Mine a Database: PubMed XML to OLTP & OLAP

Northeastern University, CS 5200 Database Management Systems · Spring 2023

R MySQL SQLite XML
PubMed XML pipeline from OLTP to analytical warehouse

Overview

I built an end-to-end pipeline from PubMed-style bibliography XML into relational storage: R with XPath pulls structured fields out of nested records, normalizes them, and lands them in SQLite as an OLTP-friendly schema so edits and integrity rules stay tight at fine grain.

The design then feeds a star schema in MySQL suited to OLAP-style questions about publication timing, authorship patterns, and how complete or consistent metadata looks over time. Loader scripts and analytical SQL connect the two tiers without hand-waving the grain shift.

Deliverables

  • Transactional layer: Normalized SQLite schema and ingest logic for cleaned citation metadata.
  • Analytical layer: Star-schema construction in MySQL and warehouse load routines for reporting slices.
  • XML extraction: R/XPath transforms that map semi-structured feeds into relational tuples ready for SQL.
  • Analysis artifacts: Analyze Data (R Markdown) and companion PDF export; repository also contains the XML-to-SQL and warehouse-load scripts in the same folder.
  • Source bundle: Materials live under Practicum II – Mine a Database on GitHub.

Technology stack

  • SQLite: Initial normalized schema and OLTP-oriented storage.
  • MySQL: Dimensional / star layout and analytical querying.
  • R: XPath-based parsing, reshaping, and orchestration of database writes.
  • SQL: Load steps, integrity-minded DDL, and reporting queries across both engines.