Mine a Database: PubMed XML to OLTP & OLAP
Overview
I built an end-to-end pipeline from PubMed-style bibliography XML into relational storage: R with XPath pulls structured fields out of nested records, normalizes them, and lands them in SQLite as an OLTP-friendly schema so edits and integrity rules stay tight at fine grain.
The design then feeds a star schema in MySQL suited to OLAP-style questions about publication timing, authorship patterns, and how complete or consistent metadata looks over time. Loader scripts and analytical SQL connect the two tiers without hand-waving the grain shift.
Deliverables
- Transactional layer: Normalized SQLite schema and ingest logic for cleaned citation metadata.
- Analytical layer: Star-schema construction in MySQL and warehouse load routines for reporting slices.
- XML extraction: R/XPath transforms that map semi-structured feeds into relational tuples ready for SQL.
- Analysis artifacts: Analyze Data (R Markdown) and companion PDF export; repository also contains the XML-to-SQL and warehouse-load scripts in the same folder.
- Source bundle: Materials live under Practicum II – Mine a Database on GitHub.
Technology stack
- SQLite: Initial normalized schema and OLTP-oriented storage.
- MySQL: Dimensional / star layout and analytical querying.
- R: XPath-based parsing, reshaping, and orchestration of database writes.
- SQL: Load steps, integrity-minded DDL, and reporting queries across both engines.