A reproducible data engineering pipeline for extracting, processing, and formatting public domain Catholic texts for use with NotebookLM and other AI tools.
This project is named "The Depositum" (Latin: "deposit") in reference to the Catholic concept of the Deposit of Faith (depositum fidei) - the body of revealed truth entrusted by Christ to the Apostles and their successors. The Deposit of Faith consists of:
- Sacred Scripture - The written Word of God
- Sacred Tradition - The living transmission of the Word of God through the Church
- Magisterium - The teaching authority of the Church that interprets and safeguards both Scripture and Tradition
The three datasets in this project directly correspond to these three components:
- Douay-Rheims Bible = Sacred Scripture
- Haydock Commentary = Sacred Tradition (Church Fathers' interpretation)
- Catechism of Trent = Magisterium (official Church teaching)
Together, they create a digital "depositum" - a repository that preserves and makes accessible the Deposit of Faith in a format suitable for AI-powered study, ensuring responses draw from Scripture, Tradition, and Magisterium as an integrated whole.
This pipeline produces Markdown files specifically optimized for AI tools and NotebookLM:
- Structured Format: Markdown's hierarchical structure (headers, lists, formatting) is easily parsed by AI systems while remaining human-readable
- Clean Text: Minimal markup ensures AI models can focus on content rather than complex formatting
- Metadata Support: YAML frontmatter provides structured metadata (titles, tags) that AI tools can leverage for organization and search
- NotebookLM Compatibility: Markdown is the preferred format for NotebookLM, ensuring optimal ingestion, querying, and analysis
- Semantic Structure: Clear chapter, verse, and section markers enable precise referencing and context-aware responses
- Universal Compatibility: Markdown works seamlessly across AI platforms, text editors, and documentation systems
- π Douay-Rheims Bible: Complete 73-book Catholic canon extracted via patchwork approach (66 books from bible-api.com, 7 Deuterocanonical books from GitHub) - see bible_douay_rheims/README.md for details
- π Haydock Bible Commentary: Full commentary extracted from EPUB format
- βοΈ Roman Catechism (McHugh & Callan): Catechism of the Council of Trent converted from PDF to Markdown
- π Reproducible Pipeline: Complete automation for data extraction and transformation
- π NotebookLM-Ready Output: Clean, formatted Markdown files optimized for AI tools
- Python 3.10+
- uv (fast Python package manager) - See installation instructions below
- Internet connection (for API downloads)
- EPUB file for Haydock Commentary (download separately)
- PDF file for Catechism (download separately)
Step 1: Install uv (if you don't have it):
# macOS (recommended)
brew install uv
# Linux/Windows (alternative)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or: pip install uvStep 2: Set up the project:
# Clone the repository
git clone https://github.com/your-username/the_depositum.git
cd the_depositum
# Create virtual environment in project root (.venv/)
# This will create .venv/ directory in the project root
uv venv
# Install all dependencies (uses pyproject.toml)
# This automatically installs the project in editable mode
uv sync
# Optional: Install with dev dependencies for testing
uv sync --extra dev
# Activate virtual environment (optional - uv run works without activation)
source .venv/bin/activate # On Windows: .venv\Scripts\activateNote: The virtual environment (.venv/) is created in the project root directory. You can use uv run <command> to run commands in the venv without activating it manually.
Step 3: Download Source Files (if needed):
Bible: β No download needed - The script downloads directly from the API
Haydock Commentary:
- Download EPUB from Isidore E-Book Library or JohnBlood GitLab
- Place in:
data_engineering/data_sources/bible_commentary_haydock/directory (the script will look for files matching the patternHaydock Catholic Bible Comment*.epub)
Catechism:
- Download PDF from SaintsBooks.net
- Important: Ensure it's the McHugh & Callan translation (1923)
- Place
The Roman Catechism.pdfin:data_engineering/data_sources/catholic_catechism_trent/
Step 4: Run the Pipeline:
# Run everything
python data_engineering/scripts/run_pipeline.py
# Or run individual sources
python data_engineering/scripts/run_pipeline.py --source bible
python data_engineering/scripts/run_pipeline.py --source commentary
python data_engineering/scripts/run_pipeline.py --source catechism
# Run and copy to final output
python data_engineering/scripts/run_pipeline.py --copy-outputStep 5: Verify Output:
# Bible (should have 73 files - complete Catholic canon)
ls data_final/bible_douay_rheims/ | wc -l
# Commentary
ls data_final/bible_commentary_haydock/
# Catechism
ls data_final/catholic_catechism_trent/This pipeline extracts and processes three foundational Catholic texts that together represent the complete Deposit of Faith - Scripture, Tradition, and Magisterium. Each source has been carefully selected for its historical significance, doctrinal authority, and public domain status. Together, they ensure that AI responses are grounded in authoritative Catholic teaching.
- Source: Patchwork approach (MVP solution)
- 66 books from bible-api.com (ebible.org data)
- 7 Deuterocanonical books from GitHub repository (xxruyle/Bible-DouayRheims)
- Format: API/JSON β Markdown
- Output: 73 individual Markdown files (Bible_Book_01_Genesis.md through Bible_Book_73_Revelation.md) - Complete Catholic canon
- Scripts:
data_engineering/data_sources/bible_douay_rheims/extract_bible.py(66 books)data_engineering/data_sources/bible_douay_rheims/extract_deuterocanonical.py(7 books)
- No prerequisites: Downloads directly from API and GitHub
- Note: This is an MVP patchwork approach. While functional and complete, it requires two separate scripts. Future improvements may include migration to a unified API source for better robustness. See bible_douay_rheims/README.md for details.
- Historical Significance: First officially authorized Catholic Bible in English, translated from the Latin Vulgate. The 1899 American Edition represents the Challoner revision, which became the standard English Catholic Bible for centuries.
- Role in Deposit of Faith: Represents Sacred Scripture - the written Word of God
- Source: EPUB file (from Isidore E-Book Library or JohnBlood GitLab)
- Format: EPUB β HTML β Markdown
- Output: Commentary files organized by book/chapter
- Script:
data_engineering/data_sources/bible_commentary_haydock/extract_commentary.py - Prerequisite: Download EPUB file separately
- Historical Significance: Comprehensive Catholic Bible commentary compiled by Father George Leo Haydock, drawing extensively from Church Fathers (Augustine, Jerome, Chrysostom) and traditional Catholic exegesis. The 1859 edition represents the mature form of this influential commentary.
- Role in Deposit of Faith: Represents Sacred Tradition - the living transmission of how the Church has understood Scripture through the ages, preserving the interpretive insights of the Church Fathers
- Source: PDF file from SaintsBooks.net
- Format: PDF β Text with Formatting β Markdown
- Output: Single Markdown file with comprehensive header detection (PART, ARTICLE, major sections, italicized subsections)
- Script:
data_engineering/data_sources/catholic_catechism_trent/extract_catechism.py - Prerequisite: Download PDF file separately (McHugh & Callan translation, 1923)
- Historical Significance: Official catechism commissioned by the Council of Trent (1545-1563) and published in 1566. The McHugh & Callan translation (1923) is considered one of the most accurate English translations, produced by Dominican scholars. This catechism represents authoritative post-Tridentine Catholic doctrine.
- Role in Deposit of Faith: Represents Magisterium - the teaching authority of the Church providing official interpretation and explanation of the faith
Raw Sources β Extraction Scripts β Processed Data β NotebookLM
β β β
API/EPUB/PDF Python Scripts Clean Markdown
- Extraction: Download and parse source materials
- Transformation: Convert to clean Markdown format
- Validation: Verify data quality and completeness
- Output: Generate final files in
data_final/
the_depositum/
βββ README.md # Main project documentation
βββ LICENSE # MIT License
βββ pyproject.toml # Python project config & dependencies
βββ .gitignore # Git ignore rules
β
βββ .cursor/ # Cursor IDE rules (2025 format)
β βββ rules/
β βββ error-handling.mdc # Critical error handling rules
β βββ data-engineering.mdc # Data engineering standards
β
βββ .github/ # GitHub configuration
β βββ CODEOWNERS # Code review assignments
β βββ workflows/
β βββ security-audit.yml # Automated security scanning
β
βββ data_final/ # Final output (ready for NotebookLM)
β βββ 00_Project_Prompt_and_Sources.md # Project constitution and source documentation
β βββ bible_douay_rheims/ # 73 Bible books (.md files) - Complete Catholic canon
β βββ bible_commentary_haydock/ # 73 Commentary files (.md files)
β βββ catholic_catechism_trent/ # Catechism file (.md file)
β
βββ data_engineering/ # All technical components
βββ README.md # Technical documentation
βββ config/
β βββ pipeline_config.yaml # Pipeline configuration
βββ scripts/
β βββ run_pipeline.py # Main pipeline orchestrator
βββ data_sources/ # Extraction scripts
β βββ README.md # Data sources overview
β βββ bible_douay_rheims/
β β βββ extract_bible.py
β β βββ extract_deuterocanonical.py
β β βββ README.md
β βββ bible_commentary_haydock/
β β βββ extract_commentary.py
β β βββ README.md
β βββ catholic_catechism_trent/
β βββ extract_catechism.py
β βββ README.md
βββ processed_data/ # Intermediate processed files
README.md- Main project documentationLICENSE- MIT Licensepyproject.toml- Python project configuration and dependencies.gitignore- Git ignore rules
.cursor/rules/error-handling.mdc- Error handling standards (applies to all*.py) - if directory exists.cursor/rules/data-engineering.mdc- Data engineering standards (applies todata_engineering/**/*.py) - if directory exists.github/CODEOWNERS- Code review assignments - if directory exists.github/workflows/security-audit.yml- Security scanning workflow - if directory existsscripts/security_check.sh- Local security scanning script
data_engineering/README.md- Technical documentationdata_engineering/config/pipeline_config.yaml- Pipeline configurationdata_engineering/scripts/run_pipeline.py- Main pipeline orchestratordata_engineering/data_sources/README.md- Data sources overviewdata_engineering/data_sources/bible_douay_rheims/extract_bible.py- Bible extraction script (66 books from API)data_engineering/data_sources/bible_douay_rheims/extract_deuterocanonical.py- Deuterocanonical books extraction script (7 books from GitHub)data_engineering/data_sources/bible_douay_rheims/README.md- Bible extraction guidedata_engineering/data_sources/bible_commentary_haydock/extract_commentary.py- Commentary extraction scriptdata_engineering/data_sources/bible_commentary_haydock/README.md- Commentary extraction guidedata_engineering/data_sources/catholic_catechism_trent/extract_catechism.py- Catechism extraction scriptdata_engineering/data_sources/catholic_catechism_trent/README.md- Catechism extraction guidedata_engineering/data_sources/catholic_catechism_trent/EXTRACTION_ANALYSIS.md- Analysis of header detection improvementsdata_engineering/data_sources/catholic_catechism_trent/cleaned_table_of_contents.csv- Reference table of contents for validation
data_final/00_Project_Prompt_and_Sources.md- Project constitution and source documentation defining the three pillars and operational guidelines
data_engineering/processed_data/- Intermediate processed filesdata_final/bible_douay_rheims/- Final Bible output (73 .md files, named likeBible_Book_01_Genesis.md,Bible_Book_73_Revelation.md- complete Catholic canon)data_final/bible_commentary_haydock/- Final commentary output (73 .md files, named likeBible_Book_01_Genesis_Commentary.md,Bible_Book_73_Revelation_Commentary.md)data_final/catholic_catechism_trent/- Final catechism output (.md file)data_final/00_Project_Prompt_and_Sources.md- Project constitution and source documentationdata_engineering/logs/- Execution logs (bible_extraction.log, catechism_extraction.log)
This pipeline prepares texts for use with Google's NotebookLM:
- Run the pipeline to generate clean Markdown files
- Upload to NotebookLM: Import the files from
data_final/ - Query the Deposit of Faith: Ask questions that draw from Scripture, Tradition, and Magisterium
- Generate Audio Overviews: Create podcasts discussing theological topics
- "Explain the Biblical roots of Confession using only the Gospel of John and the Council of Trent."
- "What do the Psalms and the Church Fathers say about trusting God in times of anxiety?"
- "Explain the Real Presence of the Eucharist using John 6 and the Haydock Commentary."
- Completeness Checks: Verify all expected books/files are present
- Format Validation: Ensure proper Markdown structure
- Content Verification: Check for missing verses or chapters
- Encoding Validation: Ensure UTF-8 encoding throughout
Edit data_engineering/config/pipeline_config.yaml to customize:
- API endpoints and rate limits
- File paths
- Output formatting
- Validation rules
- Fork and clone the repository
- Set up environment:
uv venv uv sync --extra dev # Installs with dev dependencies source .venv/bin/activate # Optional: activate venv
- Run tests:
python data_engineering/scripts/run_pipeline.py --test
- Data Quality: All extractions must pass validation checks
- Documentation: Update README and code comments as needed
- Testing: Add tests for new extraction methods
- Format Consistency: Maintain consistent Markdown formatting
For detailed technical information, see:
- data_engineering/README.md: Complete technical documentation
- data_engineering/data_sources/README.md: Data sources overview
- data_final/README.md: Final output documentation with historical context
- FILES.md: Complete file listing and organization guide
- Individual source READMEs in each
data_sources/{source}/directory
- Ensure source files (EPUB/PDF) are in the correct directories
- Check file names match exactly (case-sensitive)
- Increase
API_RATE_LIMIT_DELAYindata_engineering/config/pipeline_config.yaml - Check internet connection
- For Haydock: Inspect EPUB structure and adjust parsing logic in extraction script
- For Catechism: Check PDF structure and header detection patterns
- Ensure virtual environment is activated (or use
uv run) - Reinstall dependencies:
uv sync
This project implements automated security scanning:
- Code Security: Bandit scans Python code for security vulnerabilities
- Dependency Scanning: pip-audit checks for known vulnerabilities in dependencies
- Automated Workflows: Security audits run on every push and pull request
- Weekly Scans: Scheduled security audits run every Monday
Security scans are automatically performed via GitHub Actions. See .github/workflows/security-audit.yml for details.
To run security scans locally:
# Install security tools (if not already installed)
uv sync --extra dev
# Run Bandit code scan
bandit -r data_engineering/
# Run pip-audit dependency scan
pip-auditThis project is licensed under the MIT License - see the LICENSE file for details.
All source texts (Douay-Rheims, Haydock Commentary, Roman Catechism) are in the public domain and free of copyright restrictions.
- bible-api.com: For providing structured Bible data
- ebible.org: For maintaining public domain Bible texts
- Isidore E-Book Library / JohnBlood GitLab: For Haydock Commentary EPUB
- SaintsBooks.net: For McHugh & Callan Catechism PDF
Note: This pipeline is designed for educational and faith formation purposes. The texts are authoritative Catholic sources, but users should consult official Church documents for canonical matters.