Project Helios is a specialized initiative dedicated to the systematic archival, auditing, and analysis of United States historical, legal, and governmental datasets. By applying Zero Trust principles and advanced Natural Language Processing, we aim to transform static archives into dynamic, interlinked knowledge ecosystems for cybersecurity research and legal transparency.
This repository serves as the central node for the US Archive and the Protocol Helios initiatives. It bridges the gap between raw historical data and actionable intelligence through phased development.
-
Phase 1: Foundational Ingestion
- Primary sources: National Archives, GovInfo, and Project Gutenberg.
- Core assets: Declaration of Independence, US Constitution, and Federalist Papers.
-
Phase 2: Legal & Crisis Archiving
- Mapping major US historical events and crises via Wikidata SPARQL.
- Initial scaffolding for authenticated case law ingestion via CourtListener.
-
Phase 3: NLP & Knowledge Graphing
- Named Entity Recognition (NER) using
spaCyto extract Persons, Organizations, and Geopolitical Entities. - Generation of a structured
knowledge_graph.jsonmapping relationships across 60+ foundational documents.
- Named Entity Recognition (NER) using
-
Phase 4: Semantic Search & Analysis (Upcoming)
- Implementation of vector embeddings and semantic conceptual search.
Use the links below to navigate the core indices of the archive.
- Historical Eras: Documents organized by Founding, Civil War, and Modern eras.
- Federal Law: Central repository for foundational acts and Supreme Court cases.
- Major Events & Crises: A chronological index of US history linked to legal shifts.
- Full Law Books: Complete digital library of foundational legal treatises.
- Web Archives: Offline HTML/Text captures of key public legal resources.
- Audit Reports: Security and integrity reports for the archive datasets.
- Language: Python 3.x
- NLP: spaCy (en_core_web_sm)
- Auditing: Lazarus Protocol Standards (Zero Trust Verification)
- State Management: Automated checkpointing via
archive_state.json