Skip to content

Packaging and Merging Updates#377

Merged
EvanDietzMorris merged 72 commits intomasterfrom
packaging
Mar 31, 2026
Merged

Packaging and Merging Updates#377
EvanDietzMorris merged 72 commits intomasterfrom
packaging

Conversation

@EvanDietzMorris
Copy link
Copy Markdown
Contributor

PyPI Packaging & Project Modernization

This restructures ORION into a distributable Python package published on PyPI as robokop-orion.

  • Renamed Common/ → orion/ to follow packaging conventions
  • Moved cli/ → orion/cli/ so cli scripts are part of the distributed package
  • Renamed load_manager.py & SourceDataManager → ingest_pipeline.py & IngestPipeline
  • Removed requirements.txt, all dependencies are now declared in pyproject.toml
  • Added pyproject.toml, uv.lock, .python-version,
  • Using uv as the build backend for packaging and development
  • Separated core dependencies, dev, and those used only by parsers.
  • CLI entry points registered as [project.scripts]:
    • orion-build - build complete knowledge graphs from a Graph Spec
    • orion-ingest - run the ingest pipeline for individual data sources
    • orion-merge - merge KGX node/edge files
    • orion-meta-kg - generate MetaKG and test data files
    • orion-redundant-kg - generate redundant edge files
    • orion-ac - generate AnswerCoalesce files
    • orion-neo4j-dump / orion-memgraph-dump - generate database dumps

CI/CD

  • New pypi.yml workflow - publishes to TestPyPI and PyPI on GitHub release using trusted publishing
  • Updated test.yml - migrated from Python 3.9 + pip to Python 3.12 + uv

Docker & Deployment

  • Dockerfile updated: uses uv to install the package and manage entry points
  • Restructured neo4j image and python usage to be able to use newer python versions (3.12)
  • Docker Compose and Helm charts now use executables like orion-build instead of python commands

Merging Improvements

  • Significantly improved merging algorithm
  • Added custom merging option for retrieval sources
  • Merging now always treats lists as sets
  • Simplified merging buffer with stricter buffer size enforcement
  • Better knowledge source merging with improved tests

Other Improvements

  • Added add_edge_id and edge_id_type support (including UUID option) at the graph spec level
  • Prevented duplicate edges in example data
  • Improved handling of list qualifier values for MetaKG
  • Increased default normalization batch size
  • Updated documentation (README, contributor guide, Jupyter notebook) to reflect new structure and commands
  • Updated .gitignore to exclude .DS_Store, .pytest_cache/, dist/, and *.egg-info/

- remove "regular" node distinction
- reword "merged" field which is a lie now
some of the imports in __init__ would do more than most users would want, removing until lazy loading is improved
Improved on disk merging algorithm in several major ways. Including entity keys in written temp files, only computing them once and avoiding unnecessary json deserialization. Deriving the key and serializing the json of entities as they are encountered, reducing memory as they are buffered. Making the entity buffer a hard cap and removing chunking complexity. Improving functionality and handling for edge id addition and custom edge key handling.
@github-actions github-actions bot added the Biological Context QC Require validation of biological context to ensure accuracy and consistency label Mar 26, 2026
@EvanDietzMorris EvanDietzMorris merged commit 7992cd1 into master Mar 31, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Biological Context QC Require validation of biological context to ensure accuracy and consistency

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant