A current bottleneck in the pipeline processing is merging separate individual organism datasets into a single multi-organism dataset. Historically the data design placed items like genes and networks for all organisms into shared database tables identified by an internal ID. This created a unified ID space for looking up the data, and causes interdependencies between the organisms when building a dataset. The revised data pipeline allows independent builds of the independent organisms, and adds a merge step that takes care of the interdependencies.
However the merge step results in duplication of data (just re-indexing files to map to a unified id space), wasting time and disk-space during the build process, as well as requiring the re-execution of data processing steps that follow the merge step (the merge step runs at the generic_db level).
This could be improved in a couple of ways:
The 'right' way (but could affect users, requires care):
- update application api's and binary data products to always be retrievable by organism, so that id's within an organism can clash.
- then we could simply skip the merge step and just distribute the aggregation of the individual organism datasets
- would make it easy to update an individual organism without touching the others
- but requires changes to application code
- but will change the format of data distributed to users via the plugin, requiring compatability workarounds
The 'wrong' way (but won't affect users):
- could still improve data duplication by reserving id space for each organism in the organism.cfg properties file (e.g. network id range X-Y, node ID range W-Z, etc for attributes and so on)
- then the merge step could be simplified to checking that ids don't clash, and then just copying the data
- no changes to user data or application code, only the pipeline
- but manual and likely error prone
A current bottleneck in the pipeline processing is merging separate individual organism datasets into a single multi-organism dataset. Historically the data design placed items like genes and networks for all organisms into shared database tables identified by an internal ID. This created a unified ID space for looking up the data, and causes interdependencies between the organisms when building a dataset. The revised data pipeline allows independent builds of the independent organisms, and adds a merge step that takes care of the interdependencies.
However the merge step results in duplication of data (just re-indexing files to map to a unified id space), wasting time and disk-space during the build process, as well as requiring the re-execution of data processing steps that follow the merge step (the merge step runs at the generic_db level).
This could be improved in a couple of ways:
The 'right' way (but could affect users, requires care):
The 'wrong' way (but won't affect users):