simplify organism data merging

A current bottleneck in the pipeline processing is merging separate individual organism datasets into a single multi-organism dataset. Historically the data design placed items like genes and networks for all organisms into shared database tables identified by an internal ID. This created a unified ID space for looking up the data, and causes interdependencies between the organisms when building a dataset. The revised data pipeline allows independent builds of the independent organisms, and adds a merge step that takes care of the interdependencies.

However the merge step results in duplication of data (just re-indexing files to map to a unified id space), wasting time and disk-space during the build process, as well as requiring the re-execution of data processing steps that follow the merge step (the merge step runs at the generic_db level).

This could be improved in a couple of ways:

The 'right' way (but could affect users, requires care): 
- update application api's and binary data products to always be retrievable by organism, so that id's within an organism can clash. 
- then we could simply skip the merge step and just distribute the aggregation of the individual organism datasets
- would make it easy to update an individual organism without touching the others
- but requires changes to application code
- but will change the format of data distributed to users via the plugin, requiring compatability workarounds

The 'wrong' way (but won't affect users):
- could still improve data duplication by reserving id space for each organism in the organism.cfg properties file (e.g. network id range X-Y, node ID range W-Z, etc for attributes and so on)
- then the merge step could be simplified to checking that ids don't clash, and then just copying the data
- no changes to user data or application code, only the pipeline
- but manual and likely error prone


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simplify organism data merging #19

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

simplify organism data merging #19

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions