Skip to content

simplify organism data merging #19

@kzuberi

Description

@kzuberi

A current bottleneck in the pipeline processing is merging separate individual organism datasets into a single multi-organism dataset. Historically the data design placed items like genes and networks for all organisms into shared database tables identified by an internal ID. This created a unified ID space for looking up the data, and causes interdependencies between the organisms when building a dataset. The revised data pipeline allows independent builds of the independent organisms, and adds a merge step that takes care of the interdependencies.

However the merge step results in duplication of data (just re-indexing files to map to a unified id space), wasting time and disk-space during the build process, as well as requiring the re-execution of data processing steps that follow the merge step (the merge step runs at the generic_db level).

This could be improved in a couple of ways:

The 'right' way (but could affect users, requires care):

  • update application api's and binary data products to always be retrievable by organism, so that id's within an organism can clash.
  • then we could simply skip the merge step and just distribute the aggregation of the individual organism datasets
  • would make it easy to update an individual organism without touching the others
  • but requires changes to application code
  • but will change the format of data distributed to users via the plugin, requiring compatability workarounds

The 'wrong' way (but won't affect users):

  • could still improve data duplication by reserving id space for each organism in the organism.cfg properties file (e.g. network id range X-Y, node ID range W-Z, etc for attributes and so on)
  • then the merge step could be simplified to checking that ids don't clash, and then just copying the data
  • no changes to user data or application code, only the pipeline
  • but manual and likely error prone

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions