Skip to content

binary file format for pairs #85

@golobor

Description

@golobor

Issue: storing Hi-C contacts in a gzipped .tsv cause major slowdowns for some computations. We need to pick a binary container and write software for common operations.

.tsv/.csv:
Cons:

  • .tsv/.csv is row-oriented. Extra fields, like readID or sam_fields are really heavy, comparing to chrom and pos, yet they have to be unpacked every time a file is read through.
  • text is very expensive to compress/decompress and parse. As a result, calculation of P(s) curves and other stats can take 10 minutes or more. It could potentially be done in seconds, if chrom and pos were binary.
  • no random access. There is a solution, bgzip+pairix, but it has many moving parts/dependencies.

Pros:

  • tsv/csv is a format that is easy to agree for a community and it is the default expectation in bioinformatics
  • platform-agnostic: command-line, python, R, C, win/linux/mac - all can work with tsv, to some extent.
  • can be streamed between processes via pipes
  • merge-sort is available and fairly efficient for text files
  • .pairs.gz is already used by the 4DN, not going away.

The alternative is to store pair tables in existing binary container files. The two options are:

HDF5:
Pros:

  • a major standard, developed by a company, used by NASA, not going away
  • an existing dependency, can be seen as an extension of cooler
  • can store multiple tables per file: chromsizes and artifacts, like P(s) curves, trans-levels and other summaries, can be kept inside the file. HDF5 can even store non-tabular data, which could be useful for summary tables.
  • easy appending, both along columns and along rows

Cons:

  • columnar storage has to be implemented on top of HDF5. @nvictus has prototyped it: https://github.com/nvictus/coltab, but it needs more work. The result can potentially be popular and useful for other people and projects (including cooler?..).
  • variable-length strings are not compressed, which makes them useless. The solution is to store string arrays in multiple chunks of fixed-length strings, with the length varying between chunks; the chunks could then be merged together in a virtual dataset. Requires some work to implement and maintain.
  • merge-sort has to be implemented.
  • very glacial development of the core format.
  • there is a certain dislike of the format by the community, due to issues with parallelization and complex implementation. Both issues seem to be improving over time.

Parquet:
Pros:

  • a major standard, supported by big players in IT industry. Not likely to die.
  • the library is young and developed very rapidly.
  • first-class support by pandas and arrow.
  • supports compressed variable-length strings and dictionary compression.
  • has built-in indexing of dataframes via block statistics

Cons:

  • only one table per file! Chromsizes would have to be stored in the header (could be an issue for low-quality assemblies), artifacts would have to be stored in separate parquet and non-parquet files (for non-tabular data). Alternatively, we keep all related datasets together in a single zero-compression zip file, but that could be difficult to use in C/C++ and R.
  • not designed for appending columns. Adding extra columns to .pairs would need a complete re-write. Appending rows seems to work, but is not preferred by design.
  • merge-sort has to be implemented.
  • extra dependency, though not a major one.

Personally, I'm not happy with either of the solutions. Thoughts?...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions