Skip to content

Incorrect compression of N bases #2

@fwip

Description

@fwip

From the documentation at https://docs.google.com/document/pub?id=1f-8C-ZfCUTEsO-EqvlcTXQ0M5aYM61Aet902dA8QZZk

Bases are encoded to the .fxb file by first deleting all N’s, and then packing 3 or 4 bases per byte using a variable length code. The N’s can be restored because they always have a quality score of 0, and no other bases do.

This does not hold true for our data. Near as I can tell, N bases always have a quality score of 2 ("#"). Unfortunately, other bases also sometimes have a quality score of 2. No observed bases have a quality below 2.

As-is, the error only becomes evident on decompression:
fastqz error: unexpected end of .fxb

The N bases are left out entirely, causing all subsequent bases to be pushed up (including those in subsequent reads).

Possible fixes, in order of increasing estimated difficulty:

  1. Pre & post process our data outside of fastqz - convert N qualities to 0 ("!") before compression, and convert 0s back to 2 after decompression.
  2. Bundle the above into fastqz, possibly with customizable "offset" value.
  3. Change the encoding schema to store N values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions