Incorrect compression of N bases

From the documentation at https://docs.google.com/document/pub?id=1f-8C-ZfCUTEsO-EqvlcTXQ0M5aYM61Aet902dA8QZZk

> Bases are encoded to the .fxb file by first deleting all N’s, and then packing 3 or 4 bases per byte using a variable length code. The N’s can be restored because they always have a quality score of 0, and no other bases do.

This does not hold true for our data. Near as I can tell, N bases always have a quality score of 2 ("#"). Unfortunately, other bases **also** sometimes have a quality score of 2. No observed bases have a quality below 2.

As-is, the error only becomes evident on decompression:
  fastqz error: unexpected end of .fxb

The N bases are left out entirely, causing all subsequent bases to be pushed up (including those in subsequent reads).

Possible fixes, in order of increasing estimated difficulty:
1. Pre & post process our data outside of fastqz - convert N qualities to 0 ("!") before compression, and convert 0s back to 2 after decompression.
2. Bundle the above into fastqz, possibly with customizable "offset" value.
3. Change the encoding schema to store N values.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect compression of N bases #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Incorrect compression of N bases #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions