From the documentation at https://docs.google.com/document/pub?id=1f-8C-ZfCUTEsO-EqvlcTXQ0M5aYM61Aet902dA8QZZk
Bases are encoded to the .fxb file by first deleting all N’s, and then packing 3 or 4 bases per byte using a variable length code. The N’s can be restored because they always have a quality score of 0, and no other bases do.
This does not hold true for our data. Near as I can tell, N bases always have a quality score of 2 ("#"). Unfortunately, other bases also sometimes have a quality score of 2. No observed bases have a quality below 2.
As-is, the error only becomes evident on decompression:
fastqz error: unexpected end of .fxb
The N bases are left out entirely, causing all subsequent bases to be pushed up (including those in subsequent reads).
Possible fixes, in order of increasing estimated difficulty:
- Pre & post process our data outside of fastqz - convert N qualities to 0 ("!") before compression, and convert 0s back to 2 after decompression.
- Bundle the above into fastqz, possibly with customizable "offset" value.
- Change the encoding schema to store N values.
From the documentation at https://docs.google.com/document/pub?id=1f-8C-ZfCUTEsO-EqvlcTXQ0M5aYM61Aet902dA8QZZk
This does not hold true for our data. Near as I can tell, N bases always have a quality score of 2 ("#"). Unfortunately, other bases also sometimes have a quality score of 2. No observed bases have a quality below 2.
As-is, the error only becomes evident on decompression:
fastqz error: unexpected end of .fxb
The N bases are left out entirely, causing all subsequent bases to be pushed up (including those in subsequent reads).
Possible fixes, in order of increasing estimated difficulty: