Retain spaces when splitting long lines#4
Conversation
When long qualifier values are split on spaces to keep the total line length less than 80 characters, and the final character on a line is a space, that space is deleted. However, this prevents parsing the original value - it's not possible to know when joining the lines together whether a space should be inserted or not. For example: Lipopolysaccharide export system ATP-binding protein LptB will appear in the Genbank text as: /product="Lipopolysaccharide export system ATP-binding protein LptB" and when this is read later and newlines are removed, it will be: Lipopolysaccharide export system ATP-bindingprotein LptB This commit retains the final space at the end of the line so that the original value can be reconstructed unambiguously.
This commit implements all the changes recommended by `cargo clippy`, which is currently causing the CI pipeline to fail.
|
Thanks for the PR and thanks for fixing the clippy errors! It might take me a while to review this one though, gotta get my head around it (I remember this being a bit tricky) |
|
So this is a bit of a tricky one, from my reading of the docs here and looking at some official GenBank files (like mg1566.gb) it seems like the interpretation of the newlines in the qualifier values needs to be context-dependent, ie. on the key. For example for sequences the newlines should be dropped but for descriptive text they are equivalent to spaces. I did go a tiny bit of the way towards implementing that here for translations, but for everything thing else the newlines are just left as is and it is up to the library user to figure out what they mean, which not very consistent at all 😀. I'm not sure what the best solution is here really, I think it might be good to compare other implementations like BioPython to see how they handle this. |
When long qualifier values are split on spaces to keep the total line length less than 80 characters, and the final character on a line is a space, that space is deleted. However, this prevents parsing the original value - it's not possible to know when joining the lines together whether a space should be inserted or not. For example:
Lipopolysaccharide export system ATP-binding protein LptBwill appear in the Genbank text as:
and when this is read later and newlines are removed, it will be:
Lipopolysaccharide export system ATP-bindingprotein LptB(note
bindingproteinis one word now).This commit retains the final space at the end of the line so that the original value can be reconstructed unambiguously.