Skip to content

Retain spaces when splitting long lines#4

Open
jimrybarski wants to merge 2 commits into
dlesl:masterfrom
jimrybarski:master
Open

Retain spaces when splitting long lines#4
jimrybarski wants to merge 2 commits into
dlesl:masterfrom
jimrybarski:master

Conversation

@jimrybarski
Copy link
Copy Markdown

When long qualifier values are split on spaces to keep the total line length less than 80 characters, and the final character on a line is a space, that space is deleted. However, this prevents parsing the original value - it's not possible to know when joining the lines together whether a space should be inserted or not. For example:

Lipopolysaccharide export system ATP-binding protein LptB

will appear in the Genbank text as:

/product="Lipopolysaccharide export system ATP-binding
 protein LptB"

and when this is read later and newlines are removed, it will be:

Lipopolysaccharide export system ATP-bindingprotein LptB
(note bindingprotein is one word now).

This commit retains the final space at the end of the line so that the original value can be reconstructed unambiguously.

When long qualifier values are split on spaces to keep the total line length less than 80 characters,
and the final character on a line is a space, that space is deleted. However, this prevents parsing the
original value - it's not possible to know when joining the lines together whether a space should be
inserted or not. For example:

Lipopolysaccharide export system ATP-binding protein LptB

will appear in the Genbank text as:

/product="Lipopolysaccharide export system ATP-binding
 protein LptB"

and when this is read later and newlines are removed, it will be:

Lipopolysaccharide export system ATP-bindingprotein LptB

This commit retains the final space at the end of the line so that the original value can be
reconstructed unambiguously.
This commit implements all the changes recommended by `cargo clippy`, which is currently causing the CI pipeline to fail.
@dlesl
Copy link
Copy Markdown
Owner

dlesl commented Nov 24, 2022

Thanks for the PR and thanks for fixing the clippy errors! It might take me a while to review this one though, gotta get my head around it (I remember this being a bit tricky)

@dlesl
Copy link
Copy Markdown
Owner

dlesl commented Nov 24, 2022

So this is a bit of a tricky one, from my reading of the docs here and looking at some official GenBank files (like mg1566.gb) it seems like the interpretation of the newlines in the qualifier values needs to be context-dependent, ie. on the key. For example for sequences the newlines should be dropped but for descriptive text they are equivalent to spaces. I did go a tiny bit of the way towards implementing that here for translations, but for everything thing else the newlines are just left as is and it is up to the library user to figure out what they mean, which not very consistent at all 😀. I'm not sure what the best solution is here really, I think it might be good to compare other implementations like BioPython to see how they handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants