Skip to content

[BUG] Validate DNA input for AptaNet k-mers#399

Open
prashu0705 wants to merge 3 commits intogc-os-ai:mainfrom
prashu0705:fix/aptanet-dna-validation
Open

[BUG] Validate DNA input for AptaNet k-mers#399
prashu0705 wants to merge 3 commits intogc-os-ai:mainfrom
prashu0705:fix/aptanet-dna-validation

Conversation

@prashu0705
Copy link
Copy Markdown

Reference Issues/PRs

Fixes #322.

What does this implement/fix? Explain your changes.

This adds DNA-only input validation to generate_kmer_vecs.

AptaNet k-mer features are documented as taking DNA aptamer sequences. Previously, RNA-like input such as a sequence containing U was accepted, but k-mers containing non-DNA characters were not represented in the generated feature vector.

Now, generate_kmer_vecs raises a clear ValueError when the input sequence contains non-DNA characters.

What should a reviewer concentrate their feedback on?

  • Whether the validation location is appropriate.
  • Whether the ValueError message is clear enough for users.

Did you add any tests for the change?

Yes. I added one focused regression test checking that RNA input containing U raises a ValueError.

Tested with: pytest pyaptamer/utils/tests/test_aptanet_utils.py -q

Any other comments?

Kept this PR minimal based on maintainer feedback.

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
  • Added/modified tests
  • Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install. To run hooks independent of commit, execute pre-commit run --all-files

Copy link
Copy Markdown
Collaborator

@satvshr satvshr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass

@prashu0705
Copy link
Copy Markdown
Author

Hi @satvshr, I've opened PR #399 to address this.
The fix adds input validation to generate_kmer_vecs that raises a ValueError when the sequence contains non-DNA characters (e.g. U). Previously, such input was silently accepted but k-mers with non-DNA characters were simply absent from the feature vector, which could cause subtle bugs.
As a follow-up, I also updated the AptaNet tutorial notebooks and the _aptamer_aptanet.py docstring to use DNA sequences consistently, since the CI was failing due to the existing RNA-style examples conflicting with the new validation.
A regression test is included. Happy to make changes based on feedback!

@satvshr
Copy link
Copy Markdown
Collaborator

satvshr commented Apr 13, 2026

Wait if AptaNet accepts U too I dont see why we should remove its support, why do we need this PR again?

but k-mers containing non-DNA characters were not represented in the generated feature vector.

Why do you say so, can you give evidence supporting this statement?

@prashu0705
Copy link
Copy Markdown
Author

prashu0705 commented Apr 13, 2026

Looking at the code in _aptanet_utils.py, the k-mer vocabulary is generated as:

DNA_BASES = list("ACGT")
all_kmers = ["".join(p) for p in product(DNA_BASES, repeat=i)]

This means the vocabulary only contains k-mers made of A, C, G, T. When counting k-mers from the input sequence:

if kmer in kmer_counts:
    kmer_counts[kmer] += 1

Any substring containing U silently fails this check and is skipped. For example, for a sequence like AUCG, substrings U, AU, UC, AUC, UCG, AUCG are all dropped with no warning. The final frequency vector is then normalized over only the DNA k-mers that were counted, giving incorrect relative frequencies — the vector looks valid but doesn't actually represent the full sequence.

So U isn't "supported" so much as silently ignored, which is arguably worse than raising an error, since the user gets subtly wrong features without any indication that something went wrong. That's why this PR removes the appearance of U support — it was never actually supported correctly in the first place, just silently broken.

@satvshr
Copy link
Copy Markdown
Collaborator

satvshr commented Apr 13, 2026

Thanks! Why not add U support rather than removing it?

@prashu0705
Copy link
Copy Markdown
Author

That's a fair point, and I did consider it. The reason I went with validation instead is that "adding U support" isn't a single obvious fix — it requires a design decision about what U support actually means for AptaNet specifically:

Option 1: U → T normalization

This is what the original AptaNet paper does (Emami & Ferdousi, 2021 — they explicitly convert RNA to DNA by replacing U with T before encoding). It's biologically reasonable since T and U are functionally equivalent for base-pairing purposes. But it silently modifies the user's input, which is the same category of problem as the original bug — just less harmful. It should at minimum be documented clearly and opt-in.

Option 2: Expand the vocabulary to include U

I actually think this is the worse option, for a few reasons:

  1. It breaks model compatibility. AptaNet is a trained neural network whose input layer expects exactly 4^k features — 16 for k=2, 64 for k=3, 256 for k=4. Expanding the alphabet to {A, C, G, T, U} changes the vector size to 5^k (25, 125, 625). You can't feed that into the existing model without retraining from scratch. So this isn't just a change to a utility function — it breaks the entire AptaNet pipeline.

  2. U and T are not meaningfully distinct in this feature space. They represent the same base in terms of Watson-Crick pairing — the only difference is the sugar backbone (ribose vs deoxyribose), which k-mer frequency encoding doesn't capture at all. Treating them as separate characters would add a fifth "distinct" base that carries no additional biological signal, just noise.

  3. It breaks the reverse complement logic. The current implementation pairs A↔T and C↔G. If U is its own character, its complement would also be A — meaning both T and U complement A, which is inconsistent and would require special-casing throughout the codebase.

Option 2 would only make sense if we were building a brand new RNA-specific model trained from scratch with a 5-character alphabet — which is a research project, not a bug fix.

Why I went with a ValueError

@siddharth7113 clarified earlier in this thread that AptaNet is DNA-only by design, and the function's own docstring says "The DNA sequence of the aptamer". So rather than expanding scope, I kept this PR minimal: just make the existing constraint explicit rather than silently broken. That aligns with the direction the maintainers confirmed when this issue was first triaged.

That said, if you're open to it, I'd genuinely like to work on proper RNA support as a follow-up issue/PR. I think Option 1 (U → T with clear documentation and a warning) is the right approach and I'm happy to implement it — just wanted to keep this bug fix focused and reviewable on its own first. Happy to open a separate issue to track it if that works for you.

Copy link
Copy Markdown
Collaborator

@satvshr satvshr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@prashu0705
Copy link
Copy Markdown
Author

The only remaining failure is run-jupyter-notebooksexamples/aptanet_tutorial.ipynb is hitting the new ValueError because the dataset it uses contains RNA sequences with U.

To fix it I was planning to push one more commit to this PR that adds a preprocessing step in the notebook replacing U with T before pipeline.fit(), along with a markdown note explaining why. The library validation stays exactly as is — this just shows users the correct preprocessing step, consistent with the original paper.

Should I go ahead and push it here, or would you prefer I handle it differently?

@satvshr
Copy link
Copy Markdown
Collaborator

satvshr commented Apr 14, 2026

To fix it I was planning to push one more commit to this PR that adds a preprocessing step in the notebook replacing U with T before pipeline.fit(), along with a markdown note explaining why. The library validation stays exactly as is — this just shows users the correct preprocessing step, consistent with the original paper.

Sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] RNA k-mers with 'U' are silently ignored in generate_kmer_vecs

2 participants