[BUG] Validate DNA input for AptaNet k-mers by prashu0705 · Pull Request #399 · gc-os-ai/pyaptamer

prashu0705 · 2026-04-13T10:35:45Z

Reference Issues/PRs

Fixes #322.

What does this implement/fix? Explain your changes.

This adds DNA-only input validation to generate_kmer_vecs.

AptaNet k-mer features are documented as taking DNA aptamer sequences. Previously, RNA-like input such as a sequence containing U was accepted, but k-mers containing non-DNA characters were not represented in the generated feature vector.

Now, generate_kmer_vecs raises a clear ValueError when the input sequence contains non-DNA characters.

What should a reviewer concentrate their feedback on?

Whether the validation location is appropriate.
Whether the ValueError message is clear enough for users.

Did you add any tests for the change?

Yes. I added one focused regression test checking that RNA input containing U raises a ValueError.

Tested with: pytest pyaptamer/utils/tests/test_aptanet_utils.py -q

Any other comments?

Kept this PR minimal based on maintainer feedback.

PR checklist

The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
Added/modified tests
Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install. To run hooks independent of commit, execute pre-commit run --all-files

satvshr

LGTM if tests pass

prashu0705 · 2026-04-13T13:11:59Z

Hi @satvshr, I've opened PR #399 to address this.
The fix adds input validation to generate_kmer_vecs that raises a ValueError when the sequence contains non-DNA characters (e.g. U). Previously, such input was silently accepted but k-mers with non-DNA characters were simply absent from the feature vector, which could cause subtle bugs.
As a follow-up, I also updated the AptaNet tutorial notebooks and the _aptamer_aptanet.py docstring to use DNA sequences consistently, since the CI was failing due to the existing RNA-style examples conflicting with the new validation.
A regression test is included. Happy to make changes based on feedback!

satvshr · 2026-04-13T14:05:10Z

Wait if AptaNet accepts U too I dont see why we should remove its support, why do we need this PR again?

but k-mers containing non-DNA characters were not represented in the generated feature vector.

Why do you say so, can you give evidence supporting this statement?

prashu0705 · 2026-04-13T16:32:48Z

Looking at the code in _aptanet_utils.py, the k-mer vocabulary is generated as:

DNA_BASES = list("ACGT")
all_kmers = ["".join(p) for p in product(DNA_BASES, repeat=i)]

This means the vocabulary only contains k-mers made of A, C, G, T. When counting k-mers from the input sequence:

if kmer in kmer_counts:
    kmer_counts[kmer] += 1

Any substring containing U silently fails this check and is skipped. For example, for a sequence like AUCG, substrings U, AU, UC, AUC, UCG, AUCG are all dropped with no warning. The final frequency vector is then normalized over only the DNA k-mers that were counted, giving incorrect relative frequencies — the vector looks valid but doesn't actually represent the full sequence.

So U isn't "supported" so much as silently ignored, which is arguably worse than raising an error, since the user gets subtly wrong features without any indication that something went wrong. That's why this PR removes the appearance of U support — it was never actually supported correctly in the first place, just silently broken.

satvshr · 2026-04-13T20:18:49Z

Thanks! Why not add U support rather than removing it?

prashu0705 · 2026-04-14T05:39:36Z

That's a fair point, and I did consider it. The reason I went with validation instead is that "adding U support" isn't a single obvious fix — it requires a design decision about what U support actually means for AptaNet specifically:

Option 1: U → T normalization

This is what the original AptaNet paper does (Emami & Ferdousi, 2021 — they explicitly convert RNA to DNA by replacing U with T before encoding). It's biologically reasonable since T and U are functionally equivalent for base-pairing purposes. But it silently modifies the user's input, which is the same category of problem as the original bug — just less harmful. It should at minimum be documented clearly and opt-in.

Option 2: Expand the vocabulary to include U

I actually think this is the worse option, for a few reasons:

It breaks model compatibility. AptaNet is a trained neural network whose input layer expects exactly 4^k features — 16 for k=2, 64 for k=3, 256 for k=4. Expanding the alphabet to {A, C, G, T, U} changes the vector size to 5^k (25, 125, 625). You can't feed that into the existing model without retraining from scratch. So this isn't just a change to a utility function — it breaks the entire AptaNet pipeline.
U and T are not meaningfully distinct in this feature space. They represent the same base in terms of Watson-Crick pairing — the only difference is the sugar backbone (ribose vs deoxyribose), which k-mer frequency encoding doesn't capture at all. Treating them as separate characters would add a fifth "distinct" base that carries no additional biological signal, just noise.
It breaks the reverse complement logic. The current implementation pairs A↔T and C↔G. If U is its own character, its complement would also be A — meaning both T and U complement A, which is inconsistent and would require special-casing throughout the codebase.

Option 2 would only make sense if we were building a brand new RNA-specific model trained from scratch with a 5-character alphabet — which is a research project, not a bug fix.

Why I went with a `ValueError`

@siddharth7113 clarified earlier in this thread that AptaNet is DNA-only by design, and the function's own docstring says "The DNA sequence of the aptamer". So rather than expanding scope, I kept this PR minimal: just make the existing constraint explicit rather than silently broken. That aligns with the direction the maintainers confirmed when this issue was first triaged.

That said, if you're open to it, I'd genuinely like to work on proper RNA support as a follow-up issue/PR. I think Option 1 (U → T with clear documentation and a warning) is the right approach and I'm happy to implement it — just wanted to keep this bug fix focused and reviewable on its own first. Happy to open a separate issue to track it if that works for you.

satvshr

LGTM

prashu0705 · 2026-04-14T06:56:41Z

The only remaining failure is run-jupyter-notebooks — examples/aptanet_tutorial.ipynb is hitting the new ValueError because the dataset it uses contains RNA sequences with U.

To fix it I was planning to push one more commit to this PR that adds a preprocessing step in the notebook replacing U with T before pipeline.fit(), along with a markdown note explaining why. The library validation stays exactly as is — this just shows users the correct preprocessing step, consistent with the original paper.

Should I go ahead and push it here, or would you prefer I handle it differently?

satvshr · 2026-04-14T07:56:53Z

To fix it I was planning to push one more commit to this PR that adds a preprocessing step in the notebook replacing U with T before pipeline.fit(), along with a markdown note explaining why. The library validation stays exactly as is — this just shows users the correct preprocessing step, consistent with the original paper.

Sounds good

[BUG] Validate DNA input for AptaNet k-mers

69c1d76

satvshr approved these changes Apr 13, 2026

View reviewed changes

[BUG] Update AptaNet examples for DNA-only input

fac5f86

satvshr approved these changes Apr 14, 2026

View reviewed changes

[BUG] Normalize RNA sequences in AptaNet tutorial notebook

3e19646

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Validate DNA input for AptaNet k-mers#399

[BUG] Validate DNA input for AptaNet k-mers#399
prashu0705 wants to merge 3 commits intogc-os-ai:mainfrom
prashu0705:fix/aptanet-dna-validation

prashu0705 commented Apr 13, 2026

Uh oh!

satvshr left a comment

Uh oh!

prashu0705 commented Apr 13, 2026

Uh oh!

satvshr commented Apr 13, 2026 •

edited

Loading

Uh oh!

prashu0705 commented Apr 13, 2026 •

edited

Loading

Uh oh!

satvshr commented Apr 13, 2026

Uh oh!

prashu0705 commented Apr 14, 2026

Uh oh!

satvshr left a comment

Uh oh!

prashu0705 commented Apr 14, 2026

Uh oh!

satvshr commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prashu0705 commented Apr 13, 2026

Reference Issues/PRs

What does this implement/fix? Explain your changes.

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

Uh oh!

satvshr left a comment

Choose a reason for hiding this comment

Uh oh!

prashu0705 commented Apr 13, 2026

Uh oh!

satvshr commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prashu0705 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Apr 13, 2026

Uh oh!

prashu0705 commented Apr 14, 2026

Option 1: U → T normalization

Option 2: Expand the vocabulary to include U

Why I went with a ValueError

Uh oh!

satvshr left a comment

Choose a reason for hiding this comment

Uh oh!

prashu0705 commented Apr 14, 2026

Uh oh!

satvshr commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

satvshr commented Apr 13, 2026 •

edited

Loading

prashu0705 commented Apr 13, 2026 •

edited

Loading

Why I went with a `ValueError`