[BUG] Validate DNA input for AptaNet k-mers#399
[BUG] Validate DNA input for AptaNet k-mers#399prashu0705 wants to merge 3 commits intogc-os-ai:mainfrom
Conversation
|
Hi @satvshr, I've opened PR #399 to address this. |
|
Wait if AptaNet accepts U too I dont see why we should remove its support, why do we need this PR again?
Why do you say so, can you give evidence supporting this statement? |
|
Looking at the code in DNA_BASES = list("ACGT")
all_kmers = ["".join(p) for p in product(DNA_BASES, repeat=i)]This means the vocabulary only contains k-mers made of if kmer in kmer_counts:
kmer_counts[kmer] += 1Any substring containing So |
|
Thanks! Why not add U support rather than removing it? |
|
That's a fair point, and I did consider it. The reason I went with validation instead is that "adding U support" isn't a single obvious fix — it requires a design decision about what U support actually means for AptaNet specifically: Option 1: U → T normalizationThis is what the original AptaNet paper does (Emami & Ferdousi, 2021 — they explicitly convert RNA to DNA by replacing U with T before encoding). It's biologically reasonable since T and U are functionally equivalent for base-pairing purposes. But it silently modifies the user's input, which is the same category of problem as the original bug — just less harmful. It should at minimum be documented clearly and opt-in. Option 2: Expand the vocabulary to include UI actually think this is the worse option, for a few reasons:
Option 2 would only make sense if we were building a brand new RNA-specific model trained from scratch with a 5-character alphabet — which is a research project, not a bug fix. Why I went with a
|
|
The only remaining failure is To fix it I was planning to push one more commit to this PR that adds a preprocessing step in the notebook replacing Should I go ahead and push it here, or would you prefer I handle it differently? |
Sounds good |
Reference Issues/PRs
Fixes #322.
What does this implement/fix? Explain your changes.
This adds DNA-only input validation to
generate_kmer_vecs.AptaNet k-mer features are documented as taking DNA aptamer sequences. Previously, RNA-like input such as a sequence containing
Uwas accepted, but k-mers containing non-DNA characters were not represented in the generated feature vector.Now,
generate_kmer_vecsraises a clearValueErrorwhen the input sequence contains non-DNA characters.What should a reviewer concentrate their feedback on?
ValueErrormessage is clear enough for users.Did you add any tests for the change?
Yes. I added one focused regression test checking that RNA input containing
Uraises aValueError.Tested with:
pytest pyaptamer/utils/tests/test_aptanet_utils.py -qAny other comments?
Kept this PR minimal based on maintainer feedback.
PR checklist
pre-commit install. To run hooks independent of commit, executepre-commit run --all-files