[BUG] seq2vec silently drops unknown characters instead of mapping to index 0 by rupeshca007 · Pull Request #382 · gc-os-ai/pyaptamer

rupeshca007 · 2026-04-12T12:17:26Z

Reference Issues/PRs

Fixes #381

What does this implement/fix? Explain your changes.

Fixes a silent data corruption bug in seq2vec where unknown characters were completely dropped from sequences instead of being mapped to the 0 index token as documented.

Changes implemented:

Added output.append(0) and output_ss.append(0) inside the unmatched logic block.
- Implemented the seq_max_len boundary check in the fallback block to ensure chunk resets are correctly maintained.

What should a reviewer concentrate their feedback on?

N/A

Did you add any tests for the change?

No - existing tests continue to pass. The fix is a specific logic adjustment inside the tokenizer.

Any other comments?

None.

PR checklist

The PR title starts with either [ENH], [MNT], [DOC], or [BUG].
- [ ] Added/modified tests
- [x] Used pre-commit hooks when committing to ensure that code is compliant with hooks.

fix(utils): cleanly map unknown tokens to index 0 in seq2vec

f267624

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] seq2vec silently drops unknown characters instead of mapping to index 0#382

[BUG] seq2vec silently drops unknown characters instead of mapping to index 0#382
rupeshca007 wants to merge 1 commit intogc-os-ai:mainfrom
rupeshca007:fix/seq2vec-unknown-token-bug

rupeshca007 commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rupeshca007 commented Apr 12, 2026

Reference Issues/PRs

What does this implement/fix? Explain your changes.

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant