Skip to content

[BUG] seq2vec silently drops unknown characters instead of mapping to index 0#382

Open
rupeshca007 wants to merge 1 commit intogc-os-ai:mainfrom
rupeshca007:fix/seq2vec-unknown-token-bug
Open

[BUG] seq2vec silently drops unknown characters instead of mapping to index 0#382
rupeshca007 wants to merge 1 commit intogc-os-ai:mainfrom
rupeshca007:fix/seq2vec-unknown-token-bug

Conversation

@rupeshca007
Copy link
Copy Markdown

Reference Issues/PRs

Fixes #381

What does this implement/fix? Explain your changes.

Fixes a silent data corruption bug in seq2vec where unknown characters were completely dropped from sequences instead of being mapped to the 0 index token as documented.

Changes implemented:

  • Added output.append(0) and output_ss.append(0) inside the unmatched logic block.
    • Implemented the seq_max_len boundary check in the fallback block to ensure chunk resets are correctly maintained.

What should a reviewer concentrate their feedback on?

  • N/A

Did you add any tests for the change?

  • No - existing tests continue to pass. The fix is a specific logic adjustment inside the tokenizer.

Any other comments?

None.

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG].
  • - [ ] Added/modified tests
  • - [x] Used pre-commit hooks when committing to ensure that code is compliant with hooks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] seq2vec silently drops unknown characters instead of mapping to index 0

1 participant