Skip to content

Conversation

@beabevi
Copy link
Contributor

@beabevi beabevi commented Jul 25, 2025

Use numpy arrays for most of the processing. Doesn't change any existing functionality or sampling result.

In _create_sentences replace loop with np.array_split.

Split _create_batches into two passes:

  1. upsample potential "cell sentences" that are too small
  2. use np.concat to form batches

This also speeds up slightly the processing time (33% on the parse dataset).

@beabevi beabevi force-pushed the beabevi/simpler-sample branch from 73860fa to 5444ec6 Compare July 25, 2025 23:50
Copy link
Collaborator

@abhinadduri abhinadduri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks!! we can merge after #51 and #54 with

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants