Skip to content

feat(tokenization): Added flag to run tokenization on multiple cores#58

Open
talolard wants to merge 161 commits intoallenai:masterfrom
talolard:master
Open

feat(tokenization): Added flag to run tokenization on multiple cores#58
talolard wants to merge 161 commits intoallenai:masterfrom
talolard:master

Conversation

@talolard
Copy link

I added a flag that runs the tokenization step on multiple cores per #57.

Ended up doing this with worker processes that read and write to a queue, so that each process loads spacy only once.

Seems to be 2x faster with 3 cores. I tested

  • 50K documents
  • 146,813,409 "words" (per wc)
  • Using spacy tokenizer
  • On 4 cores (so 3 in use because we use n-1 cores)
  • times are for the entire process with count vectorizer
    With multiproc
    image
    Without multiproc
    image

Possible improvements

  • Figuring out a way to drain the output queue faster would help this scale to more cores.
  • Reading the file in the children processes instead of the parent would reduce IPC. Not much of a gain though because we're constrained on clearing the output queue

kernelmachine and others added 28 commits July 8, 2019 13:20
Add Colab Notebook with AG News walkthrough.
Miscellaneous updates and enhancements
Thanks for releasing this work!

This PR will update the currently dead link in ELMo docs
Fix dead link to ELMo docs in README.md
@talolard
Copy link
Author

Don't merge this yet I found a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants