feat(tokenization): Added flag to run tokenization on multiple cores by talolard · Pull Request #58 · allenai/vampire

talolard · 2020-03-22T19:44:48Z

I added a flag that runs the tokenization step on multiple cores per #57.

Ended up doing this with worker processes that read and write to a queue, so that each process loads spacy only once.

Seems to be 2x faster with 3 cores. I tested

50K documents
146,813,409 "words" (per wc)
Using spacy tokenizer
On 4 cores (so 3 in use because we use n-1 cores)
times are for the entire process with count vectorizer
With multiproc

Without multiproc

Possible improvements

Figuring out a way to drain the output queue faster would help this scale to more cores.
Reading the file in the children processes instead of the parent would reduce IPC. Not much of a gain though because we're constrained on clearing the output queue

Add Colab Notebook with AG News walkthrough.

… updates)

Miscellaneous updates and enhancements

pinned to 0.9.0

pinned scipy

Thanks for releasing this work! This PR will update the currently dead link in ELMo docs

Fix dead link to ELMo docs in README.md

talolard · 2020-03-23T12:04:59Z

Don't merge this yet I found a bug

Suchin Gururangan added 30 commits May 25, 2019 20:17

updated

77a8f58

updated

9a44a62

update

63034fd

update

583dcaf

all tests pass

9f5b193

all tests pass

19e1814

added fast vampire

414c7c6

added fast vampire

0454b73

updated

043cc8e

updated

c7daa38

updated tests

1e101df

updated tests

3e8a440

update

5f6f66d

update

617ad42

updated

a39de3d

updated

6d4ba39

runs

d33a142

runs

2e6e2d5

update

7f7e6da

update

10913d5

all tests pass

139788a

all tests pass

dedcf85

update

dd796cd

update

057fb3c

update

1508284

update

25fd616

updated scripts

2b841c6

updated scripts

407a49b

mypy fixes

75f1d3f

mypy fixes

29f44a3

kernelmachine and others added 28 commits July 8, 2019 13:20

Merge branch 'master' of github.com:allenai/vampire into colab

ca79598

Merge branch 'colab' of github.com:allenai/vampire into colab

8b8b9f1

added comments

1d0c575

Merge pull request allenai#36 from allenai/colab

2172842

Add Colab Notebook with AG News walkthrough.

add updates (e.g. kld clamping, dataset sampling, preprocessing speed…

2aeaf50

… updates)

updated embedding

f4de5e7

updated embedding

52c7052

update to kld clamping

667a926

updates to dataset reader, removed unused tokenizer

ca7ba3a

all tests pass

8aa43af

all checks pass

767b24f

all tests pass

46f9653

added intermediate encoder output to vae for efficiency

09659b0

Update README.md

19c0948

removed dependency on make-reference-corpus script

21ceb63

removed typos

b47533d

checks pass

5e3cbc0

Sums across document dimension instead of vocab dimension

85facf6

update'

5704a52

remove pdb trace

c108b11

Merge pull request allenai#38 from allenai/updates

e3795dd

Miscellaneous updates and enhancements

pinned to 0.9.0

56aa5d3

Merge pull request allenai#49 from allenai/pin

cef93f8

pinned to 0.9.0

pinned scipy

f896f34

Merge pull request allenai#50 from allenai/scipy

6d8be13

pinned scipy

Fix dead link to ELMo docs in README.md

cb0d7cb

Thanks for releasing this work! This PR will update the currently dead link in ELMo docs

Merge pull request allenai#54 from iechevarria/patch-1

2613609

Fix dead link to ELMo docs in README.md

feat(tokenization): Added flag to run tokenization on multiple cores

e2ae461

kernelmachine force-pushed the master branch from a81d40b to cf5c2fe Compare July 29, 2020 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tokenization): Added flag to run tokenization on multiple cores#58

feat(tokenization): Added flag to run tokenization on multiple cores#58
talolard wants to merge 161 commits intoallenai:masterfrom
talolard:master

talolard commented Mar 22, 2020

Uh oh!

talolard commented Mar 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

talolard commented Mar 22, 2020

Possible improvements

Uh oh!

talolard commented Mar 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants