Do tokenization preprocessing in a process pool 

For larger datasets, it's not great to do the tokenization on one core when we have many available. I'd suggest wrapping the relevant function in a process pool, or passing the pool as an argument and doing Pool.map 

Happy to make a PR if it's a good fit for the repo 
https://github.com/allenai/vampire/blob/26136099a832e6332041c9214684e7233b33a199/scripts/preprocess_data.py#L26