For larger datasets, it's not great to do the tokenization on one core when we have many available. I'd suggest wrapping the relevant function in a process pool, or passing the pool as an argument and doing Pool.map
Happy to make a PR if it's a good fit for the repo
|
with tqdm(open(data_path, "r"), desc=f"loading {data_path}") as f: |