Skip to content

Reduce memory usage for loading DwC-A file #37

@adityajain07

Description

@adityajain07

Suggestion by the IDT team:

You could use a "streaming" approach where you use an iterator to read lines from that CSV gradually, and hand them out to pool.imap_unordered() (https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap_unordered) as they come. At no time do you require all this data in memory.

Another suggestion on not asking for multiple CPUs:

A download job is network I/O-bound. It is limited entirely by the Internet connection to the outside. It is so slow that even one CPU core is massively more than enough for your needs. You're asking for 64. That means 63.5+ of those cores are wasted. Furthermore, with a streaming I/O approach as I suggest, you should not need more than 10G of RAM, meaning that your ask for 300G RAM is also >95% waste. You simply do not need to load all these URLs in RAM, still less shove them into Pandas.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions