Reduce memory usage for loading DwC-A file

Suggestion by the IDT team:

> You could use a "streaming" approach where you use an iterator to read lines from that CSV gradually, and hand them out to pool.imap_unordered() (https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap_unordered) as they come. At no time do you require all this data in memory.

Another suggestion on **not** asking for multiple CPUs:

> A download job is network I/O-bound. It is limited entirely by the Internet connection to the outside. It is so slow that even one CPU core is massively more than enough for your needs. You're asking for 64. That means 63.5+ of those cores are wasted. Furthermore, with a streaming I/O approach as I suggest, you should not need more than 10G of RAM, meaning that your ask for 300G RAM is also >95% waste. You simply do not need to load all these URLs in RAM, still less shove them into Pandas.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce memory usage for loading DwC-A file #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce memory usage for loading DwC-A file #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions