Skip to content

Computing and loading full dataset into memory (very slow) #35

@karnawhat

Description

@karnawhat

When preprocessing a dataset that has many structures (>100), the computed molecular features quickly take up lot of available memory in the system (as it has to store 4 different np.ndarray for each example). The system has to constantly free up available memory, which significantly slows down the whole process, resulting in a lot of time to cache the dataset into its respective format.

To combat this, we should not store each of the ndarray's in memory and concatenate them into one big ndarray. We can either:

  1. Serialize and save each example as we compute them during the preprocessing step. This might make the code more complex however.
  2. Chunk the computed dataset into manageable batches and save each batch.

Similarly, when loading the dataset back into memory, such as when using a InMemorySerializer (see #33), chunk the dataset in batches to reduce memory pressure. Essentially, it should yield an iterator that can properly iterate though each batch. Maybe have it so that we see how much available memory is being used, and if it cannot be fully loaded in memory, then return the iterator instead (with msg saying that we are returning the iterator).

Do a performance test to determine how much chucking the dataset saves both in terms of memory usage and time to process the full dataset.

Metadata

Metadata

Assignees

Labels

performancePerformance related issues/improvementsquestionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions