-
Notifications
You must be signed in to change notification settings - Fork 0
Description
When preprocessing a dataset that has many structures (>100), the computed molecular features quickly take up lot of available memory in the system (as it has to store 4 different np.ndarray for each example). The system has to constantly free up available memory, which significantly slows down the whole process, resulting in a lot of time to cache the dataset into its respective format.
To combat this, we should not store each of the ndarray's in memory and concatenate them into one big ndarray. We can either:
- Serialize and save each example as we compute them during the preprocessing step. This might make the code more complex however.
- Chunk the computed dataset into manageable batches and save each batch.
Similarly, when loading the dataset back into memory, such as when using a InMemorySerializer (see #33), chunk the dataset in batches to reduce memory pressure. Essentially, it should yield an iterator that can properly iterate though each batch. Maybe have it so that we see how much available memory is being used, and if it cannot be fully loaded in memory, then return the iterator instead (with msg saying that we are returning the iterator).
Do a performance test to determine how much chucking the dataset saves both in terms of memory usage and time to process the full dataset.