Skip to content

pandas groupby is too slow for experiment size datasets #23

@turbach

Description

@turbach

Problem

epf.py uses groupby epoch_id and time operations, for instance in QC and center_eeg.

The groupby operations are too slow for use on experiment sized datasets and need to be replaced, probably with numpy operations.

Solution

TBD. Centering is operations on floats, only need the numpy arrays

Maybe vectorize ... something like this pseudo code for center_eeg

  • look up rows in each epoch in the centering interval
idxs = np.where((epochs.time >= start & epochs.time < stop))
  • slice out the np array of (n_epochs * n_center_times, n_channels) for the centering interval
center_data = epochs[idxs]
  • unstack/reshape the center_data 2D (n_epochs * n_center_times, n_eeg_streams) to 3D (n_epochs, n_center_times, n_eeg_streams)
  • compute epoch mean across times (axis 1) = a 2D array of interval means (n_epochs, n_eeg_streams)
  • np. repeat/tile/broacast the interval means for each epoch by the number of times per epoch to original dimensions (n_epochs * n_times, n_channels)

This gives a new 2D array (n_epochs * n_times, n_eeg_streams) where each epoch has the value of the mean in the centering interval for that epoch at that eeg_stream

center_mns = np.[tile?repeat?](center_data.reshape(?,?,?).mean(axis=1))
assert center_mns.shape == epochs[data_streams].shape

Centering the epochs by the mean of the centering interval is a one line subtraction

epochs[eeg_streams] = epochs[eeg_streams] - center_mns

Run %%timeit to see if this helps, if not find something that does.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions