Skip to content

nested multiprocessing in notebooks with mpire #29

@sergpolly

Description

@sergpolly

Consider demonstrating an example of parallel execution of some of the cooltools API functions for multiple samples - i.e. when an API function itself is using multiprocessing and we want to do it in the in the notebook ...

If one have a big multicore system (16 real cores and more) it is easy to run several CLI tasks in parallel for multiple samples, where each task itself is using several cores - i.e. is multiprocessed. Very often such multiprocessed operations does not scale well beyond 8-12 processes - so it is indeed more "economical" to process multiple samples at once with fewer cores each.

Now - what if we want to achieve the same but in the notebook ? It is not trivial to do so - because multiprocess does not allow nesting (the way we typically use it/out of the box). Now it can be easily done with MPIRE https://github.com/sybrenjansen/mpire , which allows running multiple multiprocessed task in parallel and its API is very similar to multiprocess itself ... Check it out:

mpire test:

from mpire import WorkerPool
clrs  # dictionary of several coolers
exp_kwargs = dict(view_df=hg38_arms, nproc=12)

def _job(sample):
        _clr = clrs[sample]
        _exp = cooltools.expected_cis( _clr, **exp_kwargs)
    return (sample, _exp)

# have to use daemon=False, because _job is multiprocessing-based already ...
# trying to run 8 samples in parallel, each using 12-processes - 8*12=96
with WorkerPool(n_jobs=8, daemon=False) as wpool:
    results = wpool.map(_job, telo_clrs, progress_bar=True)

# sort out the results ...
exps = {sample: _exp for sample, _exp in results}

# this takes ~1 min for 16 coolers @ 25kb on 56-core system (112 thread)

one-by-one using a ton of cores per task:

exp_kwargs = dict(view_df=hg38_arms, nproc=112)
exps = {}
for sample, _clr in clrs.items():
    print(f"calculating expected for {sample} ...")
        exps[sample] = cooltools.expected_cis( _clr, **exp_kwargs)

# this takes > 2 mins, and shows no time improvements after nproc=32 ...

this has limited application to projects with many samples and people with big workstations - but when those 2 criteria are both met - the speed up is very appreciated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions