Document parallelization tuning and automatic sparse diagonalization#36
Open
kochjens wants to merge 10 commits into
Open
Document parallelization tuning and automatic sparse diagonalization#36kochjens wants to merge 10 commits into
kochjens wants to merge 10 commits into
Conversation
- Add MULTIPROC_BLAS_THREADS to the user-accessible settings table. - Extend the Parallel Processing guide with a "Limiting BLAS threads per worker" section (programmatic cap, threadpoolctl requirement for fork-based workers, no-op cases incl. Apple Accelerate, scipy-OpenBLAS caveat) and a "Worker-pool reuse" section. - Add a changelog entry for MULTIPROC_BLAS_THREADS, pool reuse, and the per-grid-point bare-eigensystem shipping in ParameterSweep.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Mirror the expectation-setting from the rewritten demo_multiprocessing notebook: parallelization helps only when grid-size x per-point cost exceeds the per-task overhead, so small/cheap sweeps see no speedup (or a slowdown) and that is normal; note the two usual causes of confusing num_cpus comparisons (below break-even, or BLAS not capped), and that sparse diagonalization is often a bigger lever.
…TART_METHOD - Add a "Process start method (fork vs spawn)" section to the Parallel Processing guide: the platform defaults (fork on Linux, spawn on macOS/Windows), why fork is unsafe on macOS (both Intel and Apple Silicon), the __main__-guard requirement for plain scripts under spawn, the once-per-session (not per-sweep) cost thanks to pool reuse, the override, and the orphaned-worker note. - Add MULTIPROC_START_METHOD to the user-accessible settings table. - Add a 4.3.2 changelog entry.
…-determined The start method is no longer a public setting (fork on Linux, spawn on macOS/Windows is the only safe choice per platform). Remove the settings row, rewrite the parallel-guide section to present it as platform- determined rather than configurable, delete the override code cell, and fix the stale "macOS forks" bullet in the BLAS-cap section.
…elization Rewrite the parallel guide around the shipped tuning API: a new "Letting scqubits choose the settings" section covering recommend_parallelization, the num_cpus="auto" sentinel, settings.AUTO_PARALLEL, and the one-time calibrate_parallelization(); drop the dead tools/autotune_multiprocessing.py reference. Add AUTO_PARALLEL and PARALLEL_CALIBRATION_PATH to the settings table, and changelog entries.
The three multiprocessing/diagonalization PRs release together, so the docs cover the sparse-diag default alongside the parallelization work already on this branch. - custom_diagonalization notebook: new "Automatic sparse diagonalization" section explaining the default-on behavior (esys/evals_method = None -> sparse eigsh for large spectra with few eigenvalues), the dim/evals gating, the dense fallback on raise or residual-check failure, and how to disable. - settings guide: AUTO_SPARSE_DIAG row added to the settings table. - changelog (4.3.2): automatic sparse diagonalization entry.
MULTIPROC_BLAS_THREADS now defaults to "auto" (cap each worker to cores // num_cpus), so parallel sweeps no longer oversubscribe the cores out of the box. Update the docs to match: - parallel.ipynb no longer opens with the manual "export thread-count env vars before importing numpy" workaround; it states that capping is automatic and documents the "auto"/int/None values. The break-even diagnostic no longer lists oversubscription as a default failure mode. - guide-settings table and changelog: default "auto", and threadpoolctl is now a runtime dependency rather than optional.
… model, calibration guidance - New top section 'Updating scqubits -- what changes for you?': reassures that existing code is unchanged, explains the automatic BLAS-thread cap, and gives a do-this table + checklist for getting more speed (auto / calibrate / AUTO_PARALLEL). - Rewrote 'Letting scqubits choose the settings' with a 30-second mental model (the switch num_cpus vs the map = calibration data), an explicit num_cpus='auto' vs AUTO_PARALLEL precedence block, and expanded calibration guidance: re-running overwrites the file, and calibrate on an idle, plugged-in machine (battery/CPU-throttling skews the measurements).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Documents the parallelization and diagonalization changes shipping in this release.
What changes
num_cpus > 1helps (the grid break-even),recommend_parallelization/num_cpus="auto"/AUTO_PARALLEL,calibrate_parallelization, theMULTIPROC_BLAS_THREADScap, worker-pool reuse, the platform-determinedfork/spawnstart method, and the__main__guard.MULTIPROC_BLAS_THREADS,AUTO_PARALLEL,PARALLEL_CALIBRATION_PATH,AUTO_SPARSE_DIAG.Pairs with the corresponding scqubits code PRs (multiprocessing core, automatic sparse diagonalization, parallelization heuristic + calibration) and the
demo_multiprocessingexamples notebook.