Skip to content
298 changes: 298 additions & 0 deletions examples/demo_multiprocessing.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Letting scqubits choose your multiprocessing settings\n",
"\n",
"J. Koch and P. Groszkowski\n",
"\n",
"For further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n",
"\n",
"---\n",
"\n",
"A `ParameterSweep` can spread its grid points across worker processes through the `num_cpus`\n",
"argument. The catch is that the *right* number of workers — and the right number of BLAS\n",
"threads each worker should use — depends on the problem and the machine. Guess wrong and you\n",
"get no speedup, or, when the workers oversubscribe the cores, a slowdown of **one to two\n",
"orders of magnitude**.\n",
"\n",
"So scqubits can choose for you. This notebook shows the easy path —\n",
"`recommend_parallelization` and `num_cpus=\"auto\"` — and a one-time per-machine calibration,\n",
"and then explains what is being balanced under the hood."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"import scqubits as scq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A system to sweep\n",
"\n",
"We use three capacitively coupled tunable transmons and sweep the flux of the first. The\n",
"dressed Hilbert space has dimension `6**3 = 216`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def build_hilbertspace():\n",
" qubits = [\n",
" scq.TunableTransmon(\n",
" EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n",
" truncated_dim=6, id_str=f\"tmon{i}\",\n",
" )\n",
" for i in range(3)\n",
" ]\n",
" hs = scq.HilbertSpace(qubits)\n",
" for i in range(2):\n",
" hs.add_interaction(\n",
" g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator\n",
" )\n",
"\n",
" def update(flux):\n",
" qubits[0].flux = flux\n",
"\n",
" return hs, update\n",
"\n",
"\n",
"hs, update = build_hilbertspace()\n",
"print(\"dressed dimension:\", hs.dimension)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The easy way: let scqubits choose\n",
"\n",
"`scq.recommend_parallelization` reads the workload — Hilbert-space dimension, number of grid\n",
"points, eigenvalue count, and whether sparse diagonalization applies — and returns a\n",
"recommended `num_cpus` together with a per-worker BLAS-thread cap. It is a *pure* function: it\n",
"starts no worker processes, so it is safe to call anywhere, and it does not run the sweep.\n",
"\n",
"Because a `ParameterSweep` runs the moment it is constructed, call it *before* building the\n",
"sweep. Here is its choice for our system at two grid sizes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for n_points in (16, 384):\n",
" cfg = scq.recommend_parallelization(\n",
" hilbertspace=hs, num_points=n_points, evals_count=20\n",
" )\n",
" print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}, blas_threads={cfg.blas_threads}\")\n",
" print(f\" {cfg.reason}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A 16-point sweep stays serial — there are too few points to repay the cost of starting and\n",
"feeding worker processes — while the 384-point sweep is spread across workers, each capped to\n",
"a single BLAS thread so they do not oversubscribe the cores.\n",
"\n",
"To apply the recommendation without copying numbers by hand, pass the sentinel\n",
"`num_cpus=\"auto\"` to the sweep. The choice is then made automatically, *before* the sweep\n",
"runs:\n",
"\n",
"```python\n",
"sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n",
"```\n",
"\n",
"To make *every* sweep that does not specify `num_cpus` tune itself this way, set\n",
"`scq.settings.AUTO_PARALLEL = True`.\n",
"\n",
"These are the same auto-tuner with different reach — `num_cpus=\"auto\"` opts in for one sweep, while `AUTO_PARALLEL = True` makes it the default for every sweep where you don't pass `num_cpus`. An explicit number always wins:\n",
"\n",
"```text\n",
"num_cpus=4 -> exactly 4 workers (you decide)\n",
"num_cpus=\"auto\" -> auto-tuner decides (always)\n",
"num_cpus omitted -> auto-tuner if AUTO_PARALLEL=True, else serial (the default)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The automatic choice changes only *how* a sweep is computed, never the result. We confirm\n",
"that by running the same 384-point sweep both serially and with `num_cpus=\"auto\"`, and\n",
"comparing the spectra:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"flux_vals = np.linspace(0.0, 0.5, 384)\n",
"\n",
"serial = scq.ParameterSweep(\n",
" hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
" update_hilbertspace=update, evals_count=20, num_cpus=1,\n",
")\n",
"auto = scq.ParameterSweep(\n",
" hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
" update_hilbertspace=update, evals_count=20, num_cpus=\"auto\",\n",
")\n",
"print(\"spectra identical:\", np.allclose(serial[\"evals\"][:], auto[\"evals\"][:]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tune to your machine (optional, run once)\n",
"\n",
"The recommendation above uses conservative built-in thresholds that work everywhere. For choices tuned to *your* hardware, run the one-time calibration. It times a short battery of sweeps in isolated subprocesses — this machine's per-task dispatch overhead, the one-time pool-startup cost, and per-point diagonalization cost (dense and sparse) — and writes `~/.scqubits/parallel_calibration.json` (about a minute).\n",
"\n",
"The calibration is **just data**: it does nothing on its own. From then on, every `recommend_parallelization` / `num_cpus=\"auto\"` call reads that file and makes a sharper, machine-specific choice instead of using the generic defaults.\n",
"\n",
"**Re-running overwrites the file, so recalibrate freely** — and *do* recalibrate if a previous run was taken under bad conditions: while the machine was busy, or on a laptop that was on battery / CPU-throttled (many laptops clock down hard when unplugged, which makes the calibration over-estimate every cost so `\"auto\"` then under-parallelizes). For the most representative numbers, calibrate on an otherwise-idle machine plugged into wall power.\n",
"\n",
"Because it launches its measurements as `python -m` subprocesses, the call needs no `if __name__ == \"__main__\":` guard, in Jupyter or in a plain script."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scq.calibrate_parallelization()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the calibration exists, `recommend_parallelization` (and `num_cpus=\"auto\"`) use the\n",
"measured break-even — parallelizing only once the grid is large enough to repay the measured\n",
"pool-startup cost. Re-running the recommendation now reflects your machine:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for n_points in (16, 384):\n",
" cfg = scq.recommend_parallelization(\n",
" hilbertspace=hs, num_points=n_points, evals_count=20\n",
" )\n",
" print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus} ({cfg.reason})\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What scqubits is balancing for you\n",
"\n",
"There is no single right answer because two effects pull in opposite directions.\n",
"\n",
"**1. The grid break-even.** Sending a grid point to a worker costs a fixed amount (pickling,\n",
"inter-process hand-off), and starting the pool costs a one-time amount (about a second when\n",
"workers are *spawned*, as on macOS and Windows). Parallelism pays off only when\n",
"\n",
"> (number of grid points) × (cost per point) ≫ that fixed overhead.\n",
"\n",
"Few points, or cheap points, stay faster serially — which is why the heuristic keeps small\n",
"sweeps on a single process.\n",
"\n",
"**2. The BLAS oversubscription cliff.** Every eigensolve already runs on a multithreaded BLAS\n",
"backend. If several workers each use all cores, the cores are oversubscribed by a factor of\n",
"`num_cpus`, which on large dense matrices is not a small slowdown but a collapse. Measured on\n",
"a 10-core Mac mini — five capacitively coupled fluxonia (dressed dimension 3125, dense),\n",
"16-point sweep:\n",
"\n",
"| configuration | wall time |\n",
"|---|---:|\n",
"| `num_cpus=1` | 42 s |\n",
"| `num_cpus=4`, BLAS uncapped | **3608 s** (~90x slower) |\n",
"| `num_cpus=4`, BLAS capped to 1 | 40 s |\n",
"| `num_cpus=8`, BLAS capped to 1 | 28 s |\n",
"\n",
"This is why the recommendation always pairs a worker count with a BLAS-thread cap, keeping\n",
"`num_cpus x BLAS-threads` near the core count."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Manual control\n\nThe same knobs are available directly if you would rather set them yourself:\n\n```python\nscq.settings.NUM_CPUS = 4 # default worker count when num_cpus is unset\nscq.settings.MULTIPROC_BLAS_THREADS = 1 # per-worker BLAS-thread cap during a sweep\n```\n\n`MULTIPROC_BLAS_THREADS` already defaults to `\"auto\"`, which caps each worker to\n`cores // num_cpus` so parallel sweeps never oversubscribe the cores — setting an integer\njust overrides that with a fixed cap (use `None` to opt out entirely). Rule of thumb: keep\n`num_cpus x BLAS-threads` near the number of physical cores. The cap reaches spawn-based\nworkers (macOS, Windows) through the thread-count environment variables, and fork-based\nworkers (Linux) through `threadpoolctl`.\n\nFor large composite systems the per-point **diagonalization method** is often a bigger lever\nthan parallelism: scqubits uses sparse diagonalization by default for large spectra (see\n`scq.settings.AUTO_SPARSE_DIAG`), which can be far faster per point. Once each point is cheap,\nparallelism helps even less — so try sparse first, and parallelize second."
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running as a script\n",
"\n",
"In Jupyter, everything above runs as shown. In a plain Python script on macOS or Windows,\n",
"workers are *spawned*, which re-imports your script in each worker — so the entry point that\n",
"triggers a parallel sweep must be guarded:\n",
"\n",
"```python\n",
"import scqubits as scq\n",
"\n",
"if __name__ == \"__main__\":\n",
" sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n",
"```\n",
"\n",
"Linux (which forks) and Jupyter need no guard. scqubits prints a one-time reminder the first\n",
"time it spawns workers outside of IPython."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"- **Let scqubits choose.** `scq.recommend_parallelization(...)` recommends `num_cpus` and a\n",
" BLAS-thread cap from the workload; `num_cpus=\"auto\"` applies it per sweep, and\n",
" `scq.settings.AUTO_PARALLEL = True` applies it everywhere.\n",
"- **Calibrate once** with `scq.calibrate_parallelization()` for advice measured on your own\n",
" hardware.\n",
"- Parallelism helps only when the grid is large enough to repay the fixed overhead, and the\n",
" BLAS-thread cap is what keeps many workers from oversubscribing the cores.\n",
"- All of this takes effect live, without restarting the kernel, in Jupyter and in scripts\n",
" alike."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading