From 6f6121a5ed04c93a9fe2865e1fd681bd9c4a8308 Mon Sep 17 00:00:00 2001 From: Jens Koch Date: Sun, 31 May 2026 17:39:32 -0500 Subject: [PATCH 1/8] examples: add demo_multiprocessing notebook Demonstrates efficient parameter sweeps: the interaction between num_cpus and BLAS/LAPACK threads, capping BLAS threads before import, and timing num_cpus on the user's own machine. Cells are left unexecuted; timings are machine-specific. Co-Authored-By: Claude Opus 4.8 --- examples/demo_multiprocessing.ipynb | 220 ++++++++++++++++++++++++++++ 1 file changed, 220 insertions(+) create mode 100644 examples/demo_multiprocessing.ipynb diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb new file mode 100644 index 0000000..947db0b --- /dev/null +++ b/examples/demo_multiprocessing.ipynb @@ -0,0 +1,220 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# scqubits example: efficient parameter sweeps (`num_cpus` and BLAS threads)\n", + "\n", + "J. Koch and P. Groszkowski\n", + "\n", + "For further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n", + "\n", + "---\n", + "\n", + "`ParameterSweep` can compute a sweep in parallel across worker processes via its\n", + "`num_cpus` argument. How much that helps \u2014 and whether it helps at all \u2014 depends on\n", + "**two interacting knobs**:\n", + "\n", + "1. **`num_cpus`** \u2014 how many worker processes share the grid points.\n", + "2. **BLAS/LAPACK threads** \u2014 every eigensolve (via `numpy`/`scipy`) runs on a\n", + " multithreaded linear-algebra backend (OpenBLAS or MKL) that by default uses *all*\n", + " cores.\n", + "\n", + "Running `num_cpus` workers that each launch a full BLAS thread pool oversubscribes the\n", + "cores; and on the small matrices typical of qubit truncations, a heavily threaded BLAS\n", + "can be *slower* than a single thread. The fastest setting is a balance \u2014 roughly\n", + "`num_cpus \u00d7 BLAS-threads \u2248 number of cores` \u2014 and it depends on your machine, BLAS\n", + "library, and problem size.\n", + "\n", + "**This notebook shows how to measure it on your own system.** The timing cells are left\n", + "unexecuted on purpose: run them yourself \u2014 your numbers will differ." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Set the BLAS thread count *before* importing scqubits\n", + "\n", + "The BLAS backend reads its thread count once, at import time. So this must be the\n", + "**first** cell you run in a fresh kernel \u2014 before `numpy` or `scqubits` is imported. (If\n", + "you have already imported scqubits in this kernel, restart it first.)\n", + "\n", + "Capping to a small number \u2014 often `1` \u2014 is a good starting point for qubit sweeps, and\n", + "is also what prevents oversubscription once `num_cpus > 1`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# Must run before numpy / scqubits are imported (restart the kernel otherwise).\n", + "for _var in (\"OPENBLAS_NUM_THREADS\", \"MKL_NUM_THREADS\", \"OMP_NUM_THREADS\"):\n", + " os.environ[_var] = \"1\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "\n", + "import numpy as np\n", + "\n", + "import scqubits as scq" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. A coupled example system\n", + "\n", + "Two tunable transmons coupled to a common resonator (the same system as\n", + "`demo_parametersweep`), swept over `flux` and `ng`. We wrap construction in a factory so\n", + "we can rebuild a fresh sweep for each timed run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def make_sweep(num_cpus, flux_points=11):\n", + " tmon1 = scq.TunableTransmon(\n", + " EJmax=40.0, EC=0.2, d=0.1, flux=0.0, ng=0.3, ncut=40, truncated_dim=3\n", + " )\n", + " tmon2 = scq.TunableTransmon(\n", + " EJmax=15.0, EC=0.15, d=0.2, flux=0.0, ng=0.0, ncut=30, truncated_dim=3\n", + " )\n", + " resonator = scq.Oscillator(E_osc=4.5, truncated_dim=4)\n", + "\n", + " hilbertspace = scq.HilbertSpace([tmon1, tmon2, resonator])\n", + " hilbertspace.add_interaction(\n", + " g_strength=0.1, op1=tmon1.n_operator, op2=resonator.creation_operator, add_hc=True\n", + " )\n", + " hilbertspace.add_interaction(\n", + " g_strength=0.2, op1=tmon2.n_operator, op2=resonator.creation_operator, add_hc=True\n", + " )\n", + "\n", + " flux_vals = np.linspace(0.0, 2.0, flux_points)\n", + " ng_vals = np.linspace(-0.5, 0.5, 3)\n", + "\n", + " def update_hilbertspace(flux, ng):\n", + " tmon1.flux = flux\n", + " tmon2.flux = 1.2 * flux\n", + " tmon2.ng = ng\n", + "\n", + " return scq.ParameterSweep(\n", + " hilbertspace=hilbertspace,\n", + " paramvals_by_name={\"flux\": flux_vals, \"ng\": ng_vals},\n", + " update_hilbertspace=update_hilbertspace,\n", + " subsys_update_info={\"flux\": [tmon1, tmon2], \"ng\": [tmon2]},\n", + " evals_count=20,\n", + " num_cpus=num_cpus,\n", + " autorun=False,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Time `ParameterSweep.run()` versus `num_cpus`\n", + "\n", + "Wall-clock timing is noisy, so we take the median of a few repeats and discard a warm-up\n", + "run (the first parallel run pays a one-time process-startup cost). **Results are\n", + "machine-specific** \u2014 what matters is the *shape* of the trend on your hardware, not any\n", + "single number." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def time_run(num_cpus, flux_points=11, repeats=3):\n", + " make_sweep(num_cpus, flux_points).run() # warm-up (discarded)\n", + " times = []\n", + " for _ in range(repeats):\n", + " sweep = make_sweep(num_cpus, flux_points)\n", + " start = time.perf_counter()\n", + " sweep.run()\n", + " times.append(time.perf_counter() - start)\n", + " return float(np.median(times))\n", + "\n", + "\n", + "cores = os.cpu_count() or 1\n", + "for n in [c for c in (1, 2, 4) if c <= cores]:\n", + " print(f\"num_cpus={n}: {time_run(n):.3f} s (median)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. See the BLAS-threading effect\n", + "\n", + "To feel the impact of the thread cap, **restart the kernel**, change the cap in the first\n", + "cell from `\"1\"` to e.g. `str(os.cpu_count())` (let BLAS use every core), and re-run the\n", + "timing. On small-matrix sweeps you will typically find the capped version is faster even\n", + "at `num_cpus=1`, because an un-capped BLAS spends time coordinating threads on matrices\n", + "too small to benefit." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Larger sweeps and automated tuning\n", + "\n", + "The benefit of `num_cpus > 1` grows with the number of grid points \u2014 more work to\n", + "amortize the fixed startup cost. Compare a bigger grid to see parallelism pay off:\n", + "\n", + "```python\n", + "print(\"num_cpus=1:\", time_run(1, flux_points=64))\n", + "print(\"num_cpus=4:\", time_run(4, flux_points=64))\n", + "```\n", + "\n", + "Because the optimum depends on your machine, BLAS library, and problem, the most reliable\n", + "approach is to measure. If you work from the scqubits **source tree**, the script\n", + "`tools/autotune_multiprocessing.py` automates this: it searches the\n", + "`num_cpus \u00d7 BLAS-threads` frontier and reports the fastest configuration for your system." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "- Two knobs interact: `num_cpus` (worker processes) and BLAS/LAPACK threads.\n", + "- Set BLAS threads **before** importing scqubits; capping low (often `1`) is a good\n", + " default for qubit sweeps and prevents oversubscription when `num_cpus > 1`.\n", + "- Keep `num_cpus \u00d7 BLAS-threads \u2248 cores`; raise `num_cpus` only for large grids.\n", + "- Measure on your own hardware \u2014 timings are machine-specific." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file From bdb860bf2efefc2a9417cfd81af961509514ca26 Mon Sep 17 00:00:00 2001 From: Jens Koch <19193849+jkochNU@users.noreply.github.com> Date: Mon, 1 Jun 2026 17:42:53 -0500 Subject: [PATCH 2/8] examples: rewrite multiprocessing demo to set correct expectations The previous version led the reader to expect a num_cpus speedup, but its default example (small system, small grid) is in the regime where multiprocessing is net-negative -- so the first hands-on result showed num_cpus=2 slower than 1 with no explanation, and the "it helps for larger grids" caveat came afterwards. Restructure around expectation-setting and two measured regimes: - Open with the mental model: parallelism helps only when grid-size x per-point cost greatly exceeds the per-task overhead; on small/cheap sweeps it does nothing or slows you down -- that is normal, keep num_cpus=1. - Cap BLAS threads FIRST (before imports), with a threadpoolctl verification cell (the #1 cause of confusing/backwards timings is the cap not taking effect). - Demo 1: a cheap system (dim 216, ~5 ms/pt) where num_cpus does not help. - Demo 2: the same system at larger truncated_dim (dim 512, ~40 ms/pt) where num_cpus gives ~2.8x -- the only change is per-point cost. - Explain the BLAS oversubscription footgun with the measured ~90x cliff, note num_cpus x BLAS-threads ~ cores, and point to sparse diagonalization as the bigger lever for large composite systems. All illustrative numbers are real measurements (10-core laptop); timing cells are left runnable but unexecuted (run-it-yourself). Code paths smoke-tested. --- examples/demo_multiprocessing.ipynb | 305 +++++++++++++++++++--------- 1 file changed, 205 insertions(+), 100 deletions(-) diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb index 947db0b..335260f 100644 --- a/examples/demo_multiprocessing.ipynb +++ b/examples/demo_multiprocessing.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# scqubits example: efficient parameter sweeps (`num_cpus` and BLAS threads)\n", + "# scqubits example: when does parallelizing a sweep help? (`num_cpus`, BLAS threads)\n", "\n", "J. Koch and P. Groszkowski\n", "\n", @@ -12,37 +12,40 @@ "\n", "---\n", "\n", - "`ParameterSweep` can compute a sweep in parallel across worker processes via its\n", - "`num_cpus` argument. How much that helps \u2014 and whether it helps at all \u2014 depends on\n", - "**two interacting knobs**:\n", + "`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\n", + "argument. **Parallelism is not free, and very often it does *not* help — sometimes it\n", + "makes a sweep slower.** Whether it pays off is governed by one simple balance:\n", "\n", - "1. **`num_cpus`** \u2014 how many worker processes share the grid points.\n", - "2. **BLAS/LAPACK threads** \u2014 every eigensolve (via `numpy`/`scipy`) runs on a\n", - " multithreaded linear-algebra backend (OpenBLAS or MKL) that by default uses *all*\n", - " cores.\n", + "> parallelism helps only when  **(number of grid points) × (cost per point)  ≫  the fixed overhead**\n", + "> of starting worker processes and pickling each task to them.\n", "\n", - "Running `num_cpus` workers that each launch a full BLAS thread pool oversubscribes the\n", - "cores; and on the small matrices typical of qubit truncations, a heavily threaded BLAS\n", - "can be *slower* than a single thread. The fastest setting is a balance \u2014 roughly\n", - "`num_cpus \u00d7 BLAS-threads \u2248 number of cores` \u2014 and it depends on your machine, BLAS\n", - "library, and problem size.\n", + "So the behaviour you should *expect*:\n", "\n", - "**This notebook shows how to measure it on your own system.** The timing cells are left\n", - "unexecuted on purpose: run them yourself \u2014 your numbers will differ." + "- **Small grids, or cheap-per-point systems** (small Hilbert spaces): `num_cpus > 1`\n", + " gives little or no speedup, and frequently a **slowdown**. This is normal — keep the\n", + " default `num_cpus = 1`.\n", + "- **Large grids of expensive points** (big composite Hilbert spaces): this is where\n", + " workers pay off.\n", + "\n", + "A second knob interacts with the first: every eigensolve runs on a multithreaded\n", + "**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\n", + "oversubscribe — which on large dense matrices is not a small slowdown but a\n", + "**catastrophe** (an example below is ~90× slower than serial). So we cap BLAS threads\n", + "**first**, before importing anything. Timings are machine-specific — run the cells\n", + "yourself; the illustrative numbers are from a 10-core laptop and yours will differ.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## 1. Set the BLAS thread count *before* importing scqubits\n", - "\n", - "The BLAS backend reads its thread count once, at import time. So this must be the\n", - "**first** cell you run in a fresh kernel \u2014 before `numpy` or `scqubits` is imported. (If\n", - "you have already imported scqubits in this kernel, restart it first.)\n", + "## Step 0 (do this first): cap BLAS threads *before* importing scqubits\n", "\n", - "Capping to a small number \u2014 often `1` \u2014 is a good starting point for qubit sweeps, and\n", - "is also what prevents oversubscription once `num_cpus > 1`." + "The BLAS backend reads its thread count **once, at import time**. So this must be the\n", + "**first cell you run in a fresh kernel** — before `numpy` or `scqubits` is imported. (If\n", + "scqubits is already imported in this kernel, restart it and run this first.) Capping to\n", + "`1` is a good starting point for qubit sweeps and is what prevents oversubscription once\n", + "`num_cpus > 1`. We explain *why this matters so much* in the 'BLAS footgun' section below.\n" ] }, { @@ -53,9 +56,55 @@ "source": [ "import os\n", "\n", - "# Must run before numpy / scqubits are imported (restart the kernel otherwise).\n", - "for _var in (\"OPENBLAS_NUM_THREADS\", \"MKL_NUM_THREADS\", \"OMP_NUM_THREADS\"):\n", - " os.environ[_var] = \"1\"" + "# MUST run before numpy / scqubits are imported (restart the kernel otherwise).\n", + "for _var in (\"OPENBLAS_NUM_THREADS\", \"MKL_NUM_THREADS\", \"OMP_NUM_THREADS\", \"NUMEXPR_NUM_THREADS\"):\n", + " os.environ[_var] = \"1\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Confirm the cap took effect\n", + "\n", + "The single most common cause of confusing sweep timings is that the cap **did not\n", + "apply** — because `numpy`/`scqubits` were already imported when the env vars were set.\n", + "Check it (needs the small `threadpoolctl` package, `pip install threadpoolctl`); each\n", + "reported thread count should be `1`:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import scqubits as scq # imports scipy, whose BLAS the eigensolvers use\n", + "\n", + "try:\n", + " from threadpoolctl import threadpool_info\n", + "\n", + " pools = threadpool_info()\n", + " for pool in pools:\n", + " print(f\"{pool['internal_api']:>10}: {pool['num_threads']} thread(s)\")\n", + " if not pools:\n", + " print(\"(no controllable BLAS pool detected)\")\n", + " print(\"\\nEach count should be 1. If not, restart the kernel and run the cap cell\")\n", + " print(\"FIRST. (On Apple Silicon, numpy itself uses Accelerate, which ignores these\")\n", + " print(\" variables; scipy's bundled OpenBLAS -- shown here -- is what scqubits uses.)\")\n", + "except ImportError:\n", + " print(\"threadpoolctl not installed (pip install threadpoolctl) -- skipping check.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A small timing helper\n", + "\n", + "Wall-clock timing is noisy, so we take the median of a few repeats and discard a warm-up\n", + "run (the first parallel run pays a one-time process-startup cost).\n" ] }, { @@ -66,20 +115,34 @@ "source": [ "import time\n", "\n", - "import numpy as np\n", "\n", - "import scqubits as scq" + "\n", + "def time_run(make_sweep, num_cpus, repeats=3):\n", + " \"\"\"Median wall time of one full sweep, after a discarded warm-up run.\n", + "\n", + " `make_sweep(num_cpus)` builds *and runs* a fresh ParameterSweep (autorun=True).\n", + " \"\"\"\n", + " make_sweep(num_cpus) # warm-up (discarded: pays one-time process startup)\n", + " times = []\n", + " for _ in range(repeats):\n", + " start = time.perf_counter()\n", + " make_sweep(num_cpus)\n", + " times.append(time.perf_counter() - start)\n", + " return float(np.median(times))\n", + "\n", + "\n", + "cores = os.cpu_count() or 1\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## 2. A coupled example system\n", + "## Demo 1 — when parallelism does *not* help (the common case)\n", "\n", - "Two tunable transmons coupled to a common resonator (the same system as\n", - "`demo_parametersweep`), swept over `flux` and `ng`. We wrap construction in a factory so\n", - "we can rebuild a fresh sweep for each timed run." + "Three coupled tunable transmons, swept over the flux of the first. With\n", + "`truncated_dim = 6` the dressed Hilbert space has dimension 6³ = 216 and each point costs\n", + "only a few milliseconds — far below the per-task overhead.\n" ] }, { @@ -88,52 +151,59 @@ "metadata": {}, "outputs": [], "source": [ - "def make_sweep(num_cpus, flux_points=11):\n", - " tmon1 = scq.TunableTransmon(\n", - " EJmax=40.0, EC=0.2, d=0.1, flux=0.0, ng=0.3, ncut=40, truncated_dim=3\n", - " )\n", - " tmon2 = scq.TunableTransmon(\n", - " EJmax=15.0, EC=0.15, d=0.2, flux=0.0, ng=0.0, ncut=30, truncated_dim=3\n", - " )\n", - " resonator = scq.Oscillator(E_osc=4.5, truncated_dim=4)\n", + "def make_sweep_light(num_cpus, n_points=96, truncated_dim=6):\n", + " qubits = [\n", + " scq.TunableTransmon(\n", + " EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n", + " truncated_dim=truncated_dim, id_str=f\"tmon{i}\",\n", + " )\n", + " for i in range(3)\n", + " ]\n", + " hs = scq.HilbertSpace(qubits)\n", + " for i in range(2):\n", + " hs.add_interaction(g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator)\n", + " flux_vals = np.linspace(0.0, 0.5, n_points)\n", "\n", - " hilbertspace = scq.HilbertSpace([tmon1, tmon2, resonator])\n", - " hilbertspace.add_interaction(\n", - " g_strength=0.1, op1=tmon1.n_operator, op2=resonator.creation_operator, add_hc=True\n", - " )\n", - " hilbertspace.add_interaction(\n", - " g_strength=0.2, op1=tmon2.n_operator, op2=resonator.creation_operator, add_hc=True\n", + " def update(flux):\n", + " qubits[0].flux = flux\n", + "\n", + " return scq.ParameterSweep(\n", + " hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n", + " update_hilbertspace=update, evals_count=20, num_cpus=num_cpus, autorun=True,\n", " )\n", "\n", - " flux_vals = np.linspace(0.0, 2.0, flux_points)\n", - " ng_vals = np.linspace(-0.5, 0.5, 3)\n", "\n", - " def update_hilbertspace(flux, ng):\n", - " tmon1.flux = flux\n", - " tmon2.flux = 1.2 * flux\n", - " tmon2.ng = ng\n", + "for n in [c for c in (1, 2, 4) if c <= cores]:\n", + " print(f\"num_cpus={n}: {time_run(make_sweep_light, n):.3f} s (median)\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**What to expect:** roughly *flat* timings — `num_cpus = 4` is no faster than `1`, and\n", + "`num_cpus = 2` may even be slower. On the reference 10-core laptop (dim 216, 96 points,\n", + "BLAS capped to 1):\n", "\n", - " return scq.ParameterSweep(\n", - " hilbertspace=hilbertspace,\n", - " paramvals_by_name={\"flux\": flux_vals, \"ng\": ng_vals},\n", - " update_hilbertspace=update_hilbertspace,\n", - " subsys_update_info={\"flux\": [tmon1, tmon2], \"ng\": [tmon2]},\n", - " evals_count=20,\n", - " num_cpus=num_cpus,\n", - " autorun=False,\n", - " )" + "| num_cpus | wall time | speedup |\n", + "|---:|---:|---:|\n", + "| 1 | 0.51 s | 1.00× |\n", + "| 4 | 0.51 s | 1.00× |\n", + "\n", + "This is the **normal** outcome for a modest sweep: at ~5 ms/point the cost of pickling\n", + "each task to a worker is comparable to the work itself, so spreading the points across\n", + "processes buys nothing. **This is not a bug — it is the expected regime, and\n", + "`num_cpus = 1` (the default) is the right choice here.**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## 3. Time `ParameterSweep.run()` versus `num_cpus`\n", + "## Demo 2 — when parallelism *does* help\n", "\n", - "Wall-clock timing is noisy, so we take the median of a few repeats and discard a warm-up\n", - "run (the first parallel run pays a one-time process-startup cost). **Results are\n", - "machine-specific** \u2014 what matters is the *shape* of the trend on your hardware, not any\n", - "single number." + "The *only* change: `truncated_dim = 8`, so the dressed dimension becomes 8³ = 512 and\n", + "each eigensolve now costs ~40 ms — large enough to amortize the per-task overhead.\n" ] }, { @@ -142,66 +212,101 @@ "metadata": {}, "outputs": [], "source": [ - "def time_run(num_cpus, flux_points=11, repeats=3):\n", - " make_sweep(num_cpus, flux_points).run() # warm-up (discarded)\n", - " times = []\n", - " for _ in range(repeats):\n", - " sweep = make_sweep(num_cpus, flux_points)\n", - " start = time.perf_counter()\n", - " sweep.run()\n", - " times.append(time.perf_counter() - start)\n", - " return float(np.median(times))\n", + "def make_sweep_heavy(num_cpus, n_points=96):\n", + " return make_sweep_light(num_cpus, n_points=n_points, truncated_dim=8)\n", "\n", "\n", - "cores = os.cpu_count() or 1\n", "for n in [c for c in (1, 2, 4) if c <= cores]:\n", - " print(f\"num_cpus={n}: {time_run(n):.3f} s (median)\")" + " print(f\"num_cpus={n}: {time_run(make_sweep_heavy, n):.3f} s (median)\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## 4. See the BLAS-threading effect\n", + "**What to expect:** now the workers pay off. On the reference laptop (dim 512, 96\n", + "points, BLAS capped to 1):\n", + "\n", + "| num_cpus | wall time | speedup |\n", + "|---:|---:|---:|\n", + "| 1 | 3.62 s | 1.00× |\n", + "| 4 | 1.30 s | **2.8×** |\n", "\n", - "To feel the impact of the thread cap, **restart the kernel**, change the cap in the first\n", - "cell from `\"1\"` to e.g. `str(os.cpu_count())` (let BLAS use every core), and re-run the\n", - "timing. On small-matrix sweeps you will typically find the capped version is faster even\n", - "at `num_cpus=1`, because an un-capped BLAS spends time coordinating threads on matrices\n", - "too small to benefit." + "Nothing changed except the **cost per point** (~5 ms → ~40 ms, via `truncated_dim`).\n", + "That is the whole story: parallelism helps once *grid size × per-point cost* clears the\n", + "overhead. If your sweep looks like Demo 1, raising `num_cpus` will not help — make the\n", + "grid much larger, or accept that the sweep is simply fast enough serially.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## 5. Larger sweeps and automated tuning\n", + "## The BLAS-thread footgun — *why* we capped first\n", "\n", - "The benefit of `num_cpus > 1` grows with the number of grid points \u2014 more work to\n", - "amortize the fixed startup cost. Compare a bigger grid to see parallelism pay off:\n", + "Each eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\n", + "Run `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\n", + "by a factor of `num_cpus`. On small matrices this is wasteful; on **large dense**\n", + "matrices it is *catastrophic*. Measured on the reference laptop — 5 capacitively coupled\n", + "fluxonia, dressed dim 3125, dense diagonalization, 16-point sweep:\n", "\n", - "```python\n", - "print(\"num_cpus=1:\", time_run(1, flux_points=64))\n", - "print(\"num_cpus=4:\", time_run(4, flux_points=64))\n", - "```\n", + "| configuration | wall time |\n", + "|---|---:|\n", + "| `num_cpus=1` | 42 s |\n", + "| `num_cpus=4`, BLAS **uncapped** | **3608 s** (≈ 90× *slower*) |\n", + "| `num_cpus=4`, BLAS capped to 1 | 40 s |\n", + "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n", "\n", - "Because the optimum depends on your machine, BLAS library, and problem, the most reliable\n", - "approach is to measure. If you work from the scqubits **source tree**, the script\n", - "`tools/autotune_multiprocessing.py` automates this: it searches the\n", - "`num_cpus \u00d7 BLAS-threads` frontier and reports the fastest configuration for your system." + "40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\n", + "difference between working and broken. (If you skip Step 0 and rerun Demo 2 with the cap\n", + "off, you will reproduce a milder version of this: `num_cpus = 2`/`4` become *much* slower\n", + "than `1`.) That is also the usual reason a `num_cpus` comparison looks 'inconclusive' or\n", + "backwards: the cap was never in effect.\n", + "\n", + "**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\n", + "best; on large dense matrices each eigensolve benefits from several threads, so the\n", + "fastest setting balances the knobs: roughly **`num_cpus × BLAS-threads ≈ cores`**. The\n", + "optimum depends on machine, BLAS library, and matrix size — measure it.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Summary\n", + "## For large composite systems, try sparse diagonalization first\n", + "\n", + "Before reaching for `num_cpus`, note that for large coupled systems the dominant cost is\n", + "the *per-point diagonalization*, and that is usually a bigger lever than parallelism.\n", + "Recent scqubits automatically uses **sparse** diagonalization (`scipy.eigsh`) for the\n", + "default method when only a few eigenstates of a large Hilbert space are requested\n", + "(`scqubits.settings.AUTO_SPARSE_DIAG`). For the 5-fluxonia system above (dressed dim\n", + "3125) this alone is ~16× faster per point than dense — far more than the ~1.5× that\n", + "multiprocessing buys there. And once sparse makes each point cheap, `num_cpus > 1` helps\n", + "even less. **Try sparse first; parallelize second.**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Automated tuning and summary\n", + "\n", + "Because the optimum depends on your machine, BLAS library, and problem, the most\n", + "reliable approach is to measure. From the scqubits **source tree**, the script\n", + "`tools/autotune_multiprocessing.py` searches the `num_cpus × BLAS-threads` frontier and\n", + "reports the fastest configuration for your system.\n", + "\n", + "### Summary\n", "\n", - "- Two knobs interact: `num_cpus` (worker processes) and BLAS/LAPACK threads.\n", - "- Set BLAS threads **before** importing scqubits; capping low (often `1`) is a good\n", - " default for qubit sweeps and prevents oversubscription when `num_cpus > 1`.\n", - "- Keep `num_cpus \u00d7 BLAS-threads \u2248 cores`; raise `num_cpus` only for large grids.\n", - "- Measure on your own hardware \u2014 timings are machine-specific." + "- **Default `num_cpus = 1`.** Parallelism helps only when *grid size × cost-per-point*\n", + " greatly exceeds the per-task overhead; on small or cheap sweeps it does nothing, or\n", + " slows you down — expected, not a bug (Demo 1 vs Demo 2).\n", + "- **Cap BLAS threads before importing scqubits**, and verify it took effect.\n", + " Oversubscription is catastrophic on large dense matrices (the 90× example). Keep\n", + " `num_cpus × BLAS-threads ≈ cores`.\n", + "- For large composite Hilbert spaces, **sparse diagonalization is usually a bigger lever\n", + " than multiprocessing** — try it first.\n", + "- **Measure on your own hardware** — every number here is machine-specific.\n" ] } ], @@ -217,4 +322,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} From d6dd13422a9c9c8c5db41e2bc539fd32de9bb2f2 Mon Sep 17 00:00:00 2001 From: Jens Koch <19193849+jkochNU@users.noreply.github.com> Date: Mon, 1 Jun 2026 18:39:15 -0500 Subject: [PATCH 3/8] examples: re-anchor MP demo on grid size; refresh numbers; warning-free The multiprocessing follow-up (per-task IPC reduced from O(N^2) to O(N)) lowered per-task overhead enough that the previous "light" system (dim 216) now benefits from num_cpus even at 96 points, so the old truncated_dim contrast no longer demonstrated "parallelism does not help." Re-anchor the two demos on GRID SIZE, the lever users actually vary, measured on the integration of all current perf branches: - Demo 1: same system, 16 points -> num_cpus 1/2/4 are flat (0.10 s, ~1x/0.9x), the normal "too few points to amortize" regime. - Demo 2: same system, 384 points -> 2.14 s -> 0.90 s at 4 cores (2.37x). Single make_sweep(num_cpus, n_points) factory; refreshed illustrative tables; note the crossover also shifts with per-point cost. Verified against the combined branches: cap-first ordering keeps BLAS at 1 thread, and the matrix-element matmul fix means the coupled-transmon sweep now runs with zero spurious matmul warnings. --- examples/demo_multiprocessing.ipynb | 185 +++++++++++++--------------- 1 file changed, 87 insertions(+), 98 deletions(-) diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb index 335260f..1b255f9 100644 --- a/examples/demo_multiprocessing.ipynb +++ b/examples/demo_multiprocessing.ipynb @@ -13,26 +13,24 @@ "---\n", "\n", "`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\n", - "argument. **Parallelism is not free, and very often it does *not* help — sometimes it\n", - "makes a sweep slower.** Whether it pays off is governed by one simple balance:\n", + "argument. **Parallelism is not free, and on small sweeps it does *not* help — sometimes\n", + "it makes them slower.** Whether it pays off is governed by one simple balance:\n", "\n", "> parallelism helps only when  **(number of grid points) × (cost per point)  ≫  the fixed overhead**\n", - "> of starting worker processes and pickling each task to them.\n", + "> of starting worker processes and shipping each task to them.\n", "\n", "So the behaviour you should *expect*:\n", "\n", - "- **Small grids, or cheap-per-point systems** (small Hilbert spaces): `num_cpus > 1`\n", - " gives little or no speedup, and frequently a **slowdown**. This is normal — keep the\n", - " default `num_cpus = 1`.\n", - "- **Large grids of expensive points** (big composite Hilbert spaces): this is where\n", - " workers pay off.\n", + "- **Few grid points:** `num_cpus > 1` gives no speedup, and can be *slower* than serial.\n", + " This is normal — keep the default `num_cpus = 1`.\n", + "- **Many grid points** (and/or an expensive-per-point system): workers pay off.\n", "\n", "A second knob interacts with the first: every eigensolve runs on a multithreaded\n", "**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\n", "oversubscribe — which on large dense matrices is not a small slowdown but a\n", - "**catastrophe** (an example below is ~90× slower than serial). So we cap BLAS threads\n", - "**first**, before importing anything. Timings are machine-specific — run the cells\n", - "yourself; the illustrative numbers are from a 10-core laptop and yours will differ.\n" + "**catastrophe** (~90× in the example below). So we cap BLAS threads **first**, before\n", + "importing anything. Timings are machine-specific — run the cells yourself; the\n", + "illustrative numbers are from a 10-core laptop and yours will differ.\n" ] }, { @@ -41,11 +39,11 @@ "source": [ "## Step 0 (do this first): cap BLAS threads *before* importing scqubits\n", "\n", - "The BLAS backend reads its thread count **once, at import time**. So this must be the\n", + "The BLAS backend reads its thread count **once, at import time**, so this must be the\n", "**first cell you run in a fresh kernel** — before `numpy` or `scqubits` is imported. (If\n", - "scqubits is already imported in this kernel, restart it and run this first.) Capping to\n", - "`1` is a good starting point for qubit sweeps and is what prevents oversubscription once\n", - "`num_cpus > 1`. We explain *why this matters so much* in the 'BLAS footgun' section below.\n" + "scqubits is already imported, restart the kernel and run this first.) Capping to `1` is a\n", + "good starting point for qubit sweeps and prevents oversubscription once `num_cpus > 1`;\n", + "the 'BLAS footgun' section below explains why this matters so much.\n" ] }, { @@ -67,10 +65,9 @@ "source": [ "### Confirm the cap took effect\n", "\n", - "The single most common cause of confusing sweep timings is that the cap **did not\n", - "apply** — because `numpy`/`scqubits` were already imported when the env vars were set.\n", - "Check it (needs the small `threadpoolctl` package, `pip install threadpoolctl`); each\n", - "reported thread count should be `1`:\n" + "The single most common cause of confusing sweep timings is that the cap **did not apply**\n", + "— because `numpy`/`scqubits` were already imported when the env vars were set. Check it\n", + "(needs `threadpoolctl`, `pip install threadpoolctl`); each reported count should be `1`:\n" ] }, { @@ -101,10 +98,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## A small timing helper\n", + "## A small timing helper and an example system\n", "\n", - "Wall-clock timing is noisy, so we take the median of a few repeats and discard a warm-up\n", - "run (the first parallel run pays a one-time process-startup cost).\n" + "Three coupled tunable transmons, swept over the flux of the first. We rebuild a fresh\n", + "sweep for each timed run, and we will sweep **the same system at two grid sizes** to see\n", + "both regimes. Wall-clock timing is noisy, so we take the median of a few repeats and\n", + "discard a warm-up run (the first parallel run pays a one-time process-startup cost).\n" ] }, { @@ -116,17 +115,35 @@ "import time\n", "\n", "\n", + "def make_sweep(num_cpus, n_points, truncated_dim=6):\n", + " qubits = [\n", + " scq.TunableTransmon(\n", + " EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n", + " truncated_dim=truncated_dim, id_str=f\"tmon{i}\",\n", + " )\n", + " for i in range(3)\n", + " ]\n", + " hs = scq.HilbertSpace(qubits)\n", + " for i in range(2):\n", + " hs.add_interaction(g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator)\n", + " flux_vals = np.linspace(0.0, 0.5, n_points)\n", "\n", - "def time_run(make_sweep, num_cpus, repeats=3):\n", - " \"\"\"Median wall time of one full sweep, after a discarded warm-up run.\n", + " def update(flux):\n", + " qubits[0].flux = flux\n", "\n", - " `make_sweep(num_cpus)` builds *and runs* a fresh ParameterSweep (autorun=True).\n", - " \"\"\"\n", - " make_sweep(num_cpus) # warm-up (discarded: pays one-time process startup)\n", + " return scq.ParameterSweep(\n", + " hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n", + " update_hilbertspace=update, evals_count=20, num_cpus=num_cpus, autorun=True,\n", + " )\n", + "\n", + "\n", + "def time_run(n_points, num_cpus, repeats=3):\n", + " \"\"\"Median wall time of make_sweep(num_cpus, n_points), after a discarded warm-up.\"\"\"\n", + " make_sweep(num_cpus, n_points) # warm-up (discarded)\n", " times = []\n", " for _ in range(repeats):\n", " start = time.perf_counter()\n", - " make_sweep(num_cpus)\n", + " make_sweep(num_cpus, n_points)\n", " times.append(time.perf_counter() - start)\n", " return float(np.median(times))\n", "\n", @@ -138,11 +155,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Demo 1 — when parallelism does *not* help (the common case)\n", + "## Demo 1 — a small sweep: parallelism does *not* help\n", "\n", - "Three coupled tunable transmons, swept over the flux of the first. With\n", - "`truncated_dim = 6` the dressed Hilbert space has dimension 6³ = 216 and each point costs\n", - "only a few milliseconds — far below the per-task overhead.\n" + "Only **16 grid points**. There are too few points to amortize the cost of starting\n", + "workers and shipping tasks, so spreading them across processes buys nothing.\n" ] }, { @@ -151,59 +167,35 @@ "metadata": {}, "outputs": [], "source": [ - "def make_sweep_light(num_cpus, n_points=96, truncated_dim=6):\n", - " qubits = [\n", - " scq.TunableTransmon(\n", - " EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n", - " truncated_dim=truncated_dim, id_str=f\"tmon{i}\",\n", - " )\n", - " for i in range(3)\n", - " ]\n", - " hs = scq.HilbertSpace(qubits)\n", - " for i in range(2):\n", - " hs.add_interaction(g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator)\n", - " flux_vals = np.linspace(0.0, 0.5, n_points)\n", - "\n", - " def update(flux):\n", - " qubits[0].flux = flux\n", - "\n", - " return scq.ParameterSweep(\n", - " hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n", - " update_hilbertspace=update, evals_count=20, num_cpus=num_cpus, autorun=True,\n", - " )\n", - "\n", - "\n", "for n in [c for c in (1, 2, 4) if c <= cores]:\n", - " print(f\"num_cpus={n}: {time_run(make_sweep_light, n):.3f} s (median)\")\n" + " print(f\"num_cpus={n}: {time_run(16, n):.3f} s (median)\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**What to expect:** roughly *flat* timings — `num_cpus = 4` is no faster than `1`, and\n", - "`num_cpus = 2` may even be slower. On the reference 10-core laptop (dim 216, 96 points,\n", - "BLAS capped to 1):\n", + "**What to expect:** roughly *flat*, and `num_cpus = 2`/`4` may be slightly **slower** than\n", + "`1`. On the reference 10-core laptop (16 points, BLAS capped to 1):\n", "\n", "| num_cpus | wall time | speedup |\n", "|---:|---:|---:|\n", - "| 1 | 0.51 s | 1.00× |\n", - "| 4 | 0.51 s | 1.00× |\n", + "| 1 | 0.10 s | 1.00× |\n", + "| 2 | 0.10 s | 0.93× |\n", + "| 4 | 0.11 s | 0.84× |\n", "\n", - "This is the **normal** outcome for a modest sweep: at ~5 ms/point the cost of pickling\n", - "each task to a worker is comparable to the work itself, so spreading the points across\n", - "processes buys nothing. **This is not a bug — it is the expected regime, and\n", - "`num_cpus = 1` (the default) is the right choice here.**\n" + "This is the **normal** outcome for a quick sweep, and **not a bug** — with only 16 points\n", + "the per-task overhead dominates, so `num_cpus = 1` (the default) is the right choice.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Demo 2 — when parallelism *does* help\n", + "## Demo 2 — a large sweep: parallelism pays off\n", "\n", - "The *only* change: `truncated_dim = 8`, so the dressed dimension becomes 8³ = 512 and\n", - "each eigensolve now costs ~40 ms — large enough to amortize the per-task overhead.\n" + "The *same system*, now with **384 grid points**. There is enough work to amortize the\n", + "fixed overhead, so the workers help.\n" ] }, { @@ -212,30 +204,27 @@ "metadata": {}, "outputs": [], "source": [ - "def make_sweep_heavy(num_cpus, n_points=96):\n", - " return make_sweep_light(num_cpus, n_points=n_points, truncated_dim=8)\n", - "\n", - "\n", "for n in [c for c in (1, 2, 4) if c <= cores]:\n", - " print(f\"num_cpus={n}: {time_run(make_sweep_heavy, n):.3f} s (median)\")\n" + " print(f\"num_cpus={n}: {time_run(384, n):.3f} s (median)\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**What to expect:** now the workers pay off. On the reference laptop (dim 512, 96\n", - "points, BLAS capped to 1):\n", + "**What to expect:** now the workers pay off. On the reference laptop (384 points, BLAS\n", + "capped to 1):\n", "\n", "| num_cpus | wall time | speedup |\n", "|---:|---:|---:|\n", - "| 1 | 3.62 s | 1.00× |\n", - "| 4 | 1.30 s | **2.8×** |\n", - "\n", - "Nothing changed except the **cost per point** (~5 ms → ~40 ms, via `truncated_dim`).\n", - "That is the whole story: parallelism helps once *grid size × per-point cost* clears the\n", - "overhead. If your sweep looks like Demo 1, raising `num_cpus` will not help — make the\n", - "grid much larger, or accept that the sweep is simply fast enough serially.\n" + "| 1 | 2.14 s | 1.00× |\n", + "| 2 | 1.50 s | 1.42× |\n", + "| 4 | 0.90 s | **2.37×** |\n", + "\n", + "Nothing changed except the **number of grid points** (16 → 384). That is the lever:\n", + "parallelism helps once *grid size × per-point cost* clears the overhead. The crossover\n", + "depends on the per-point cost too — a heavier system (larger `truncated_dim`, or a large\n", + "composite Hilbert space) reaches it at fewer points; a cheaper one needs more.\n" ] }, { @@ -246,9 +235,9 @@ "\n", "Each eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\n", "Run `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\n", - "by a factor of `num_cpus`. On small matrices this is wasteful; on **large dense**\n", - "matrices it is *catastrophic*. Measured on the reference laptop — 5 capacitively coupled\n", - "fluxonia, dressed dim 3125, dense diagonalization, 16-point sweep:\n", + "by a factor of `num_cpus`. On small matrices this is wasteful; on **large dense** matrices\n", + "it is *catastrophic*. Measured on the reference laptop — 5 capacitively coupled fluxonia,\n", + "dressed dim 3125, dense diagonalization, 16-point sweep:\n", "\n", "| configuration | wall time |\n", "|---|---:|\n", @@ -258,10 +247,10 @@ "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n", "\n", "40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\n", - "difference between working and broken. (If you skip Step 0 and rerun Demo 2 with the cap\n", - "off, you will reproduce a milder version of this: `num_cpus = 2`/`4` become *much* slower\n", - "than `1`.) That is also the usual reason a `num_cpus` comparison looks 'inconclusive' or\n", - "backwards: the cap was never in effect.\n", + "difference between working and broken. (Skip Step 0 and rerun Demo 2 with the cap off and\n", + "you will reproduce a milder version: `num_cpus = 2`/`4` become *much* slower than `1`.)\n", + "That is also the usual reason a `num_cpus` comparison looks 'inconclusive' or backwards:\n", + "the cap was never in effect.\n", "\n", "**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\n", "best; on large dense matrices each eigensolve benefits from several threads, so the\n", @@ -276,13 +265,13 @@ "## For large composite systems, try sparse diagonalization first\n", "\n", "Before reaching for `num_cpus`, note that for large coupled systems the dominant cost is\n", - "the *per-point diagonalization*, and that is usually a bigger lever than parallelism.\n", - "Recent scqubits automatically uses **sparse** diagonalization (`scipy.eigsh`) for the\n", - "default method when only a few eigenstates of a large Hilbert space are requested\n", + "the *per-point diagonalization*, usually a bigger lever than parallelism. Recent scqubits\n", + "automatically uses **sparse** diagonalization (`scipy.eigsh`) for the default method when\n", + "only a few eigenstates of a large Hilbert space are requested\n", "(`scqubits.settings.AUTO_SPARSE_DIAG`). For the 5-fluxonia system above (dressed dim\n", - "3125) this alone is ~16× faster per point than dense — far more than the ~1.5× that\n", - "multiprocessing buys there. And once sparse makes each point cheap, `num_cpus > 1` helps\n", - "even less. **Try sparse first; parallelize second.**\n" + "3125) this alone is ~16× faster per point than dense — far more than parallelism buys\n", + "there. And once sparse makes each point cheap, `num_cpus > 1` helps even less. **Try\n", + "sparse first; parallelize second.**\n" ] }, { @@ -291,16 +280,16 @@ "source": [ "## Automated tuning and summary\n", "\n", - "Because the optimum depends on your machine, BLAS library, and problem, the most\n", - "reliable approach is to measure. From the scqubits **source tree**, the script\n", + "Because the optimum depends on your machine, BLAS library, and problem, the most reliable\n", + "approach is to measure. From the scqubits **source tree**, the script\n", "`tools/autotune_multiprocessing.py` searches the `num_cpus × BLAS-threads` frontier and\n", "reports the fastest configuration for your system.\n", "\n", "### Summary\n", "\n", "- **Default `num_cpus = 1`.** Parallelism helps only when *grid size × cost-per-point*\n", - " greatly exceeds the per-task overhead; on small or cheap sweeps it does nothing, or\n", - " slows you down — expected, not a bug (Demo 1 vs Demo 2).\n", + " greatly exceeds the per-task overhead; on small sweeps it does nothing, or slows you\n", + " down — expected, not a bug (Demo 1 vs Demo 2).\n", "- **Cap BLAS threads before importing scqubits**, and verify it took effect.\n", " Oversubscription is catastrophic on large dense matrices (the 90× example). Keep\n", " `num_cpus × BLAS-threads ≈ cores`.\n", From 4c18b6caf6efd3739155111e3a079349821cb730 Mon Sep 17 00:00:00 2001 From: Jens Koch <19193849+jkochNU@users.noreply.github.com> Date: Tue, 2 Jun 2026 20:58:45 -0500 Subject: [PATCH 4/8] examples: demo the shipped parallelization heuristic + calibration Add a "Letting scqubits pick the settings" section showing recommend_parallelization (with a runnable cell over the 16- and 384-point grids), num_cpus="auto", AUTO_PARALLEL, and calibrate_parallelization; replace the dead tools/autotune_multiprocessing.py reference. Also fold in the earlier demo fixes (restored runnable second half, Mac mini timing labels, the __main__-guard note for plain scripts, and the corrected fork/spawn BLAS-cap wording). --- examples/demo_multiprocessing.ipynb | 205 +++++++++++++++------------- 1 file changed, 107 insertions(+), 98 deletions(-) diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb index 1b255f9..02d334f 100644 --- a/examples/demo_multiprocessing.ipynb +++ b/examples/demo_multiprocessing.ipynb @@ -3,35 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# scqubits example: when does parallelizing a sweep help? (`num_cpus`, BLAS threads)\n", - "\n", - "J. Koch and P. Groszkowski\n", - "\n", - "For further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n", - "\n", - "---\n", - "\n", - "`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\n", - "argument. **Parallelism is not free, and on small sweeps it does *not* help — sometimes\n", - "it makes them slower.** Whether it pays off is governed by one simple balance:\n", - "\n", - "> parallelism helps only when  **(number of grid points) × (cost per point)  ≫  the fixed overhead**\n", - "> of starting worker processes and shipping each task to them.\n", - "\n", - "So the behaviour you should *expect*:\n", - "\n", - "- **Few grid points:** `num_cpus > 1` gives no speedup, and can be *slower* than serial.\n", - " This is normal — keep the default `num_cpus = 1`.\n", - "- **Many grid points** (and/or an expensive-per-point system): workers pay off.\n", - "\n", - "A second knob interacts with the first: every eigensolve runs on a multithreaded\n", - "**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\n", - "oversubscribe — which on large dense matrices is not a small slowdown but a\n", - "**catastrophe** (~90× in the example below). So we cap BLAS threads **first**, before\n", - "importing anything. Timings are machine-specific — run the cells yourself; the\n", - "illustrative numbers are from a 10-core laptop and yours will differ.\n" - ] + "source": "# scqubits example: when does parallelizing a sweep help? (`num_cpus`, BLAS threads)\n\nJ. Koch and P. Groszkowski\n\nFor further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n\n---\n\n`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\nargument. **Parallelism is not free, and on small sweeps it does *not* help — sometimes\nit makes them slower.** Whether it pays off is governed by one simple balance:\n\n> parallelism helps only when  **(number of grid points) × (cost per point)  ≫  the fixed overhead**\n> of starting worker processes and shipping each task to them.\n\nSo the behaviour you should *expect*:\n\n- **Few grid points:** `num_cpus > 1` gives no speedup, and can be *slower* than serial.\n This is normal — keep the default `num_cpus = 1`.\n- **Many grid points** (and/or an expensive-per-point system): workers pay off.\n\nA second knob interacts with the first: every eigensolve runs on a multithreaded\n**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\noversubscribe — which on large dense matrices is not a small slowdown but a\n**catastrophe** (~90× in the example below). So we cap BLAS threads **first**, before\nimporting anything. Timings are machine-specific — run the cells yourself; the\nillustrative numbers are from a 10-core Mac mini and yours will differ." }, { "cell_type": "markdown", @@ -151,6 +123,11 @@ "cores = os.cpu_count() or 1\n" ] }, + { + "cell_type": "markdown", + "source": "### Running this as a `.py` script? Guard the entry point\n\nThese cells run fine **in Jupyter**. But if you move parallel code (`num_cpus > 1`) into a\nplain Python script, note that macOS and Windows start workers with the `spawn` method,\nwhich **re-imports your script** in each worker — so the entry point must be guarded, or\nPython raises a `RuntimeError`:\n\n```python\nif __name__ == \"__main__\":\n ... # code that triggers num_cpus > 1\n```\n\nLinux (which uses `fork`) and Jupyter need no guard. scqubits also prints a one-time\nreminder the first time it spawns workers outside Jupyter.", + "metadata": {} + }, { "cell_type": "markdown", "metadata": {}, @@ -174,19 +151,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "**What to expect:** roughly *flat*, and `num_cpus = 2`/`4` may be slightly **slower** than\n", - "`1`. On the reference 10-core laptop (16 points, BLAS capped to 1):\n", - "\n", - "| num_cpus | wall time | speedup |\n", - "|---:|---:|---:|\n", - "| 1 | 0.10 s | 1.00× |\n", - "| 2 | 0.10 s | 0.93× |\n", - "| 4 | 0.11 s | 0.84× |\n", - "\n", - "This is the **normal** outcome for a quick sweep, and **not a bug** — with only 16 points\n", - "the per-task overhead dominates, so `num_cpus = 1` (the default) is the right choice.\n" - ] + "source": "**What to expect:** roughly *flat*, and `num_cpus = 2`/`4` may be slightly **slower** than\n`1`. On the reference 10-core Mac mini (16 points, BLAS capped to 1):\n\n| num_cpus | wall time | speedup |\n|---:|---:|---:|\n| 1 | 0.10 s | 1.00× |\n| 2 | 0.10 s | 0.93× |\n| 4 | 0.11 s | 0.84× |\n\nThis is the **normal** outcome for a quick sweep, and **not a bug** — with only 16 points\nthe per-task overhead dominates, so `num_cpus = 1` (the default) is the right choice." }, { "cell_type": "markdown", @@ -211,53 +176,29 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "**What to expect:** now the workers pay off. On the reference laptop (384 points, BLAS\n", - "capped to 1):\n", - "\n", - "| num_cpus | wall time | speedup |\n", - "|---:|---:|---:|\n", - "| 1 | 2.14 s | 1.00× |\n", - "| 2 | 1.50 s | 1.42× |\n", - "| 4 | 0.90 s | **2.37×** |\n", - "\n", - "Nothing changed except the **number of grid points** (16 → 384). That is the lever:\n", - "parallelism helps once *grid size × per-point cost* clears the overhead. The crossover\n", - "depends on the per-point cost too — a heavier system (larger `truncated_dim`, or a large\n", - "composite Hilbert space) reaches it at fewer points; a cheaper one needs more.\n" - ] + "source": "**What to expect:** now the workers pay off. On the reference Mac mini (384 points, BLAS\ncapped to 1):\n\n| num_cpus | wall time | speedup |\n|---:|---:|---:|\n| 1 | 2.14 s | 1.00× |\n| 2 | 1.50 s | 1.42× |\n| 4 | 0.90 s | **2.37×** |\n\nNothing changed except the **number of grid points** (16 → 384). That is the lever:\nparallelism helps once *grid size × per-point cost* clears the overhead. The crossover\ndepends on the per-point cost too — a heavier system (larger `truncated_dim`, or a large\ncomposite Hilbert space) reaches it at fewer points; a cheaper one needs more." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## The BLAS-thread footgun — *why* we capped first\n\nEach eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\nRun `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\nby a factor of `num_cpus`. On small matrices this is wasteful; on **large dense** matrices\nit is *catastrophic*. Measured on the reference Mac mini — 5 capacitively coupled fluxonia,\ndressed dim 3125, dense diagonalization, 16-point sweep:\n\n| configuration | wall time |\n|---|---:|\n| `num_cpus=1` | 42 s |\n| `num_cpus=4`, BLAS **uncapped** | **3608 s** (≈ 90× *slower*) |\n| `num_cpus=4`, BLAS capped to 1 | 40 s |\n| `num_cpus=8`, BLAS capped to 1 | 28 s |\n\n40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\ndifference between working and broken. (Skip Step 0 and rerun Demo 2 with the cap off and\nyou will reproduce a milder version: `num_cpus = 2`/`4` become *much* slower than `1`.)\nThat is also the usual reason a `num_cpus` comparison looks 'inconclusive' or backwards:\nthe cap was never in effect.\n\n**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\nbest; on large dense matrices each eigensolve benefits from several threads, so the\nfastest setting balances the knobs: roughly **`num_cpus × BLAS-threads ≈ cores`**. The\noptimum depends on machine, BLAS library, and matrix size — measure it." }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## The BLAS-thread footgun — *why* we capped first\n", - "\n", - "Each eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\n", - "Run `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\n", - "by a factor of `num_cpus`. On small matrices this is wasteful; on **large dense** matrices\n", - "it is *catastrophic*. Measured on the reference laptop — 5 capacitively coupled fluxonia,\n", - "dressed dim 3125, dense diagonalization, 16-point sweep:\n", - "\n", - "| configuration | wall time |\n", - "|---|---:|\n", - "| `num_cpus=1` | 42 s |\n", - "| `num_cpus=4`, BLAS **uncapped** | **3608 s** (≈ 90× *slower*) |\n", - "| `num_cpus=4`, BLAS capped to 1 | 40 s |\n", - "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n", - "\n", - "40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\n", - "difference between working and broken. (Skip Step 0 and rerun Demo 2 with the cap off and\n", - "you will reproduce a milder version: `num_cpus = 2`/`4` become *much* slower than `1`.)\n", - "That is also the usual reason a `num_cpus` comparison looks 'inconclusive' or backwards:\n", - "the cap was never in effect.\n", - "\n", - "**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\n", - "best; on large dense matrices each eigensolve benefits from several threads, so the\n", - "fastest setting balances the knobs: roughly **`num_cpus × BLAS-threads ≈ cores`**. The\n", - "optimum depends on machine, BLAS library, and matrix size — measure it.\n" + "The thread cap above was set with environment variables (the reliable way, since BLAS\n", + "reads them at import). scqubits also exposes the relevant knobs as **settings** you can\n", + "change at any time in a session:\n" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# Global default number of worker processes (used when `num_cpus` is not passed):\nscq.settings.NUM_CPUS = 1\n\n# Cap BLAS threads per worker process during parallel sweeps. This is applied only while\n# the worker pool is created. Spawn-based workers (macOS, Windows) re-read the env vars at\n# import; fork-based workers (Linux) need `threadpoolctl`. Keep num_cpus x BLAS-threads ~ cores.\nscq.settings.MULTIPROC_BLAS_THREADS = 1" + }, { "cell_type": "markdown", "metadata": {}, @@ -278,24 +219,92 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Automated tuning and summary\n", + "The setting is `scqubits.settings.AUTO_SPARSE_DIAG` (on by default). Here it is on a\n", + "moderately large system you can actually run — **4 capacitively coupled fluxonia**,\n", + "dressed dimension 6⁴ = 1296 — timing the same 8-point sweep with sparse vs dense\n", + "diagonalization:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import warnings\n", + "\n", + "def make_fluxonia_sweep(n_points=8):\n", + " fluxonia = [\n", + " scq.Fluxonium(EJ=4.0, EC=1.0, EL=1.0, flux=0.5, cutoff=110, truncated_dim=6, id_str=f\"flx{i}\")\n", + " for i in range(4)\n", + " ]\n", + " hs = scq.HilbertSpace(fluxonia)\n", + " for i in range(3):\n", + " hs.add_interaction(g_strength=0.1, op1=fluxonia[i].n_operator, op2=fluxonia[i + 1].n_operator)\n", + "\n", + " def update(flux):\n", + " fluxonia[0].flux = flux\n", "\n", - "Because the optimum depends on your machine, BLAS library, and problem, the most reliable\n", - "approach is to measure. From the scqubits **source tree**, the script\n", - "`tools/autotune_multiprocessing.py` searches the `num_cpus × BLAS-threads` frontier and\n", - "reports the fastest configuration for your system.\n", + " return scq.ParameterSweep(\n", + " hilbertspace=hs, paramvals_by_name={\"flux\": np.linspace(0.45, 0.55, n_points)},\n", + " update_hilbertspace=update, evals_count=20, num_cpus=1, autorun=True,\n", + " )\n", "\n", - "### Summary\n", "\n", - "- **Default `num_cpus = 1`.** Parallelism helps only when *grid size × cost-per-point*\n", - " greatly exceeds the per-task overhead; on small sweeps it does nothing, or slows you\n", - " down — expected, not a bug (Demo 1 vs Demo 2).\n", - "- **Cap BLAS threads before importing scqubits**, and verify it took effect.\n", - " Oversubscription is catastrophic on large dense matrices (the 90× example). Keep\n", - " `num_cpus × BLAS-threads ≈ cores`.\n", - "- For large composite Hilbert spaces, **sparse diagonalization is usually a bigger lever\n", - " than multiprocessing** — try it first.\n", - "- **Measure on your own hardware** — every number here is machine-specific.\n" + "def time_fluxonia():\n", + " make_fluxonia_sweep() # warm-up\n", + " start = time.perf_counter()\n", + " make_fluxonia_sweep()\n", + " return time.perf_counter() - start\n", + "\n", + "\n", + "scq.settings.AUTO_SPARSE_DIAG = True\n", + "t_sparse = time_fluxonia()\n", + "\n", + "# The dense path exercises scipy's matrix-function code, which emits spurious 'matmul'\n", + "# RuntimeWarnings on some BLAS backends (e.g. Apple Accelerate); silence them for the demo.\n", + "with warnings.catch_warnings():\n", + " warnings.simplefilter(\"ignore\")\n", + " scq.settings.AUTO_SPARSE_DIAG = False\n", + " t_dense = time_fluxonia()\n", + "scq.settings.AUTO_SPARSE_DIAG = True # restore the default\n", + "\n", + "print(f\"sparse (default): {t_sparse:.2f} s dense: {t_dense:.2f} s speedup: {t_dense / t_sparse:.1f}x\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "On the reference Mac mini this prints roughly `sparse 0.55 s dense 3.94 s speedup 7.2x`.\nThe advantage grows with system size, and for very large Hilbert spaces dense may not fit\nin memory at all. (The dense run may also emit a few scipy-internal `matmul` warnings —\nanother reason to prefer the sparse default.)" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Letting scqubits pick the settings\n\nYou don't have to find the break-even by hand. `scqubits.recommend_parallelization` reads\nthe workload — Hilbert-space dimension, number of grid points, eigenvalue count, and\nwhether sparse diagonalization applies — and returns a recommended `num_cpus` and\nBLAS-thread cap. It is a *pure* function (it starts no workers), so it is safe to call\nanywhere. For the three-transmon system above (dressed dimension 6³ = 216), at the two grid\nsizes from Demos 1 and 2:" + }, + { + "cell_type": "code", + "source": "for n_points in (16, 384):\n cfg = scq.recommend_parallelization(dimension=6**3, num_points=n_points, evals_count=20)\n print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}, blas_threads={cfg.blas_threads}\")\n print(f\" {cfg.reason}\")\n", + "metadata": {}, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": "The recommendation reproduces what the demos showed: serial for the 16-point sweep,\nparallel for the 384-point one. Two conveniences build on it:\n\n- pass **`num_cpus=\"auto\"`** to a sweep and it tunes itself *before* it runs:\n `scq.ParameterSweep(..., num_cpus=\"auto\")`;\n- set **`scq.settings.AUTO_PARALLEL = True`** to apply this to every sweep that does not\n specify `num_cpus`.\n\nFor a recommendation tuned to *your* hardware (rather than built-in defaults), run the\none-time **`scq.calibrate_parallelization()`** — it measures this machine's per-task\noverhead, pool-startup cost, and per-point cost (writing\n`~/.scqubits/parallel_calibration.json`), after which the recommendation uses your measured\nbreak-even.\n\n### Summary\n\n- **Easiest:** let `recommend_parallelization` / `num_cpus=\"auto\"` choose; optionally run\n `calibrate_parallelization()` once for machine-tuned advice.\n- Parallelism helps only when *grid size × cost-per-point* greatly exceeds the per-task\n overhead; on small sweeps it does nothing, or slows you down — expected, not a bug\n (Demo 1 vs Demo 2).\n- **Cap BLAS threads** to avoid oversubscription (the 90× example); keep\n `num_cpus × BLAS-threads ≈ cores`. Capping before import is the blunt manual route;\n `MULTIPROC_BLAS_THREADS` and the heuristic handle it per sweep.\n- For large composite Hilbert spaces, **sparse diagonalization is usually a bigger lever\n than multiprocessing** — try it first.\n- **Every number here is machine-specific** — measure (or calibrate) on your own hardware.", + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The settings touched in this notebook, with their current values:\n", + "print(\"NUM_CPUS =\", scq.settings.NUM_CPUS)\n", + "print(\"MULTIPROC_BLAS_THREADS =\", scq.settings.MULTIPROC_BLAS_THREADS)\n", + "print(\"AUTO_SPARSE_DIAG =\", scq.settings.AUTO_SPARSE_DIAG)\n", + "print(\"MULTIPROC =\", scq.settings.MULTIPROC, \"(parallelization backend)\")\n" ] } ], @@ -311,4 +320,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file From 5660a50727558f8a0e22868f253f8b2a5f56179c Mon Sep 17 00:00:00 2001 From: Jens Koch <19193849+jkochNU@users.noreply.github.com> Date: Tue, 2 Jun 2026 21:50:52 -0500 Subject: [PATCH 5/8] examples: rewrite multiprocessing demo around the auto-tuning story Rewrite from scratch so the main story is the shipped capability: let scqubits choose num_cpus and the BLAS-thread cap. Lead with recommend_parallelization, num_cpus="auto", and AUTO_PARALLEL; show a serial-vs-auto correctness check; then the one-time calibrate_parallelization(). The manual mechanics (grid break-even, the BLAS oversubscription cliff, manual settings, sparse-first, the __main__ guard) are demoted to supporting "why" and "manual control" sections. American English throughout. 17 cells, down from 25. --- examples/demo_multiprocessing.ipynb | 345 +++++++++++++--------------- 1 file changed, 166 insertions(+), 179 deletions(-) diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb index 02d334f..8dfbdaf 100644 --- a/examples/demo_multiprocessing.ipynb +++ b/examples/demo_multiprocessing.ipynb @@ -1,45 +1,26 @@ { "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": "# scqubits example: when does parallelizing a sweep help? (`num_cpus`, BLAS threads)\n\nJ. Koch and P. Groszkowski\n\nFor further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n\n---\n\n`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\nargument. **Parallelism is not free, and on small sweeps it does *not* help — sometimes\nit makes them slower.** Whether it pays off is governed by one simple balance:\n\n> parallelism helps only when  **(number of grid points) × (cost per point)  ≫  the fixed overhead**\n> of starting worker processes and shipping each task to them.\n\nSo the behaviour you should *expect*:\n\n- **Few grid points:** `num_cpus > 1` gives no speedup, and can be *slower* than serial.\n This is normal — keep the default `num_cpus = 1`.\n- **Many grid points** (and/or an expensive-per-point system): workers pay off.\n\nA second knob interacts with the first: every eigensolve runs on a multithreaded\n**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\noversubscribe — which on large dense matrices is not a small slowdown but a\n**catastrophe** (~90× in the example below). So we cap BLAS threads **first**, before\nimporting anything. Timings are machine-specific — run the cells yourself; the\nillustrative numbers are from a 10-core Mac mini and yours will differ." - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Step 0 (do this first): cap BLAS threads *before* importing scqubits\n", + "# Letting scqubits choose your multiprocessing settings\n", "\n", - "The BLAS backend reads its thread count **once, at import time**, so this must be the\n", - "**first cell you run in a fresh kernel** — before `numpy` or `scqubits` is imported. (If\n", - "scqubits is already imported, restart the kernel and run this first.) Capping to `1` is a\n", - "good starting point for qubit sweeps and prevents oversubscription once `num_cpus > 1`;\n", - "the 'BLAS footgun' section below explains why this matters so much.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", + "J. Koch and P. Groszkowski\n", "\n", - "# MUST run before numpy / scqubits are imported (restart the kernel otherwise).\n", - "for _var in (\"OPENBLAS_NUM_THREADS\", \"MKL_NUM_THREADS\", \"OMP_NUM_THREADS\", \"NUMEXPR_NUM_THREADS\"):\n", - " os.environ[_var] = \"1\"\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Confirm the cap took effect\n", + "For further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n", + "\n", + "---\n", + "\n", + "A `ParameterSweep` can spread its grid points across worker processes through the `num_cpus`\n", + "argument. The catch is that the *right* number of workers — and the right number of BLAS\n", + "threads each worker should use — depends on the problem and the machine. Guess wrong and you\n", + "get no speedup, or, when the workers oversubscribe the cores, a slowdown of **one to two\n", + "orders of magnitude**.\n", "\n", - "The single most common cause of confusing sweep timings is that the cap **did not apply**\n", - "— because `numpy`/`scqubits` were already imported when the env vars were set. Check it\n", - "(needs `threadpoolctl`, `pip install threadpoolctl`); each reported count should be `1`:\n" + "So scqubits can choose for you. This notebook shows the easy path —\n", + "`recommend_parallelization` and `num_cpus=\"auto\"` — and a one-time per-machine calibration,\n", + "and then explains what is being balanced under the hood." ] }, { @@ -49,33 +30,18 @@ "outputs": [], "source": [ "import numpy as np\n", - "import scqubits as scq # imports scipy, whose BLAS the eigensolvers use\n", - "\n", - "try:\n", - " from threadpoolctl import threadpool_info\n", - "\n", - " pools = threadpool_info()\n", - " for pool in pools:\n", - " print(f\"{pool['internal_api']:>10}: {pool['num_threads']} thread(s)\")\n", - " if not pools:\n", - " print(\"(no controllable BLAS pool detected)\")\n", - " print(\"\\nEach count should be 1. If not, restart the kernel and run the cap cell\")\n", - " print(\"FIRST. (On Apple Silicon, numpy itself uses Accelerate, which ignores these\")\n", - " print(\" variables; scipy's bundled OpenBLAS -- shown here -- is what scqubits uses.)\")\n", - "except ImportError:\n", - " print(\"threadpoolctl not installed (pip install threadpoolctl) -- skipping check.\")\n" + "\n", + "import scqubits as scq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## A small timing helper and an example system\n", + "## A system to sweep\n", "\n", - "Three coupled tunable transmons, swept over the flux of the first. We rebuild a fresh\n", - "sweep for each timed run, and we will sweep **the same system at two grid sizes** to see\n", - "both regimes. Wall-clock timing is noisy, so we take the median of a few repeats and\n", - "discard a warm-up run (the first parallel run pays a one-time process-startup cost).\n" + "We use three capacitively coupled tunable transmons and sweep the flux of the first. The\n", + "dressed Hilbert space has dimension `6**3 = 216`." ] }, { @@ -84,58 +50,43 @@ "metadata": {}, "outputs": [], "source": [ - "import time\n", - "\n", - "\n", - "def make_sweep(num_cpus, n_points, truncated_dim=6):\n", + "def build_hilbertspace():\n", " qubits = [\n", " scq.TunableTransmon(\n", " EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n", - " truncated_dim=truncated_dim, id_str=f\"tmon{i}\",\n", + " truncated_dim=6, id_str=f\"tmon{i}\",\n", " )\n", " for i in range(3)\n", " ]\n", " hs = scq.HilbertSpace(qubits)\n", " for i in range(2):\n", - " hs.add_interaction(g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator)\n", - " flux_vals = np.linspace(0.0, 0.5, n_points)\n", + " hs.add_interaction(\n", + " g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator\n", + " )\n", "\n", " def update(flux):\n", " qubits[0].flux = flux\n", "\n", - " return scq.ParameterSweep(\n", - " hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n", - " update_hilbertspace=update, evals_count=20, num_cpus=num_cpus, autorun=True,\n", - " )\n", - "\n", - "\n", - "def time_run(n_points, num_cpus, repeats=3):\n", - " \"\"\"Median wall time of make_sweep(num_cpus, n_points), after a discarded warm-up.\"\"\"\n", - " make_sweep(num_cpus, n_points) # warm-up (discarded)\n", - " times = []\n", - " for _ in range(repeats):\n", - " start = time.perf_counter()\n", - " make_sweep(num_cpus, n_points)\n", - " times.append(time.perf_counter() - start)\n", - " return float(np.median(times))\n", + " return hs, update\n", "\n", "\n", - "cores = os.cpu_count() or 1\n" + "hs, update = build_hilbertspace()\n", + "print(\"dressed dimension:\", hs.dimension)" ] }, - { - "cell_type": "markdown", - "source": "### Running this as a `.py` script? Guard the entry point\n\nThese cells run fine **in Jupyter**. But if you move parallel code (`num_cpus > 1`) into a\nplain Python script, note that macOS and Windows start workers with the `spawn` method,\nwhich **re-imports your script** in each worker — so the entry point must be guarded, or\nPython raises a `RuntimeError`:\n\n```python\nif __name__ == \"__main__\":\n ... # code that triggers num_cpus > 1\n```\n\nLinux (which uses `fork`) and Jupyter need no guard. scqubits also prints a one-time\nreminder the first time it spawns workers outside Jupyter.", - "metadata": {} - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Demo 1 — a small sweep: parallelism does *not* help\n", + "## The easy way: let scqubits choose\n", + "\n", + "`scq.recommend_parallelization` reads the workload — Hilbert-space dimension, number of grid\n", + "points, eigenvalue count, and whether sparse diagonalization applies — and returns a\n", + "recommended `num_cpus` together with a per-worker BLAS-thread cap. It is a *pure* function: it\n", + "starts no worker processes, so it is safe to call anywhere, and it does not run the sweep.\n", "\n", - "Only **16 grid points**. There are too few points to amortize the cost of starting\n", - "workers and shipping tasks, so spreading them across processes buys nothing.\n" + "Because a `ParameterSweep` runs the moment it is constructed, call it *before* building the\n", + "sweep. Here is its choice for our system at two grid sizes:" ] }, { @@ -144,23 +95,41 @@ "metadata": {}, "outputs": [], "source": [ - "for n in [c for c in (1, 2, 4) if c <= cores]:\n", - " print(f\"num_cpus={n}: {time_run(16, n):.3f} s (median)\")\n" + "for n_points in (16, 384):\n", + " cfg = scq.recommend_parallelization(\n", + " hilbertspace=hs, num_points=n_points, evals_count=20\n", + " )\n", + " print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}, blas_threads={cfg.blas_threads}\")\n", + " print(f\" {cfg.reason}\")" ] }, { "cell_type": "markdown", "metadata": {}, - "source": "**What to expect:** roughly *flat*, and `num_cpus = 2`/`4` may be slightly **slower** than\n`1`. On the reference 10-core Mac mini (16 points, BLAS capped to 1):\n\n| num_cpus | wall time | speedup |\n|---:|---:|---:|\n| 1 | 0.10 s | 1.00× |\n| 2 | 0.10 s | 0.93× |\n| 4 | 0.11 s | 0.84× |\n\nThis is the **normal** outcome for a quick sweep, and **not a bug** — with only 16 points\nthe per-task overhead dominates, so `num_cpus = 1` (the default) is the right choice." + "source": [ + "A 16-point sweep stays serial — there are too few points to repay the cost of starting and\n", + "feeding worker processes — while the 384-point sweep is spread across workers, each capped to\n", + "a single BLAS thread so they do not oversubscribe the cores.\n", + "\n", + "To apply the recommendation without copying numbers by hand, pass the sentinel\n", + "`num_cpus=\"auto\"` to the sweep. The choice is then made automatically, *before* the sweep\n", + "runs:\n", + "\n", + "```python\n", + "sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n", + "```\n", + "\n", + "To make *every* sweep that does not specify `num_cpus` tune itself this way, set\n", + "`scq.settings.AUTO_PARALLEL = True`." + ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Demo 2 — a large sweep: parallelism pays off\n", - "\n", - "The *same system*, now with **384 grid points**. There is enough work to amortize the\n", - "fixed overhead, so the workers help.\n" + "The automatic choice changes only *how* a sweep is computed, never the result. We confirm\n", + "that by running the same 384-point sweep both serially and with `num_cpus=\"auto\"`, and\n", + "comparing the spectra:" ] }, { @@ -169,27 +138,33 @@ "metadata": {}, "outputs": [], "source": [ - "for n in [c for c in (1, 2, 4) if c <= cores]:\n", - " print(f\"num_cpus={n}: {time_run(384, n):.3f} s (median)\")\n" + "flux_vals = np.linspace(0.0, 0.5, 384)\n", + "\n", + "serial = scq.ParameterSweep(\n", + " hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n", + " update_hilbertspace=update, evals_count=20, num_cpus=1,\n", + ")\n", + "auto = scq.ParameterSweep(\n", + " hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n", + " update_hilbertspace=update, evals_count=20, num_cpus=\"auto\",\n", + ")\n", + "print(\"spectra identical:\", np.allclose(serial[\"evals\"][:], auto[\"evals\"][:]))" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "**What to expect:** now the workers pay off. On the reference Mac mini (384 points, BLAS\ncapped to 1):\n\n| num_cpus | wall time | speedup |\n|---:|---:|---:|\n| 1 | 2.14 s | 1.00× |\n| 2 | 1.50 s | 1.42× |\n| 4 | 0.90 s | **2.37×** |\n\nNothing changed except the **number of grid points** (16 → 384). That is the lever:\nparallelism helps once *grid size × per-point cost* clears the overhead. The crossover\ndepends on the per-point cost too — a heavier system (larger `truncated_dim`, or a large\ncomposite Hilbert space) reaches it at fewer points; a cheaper one needs more." - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "## The BLAS-thread footgun — *why* we capped first\n\nEach eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\nRun `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\nby a factor of `num_cpus`. On small matrices this is wasteful; on **large dense** matrices\nit is *catastrophic*. Measured on the reference Mac mini — 5 capacitively coupled fluxonia,\ndressed dim 3125, dense diagonalization, 16-point sweep:\n\n| configuration | wall time |\n|---|---:|\n| `num_cpus=1` | 42 s |\n| `num_cpus=4`, BLAS **uncapped** | **3608 s** (≈ 90× *slower*) |\n| `num_cpus=4`, BLAS capped to 1 | 40 s |\n| `num_cpus=8`, BLAS capped to 1 | 28 s |\n\n40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\ndifference between working and broken. (Skip Step 0 and rerun Demo 2 with the cap off and\nyou will reproduce a milder version: `num_cpus = 2`/`4` become *much* slower than `1`.)\nThat is also the usual reason a `num_cpus` comparison looks 'inconclusive' or backwards:\nthe cap was never in effect.\n\n**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\nbest; on large dense matrices each eigensolve benefits from several threads, so the\nfastest setting balances the knobs: roughly **`num_cpus × BLAS-threads ≈ cores`**. The\noptimum depends on machine, BLAS library, and matrix size — measure it." - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The thread cap above was set with environment variables (the reliable way, since BLAS\n", - "reads them at import). scqubits also exposes the relevant knobs as **settings** you can\n", - "change at any time in a session:\n" + "## Tune to your machine (optional, run once)\n", + "\n", + "The recommendation above uses conservative built-in thresholds. For a recommendation tuned to\n", + "*your* hardware, run the one-time calibration. It times a short battery of sweeps in isolated\n", + "subprocesses and measures this machine's per-task dispatch overhead, one-time pool-startup\n", + "cost, and per-point diagonalization cost (dense and sparse). It takes about a minute and\n", + "writes `~/.scqubits/parallel_calibration.json`.\n", + "\n", + "Because it launches its measurements as `python -m` subprocesses, the call needs no\n", + "`if __name__ == \"__main__\":` guard, in Jupyter or in a plain script." ] }, { @@ -197,32 +172,17 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": "# Global default number of worker processes (used when `num_cpus` is not passed):\nscq.settings.NUM_CPUS = 1\n\n# Cap BLAS threads per worker process during parallel sweeps. This is applied only while\n# the worker pool is created. Spawn-based workers (macOS, Windows) re-read the env vars at\n# import; fork-based workers (Linux) need `threadpoolctl`. Keep num_cpus x BLAS-threads ~ cores.\nscq.settings.MULTIPROC_BLAS_THREADS = 1" - }, - { - "cell_type": "markdown", - "metadata": {}, "source": [ - "## For large composite systems, try sparse diagonalization first\n", - "\n", - "Before reaching for `num_cpus`, note that for large coupled systems the dominant cost is\n", - "the *per-point diagonalization*, usually a bigger lever than parallelism. Recent scqubits\n", - "automatically uses **sparse** diagonalization (`scipy.eigsh`) for the default method when\n", - "only a few eigenstates of a large Hilbert space are requested\n", - "(`scqubits.settings.AUTO_SPARSE_DIAG`). For the 5-fluxonia system above (dressed dim\n", - "3125) this alone is ~16× faster per point than dense — far more than parallelism buys\n", - "there. And once sparse makes each point cheap, `num_cpus > 1` helps even less. **Try\n", - "sparse first; parallelize second.**\n" + "scq.calibrate_parallelization()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The setting is `scqubits.settings.AUTO_SPARSE_DIAG` (on by default). Here it is on a\n", - "moderately large system you can actually run — **4 capacitively coupled fluxonia**,\n", - "dressed dimension 6⁴ = 1296 — timing the same 8-point sweep with sparse vs dense\n", - "diagonalization:\n" + "Once the calibration exists, `recommend_parallelization` (and `num_cpus=\"auto\"`) use the\n", + "measured break-even — parallelizing only once the grid is large enough to repay the measured\n", + "pool-startup cost. Re-running the recommendation now reflects your machine:" ] }, { @@ -231,80 +191,107 @@ "metadata": {}, "outputs": [], "source": [ - "import warnings\n", - "\n", - "def make_fluxonia_sweep(n_points=8):\n", - " fluxonia = [\n", - " scq.Fluxonium(EJ=4.0, EC=1.0, EL=1.0, flux=0.5, cutoff=110, truncated_dim=6, id_str=f\"flx{i}\")\n", - " for i in range(4)\n", - " ]\n", - " hs = scq.HilbertSpace(fluxonia)\n", - " for i in range(3):\n", - " hs.add_interaction(g_strength=0.1, op1=fluxonia[i].n_operator, op2=fluxonia[i + 1].n_operator)\n", - "\n", - " def update(flux):\n", - " fluxonia[0].flux = flux\n", - "\n", - " return scq.ParameterSweep(\n", - " hilbertspace=hs, paramvals_by_name={\"flux\": np.linspace(0.45, 0.55, n_points)},\n", - " update_hilbertspace=update, evals_count=20, num_cpus=1, autorun=True,\n", + "for n_points in (16, 384):\n", + " cfg = scq.recommend_parallelization(\n", + " hilbertspace=hs, num_points=n_points, evals_count=20\n", " )\n", + " print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus} ({cfg.reason})\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What scqubits is balancing for you\n", "\n", + "There is no single right answer because two effects pull in opposite directions.\n", "\n", - "def time_fluxonia():\n", - " make_fluxonia_sweep() # warm-up\n", - " start = time.perf_counter()\n", - " make_fluxonia_sweep()\n", - " return time.perf_counter() - start\n", + "**1. The grid break-even.** Sending a grid point to a worker costs a fixed amount (pickling,\n", + "inter-process hand-off), and starting the pool costs a one-time amount (about a second when\n", + "workers are *spawned*, as on macOS and Windows). Parallelism pays off only when\n", "\n", + "> (number of grid points) × (cost per point) ≫ that fixed overhead.\n", "\n", - "scq.settings.AUTO_SPARSE_DIAG = True\n", - "t_sparse = time_fluxonia()\n", + "Few points, or cheap points, stay faster serially — which is why the heuristic keeps small\n", + "sweeps on a single process.\n", "\n", - "# The dense path exercises scipy's matrix-function code, which emits spurious 'matmul'\n", - "# RuntimeWarnings on some BLAS backends (e.g. Apple Accelerate); silence them for the demo.\n", - "with warnings.catch_warnings():\n", - " warnings.simplefilter(\"ignore\")\n", - " scq.settings.AUTO_SPARSE_DIAG = False\n", - " t_dense = time_fluxonia()\n", - "scq.settings.AUTO_SPARSE_DIAG = True # restore the default\n", + "**2. The BLAS oversubscription cliff.** Every eigensolve already runs on a multithreaded BLAS\n", + "backend. If several workers each use all cores, the cores are oversubscribed by a factor of\n", + "`num_cpus`, which on large dense matrices is not a small slowdown but a collapse. Measured on\n", + "a 10-core Mac mini — five capacitively coupled fluxonia (dressed dimension 3125, dense),\n", + "16-point sweep:\n", "\n", - "print(f\"sparse (default): {t_sparse:.2f} s dense: {t_dense:.2f} s speedup: {t_dense / t_sparse:.1f}x\")\n" + "| configuration | wall time |\n", + "|---|---:|\n", + "| `num_cpus=1` | 42 s |\n", + "| `num_cpus=4`, BLAS uncapped | **3608 s** (~90x slower) |\n", + "| `num_cpus=4`, BLAS capped to 1 | 40 s |\n", + "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n", + "\n", + "This is why the recommendation always pairs a worker count with a BLAS-thread cap, keeping\n", + "`num_cpus x BLAS-threads` near the core count." ] }, { "cell_type": "markdown", "metadata": {}, - "source": "On the reference Mac mini this prints roughly `sparse 0.55 s dense 3.94 s speedup 7.2x`.\nThe advantage grows with system size, and for very large Hilbert spaces dense may not fit\nin memory at all. (The dense run may also emit a few scipy-internal `matmul` warnings —\nanother reason to prefer the sparse default.)" + "source": [ + "## Manual control\n", + "\n", + "The same knobs are available directly if you would rather set them yourself:\n", + "\n", + "```python\n", + "scq.settings.NUM_CPUS = 4 # default worker count when num_cpus is unset\n", + "scq.settings.MULTIPROC_BLAS_THREADS = 1 # per-worker BLAS-thread cap during a sweep\n", + "```\n", + "\n", + "Rule of thumb: keep `num_cpus x BLAS-threads` near the number of physical cores.\n", + "`MULTIPROC_BLAS_THREADS` reaches spawn-based workers (macOS, Windows) through the thread-count\n", + "environment variables, and fork-based workers (Linux) through the optional `threadpoolctl`\n", + "package.\n", + "\n", + "For large composite systems the per-point **diagonalization method** is often a bigger lever\n", + "than parallelism: scqubits uses sparse diagonalization by default for large spectra (see\n", + "`scq.settings.AUTO_SPARSE_DIAG`), which can be far faster per point. Once each point is cheap,\n", + "parallelism helps even less — so try sparse first, and parallelize second." + ] }, { "cell_type": "markdown", "metadata": {}, - "source": "## Letting scqubits pick the settings\n\nYou don't have to find the break-even by hand. `scqubits.recommend_parallelization` reads\nthe workload — Hilbert-space dimension, number of grid points, eigenvalue count, and\nwhether sparse diagonalization applies — and returns a recommended `num_cpus` and\nBLAS-thread cap. It is a *pure* function (it starts no workers), so it is safe to call\nanywhere. For the three-transmon system above (dressed dimension 6³ = 216), at the two grid\nsizes from Demos 1 and 2:" - }, - { - "cell_type": "code", - "source": "for n_points in (16, 384):\n cfg = scq.recommend_parallelization(dimension=6**3, num_points=n_points, evals_count=20)\n print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}, blas_threads={cfg.blas_threads}\")\n print(f\" {cfg.reason}\")\n", - "metadata": {}, - "execution_count": null, - "outputs": [] + "source": [ + "## Running as a script\n", + "\n", + "In Jupyter, everything above runs as shown. In a plain Python script on macOS or Windows,\n", + "workers are *spawned*, which re-imports your script in each worker — so the entry point that\n", + "triggers a parallel sweep must be guarded:\n", + "\n", + "```python\n", + "import scqubits as scq\n", + "\n", + "if __name__ == \"__main__\":\n", + " sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n", + "```\n", + "\n", + "Linux (which forks) and Jupyter need no guard. scqubits prints a one-time reminder the first\n", + "time it spawns workers outside of IPython." + ] }, { "cell_type": "markdown", - "source": "The recommendation reproduces what the demos showed: serial for the 16-point sweep,\nparallel for the 384-point one. Two conveniences build on it:\n\n- pass **`num_cpus=\"auto\"`** to a sweep and it tunes itself *before* it runs:\n `scq.ParameterSweep(..., num_cpus=\"auto\")`;\n- set **`scq.settings.AUTO_PARALLEL = True`** to apply this to every sweep that does not\n specify `num_cpus`.\n\nFor a recommendation tuned to *your* hardware (rather than built-in defaults), run the\none-time **`scq.calibrate_parallelization()`** — it measures this machine's per-task\noverhead, pool-startup cost, and per-point cost (writing\n`~/.scqubits/parallel_calibration.json`), after which the recommendation uses your measured\nbreak-even.\n\n### Summary\n\n- **Easiest:** let `recommend_parallelization` / `num_cpus=\"auto\"` choose; optionally run\n `calibrate_parallelization()` once for machine-tuned advice.\n- Parallelism helps only when *grid size × cost-per-point* greatly exceeds the per-task\n overhead; on small sweeps it does nothing, or slows you down — expected, not a bug\n (Demo 1 vs Demo 2).\n- **Cap BLAS threads** to avoid oversubscription (the 90× example); keep\n `num_cpus × BLAS-threads ≈ cores`. Capping before import is the blunt manual route;\n `MULTIPROC_BLAS_THREADS` and the heuristic handle it per sweep.\n- For large composite Hilbert spaces, **sparse diagonalization is usually a bigger lever\n than multiprocessing** — try it first.\n- **Every number here is machine-specific** — measure (or calibrate) on your own hardware.", - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, "metadata": {}, - "outputs": [], "source": [ - "# The settings touched in this notebook, with their current values:\n", - "print(\"NUM_CPUS =\", scq.settings.NUM_CPUS)\n", - "print(\"MULTIPROC_BLAS_THREADS =\", scq.settings.MULTIPROC_BLAS_THREADS)\n", - "print(\"AUTO_SPARSE_DIAG =\", scq.settings.AUTO_SPARSE_DIAG)\n", - "print(\"MULTIPROC =\", scq.settings.MULTIPROC, \"(parallelization backend)\")\n" + "## Summary\n", + "\n", + "- **Let scqubits choose.** `scq.recommend_parallelization(...)` recommends `num_cpus` and a\n", + " BLAS-thread cap from the workload; `num_cpus=\"auto\"` applies it per sweep, and\n", + " `scq.settings.AUTO_PARALLEL = True` applies it everywhere.\n", + "- **Calibrate once** with `scq.calibrate_parallelization()` for advice measured on your own\n", + " hardware.\n", + "- Parallelism helps only when the grid is large enough to repay the fixed overhead, and the\n", + " BLAS-thread cap is what keeps many workers from oversubscribing the cores.\n", + "- All of this takes effect live, without restarting the kernel, in Jupyter and in scripts\n", + " alike." ] } ], @@ -320,4 +307,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} From 5fa0767fbcf3e3fe46bcbb639bf8a1dfe1eb567a Mon Sep 17 00:00:00 2001 From: Jens Koch <19193849+jkochNU@users.noreply.github.com> Date: Fri, 5 Jun 2026 23:18:50 -0500 Subject: [PATCH 6/8] docs(demo): note MULTIPROC_BLAS_THREADS defaults to "auto" The manual-control section now explains that the BLAS-thread cap is automatic by default ("auto" = cores // num_cpus); setting an int overrides it and None opts out. --- examples/demo_multiprocessing.ipynb | 23 ++--------------------- 1 file changed, 2 insertions(+), 21 deletions(-) diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb index 8dfbdaf..f25cc31 100644 --- a/examples/demo_multiprocessing.ipynb +++ b/examples/demo_multiprocessing.ipynb @@ -235,26 +235,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## Manual control\n", - "\n", - "The same knobs are available directly if you would rather set them yourself:\n", - "\n", - "```python\n", - "scq.settings.NUM_CPUS = 4 # default worker count when num_cpus is unset\n", - "scq.settings.MULTIPROC_BLAS_THREADS = 1 # per-worker BLAS-thread cap during a sweep\n", - "```\n", - "\n", - "Rule of thumb: keep `num_cpus x BLAS-threads` near the number of physical cores.\n", - "`MULTIPROC_BLAS_THREADS` reaches spawn-based workers (macOS, Windows) through the thread-count\n", - "environment variables, and fork-based workers (Linux) through the optional `threadpoolctl`\n", - "package.\n", - "\n", - "For large composite systems the per-point **diagonalization method** is often a bigger lever\n", - "than parallelism: scqubits uses sparse diagonalization by default for large spectra (see\n", - "`scq.settings.AUTO_SPARSE_DIAG`), which can be far faster per point. Once each point is cheap,\n", - "parallelism helps even less — so try sparse first, and parallelize second." - ] + "source": "## Manual control\n\nThe same knobs are available directly if you would rather set them yourself:\n\n```python\nscq.settings.NUM_CPUS = 4 # default worker count when num_cpus is unset\nscq.settings.MULTIPROC_BLAS_THREADS = 1 # per-worker BLAS-thread cap during a sweep\n```\n\n`MULTIPROC_BLAS_THREADS` already defaults to `\"auto\"`, which caps each worker to\n`cores // num_cpus` so parallel sweeps never oversubscribe the cores — setting an integer\njust overrides that with a fixed cap (use `None` to opt out entirely). Rule of thumb: keep\n`num_cpus x BLAS-threads` near the number of physical cores. The cap reaches spawn-based\nworkers (macOS, Windows) through the thread-count environment variables, and fork-based\nworkers (Linux) through `threadpoolctl`.\n\nFor large composite systems the per-point **diagonalization method** is often a bigger lever\nthan parallelism: scqubits uses sparse diagonalization by default for large spectra (see\n`scq.settings.AUTO_SPARSE_DIAG`), which can be far faster per point. Once each point is cheap,\nparallelism helps even less — so try sparse first, and parallelize second." }, { "cell_type": "markdown", @@ -307,4 +288,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file From 04a415f0373112d8a1dfaa875e784463454d433c Mon Sep 17 00:00:00 2001 From: Jens Koch <19193849+jkochNU@users.noreply.github.com> Date: Mon, 8 Jun 2026 18:19:33 -0500 Subject: [PATCH 7/8] docs(demo): reassure existing users, clarify auto vs AUTO_PARALLEL, recalibration --- examples/demo_multiprocessing.ipynb | 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb index f25cc31..a8cbd3f 100644 --- a/examples/demo_multiprocessing.ipynb +++ b/examples/demo_multiprocessing.ipynb @@ -20,7 +20,9 @@ "\n", "So scqubits can choose for you. This notebook shows the easy path —\n", "`recommend_parallelization` and `num_cpus=\"auto\"` — and a one-time per-machine calibration,\n", - "and then explains what is being balanced under the hood." + "and then explains what is being balanced under the hood.\n", + "\n", + "> **Updating from an older scqubits?** Nothing here is required: your existing code runs unchanged and gives identical results (the default is still serial, `num_cpus=1`). And if you already pass `num_cpus > 1`, you now get automatic protection against the oversubscription slowdown for free. Everything below is opt-in — for *more* speed with *less* guesswork." ] }, { @@ -120,7 +122,15 @@ "```\n", "\n", "To make *every* sweep that does not specify `num_cpus` tune itself this way, set\n", - "`scq.settings.AUTO_PARALLEL = True`." + "`scq.settings.AUTO_PARALLEL = True`.\n", + "\n", + "These are the same auto-tuner with different reach — `num_cpus=\"auto\"` opts in for one sweep, while `AUTO_PARALLEL = True` makes it the default for every sweep where you don't pass `num_cpus`. An explicit number always wins:\n", + "\n", + "```text\n", + "num_cpus=4 -> exactly 4 workers (you decide)\n", + "num_cpus=\"auto\" -> auto-tuner decides (always)\n", + "num_cpus omitted -> auto-tuner if AUTO_PARALLEL=True, else serial (the default)\n", + "```" ] }, { @@ -157,14 +167,13 @@ "source": [ "## Tune to your machine (optional, run once)\n", "\n", - "The recommendation above uses conservative built-in thresholds. For a recommendation tuned to\n", - "*your* hardware, run the one-time calibration. It times a short battery of sweeps in isolated\n", - "subprocesses and measures this machine's per-task dispatch overhead, one-time pool-startup\n", - "cost, and per-point diagonalization cost (dense and sparse). It takes about a minute and\n", - "writes `~/.scqubits/parallel_calibration.json`.\n", + "The recommendation above uses conservative built-in thresholds that work everywhere. For choices tuned to *your* hardware, run the one-time calibration. It times a short battery of sweeps in isolated subprocesses — this machine's per-task dispatch overhead, the one-time pool-startup cost, and per-point diagonalization cost (dense and sparse) — and writes `~/.scqubits/parallel_calibration.json` (about a minute).\n", + "\n", + "The calibration is **just data**: it does nothing on its own. From then on, every `recommend_parallelization` / `num_cpus=\"auto\"` call reads that file and makes a sharper, machine-specific choice instead of using the generic defaults.\n", + "\n", + "**Re-running overwrites the file, so recalibrate freely** — and *do* recalibrate if a previous run was taken under bad conditions: while the machine was busy, or on a laptop that was on battery / CPU-throttled (many laptops clock down hard when unplugged, which makes the calibration over-estimate every cost so `\"auto\"` then under-parallelizes). For the most representative numbers, calibrate on an otherwise-idle machine plugged into wall power.\n", "\n", - "Because it launches its measurements as `python -m` subprocesses, the call needs no\n", - "`if __name__ == \"__main__\":` guard, in Jupyter or in a plain script." + "Because it launches its measurements as `python -m` subprocesses, the call needs no `if __name__ == \"__main__\":` guard, in Jupyter or in a plain script." ] }, { @@ -288,4 +297,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} From 7878728781880d6df866e545d03c02ffacdcd9a1 Mon Sep 17 00:00:00 2001 From: Jens Koch <19193849+jkochNU@users.noreply.github.com> Date: Mon, 8 Jun 2026 18:36:15 -0500 Subject: [PATCH 8/8] docs(demo): drop update-from-older-scqubits aside; keep current-state intro --- examples/demo_multiprocessing.ipynb | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb index a8cbd3f..0cbbca2 100644 --- a/examples/demo_multiprocessing.ipynb +++ b/examples/demo_multiprocessing.ipynb @@ -20,9 +20,7 @@ "\n", "So scqubits can choose for you. This notebook shows the easy path —\n", "`recommend_parallelization` and `num_cpus=\"auto\"` — and a one-time per-machine calibration,\n", - "and then explains what is being balanced under the hood.\n", - "\n", - "> **Updating from an older scqubits?** Nothing here is required: your existing code runs unchanged and gives identical results (the default is still serial, `num_cpus=1`). And if you already pass `num_cpus > 1`, you now get automatic protection against the oversubscription slowdown for free. Everything below is opt-in — for *more* speed with *less* guesswork." + "and then explains what is being balanced under the hood." ] }, {