From 6f6121a5ed04c93a9fe2865e1fd681bd9c4a8308 Mon Sep 17 00:00:00 2001
From: Jens Koch <jens-koch@northwestern.edu>
Date: Sun, 31 May 2026 17:39:32 -0500
Subject: [PATCH 1/8] examples: add demo_multiprocessing notebook

Demonstrates efficient parameter sweeps: the interaction between num_cpus and
BLAS/LAPACK threads, capping BLAS threads before import, and timing num_cpus on
the user's own machine. Cells are left unexecuted; timings are machine-specific.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 examples/demo_multiprocessing.ipynb | 220 ++++++++++++++++++++++++++++
 1 file changed, 220 insertions(+)
 create mode 100644 examples/demo_multiprocessing.ipynb

diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb
new file mode 100644
index 0000000..947db0b
--- /dev/null
+++ b/examples/demo_multiprocessing.ipynb
@@ -0,0 +1,220 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# scqubits example: efficient parameter sweeps (`num_cpus` and BLAS threads)\n",
+    "\n",
+    "J. Koch and P. Groszkowski\n",
+    "\n",
+    "For further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "`ParameterSweep` can compute a sweep in parallel across worker processes via its\n",
+    "`num_cpus` argument. How much that helps \u2014 and whether it helps at all \u2014 depends on\n",
+    "**two interacting knobs**:\n",
+    "\n",
+    "1. **`num_cpus`** \u2014 how many worker processes share the grid points.\n",
+    "2. **BLAS/LAPACK threads** \u2014 every eigensolve (via `numpy`/`scipy`) runs on a\n",
+    "   multithreaded linear-algebra backend (OpenBLAS or MKL) that by default uses *all*\n",
+    "   cores.\n",
+    "\n",
+    "Running `num_cpus` workers that each launch a full BLAS thread pool oversubscribes the\n",
+    "cores; and on the small matrices typical of qubit truncations, a heavily threaded BLAS\n",
+    "can be *slower* than a single thread. The fastest setting is a balance \u2014 roughly\n",
+    "`num_cpus \u00d7 BLAS-threads \u2248 number of cores` \u2014 and it depends on your machine, BLAS\n",
+    "library, and problem size.\n",
+    "\n",
+    "**This notebook shows how to measure it on your own system.** The timing cells are left\n",
+    "unexecuted on purpose: run them yourself \u2014 your numbers will differ."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Set the BLAS thread count *before* importing scqubits\n",
+    "\n",
+    "The BLAS backend reads its thread count once, at import time. So this must be the\n",
+    "**first** cell you run in a fresh kernel \u2014 before `numpy` or `scqubits` is imported. (If\n",
+    "you have already imported scqubits in this kernel, restart it first.)\n",
+    "\n",
+    "Capping to a small number \u2014 often `1` \u2014 is a good starting point for qubit sweeps, and\n",
+    "is also what prevents oversubscription once `num_cpus > 1`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# Must run before numpy / scqubits are imported (restart the kernel otherwise).\n",
+    "for _var in (\"OPENBLAS_NUM_THREADS\", \"MKL_NUM_THREADS\", \"OMP_NUM_THREADS\"):\n",
+    "    os.environ[_var] = \"1\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "import numpy as np\n",
+    "\n",
+    "import scqubits as scq"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. A coupled example system\n",
+    "\n",
+    "Two tunable transmons coupled to a common resonator (the same system as\n",
+    "`demo_parametersweep`), swept over `flux` and `ng`. We wrap construction in a factory so\n",
+    "we can rebuild a fresh sweep for each timed run."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def make_sweep(num_cpus, flux_points=11):\n",
+    "    tmon1 = scq.TunableTransmon(\n",
+    "        EJmax=40.0, EC=0.2, d=0.1, flux=0.0, ng=0.3, ncut=40, truncated_dim=3\n",
+    "    )\n",
+    "    tmon2 = scq.TunableTransmon(\n",
+    "        EJmax=15.0, EC=0.15, d=0.2, flux=0.0, ng=0.0, ncut=30, truncated_dim=3\n",
+    "    )\n",
+    "    resonator = scq.Oscillator(E_osc=4.5, truncated_dim=4)\n",
+    "\n",
+    "    hilbertspace = scq.HilbertSpace([tmon1, tmon2, resonator])\n",
+    "    hilbertspace.add_interaction(\n",
+    "        g_strength=0.1, op1=tmon1.n_operator, op2=resonator.creation_operator, add_hc=True\n",
+    "    )\n",
+    "    hilbertspace.add_interaction(\n",
+    "        g_strength=0.2, op1=tmon2.n_operator, op2=resonator.creation_operator, add_hc=True\n",
+    "    )\n",
+    "\n",
+    "    flux_vals = np.linspace(0.0, 2.0, flux_points)\n",
+    "    ng_vals = np.linspace(-0.5, 0.5, 3)\n",
+    "\n",
+    "    def update_hilbertspace(flux, ng):\n",
+    "        tmon1.flux = flux\n",
+    "        tmon2.flux = 1.2 * flux\n",
+    "        tmon2.ng = ng\n",
+    "\n",
+    "    return scq.ParameterSweep(\n",
+    "        hilbertspace=hilbertspace,\n",
+    "        paramvals_by_name={\"flux\": flux_vals, \"ng\": ng_vals},\n",
+    "        update_hilbertspace=update_hilbertspace,\n",
+    "        subsys_update_info={\"flux\": [tmon1, tmon2], \"ng\": [tmon2]},\n",
+    "        evals_count=20,\n",
+    "        num_cpus=num_cpus,\n",
+    "        autorun=False,\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Time `ParameterSweep.run()` versus `num_cpus`\n",
+    "\n",
+    "Wall-clock timing is noisy, so we take the median of a few repeats and discard a warm-up\n",
+    "run (the first parallel run pays a one-time process-startup cost). **Results are\n",
+    "machine-specific** \u2014 what matters is the *shape* of the trend on your hardware, not any\n",
+    "single number."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def time_run(num_cpus, flux_points=11, repeats=3):\n",
+    "    make_sweep(num_cpus, flux_points).run()  # warm-up (discarded)\n",
+    "    times = []\n",
+    "    for _ in range(repeats):\n",
+    "        sweep = make_sweep(num_cpus, flux_points)\n",
+    "        start = time.perf_counter()\n",
+    "        sweep.run()\n",
+    "        times.append(time.perf_counter() - start)\n",
+    "    return float(np.median(times))\n",
+    "\n",
+    "\n",
+    "cores = os.cpu_count() or 1\n",
+    "for n in [c for c in (1, 2, 4) if c <= cores]:\n",
+    "    print(f\"num_cpus={n}:  {time_run(n):.3f} s (median)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. See the BLAS-threading effect\n",
+    "\n",
+    "To feel the impact of the thread cap, **restart the kernel**, change the cap in the first\n",
+    "cell from `\"1\"` to e.g. `str(os.cpu_count())` (let BLAS use every core), and re-run the\n",
+    "timing. On small-matrix sweeps you will typically find the capped version is faster even\n",
+    "at `num_cpus=1`, because an un-capped BLAS spends time coordinating threads on matrices\n",
+    "too small to benefit."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Larger sweeps and automated tuning\n",
+    "\n",
+    "The benefit of `num_cpus > 1` grows with the number of grid points \u2014 more work to\n",
+    "amortize the fixed startup cost. Compare a bigger grid to see parallelism pay off:\n",
+    "\n",
+    "```python\n",
+    "print(\"num_cpus=1:\", time_run(1, flux_points=64))\n",
+    "print(\"num_cpus=4:\", time_run(4, flux_points=64))\n",
+    "```\n",
+    "\n",
+    "Because the optimum depends on your machine, BLAS library, and problem, the most reliable\n",
+    "approach is to measure. If you work from the scqubits **source tree**, the script\n",
+    "`tools/autotune_multiprocessing.py` automates this: it searches the\n",
+    "`num_cpus \u00d7 BLAS-threads` frontier and reports the fastest configuration for your system."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "- Two knobs interact: `num_cpus` (worker processes) and BLAS/LAPACK threads.\n",
+    "- Set BLAS threads **before** importing scqubits; capping low (often `1`) is a good\n",
+    "  default for qubit sweeps and prevents oversubscription when `num_cpus > 1`.\n",
+    "- Keep `num_cpus \u00d7 BLAS-threads \u2248 cores`; raise `num_cpus` only for large grids.\n",
+    "- Measure on your own hardware \u2014 timings are machine-specific."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file

From bdb860bf2efefc2a9417cfd81af961509514ca26 Mon Sep 17 00:00:00 2001
From: Jens Koch <19193849+jkochNU@users.noreply.github.com>
Date: Mon, 1 Jun 2026 17:42:53 -0500
Subject: [PATCH 2/8] examples: rewrite multiprocessing demo to set correct
 expectations

The previous version led the reader to expect a num_cpus speedup, but its default
example (small system, small grid) is in the regime where multiprocessing is
net-negative -- so the first hands-on result showed num_cpus=2 slower than 1 with
no explanation, and the "it helps for larger grids" caveat came afterwards.

Restructure around expectation-setting and two measured regimes:
- Open with the mental model: parallelism helps only when grid-size x per-point
  cost greatly exceeds the per-task overhead; on small/cheap sweeps it does
  nothing or slows you down -- that is normal, keep num_cpus=1.
- Cap BLAS threads FIRST (before imports), with a threadpoolctl verification cell
  (the #1 cause of confusing/backwards timings is the cap not taking effect).
- Demo 1: a cheap system (dim 216, ~5 ms/pt) where num_cpus does not help.
- Demo 2: the same system at larger truncated_dim (dim 512, ~40 ms/pt) where
  num_cpus gives ~2.8x -- the only change is per-point cost.
- Explain the BLAS oversubscription footgun with the measured ~90x cliff, note
  num_cpus x BLAS-threads ~ cores, and point to sparse diagonalization as the
  bigger lever for large composite systems.

All illustrative numbers are real measurements (10-core laptop); timing cells are
left runnable but unexecuted (run-it-yourself). Code paths smoke-tested.
---
 examples/demo_multiprocessing.ipynb | 305 +++++++++++++++++++---------
 1 file changed, 205 insertions(+), 100 deletions(-)

diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb
index 947db0b..335260f 100644
--- a/examples/demo_multiprocessing.ipynb
+++ b/examples/demo_multiprocessing.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# scqubits example: efficient parameter sweeps (`num_cpus` and BLAS threads)\n",
+    "# scqubits example: when does parallelizing a sweep help? (`num_cpus`, BLAS threads)\n",
     "\n",
     "J. Koch and P. Groszkowski\n",
     "\n",
@@ -12,37 +12,40 @@
     "\n",
     "---\n",
     "\n",
-    "`ParameterSweep` can compute a sweep in parallel across worker processes via its\n",
-    "`num_cpus` argument. How much that helps \u2014 and whether it helps at all \u2014 depends on\n",
-    "**two interacting knobs**:\n",
+    "`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\n",
+    "argument. **Parallelism is not free, and very often it does *not* help — sometimes it\n",
+    "makes a sweep slower.** Whether it pays off is governed by one simple balance:\n",
     "\n",
-    "1. **`num_cpus`** \u2014 how many worker processes share the grid points.\n",
-    "2. **BLAS/LAPACK threads** \u2014 every eigensolve (via `numpy`/`scipy`) runs on a\n",
-    "   multithreaded linear-algebra backend (OpenBLAS or MKL) that by default uses *all*\n",
-    "   cores.\n",
+    "> parallelism helps only when&nbsp;&nbsp;**(number of grid points) × (cost per point)&nbsp;&nbsp;≫&nbsp;&nbsp;the fixed overhead**\n",
+    "> of starting worker processes and pickling each task to them.\n",
     "\n",
-    "Running `num_cpus` workers that each launch a full BLAS thread pool oversubscribes the\n",
-    "cores; and on the small matrices typical of qubit truncations, a heavily threaded BLAS\n",
-    "can be *slower* than a single thread. The fastest setting is a balance \u2014 roughly\n",
-    "`num_cpus \u00d7 BLAS-threads \u2248 number of cores` \u2014 and it depends on your machine, BLAS\n",
-    "library, and problem size.\n",
+    "So the behaviour you should *expect*:\n",
     "\n",
-    "**This notebook shows how to measure it on your own system.** The timing cells are left\n",
-    "unexecuted on purpose: run them yourself \u2014 your numbers will differ."
+    "- **Small grids, or cheap-per-point systems** (small Hilbert spaces): `num_cpus > 1`\n",
+    "  gives little or no speedup, and frequently a **slowdown**. This is normal — keep the\n",
+    "  default `num_cpus = 1`.\n",
+    "- **Large grids of expensive points** (big composite Hilbert spaces): this is where\n",
+    "  workers pay off.\n",
+    "\n",
+    "A second knob interacts with the first: every eigensolve runs on a multithreaded\n",
+    "**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\n",
+    "oversubscribe — which on large dense matrices is not a small slowdown but a\n",
+    "**catastrophe** (an example below is ~90× slower than serial). So we cap BLAS threads\n",
+    "**first**, before importing anything. Timings are machine-specific — run the cells\n",
+    "yourself; the illustrative numbers are from a 10-core laptop and yours will differ.\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 1. Set the BLAS thread count *before* importing scqubits\n",
-    "\n",
-    "The BLAS backend reads its thread count once, at import time. So this must be the\n",
-    "**first** cell you run in a fresh kernel \u2014 before `numpy` or `scqubits` is imported. (If\n",
-    "you have already imported scqubits in this kernel, restart it first.)\n",
+    "## Step 0 (do this first): cap BLAS threads *before* importing scqubits\n",
     "\n",
-    "Capping to a small number \u2014 often `1` \u2014 is a good starting point for qubit sweeps, and\n",
-    "is also what prevents oversubscription once `num_cpus > 1`."
+    "The BLAS backend reads its thread count **once, at import time**. So this must be the\n",
+    "**first cell you run in a fresh kernel** — before `numpy` or `scqubits` is imported. (If\n",
+    "scqubits is already imported in this kernel, restart it and run this first.) Capping to\n",
+    "`1` is a good starting point for qubit sweeps and is what prevents oversubscription once\n",
+    "`num_cpus > 1`. We explain *why this matters so much* in the 'BLAS footgun' section below.\n"
    ]
   },
   {
@@ -53,9 +56,55 @@
    "source": [
     "import os\n",
     "\n",
-    "# Must run before numpy / scqubits are imported (restart the kernel otherwise).\n",
-    "for _var in (\"OPENBLAS_NUM_THREADS\", \"MKL_NUM_THREADS\", \"OMP_NUM_THREADS\"):\n",
-    "    os.environ[_var] = \"1\""
+    "# MUST run before numpy / scqubits are imported (restart the kernel otherwise).\n",
+    "for _var in (\"OPENBLAS_NUM_THREADS\", \"MKL_NUM_THREADS\", \"OMP_NUM_THREADS\", \"NUMEXPR_NUM_THREADS\"):\n",
+    "    os.environ[_var] = \"1\"\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Confirm the cap took effect\n",
+    "\n",
+    "The single most common cause of confusing sweep timings is that the cap **did not\n",
+    "apply** — because `numpy`/`scqubits` were already imported when the env vars were set.\n",
+    "Check it (needs the small `threadpoolctl` package, `pip install threadpoolctl`); each\n",
+    "reported thread count should be `1`:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import scqubits as scq  # imports scipy, whose BLAS the eigensolvers use\n",
+    "\n",
+    "try:\n",
+    "    from threadpoolctl import threadpool_info\n",
+    "\n",
+    "    pools = threadpool_info()\n",
+    "    for pool in pools:\n",
+    "        print(f\"{pool['internal_api']:>10}: {pool['num_threads']} thread(s)\")\n",
+    "    if not pools:\n",
+    "        print(\"(no controllable BLAS pool detected)\")\n",
+    "    print(\"\\nEach count should be 1. If not, restart the kernel and run the cap cell\")\n",
+    "    print(\"FIRST. (On Apple Silicon, numpy itself uses Accelerate, which ignores these\")\n",
+    "    print(\" variables; scipy's bundled OpenBLAS -- shown here -- is what scqubits uses.)\")\n",
+    "except ImportError:\n",
+    "    print(\"threadpoolctl not installed (pip install threadpoolctl) -- skipping check.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## A small timing helper\n",
+    "\n",
+    "Wall-clock timing is noisy, so we take the median of a few repeats and discard a warm-up\n",
+    "run (the first parallel run pays a one-time process-startup cost).\n"
    ]
   },
   {
@@ -66,20 +115,34 @@
    "source": [
     "import time\n",
     "\n",
-    "import numpy as np\n",
     "\n",
-    "import scqubits as scq"
+    "\n",
+    "def time_run(make_sweep, num_cpus, repeats=3):\n",
+    "    \"\"\"Median wall time of one full sweep, after a discarded warm-up run.\n",
+    "\n",
+    "    `make_sweep(num_cpus)` builds *and runs* a fresh ParameterSweep (autorun=True).\n",
+    "    \"\"\"\n",
+    "    make_sweep(num_cpus)  # warm-up (discarded: pays one-time process startup)\n",
+    "    times = []\n",
+    "    for _ in range(repeats):\n",
+    "        start = time.perf_counter()\n",
+    "        make_sweep(num_cpus)\n",
+    "        times.append(time.perf_counter() - start)\n",
+    "    return float(np.median(times))\n",
+    "\n",
+    "\n",
+    "cores = os.cpu_count() or 1\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 2. A coupled example system\n",
+    "## Demo 1 — when parallelism does *not* help (the common case)\n",
     "\n",
-    "Two tunable transmons coupled to a common resonator (the same system as\n",
-    "`demo_parametersweep`), swept over `flux` and `ng`. We wrap construction in a factory so\n",
-    "we can rebuild a fresh sweep for each timed run."
+    "Three coupled tunable transmons, swept over the flux of the first. With\n",
+    "`truncated_dim = 6` the dressed Hilbert space has dimension 6³ = 216 and each point costs\n",
+    "only a few milliseconds — far below the per-task overhead.\n"
    ]
   },
   {
@@ -88,52 +151,59 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def make_sweep(num_cpus, flux_points=11):\n",
-    "    tmon1 = scq.TunableTransmon(\n",
-    "        EJmax=40.0, EC=0.2, d=0.1, flux=0.0, ng=0.3, ncut=40, truncated_dim=3\n",
-    "    )\n",
-    "    tmon2 = scq.TunableTransmon(\n",
-    "        EJmax=15.0, EC=0.15, d=0.2, flux=0.0, ng=0.0, ncut=30, truncated_dim=3\n",
-    "    )\n",
-    "    resonator = scq.Oscillator(E_osc=4.5, truncated_dim=4)\n",
+    "def make_sweep_light(num_cpus, n_points=96, truncated_dim=6):\n",
+    "    qubits = [\n",
+    "        scq.TunableTransmon(\n",
+    "            EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n",
+    "            truncated_dim=truncated_dim, id_str=f\"tmon{i}\",\n",
+    "        )\n",
+    "        for i in range(3)\n",
+    "    ]\n",
+    "    hs = scq.HilbertSpace(qubits)\n",
+    "    for i in range(2):\n",
+    "        hs.add_interaction(g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator)\n",
+    "    flux_vals = np.linspace(0.0, 0.5, n_points)\n",
     "\n",
-    "    hilbertspace = scq.HilbertSpace([tmon1, tmon2, resonator])\n",
-    "    hilbertspace.add_interaction(\n",
-    "        g_strength=0.1, op1=tmon1.n_operator, op2=resonator.creation_operator, add_hc=True\n",
-    "    )\n",
-    "    hilbertspace.add_interaction(\n",
-    "        g_strength=0.2, op1=tmon2.n_operator, op2=resonator.creation_operator, add_hc=True\n",
+    "    def update(flux):\n",
+    "        qubits[0].flux = flux\n",
+    "\n",
+    "    return scq.ParameterSweep(\n",
+    "        hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
+    "        update_hilbertspace=update, evals_count=20, num_cpus=num_cpus, autorun=True,\n",
     "    )\n",
     "\n",
-    "    flux_vals = np.linspace(0.0, 2.0, flux_points)\n",
-    "    ng_vals = np.linspace(-0.5, 0.5, 3)\n",
     "\n",
-    "    def update_hilbertspace(flux, ng):\n",
-    "        tmon1.flux = flux\n",
-    "        tmon2.flux = 1.2 * flux\n",
-    "        tmon2.ng = ng\n",
+    "for n in [c for c in (1, 2, 4) if c <= cores]:\n",
+    "    print(f\"num_cpus={n}:  {time_run(make_sweep_light, n):.3f} s (median)\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**What to expect:** roughly *flat* timings — `num_cpus = 4` is no faster than `1`, and\n",
+    "`num_cpus = 2` may even be slower. On the reference 10-core laptop (dim 216, 96 points,\n",
+    "BLAS capped to 1):\n",
     "\n",
-    "    return scq.ParameterSweep(\n",
-    "        hilbertspace=hilbertspace,\n",
-    "        paramvals_by_name={\"flux\": flux_vals, \"ng\": ng_vals},\n",
-    "        update_hilbertspace=update_hilbertspace,\n",
-    "        subsys_update_info={\"flux\": [tmon1, tmon2], \"ng\": [tmon2]},\n",
-    "        evals_count=20,\n",
-    "        num_cpus=num_cpus,\n",
-    "        autorun=False,\n",
-    "    )"
+    "| num_cpus | wall time | speedup |\n",
+    "|---:|---:|---:|\n",
+    "| 1 | 0.51 s | 1.00× |\n",
+    "| 4 | 0.51 s | 1.00× |\n",
+    "\n",
+    "This is the **normal** outcome for a modest sweep: at ~5 ms/point the cost of pickling\n",
+    "each task to a worker is comparable to the work itself, so spreading the points across\n",
+    "processes buys nothing. **This is not a bug — it is the expected regime, and\n",
+    "`num_cpus = 1` (the default) is the right choice here.**\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 3. Time `ParameterSweep.run()` versus `num_cpus`\n",
+    "## Demo 2 — when parallelism *does* help\n",
     "\n",
-    "Wall-clock timing is noisy, so we take the median of a few repeats and discard a warm-up\n",
-    "run (the first parallel run pays a one-time process-startup cost). **Results are\n",
-    "machine-specific** \u2014 what matters is the *shape* of the trend on your hardware, not any\n",
-    "single number."
+    "The *only* change: `truncated_dim = 8`, so the dressed dimension becomes 8³ = 512 and\n",
+    "each eigensolve now costs ~40 ms — large enough to amortize the per-task overhead.\n"
    ]
   },
   {
@@ -142,66 +212,101 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def time_run(num_cpus, flux_points=11, repeats=3):\n",
-    "    make_sweep(num_cpus, flux_points).run()  # warm-up (discarded)\n",
-    "    times = []\n",
-    "    for _ in range(repeats):\n",
-    "        sweep = make_sweep(num_cpus, flux_points)\n",
-    "        start = time.perf_counter()\n",
-    "        sweep.run()\n",
-    "        times.append(time.perf_counter() - start)\n",
-    "    return float(np.median(times))\n",
+    "def make_sweep_heavy(num_cpus, n_points=96):\n",
+    "    return make_sweep_light(num_cpus, n_points=n_points, truncated_dim=8)\n",
     "\n",
     "\n",
-    "cores = os.cpu_count() or 1\n",
     "for n in [c for c in (1, 2, 4) if c <= cores]:\n",
-    "    print(f\"num_cpus={n}:  {time_run(n):.3f} s (median)\")"
+    "    print(f\"num_cpus={n}:  {time_run(make_sweep_heavy, n):.3f} s (median)\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 4. See the BLAS-threading effect\n",
+    "**What to expect:** now the workers pay off. On the reference laptop (dim 512, 96\n",
+    "points, BLAS capped to 1):\n",
+    "\n",
+    "| num_cpus | wall time | speedup |\n",
+    "|---:|---:|---:|\n",
+    "| 1 | 3.62 s | 1.00× |\n",
+    "| 4 | 1.30 s | **2.8×** |\n",
     "\n",
-    "To feel the impact of the thread cap, **restart the kernel**, change the cap in the first\n",
-    "cell from `\"1\"` to e.g. `str(os.cpu_count())` (let BLAS use every core), and re-run the\n",
-    "timing. On small-matrix sweeps you will typically find the capped version is faster even\n",
-    "at `num_cpus=1`, because an un-capped BLAS spends time coordinating threads on matrices\n",
-    "too small to benefit."
+    "Nothing changed except the **cost per point** (~5 ms → ~40 ms, via `truncated_dim`).\n",
+    "That is the whole story: parallelism helps once *grid size × per-point cost* clears the\n",
+    "overhead. If your sweep looks like Demo 1, raising `num_cpus` will not help — make the\n",
+    "grid much larger, or accept that the sweep is simply fast enough serially.\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 5. Larger sweeps and automated tuning\n",
+    "## The BLAS-thread footgun — *why* we capped first\n",
     "\n",
-    "The benefit of `num_cpus > 1` grows with the number of grid points \u2014 more work to\n",
-    "amortize the fixed startup cost. Compare a bigger grid to see parallelism pay off:\n",
+    "Each eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\n",
+    "Run `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\n",
+    "by a factor of `num_cpus`. On small matrices this is wasteful; on **large dense**\n",
+    "matrices it is *catastrophic*. Measured on the reference laptop — 5 capacitively coupled\n",
+    "fluxonia, dressed dim 3125, dense diagonalization, 16-point sweep:\n",
     "\n",
-    "```python\n",
-    "print(\"num_cpus=1:\", time_run(1, flux_points=64))\n",
-    "print(\"num_cpus=4:\", time_run(4, flux_points=64))\n",
-    "```\n",
+    "| configuration | wall time |\n",
+    "|---|---:|\n",
+    "| `num_cpus=1` | 42 s |\n",
+    "| `num_cpus=4`, BLAS **uncapped** | **3608 s**  (≈ 90× *slower*) |\n",
+    "| `num_cpus=4`, BLAS capped to 1 | 40 s |\n",
+    "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n",
     "\n",
-    "Because the optimum depends on your machine, BLAS library, and problem, the most reliable\n",
-    "approach is to measure. If you work from the scqubits **source tree**, the script\n",
-    "`tools/autotune_multiprocessing.py` automates this: it searches the\n",
-    "`num_cpus \u00d7 BLAS-threads` frontier and reports the fastest configuration for your system."
+    "40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\n",
+    "difference between working and broken. (If you skip Step 0 and rerun Demo 2 with the cap\n",
+    "off, you will reproduce a milder version of this: `num_cpus = 2`/`4` become *much* slower\n",
+    "than `1`.) That is also the usual reason a `num_cpus` comparison looks 'inconclusive' or\n",
+    "backwards: the cap was never in effect.\n",
+    "\n",
+    "**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\n",
+    "best; on large dense matrices each eigensolve benefits from several threads, so the\n",
+    "fastest setting balances the knobs: roughly **`num_cpus × BLAS-threads ≈ cores`**. The\n",
+    "optimum depends on machine, BLAS library, and matrix size — measure it.\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Summary\n",
+    "## For large composite systems, try sparse diagonalization first\n",
+    "\n",
+    "Before reaching for `num_cpus`, note that for large coupled systems the dominant cost is\n",
+    "the *per-point diagonalization*, and that is usually a bigger lever than parallelism.\n",
+    "Recent scqubits automatically uses **sparse** diagonalization (`scipy.eigsh`) for the\n",
+    "default method when only a few eigenstates of a large Hilbert space are requested\n",
+    "(`scqubits.settings.AUTO_SPARSE_DIAG`). For the 5-fluxonia system above (dressed dim\n",
+    "3125) this alone is ~16× faster per point than dense — far more than the ~1.5× that\n",
+    "multiprocessing buys there. And once sparse makes each point cheap, `num_cpus > 1` helps\n",
+    "even less. **Try sparse first; parallelize second.**\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Automated tuning and summary\n",
+    "\n",
+    "Because the optimum depends on your machine, BLAS library, and problem, the most\n",
+    "reliable approach is to measure. From the scqubits **source tree**, the script\n",
+    "`tools/autotune_multiprocessing.py` searches the `num_cpus × BLAS-threads` frontier and\n",
+    "reports the fastest configuration for your system.\n",
+    "\n",
+    "### Summary\n",
     "\n",
-    "- Two knobs interact: `num_cpus` (worker processes) and BLAS/LAPACK threads.\n",
-    "- Set BLAS threads **before** importing scqubits; capping low (often `1`) is a good\n",
-    "  default for qubit sweeps and prevents oversubscription when `num_cpus > 1`.\n",
-    "- Keep `num_cpus \u00d7 BLAS-threads \u2248 cores`; raise `num_cpus` only for large grids.\n",
-    "- Measure on your own hardware \u2014 timings are machine-specific."
+    "- **Default `num_cpus = 1`.** Parallelism helps only when *grid size × cost-per-point*\n",
+    "  greatly exceeds the per-task overhead; on small or cheap sweeps it does nothing, or\n",
+    "  slows you down — expected, not a bug (Demo 1 vs Demo 2).\n",
+    "- **Cap BLAS threads before importing scqubits**, and verify it took effect.\n",
+    "  Oversubscription is catastrophic on large dense matrices (the 90× example). Keep\n",
+    "  `num_cpus × BLAS-threads ≈ cores`.\n",
+    "- For large composite Hilbert spaces, **sparse diagonalization is usually a bigger lever\n",
+    "  than multiprocessing** — try it first.\n",
+    "- **Measure on your own hardware** — every number here is machine-specific.\n"
    ]
   }
  ],
@@ -217,4 +322,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
\ No newline at end of file
+}

From d6dd13422a9c9c8c5db41e2bc539fd32de9bb2f2 Mon Sep 17 00:00:00 2001
From: Jens Koch <19193849+jkochNU@users.noreply.github.com>
Date: Mon, 1 Jun 2026 18:39:15 -0500
Subject: [PATCH 3/8] examples: re-anchor MP demo on grid size; refresh
 numbers; warning-free

The multiprocessing follow-up (per-task IPC reduced from O(N^2) to O(N)) lowered
per-task overhead enough that the previous "light" system (dim 216) now benefits
from num_cpus even at 96 points, so the old truncated_dim contrast no longer
demonstrated "parallelism does not help." Re-anchor the two demos on GRID SIZE,
the lever users actually vary, measured on the integration of all current perf
branches:
- Demo 1: same system, 16 points -> num_cpus 1/2/4 are flat (0.10 s, ~1x/0.9x),
  the normal "too few points to amortize" regime.
- Demo 2: same system, 384 points -> 2.14 s -> 0.90 s at 4 cores (2.37x).
Single make_sweep(num_cpus, n_points) factory; refreshed illustrative tables;
note the crossover also shifts with per-point cost.

Verified against the combined branches: cap-first ordering keeps BLAS at 1
thread, and the matrix-element matmul fix means the coupled-transmon sweep now
runs with zero spurious matmul warnings.
---
 examples/demo_multiprocessing.ipynb | 185 +++++++++++++---------------
 1 file changed, 87 insertions(+), 98 deletions(-)

diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb
index 335260f..1b255f9 100644
--- a/examples/demo_multiprocessing.ipynb
+++ b/examples/demo_multiprocessing.ipynb
@@ -13,26 +13,24 @@
     "---\n",
     "\n",
     "`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\n",
-    "argument. **Parallelism is not free, and very often it does *not* help — sometimes it\n",
-    "makes a sweep slower.** Whether it pays off is governed by one simple balance:\n",
+    "argument. **Parallelism is not free, and on small sweeps it does *not* help — sometimes\n",
+    "it makes them slower.** Whether it pays off is governed by one simple balance:\n",
     "\n",
     "> parallelism helps only when&nbsp;&nbsp;**(number of grid points) × (cost per point)&nbsp;&nbsp;≫&nbsp;&nbsp;the fixed overhead**\n",
-    "> of starting worker processes and pickling each task to them.\n",
+    "> of starting worker processes and shipping each task to them.\n",
     "\n",
     "So the behaviour you should *expect*:\n",
     "\n",
-    "- **Small grids, or cheap-per-point systems** (small Hilbert spaces): `num_cpus > 1`\n",
-    "  gives little or no speedup, and frequently a **slowdown**. This is normal — keep the\n",
-    "  default `num_cpus = 1`.\n",
-    "- **Large grids of expensive points** (big composite Hilbert spaces): this is where\n",
-    "  workers pay off.\n",
+    "- **Few grid points:** `num_cpus > 1` gives no speedup, and can be *slower* than serial.\n",
+    "  This is normal — keep the default `num_cpus = 1`.\n",
+    "- **Many grid points** (and/or an expensive-per-point system): workers pay off.\n",
     "\n",
     "A second knob interacts with the first: every eigensolve runs on a multithreaded\n",
     "**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\n",
     "oversubscribe — which on large dense matrices is not a small slowdown but a\n",
-    "**catastrophe** (an example below is ~90× slower than serial). So we cap BLAS threads\n",
-    "**first**, before importing anything. Timings are machine-specific — run the cells\n",
-    "yourself; the illustrative numbers are from a 10-core laptop and yours will differ.\n"
+    "**catastrophe** (~90× in the example below). So we cap BLAS threads **first**, before\n",
+    "importing anything. Timings are machine-specific — run the cells yourself; the\n",
+    "illustrative numbers are from a 10-core laptop and yours will differ.\n"
    ]
   },
   {
@@ -41,11 +39,11 @@
    "source": [
     "## Step 0 (do this first): cap BLAS threads *before* importing scqubits\n",
     "\n",
-    "The BLAS backend reads its thread count **once, at import time**. So this must be the\n",
+    "The BLAS backend reads its thread count **once, at import time**, so this must be the\n",
     "**first cell you run in a fresh kernel** — before `numpy` or `scqubits` is imported. (If\n",
-    "scqubits is already imported in this kernel, restart it and run this first.) Capping to\n",
-    "`1` is a good starting point for qubit sweeps and is what prevents oversubscription once\n",
-    "`num_cpus > 1`. We explain *why this matters so much* in the 'BLAS footgun' section below.\n"
+    "scqubits is already imported, restart the kernel and run this first.) Capping to `1` is a\n",
+    "good starting point for qubit sweeps and prevents oversubscription once `num_cpus > 1`;\n",
+    "the 'BLAS footgun' section below explains why this matters so much.\n"
    ]
   },
   {
@@ -67,10 +65,9 @@
    "source": [
     "### Confirm the cap took effect\n",
     "\n",
-    "The single most common cause of confusing sweep timings is that the cap **did not\n",
-    "apply** — because `numpy`/`scqubits` were already imported when the env vars were set.\n",
-    "Check it (needs the small `threadpoolctl` package, `pip install threadpoolctl`); each\n",
-    "reported thread count should be `1`:\n"
+    "The single most common cause of confusing sweep timings is that the cap **did not apply**\n",
+    "— because `numpy`/`scqubits` were already imported when the env vars were set. Check it\n",
+    "(needs `threadpoolctl`, `pip install threadpoolctl`); each reported count should be `1`:\n"
    ]
   },
   {
@@ -101,10 +98,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## A small timing helper\n",
+    "## A small timing helper and an example system\n",
     "\n",
-    "Wall-clock timing is noisy, so we take the median of a few repeats and discard a warm-up\n",
-    "run (the first parallel run pays a one-time process-startup cost).\n"
+    "Three coupled tunable transmons, swept over the flux of the first. We rebuild a fresh\n",
+    "sweep for each timed run, and we will sweep **the same system at two grid sizes** to see\n",
+    "both regimes. Wall-clock timing is noisy, so we take the median of a few repeats and\n",
+    "discard a warm-up run (the first parallel run pays a one-time process-startup cost).\n"
    ]
   },
   {
@@ -116,17 +115,35 @@
     "import time\n",
     "\n",
     "\n",
+    "def make_sweep(num_cpus, n_points, truncated_dim=6):\n",
+    "    qubits = [\n",
+    "        scq.TunableTransmon(\n",
+    "            EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n",
+    "            truncated_dim=truncated_dim, id_str=f\"tmon{i}\",\n",
+    "        )\n",
+    "        for i in range(3)\n",
+    "    ]\n",
+    "    hs = scq.HilbertSpace(qubits)\n",
+    "    for i in range(2):\n",
+    "        hs.add_interaction(g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator)\n",
+    "    flux_vals = np.linspace(0.0, 0.5, n_points)\n",
     "\n",
-    "def time_run(make_sweep, num_cpus, repeats=3):\n",
-    "    \"\"\"Median wall time of one full sweep, after a discarded warm-up run.\n",
+    "    def update(flux):\n",
+    "        qubits[0].flux = flux\n",
     "\n",
-    "    `make_sweep(num_cpus)` builds *and runs* a fresh ParameterSweep (autorun=True).\n",
-    "    \"\"\"\n",
-    "    make_sweep(num_cpus)  # warm-up (discarded: pays one-time process startup)\n",
+    "    return scq.ParameterSweep(\n",
+    "        hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
+    "        update_hilbertspace=update, evals_count=20, num_cpus=num_cpus, autorun=True,\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "def time_run(n_points, num_cpus, repeats=3):\n",
+    "    \"\"\"Median wall time of make_sweep(num_cpus, n_points), after a discarded warm-up.\"\"\"\n",
+    "    make_sweep(num_cpus, n_points)  # warm-up (discarded)\n",
     "    times = []\n",
     "    for _ in range(repeats):\n",
     "        start = time.perf_counter()\n",
-    "        make_sweep(num_cpus)\n",
+    "        make_sweep(num_cpus, n_points)\n",
     "        times.append(time.perf_counter() - start)\n",
     "    return float(np.median(times))\n",
     "\n",
@@ -138,11 +155,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Demo 1 — when parallelism does *not* help (the common case)\n",
+    "## Demo 1 — a small sweep: parallelism does *not* help\n",
     "\n",
-    "Three coupled tunable transmons, swept over the flux of the first. With\n",
-    "`truncated_dim = 6` the dressed Hilbert space has dimension 6³ = 216 and each point costs\n",
-    "only a few milliseconds — far below the per-task overhead.\n"
+    "Only **16 grid points**. There are too few points to amortize the cost of starting\n",
+    "workers and shipping tasks, so spreading them across processes buys nothing.\n"
    ]
   },
   {
@@ -151,59 +167,35 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def make_sweep_light(num_cpus, n_points=96, truncated_dim=6):\n",
-    "    qubits = [\n",
-    "        scq.TunableTransmon(\n",
-    "            EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n",
-    "            truncated_dim=truncated_dim, id_str=f\"tmon{i}\",\n",
-    "        )\n",
-    "        for i in range(3)\n",
-    "    ]\n",
-    "    hs = scq.HilbertSpace(qubits)\n",
-    "    for i in range(2):\n",
-    "        hs.add_interaction(g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator)\n",
-    "    flux_vals = np.linspace(0.0, 0.5, n_points)\n",
-    "\n",
-    "    def update(flux):\n",
-    "        qubits[0].flux = flux\n",
-    "\n",
-    "    return scq.ParameterSweep(\n",
-    "        hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
-    "        update_hilbertspace=update, evals_count=20, num_cpus=num_cpus, autorun=True,\n",
-    "    )\n",
-    "\n",
-    "\n",
     "for n in [c for c in (1, 2, 4) if c <= cores]:\n",
-    "    print(f\"num_cpus={n}:  {time_run(make_sweep_light, n):.3f} s (median)\")\n"
+    "    print(f\"num_cpus={n}:  {time_run(16, n):.3f} s (median)\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**What to expect:** roughly *flat* timings — `num_cpus = 4` is no faster than `1`, and\n",
-    "`num_cpus = 2` may even be slower. On the reference 10-core laptop (dim 216, 96 points,\n",
-    "BLAS capped to 1):\n",
+    "**What to expect:** roughly *flat*, and `num_cpus = 2`/`4` may be slightly **slower** than\n",
+    "`1`. On the reference 10-core laptop (16 points, BLAS capped to 1):\n",
     "\n",
     "| num_cpus | wall time | speedup |\n",
     "|---:|---:|---:|\n",
-    "| 1 | 0.51 s | 1.00× |\n",
-    "| 4 | 0.51 s | 1.00× |\n",
+    "| 1 | 0.10 s | 1.00× |\n",
+    "| 2 | 0.10 s | 0.93× |\n",
+    "| 4 | 0.11 s | 0.84× |\n",
     "\n",
-    "This is the **normal** outcome for a modest sweep: at ~5 ms/point the cost of pickling\n",
-    "each task to a worker is comparable to the work itself, so spreading the points across\n",
-    "processes buys nothing. **This is not a bug — it is the expected regime, and\n",
-    "`num_cpus = 1` (the default) is the right choice here.**\n"
+    "This is the **normal** outcome for a quick sweep, and **not a bug** — with only 16 points\n",
+    "the per-task overhead dominates, so `num_cpus = 1` (the default) is the right choice.\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Demo 2 — when parallelism *does* help\n",
+    "## Demo 2 — a large sweep: parallelism pays off\n",
     "\n",
-    "The *only* change: `truncated_dim = 8`, so the dressed dimension becomes 8³ = 512 and\n",
-    "each eigensolve now costs ~40 ms — large enough to amortize the per-task overhead.\n"
+    "The *same system*, now with **384 grid points**. There is enough work to amortize the\n",
+    "fixed overhead, so the workers help.\n"
    ]
   },
   {
@@ -212,30 +204,27 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def make_sweep_heavy(num_cpus, n_points=96):\n",
-    "    return make_sweep_light(num_cpus, n_points=n_points, truncated_dim=8)\n",
-    "\n",
-    "\n",
     "for n in [c for c in (1, 2, 4) if c <= cores]:\n",
-    "    print(f\"num_cpus={n}:  {time_run(make_sweep_heavy, n):.3f} s (median)\")\n"
+    "    print(f\"num_cpus={n}:  {time_run(384, n):.3f} s (median)\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**What to expect:** now the workers pay off. On the reference laptop (dim 512, 96\n",
-    "points, BLAS capped to 1):\n",
+    "**What to expect:** now the workers pay off. On the reference laptop (384 points, BLAS\n",
+    "capped to 1):\n",
     "\n",
     "| num_cpus | wall time | speedup |\n",
     "|---:|---:|---:|\n",
-    "| 1 | 3.62 s | 1.00× |\n",
-    "| 4 | 1.30 s | **2.8×** |\n",
-    "\n",
-    "Nothing changed except the **cost per point** (~5 ms → ~40 ms, via `truncated_dim`).\n",
-    "That is the whole story: parallelism helps once *grid size × per-point cost* clears the\n",
-    "overhead. If your sweep looks like Demo 1, raising `num_cpus` will not help — make the\n",
-    "grid much larger, or accept that the sweep is simply fast enough serially.\n"
+    "| 1 | 2.14 s | 1.00× |\n",
+    "| 2 | 1.50 s | 1.42× |\n",
+    "| 4 | 0.90 s | **2.37×** |\n",
+    "\n",
+    "Nothing changed except the **number of grid points** (16 → 384). That is the lever:\n",
+    "parallelism helps once *grid size × per-point cost* clears the overhead. The crossover\n",
+    "depends on the per-point cost too — a heavier system (larger `truncated_dim`, or a large\n",
+    "composite Hilbert space) reaches it at fewer points; a cheaper one needs more.\n"
    ]
   },
   {
@@ -246,9 +235,9 @@
     "\n",
     "Each eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\n",
     "Run `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\n",
-    "by a factor of `num_cpus`. On small matrices this is wasteful; on **large dense**\n",
-    "matrices it is *catastrophic*. Measured on the reference laptop — 5 capacitively coupled\n",
-    "fluxonia, dressed dim 3125, dense diagonalization, 16-point sweep:\n",
+    "by a factor of `num_cpus`. On small matrices this is wasteful; on **large dense** matrices\n",
+    "it is *catastrophic*. Measured on the reference laptop — 5 capacitively coupled fluxonia,\n",
+    "dressed dim 3125, dense diagonalization, 16-point sweep:\n",
     "\n",
     "| configuration | wall time |\n",
     "|---|---:|\n",
@@ -258,10 +247,10 @@
     "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n",
     "\n",
     "40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\n",
-    "difference between working and broken. (If you skip Step 0 and rerun Demo 2 with the cap\n",
-    "off, you will reproduce a milder version of this: `num_cpus = 2`/`4` become *much* slower\n",
-    "than `1`.) That is also the usual reason a `num_cpus` comparison looks 'inconclusive' or\n",
-    "backwards: the cap was never in effect.\n",
+    "difference between working and broken. (Skip Step 0 and rerun Demo 2 with the cap off and\n",
+    "you will reproduce a milder version: `num_cpus = 2`/`4` become *much* slower than `1`.)\n",
+    "That is also the usual reason a `num_cpus` comparison looks 'inconclusive' or backwards:\n",
+    "the cap was never in effect.\n",
     "\n",
     "**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\n",
     "best; on large dense matrices each eigensolve benefits from several threads, so the\n",
@@ -276,13 +265,13 @@
     "## For large composite systems, try sparse diagonalization first\n",
     "\n",
     "Before reaching for `num_cpus`, note that for large coupled systems the dominant cost is\n",
-    "the *per-point diagonalization*, and that is usually a bigger lever than parallelism.\n",
-    "Recent scqubits automatically uses **sparse** diagonalization (`scipy.eigsh`) for the\n",
-    "default method when only a few eigenstates of a large Hilbert space are requested\n",
+    "the *per-point diagonalization*, usually a bigger lever than parallelism. Recent scqubits\n",
+    "automatically uses **sparse** diagonalization (`scipy.eigsh`) for the default method when\n",
+    "only a few eigenstates of a large Hilbert space are requested\n",
     "(`scqubits.settings.AUTO_SPARSE_DIAG`). For the 5-fluxonia system above (dressed dim\n",
-    "3125) this alone is ~16× faster per point than dense — far more than the ~1.5× that\n",
-    "multiprocessing buys there. And once sparse makes each point cheap, `num_cpus > 1` helps\n",
-    "even less. **Try sparse first; parallelize second.**\n"
+    "3125) this alone is ~16× faster per point than dense — far more than parallelism buys\n",
+    "there. And once sparse makes each point cheap, `num_cpus > 1` helps even less. **Try\n",
+    "sparse first; parallelize second.**\n"
    ]
   },
   {
@@ -291,16 +280,16 @@
    "source": [
     "## Automated tuning and summary\n",
     "\n",
-    "Because the optimum depends on your machine, BLAS library, and problem, the most\n",
-    "reliable approach is to measure. From the scqubits **source tree**, the script\n",
+    "Because the optimum depends on your machine, BLAS library, and problem, the most reliable\n",
+    "approach is to measure. From the scqubits **source tree**, the script\n",
     "`tools/autotune_multiprocessing.py` searches the `num_cpus × BLAS-threads` frontier and\n",
     "reports the fastest configuration for your system.\n",
     "\n",
     "### Summary\n",
     "\n",
     "- **Default `num_cpus = 1`.** Parallelism helps only when *grid size × cost-per-point*\n",
-    "  greatly exceeds the per-task overhead; on small or cheap sweeps it does nothing, or\n",
-    "  slows you down — expected, not a bug (Demo 1 vs Demo 2).\n",
+    "  greatly exceeds the per-task overhead; on small sweeps it does nothing, or slows you\n",
+    "  down — expected, not a bug (Demo 1 vs Demo 2).\n",
     "- **Cap BLAS threads before importing scqubits**, and verify it took effect.\n",
     "  Oversubscription is catastrophic on large dense matrices (the 90× example). Keep\n",
     "  `num_cpus × BLAS-threads ≈ cores`.\n",

From 4c18b6caf6efd3739155111e3a079349821cb730 Mon Sep 17 00:00:00 2001
From: Jens Koch <19193849+jkochNU@users.noreply.github.com>
Date: Tue, 2 Jun 2026 20:58:45 -0500
Subject: [PATCH 4/8] examples: demo the shipped parallelization heuristic +
 calibration

Add a "Letting scqubits pick the settings" section showing
recommend_parallelization (with a runnable cell over the 16- and 384-point
grids), num_cpus="auto", AUTO_PARALLEL, and calibrate_parallelization; replace the
dead tools/autotune_multiprocessing.py reference. Also fold in the earlier demo
fixes (restored runnable second half, Mac mini timing labels, the __main__-guard
note for plain scripts, and the corrected fork/spawn BLAS-cap wording).
---
 examples/demo_multiprocessing.ipynb | 205 +++++++++++++++-------------
 1 file changed, 107 insertions(+), 98 deletions(-)

diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb
index 1b255f9..02d334f 100644
--- a/examples/demo_multiprocessing.ipynb
+++ b/examples/demo_multiprocessing.ipynb
@@ -3,35 +3,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "# scqubits example: when does parallelizing a sweep help? (`num_cpus`, BLAS threads)\n",
-    "\n",
-    "J. Koch and P. Groszkowski\n",
-    "\n",
-    "For further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n",
-    "\n",
-    "---\n",
-    "\n",
-    "`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\n",
-    "argument. **Parallelism is not free, and on small sweeps it does *not* help — sometimes\n",
-    "it makes them slower.** Whether it pays off is governed by one simple balance:\n",
-    "\n",
-    "> parallelism helps only when&nbsp;&nbsp;**(number of grid points) × (cost per point)&nbsp;&nbsp;≫&nbsp;&nbsp;the fixed overhead**\n",
-    "> of starting worker processes and shipping each task to them.\n",
-    "\n",
-    "So the behaviour you should *expect*:\n",
-    "\n",
-    "- **Few grid points:** `num_cpus > 1` gives no speedup, and can be *slower* than serial.\n",
-    "  This is normal — keep the default `num_cpus = 1`.\n",
-    "- **Many grid points** (and/or an expensive-per-point system): workers pay off.\n",
-    "\n",
-    "A second knob interacts with the first: every eigensolve runs on a multithreaded\n",
-    "**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\n",
-    "oversubscribe — which on large dense matrices is not a small slowdown but a\n",
-    "**catastrophe** (~90× in the example below). So we cap BLAS threads **first**, before\n",
-    "importing anything. Timings are machine-specific — run the cells yourself; the\n",
-    "illustrative numbers are from a 10-core laptop and yours will differ.\n"
-   ]
+   "source": "# scqubits example: when does parallelizing a sweep help? (`num_cpus`, BLAS threads)\n\nJ. Koch and P. Groszkowski\n\nFor further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n\n---\n\n`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\nargument. **Parallelism is not free, and on small sweeps it does *not* help — sometimes\nit makes them slower.** Whether it pays off is governed by one simple balance:\n\n> parallelism helps only when&nbsp;&nbsp;**(number of grid points) × (cost per point)&nbsp;&nbsp;≫&nbsp;&nbsp;the fixed overhead**\n> of starting worker processes and shipping each task to them.\n\nSo the behaviour you should *expect*:\n\n- **Few grid points:** `num_cpus > 1` gives no speedup, and can be *slower* than serial.\n  This is normal — keep the default `num_cpus = 1`.\n- **Many grid points** (and/or an expensive-per-point system): workers pay off.\n\nA second knob interacts with the first: every eigensolve runs on a multithreaded\n**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\noversubscribe — which on large dense matrices is not a small slowdown but a\n**catastrophe** (~90× in the example below). So we cap BLAS threads **first**, before\nimporting anything. Timings are machine-specific — run the cells yourself; the\nillustrative numbers are from a 10-core Mac mini and yours will differ."
   },
   {
    "cell_type": "markdown",
@@ -151,6 +123,11 @@
     "cores = os.cpu_count() or 1\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "source": "### Running this as a `.py` script? Guard the entry point\n\nThese cells run fine **in Jupyter**. But if you move parallel code (`num_cpus > 1`) into a\nplain Python script, note that macOS and Windows start workers with the `spawn` method,\nwhich **re-imports your script** in each worker — so the entry point must be guarded, or\nPython raises a `RuntimeError`:\n\n```python\nif __name__ == \"__main__\":\n    ...   # code that triggers num_cpus > 1\n```\n\nLinux (which uses `fork`) and Jupyter need no guard. scqubits also prints a one-time\nreminder the first time it spawns workers outside Jupyter.",
+   "metadata": {}
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -174,19 +151,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "**What to expect:** roughly *flat*, and `num_cpus = 2`/`4` may be slightly **slower** than\n",
-    "`1`. On the reference 10-core laptop (16 points, BLAS capped to 1):\n",
-    "\n",
-    "| num_cpus | wall time | speedup |\n",
-    "|---:|---:|---:|\n",
-    "| 1 | 0.10 s | 1.00× |\n",
-    "| 2 | 0.10 s | 0.93× |\n",
-    "| 4 | 0.11 s | 0.84× |\n",
-    "\n",
-    "This is the **normal** outcome for a quick sweep, and **not a bug** — with only 16 points\n",
-    "the per-task overhead dominates, so `num_cpus = 1` (the default) is the right choice.\n"
-   ]
+   "source": "**What to expect:** roughly *flat*, and `num_cpus = 2`/`4` may be slightly **slower** than\n`1`. On the reference 10-core Mac mini (16 points, BLAS capped to 1):\n\n| num_cpus | wall time | speedup |\n|---:|---:|---:|\n| 1 | 0.10 s | 1.00× |\n| 2 | 0.10 s | 0.93× |\n| 4 | 0.11 s | 0.84× |\n\nThis is the **normal** outcome for a quick sweep, and **not a bug** — with only 16 points\nthe per-task overhead dominates, so `num_cpus = 1` (the default) is the right choice."
   },
   {
    "cell_type": "markdown",
@@ -211,53 +176,29 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "**What to expect:** now the workers pay off. On the reference laptop (384 points, BLAS\n",
-    "capped to 1):\n",
-    "\n",
-    "| num_cpus | wall time | speedup |\n",
-    "|---:|---:|---:|\n",
-    "| 1 | 2.14 s | 1.00× |\n",
-    "| 2 | 1.50 s | 1.42× |\n",
-    "| 4 | 0.90 s | **2.37×** |\n",
-    "\n",
-    "Nothing changed except the **number of grid points** (16 → 384). That is the lever:\n",
-    "parallelism helps once *grid size × per-point cost* clears the overhead. The crossover\n",
-    "depends on the per-point cost too — a heavier system (larger `truncated_dim`, or a large\n",
-    "composite Hilbert space) reaches it at fewer points; a cheaper one needs more.\n"
-   ]
+   "source": "**What to expect:** now the workers pay off. On the reference Mac mini (384 points, BLAS\ncapped to 1):\n\n| num_cpus | wall time | speedup |\n|---:|---:|---:|\n| 1 | 2.14 s | 1.00× |\n| 2 | 1.50 s | 1.42× |\n| 4 | 0.90 s | **2.37×** |\n\nNothing changed except the **number of grid points** (16 → 384). That is the lever:\nparallelism helps once *grid size × per-point cost* clears the overhead. The crossover\ndepends on the per-point cost too — a heavier system (larger `truncated_dim`, or a large\ncomposite Hilbert space) reaches it at fewer points; a cheaper one needs more."
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## The BLAS-thread footgun — *why* we capped first\n\nEach eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\nRun `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\nby a factor of `num_cpus`. On small matrices this is wasteful; on **large dense** matrices\nit is *catastrophic*. Measured on the reference Mac mini — 5 capacitively coupled fluxonia,\ndressed dim 3125, dense diagonalization, 16-point sweep:\n\n| configuration | wall time |\n|---|---:|\n| `num_cpus=1` | 42 s |\n| `num_cpus=4`, BLAS **uncapped** | **3608 s**  (≈ 90× *slower*) |\n| `num_cpus=4`, BLAS capped to 1 | 40 s |\n| `num_cpus=8`, BLAS capped to 1 | 28 s |\n\n40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\ndifference between working and broken. (Skip Step 0 and rerun Demo 2 with the cap off and\nyou will reproduce a milder version: `num_cpus = 2`/`4` become *much* slower than `1`.)\nThat is also the usual reason a `num_cpus` comparison looks 'inconclusive' or backwards:\nthe cap was never in effect.\n\n**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\nbest; on large dense matrices each eigensolve benefits from several threads, so the\nfastest setting balances the knobs: roughly **`num_cpus × BLAS-threads ≈ cores`**. The\noptimum depends on machine, BLAS library, and matrix size — measure it."
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## The BLAS-thread footgun — *why* we capped first\n",
-    "\n",
-    "Each eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\n",
-    "Run `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\n",
-    "by a factor of `num_cpus`. On small matrices this is wasteful; on **large dense** matrices\n",
-    "it is *catastrophic*. Measured on the reference laptop — 5 capacitively coupled fluxonia,\n",
-    "dressed dim 3125, dense diagonalization, 16-point sweep:\n",
-    "\n",
-    "| configuration | wall time |\n",
-    "|---|---:|\n",
-    "| `num_cpus=1` | 42 s |\n",
-    "| `num_cpus=4`, BLAS **uncapped** | **3608 s**  (≈ 90× *slower*) |\n",
-    "| `num_cpus=4`, BLAS capped to 1 | 40 s |\n",
-    "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n",
-    "\n",
-    "40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\n",
-    "difference between working and broken. (Skip Step 0 and rerun Demo 2 with the cap off and\n",
-    "you will reproduce a milder version: `num_cpus = 2`/`4` become *much* slower than `1`.)\n",
-    "That is also the usual reason a `num_cpus` comparison looks 'inconclusive' or backwards:\n",
-    "the cap was never in effect.\n",
-    "\n",
-    "**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\n",
-    "best; on large dense matrices each eigensolve benefits from several threads, so the\n",
-    "fastest setting balances the knobs: roughly **`num_cpus × BLAS-threads ≈ cores`**. The\n",
-    "optimum depends on machine, BLAS library, and matrix size — measure it.\n"
+    "The thread cap above was set with environment variables (the reliable way, since BLAS\n",
+    "reads them at import). scqubits also exposes the relevant knobs as **settings** you can\n",
+    "change at any time in a session:\n"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Global default number of worker processes (used when `num_cpus` is not passed):\nscq.settings.NUM_CPUS = 1\n\n# Cap BLAS threads per worker process during parallel sweeps. This is applied only while\n# the worker pool is created. Spawn-based workers (macOS, Windows) re-read the env vars at\n# import; fork-based workers (Linux) need `threadpoolctl`. Keep num_cpus x BLAS-threads ~ cores.\nscq.settings.MULTIPROC_BLAS_THREADS = 1"
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -278,24 +219,92 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Automated tuning and summary\n",
+    "The setting is `scqubits.settings.AUTO_SPARSE_DIAG` (on by default). Here it is on a\n",
+    "moderately large system you can actually run — **4 capacitively coupled fluxonia**,\n",
+    "dressed dimension 6⁴ = 1296 — timing the same 8-point sweep with sparse vs dense\n",
+    "diagonalization:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import warnings\n",
+    "\n",
+    "def make_fluxonia_sweep(n_points=8):\n",
+    "    fluxonia = [\n",
+    "        scq.Fluxonium(EJ=4.0, EC=1.0, EL=1.0, flux=0.5, cutoff=110, truncated_dim=6, id_str=f\"flx{i}\")\n",
+    "        for i in range(4)\n",
+    "    ]\n",
+    "    hs = scq.HilbertSpace(fluxonia)\n",
+    "    for i in range(3):\n",
+    "        hs.add_interaction(g_strength=0.1, op1=fluxonia[i].n_operator, op2=fluxonia[i + 1].n_operator)\n",
+    "\n",
+    "    def update(flux):\n",
+    "        fluxonia[0].flux = flux\n",
     "\n",
-    "Because the optimum depends on your machine, BLAS library, and problem, the most reliable\n",
-    "approach is to measure. From the scqubits **source tree**, the script\n",
-    "`tools/autotune_multiprocessing.py` searches the `num_cpus × BLAS-threads` frontier and\n",
-    "reports the fastest configuration for your system.\n",
+    "    return scq.ParameterSweep(\n",
+    "        hilbertspace=hs, paramvals_by_name={\"flux\": np.linspace(0.45, 0.55, n_points)},\n",
+    "        update_hilbertspace=update, evals_count=20, num_cpus=1, autorun=True,\n",
+    "    )\n",
     "\n",
-    "### Summary\n",
     "\n",
-    "- **Default `num_cpus = 1`.** Parallelism helps only when *grid size × cost-per-point*\n",
-    "  greatly exceeds the per-task overhead; on small sweeps it does nothing, or slows you\n",
-    "  down — expected, not a bug (Demo 1 vs Demo 2).\n",
-    "- **Cap BLAS threads before importing scqubits**, and verify it took effect.\n",
-    "  Oversubscription is catastrophic on large dense matrices (the 90× example). Keep\n",
-    "  `num_cpus × BLAS-threads ≈ cores`.\n",
-    "- For large composite Hilbert spaces, **sparse diagonalization is usually a bigger lever\n",
-    "  than multiprocessing** — try it first.\n",
-    "- **Measure on your own hardware** — every number here is machine-specific.\n"
+    "def time_fluxonia():\n",
+    "    make_fluxonia_sweep()  # warm-up\n",
+    "    start = time.perf_counter()\n",
+    "    make_fluxonia_sweep()\n",
+    "    return time.perf_counter() - start\n",
+    "\n",
+    "\n",
+    "scq.settings.AUTO_SPARSE_DIAG = True\n",
+    "t_sparse = time_fluxonia()\n",
+    "\n",
+    "# The dense path exercises scipy's matrix-function code, which emits spurious 'matmul'\n",
+    "# RuntimeWarnings on some BLAS backends (e.g. Apple Accelerate); silence them for the demo.\n",
+    "with warnings.catch_warnings():\n",
+    "    warnings.simplefilter(\"ignore\")\n",
+    "    scq.settings.AUTO_SPARSE_DIAG = False\n",
+    "    t_dense = time_fluxonia()\n",
+    "scq.settings.AUTO_SPARSE_DIAG = True  # restore the default\n",
+    "\n",
+    "print(f\"sparse (default): {t_sparse:.2f} s    dense: {t_dense:.2f} s    speedup: {t_dense / t_sparse:.1f}x\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "On the reference Mac mini this prints roughly `sparse 0.55 s   dense 3.94 s   speedup 7.2x`.\nThe advantage grows with system size, and for very large Hilbert spaces dense may not fit\nin memory at all. (The dense run may also emit a few scipy-internal `matmul` warnings —\nanother reason to prefer the sparse default.)"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Letting scqubits pick the settings\n\nYou don't have to find the break-even by hand. `scqubits.recommend_parallelization` reads\nthe workload — Hilbert-space dimension, number of grid points, eigenvalue count, and\nwhether sparse diagonalization applies — and returns a recommended `num_cpus` and\nBLAS-thread cap. It is a *pure* function (it starts no workers), so it is safe to call\nanywhere. For the three-transmon system above (dressed dimension 6³ = 216), at the two grid\nsizes from Demos 1 and 2:"
+  },
+  {
+   "cell_type": "code",
+   "source": "for n_points in (16, 384):\n    cfg = scq.recommend_parallelization(dimension=6**3, num_points=n_points, evals_count=20)\n    print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}, blas_threads={cfg.blas_threads}\")\n    print(f\"            {cfg.reason}\")\n",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": "The recommendation reproduces what the demos showed: serial for the 16-point sweep,\nparallel for the 384-point one. Two conveniences build on it:\n\n- pass **`num_cpus=\"auto\"`** to a sweep and it tunes itself *before* it runs:\n  `scq.ParameterSweep(..., num_cpus=\"auto\")`;\n- set **`scq.settings.AUTO_PARALLEL = True`** to apply this to every sweep that does not\n  specify `num_cpus`.\n\nFor a recommendation tuned to *your* hardware (rather than built-in defaults), run the\none-time **`scq.calibrate_parallelization()`** — it measures this machine's per-task\noverhead, pool-startup cost, and per-point cost (writing\n`~/.scqubits/parallel_calibration.json`), after which the recommendation uses your measured\nbreak-even.\n\n### Summary\n\n- **Easiest:** let `recommend_parallelization` / `num_cpus=\"auto\"` choose; optionally run\n  `calibrate_parallelization()` once for machine-tuned advice.\n- Parallelism helps only when *grid size × cost-per-point* greatly exceeds the per-task\n  overhead; on small sweeps it does nothing, or slows you down — expected, not a bug\n  (Demo 1 vs Demo 2).\n- **Cap BLAS threads** to avoid oversubscription (the 90× example); keep\n  `num_cpus × BLAS-threads ≈ cores`. Capping before import is the blunt manual route;\n  `MULTIPROC_BLAS_THREADS` and the heuristic handle it per sweep.\n- For large composite Hilbert spaces, **sparse diagonalization is usually a bigger lever\n  than multiprocessing** — try it first.\n- **Every number here is machine-specific** — measure (or calibrate) on your own hardware.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The settings touched in this notebook, with their current values:\n",
+    "print(\"NUM_CPUS               =\", scq.settings.NUM_CPUS)\n",
+    "print(\"MULTIPROC_BLAS_THREADS =\", scq.settings.MULTIPROC_BLAS_THREADS)\n",
+    "print(\"AUTO_SPARSE_DIAG       =\", scq.settings.AUTO_SPARSE_DIAG)\n",
+    "print(\"MULTIPROC              =\", scq.settings.MULTIPROC, \"(parallelization backend)\")\n"
    ]
   }
  ],
@@ -311,4 +320,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
+}
\ No newline at end of file

From 5660a50727558f8a0e22868f253f8b2a5f56179c Mon Sep 17 00:00:00 2001
From: Jens Koch <19193849+jkochNU@users.noreply.github.com>
Date: Tue, 2 Jun 2026 21:50:52 -0500
Subject: [PATCH 5/8] examples: rewrite multiprocessing demo around the
 auto-tuning story

Rewrite from scratch so the main story is the shipped capability: let scqubits
choose num_cpus and the BLAS-thread cap. Lead with recommend_parallelization,
num_cpus="auto", and AUTO_PARALLEL; show a serial-vs-auto correctness check; then
the one-time calibrate_parallelization(). The manual mechanics (grid break-even,
the BLAS oversubscription cliff, manual settings, sparse-first, the __main__ guard)
are demoted to supporting "why" and "manual control" sections. American English
throughout. 17 cells, down from 25.
---
 examples/demo_multiprocessing.ipynb | 345 +++++++++++++---------------
 1 file changed, 166 insertions(+), 179 deletions(-)

diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb
index 02d334f..8dfbdaf 100644
--- a/examples/demo_multiprocessing.ipynb
+++ b/examples/demo_multiprocessing.ipynb
@@ -1,45 +1,26 @@
 {
  "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": "# scqubits example: when does parallelizing a sweep help? (`num_cpus`, BLAS threads)\n\nJ. Koch and P. Groszkowski\n\nFor further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n\n---\n\n`ParameterSweep` can compute a sweep across worker processes via its `num_cpus`\nargument. **Parallelism is not free, and on small sweeps it does *not* help — sometimes\nit makes them slower.** Whether it pays off is governed by one simple balance:\n\n> parallelism helps only when&nbsp;&nbsp;**(number of grid points) × (cost per point)&nbsp;&nbsp;≫&nbsp;&nbsp;the fixed overhead**\n> of starting worker processes and shipping each task to them.\n\nSo the behaviour you should *expect*:\n\n- **Few grid points:** `num_cpus > 1` gives no speedup, and can be *slower* than serial.\n  This is normal — keep the default `num_cpus = 1`.\n- **Many grid points** (and/or an expensive-per-point system): workers pay off.\n\nA second knob interacts with the first: every eigensolve runs on a multithreaded\n**BLAS/LAPACK** backend. If you run several workers and let each use all cores, you\noversubscribe — which on large dense matrices is not a small slowdown but a\n**catastrophe** (~90× in the example below). So we cap BLAS threads **first**, before\nimporting anything. Timings are machine-specific — run the cells yourself; the\nillustrative numbers are from a 10-core Mac mini and yours will differ."
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 0 (do this first): cap BLAS threads *before* importing scqubits\n",
+    "# Letting scqubits choose your multiprocessing settings\n",
     "\n",
-    "The BLAS backend reads its thread count **once, at import time**, so this must be the\n",
-    "**first cell you run in a fresh kernel** — before `numpy` or `scqubits` is imported. (If\n",
-    "scqubits is already imported, restart the kernel and run this first.) Capping to `1` is a\n",
-    "good starting point for qubit sweeps and prevents oversubscription once `num_cpus > 1`;\n",
-    "the 'BLAS footgun' section below explains why this matters so much.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
+    "J. Koch and P. Groszkowski\n",
     "\n",
-    "# MUST run before numpy / scqubits are imported (restart the kernel otherwise).\n",
-    "for _var in (\"OPENBLAS_NUM_THREADS\", \"MKL_NUM_THREADS\", \"OMP_NUM_THREADS\", \"NUMEXPR_NUM_THREADS\"):\n",
-    "    os.environ[_var] = \"1\"\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Confirm the cap took effect\n",
+    "For further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "A `ParameterSweep` can spread its grid points across worker processes through the `num_cpus`\n",
+    "argument. The catch is that the *right* number of workers — and the right number of BLAS\n",
+    "threads each worker should use — depends on the problem and the machine. Guess wrong and you\n",
+    "get no speedup, or, when the workers oversubscribe the cores, a slowdown of **one to two\n",
+    "orders of magnitude**.\n",
     "\n",
-    "The single most common cause of confusing sweep timings is that the cap **did not apply**\n",
-    "— because `numpy`/`scqubits` were already imported when the env vars were set. Check it\n",
-    "(needs `threadpoolctl`, `pip install threadpoolctl`); each reported count should be `1`:\n"
+    "So scqubits can choose for you. This notebook shows the easy path —\n",
+    "`recommend_parallelization` and `num_cpus=\"auto\"` — and a one-time per-machine calibration,\n",
+    "and then explains what is being balanced under the hood."
    ]
   },
   {
@@ -49,33 +30,18 @@
    "outputs": [],
    "source": [
     "import numpy as np\n",
-    "import scqubits as scq  # imports scipy, whose BLAS the eigensolvers use\n",
-    "\n",
-    "try:\n",
-    "    from threadpoolctl import threadpool_info\n",
-    "\n",
-    "    pools = threadpool_info()\n",
-    "    for pool in pools:\n",
-    "        print(f\"{pool['internal_api']:>10}: {pool['num_threads']} thread(s)\")\n",
-    "    if not pools:\n",
-    "        print(\"(no controllable BLAS pool detected)\")\n",
-    "    print(\"\\nEach count should be 1. If not, restart the kernel and run the cap cell\")\n",
-    "    print(\"FIRST. (On Apple Silicon, numpy itself uses Accelerate, which ignores these\")\n",
-    "    print(\" variables; scipy's bundled OpenBLAS -- shown here -- is what scqubits uses.)\")\n",
-    "except ImportError:\n",
-    "    print(\"threadpoolctl not installed (pip install threadpoolctl) -- skipping check.\")\n"
+    "\n",
+    "import scqubits as scq"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## A small timing helper and an example system\n",
+    "## A system to sweep\n",
     "\n",
-    "Three coupled tunable transmons, swept over the flux of the first. We rebuild a fresh\n",
-    "sweep for each timed run, and we will sweep **the same system at two grid sizes** to see\n",
-    "both regimes. Wall-clock timing is noisy, so we take the median of a few repeats and\n",
-    "discard a warm-up run (the first parallel run pays a one-time process-startup cost).\n"
+    "We use three capacitively coupled tunable transmons and sweep the flux of the first. The\n",
+    "dressed Hilbert space has dimension `6**3 = 216`."
    ]
   },
   {
@@ -84,58 +50,43 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import time\n",
-    "\n",
-    "\n",
-    "def make_sweep(num_cpus, n_points, truncated_dim=6):\n",
+    "def build_hilbertspace():\n",
     "    qubits = [\n",
     "        scq.TunableTransmon(\n",
     "            EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n",
-    "            truncated_dim=truncated_dim, id_str=f\"tmon{i}\",\n",
+    "            truncated_dim=6, id_str=f\"tmon{i}\",\n",
     "        )\n",
     "        for i in range(3)\n",
     "    ]\n",
     "    hs = scq.HilbertSpace(qubits)\n",
     "    for i in range(2):\n",
-    "        hs.add_interaction(g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator)\n",
-    "    flux_vals = np.linspace(0.0, 0.5, n_points)\n",
+    "        hs.add_interaction(\n",
+    "            g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator\n",
+    "        )\n",
     "\n",
     "    def update(flux):\n",
     "        qubits[0].flux = flux\n",
     "\n",
-    "    return scq.ParameterSweep(\n",
-    "        hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
-    "        update_hilbertspace=update, evals_count=20, num_cpus=num_cpus, autorun=True,\n",
-    "    )\n",
-    "\n",
-    "\n",
-    "def time_run(n_points, num_cpus, repeats=3):\n",
-    "    \"\"\"Median wall time of make_sweep(num_cpus, n_points), after a discarded warm-up.\"\"\"\n",
-    "    make_sweep(num_cpus, n_points)  # warm-up (discarded)\n",
-    "    times = []\n",
-    "    for _ in range(repeats):\n",
-    "        start = time.perf_counter()\n",
-    "        make_sweep(num_cpus, n_points)\n",
-    "        times.append(time.perf_counter() - start)\n",
-    "    return float(np.median(times))\n",
+    "    return hs, update\n",
     "\n",
     "\n",
-    "cores = os.cpu_count() or 1\n"
+    "hs, update = build_hilbertspace()\n",
+    "print(\"dressed dimension:\", hs.dimension)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "source": "### Running this as a `.py` script? Guard the entry point\n\nThese cells run fine **in Jupyter**. But if you move parallel code (`num_cpus > 1`) into a\nplain Python script, note that macOS and Windows start workers with the `spawn` method,\nwhich **re-imports your script** in each worker — so the entry point must be guarded, or\nPython raises a `RuntimeError`:\n\n```python\nif __name__ == \"__main__\":\n    ...   # code that triggers num_cpus > 1\n```\n\nLinux (which uses `fork`) and Jupyter need no guard. scqubits also prints a one-time\nreminder the first time it spawns workers outside Jupyter.",
-   "metadata": {}
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Demo 1 — a small sweep: parallelism does *not* help\n",
+    "## The easy way: let scqubits choose\n",
+    "\n",
+    "`scq.recommend_parallelization` reads the workload — Hilbert-space dimension, number of grid\n",
+    "points, eigenvalue count, and whether sparse diagonalization applies — and returns a\n",
+    "recommended `num_cpus` together with a per-worker BLAS-thread cap. It is a *pure* function: it\n",
+    "starts no worker processes, so it is safe to call anywhere, and it does not run the sweep.\n",
     "\n",
-    "Only **16 grid points**. There are too few points to amortize the cost of starting\n",
-    "workers and shipping tasks, so spreading them across processes buys nothing.\n"
+    "Because a `ParameterSweep` runs the moment it is constructed, call it *before* building the\n",
+    "sweep. Here is its choice for our system at two grid sizes:"
    ]
   },
   {
@@ -144,23 +95,41 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "for n in [c for c in (1, 2, 4) if c <= cores]:\n",
-    "    print(f\"num_cpus={n}:  {time_run(16, n):.3f} s (median)\")\n"
+    "for n_points in (16, 384):\n",
+    "    cfg = scq.recommend_parallelization(\n",
+    "        hilbertspace=hs, num_points=n_points, evals_count=20\n",
+    "    )\n",
+    "    print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}, blas_threads={cfg.blas_threads}\")\n",
+    "    print(f\"            {cfg.reason}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "**What to expect:** roughly *flat*, and `num_cpus = 2`/`4` may be slightly **slower** than\n`1`. On the reference 10-core Mac mini (16 points, BLAS capped to 1):\n\n| num_cpus | wall time | speedup |\n|---:|---:|---:|\n| 1 | 0.10 s | 1.00× |\n| 2 | 0.10 s | 0.93× |\n| 4 | 0.11 s | 0.84× |\n\nThis is the **normal** outcome for a quick sweep, and **not a bug** — with only 16 points\nthe per-task overhead dominates, so `num_cpus = 1` (the default) is the right choice."
+   "source": [
+    "A 16-point sweep stays serial — there are too few points to repay the cost of starting and\n",
+    "feeding worker processes — while the 384-point sweep is spread across workers, each capped to\n",
+    "a single BLAS thread so they do not oversubscribe the cores.\n",
+    "\n",
+    "To apply the recommendation without copying numbers by hand, pass the sentinel\n",
+    "`num_cpus=\"auto\"` to the sweep. The choice is then made automatically, *before* the sweep\n",
+    "runs:\n",
+    "\n",
+    "```python\n",
+    "sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n",
+    "```\n",
+    "\n",
+    "To make *every* sweep that does not specify `num_cpus` tune itself this way, set\n",
+    "`scq.settings.AUTO_PARALLEL = True`."
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Demo 2 — a large sweep: parallelism pays off\n",
-    "\n",
-    "The *same system*, now with **384 grid points**. There is enough work to amortize the\n",
-    "fixed overhead, so the workers help.\n"
+    "The automatic choice changes only *how* a sweep is computed, never the result. We confirm\n",
+    "that by running the same 384-point sweep both serially and with `num_cpus=\"auto\"`, and\n",
+    "comparing the spectra:"
    ]
   },
   {
@@ -169,27 +138,33 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "for n in [c for c in (1, 2, 4) if c <= cores]:\n",
-    "    print(f\"num_cpus={n}:  {time_run(384, n):.3f} s (median)\")\n"
+    "flux_vals = np.linspace(0.0, 0.5, 384)\n",
+    "\n",
+    "serial = scq.ParameterSweep(\n",
+    "    hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
+    "    update_hilbertspace=update, evals_count=20, num_cpus=1,\n",
+    ")\n",
+    "auto = scq.ParameterSweep(\n",
+    "    hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
+    "    update_hilbertspace=update, evals_count=20, num_cpus=\"auto\",\n",
+    ")\n",
+    "print(\"spectra identical:\", np.allclose(serial[\"evals\"][:], auto[\"evals\"][:]))"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": "**What to expect:** now the workers pay off. On the reference Mac mini (384 points, BLAS\ncapped to 1):\n\n| num_cpus | wall time | speedup |\n|---:|---:|---:|\n| 1 | 2.14 s | 1.00× |\n| 2 | 1.50 s | 1.42× |\n| 4 | 0.90 s | **2.37×** |\n\nNothing changed except the **number of grid points** (16 → 384). That is the lever:\nparallelism helps once *grid size × per-point cost* clears the overhead. The crossover\ndepends on the per-point cost too — a heavier system (larger `truncated_dim`, or a large\ncomposite Hilbert space) reaches it at fewer points; a cheaper one needs more."
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": "## The BLAS-thread footgun — *why* we capped first\n\nEach eigensolve runs on a multithreaded BLAS backend that by default uses *all* cores.\nRun `num_cpus` workers and each launches a full BLAS pool, so you oversubscribe the cores\nby a factor of `num_cpus`. On small matrices this is wasteful; on **large dense** matrices\nit is *catastrophic*. Measured on the reference Mac mini — 5 capacitively coupled fluxonia,\ndressed dim 3125, dense diagonalization, 16-point sweep:\n\n| configuration | wall time |\n|---|---:|\n| `num_cpus=1` | 42 s |\n| `num_cpus=4`, BLAS **uncapped** | **3608 s**  (≈ 90× *slower*) |\n| `num_cpus=4`, BLAS capped to 1 | 40 s |\n| `num_cpus=8`, BLAS capped to 1 | 28 s |\n\n40 threads fighting over 10 cores on dense LAPACK collapses performance — the cap is the\ndifference between working and broken. (Skip Step 0 and rerun Demo 2 with the cap off and\nyou will reproduce a milder version: `num_cpus = 2`/`4` become *much* slower than `1`.)\nThat is also the usual reason a `num_cpus` comparison looks 'inconclusive' or backwards:\nthe cap was never in effect.\n\n**Caveat — `1` is not universally optimal.** On small matrices a single BLAS thread is\nbest; on large dense matrices each eigensolve benefits from several threads, so the\nfastest setting balances the knobs: roughly **`num_cpus × BLAS-threads ≈ cores`**. The\noptimum depends on machine, BLAS library, and matrix size — measure it."
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The thread cap above was set with environment variables (the reliable way, since BLAS\n",
-    "reads them at import). scqubits also exposes the relevant knobs as **settings** you can\n",
-    "change at any time in a session:\n"
+    "## Tune to your machine (optional, run once)\n",
+    "\n",
+    "The recommendation above uses conservative built-in thresholds. For a recommendation tuned to\n",
+    "*your* hardware, run the one-time calibration. It times a short battery of sweeps in isolated\n",
+    "subprocesses and measures this machine's per-task dispatch overhead, one-time pool-startup\n",
+    "cost, and per-point diagonalization cost (dense and sparse). It takes about a minute and\n",
+    "writes `~/.scqubits/parallel_calibration.json`.\n",
+    "\n",
+    "Because it launches its measurements as `python -m` subprocesses, the call needs no\n",
+    "`if __name__ == \"__main__\":` guard, in Jupyter or in a plain script."
    ]
   },
   {
@@ -197,32 +172,17 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "# Global default number of worker processes (used when `num_cpus` is not passed):\nscq.settings.NUM_CPUS = 1\n\n# Cap BLAS threads per worker process during parallel sweeps. This is applied only while\n# the worker pool is created. Spawn-based workers (macOS, Windows) re-read the env vars at\n# import; fork-based workers (Linux) need `threadpoolctl`. Keep num_cpus x BLAS-threads ~ cores.\nscq.settings.MULTIPROC_BLAS_THREADS = 1"
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
    "source": [
-    "## For large composite systems, try sparse diagonalization first\n",
-    "\n",
-    "Before reaching for `num_cpus`, note that for large coupled systems the dominant cost is\n",
-    "the *per-point diagonalization*, usually a bigger lever than parallelism. Recent scqubits\n",
-    "automatically uses **sparse** diagonalization (`scipy.eigsh`) for the default method when\n",
-    "only a few eigenstates of a large Hilbert space are requested\n",
-    "(`scqubits.settings.AUTO_SPARSE_DIAG`). For the 5-fluxonia system above (dressed dim\n",
-    "3125) this alone is ~16× faster per point than dense — far more than parallelism buys\n",
-    "there. And once sparse makes each point cheap, `num_cpus > 1` helps even less. **Try\n",
-    "sparse first; parallelize second.**\n"
+    "scq.calibrate_parallelization()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The setting is `scqubits.settings.AUTO_SPARSE_DIAG` (on by default). Here it is on a\n",
-    "moderately large system you can actually run — **4 capacitively coupled fluxonia**,\n",
-    "dressed dimension 6⁴ = 1296 — timing the same 8-point sweep with sparse vs dense\n",
-    "diagonalization:\n"
+    "Once the calibration exists, `recommend_parallelization` (and `num_cpus=\"auto\"`) use the\n",
+    "measured break-even — parallelizing only once the grid is large enough to repay the measured\n",
+    "pool-startup cost. Re-running the recommendation now reflects your machine:"
    ]
   },
   {
@@ -231,80 +191,107 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import warnings\n",
-    "\n",
-    "def make_fluxonia_sweep(n_points=8):\n",
-    "    fluxonia = [\n",
-    "        scq.Fluxonium(EJ=4.0, EC=1.0, EL=1.0, flux=0.5, cutoff=110, truncated_dim=6, id_str=f\"flx{i}\")\n",
-    "        for i in range(4)\n",
-    "    ]\n",
-    "    hs = scq.HilbertSpace(fluxonia)\n",
-    "    for i in range(3):\n",
-    "        hs.add_interaction(g_strength=0.1, op1=fluxonia[i].n_operator, op2=fluxonia[i + 1].n_operator)\n",
-    "\n",
-    "    def update(flux):\n",
-    "        fluxonia[0].flux = flux\n",
-    "\n",
-    "    return scq.ParameterSweep(\n",
-    "        hilbertspace=hs, paramvals_by_name={\"flux\": np.linspace(0.45, 0.55, n_points)},\n",
-    "        update_hilbertspace=update, evals_count=20, num_cpus=1, autorun=True,\n",
+    "for n_points in (16, 384):\n",
+    "    cfg = scq.recommend_parallelization(\n",
+    "        hilbertspace=hs, num_points=n_points, evals_count=20\n",
     "    )\n",
+    "    print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}  ({cfg.reason})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What scqubits is balancing for you\n",
     "\n",
+    "There is no single right answer because two effects pull in opposite directions.\n",
     "\n",
-    "def time_fluxonia():\n",
-    "    make_fluxonia_sweep()  # warm-up\n",
-    "    start = time.perf_counter()\n",
-    "    make_fluxonia_sweep()\n",
-    "    return time.perf_counter() - start\n",
+    "**1. The grid break-even.** Sending a grid point to a worker costs a fixed amount (pickling,\n",
+    "inter-process hand-off), and starting the pool costs a one-time amount (about a second when\n",
+    "workers are *spawned*, as on macOS and Windows). Parallelism pays off only when\n",
     "\n",
+    "> (number of grid points) × (cost per point) ≫ that fixed overhead.\n",
     "\n",
-    "scq.settings.AUTO_SPARSE_DIAG = True\n",
-    "t_sparse = time_fluxonia()\n",
+    "Few points, or cheap points, stay faster serially — which is why the heuristic keeps small\n",
+    "sweeps on a single process.\n",
     "\n",
-    "# The dense path exercises scipy's matrix-function code, which emits spurious 'matmul'\n",
-    "# RuntimeWarnings on some BLAS backends (e.g. Apple Accelerate); silence them for the demo.\n",
-    "with warnings.catch_warnings():\n",
-    "    warnings.simplefilter(\"ignore\")\n",
-    "    scq.settings.AUTO_SPARSE_DIAG = False\n",
-    "    t_dense = time_fluxonia()\n",
-    "scq.settings.AUTO_SPARSE_DIAG = True  # restore the default\n",
+    "**2. The BLAS oversubscription cliff.** Every eigensolve already runs on a multithreaded BLAS\n",
+    "backend. If several workers each use all cores, the cores are oversubscribed by a factor of\n",
+    "`num_cpus`, which on large dense matrices is not a small slowdown but a collapse. Measured on\n",
+    "a 10-core Mac mini — five capacitively coupled fluxonia (dressed dimension 3125, dense),\n",
+    "16-point sweep:\n",
     "\n",
-    "print(f\"sparse (default): {t_sparse:.2f} s    dense: {t_dense:.2f} s    speedup: {t_dense / t_sparse:.1f}x\")\n"
+    "| configuration | wall time |\n",
+    "|---|---:|\n",
+    "| `num_cpus=1` | 42 s |\n",
+    "| `num_cpus=4`, BLAS uncapped | **3608 s**  (~90x slower) |\n",
+    "| `num_cpus=4`, BLAS capped to 1 | 40 s |\n",
+    "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n",
+    "\n",
+    "This is why the recommendation always pairs a worker count with a BLAS-thread cap, keeping\n",
+    "`num_cpus x BLAS-threads` near the core count."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "On the reference Mac mini this prints roughly `sparse 0.55 s   dense 3.94 s   speedup 7.2x`.\nThe advantage grows with system size, and for very large Hilbert spaces dense may not fit\nin memory at all. (The dense run may also emit a few scipy-internal `matmul` warnings —\nanother reason to prefer the sparse default.)"
+   "source": [
+    "## Manual control\n",
+    "\n",
+    "The same knobs are available directly if you would rather set them yourself:\n",
+    "\n",
+    "```python\n",
+    "scq.settings.NUM_CPUS = 4                 # default worker count when num_cpus is unset\n",
+    "scq.settings.MULTIPROC_BLAS_THREADS = 1   # per-worker BLAS-thread cap during a sweep\n",
+    "```\n",
+    "\n",
+    "Rule of thumb: keep `num_cpus x BLAS-threads` near the number of physical cores.\n",
+    "`MULTIPROC_BLAS_THREADS` reaches spawn-based workers (macOS, Windows) through the thread-count\n",
+    "environment variables, and fork-based workers (Linux) through the optional `threadpoolctl`\n",
+    "package.\n",
+    "\n",
+    "For large composite systems the per-point **diagonalization method** is often a bigger lever\n",
+    "than parallelism: scqubits uses sparse diagonalization by default for large spectra (see\n",
+    "`scq.settings.AUTO_SPARSE_DIAG`), which can be far faster per point. Once each point is cheap,\n",
+    "parallelism helps even less — so try sparse first, and parallelize second."
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## Letting scqubits pick the settings\n\nYou don't have to find the break-even by hand. `scqubits.recommend_parallelization` reads\nthe workload — Hilbert-space dimension, number of grid points, eigenvalue count, and\nwhether sparse diagonalization applies — and returns a recommended `num_cpus` and\nBLAS-thread cap. It is a *pure* function (it starts no workers), so it is safe to call\nanywhere. For the three-transmon system above (dressed dimension 6³ = 216), at the two grid\nsizes from Demos 1 and 2:"
-  },
-  {
-   "cell_type": "code",
-   "source": "for n_points in (16, 384):\n    cfg = scq.recommend_parallelization(dimension=6**3, num_points=n_points, evals_count=20)\n    print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}, blas_threads={cfg.blas_threads}\")\n    print(f\"            {cfg.reason}\")\n",
-   "metadata": {},
-   "execution_count": null,
-   "outputs": []
+   "source": [
+    "## Running as a script\n",
+    "\n",
+    "In Jupyter, everything above runs as shown. In a plain Python script on macOS or Windows,\n",
+    "workers are *spawned*, which re-imports your script in each worker — so the entry point that\n",
+    "triggers a parallel sweep must be guarded:\n",
+    "\n",
+    "```python\n",
+    "import scqubits as scq\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n",
+    "```\n",
+    "\n",
+    "Linux (which forks) and Jupyter need no guard. scqubits prints a one-time reminder the first\n",
+    "time it spawns workers outside of IPython."
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": "The recommendation reproduces what the demos showed: serial for the 16-point sweep,\nparallel for the 384-point one. Two conveniences build on it:\n\n- pass **`num_cpus=\"auto\"`** to a sweep and it tunes itself *before* it runs:\n  `scq.ParameterSweep(..., num_cpus=\"auto\")`;\n- set **`scq.settings.AUTO_PARALLEL = True`** to apply this to every sweep that does not\n  specify `num_cpus`.\n\nFor a recommendation tuned to *your* hardware (rather than built-in defaults), run the\none-time **`scq.calibrate_parallelization()`** — it measures this machine's per-task\noverhead, pool-startup cost, and per-point cost (writing\n`~/.scqubits/parallel_calibration.json`), after which the recommendation uses your measured\nbreak-even.\n\n### Summary\n\n- **Easiest:** let `recommend_parallelization` / `num_cpus=\"auto\"` choose; optionally run\n  `calibrate_parallelization()` once for machine-tuned advice.\n- Parallelism helps only when *grid size × cost-per-point* greatly exceeds the per-task\n  overhead; on small sweeps it does nothing, or slows you down — expected, not a bug\n  (Demo 1 vs Demo 2).\n- **Cap BLAS threads** to avoid oversubscription (the 90× example); keep\n  `num_cpus × BLAS-threads ≈ cores`. Capping before import is the blunt manual route;\n  `MULTIPROC_BLAS_THREADS` and the heuristic handle it per sweep.\n- For large composite Hilbert spaces, **sparse diagonalization is usually a bigger lever\n  than multiprocessing** — try it first.\n- **Every number here is machine-specific** — measure (or calibrate) on your own hardware.",
-   "metadata": {}
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
-   "outputs": [],
    "source": [
-    "# The settings touched in this notebook, with their current values:\n",
-    "print(\"NUM_CPUS               =\", scq.settings.NUM_CPUS)\n",
-    "print(\"MULTIPROC_BLAS_THREADS =\", scq.settings.MULTIPROC_BLAS_THREADS)\n",
-    "print(\"AUTO_SPARSE_DIAG       =\", scq.settings.AUTO_SPARSE_DIAG)\n",
-    "print(\"MULTIPROC              =\", scq.settings.MULTIPROC, \"(parallelization backend)\")\n"
+    "## Summary\n",
+    "\n",
+    "- **Let scqubits choose.** `scq.recommend_parallelization(...)` recommends `num_cpus` and a\n",
+    "  BLAS-thread cap from the workload; `num_cpus=\"auto\"` applies it per sweep, and\n",
+    "  `scq.settings.AUTO_PARALLEL = True` applies it everywhere.\n",
+    "- **Calibrate once** with `scq.calibrate_parallelization()` for advice measured on your own\n",
+    "  hardware.\n",
+    "- Parallelism helps only when the grid is large enough to repay the fixed overhead, and the\n",
+    "  BLAS-thread cap is what keeps many workers from oversubscribing the cores.\n",
+    "- All of this takes effect live, without restarting the kernel, in Jupyter and in scripts\n",
+    "  alike."
    ]
   }
  ],
@@ -320,4 +307,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
\ No newline at end of file
+}

From 5fa0767fbcf3e3fe46bcbb639bf8a1dfe1eb567a Mon Sep 17 00:00:00 2001
From: Jens Koch <19193849+jkochNU@users.noreply.github.com>
Date: Fri, 5 Jun 2026 23:18:50 -0500
Subject: [PATCH 6/8] docs(demo): note MULTIPROC_BLAS_THREADS defaults to
 "auto"

The manual-control section now explains that the BLAS-thread cap is automatic
by default ("auto" = cores // num_cpus); setting an int overrides it and None
opts out.
---
 examples/demo_multiprocessing.ipynb | 23 ++---------------------
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb
index 8dfbdaf..f25cc31 100644
--- a/examples/demo_multiprocessing.ipynb
+++ b/examples/demo_multiprocessing.ipynb
@@ -235,26 +235,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## Manual control\n",
-    "\n",
-    "The same knobs are available directly if you would rather set them yourself:\n",
-    "\n",
-    "```python\n",
-    "scq.settings.NUM_CPUS = 4                 # default worker count when num_cpus is unset\n",
-    "scq.settings.MULTIPROC_BLAS_THREADS = 1   # per-worker BLAS-thread cap during a sweep\n",
-    "```\n",
-    "\n",
-    "Rule of thumb: keep `num_cpus x BLAS-threads` near the number of physical cores.\n",
-    "`MULTIPROC_BLAS_THREADS` reaches spawn-based workers (macOS, Windows) through the thread-count\n",
-    "environment variables, and fork-based workers (Linux) through the optional `threadpoolctl`\n",
-    "package.\n",
-    "\n",
-    "For large composite systems the per-point **diagonalization method** is often a bigger lever\n",
-    "than parallelism: scqubits uses sparse diagonalization by default for large spectra (see\n",
-    "`scq.settings.AUTO_SPARSE_DIAG`), which can be far faster per point. Once each point is cheap,\n",
-    "parallelism helps even less — so try sparse first, and parallelize second."
-   ]
+   "source": "## Manual control\n\nThe same knobs are available directly if you would rather set them yourself:\n\n```python\nscq.settings.NUM_CPUS = 4                 # default worker count when num_cpus is unset\nscq.settings.MULTIPROC_BLAS_THREADS = 1   # per-worker BLAS-thread cap during a sweep\n```\n\n`MULTIPROC_BLAS_THREADS` already defaults to `\"auto\"`, which caps each worker to\n`cores // num_cpus` so parallel sweeps never oversubscribe the cores — setting an integer\njust overrides that with a fixed cap (use `None` to opt out entirely). Rule of thumb: keep\n`num_cpus x BLAS-threads` near the number of physical cores. The cap reaches spawn-based\nworkers (macOS, Windows) through the thread-count environment variables, and fork-based\nworkers (Linux) through `threadpoolctl`.\n\nFor large composite systems the per-point **diagonalization method** is often a bigger lever\nthan parallelism: scqubits uses sparse diagonalization by default for large spectra (see\n`scq.settings.AUTO_SPARSE_DIAG`), which can be far faster per point. Once each point is cheap,\nparallelism helps even less — so try sparse first, and parallelize second."
   },
   {
    "cell_type": "markdown",
@@ -307,4 +288,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
+}
\ No newline at end of file

From 04a415f0373112d8a1dfaa875e784463454d433c Mon Sep 17 00:00:00 2001
From: Jens Koch <19193849+jkochNU@users.noreply.github.com>
Date: Mon, 8 Jun 2026 18:19:33 -0500
Subject: [PATCH 7/8] docs(demo): reassure existing users, clarify auto vs
 AUTO_PARALLEL, recalibration

---
 examples/demo_multiprocessing.ipynb | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb
index f25cc31..a8cbd3f 100644
--- a/examples/demo_multiprocessing.ipynb
+++ b/examples/demo_multiprocessing.ipynb
@@ -20,7 +20,9 @@
     "\n",
     "So scqubits can choose for you. This notebook shows the easy path —\n",
     "`recommend_parallelization` and `num_cpus=\"auto\"` — and a one-time per-machine calibration,\n",
-    "and then explains what is being balanced under the hood."
+    "and then explains what is being balanced under the hood.\n",
+    "\n",
+    "> **Updating from an older scqubits?** Nothing here is required: your existing code runs unchanged and gives identical results (the default is still serial, `num_cpus=1`). And if you already pass `num_cpus > 1`, you now get automatic protection against the oversubscription slowdown for free. Everything below is opt-in — for *more* speed with *less* guesswork."
    ]
   },
   {
@@ -120,7 +122,15 @@
     "```\n",
     "\n",
     "To make *every* sweep that does not specify `num_cpus` tune itself this way, set\n",
-    "`scq.settings.AUTO_PARALLEL = True`."
+    "`scq.settings.AUTO_PARALLEL = True`.\n",
+    "\n",
+    "These are the same auto-tuner with different reach — `num_cpus=\"auto\"` opts in for one sweep, while `AUTO_PARALLEL = True` makes it the default for every sweep where you don't pass `num_cpus`. An explicit number always wins:\n",
+    "\n",
+    "```text\n",
+    "num_cpus=4        ->  exactly 4 workers      (you decide)\n",
+    "num_cpus=\"auto\"   ->  auto-tuner decides     (always)\n",
+    "num_cpus omitted  ->  auto-tuner if AUTO_PARALLEL=True, else serial (the default)\n",
+    "```"
    ]
   },
   {
@@ -157,14 +167,13 @@
    "source": [
     "## Tune to your machine (optional, run once)\n",
     "\n",
-    "The recommendation above uses conservative built-in thresholds. For a recommendation tuned to\n",
-    "*your* hardware, run the one-time calibration. It times a short battery of sweeps in isolated\n",
-    "subprocesses and measures this machine's per-task dispatch overhead, one-time pool-startup\n",
-    "cost, and per-point diagonalization cost (dense and sparse). It takes about a minute and\n",
-    "writes `~/.scqubits/parallel_calibration.json`.\n",
+    "The recommendation above uses conservative built-in thresholds that work everywhere. For choices tuned to *your* hardware, run the one-time calibration. It times a short battery of sweeps in isolated subprocesses — this machine's per-task dispatch overhead, the one-time pool-startup cost, and per-point diagonalization cost (dense and sparse) — and writes `~/.scqubits/parallel_calibration.json` (about a minute).\n",
+    "\n",
+    "The calibration is **just data**: it does nothing on its own. From then on, every `recommend_parallelization` / `num_cpus=\"auto\"` call reads that file and makes a sharper, machine-specific choice instead of using the generic defaults.\n",
+    "\n",
+    "**Re-running overwrites the file, so recalibrate freely** — and *do* recalibrate if a previous run was taken under bad conditions: while the machine was busy, or on a laptop that was on battery / CPU-throttled (many laptops clock down hard when unplugged, which makes the calibration over-estimate every cost so `\"auto\"` then under-parallelizes). For the most representative numbers, calibrate on an otherwise-idle machine plugged into wall power.\n",
     "\n",
-    "Because it launches its measurements as `python -m` subprocesses, the call needs no\n",
-    "`if __name__ == \"__main__\":` guard, in Jupyter or in a plain script."
+    "Because it launches its measurements as `python -m` subprocesses, the call needs no `if __name__ == \"__main__\":` guard, in Jupyter or in a plain script."
    ]
   },
   {
@@ -288,4 +297,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
\ No newline at end of file
+}

From 7878728781880d6df866e545d03c02ffacdcd9a1 Mon Sep 17 00:00:00 2001
From: Jens Koch <19193849+jkochNU@users.noreply.github.com>
Date: Mon, 8 Jun 2026 18:36:15 -0500
Subject: [PATCH 8/8] docs(demo): drop update-from-older-scqubits aside; keep
 current-state intro

---
 examples/demo_multiprocessing.ipynb | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb
index a8cbd3f..0cbbca2 100644
--- a/examples/demo_multiprocessing.ipynb
+++ b/examples/demo_multiprocessing.ipynb
@@ -20,9 +20,7 @@
     "\n",
     "So scqubits can choose for you. This notebook shows the easy path —\n",
     "`recommend_parallelization` and `num_cpus=\"auto\"` — and a one-time per-machine calibration,\n",
-    "and then explains what is being balanced under the hood.\n",
-    "\n",
-    "> **Updating from an older scqubits?** Nothing here is required: your existing code runs unchanged and gives identical results (the default is still serial, `num_cpus=1`). And if you already pass `num_cpus > 1`, you now get automatic protection against the oversubscription slowdown for free. Everything below is opt-in — for *more* speed with *less* guesswork."
+    "and then explains what is being balanced under the hood."
    ]
   },
   {