diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb
new file mode 100644
index 0000000..0cbbca2
--- /dev/null
+++ b/examples/demo_multiprocessing.ipynb
@@ -0,0 +1,298 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Letting scqubits choose your multiprocessing settings\n",
+    "\n",
+    "J. Koch and P. Groszkowski\n",
+    "\n",
+    "For further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "A `ParameterSweep` can spread its grid points across worker processes through the `num_cpus`\n",
+    "argument. The catch is that the *right* number of workers — and the right number of BLAS\n",
+    "threads each worker should use — depends on the problem and the machine. Guess wrong and you\n",
+    "get no speedup, or, when the workers oversubscribe the cores, a slowdown of **one to two\n",
+    "orders of magnitude**.\n",
+    "\n",
+    "So scqubits can choose for you. This notebook shows the easy path —\n",
+    "`recommend_parallelization` and `num_cpus=\"auto\"` — and a one-time per-machine calibration,\n",
+    "and then explains what is being balanced under the hood."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "import scqubits as scq"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## A system to sweep\n",
+    "\n",
+    "We use three capacitively coupled tunable transmons and sweep the flux of the first. The\n",
+    "dressed Hilbert space has dimension `6**3 = 216`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def build_hilbertspace():\n",
+    "    qubits = [\n",
+    "        scq.TunableTransmon(\n",
+    "            EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n",
+    "            truncated_dim=6, id_str=f\"tmon{i}\",\n",
+    "        )\n",
+    "        for i in range(3)\n",
+    "    ]\n",
+    "    hs = scq.HilbertSpace(qubits)\n",
+    "    for i in range(2):\n",
+    "        hs.add_interaction(\n",
+    "            g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator\n",
+    "        )\n",
+    "\n",
+    "    def update(flux):\n",
+    "        qubits[0].flux = flux\n",
+    "\n",
+    "    return hs, update\n",
+    "\n",
+    "\n",
+    "hs, update = build_hilbertspace()\n",
+    "print(\"dressed dimension:\", hs.dimension)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The easy way: let scqubits choose\n",
+    "\n",
+    "`scq.recommend_parallelization` reads the workload — Hilbert-space dimension, number of grid\n",
+    "points, eigenvalue count, and whether sparse diagonalization applies — and returns a\n",
+    "recommended `num_cpus` together with a per-worker BLAS-thread cap. It is a *pure* function: it\n",
+    "starts no worker processes, so it is safe to call anywhere, and it does not run the sweep.\n",
+    "\n",
+    "Because a `ParameterSweep` runs the moment it is constructed, call it *before* building the\n",
+    "sweep. Here is its choice for our system at two grid sizes:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for n_points in (16, 384):\n",
+    "    cfg = scq.recommend_parallelization(\n",
+    "        hilbertspace=hs, num_points=n_points, evals_count=20\n",
+    "    )\n",
+    "    print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}, blas_threads={cfg.blas_threads}\")\n",
+    "    print(f\"            {cfg.reason}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A 16-point sweep stays serial — there are too few points to repay the cost of starting and\n",
+    "feeding worker processes — while the 384-point sweep is spread across workers, each capped to\n",
+    "a single BLAS thread so they do not oversubscribe the cores.\n",
+    "\n",
+    "To apply the recommendation without copying numbers by hand, pass the sentinel\n",
+    "`num_cpus=\"auto\"` to the sweep. The choice is then made automatically, *before* the sweep\n",
+    "runs:\n",
+    "\n",
+    "```python\n",
+    "sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n",
+    "```\n",
+    "\n",
+    "To make *every* sweep that does not specify `num_cpus` tune itself this way, set\n",
+    "`scq.settings.AUTO_PARALLEL = True`.\n",
+    "\n",
+    "These are the same auto-tuner with different reach — `num_cpus=\"auto\"` opts in for one sweep, while `AUTO_PARALLEL = True` makes it the default for every sweep where you don't pass `num_cpus`. An explicit number always wins:\n",
+    "\n",
+    "```text\n",
+    "num_cpus=4        ->  exactly 4 workers      (you decide)\n",
+    "num_cpus=\"auto\"   ->  auto-tuner decides     (always)\n",
+    "num_cpus omitted  ->  auto-tuner if AUTO_PARALLEL=True, else serial (the default)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The automatic choice changes only *how* a sweep is computed, never the result. We confirm\n",
+    "that by running the same 384-point sweep both serially and with `num_cpus=\"auto\"`, and\n",
+    "comparing the spectra:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "flux_vals = np.linspace(0.0, 0.5, 384)\n",
+    "\n",
+    "serial = scq.ParameterSweep(\n",
+    "    hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
+    "    update_hilbertspace=update, evals_count=20, num_cpus=1,\n",
+    ")\n",
+    "auto = scq.ParameterSweep(\n",
+    "    hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n",
+    "    update_hilbertspace=update, evals_count=20, num_cpus=\"auto\",\n",
+    ")\n",
+    "print(\"spectra identical:\", np.allclose(serial[\"evals\"][:], auto[\"evals\"][:]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tune to your machine (optional, run once)\n",
+    "\n",
+    "The recommendation above uses conservative built-in thresholds that work everywhere. For choices tuned to *your* hardware, run the one-time calibration. It times a short battery of sweeps in isolated subprocesses — this machine's per-task dispatch overhead, the one-time pool-startup cost, and per-point diagonalization cost (dense and sparse) — and writes `~/.scqubits/parallel_calibration.json` (about a minute).\n",
+    "\n",
+    "The calibration is **just data**: it does nothing on its own. From then on, every `recommend_parallelization` / `num_cpus=\"auto\"` call reads that file and makes a sharper, machine-specific choice instead of using the generic defaults.\n",
+    "\n",
+    "**Re-running overwrites the file, so recalibrate freely** — and *do* recalibrate if a previous run was taken under bad conditions: while the machine was busy, or on a laptop that was on battery / CPU-throttled (many laptops clock down hard when unplugged, which makes the calibration over-estimate every cost so `\"auto\"` then under-parallelizes). For the most representative numbers, calibrate on an otherwise-idle machine plugged into wall power.\n",
+    "\n",
+    "Because it launches its measurements as `python -m` subprocesses, the call needs no `if __name__ == \"__main__\":` guard, in Jupyter or in a plain script."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "scq.calibrate_parallelization()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once the calibration exists, `recommend_parallelization` (and `num_cpus=\"auto\"`) use the\n",
+    "measured break-even — parallelizing only once the grid is large enough to repay the measured\n",
+    "pool-startup cost. Re-running the recommendation now reflects your machine:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for n_points in (16, 384):\n",
+    "    cfg = scq.recommend_parallelization(\n",
+    "        hilbertspace=hs, num_points=n_points, evals_count=20\n",
+    "    )\n",
+    "    print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}  ({cfg.reason})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What scqubits is balancing for you\n",
+    "\n",
+    "There is no single right answer because two effects pull in opposite directions.\n",
+    "\n",
+    "**1. The grid break-even.** Sending a grid point to a worker costs a fixed amount (pickling,\n",
+    "inter-process hand-off), and starting the pool costs a one-time amount (about a second when\n",
+    "workers are *spawned*, as on macOS and Windows). Parallelism pays off only when\n",
+    "\n",
+    "> (number of grid points) × (cost per point) ≫ that fixed overhead.\n",
+    "\n",
+    "Few points, or cheap points, stay faster serially — which is why the heuristic keeps small\n",
+    "sweeps on a single process.\n",
+    "\n",
+    "**2. The BLAS oversubscription cliff.** Every eigensolve already runs on a multithreaded BLAS\n",
+    "backend. If several workers each use all cores, the cores are oversubscribed by a factor of\n",
+    "`num_cpus`, which on large dense matrices is not a small slowdown but a collapse. Measured on\n",
+    "a 10-core Mac mini — five capacitively coupled fluxonia (dressed dimension 3125, dense),\n",
+    "16-point sweep:\n",
+    "\n",
+    "| configuration | wall time |\n",
+    "|---|---:|\n",
+    "| `num_cpus=1` | 42 s |\n",
+    "| `num_cpus=4`, BLAS uncapped | **3608 s**  (~90x slower) |\n",
+    "| `num_cpus=4`, BLAS capped to 1 | 40 s |\n",
+    "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n",
+    "\n",
+    "This is why the recommendation always pairs a worker count with a BLAS-thread cap, keeping\n",
+    "`num_cpus x BLAS-threads` near the core count."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Manual control\n\nThe same knobs are available directly if you would rather set them yourself:\n\n```python\nscq.settings.NUM_CPUS = 4                 # default worker count when num_cpus is unset\nscq.settings.MULTIPROC_BLAS_THREADS = 1   # per-worker BLAS-thread cap during a sweep\n```\n\n`MULTIPROC_BLAS_THREADS` already defaults to `\"auto\"`, which caps each worker to\n`cores // num_cpus` so parallel sweeps never oversubscribe the cores — setting an integer\njust overrides that with a fixed cap (use `None` to opt out entirely). Rule of thumb: keep\n`num_cpus x BLAS-threads` near the number of physical cores. The cap reaches spawn-based\nworkers (macOS, Windows) through the thread-count environment variables, and fork-based\nworkers (Linux) through `threadpoolctl`.\n\nFor large composite systems the per-point **diagonalization method** is often a bigger lever\nthan parallelism: scqubits uses sparse diagonalization by default for large spectra (see\n`scq.settings.AUTO_SPARSE_DIAG`), which can be far faster per point. Once each point is cheap,\nparallelism helps even less — so try sparse first, and parallelize second."
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Running as a script\n",
+    "\n",
+    "In Jupyter, everything above runs as shown. In a plain Python script on macOS or Windows,\n",
+    "workers are *spawned*, which re-imports your script in each worker — so the entry point that\n",
+    "triggers a parallel sweep must be guarded:\n",
+    "\n",
+    "```python\n",
+    "import scqubits as scq\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n",
+    "```\n",
+    "\n",
+    "Linux (which forks) and Jupyter need no guard. scqubits prints a one-time reminder the first\n",
+    "time it spawns workers outside of IPython."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "- **Let scqubits choose.** `scq.recommend_parallelization(...)` recommends `num_cpus` and a\n",
+    "  BLAS-thread cap from the workload; `num_cpus=\"auto\"` applies it per sweep, and\n",
+    "  `scq.settings.AUTO_PARALLEL = True` applies it everywhere.\n",
+    "- **Calibrate once** with `scq.calibrate_parallelization()` for advice measured on your own\n",
+    "  hardware.\n",
+    "- Parallelism helps only when the grid is large enough to repay the fixed overhead, and the\n",
+    "  BLAS-thread cap is what keeps many workers from oversubscribing the cores.\n",
+    "- All of this takes effect live, without restarting the kernel, in Jupyter and in scripts\n",
+    "  alike."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}