diff --git a/examples/demo_multiprocessing.ipynb b/examples/demo_multiprocessing.ipynb new file mode 100644 index 0000000..0cbbca2 --- /dev/null +++ b/examples/demo_multiprocessing.ipynb @@ -0,0 +1,298 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Letting scqubits choose your multiprocessing settings\n", + "\n", + "J. Koch and P. Groszkowski\n", + "\n", + "For further documentation of scqubits see https://scqubits.readthedocs.io/en/latest/.\n", + "\n", + "---\n", + "\n", + "A `ParameterSweep` can spread its grid points across worker processes through the `num_cpus`\n", + "argument. The catch is that the *right* number of workers — and the right number of BLAS\n", + "threads each worker should use — depends on the problem and the machine. Guess wrong and you\n", + "get no speedup, or, when the workers oversubscribe the cores, a slowdown of **one to two\n", + "orders of magnitude**.\n", + "\n", + "So scqubits can choose for you. This notebook shows the easy path —\n", + "`recommend_parallelization` and `num_cpus=\"auto\"` — and a one-time per-machine calibration,\n", + "and then explains what is being balanced under the hood." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "import scqubits as scq" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A system to sweep\n", + "\n", + "We use three capacitively coupled tunable transmons and sweep the flux of the first. The\n", + "dressed Hilbert space has dimension `6**3 = 216`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def build_hilbertspace():\n", + " qubits = [\n", + " scq.TunableTransmon(\n", + " EJmax=30.0, EC=0.2, d=0.1, flux=0.0, ng=0.0, ncut=50,\n", + " truncated_dim=6, id_str=f\"tmon{i}\",\n", + " )\n", + " for i in range(3)\n", + " ]\n", + " hs = scq.HilbertSpace(qubits)\n", + " for i in range(2):\n", + " hs.add_interaction(\n", + " g_strength=0.1, op1=qubits[i].n_operator, op2=qubits[i + 1].n_operator\n", + " )\n", + "\n", + " def update(flux):\n", + " qubits[0].flux = flux\n", + "\n", + " return hs, update\n", + "\n", + "\n", + "hs, update = build_hilbertspace()\n", + "print(\"dressed dimension:\", hs.dimension)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The easy way: let scqubits choose\n", + "\n", + "`scq.recommend_parallelization` reads the workload — Hilbert-space dimension, number of grid\n", + "points, eigenvalue count, and whether sparse diagonalization applies — and returns a\n", + "recommended `num_cpus` together with a per-worker BLAS-thread cap. It is a *pure* function: it\n", + "starts no worker processes, so it is safe to call anywhere, and it does not run the sweep.\n", + "\n", + "Because a `ParameterSweep` runs the moment it is constructed, call it *before* building the\n", + "sweep. Here is its choice for our system at two grid sizes:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for n_points in (16, 384):\n", + " cfg = scq.recommend_parallelization(\n", + " hilbertspace=hs, num_points=n_points, evals_count=20\n", + " )\n", + " print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus}, blas_threads={cfg.blas_threads}\")\n", + " print(f\" {cfg.reason}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A 16-point sweep stays serial — there are too few points to repay the cost of starting and\n", + "feeding worker processes — while the 384-point sweep is spread across workers, each capped to\n", + "a single BLAS thread so they do not oversubscribe the cores.\n", + "\n", + "To apply the recommendation without copying numbers by hand, pass the sentinel\n", + "`num_cpus=\"auto\"` to the sweep. The choice is then made automatically, *before* the sweep\n", + "runs:\n", + "\n", + "```python\n", + "sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n", + "```\n", + "\n", + "To make *every* sweep that does not specify `num_cpus` tune itself this way, set\n", + "`scq.settings.AUTO_PARALLEL = True`.\n", + "\n", + "These are the same auto-tuner with different reach — `num_cpus=\"auto\"` opts in for one sweep, while `AUTO_PARALLEL = True` makes it the default for every sweep where you don't pass `num_cpus`. An explicit number always wins:\n", + "\n", + "```text\n", + "num_cpus=4 -> exactly 4 workers (you decide)\n", + "num_cpus=\"auto\" -> auto-tuner decides (always)\n", + "num_cpus omitted -> auto-tuner if AUTO_PARALLEL=True, else serial (the default)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The automatic choice changes only *how* a sweep is computed, never the result. We confirm\n", + "that by running the same 384-point sweep both serially and with `num_cpus=\"auto\"`, and\n", + "comparing the spectra:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "flux_vals = np.linspace(0.0, 0.5, 384)\n", + "\n", + "serial = scq.ParameterSweep(\n", + " hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n", + " update_hilbertspace=update, evals_count=20, num_cpus=1,\n", + ")\n", + "auto = scq.ParameterSweep(\n", + " hilbertspace=hs, paramvals_by_name={\"flux\": flux_vals},\n", + " update_hilbertspace=update, evals_count=20, num_cpus=\"auto\",\n", + ")\n", + "print(\"spectra identical:\", np.allclose(serial[\"evals\"][:], auto[\"evals\"][:]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tune to your machine (optional, run once)\n", + "\n", + "The recommendation above uses conservative built-in thresholds that work everywhere. For choices tuned to *your* hardware, run the one-time calibration. It times a short battery of sweeps in isolated subprocesses — this machine's per-task dispatch overhead, the one-time pool-startup cost, and per-point diagonalization cost (dense and sparse) — and writes `~/.scqubits/parallel_calibration.json` (about a minute).\n", + "\n", + "The calibration is **just data**: it does nothing on its own. From then on, every `recommend_parallelization` / `num_cpus=\"auto\"` call reads that file and makes a sharper, machine-specific choice instead of using the generic defaults.\n", + "\n", + "**Re-running overwrites the file, so recalibrate freely** — and *do* recalibrate if a previous run was taken under bad conditions: while the machine was busy, or on a laptop that was on battery / CPU-throttled (many laptops clock down hard when unplugged, which makes the calibration over-estimate every cost so `\"auto\"` then under-parallelizes). For the most representative numbers, calibrate on an otherwise-idle machine plugged into wall power.\n", + "\n", + "Because it launches its measurements as `python -m` subprocesses, the call needs no `if __name__ == \"__main__\":` guard, in Jupyter or in a plain script." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "scq.calibrate_parallelization()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once the calibration exists, `recommend_parallelization` (and `num_cpus=\"auto\"`) use the\n", + "measured break-even — parallelizing only once the grid is large enough to repay the measured\n", + "pool-startup cost. Re-running the recommendation now reflects your machine:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for n_points in (16, 384):\n", + " cfg = scq.recommend_parallelization(\n", + " hilbertspace=hs, num_points=n_points, evals_count=20\n", + " )\n", + " print(f\"{n_points:>4} points -> num_cpus={cfg.num_cpus} ({cfg.reason})\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What scqubits is balancing for you\n", + "\n", + "There is no single right answer because two effects pull in opposite directions.\n", + "\n", + "**1. The grid break-even.** Sending a grid point to a worker costs a fixed amount (pickling,\n", + "inter-process hand-off), and starting the pool costs a one-time amount (about a second when\n", + "workers are *spawned*, as on macOS and Windows). Parallelism pays off only when\n", + "\n", + "> (number of grid points) × (cost per point) ≫ that fixed overhead.\n", + "\n", + "Few points, or cheap points, stay faster serially — which is why the heuristic keeps small\n", + "sweeps on a single process.\n", + "\n", + "**2. The BLAS oversubscription cliff.** Every eigensolve already runs on a multithreaded BLAS\n", + "backend. If several workers each use all cores, the cores are oversubscribed by a factor of\n", + "`num_cpus`, which on large dense matrices is not a small slowdown but a collapse. Measured on\n", + "a 10-core Mac mini — five capacitively coupled fluxonia (dressed dimension 3125, dense),\n", + "16-point sweep:\n", + "\n", + "| configuration | wall time |\n", + "|---|---:|\n", + "| `num_cpus=1` | 42 s |\n", + "| `num_cpus=4`, BLAS uncapped | **3608 s** (~90x slower) |\n", + "| `num_cpus=4`, BLAS capped to 1 | 40 s |\n", + "| `num_cpus=8`, BLAS capped to 1 | 28 s |\n", + "\n", + "This is why the recommendation always pairs a worker count with a BLAS-thread cap, keeping\n", + "`num_cpus x BLAS-threads` near the core count." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Manual control\n\nThe same knobs are available directly if you would rather set them yourself:\n\n```python\nscq.settings.NUM_CPUS = 4 # default worker count when num_cpus is unset\nscq.settings.MULTIPROC_BLAS_THREADS = 1 # per-worker BLAS-thread cap during a sweep\n```\n\n`MULTIPROC_BLAS_THREADS` already defaults to `\"auto\"`, which caps each worker to\n`cores // num_cpus` so parallel sweeps never oversubscribe the cores — setting an integer\njust overrides that with a fixed cap (use `None` to opt out entirely). Rule of thumb: keep\n`num_cpus x BLAS-threads` near the number of physical cores. The cap reaches spawn-based\nworkers (macOS, Windows) through the thread-count environment variables, and fork-based\nworkers (Linux) through `threadpoolctl`.\n\nFor large composite systems the per-point **diagonalization method** is often a bigger lever\nthan parallelism: scqubits uses sparse diagonalization by default for large spectra (see\n`scq.settings.AUTO_SPARSE_DIAG`), which can be far faster per point. Once each point is cheap,\nparallelism helps even less — so try sparse first, and parallelize second." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running as a script\n", + "\n", + "In Jupyter, everything above runs as shown. In a plain Python script on macOS or Windows,\n", + "workers are *spawned*, which re-imports your script in each worker — so the entry point that\n", + "triggers a parallel sweep must be guarded:\n", + "\n", + "```python\n", + "import scqubits as scq\n", + "\n", + "if __name__ == \"__main__\":\n", + " sweep = scq.ParameterSweep(..., num_cpus=\"auto\")\n", + "```\n", + "\n", + "Linux (which forks) and Jupyter need no guard. scqubits prints a one-time reminder the first\n", + "time it spawns workers outside of IPython." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "- **Let scqubits choose.** `scq.recommend_parallelization(...)` recommends `num_cpus` and a\n", + " BLAS-thread cap from the workload; `num_cpus=\"auto\"` applies it per sweep, and\n", + " `scq.settings.AUTO_PARALLEL = True` applies it everywhere.\n", + "- **Calibrate once** with `scq.calibrate_parallelization()` for advice measured on your own\n", + " hardware.\n", + "- Parallelism helps only when the grid is large enough to repay the fixed overhead, and the\n", + " BLAS-thread cap is what keeps many workers from oversubscribing the cores.\n", + "- All of this takes effect live, without restarting the kernel, in Jupyter and in scripts\n", + " alike." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}