Skip to content

PML-386 Update README roadmap for 0.4.1#279

Merged
LF-Vigneux merged 1 commit into
merlinquantum:release/0.4.1from
ben9871:PML-386
Jun 26, 2026
Merged

PML-386 Update README roadmap for 0.4.1#279
LF-Vigneux merged 1 commit into
merlinquantum:release/0.4.1from
ben9871:PML-386

Conversation

@ben9871

@ben9871 ben9871 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

update readme to match current version

Related Issue

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor / Cleanup
  • Performance improvement
  • CI / Build / Tooling
  • Breaking change (requires migration notes)

Proposed changes

How to test / How to run

  1. Command lines
Block of code

Screenshots / Logs (optional)

Performance considerations (optional)

Documentation

  • User docs updated (Sphinx)
  • Examples / notebooks updated
  • Docstrings updated
  • Updated the API

Checklist

  • PR title includes Jira issue key (e.g., PML-126)

  • "Related Jira ticket" section includes the Jira issue key (no URL)

  • Code formatted (ruff format)

  • Lint passes (ruff)

  • Static typing passes (mypy) if applicable

  • Unit tests added/updated (pytest)

  • Tests pass locally (pytest)

  • Tests pass on GPU (pytest)

  • Test coverage not decreased significantly

  • Docs build locally if affected (sphinx)

  • With this command:

      > SPHINXOPTS="-W --keep-going -n" make -C docs clean html 
    

    the docs are built without any warning or errors.

  • New public classes/methods/packages are added in the API following the methodology presented in other files.

  • Dependencies updated (if needed) and pinned appropriately

  • PR description explains what changed and how to validate it

@ben9871

ben9871 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Summary

update readme to match current version

Related Issue

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor / Cleanup
  • Performance improvement
  • CI / Build / Tooling
  • Breaking change (requires migration notes)

Proposed changes

How to test / How to run

  1. Command lines
Block of code

Screenshots / Logs (optional)

Performance considerations (optional)

Documentation

  • User docs updated (Sphinx)
  • Examples / notebooks updated
  • Docstrings updated
  • Updated the API

Checklist

  • PR title includes Jira issue key (e.g., PML-126)

  • "Related Jira ticket" section includes the Jira issue key (no URL)

  • Code formatted (ruff format)

  • Lint passes (ruff)

  • Static typing passes (mypy) if applicable

  • Unit tests added/updated (pytest)

  • Tests pass locally (pytest)

  • Tests pass on GPU (pytest)

  • Test coverage not decreased significantly

  • Docs build locally if affected (sphinx)

  • With this command:

      > SPHINXOPTS="-W --keep-going -n" make -C docs clean html 
    

    the docs are built without any warning or errors.

  • New public classes/methods/packages are added in the API following the methodology presented in other files.

  • Dependencies updated (if needed) and pinned appropriately

  • PR description explains what changed and how to validate it

@claude this ticket is fine but instead of reviewing this, do a full optimisation-wise review of merlin particularly on CPU-GPU transfers, if they are redundant( even on GPU we observe alot of CPU wallclock time ) and ways in which we could be more efficient with this.

@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown

Claude finished @ben9871's task in 3m 9s —— View job


Optimisation review: CPU↔GPU transfers & host-side stalls

You asked me to set the README ticket aside and instead audit Merlin for redundant CPU↔GPU traffic and the "lots of CPU wallclock even on GPU" symptom. I traced the per-forward hot path: QuantumLayer.forwardComputationProcessSLOSComputeGraph.compute* (merlin/pcvl_pytorch/slos_torchscript.py) plus the unitary builder CircuitConverter.to_tensor (merlin/pcvl_pytorch/locirc_to_tensor.py). Findings below, ordered by impact.

  • Map the codebase & hot paths
  • Hunt for .cpu() / .item() / .numpy() / .tolist() sync points
  • Check device placement & redundant .to(device) transfers
  • Check torch.tensor(...) rebuilds in the forward pass
  • Write up findings

🔴 Critical — this is almost certainly your "CPU wallclock on GPU"

1. A .item() device sync runs on every layer of every forward pass.

merlin/pcvl_pytorch/slos_torchscript.py:142 (and :235 for the batched path):

next_size = int(destinations.max().item()) + 1

.item() forces a cudaStreamSynchronize — the CPU blocks until all queued GPU work drains before it can read the scalar back. This sits inside the per-photon layer loop (compute at :679, compute_batch at :788), so a circuit with k photons pays k full pipeline stalls per forward. That is exactly the profile of "GPU is idle, CPU wallclock is high" — the launch queue never gets to run ahead because every layer re-synchronizes.

The key point: destinations is a constant tensor built once in _build_graph_structure / prepare_vectorized_operations. Its .max() never changes after construction. It should be computed once at build time and stored as a plain Python int per layer (e.g. alongside the (sources, destinations, modes) tuple in self.vectorized_operations), then read with zero sync in the hot loop. Same fix applies to layer_compute_backward (:304-305).

Fix this →

2. norm_factor_output is re-uploaded host→device on every forward.

It is created CPU-only at :494:

self.norm_factor_output = torch.tensor([v[0] for k, v in keys_and_factors], dtype=self.dtype)  # no device=

but SLOSComputeGraph.to() (:852) only moves vectorized_operations — it never moves norm_factor_output. So compute (:687) and compute_batch (:800-801) both run:

amplitudes *= torch.sqrt(self.norm_factor_output.to(amplitudes.device))

— a fresh H2D copy of a constant tensor, every forward, on every graph. Move it inside to() and cache the device-resident version (or register it as a buffer).

Fix this →


🟠 Medium

3. SLOSComputeGraph.to() has a dead, buggy line. At :881:

if self.output_map_func is not None:
    self.target_indices.to(dtype=dtype, device=self.device)   # result discarded; `dtype` is undefined

.to() returns a new tensor — the result is thrown away, so target_indices never actually moves (it would then be re-coerced every forward). It also references an undefined dtype, so this would NameError for any graph with an output_map_func. Should be self.target_indices = self.target_indices.to(device=self.device).

4. p_tensor is rebuilt from a Python list every layer, every forward. layer_compute_batch at :240:

p_tensor = torch.tensor(p, device=unitary.device, dtype=torch.long)

Small, but it's a host-list→device construction in the inner loop. For a fixed input state p is constant across forwards; it can be cached per layer.

5. The inner kernels defensively .to(device) constant index tensors on every calllayer_compute_vectorized does modes.to(unitary.device), sources.to(prev_amplitudes.device), contributions.to(destinations.device), destinations.repeat(...) (:145-163), and layer_compute_batch likewise (:244-272). When graph.to(cuda) has been called these are cheap no-ops, but if a user passes a CUDA unitary to a graph whose tensors are still on CPU, each of these silently re-uploads a constant every forward instead of failing fast. Per AGENTS.md ("let it fail clearly", "no silent fallbacks"), this is better handled by asserting device equality once at entry and relying on to() to place tensors — not by per-element coercion that hides the mismatch and pays the transfer repeatedly.


🟡 Minor — CircuitConverter.to_tensor rebuilds constants each forward

In merlin/pcvl_pytorch/locirc_to_tensor.py, to_tensor runs every forward and reconstructs constant tensors via torch.tensor([...]):

  • BS base matrices [[1, 1j], [1j, 1]] etc. at :721-731 (BS is the most common component, so this recurs a lot);
  • constant BS params at :700-702 and the .to(self.device) coercions at :743-746.

These are constant per (convention, device, dtype) and can be cached at compile time. Lower impact than the SLOS sync, but it's avoidable per-forward Python + allocation overhead.


✅ Not problems (checked, legitimate)

  • reservoir_classifier.py .detach().cpu().numpy() (:481, 679-681, …) — these are in sklearn-style fit/predict/feature-caching, not the autograd hot path.
  • merlin_processor.py:985 input_tensor.cpu() and the .numpy() chunks (:1110, 1403, 2089) — the Perceval/hardware sampling backend genuinely runs on CPU, so the transfer is required, not redundant.
  • state_vector.py / probability_distribution.py .tolist()/.item() calls are in sparse-state serialization/inspection helpers, not forward.

How to confirm the win

Wrap a training step in torch.profiler.profile(activities=[CPU, CUDA]) and look for cudaStreamSynchronize / aten::item events — you should see one per photon-layer per forward today. Running with CUDA_LAUNCH_BLOCKING=0 and checking that GPU kernels queue ahead of the CPU (rather than ping-ponging) will show the difference once #1 lands. I'd expect #1 alone to recover most of the lost wallclock on multi-photon circuits.

I've kept this to a review (no code changed). If you'd like, I can implement #1#3 on this branch — they're contained and #1/#2 are pure wins with no behavior change. Just say the word.
· branch PML-386

@LF-Vigneux LF-Vigneux left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve

@LF-Vigneux LF-Vigneux merged commit a053c8b into merlinquantum:release/0.4.1 Jun 26, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants