PML-386 Update README roadmap for 0.4.1 by ben9871 · Pull Request #279 · merlinquantum/merlin

ben9871 · 2026-06-26T08:27:05Z

Summary

update readme to match current version

Related Issue

Type of change

Proposed changes

How to test / How to run

Command lines

Block of code

Screenshots / Logs (optional)

Performance considerations (optional)

Documentation

User docs updated (Sphinx)
Examples / notebooks updated
Docstrings updated
Updated the API

Checklist

PR title includes Jira issue key (e.g., PML-126)
"Related Jira ticket" section includes the Jira issue key (no URL)
Code formatted (ruff format)
Lint passes (ruff)
Static typing passes (mypy) if applicable
Unit tests added/updated (pytest)
Tests pass locally (pytest)
Tests pass on GPU (pytest)
Test coverage not decreased significantly
Docs build locally if affected (sphinx)
With this command:
```
  > SPHINXOPTS="-W --keep-going -n" make -C docs clean html 
```
the docs are built without any warning or errors.
New public classes/methods/packages are added in the API following the methodology presented in other files.
Dependencies updated (if needed) and pinned appropriately
PR description explains what changed and how to validate it

ben9871 · 2026-06-26T08:32:06Z

Summary

update readme to match current version

Related Issue

Type of change

Bug fix

New feature

Documentation update

Refactor / Cleanup

Performance improvement

CI / Build / Tooling

Breaking change (requires migration notes)

Proposed changes

How to test / How to run

Command lines
Block of code
Screenshots / Logs (optional)

Performance considerations (optional)

Documentation

User docs updated (Sphinx)

Examples / notebooks updated

Docstrings updated

Updated the API

Checklist
PR title includes Jira issue key (e.g., PML-126)

"Related Jira ticket" section includes the Jira issue key (no URL)

Code formatted (ruff format)

Lint passes (ruff)

Static typing passes (mypy) if applicable

Unit tests added/updated (pytest)

Tests pass locally (pytest)

Tests pass on GPU (pytest)

Test coverage not decreased significantly

Docs build locally if affected (sphinx)
With this command:
  > SPHINXOPTS="-W --keep-going -n" make -C docs clean html 
the docs are built without any warning or errors.
New public classes/methods/packages are added in the API following the methodology presented in other files.

Dependencies updated (if needed) and pinned appropriately

PR description explains what changed and how to validate it

@claude this ticket is fine but instead of reviewing this, do a full optimisation-wise review of merlin particularly on CPU-GPU transfers, if they are redundant( even on GPU we observe alot of CPU wallclock time ) and ways in which we could be more efficient with this.

claude · 2026-06-26T08:32:23Z

Claude finished @ben9871's task in 3m 9s —— View job

Optimisation review: CPU↔GPU transfers & host-side stalls

You asked me to set the README ticket aside and instead audit Merlin for redundant CPU↔GPU traffic and the "lots of CPU wallclock even on GPU" symptom. I traced the per-forward hot path: QuantumLayer.forward → ComputationProcess → SLOSComputeGraph.compute* (merlin/pcvl_pytorch/slos_torchscript.py) plus the unitary builder CircuitConverter.to_tensor (merlin/pcvl_pytorch/locirc_to_tensor.py). Findings below, ordered by impact.

Map the codebase & hot paths
Hunt for .cpu() / .item() / .numpy() / .tolist() sync points
Check device placement & redundant .to(device) transfers
Check torch.tensor(...) rebuilds in the forward pass
Write up findings

🔴 Critical — this is almost certainly your "CPU wallclock on GPU"

1. A .item() device sync runs on every layer of every forward pass.

merlin/pcvl_pytorch/slos_torchscript.py:142 (and :235 for the batched path):

next_size = int(destinations.max().item()) + 1

.item() forces a cudaStreamSynchronize — the CPU blocks until all queued GPU work drains before it can read the scalar back. This sits inside the per-photon layer loop (compute at :679, compute_batch at :788), so a circuit with k photons pays k full pipeline stalls per forward. That is exactly the profile of "GPU is idle, CPU wallclock is high" — the launch queue never gets to run ahead because every layer re-synchronizes.

The key point: destinations is a constant tensor built once in _build_graph_structure / prepare_vectorized_operations. Its .max() never changes after construction. It should be computed once at build time and stored as a plain Python int per layer (e.g. alongside the (sources, destinations, modes) tuple in self.vectorized_operations), then read with zero sync in the hot loop. Same fix applies to layer_compute_backward (:304-305).

Fix this →

2. norm_factor_output is re-uploaded host→device on every forward.

It is created CPU-only at :494:

self.norm_factor_output = torch.tensor([v[0] for k, v in keys_and_factors], dtype=self.dtype)  # no device=

but SLOSComputeGraph.to() (:852) only moves vectorized_operations — it never moves norm_factor_output. So compute (:687) and compute_batch (:800-801) both run:

amplitudes *= torch.sqrt(self.norm_factor_output.to(amplitudes.device))

— a fresh H2D copy of a constant tensor, every forward, on every graph. Move it inside to() and cache the device-resident version (or register it as a buffer).

Fix this →

🟠 Medium

3. SLOSComputeGraph.to() has a dead, buggy line. At :881:

if self.output_map_func is not None:
    self.target_indices.to(dtype=dtype, device=self.device)   # result discarded; `dtype` is undefined

.to() returns a new tensor — the result is thrown away, so target_indices never actually moves (it would then be re-coerced every forward). It also references an undefined dtype, so this would NameError for any graph with an output_map_func. Should be self.target_indices = self.target_indices.to(device=self.device).

4. p_tensor is rebuilt from a Python list every layer, every forward. layer_compute_batch at :240:

p_tensor = torch.tensor(p, device=unitary.device, dtype=torch.long)

Small, but it's a host-list→device construction in the inner loop. For a fixed input state p is constant across forwards; it can be cached per layer.

5. The inner kernels defensively .to(device) constant index tensors on every call — layer_compute_vectorized does modes.to(unitary.device), sources.to(prev_amplitudes.device), contributions.to(destinations.device), destinations.repeat(...) (:145-163), and layer_compute_batch likewise (:244-272). When graph.to(cuda) has been called these are cheap no-ops, but if a user passes a CUDA unitary to a graph whose tensors are still on CPU, each of these silently re-uploads a constant every forward instead of failing fast. Per AGENTS.md ("let it fail clearly", "no silent fallbacks"), this is better handled by asserting device equality once at entry and relying on to() to place tensors — not by per-element coercion that hides the mismatch and pays the transfer repeatedly.

🟡 Minor — `CircuitConverter.to_tensor` rebuilds constants each forward

In merlin/pcvl_pytorch/locirc_to_tensor.py, to_tensor runs every forward and reconstructs constant tensors via torch.tensor([...]):

BS base matrices [[1, 1j], [1j, 1]] etc. at :721-731 (BS is the most common component, so this recurs a lot);
constant BS params at :700-702 and the .to(self.device) coercions at :743-746.

These are constant per (convention, device, dtype) and can be cached at compile time. Lower impact than the SLOS sync, but it's avoidable per-forward Python + allocation overhead.

✅ Not problems (checked, legitimate)

reservoir_classifier.py .detach().cpu().numpy() (:481, 679-681, …) — these are in sklearn-style fit/predict/feature-caching, not the autograd hot path.
merlin_processor.py:985 input_tensor.cpu() and the .numpy() chunks (:1110, 1403, 2089) — the Perceval/hardware sampling backend genuinely runs on CPU, so the transfer is required, not redundant.
state_vector.py / probability_distribution.py .tolist()/.item() calls are in sparse-state serialization/inspection helpers, not forward.

How to confirm the win

Wrap a training step in torch.profiler.profile(activities=[CPU, CUDA]) and look for cudaStreamSynchronize / aten::item events — you should see one per photon-layer per forward today. Running with CUDA_LAUNCH_BLOCKING=0 and checking that GPU kernels queue ahead of the CPU (rather than ping-ponging) will show the difference once #1 lands. I'd expect #1 alone to recover most of the lost wallclock on multi-photon circuits.

I've kept this to a review (no code changed). If you'd like, I can implement #1–#3 on this branch — they're contained and #1/#2 are pure wins with no behavior change. Just say the word.
· branch PML-386

LF-Vigneux

I approve

Update README roadmap for 0.4.1

eeeeefa

LF-Vigneux approved these changes Jun 26, 2026

View reviewed changes

LF-Vigneux merged commit a053c8b into merlinquantum:release/0.4.1 Jun 26, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PML-386 Update README roadmap for 0.4.1#279

PML-386 Update README roadmap for 0.4.1#279
LF-Vigneux merged 1 commit into
merlinquantum:release/0.4.1from
ben9871:PML-386

ben9871 commented Jun 26, 2026

Uh oh!

ben9871 commented Jun 26, 2026

Summary

Related Issue

Type of change

Proposed changes

How to test / How to run

Screenshots / Logs (optional)

Performance considerations (optional)

Documentation

Checklist

Uh oh!

claude Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

LF-Vigneux left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ben9871 commented Jun 26, 2026

Summary

Related Issue

Type of change

Proposed changes

How to test / How to run

Screenshots / Logs (optional)

Performance considerations (optional)

Documentation

Checklist

Uh oh!

ben9871 commented Jun 26, 2026

Summary

Related Issue

Type of change

Proposed changes

How to test / How to run

Screenshots / Logs (optional)

Performance considerations (optional)

Documentation

Checklist

Uh oh!

claude Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimisation review: CPU↔GPU transfers & host-side stalls

🔴 Critical — this is almost certainly your "CPU wallclock on GPU"

🟠 Medium

🟡 Minor — CircuitConverter.to_tensor rebuilds constants each forward

✅ Not problems (checked, legitimate)

How to confirm the win

Uh oh!

LF-Vigneux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Jun 26, 2026 •

edited

Loading

🟡 Minor — `CircuitConverter.to_tensor` rebuilds constants each forward