PML-386 Update README roadmap for 0.4.1#279
Conversation
@claude this ticket is fine but instead of reviewing this, do a full optimisation-wise review of merlin particularly on CPU-GPU transfers, if they are redundant( even on GPU we observe alot of CPU wallclock time ) and ways in which we could be more efficient with this. |
|
Claude finished @ben9871's task in 3m 9s —— View job Optimisation review: CPU↔GPU transfers & host-side stallsYou asked me to set the README ticket aside and instead audit Merlin for redundant CPU↔GPU traffic and the "lots of CPU wallclock even on GPU" symptom. I traced the per-forward hot path:
🔴 Critical — this is almost certainly your "CPU wallclock on GPU"1. A
next_size = int(destinations.max().item()) + 1
The key point: 2. It is created CPU-only at self.norm_factor_output = torch.tensor([v[0] for k, v in keys_and_factors], dtype=self.dtype) # no device=but amplitudes *= torch.sqrt(self.norm_factor_output.to(amplitudes.device))— a fresh H2D copy of a constant tensor, every forward, on every graph. Move it inside 🟠 Medium3. if self.output_map_func is not None:
self.target_indices.to(dtype=dtype, device=self.device) # result discarded; `dtype` is undefined
4. p_tensor = torch.tensor(p, device=unitary.device, dtype=torch.long)Small, but it's a host-list→device construction in the inner loop. For a fixed input state 5. The inner kernels defensively 🟡 Minor —
|
Summary
update readme to match current version
Related Issue
Type of change
Proposed changes
How to test / How to run
Screenshots / Logs (optional)
Performance considerations (optional)
Documentation
Checklist
PR title includes Jira issue key (e.g., PML-126)
"Related Jira ticket" section includes the Jira issue key (no URL)
Code formatted (ruff format)
Lint passes (ruff)
Static typing passes (mypy) if applicable
Unit tests added/updated (pytest)
Tests pass locally (pytest)
Tests pass on GPU (pytest)
Test coverage not decreased significantly
Docs build locally if affected (sphinx)
With this command:
the docs are built without any warning or errors.
New public classes/methods/packages are added in the API following the methodology presented in other files.
Dependencies updated (if needed) and pinned appropriately
PR description explains what changed and how to validate it