build(deps): bump actions/checkout from 5 to 6#4
Merged
Conversation
Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
…knobs Kernel-side scaffolding for the six-cut minimal-tier port from main's df21286. No host-side wiring yet — existing call sites continue to go through the thin launch_t*_match wrappers and see no behavior change. The next commit wires cuts #1, #2, #3 in GpuPipeline.cu streaming impl + BatchPlotter dispatch. src/gpu/T1Kernel.{cu,cuh}: split launch_t1_match into launch_t1_match_prepare (computes bucket + fine-bucket offset arrays once per plot, resets d_out_count) and launch_t1_match_range (runs match_all_buckets over a [b_begin, b_end) bucket sub-range, accumulating into d_out_meta + d_out_mi + d_out_count via atomicAdd). The original launch_t1_match becomes a thin prepare+range wrapper for the pool path and parity tests. match_all_buckets gains a uint32_t bucket_begin parameter; bucket_id is now bucket_begin + blockIdx.y so range launches resolve to the correct (section_l, match_key_r) tuples — mirror of the existing T2 / T3 prepare-range plumbing (d4f54ae and b86939f). Used by the upcoming cut #4 (T1 match sliced per section_l). src/gpu/T3Kernel.{cu,cuh}: T3 match_all_buckets gains two int64_t biases (meta_l_index_bias, meta_r_index_bias) that shift the kernel-internal global l/r indices into a sliced-meta buffer position. Full-cap callers pass biases = 0 so indexing is unchanged. The existing launch_t3_match_range wrapper passes 0/0; behavior preserved. Add launch_t3_match_section_pair_range — accepts a sliced d_sorted_meta buffer (section_l + section_r rows packed) plus the two biases. Used by the upcoming cut #3 (T3 match section-pair input slicing): d_t2_meta_sorted parked on pinned host across T3 match, the two row slices H2D'd per pass, d_t2_xbits_sorted + d_t2_keys_merged stay full-cap on device for binary-search / target reads. Drops T3 match peak from 5200 → ~3700 MB at k=28. Expose matching_section_host(section_l, num_section_bits) so the streaming caller can compute section_r on the host side from section_l (the kernel still does this internally; this helper avoids duplicating the rotation math at the wiring site). src/host/GpuPipeline.hpp: StreamingPinnedScratch gains two knobs: - gather_tile_count (default 1) — T1 / T2 sort gather tile count. When >= 2, the merged-key + permuted-meta gather output is D2H'd per tile to host pinned (h_meta / h_keys_merged) so the cap-sized sorted_meta never has to be alive on device in full. Drops T1-sort and T2-sort phase peaks from 5200 → ~3640 MB at k=28. - t3_input_slice_count (default 1) — T3 match input-slice count. When >= 2, d_t2_meta_sorted is parked on h_meta across T3 match and each pass H2Ds the section_l + section_r row slices onto cap/N device buffers. Must equal num_sections (= 4 at k=28 strength=2) when active. Defaults preserve old compact-tier behavior. The minimal tier will set both in the upcoming BatchPlotter wiring. All TUs nvcc-clean at sm_89. Existing parity tests + pool path unaffected — they call launch_t1_match / launch_t3_match (thin wrappers) which preserve the original API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Closes the gap d4f54ae's caveat flagged: cuda-only minimal was aspirational (kMinimalFloorBytes = 3828 MiB advertised, but real peak still 5200 MB at T1 sort / T2 sort / T3 match). The three remaining SYCL-branch cuts now land on this branch and bring all three phases below the 4 GiB cliff. Cut #1 — T1 sort gather tiled. src/host/GpuPipeline.cu T1 sort phase: when scratch.gather_tile_ count >= 2, the per-tile sort-merge feeds a tiled gather instead of a single-shot one. Per tile: gather_u64 to a cap/N device tile, D2H to h_meta on host (whose unsorted-meta park lifetime ended at the JIT H2D into d_t1_meta a few lines earlier — h_meta is dead, free for reuse as the sorted-meta accumulator). After the loop, free d_t1_meta + merged_vals + tile, allocate d_t1_meta_sorted full-cap, H2D from h_meta. Live-set during gather drops from 8 + 8 + 4 = 20 cap (5200 MB) to 8 + 8/N + 4 = 12 + 8/N cap. At N=4: 14 cap = 3640 MB. Cut #2 — T2 sort meta + xbits gathers tiled, deferred re-hydrate. Mirror of cut #1 at the T2 sort gather sites, plus a deferred re-hydrate so d_t2_meta_sorted (8 cap) and d_t2_xbits_sorted (4 cap) don't co-reside with d_merged_vals (4 cap). Both accumulators land on host first (h_meta + h_t2_xbits), then d_merged_vals is freed, then both sorted streams are re-hydrated full-cap on device for T3 match. Gather peak: 5200 → ~3640 MB. Re-hydrate peak: ~3120 MB. Cut #3 — T3 match section-pair input slicing. src/host/GpuPipeline.cu T3 match phase: a new t3_input_slice_path branch precedes the existing t3_stage_path. When scratch.t3_input_slice_count >= 2, cut #2's deferred re-hydrate skips the d_t2_meta_sorted H2D entirely — T2 meta stays parked on h_meta. The T3 match phase then: 1. launch_t3_match_prepare to populate d_offsets in the temp storage region. 2. D2H d_offsets so the host loop can compute section_l / section_r row spans. Tiny (17 × 8 = 136 bytes at k=28 strength=2). 3. For each section_l ∈ [0, num_sections): compute section_r via matching_section_host, look up the row spans, H2D the section_l + section_r meta rows from h_meta into a cap/2 device slice buffer (tightly packed at indices [0, l_count) and [l_count, l_count + r_count)), set the kernel biases to map global l/r → slice indices, run launch_t3_match_section_ pair_range over the section_l × num_match_keys bucket sub- range, D2H d_t3_stage to a per-plot pinned h_t3_acc accumulator at offset t3_count, increment t3_count. 4. After all section_l: free d_t2_meta_slice + d_t3_stage + d_t3_match_temp + d_t2_xbits_sorted + d_t2_keys_merged, allocate d_t3 full-cap, H2D from h_t3_acc, free h_t3_acc. Per-plot pinned h_t3_acc (cap × T3PairingGpu = cap × u64) is necessary because h_meta is in active read-use across the section_l loop and can't double as the existing t3_stage_path's accumulator. T3 match peak: 5200 → ~3700 MB (cap/2 meta slice 1040 + cap xbits 1040 + cap keys_merged 1040 + cap/4 t3 stage 520 + offsets ~80 = ~3720 MB at k=28). src/host/BatchPlotter.cpp: minimal tier sets gather_tile_count = 4 (= num_sections at k=28 strength=2) and t3_input_slice_count = num_sections. Dispatch message updated to advertise the layered cuts. kMinimalFloorBytes stays 3828 MiB — already matches expected peak (~3700 MB) + 128 MB margin. README.md: minimal-tier description rewritten to describe the three layered cuts, the new bottleneck (T3 match at ~3700 MB), and the wider 4-GiB-card target. The b86939f-era "N=8 T2 staging only" wording was stale after d4f54ae shifted the bottleneck. Verification on hardware (RTX 4090 was main's verification host): - k=22 batch across plain / compact / minimal must produce byte-identical .plot2 output (cuts re-shape memory only). - k=28 minimal forced under POS2GPU_MAX_VRAM_MB=4096 should dispatch minimal and complete; POS2GPU_STREAMING_STATS=1 should confirm peak ≤ ~3700 MB. - k=28 minimal vs k=28 compact must be byte-identical. Cuts #4 (T1 match sliced) and #6 (Xs gen+sort+pack tiled) deferred — they're additive savings on phases that are no longer the bottleneck after the above three. Cut #4's kernel-side split landed in the previous commit so the wiring is straightforward when needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Cuts #1+#2+#3 brought T1 sort, T2 sort, and T3 match below 4 GiB, but T1 match was unaffected and stayed at ~5280 MB at k=28 (d_xs 2080 + d_t1_meta 2080 + d_t1_mi 1040 + temp ~80) — the new overall pipeline peak. Cut #4 closes that gap. src/host/GpuPipeline.cu T1 match phase: when scratch.gather_tile_ count >= 2, gate a tiled_t1_match branch that uses the existing launch_t1_match_prepare + launch_t1_match_range plumbing landed in commit bca9bf1. Each section_l pass writes to cap/N device staging buffers (cap/N × u64 meta + cap/N × u32 mi), D2H'd per pass to scratch.h_meta + a per-plot pinned h_t1_mi accumulator at offset t1_count. After all passes, free stage + d_xs and re-hydrate d_t1_mi full-cap from h_t1_mi for the upcoming T1 sort. d_t1_meta is never allocated — h_meta already holds the unsorted meta when entering T1 sort, so the existing park step becomes a no-op (now gated on d_t1_meta != nullptr). Peak: d_xs (2080) + cap/N × 12 (stage) + temp ≈ 2940 MB at N=4 (= num_sections at k=28 strength=2). Plain / compact paths unchanged. src/host/BatchPlotter.cpp: dispatch message updated to advertise "N=4 T1-match" alongside the existing "T1/T2 sort gather" and "T3 input slicing" cuts. README.md: minimal-tier description rewritten as four layered cuts (was three) — adds cut #4 and re-orders the "compact's tied 5200 MB" summary to include T1 match. After this commit the cuda-only minimal-tier peak budget at k=28 strength=2 should be: Xs phase : ~3072 MB (unchanged, no cut #6 yet) T1 match : ~2940 MB (cut #4, was 5280) T1 sort : ~3640 MB (cut #1, was 5200) T2 match : ~3640 MB (existing N=8 staging) T2 sort : ~3640 MB (cut #2, was 5200) T3 match : ~3700 MB (cut #3, was 5200) T3 sort : ~3155 MB (no change needed) Overall peak: ~3700 MB at T3 match — fits kMinimalFloorBytes (3828 MiB = ~3700 MB + 128 MB margin) and the 4 GiB-card floor. Cut #6 (Xs gen+sort+pack tiled) deferred — Xs already under 4 GiB and not the bottleneck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Closes the last cap × 4 (uint32) hot-spot. The non-tiled Xs phase peaks at four cap-sized uint32 buffers + CUB DoubleBuffer scratch (~4136 MB at k=28); cut #6 generates once into 2 cap × u32, then sorts in N tiles using cap/N alternate buffers, accumulates into host pinned, and packs into d_xs without ever holding 4 cap on device. src/host/GpuPipeline.cu Xs phase: when scratch.gather_tile_count >= 2 + scratch.h_meta != nullptr, take a tiled_xs branch: 1. Allocate d_xs_keys_full + d_xs_vals_full (2 cap × u32). 2. launch_xs_gen → fill them. 3. Allocate one shared cap/N alternate pair (keys + vals) + CUB scratch sized for tile_cap_xs. 4. For each tile in [0, N): CUB DoubleBuffer SortPairs over the slice, D2H sorted (key, val) pair to scratch.h_meta reinterpreted as a 2-cap u32 buffer (h_xs_keys at the first cap entries, h_xs_vals at the next cap — h_meta is cap × u64 = 2 cap × u32 of storage, with total_xs <= cap so both halves fit). h_meta gets overwritten by T1 match's cut #4 D2H later, so reusing it through Xs is safe. 5. Free per-tile alt + scratch + d_xs_keys_full + d_xs_vals_full (peak drops to 0 device-side). 6. Host paired stable merge (cut #5 shape) over h_xs_keys + h_xs_vals so the host buffers end up globally sorted by match_info with vals tiebreak. 7. Allocate d_xs (cap × XsCandidateGpu = 2 cap) and pack via two cudaMemcpy2DAsync H2D copies — match_info field gets h_xs_keys at struct stride 8, x field gets h_xs_vals at the same stride. No separate d_xs_keys_b / d_xs_vals_b on-device pack pair needed. Per-phase peak: 2 cap (full keys+vals) + 2 cap/N (sort alt) + scratch ≈ 2.5 cap = 2570 MB at N=4. Final d_xs alloc is the post-merge peak at ~2 cap = 2080 MB. Plain / compact paths unchanged (gated on the same tier flags as the other cuts). src/host/BatchPlotter.cpp: kMinimalFloorBytes 4356 → 3768 MiB (= 3640 measured peak + 128 MiB margin). Dispatch message "3.68 GiB floor". README.md: minimal-tier description rewritten as six layered cuts with measured per-phase peaks (Xs 2570, T1/T2 sort 3640, T3 match/sort 3640) and the new ~31 s/plot wall (vs ~12 s compact) reflecting the host-CPU merge overhead. Top-of-file streaming-floor summary 4.25 → 3.7 GiB. 4 GiB cards now targeted (with the standard "real 4 GiB hardware reports ~3.5 GiB free post-CUDA-context, please report actual fit" caveat). tools/xchplot2/cli.cpp: --tier help "minimal = ~3.7 GiB floor, fits 4 GiB". Verification on RTX 4090 (XCHPLOT2_STREAMING=1 + --tier minimal, POS2GPU_STREAMING_STATS=1): - k=22 plain / compact / minimal byte-identical (sha256 17dbf594…). - k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…). - k=28 minimal peak 4228 → 3640 MB; the bottleneck is now T1 sort / T2 sort / T3 match / T3 sort all tied at 3640 MB (T2 match was already at this level via the existing N=8 staging). - k=28 minimal wall: ~31 s/plot (vs ~12 s compact). The 2.6× slowdown matches the SYCL-branch's measured ~34 vs ~13 s for the same six-cut configuration on sm_89. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps actions/checkout from 5 to 6.
Release notes
Sourced from actions/checkout's releases.
Changelog
Sourced from actions/checkout's changelog.
... (truncated)
Commits
de0fac2Fix tag handling: preserve annotations and explicit fetch-tags (#2356)064fe7fAdd orchestration_id to git user-agent when ACTIONS_ORCHESTRATION_ID is set (...8e8c483Clarify v6 README (#2328)033fa0dAdd worktree support for persist-credentials includeIf (#2327)c2d88d3Update all references from v5 and v4 to v6 (#2314)1af3b93update readme/changelog for v6 (#2311)71cf226v6-beta (#2298)069c695Persist creds to a separate file (#2286)ff7abcdUpdate README to include Node.js 24 support details and requirements (#2248)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)