Skip to content

build(deps): bump actions/checkout from 5 to 6#4

Merged
Jsewill merged 1 commit into
mainfrom
dependabot/github_actions/actions/checkout-6
Apr 27, 2026
Merged

build(deps): bump actions/checkout from 5 to 6#4
Jsewill merged 1 commit into
mainfrom
dependabot/github_actions/actions/checkout-6

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github Apr 27, 2026

Bumps actions/checkout from 5 to 6.

Release notes

Sourced from actions/checkout's releases.

v6.0.0

What's Changed

Full Changelog: actions/checkout@v5.0.0...v6.0.0

v6-beta

What's Changed

Updated persist-credentials to store the credentials under $RUNNER_TEMP instead of directly in the local git config.

This requires a minimum Actions Runner version of v2.329.0 to access the persisted credentials for Docker container action scenarios.

v5.0.1

What's Changed

Full Changelog: actions/checkout@v5...v5.0.1

Changelog

Sourced from actions/checkout's changelog.

Changelog

v6.0.2

v6.0.1

v6.0.0

v5.0.1

v5.0.0

v4.3.1

v4.3.0

v4.2.2

v4.2.1

v4.2.0

v4.1.7

v4.1.6

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot added dependencies Pull requests that update a dependency file github_actions Pull requests that update GitHub Actions code labels Apr 27, 2026
@Jsewill Jsewill merged commit 752a39b into main Apr 27, 2026
11 checks passed
@dependabot dependabot Bot deleted the dependabot/github_actions/actions/checkout-6 branch April 27, 2026 21:43
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
…knobs

Kernel-side scaffolding for the six-cut minimal-tier port from main's
df21286. No host-side wiring yet — existing call sites continue to
go through the thin launch_t*_match wrappers and see no behavior
change. The next commit wires cuts #1, #2, #3 in GpuPipeline.cu
streaming impl + BatchPlotter dispatch.

  src/gpu/T1Kernel.{cu,cuh}: split launch_t1_match into
    launch_t1_match_prepare (computes bucket + fine-bucket offset
    arrays once per plot, resets d_out_count) and launch_t1_match_range
    (runs match_all_buckets over a [b_begin, b_end) bucket sub-range,
    accumulating into d_out_meta + d_out_mi + d_out_count via atomicAdd).
    The original launch_t1_match becomes a thin prepare+range wrapper
    for the pool path and parity tests. match_all_buckets gains a
    uint32_t bucket_begin parameter; bucket_id is now bucket_begin +
    blockIdx.y so range launches resolve to the correct (section_l,
    match_key_r) tuples — mirror of the existing T2 / T3 prepare-range
    plumbing (d4f54ae and b86939f). Used by the upcoming cut #4
    (T1 match sliced per section_l).

  src/gpu/T3Kernel.{cu,cuh}: T3 match_all_buckets gains two int64_t
    biases (meta_l_index_bias, meta_r_index_bias) that shift the
    kernel-internal global l/r indices into a sliced-meta buffer
    position. Full-cap callers pass biases = 0 so indexing is
    unchanged. The existing launch_t3_match_range wrapper passes
    0/0; behavior preserved.

    Add launch_t3_match_section_pair_range — accepts a sliced
    d_sorted_meta buffer (section_l + section_r rows packed) plus
    the two biases. Used by the upcoming cut #3 (T3 match section-pair
    input slicing): d_t2_meta_sorted parked on pinned host across T3
    match, the two row slices H2D'd per pass, d_t2_xbits_sorted +
    d_t2_keys_merged stay full-cap on device for binary-search /
    target reads. Drops T3 match peak from 5200 → ~3700 MB at k=28.

    Expose matching_section_host(section_l, num_section_bits) so the
    streaming caller can compute section_r on the host side from
    section_l (the kernel still does this internally; this helper
    avoids duplicating the rotation math at the wiring site).

  src/host/GpuPipeline.hpp: StreamingPinnedScratch gains two knobs:
    - gather_tile_count (default 1) — T1 / T2 sort gather tile
      count. When >= 2, the merged-key + permuted-meta gather output
      is D2H'd per tile to host pinned (h_meta / h_keys_merged) so
      the cap-sized sorted_meta never has to be alive on device in
      full. Drops T1-sort and T2-sort phase peaks from 5200 → ~3640
      MB at k=28.
    - t3_input_slice_count (default 1) — T3 match input-slice count.
      When >= 2, d_t2_meta_sorted is parked on h_meta across T3 match
      and each pass H2Ds the section_l + section_r row slices onto
      cap/N device buffers. Must equal num_sections (= 4 at k=28
      strength=2) when active.

    Defaults preserve old compact-tier behavior. The minimal tier
    will set both in the upcoming BatchPlotter wiring.

All TUs nvcc-clean at sm_89. Existing parity tests + pool path
unaffected — they call launch_t1_match / launch_t3_match (thin
wrappers) which preserve the original API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Closes the gap d4f54ae's caveat flagged: cuda-only minimal was
aspirational (kMinimalFloorBytes = 3828 MiB advertised, but real
peak still 5200 MB at T1 sort / T2 sort / T3 match). The three
remaining SYCL-branch cuts now land on this branch and bring all
three phases below the 4 GiB cliff.

  Cut #1 — T1 sort gather tiled.
    src/host/GpuPipeline.cu T1 sort phase: when scratch.gather_tile_
    count >= 2, the per-tile sort-merge feeds a tiled gather instead
    of a single-shot one. Per tile: gather_u64 to a cap/N device
    tile, D2H to h_meta on host (whose unsorted-meta park lifetime
    ended at the JIT H2D into d_t1_meta a few lines earlier — h_meta
    is dead, free for reuse as the sorted-meta accumulator). After
    the loop, free d_t1_meta + merged_vals + tile, allocate
    d_t1_meta_sorted full-cap, H2D from h_meta. Live-set during
    gather drops from 8 + 8 + 4 = 20 cap (5200 MB) to 8 + 8/N + 4 =
    12 + 8/N cap. At N=4: 14 cap = 3640 MB.

  Cut #2 — T2 sort meta + xbits gathers tiled, deferred re-hydrate.
    Mirror of cut #1 at the T2 sort gather sites, plus a deferred
    re-hydrate so d_t2_meta_sorted (8 cap) and d_t2_xbits_sorted
    (4 cap) don't co-reside with d_merged_vals (4 cap). Both
    accumulators land on host first (h_meta + h_t2_xbits), then
    d_merged_vals is freed, then both sorted streams are re-hydrated
    full-cap on device for T3 match. Gather peak: 5200 → ~3640 MB.
    Re-hydrate peak: ~3120 MB.

  Cut #3 — T3 match section-pair input slicing.
    src/host/GpuPipeline.cu T3 match phase: a new t3_input_slice_path
    branch precedes the existing t3_stage_path. When
    scratch.t3_input_slice_count >= 2, cut #2's deferred re-hydrate
    skips the d_t2_meta_sorted H2D entirely — T2 meta stays parked
    on h_meta. The T3 match phase then:

    1. launch_t3_match_prepare to populate d_offsets in the temp
       storage region.
    2. D2H d_offsets so the host loop can compute section_l /
       section_r row spans. Tiny (17 × 8 = 136 bytes at k=28
       strength=2).
    3. For each section_l ∈ [0, num_sections): compute section_r via
       matching_section_host, look up the row spans, H2D the
       section_l + section_r meta rows from h_meta into a cap/2
       device slice buffer (tightly packed at indices [0, l_count)
       and [l_count, l_count + r_count)), set the kernel biases to
       map global l/r → slice indices, run launch_t3_match_section_
       pair_range over the section_l × num_match_keys bucket sub-
       range, D2H d_t3_stage to a per-plot pinned h_t3_acc accumulator
       at offset t3_count, increment t3_count.
    4. After all section_l: free d_t2_meta_slice + d_t3_stage +
       d_t3_match_temp + d_t2_xbits_sorted + d_t2_keys_merged,
       allocate d_t3 full-cap, H2D from h_t3_acc, free h_t3_acc.

    Per-plot pinned h_t3_acc (cap × T3PairingGpu = cap × u64) is
    necessary because h_meta is in active read-use across the
    section_l loop and can't double as the existing t3_stage_path's
    accumulator. T3 match peak: 5200 → ~3700 MB (cap/2 meta slice
    1040 + cap xbits 1040 + cap keys_merged 1040 + cap/4 t3 stage
    520 + offsets ~80 = ~3720 MB at k=28).

  src/host/BatchPlotter.cpp: minimal tier sets gather_tile_count = 4
    (= num_sections at k=28 strength=2) and t3_input_slice_count =
    num_sections. Dispatch message updated to advertise the layered
    cuts. kMinimalFloorBytes stays 3828 MiB — already matches
    expected peak (~3700 MB) + 128 MB margin.

  README.md: minimal-tier description rewritten to describe the
    three layered cuts, the new bottleneck (T3 match at ~3700 MB),
    and the wider 4-GiB-card target. The b86939f-era "N=8 T2
    staging only" wording was stale after d4f54ae shifted the
    bottleneck.

Verification on hardware (RTX 4090 was main's verification host):
  - k=22 batch across plain / compact / minimal must produce
    byte-identical .plot2 output (cuts re-shape memory only).
  - k=28 minimal forced under POS2GPU_MAX_VRAM_MB=4096 should
    dispatch minimal and complete; POS2GPU_STREAMING_STATS=1 should
    confirm peak ≤ ~3700 MB.
  - k=28 minimal vs k=28 compact must be byte-identical.

Cuts #4 (T1 match sliced) and #6 (Xs gen+sort+pack tiled) deferred —
they're additive savings on phases that are no longer the
bottleneck after the above three. Cut #4's kernel-side split landed
in the previous commit so the wiring is straightforward when
needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Cuts #1+#2+#3 brought T1 sort, T2 sort, and T3 match below 4 GiB,
but T1 match was unaffected and stayed at ~5280 MB at k=28
(d_xs 2080 + d_t1_meta 2080 + d_t1_mi 1040 + temp ~80) — the new
overall pipeline peak.  Cut #4 closes that gap.

  src/host/GpuPipeline.cu T1 match phase: when scratch.gather_tile_
    count >= 2, gate a tiled_t1_match branch that uses the existing
    launch_t1_match_prepare + launch_t1_match_range plumbing landed
    in commit bca9bf1.  Each section_l pass writes to cap/N device
    staging buffers (cap/N × u64 meta + cap/N × u32 mi), D2H'd per
    pass to scratch.h_meta + a per-plot pinned h_t1_mi accumulator
    at offset t1_count.  After all passes, free stage + d_xs and
    re-hydrate d_t1_mi full-cap from h_t1_mi for the upcoming T1
    sort.  d_t1_meta is never allocated — h_meta already holds the
    unsorted meta when entering T1 sort, so the existing park step
    becomes a no-op (now gated on d_t1_meta != nullptr).

    Peak: d_xs (2080) + cap/N × 12 (stage) + temp ≈ 2940 MB at
    N=4 (= num_sections at k=28 strength=2).  Plain / compact paths
    unchanged.

  src/host/BatchPlotter.cpp: dispatch message updated to advertise
    "N=4 T1-match" alongside the existing "T1/T2 sort gather" and
    "T3 input slicing" cuts.

  README.md: minimal-tier description rewritten as four layered
    cuts (was three) — adds cut #4 and re-orders the "compact's
    tied 5200 MB" summary to include T1 match.

After this commit the cuda-only minimal-tier peak budget at
k=28 strength=2 should be:

  Xs phase    : ~3072 MB (unchanged, no cut #6 yet)
  T1 match    : ~2940 MB (cut #4, was 5280)
  T1 sort     : ~3640 MB (cut #1, was 5200)
  T2 match    : ~3640 MB (existing N=8 staging)
  T2 sort     : ~3640 MB (cut #2, was 5200)
  T3 match    : ~3700 MB (cut #3, was 5200)
  T3 sort     : ~3155 MB (no change needed)

Overall peak: ~3700 MB at T3 match — fits kMinimalFloorBytes
(3828 MiB = ~3700 MB + 128 MB margin) and the 4 GiB-card floor.
Cut #6 (Xs gen+sort+pack tiled) deferred — Xs already under 4 GiB
and not the bottleneck.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Closes the last cap × 4 (uint32) hot-spot. The non-tiled Xs phase
peaks at four cap-sized uint32 buffers + CUB DoubleBuffer scratch
(~4136 MB at k=28); cut #6 generates once into 2 cap × u32, then
sorts in N tiles using cap/N alternate buffers, accumulates into
host pinned, and packs into d_xs without ever holding 4 cap on
device.

  src/host/GpuPipeline.cu Xs phase: when scratch.gather_tile_count
    >= 2 + scratch.h_meta != nullptr, take a tiled_xs branch:

    1. Allocate d_xs_keys_full + d_xs_vals_full (2 cap × u32).
    2. launch_xs_gen → fill them.
    3. Allocate one shared cap/N alternate pair (keys + vals) +
       CUB scratch sized for tile_cap_xs.
    4. For each tile in [0, N): CUB DoubleBuffer SortPairs over
       the slice, D2H sorted (key, val) pair to scratch.h_meta
       reinterpreted as a 2-cap u32 buffer (h_xs_keys at the
       first cap entries, h_xs_vals at the next cap — h_meta is
       cap × u64 = 2 cap × u32 of storage, with total_xs <= cap
       so both halves fit). h_meta gets overwritten by T1
       match's cut #4 D2H later, so reusing it through Xs is safe.
    5. Free per-tile alt + scratch + d_xs_keys_full +
       d_xs_vals_full (peak drops to 0 device-side).
    6. Host paired stable merge (cut #5 shape) over h_xs_keys +
       h_xs_vals so the host buffers end up globally sorted by
       match_info with vals tiebreak.
    7. Allocate d_xs (cap × XsCandidateGpu = 2 cap) and pack via
       two cudaMemcpy2DAsync H2D copies — match_info field gets
       h_xs_keys at struct stride 8, x field gets h_xs_vals at
       the same stride. No separate d_xs_keys_b / d_xs_vals_b
       on-device pack pair needed.

    Per-phase peak: 2 cap (full keys+vals) + 2 cap/N (sort alt)
    + scratch ≈ 2.5 cap = 2570 MB at N=4. Final d_xs alloc is
    the post-merge peak at ~2 cap = 2080 MB. Plain / compact
    paths unchanged (gated on the same tier flags as the other
    cuts).

  src/host/BatchPlotter.cpp: kMinimalFloorBytes 4356 → 3768 MiB
    (= 3640 measured peak + 128 MiB margin). Dispatch message
    "3.68 GiB floor".

  README.md: minimal-tier description rewritten as six layered
    cuts with measured per-phase peaks (Xs 2570, T1/T2 sort 3640,
    T3 match/sort 3640) and the new ~31 s/plot wall (vs ~12 s
    compact) reflecting the host-CPU merge overhead. Top-of-file
    streaming-floor summary 4.25 → 3.7 GiB. 4 GiB cards now
    targeted (with the standard "real 4 GiB hardware reports
    ~3.5 GiB free post-CUDA-context, please report actual fit"
    caveat).

  tools/xchplot2/cli.cpp: --tier help "minimal = ~3.7 GiB floor,
    fits 4 GiB".

Verification on RTX 4090 (XCHPLOT2_STREAMING=1 + --tier minimal,
POS2GPU_STREAMING_STATS=1):
  - k=22 plain / compact / minimal byte-identical (sha256 17dbf594…).
  - k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…).
  - k=28 minimal peak 4228 → 3640 MB; the bottleneck is now T1
    sort / T2 sort / T3 match / T3 sort all tied at 3640 MB
    (T2 match was already at this level via the existing N=8
    staging).
  - k=28 minimal wall: ~31 s/plot (vs ~12 s compact). The 2.6×
    slowdown matches the SYCL-branch's measured ~34 vs ~13 s
    for the same six-cut configuration on sm_89.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file github_actions Pull requests that update GitHub Actions code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant