build(deps): bump DavidAnson/markdownlint-cli2-action from 18 to 23#2
Merged
Jsewill merged 1 commit intoApr 27, 2026
Conversation
Bumps [DavidAnson/markdownlint-cli2-action](https://github.com/davidanson/markdownlint-cli2-action) from 18 to 23. - [Release notes](https://github.com/davidanson/markdownlint-cli2-action/releases) - [Commits](DavidAnson/markdownlint-cli2-action@v18...v23) --- updated-dependencies: - dependency-name: DavidAnson/markdownlint-cli2-action dependency-version: '23' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
de0781c to
90db2b0
Compare
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
…knobs Kernel-side scaffolding for the six-cut minimal-tier port from main's df21286. No host-side wiring yet — existing call sites continue to go through the thin launch_t*_match wrappers and see no behavior change. The next commit wires cuts #1, #2, #3 in GpuPipeline.cu streaming impl + BatchPlotter dispatch. src/gpu/T1Kernel.{cu,cuh}: split launch_t1_match into launch_t1_match_prepare (computes bucket + fine-bucket offset arrays once per plot, resets d_out_count) and launch_t1_match_range (runs match_all_buckets over a [b_begin, b_end) bucket sub-range, accumulating into d_out_meta + d_out_mi + d_out_count via atomicAdd). The original launch_t1_match becomes a thin prepare+range wrapper for the pool path and parity tests. match_all_buckets gains a uint32_t bucket_begin parameter; bucket_id is now bucket_begin + blockIdx.y so range launches resolve to the correct (section_l, match_key_r) tuples — mirror of the existing T2 / T3 prepare-range plumbing (d4f54ae and b86939f). Used by the upcoming cut #4 (T1 match sliced per section_l). src/gpu/T3Kernel.{cu,cuh}: T3 match_all_buckets gains two int64_t biases (meta_l_index_bias, meta_r_index_bias) that shift the kernel-internal global l/r indices into a sliced-meta buffer position. Full-cap callers pass biases = 0 so indexing is unchanged. The existing launch_t3_match_range wrapper passes 0/0; behavior preserved. Add launch_t3_match_section_pair_range — accepts a sliced d_sorted_meta buffer (section_l + section_r rows packed) plus the two biases. Used by the upcoming cut #3 (T3 match section-pair input slicing): d_t2_meta_sorted parked on pinned host across T3 match, the two row slices H2D'd per pass, d_t2_xbits_sorted + d_t2_keys_merged stay full-cap on device for binary-search / target reads. Drops T3 match peak from 5200 → ~3700 MB at k=28. Expose matching_section_host(section_l, num_section_bits) so the streaming caller can compute section_r on the host side from section_l (the kernel still does this internally; this helper avoids duplicating the rotation math at the wiring site). src/host/GpuPipeline.hpp: StreamingPinnedScratch gains two knobs: - gather_tile_count (default 1) — T1 / T2 sort gather tile count. When >= 2, the merged-key + permuted-meta gather output is D2H'd per tile to host pinned (h_meta / h_keys_merged) so the cap-sized sorted_meta never has to be alive on device in full. Drops T1-sort and T2-sort phase peaks from 5200 → ~3640 MB at k=28. - t3_input_slice_count (default 1) — T3 match input-slice count. When >= 2, d_t2_meta_sorted is parked on h_meta across T3 match and each pass H2Ds the section_l + section_r row slices onto cap/N device buffers. Must equal num_sections (= 4 at k=28 strength=2) when active. Defaults preserve old compact-tier behavior. The minimal tier will set both in the upcoming BatchPlotter wiring. All TUs nvcc-clean at sm_89. Existing parity tests + pool path unaffected — they call launch_t1_match / launch_t3_match (thin wrappers) which preserve the original API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Closes the gap d4f54ae's caveat flagged: cuda-only minimal was aspirational (kMinimalFloorBytes = 3828 MiB advertised, but real peak still 5200 MB at T1 sort / T2 sort / T3 match). The three remaining SYCL-branch cuts now land on this branch and bring all three phases below the 4 GiB cliff. Cut #1 — T1 sort gather tiled. src/host/GpuPipeline.cu T1 sort phase: when scratch.gather_tile_ count >= 2, the per-tile sort-merge feeds a tiled gather instead of a single-shot one. Per tile: gather_u64 to a cap/N device tile, D2H to h_meta on host (whose unsorted-meta park lifetime ended at the JIT H2D into d_t1_meta a few lines earlier — h_meta is dead, free for reuse as the sorted-meta accumulator). After the loop, free d_t1_meta + merged_vals + tile, allocate d_t1_meta_sorted full-cap, H2D from h_meta. Live-set during gather drops from 8 + 8 + 4 = 20 cap (5200 MB) to 8 + 8/N + 4 = 12 + 8/N cap. At N=4: 14 cap = 3640 MB. Cut #2 — T2 sort meta + xbits gathers tiled, deferred re-hydrate. Mirror of cut #1 at the T2 sort gather sites, plus a deferred re-hydrate so d_t2_meta_sorted (8 cap) and d_t2_xbits_sorted (4 cap) don't co-reside with d_merged_vals (4 cap). Both accumulators land on host first (h_meta + h_t2_xbits), then d_merged_vals is freed, then both sorted streams are re-hydrated full-cap on device for T3 match. Gather peak: 5200 → ~3640 MB. Re-hydrate peak: ~3120 MB. Cut #3 — T3 match section-pair input slicing. src/host/GpuPipeline.cu T3 match phase: a new t3_input_slice_path branch precedes the existing t3_stage_path. When scratch.t3_input_slice_count >= 2, cut #2's deferred re-hydrate skips the d_t2_meta_sorted H2D entirely — T2 meta stays parked on h_meta. The T3 match phase then: 1. launch_t3_match_prepare to populate d_offsets in the temp storage region. 2. D2H d_offsets so the host loop can compute section_l / section_r row spans. Tiny (17 × 8 = 136 bytes at k=28 strength=2). 3. For each section_l ∈ [0, num_sections): compute section_r via matching_section_host, look up the row spans, H2D the section_l + section_r meta rows from h_meta into a cap/2 device slice buffer (tightly packed at indices [0, l_count) and [l_count, l_count + r_count)), set the kernel biases to map global l/r → slice indices, run launch_t3_match_section_ pair_range over the section_l × num_match_keys bucket sub- range, D2H d_t3_stage to a per-plot pinned h_t3_acc accumulator at offset t3_count, increment t3_count. 4. After all section_l: free d_t2_meta_slice + d_t3_stage + d_t3_match_temp + d_t2_xbits_sorted + d_t2_keys_merged, allocate d_t3 full-cap, H2D from h_t3_acc, free h_t3_acc. Per-plot pinned h_t3_acc (cap × T3PairingGpu = cap × u64) is necessary because h_meta is in active read-use across the section_l loop and can't double as the existing t3_stage_path's accumulator. T3 match peak: 5200 → ~3700 MB (cap/2 meta slice 1040 + cap xbits 1040 + cap keys_merged 1040 + cap/4 t3 stage 520 + offsets ~80 = ~3720 MB at k=28). src/host/BatchPlotter.cpp: minimal tier sets gather_tile_count = 4 (= num_sections at k=28 strength=2) and t3_input_slice_count = num_sections. Dispatch message updated to advertise the layered cuts. kMinimalFloorBytes stays 3828 MiB — already matches expected peak (~3700 MB) + 128 MB margin. README.md: minimal-tier description rewritten to describe the three layered cuts, the new bottleneck (T3 match at ~3700 MB), and the wider 4-GiB-card target. The b86939f-era "N=8 T2 staging only" wording was stale after d4f54ae shifted the bottleneck. Verification on hardware (RTX 4090 was main's verification host): - k=22 batch across plain / compact / minimal must produce byte-identical .plot2 output (cuts re-shape memory only). - k=28 minimal forced under POS2GPU_MAX_VRAM_MB=4096 should dispatch minimal and complete; POS2GPU_STREAMING_STATS=1 should confirm peak ≤ ~3700 MB. - k=28 minimal vs k=28 compact must be byte-identical. Cuts #4 (T1 match sliced) and #6 (Xs gen+sort+pack tiled) deferred — they're additive savings on phases that are no longer the bottleneck after the above three. Cut #4's kernel-side split landed in the previous commit so the wiring is straightforward when needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Cuts #1+#2+#3 brought T1 sort, T2 sort, and T3 match below 4 GiB, but T1 match was unaffected and stayed at ~5280 MB at k=28 (d_xs 2080 + d_t1_meta 2080 + d_t1_mi 1040 + temp ~80) — the new overall pipeline peak. Cut #4 closes that gap. src/host/GpuPipeline.cu T1 match phase: when scratch.gather_tile_ count >= 2, gate a tiled_t1_match branch that uses the existing launch_t1_match_prepare + launch_t1_match_range plumbing landed in commit bca9bf1. Each section_l pass writes to cap/N device staging buffers (cap/N × u64 meta + cap/N × u32 mi), D2H'd per pass to scratch.h_meta + a per-plot pinned h_t1_mi accumulator at offset t1_count. After all passes, free stage + d_xs and re-hydrate d_t1_mi full-cap from h_t1_mi for the upcoming T1 sort. d_t1_meta is never allocated — h_meta already holds the unsorted meta when entering T1 sort, so the existing park step becomes a no-op (now gated on d_t1_meta != nullptr). Peak: d_xs (2080) + cap/N × 12 (stage) + temp ≈ 2940 MB at N=4 (= num_sections at k=28 strength=2). Plain / compact paths unchanged. src/host/BatchPlotter.cpp: dispatch message updated to advertise "N=4 T1-match" alongside the existing "T1/T2 sort gather" and "T3 input slicing" cuts. README.md: minimal-tier description rewritten as four layered cuts (was three) — adds cut #4 and re-orders the "compact's tied 5200 MB" summary to include T1 match. After this commit the cuda-only minimal-tier peak budget at k=28 strength=2 should be: Xs phase : ~3072 MB (unchanged, no cut #6 yet) T1 match : ~2940 MB (cut #4, was 5280) T1 sort : ~3640 MB (cut #1, was 5200) T2 match : ~3640 MB (existing N=8 staging) T2 sort : ~3640 MB (cut #2, was 5200) T3 match : ~3700 MB (cut #3, was 5200) T3 sort : ~3155 MB (no change needed) Overall peak: ~3700 MB at T3 match — fits kMinimalFloorBytes (3828 MiB = ~3700 MB + 128 MB margin) and the 4 GiB-card floor. Cut #6 (Xs gen+sort+pack tiled) deferred — Xs already under 4 GiB and not the bottleneck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Verification on RTX 4090 with --tier minimal forced via
XCHPLOT2_STREAMING=1 surfaced two corrections:
1. The post-cuts overall peak is 4228 MB at T3 sort phase
(d_t3 2080 + d_frags_out 2080 + CUB scratch 68), NOT
~3700 MB at T3 match as the previous commit's README +
dispatch message claimed. Cuts #1+#2 reduced T1/T2-sort
GATHER peaks (sub-phases inside the sort phase) but the
CUB DeviceRadixSort itself still allocates four cap-sized
uint32 buffers under DoubleBuffer mode (~4 cap = 4160 MB
at k=28), so T1/T2/T3 sort phases stay near the 4 GiB
line. Per-phase peaks measured:
Xs : 4136 MB (4 cap u32 + CUB scratch 40)
T1 sort : 4180 MB (4 cap u32 + CUB scratch 20)
T2 sort : 4170 MB (4 cap u32 + CUB scratch 10)
T3 match : ~3700 MB (cut #3 working as designed)
T3 sort : 4228 MB ← bottleneck
Compact tier unchanged at 5200 MB peak (no cuts active).
Drop from 5200 → 4228 (−972 MB / −19%) is real, just less
than the README claimed.
2. 4 GiB cards (≤ ~3.5 GiB free post-CUDA-context) still
don't fit. Closing that gap requires the SYCL-branch's
cuts #5 (CUB sort output tiling with host accumulators
across T1/T2/T3 sort) and #6 (Xs gen+sort+pack tiling) —
not yet ported to cuda-only. The minimal tier in its
current form fits 5 GiB+ cards (RTX 2060, RX 6600 / 6600
XT, RX 7600) comfortably with ~1 GiB headroom.
Updates:
src/host/BatchPlotter.cpp: kMinimalFloorBytes 3828 → 4356
MiB (= 4228 measured peak + 128 MB margin). Dispatch
message floor reads as "4.25 GiB floor" instead of the
overstated "3.74 GiB".
README.md: minimal-tier description rewritten with measured
peak (4228 MB), the new bottleneck phase (T3 sort), the
accurate target hardware (5 GiB+ cards, not 4 GiB), and a
pointer to cuts #5/#6 as the remaining work for genuine
4 GiB-card support. Top-of-file streaming-floor summary
updated 3.8 → 4.25 GiB.
tools/xchplot2/cli.cpp: --tier help text updated to match.
Verified byte-identical at k=22 across plain / compact / minimal
(sha256 17dbf594…) and at k=28 across compact / minimal
(sha256 f42e62ad…). Plain pool and compact streaming paths
unchanged by this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Replaces the four cap-sized uint32/uint64 buffers that CUB
DeviceRadixSort needs in DoubleBuffer mode with cap/N per-tile
buffers + host accumulators across all three sort phases. Drops
each phase peak from ~4200 MB to 3640 MB at k=28 by paying ~7 sec
of host CPU merge time per plot (k=28 minimal: 12 → 22 s/plot).
Each tile's CUB sort lands either in the input slice (d_*_mi /
d_t3) or the cap/N alternate buffer; whichever side it lands on,
we D2H to a host pinned accumulator at the matching offset. After
all tiles, we free the per-tile device buffers and the input
buffer, then run a tree of pairwise stable in-place merges on
host (std::inplace_merge for keys-only; a hand-rolled
paired_merge_t* for the pairs cases). The result is a globally
sorted run that we H2D back to the output buffer that downstream
consumers expect.
T1 sort: scratch.h_keys_merged + scratch.h_t2_xbits as accumulators.
h_keys_merged was already going to receive the T1 sorted-mi park
after the device-side merge — cut #5 just writes it directly,
skipping the round-trip. h_t2_xbits is dead at T1 sort time
(T2 match staging hasn't filled it yet) so it doubles as the
T1 vals accumulator. Final H2D rehydrates d_t1_merged_vals
from h_t2_xbits for the cut #1 gather phase. d_t1_keys_merged
stays null — h_keys_merged is already the parked form. Per-
phase peak: 4180 → 3640 MB.
T2 sort: scratch.h_keys_merged + a per-plot pinned h_t2_sort_vals
(cap × u32, freed at end of phase). h_t2_xbits is NOT reused
for T2 sort — cut #2's xbits gather still reads h_t2_xbits as
the parked unsorted xbits stream, so an in-place reuse would
corrupt that data. Mirror of T1 sort otherwise. Per-phase peak:
4170 → 3640 MB.
T3 sort: scratch.h_meta as the keys-only accumulator. h_meta's
cut #3 lifetime as parked T2 meta ends at the H2D-back step
that cut #3 emits before T3 sort entry, so it's reusable.
SortKeys (no vals) → std::inplace_merge for the host merge
step. Per-phase peak: 4228 → 3640 MB.
Plus a small init_u32_identity_offset kernel — the cap/N tile
sort needs its vals_in seeded with global positions
[tile_start..tile_end) so the post-merge d_merged_vals stream
indexes directly into the cap-sized d_t*_meta / d_t*_xbits.
Verification (RTX 4090 at k=22 + k=28 strength=2):
- k=22 plain / compact / minimal byte-identical (sha256 17dbf594…).
- k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…).
- k=28 minimal peak 4228 → 4136 MB (-92 MB; cut #5 saves on
each sort phase but cap-sized Xs gen+sort+pack is the new
overall bottleneck — cut #6 closes that gap).
- Compact / plain paths unchanged (the new tile path is gated
on scratch.gather_tile_count >= 2 + the per-tier scratch
pinned slots being populated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
May 17, 2026
…able sort (Phase 1.5b)
Replaces Minimal T2 sort (full-cap d_t2_mi + d_keys_tile + d_vals_tile
+ d_sort_scratch + d_merged_vals + T2 sort gather d_t2_meta +
d_t2_meta_tile = ~3.6 GB at k=28) with:
1. Pre-compute h_global_idx[i] = i (u32) on host pinned for stable
sort tiebreak.
2. Triple-val streaming partition (key=d_t2_mi, val=meta_u64, val2=
global_idx_u32) — bucketizes (key, meta, global_idx) triples via
atomic-claim into host pinned arenas. Frees d_t2_mi after.
3. Per-bucket sort:
- Host: pack (key, global_idx) → u64 (key in high 32, global_idx
in low 32) for the bucket's entries.
- H2D packed keys + meta vals.
- CUB DeviceRadixSort::SortPairs<u64, u64> with end_bit = 32 + k.
Top 32 bits = key (radix-sorted first), bottom 32 bits =
global_idx (tiebreak on equal keys → matches Minimal/Plain's
full-array sort-by-(key, identity_idx) byte-parity contract).
- D2H sorted packed keys + meta.
- Unpack sorted keys → scratch.h_keys_merged (in place).
Unpack sorted global_idx, gather xbits from original
scratch.h_t2_xbits[orig_idx] → separate h_t2_xbits_sorted_out
(gather pattern hits any position, can't write in place).
4. Memcpy h_t2_xbits_sorted_out → scratch.h_t2_xbits.
Outputs: scratch.h_meta + scratch.h_keys_merged + scratch.h_t2_xbits
hold sorted (meta, mi, xbits) on host pinned. d_merged_vals and
d_t2_keys_merged stay nullptr — no permutation needed because Tiny
T3 match (per-bucket-pair via host prepare) reads slices directly
from these host buffers.
Gating updates: tiled_t2_sort and tiled_gather_t2 add !scratch.tiny_mode
so Tiny skips both the Minimal-style sort and the Cut #2 gather.
The plain-non-streaming T2-sort gather else (line ~2861) also gated
on !scratch.tiny_mode.
Measured at k=28 RTX 4090:
- Tiny plot peak: 3640 → 2615 MB (-1025 MB, -28%)
- New floor: T3 sort phase (2615 MB) — d_t3 full-cap + sort
scratch + tiled_t3_sort working set. T2 sort phase live now
1064 MB (vs 3640 prior).
- All 4 tiers byte-identical to SYCL Plain reference at k=22 + k=28
- Tiny wall: 49.2 s (Minimal: 32.0 s, +54%) — per-bucket sequential
pack + sort + gather + xbits-memcpy adds wall; acceptable per
cost model for sub-2 GB cards.
Remaining peak win to reach SYCL Tiny's measured 1064 MB at k=28:
- T3 sort streaming partition (would drop d_t3 full-cap)
- T3 match per-bucket xbits (would drop d_t2_xbits_sorted hydration)
- Xs gen+sort tiling (would drop Xs phase 2566 MB peak)
Tracked as follow-up; closes #49.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps DavidAnson/markdownlint-cli2-action from 18 to 23.
Release notes
Sourced from DavidAnson/markdownlint-cli2-action's releases.
... (truncated)
Commits
ce4853dUpdate to version 23.0.0.63a898cImprove type fidelity.08fc3a2Add configPointer input, examples for package.json/pyproject.toml.154744fFreshen generated index.js file.d1d523cBump markdownlint-cli2 from 0.21.0 to 0.22.0619b235Bump eslint from 10.0.3 to 10.1.0a226cbeFreshen generated index.js file.5d93b2eMigrate from Node.js 20 to Node.js 24.0cf8cddBump eslint from 10.0.2 to 10.0.3462cc85Bump@stylistic/eslint-pluginfrom 5.9.0 to 5.10.0