build(deps): bump hadolint/hadolint-action from 3.1.0 to 3.3.0#3
Merged
Jsewill merged 1 commit intoApr 27, 2026
Merged
Conversation
Bumps [hadolint/hadolint-action](https://github.com/hadolint/hadolint-action) from 3.1.0 to 3.3.0. - [Release notes](https://github.com/hadolint/hadolint-action/releases) - [Commits](hadolint/hadolint-action@v3.1.0...v3.3.0) --- updated-dependencies: - dependency-name: hadolint/hadolint-action dependency-version: 3.3.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
9b25c42 to
444c2e4
Compare
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
…knobs Kernel-side scaffolding for the six-cut minimal-tier port from main's df21286. No host-side wiring yet — existing call sites continue to go through the thin launch_t*_match wrappers and see no behavior change. The next commit wires cuts #1, #2, #3 in GpuPipeline.cu streaming impl + BatchPlotter dispatch. src/gpu/T1Kernel.{cu,cuh}: split launch_t1_match into launch_t1_match_prepare (computes bucket + fine-bucket offset arrays once per plot, resets d_out_count) and launch_t1_match_range (runs match_all_buckets over a [b_begin, b_end) bucket sub-range, accumulating into d_out_meta + d_out_mi + d_out_count via atomicAdd). The original launch_t1_match becomes a thin prepare+range wrapper for the pool path and parity tests. match_all_buckets gains a uint32_t bucket_begin parameter; bucket_id is now bucket_begin + blockIdx.y so range launches resolve to the correct (section_l, match_key_r) tuples — mirror of the existing T2 / T3 prepare-range plumbing (d4f54ae and b86939f). Used by the upcoming cut #4 (T1 match sliced per section_l). src/gpu/T3Kernel.{cu,cuh}: T3 match_all_buckets gains two int64_t biases (meta_l_index_bias, meta_r_index_bias) that shift the kernel-internal global l/r indices into a sliced-meta buffer position. Full-cap callers pass biases = 0 so indexing is unchanged. The existing launch_t3_match_range wrapper passes 0/0; behavior preserved. Add launch_t3_match_section_pair_range — accepts a sliced d_sorted_meta buffer (section_l + section_r rows packed) plus the two biases. Used by the upcoming cut #3 (T3 match section-pair input slicing): d_t2_meta_sorted parked on pinned host across T3 match, the two row slices H2D'd per pass, d_t2_xbits_sorted + d_t2_keys_merged stay full-cap on device for binary-search / target reads. Drops T3 match peak from 5200 → ~3700 MB at k=28. Expose matching_section_host(section_l, num_section_bits) so the streaming caller can compute section_r on the host side from section_l (the kernel still does this internally; this helper avoids duplicating the rotation math at the wiring site). src/host/GpuPipeline.hpp: StreamingPinnedScratch gains two knobs: - gather_tile_count (default 1) — T1 / T2 sort gather tile count. When >= 2, the merged-key + permuted-meta gather output is D2H'd per tile to host pinned (h_meta / h_keys_merged) so the cap-sized sorted_meta never has to be alive on device in full. Drops T1-sort and T2-sort phase peaks from 5200 → ~3640 MB at k=28. - t3_input_slice_count (default 1) — T3 match input-slice count. When >= 2, d_t2_meta_sorted is parked on h_meta across T3 match and each pass H2Ds the section_l + section_r row slices onto cap/N device buffers. Must equal num_sections (= 4 at k=28 strength=2) when active. Defaults preserve old compact-tier behavior. The minimal tier will set both in the upcoming BatchPlotter wiring. All TUs nvcc-clean at sm_89. Existing parity tests + pool path unaffected — they call launch_t1_match / launch_t3_match (thin wrappers) which preserve the original API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Closes the gap d4f54ae's caveat flagged: cuda-only minimal was aspirational (kMinimalFloorBytes = 3828 MiB advertised, but real peak still 5200 MB at T1 sort / T2 sort / T3 match). The three remaining SYCL-branch cuts now land on this branch and bring all three phases below the 4 GiB cliff. Cut #1 — T1 sort gather tiled. src/host/GpuPipeline.cu T1 sort phase: when scratch.gather_tile_ count >= 2, the per-tile sort-merge feeds a tiled gather instead of a single-shot one. Per tile: gather_u64 to a cap/N device tile, D2H to h_meta on host (whose unsorted-meta park lifetime ended at the JIT H2D into d_t1_meta a few lines earlier — h_meta is dead, free for reuse as the sorted-meta accumulator). After the loop, free d_t1_meta + merged_vals + tile, allocate d_t1_meta_sorted full-cap, H2D from h_meta. Live-set during gather drops from 8 + 8 + 4 = 20 cap (5200 MB) to 8 + 8/N + 4 = 12 + 8/N cap. At N=4: 14 cap = 3640 MB. Cut #2 — T2 sort meta + xbits gathers tiled, deferred re-hydrate. Mirror of cut #1 at the T2 sort gather sites, plus a deferred re-hydrate so d_t2_meta_sorted (8 cap) and d_t2_xbits_sorted (4 cap) don't co-reside with d_merged_vals (4 cap). Both accumulators land on host first (h_meta + h_t2_xbits), then d_merged_vals is freed, then both sorted streams are re-hydrated full-cap on device for T3 match. Gather peak: 5200 → ~3640 MB. Re-hydrate peak: ~3120 MB. Cut #3 — T3 match section-pair input slicing. src/host/GpuPipeline.cu T3 match phase: a new t3_input_slice_path branch precedes the existing t3_stage_path. When scratch.t3_input_slice_count >= 2, cut #2's deferred re-hydrate skips the d_t2_meta_sorted H2D entirely — T2 meta stays parked on h_meta. The T3 match phase then: 1. launch_t3_match_prepare to populate d_offsets in the temp storage region. 2. D2H d_offsets so the host loop can compute section_l / section_r row spans. Tiny (17 × 8 = 136 bytes at k=28 strength=2). 3. For each section_l ∈ [0, num_sections): compute section_r via matching_section_host, look up the row spans, H2D the section_l + section_r meta rows from h_meta into a cap/2 device slice buffer (tightly packed at indices [0, l_count) and [l_count, l_count + r_count)), set the kernel biases to map global l/r → slice indices, run launch_t3_match_section_ pair_range over the section_l × num_match_keys bucket sub- range, D2H d_t3_stage to a per-plot pinned h_t3_acc accumulator at offset t3_count, increment t3_count. 4. After all section_l: free d_t2_meta_slice + d_t3_stage + d_t3_match_temp + d_t2_xbits_sorted + d_t2_keys_merged, allocate d_t3 full-cap, H2D from h_t3_acc, free h_t3_acc. Per-plot pinned h_t3_acc (cap × T3PairingGpu = cap × u64) is necessary because h_meta is in active read-use across the section_l loop and can't double as the existing t3_stage_path's accumulator. T3 match peak: 5200 → ~3700 MB (cap/2 meta slice 1040 + cap xbits 1040 + cap keys_merged 1040 + cap/4 t3 stage 520 + offsets ~80 = ~3720 MB at k=28). src/host/BatchPlotter.cpp: minimal tier sets gather_tile_count = 4 (= num_sections at k=28 strength=2) and t3_input_slice_count = num_sections. Dispatch message updated to advertise the layered cuts. kMinimalFloorBytes stays 3828 MiB — already matches expected peak (~3700 MB) + 128 MB margin. README.md: minimal-tier description rewritten to describe the three layered cuts, the new bottleneck (T3 match at ~3700 MB), and the wider 4-GiB-card target. The b86939f-era "N=8 T2 staging only" wording was stale after d4f54ae shifted the bottleneck. Verification on hardware (RTX 4090 was main's verification host): - k=22 batch across plain / compact / minimal must produce byte-identical .plot2 output (cuts re-shape memory only). - k=28 minimal forced under POS2GPU_MAX_VRAM_MB=4096 should dispatch minimal and complete; POS2GPU_STREAMING_STATS=1 should confirm peak ≤ ~3700 MB. - k=28 minimal vs k=28 compact must be byte-identical. Cuts #4 (T1 match sliced) and #6 (Xs gen+sort+pack tiled) deferred — they're additive savings on phases that are no longer the bottleneck after the above three. Cut #4's kernel-side split landed in the previous commit so the wiring is straightforward when needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Cuts #1+#2+#3 brought T1 sort, T2 sort, and T3 match below 4 GiB, but T1 match was unaffected and stayed at ~5280 MB at k=28 (d_xs 2080 + d_t1_meta 2080 + d_t1_mi 1040 + temp ~80) — the new overall pipeline peak. Cut #4 closes that gap. src/host/GpuPipeline.cu T1 match phase: when scratch.gather_tile_ count >= 2, gate a tiled_t1_match branch that uses the existing launch_t1_match_prepare + launch_t1_match_range plumbing landed in commit bca9bf1. Each section_l pass writes to cap/N device staging buffers (cap/N × u64 meta + cap/N × u32 mi), D2H'd per pass to scratch.h_meta + a per-plot pinned h_t1_mi accumulator at offset t1_count. After all passes, free stage + d_xs and re-hydrate d_t1_mi full-cap from h_t1_mi for the upcoming T1 sort. d_t1_meta is never allocated — h_meta already holds the unsorted meta when entering T1 sort, so the existing park step becomes a no-op (now gated on d_t1_meta != nullptr). Peak: d_xs (2080) + cap/N × 12 (stage) + temp ≈ 2940 MB at N=4 (= num_sections at k=28 strength=2). Plain / compact paths unchanged. src/host/BatchPlotter.cpp: dispatch message updated to advertise "N=4 T1-match" alongside the existing "T1/T2 sort gather" and "T3 input slicing" cuts. README.md: minimal-tier description rewritten as four layered cuts (was three) — adds cut #4 and re-orders the "compact's tied 5200 MB" summary to include T1 match. After this commit the cuda-only minimal-tier peak budget at k=28 strength=2 should be: Xs phase : ~3072 MB (unchanged, no cut #6 yet) T1 match : ~2940 MB (cut #4, was 5280) T1 sort : ~3640 MB (cut #1, was 5200) T2 match : ~3640 MB (existing N=8 staging) T2 sort : ~3640 MB (cut #2, was 5200) T3 match : ~3700 MB (cut #3, was 5200) T3 sort : ~3155 MB (no change needed) Overall peak: ~3700 MB at T3 match — fits kMinimalFloorBytes (3828 MiB = ~3700 MB + 128 MB margin) and the 4 GiB-card floor. Cut #6 (Xs gen+sort+pack tiled) deferred — Xs already under 4 GiB and not the bottleneck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Verification on RTX 4090 with --tier minimal forced via
XCHPLOT2_STREAMING=1 surfaced two corrections:
1. The post-cuts overall peak is 4228 MB at T3 sort phase
(d_t3 2080 + d_frags_out 2080 + CUB scratch 68), NOT
~3700 MB at T3 match as the previous commit's README +
dispatch message claimed. Cuts #1+#2 reduced T1/T2-sort
GATHER peaks (sub-phases inside the sort phase) but the
CUB DeviceRadixSort itself still allocates four cap-sized
uint32 buffers under DoubleBuffer mode (~4 cap = 4160 MB
at k=28), so T1/T2/T3 sort phases stay near the 4 GiB
line. Per-phase peaks measured:
Xs : 4136 MB (4 cap u32 + CUB scratch 40)
T1 sort : 4180 MB (4 cap u32 + CUB scratch 20)
T2 sort : 4170 MB (4 cap u32 + CUB scratch 10)
T3 match : ~3700 MB (cut #3 working as designed)
T3 sort : 4228 MB ← bottleneck
Compact tier unchanged at 5200 MB peak (no cuts active).
Drop from 5200 → 4228 (−972 MB / −19%) is real, just less
than the README claimed.
2. 4 GiB cards (≤ ~3.5 GiB free post-CUDA-context) still
don't fit. Closing that gap requires the SYCL-branch's
cuts #5 (CUB sort output tiling with host accumulators
across T1/T2/T3 sort) and #6 (Xs gen+sort+pack tiling) —
not yet ported to cuda-only. The minimal tier in its
current form fits 5 GiB+ cards (RTX 2060, RX 6600 / 6600
XT, RX 7600) comfortably with ~1 GiB headroom.
Updates:
src/host/BatchPlotter.cpp: kMinimalFloorBytes 3828 → 4356
MiB (= 4228 measured peak + 128 MB margin). Dispatch
message floor reads as "4.25 GiB floor" instead of the
overstated "3.74 GiB".
README.md: minimal-tier description rewritten with measured
peak (4228 MB), the new bottleneck phase (T3 sort), the
accurate target hardware (5 GiB+ cards, not 4 GiB), and a
pointer to cuts #5/#6 as the remaining work for genuine
4 GiB-card support. Top-of-file streaming-floor summary
updated 3.8 → 4.25 GiB.
tools/xchplot2/cli.cpp: --tier help text updated to match.
Verified byte-identical at k=22 across plain / compact / minimal
(sha256 17dbf594…) and at k=28 across compact / minimal
(sha256 f42e62ad…). Plain pool and compact streaming paths
unchanged by this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Replaces the four cap-sized uint32/uint64 buffers that CUB
DeviceRadixSort needs in DoubleBuffer mode with cap/N per-tile
buffers + host accumulators across all three sort phases. Drops
each phase peak from ~4200 MB to 3640 MB at k=28 by paying ~7 sec
of host CPU merge time per plot (k=28 minimal: 12 → 22 s/plot).
Each tile's CUB sort lands either in the input slice (d_*_mi /
d_t3) or the cap/N alternate buffer; whichever side it lands on,
we D2H to a host pinned accumulator at the matching offset. After
all tiles, we free the per-tile device buffers and the input
buffer, then run a tree of pairwise stable in-place merges on
host (std::inplace_merge for keys-only; a hand-rolled
paired_merge_t* for the pairs cases). The result is a globally
sorted run that we H2D back to the output buffer that downstream
consumers expect.
T1 sort: scratch.h_keys_merged + scratch.h_t2_xbits as accumulators.
h_keys_merged was already going to receive the T1 sorted-mi park
after the device-side merge — cut #5 just writes it directly,
skipping the round-trip. h_t2_xbits is dead at T1 sort time
(T2 match staging hasn't filled it yet) so it doubles as the
T1 vals accumulator. Final H2D rehydrates d_t1_merged_vals
from h_t2_xbits for the cut #1 gather phase. d_t1_keys_merged
stays null — h_keys_merged is already the parked form. Per-
phase peak: 4180 → 3640 MB.
T2 sort: scratch.h_keys_merged + a per-plot pinned h_t2_sort_vals
(cap × u32, freed at end of phase). h_t2_xbits is NOT reused
for T2 sort — cut #2's xbits gather still reads h_t2_xbits as
the parked unsorted xbits stream, so an in-place reuse would
corrupt that data. Mirror of T1 sort otherwise. Per-phase peak:
4170 → 3640 MB.
T3 sort: scratch.h_meta as the keys-only accumulator. h_meta's
cut #3 lifetime as parked T2 meta ends at the H2D-back step
that cut #3 emits before T3 sort entry, so it's reusable.
SortKeys (no vals) → std::inplace_merge for the host merge
step. Per-phase peak: 4228 → 3640 MB.
Plus a small init_u32_identity_offset kernel — the cap/N tile
sort needs its vals_in seeded with global positions
[tile_start..tile_end) so the post-merge d_merged_vals stream
indexes directly into the cap-sized d_t*_meta / d_t*_xbits.
Verification (RTX 4090 at k=22 + k=28 strength=2):
- k=22 plain / compact / minimal byte-identical (sha256 17dbf594…).
- k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…).
- k=28 minimal peak 4228 → 4136 MB (-92 MB; cut #5 saves on
each sort phase but cap-sized Xs gen+sort+pack is the new
overall bottleneck — cut #6 closes that gap).
- Compact / plain paths unchanged (the new tile path is gated
on scratch.gather_tile_count >= 2 + the per-tier scratch
pinned slots being populated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
May 6, 2026
…iveCpp
Three layered install paths so users can pick the friction they want:
1. Containerfile (podman-first, also docker). Build args select the
base image: nvidia/cuda for CUB+SYCL, rocm/dev-ubuntu for AMD,
intel/oneapi for Intel (experimental). All variants build
AdaptiveCpp 25.10 from source inside the image and ship a slim
runtime stage. ~15-30 min first build, layer-cached after.
2. scripts/install-deps.sh — distro-aware native bootstrap covering
Arch, Ubuntu/Debian, and Fedora families. Detects GPU vendor via
nvidia-smi/rocminfo and installs the right toolchain (full CUDA
for NVIDIA, CUDA *headers* + ROCm for AMD), then builds
AdaptiveCpp into /opt/adaptivecpp. --no-acpp opts out and lets
CMake fetch it.
3. CMake FetchContent fallback. find_package(AdaptiveCpp QUIET)
followed by FetchContent_Declare at v25.10.0 with
FetchContent_MakeAvailable when the local lookup fails. Opt-in
option XCHPLOT2_FETCH_ADAPTIVECPP=ON (default ON). The
add_sycl_to_target macro is verified after the fetch — if
AdaptiveCpp doesn't expose it as a subproject we error with a
pointer to the manual install.
build.rs also now reads $XCHPLOT2_BUILD_CUDA so the AMD/Intel container
builds can flip XCHPLOT2_BUILD_CUDA=OFF without touching CMake invocation.
README's Build section restructured into three clearly-labeled paths
with the full dependency table moved into path #3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps hadolint/hadolint-action from 3.1.0 to 3.3.0.
Release notes
Sourced from hadolint/hadolint-action's releases.
Commits
2332a7bfeat: trigger release workflow2bfd2b9Don't trigger release workflow on Tag0931ae0Release v3.3.03fc49fbfeat: new minor release45eb072Trigger release workflow on tag97f3e4fMerge pull request #94 from felipecrs/patch-13e9a095Merge branch 'master' into patch-13285327Merge pull request #96 from m-ildefons/update-ci-yml8bde06fUpdate CI yml24598f4Update base image for Hadolint