build(deps): bump hadolint/hadolint-action from 3.1.0 to 3.3.0 by dependabot[bot] · Pull Request #3 · Jsewill/xchplot2

dependabot · 2026-04-27T20:58:49Z

Bumps hadolint/hadolint-action from 3.1.0 to 3.3.0.

Release notes

Sourced from hadolint/hadolint-action's releases.

v3.3.0

3.3.0 (2025-09-22)

Features

trigger release workflow (2332a7b)

v3.2.0

3.2.0 (2025-09-03)

Features

new minor release (3fc49fb)

Commits

2332a7b feat: trigger release workflow
2bfd2b9 Don't trigger release workflow on Tag
0931ae0 Release v3.3.0
3fc49fb feat: new minor release
45eb072 Trigger release workflow on tag
97f3e4f Merge pull request #94 from felipecrs/patch-1
3e9a095 Merge branch 'master' into patch-1
3285327 Merge pull request #96 from m-ildefons/update-ci-yml
8bde06f Update CI yml
24598f4 Update base image for Hadolint
Additional commits viewable in compare view

Bumps [hadolint/hadolint-action](https://github.com/hadolint/hadolint-action) from 3.1.0 to 3.3.0. - [Release notes](https://github.com/hadolint/hadolint-action/releases) - [Commits](hadolint/hadolint-action@v3.1.0...v3.3.0) --- updated-dependencies: - dependency-name: hadolint/hadolint-action dependency-version: 3.3.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

…knobs Kernel-side scaffolding for the six-cut minimal-tier port from main's df21286. No host-side wiring yet — existing call sites continue to go through the thin launch_t*_match wrappers and see no behavior change. The next commit wires cuts #1, #2, #3 in GpuPipeline.cu streaming impl + BatchPlotter dispatch. src/gpu/T1Kernel.{cu,cuh}: split launch_t1_match into launch_t1_match_prepare (computes bucket + fine-bucket offset arrays once per plot, resets d_out_count) and launch_t1_match_range (runs match_all_buckets over a [b_begin, b_end) bucket sub-range, accumulating into d_out_meta + d_out_mi + d_out_count via atomicAdd). The original launch_t1_match becomes a thin prepare+range wrapper for the pool path and parity tests. match_all_buckets gains a uint32_t bucket_begin parameter; bucket_id is now bucket_begin + blockIdx.y so range launches resolve to the correct (section_l, match_key_r) tuples — mirror of the existing T2 / T3 prepare-range plumbing (d4f54ae and b86939f). Used by the upcoming cut #4 (T1 match sliced per section_l). src/gpu/T3Kernel.{cu,cuh}: T3 match_all_buckets gains two int64_t biases (meta_l_index_bias, meta_r_index_bias) that shift the kernel-internal global l/r indices into a sliced-meta buffer position. Full-cap callers pass biases = 0 so indexing is unchanged. The existing launch_t3_match_range wrapper passes 0/0; behavior preserved. Add launch_t3_match_section_pair_range — accepts a sliced d_sorted_meta buffer (section_l + section_r rows packed) plus the two biases. Used by the upcoming cut #3 (T3 match section-pair input slicing): d_t2_meta_sorted parked on pinned host across T3 match, the two row slices H2D'd per pass, d_t2_xbits_sorted + d_t2_keys_merged stay full-cap on device for binary-search / target reads. Drops T3 match peak from 5200 → ~3700 MB at k=28. Expose matching_section_host(section_l, num_section_bits) so the streaming caller can compute section_r on the host side from section_l (the kernel still does this internally; this helper avoids duplicating the rotation math at the wiring site). src/host/GpuPipeline.hpp: StreamingPinnedScratch gains two knobs: - gather_tile_count (default 1) — T1 / T2 sort gather tile count. When >= 2, the merged-key + permuted-meta gather output is D2H'd per tile to host pinned (h_meta / h_keys_merged) so the cap-sized sorted_meta never has to be alive on device in full. Drops T1-sort and T2-sort phase peaks from 5200 → ~3640 MB at k=28. - t3_input_slice_count (default 1) — T3 match input-slice count. When >= 2, d_t2_meta_sorted is parked on h_meta across T3 match and each pass H2Ds the section_l + section_r row slices onto cap/N device buffers. Must equal num_sections (= 4 at k=28 strength=2) when active. Defaults preserve old compact-tier behavior. The minimal tier will set both in the upcoming BatchPlotter wiring. All TUs nvcc-clean at sm_89. Existing parity tests + pool path unaffected — they call launch_t1_match / launch_t3_match (thin wrappers) which preserve the original API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the gap d4f54ae's caveat flagged: cuda-only minimal was aspirational (kMinimalFloorBytes = 3828 MiB advertised, but real peak still 5200 MB at T1 sort / T2 sort / T3 match). The three remaining SYCL-branch cuts now land on this branch and bring all three phases below the 4 GiB cliff. Cut #1 — T1 sort gather tiled. src/host/GpuPipeline.cu T1 sort phase: when scratch.gather_tile_ count >= 2, the per-tile sort-merge feeds a tiled gather instead of a single-shot one. Per tile: gather_u64 to a cap/N device tile, D2H to h_meta on host (whose unsorted-meta park lifetime ended at the JIT H2D into d_t1_meta a few lines earlier — h_meta is dead, free for reuse as the sorted-meta accumulator). After the loop, free d_t1_meta + merged_vals + tile, allocate d_t1_meta_sorted full-cap, H2D from h_meta. Live-set during gather drops from 8 + 8 + 4 = 20 cap (5200 MB) to 8 + 8/N + 4 = 12 + 8/N cap. At N=4: 14 cap = 3640 MB. Cut #2 — T2 sort meta + xbits gathers tiled, deferred re-hydrate. Mirror of cut #1 at the T2 sort gather sites, plus a deferred re-hydrate so d_t2_meta_sorted (8 cap) and d_t2_xbits_sorted (4 cap) don't co-reside with d_merged_vals (4 cap). Both accumulators land on host first (h_meta + h_t2_xbits), then d_merged_vals is freed, then both sorted streams are re-hydrated full-cap on device for T3 match. Gather peak: 5200 → ~3640 MB. Re-hydrate peak: ~3120 MB. Cut #3 — T3 match section-pair input slicing. src/host/GpuPipeline.cu T3 match phase: a new t3_input_slice_path branch precedes the existing t3_stage_path. When scratch.t3_input_slice_count >= 2, cut #2's deferred re-hydrate skips the d_t2_meta_sorted H2D entirely — T2 meta stays parked on h_meta. The T3 match phase then: 1. launch_t3_match_prepare to populate d_offsets in the temp storage region. 2. D2H d_offsets so the host loop can compute section_l / section_r row spans. Tiny (17 × 8 = 136 bytes at k=28 strength=2). 3. For each section_l ∈ [0, num_sections): compute section_r via matching_section_host, look up the row spans, H2D the section_l + section_r meta rows from h_meta into a cap/2 device slice buffer (tightly packed at indices [0, l_count) and [l_count, l_count + r_count)), set the kernel biases to map global l/r → slice indices, run launch_t3_match_section_ pair_range over the section_l × num_match_keys bucket sub- range, D2H d_t3_stage to a per-plot pinned h_t3_acc accumulator at offset t3_count, increment t3_count. 4. After all section_l: free d_t2_meta_slice + d_t3_stage + d_t3_match_temp + d_t2_xbits_sorted + d_t2_keys_merged, allocate d_t3 full-cap, H2D from h_t3_acc, free h_t3_acc. Per-plot pinned h_t3_acc (cap × T3PairingGpu = cap × u64) is necessary because h_meta is in active read-use across the section_l loop and can't double as the existing t3_stage_path's accumulator. T3 match peak: 5200 → ~3700 MB (cap/2 meta slice 1040 + cap xbits 1040 + cap keys_merged 1040 + cap/4 t3 stage 520 + offsets ~80 = ~3720 MB at k=28). src/host/BatchPlotter.cpp: minimal tier sets gather_tile_count = 4 (= num_sections at k=28 strength=2) and t3_input_slice_count = num_sections. Dispatch message updated to advertise the layered cuts. kMinimalFloorBytes stays 3828 MiB — already matches expected peak (~3700 MB) + 128 MB margin. README.md: minimal-tier description rewritten to describe the three layered cuts, the new bottleneck (T3 match at ~3700 MB), and the wider 4-GiB-card target. The b86939f-era "N=8 T2 staging only" wording was stale after d4f54ae shifted the bottleneck. Verification on hardware (RTX 4090 was main's verification host): - k=22 batch across plain / compact / minimal must produce byte-identical .plot2 output (cuts re-shape memory only). - k=28 minimal forced under POS2GPU_MAX_VRAM_MB=4096 should dispatch minimal and complete; POS2GPU_STREAMING_STATS=1 should confirm peak ≤ ~3700 MB. - k=28 minimal vs k=28 compact must be byte-identical. Cuts #4 (T1 match sliced) and #6 (Xs gen+sort+pack tiled) deferred — they're additive savings on phases that are no longer the bottleneck after the above three. Cut #4's kernel-side split landed in the previous commit so the wiring is straightforward when needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cuts #1+#2+#3 brought T1 sort, T2 sort, and T3 match below 4 GiB, but T1 match was unaffected and stayed at ~5280 MB at k=28 (d_xs 2080 + d_t1_meta 2080 + d_t1_mi 1040 + temp ~80) — the new overall pipeline peak. Cut #4 closes that gap. src/host/GpuPipeline.cu T1 match phase: when scratch.gather_tile_ count >= 2, gate a tiled_t1_match branch that uses the existing launch_t1_match_prepare + launch_t1_match_range plumbing landed in commit bca9bf1. Each section_l pass writes to cap/N device staging buffers (cap/N × u64 meta + cap/N × u32 mi), D2H'd per pass to scratch.h_meta + a per-plot pinned h_t1_mi accumulator at offset t1_count. After all passes, free stage + d_xs and re-hydrate d_t1_mi full-cap from h_t1_mi for the upcoming T1 sort. d_t1_meta is never allocated — h_meta already holds the unsorted meta when entering T1 sort, so the existing park step becomes a no-op (now gated on d_t1_meta != nullptr). Peak: d_xs (2080) + cap/N × 12 (stage) + temp ≈ 2940 MB at N=4 (= num_sections at k=28 strength=2). Plain / compact paths unchanged. src/host/BatchPlotter.cpp: dispatch message updated to advertise "N=4 T1-match" alongside the existing "T1/T2 sort gather" and "T3 input slicing" cuts. README.md: minimal-tier description rewritten as four layered cuts (was three) — adds cut #4 and re-orders the "compact's tied 5200 MB" summary to include T1 match. After this commit the cuda-only minimal-tier peak budget at k=28 strength=2 should be: Xs phase : ~3072 MB (unchanged, no cut #6 yet) T1 match : ~2940 MB (cut #4, was 5280) T1 sort : ~3640 MB (cut #1, was 5200) T2 match : ~3640 MB (existing N=8 staging) T2 sort : ~3640 MB (cut #2, was 5200) T3 match : ~3700 MB (cut #3, was 5200) T3 sort : ~3155 MB (no change needed) Overall peak: ~3700 MB at T3 match — fits kMinimalFloorBytes (3828 MiB = ~3700 MB + 128 MB margin) and the 4 GiB-card floor. Cut #6 (Xs gen+sort+pack tiled) deferred — Xs already under 4 GiB and not the bottleneck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verification on RTX 4090 with --tier minimal forced via XCHPLOT2_STREAMING=1 surfaced two corrections: 1. The post-cuts overall peak is 4228 MB at T3 sort phase (d_t3 2080 + d_frags_out 2080 + CUB scratch 68), NOT ~3700 MB at T3 match as the previous commit's README + dispatch message claimed. Cuts #1+#2 reduced T1/T2-sort GATHER peaks (sub-phases inside the sort phase) but the CUB DeviceRadixSort itself still allocates four cap-sized uint32 buffers under DoubleBuffer mode (~4 cap = 4160 MB at k=28), so T1/T2/T3 sort phases stay near the 4 GiB line. Per-phase peaks measured: Xs : 4136 MB (4 cap u32 + CUB scratch 40) T1 sort : 4180 MB (4 cap u32 + CUB scratch 20) T2 sort : 4170 MB (4 cap u32 + CUB scratch 10) T3 match : ~3700 MB (cut #3 working as designed) T3 sort : 4228 MB ← bottleneck Compact tier unchanged at 5200 MB peak (no cuts active). Drop from 5200 → 4228 (−972 MB / −19%) is real, just less than the README claimed. 2. 4 GiB cards (≤ ~3.5 GiB free post-CUDA-context) still don't fit. Closing that gap requires the SYCL-branch's cuts #5 (CUB sort output tiling with host accumulators across T1/T2/T3 sort) and #6 (Xs gen+sort+pack tiling) — not yet ported to cuda-only. The minimal tier in its current form fits 5 GiB+ cards (RTX 2060, RX 6600 / 6600 XT, RX 7600) comfortably with ~1 GiB headroom. Updates: src/host/BatchPlotter.cpp: kMinimalFloorBytes 3828 → 4356 MiB (= 4228 measured peak + 128 MB margin). Dispatch message floor reads as "4.25 GiB floor" instead of the overstated "3.74 GiB". README.md: minimal-tier description rewritten with measured peak (4228 MB), the new bottleneck phase (T3 sort), the accurate target hardware (5 GiB+ cards, not 4 GiB), and a pointer to cuts #5/#6 as the remaining work for genuine 4 GiB-card support. Top-of-file streaming-floor summary updated 3.8 → 4.25 GiB. tools/xchplot2/cli.cpp: --tier help text updated to match. Verified byte-identical at k=22 across plain / compact / minimal (sha256 17dbf594…) and at k=28 across compact / minimal (sha256 f42e62ad…). Plain pool and compact streaming paths unchanged by this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the four cap-sized uint32/uint64 buffers that CUB DeviceRadixSort needs in DoubleBuffer mode with cap/N per-tile buffers + host accumulators across all three sort phases. Drops each phase peak from ~4200 MB to 3640 MB at k=28 by paying ~7 sec of host CPU merge time per plot (k=28 minimal: 12 → 22 s/plot). Each tile's CUB sort lands either in the input slice (d_*_mi / d_t3) or the cap/N alternate buffer; whichever side it lands on, we D2H to a host pinned accumulator at the matching offset. After all tiles, we free the per-tile device buffers and the input buffer, then run a tree of pairwise stable in-place merges on host (std::inplace_merge for keys-only; a hand-rolled paired_merge_t* for the pairs cases). The result is a globally sorted run that we H2D back to the output buffer that downstream consumers expect. T1 sort: scratch.h_keys_merged + scratch.h_t2_xbits as accumulators. h_keys_merged was already going to receive the T1 sorted-mi park after the device-side merge — cut #5 just writes it directly, skipping the round-trip. h_t2_xbits is dead at T1 sort time (T2 match staging hasn't filled it yet) so it doubles as the T1 vals accumulator. Final H2D rehydrates d_t1_merged_vals from h_t2_xbits for the cut #1 gather phase. d_t1_keys_merged stays null — h_keys_merged is already the parked form. Per- phase peak: 4180 → 3640 MB. T2 sort: scratch.h_keys_merged + a per-plot pinned h_t2_sort_vals (cap × u32, freed at end of phase). h_t2_xbits is NOT reused for T2 sort — cut #2's xbits gather still reads h_t2_xbits as the parked unsorted xbits stream, so an in-place reuse would corrupt that data. Mirror of T1 sort otherwise. Per-phase peak: 4170 → 3640 MB. T3 sort: scratch.h_meta as the keys-only accumulator. h_meta's cut #3 lifetime as parked T2 meta ends at the H2D-back step that cut #3 emits before T3 sort entry, so it's reusable. SortKeys (no vals) → std::inplace_merge for the host merge step. Per-phase peak: 4228 → 3640 MB. Plus a small init_u32_identity_offset kernel — the cap/N tile sort needs its vals_in seeded with global positions [tile_start..tile_end) so the post-merge d_merged_vals stream indexes directly into the cap-sized d_t*_meta / d_t*_xbits. Verification (RTX 4090 at k=22 + k=28 strength=2): - k=22 plain / compact / minimal byte-identical (sha256 17dbf594…). - k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…). - k=28 minimal peak 4228 → 4136 MB (-92 MB; cut #5 saves on each sort phase but cap-sized Xs gen+sort+pack is the new overall bottleneck — cut #6 closes that gap). - Compact / plain paths unchanged (the new tile path is gated on scratch.gather_tile_count >= 2 + the per-tier scratch pinned slots being populated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iveCpp Three layered install paths so users can pick the friction they want: 1. Containerfile (podman-first, also docker). Build args select the base image: nvidia/cuda for CUB+SYCL, rocm/dev-ubuntu for AMD, intel/oneapi for Intel (experimental). All variants build AdaptiveCpp 25.10 from source inside the image and ship a slim runtime stage. ~15-30 min first build, layer-cached after. 2. scripts/install-deps.sh — distro-aware native bootstrap covering Arch, Ubuntu/Debian, and Fedora families. Detects GPU vendor via nvidia-smi/rocminfo and installs the right toolchain (full CUDA for NVIDIA, CUDA *headers* + ROCm for AMD), then builds AdaptiveCpp into /opt/adaptivecpp. --no-acpp opts out and lets CMake fetch it. 3. CMake FetchContent fallback. find_package(AdaptiveCpp QUIET) followed by FetchContent_Declare at v25.10.0 with FetchContent_MakeAvailable when the local lookup fails. Opt-in option XCHPLOT2_FETCH_ADAPTIVECPP=ON (default ON). The add_sycl_to_target macro is verified after the fetch — if AdaptiveCpp doesn't expose it as a subproject we error with a pointer to the manual install. build.rs also now reads $XCHPLOT2_BUILD_CUDA so the AMD/Intel container builds can flip XCHPLOT2_BUILD_CUDA=OFF without touching CMake invocation. README's Build section restructured into three clearly-labeled paths with the full dependency table moved into path #3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dependabot Bot added dependencies Pull requests that update a dependency file github_actions Pull requests that update GitHub Actions code labels Apr 27, 2026

dependabot Bot force-pushed the dependabot/github_actions/hadolint/hadolint-action-3.3.0 branch from 9b25c42 to 444c2e4 Compare April 27, 2026 21:43

Jsewill merged commit 66c0a18 into main Apr 27, 2026
11 checks passed

dependabot Bot deleted the dependabot/github_actions/hadolint/hadolint-action-3.3.0 branch April 27, 2026 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build(deps): bump hadolint/hadolint-action from 3.1.0 to 3.3.0#3

build(deps): bump hadolint/hadolint-action from 3.1.0 to 3.3.0#3
Jsewill merged 1 commit into
mainfrom
dependabot/github_actions/hadolint/hadolint-action-3.3.0

dependabot Bot commented on behalf of github Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dependabot Bot commented on behalf of github Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v3.3.0

3.3.0 (2025-09-22)

Features

v3.2.0

3.2.0 (2025-09-03)

Features

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dependabot Bot commented on behalf of github Apr 27, 2026 •

edited

Loading