Skip to content

build(deps): bump hadolint/hadolint-action from 3.1.0 to 3.3.0#3

Merged
Jsewill merged 1 commit into
mainfrom
dependabot/github_actions/hadolint/hadolint-action-3.3.0
Apr 27, 2026
Merged

build(deps): bump hadolint/hadolint-action from 3.1.0 to 3.3.0#3
Jsewill merged 1 commit into
mainfrom
dependabot/github_actions/hadolint/hadolint-action-3.3.0

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github Apr 27, 2026

Bumps hadolint/hadolint-action from 3.1.0 to 3.3.0.

Release notes

Sourced from hadolint/hadolint-action's releases.

v3.3.0

3.3.0 (2025-09-22)

Features

  • trigger release workflow (2332a7b)

v3.2.0

3.2.0 (2025-09-03)

Features

Commits
  • 2332a7b feat: trigger release workflow
  • 2bfd2b9 Don't trigger release workflow on Tag
  • 0931ae0 Release v3.3.0
  • 3fc49fb feat: new minor release
  • 45eb072 Trigger release workflow on tag
  • 97f3e4f Merge pull request #94 from felipecrs/patch-1
  • 3e9a095 Merge branch 'master' into patch-1
  • 3285327 Merge pull request #96 from m-ildefons/update-ci-yml
  • 8bde06f Update CI yml
  • 24598f4 Update base image for Hadolint
  • Additional commits viewable in compare view

@dependabot dependabot Bot added dependencies Pull requests that update a dependency file github_actions Pull requests that update GitHub Actions code labels Apr 27, 2026
Bumps [hadolint/hadolint-action](https://github.com/hadolint/hadolint-action) from 3.1.0 to 3.3.0.
- [Release notes](https://github.com/hadolint/hadolint-action/releases)
- [Commits](hadolint/hadolint-action@v3.1.0...v3.3.0)

---
updated-dependencies:
- dependency-name: hadolint/hadolint-action
  dependency-version: 3.3.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot force-pushed the dependabot/github_actions/hadolint/hadolint-action-3.3.0 branch from 9b25c42 to 444c2e4 Compare April 27, 2026 21:43
@Jsewill Jsewill merged commit 66c0a18 into main Apr 27, 2026
11 checks passed
@dependabot dependabot Bot deleted the dependabot/github_actions/hadolint/hadolint-action-3.3.0 branch April 27, 2026 22:59
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
…knobs

Kernel-side scaffolding for the six-cut minimal-tier port from main's
df21286. No host-side wiring yet — existing call sites continue to
go through the thin launch_t*_match wrappers and see no behavior
change. The next commit wires cuts #1, #2, #3 in GpuPipeline.cu
streaming impl + BatchPlotter dispatch.

  src/gpu/T1Kernel.{cu,cuh}: split launch_t1_match into
    launch_t1_match_prepare (computes bucket + fine-bucket offset
    arrays once per plot, resets d_out_count) and launch_t1_match_range
    (runs match_all_buckets over a [b_begin, b_end) bucket sub-range,
    accumulating into d_out_meta + d_out_mi + d_out_count via atomicAdd).
    The original launch_t1_match becomes a thin prepare+range wrapper
    for the pool path and parity tests. match_all_buckets gains a
    uint32_t bucket_begin parameter; bucket_id is now bucket_begin +
    blockIdx.y so range launches resolve to the correct (section_l,
    match_key_r) tuples — mirror of the existing T2 / T3 prepare-range
    plumbing (d4f54ae and b86939f). Used by the upcoming cut #4
    (T1 match sliced per section_l).

  src/gpu/T3Kernel.{cu,cuh}: T3 match_all_buckets gains two int64_t
    biases (meta_l_index_bias, meta_r_index_bias) that shift the
    kernel-internal global l/r indices into a sliced-meta buffer
    position. Full-cap callers pass biases = 0 so indexing is
    unchanged. The existing launch_t3_match_range wrapper passes
    0/0; behavior preserved.

    Add launch_t3_match_section_pair_range — accepts a sliced
    d_sorted_meta buffer (section_l + section_r rows packed) plus
    the two biases. Used by the upcoming cut #3 (T3 match section-pair
    input slicing): d_t2_meta_sorted parked on pinned host across T3
    match, the two row slices H2D'd per pass, d_t2_xbits_sorted +
    d_t2_keys_merged stay full-cap on device for binary-search /
    target reads. Drops T3 match peak from 5200 → ~3700 MB at k=28.

    Expose matching_section_host(section_l, num_section_bits) so the
    streaming caller can compute section_r on the host side from
    section_l (the kernel still does this internally; this helper
    avoids duplicating the rotation math at the wiring site).

  src/host/GpuPipeline.hpp: StreamingPinnedScratch gains two knobs:
    - gather_tile_count (default 1) — T1 / T2 sort gather tile
      count. When >= 2, the merged-key + permuted-meta gather output
      is D2H'd per tile to host pinned (h_meta / h_keys_merged) so
      the cap-sized sorted_meta never has to be alive on device in
      full. Drops T1-sort and T2-sort phase peaks from 5200 → ~3640
      MB at k=28.
    - t3_input_slice_count (default 1) — T3 match input-slice count.
      When >= 2, d_t2_meta_sorted is parked on h_meta across T3 match
      and each pass H2Ds the section_l + section_r row slices onto
      cap/N device buffers. Must equal num_sections (= 4 at k=28
      strength=2) when active.

    Defaults preserve old compact-tier behavior. The minimal tier
    will set both in the upcoming BatchPlotter wiring.

All TUs nvcc-clean at sm_89. Existing parity tests + pool path
unaffected — they call launch_t1_match / launch_t3_match (thin
wrappers) which preserve the original API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Closes the gap d4f54ae's caveat flagged: cuda-only minimal was
aspirational (kMinimalFloorBytes = 3828 MiB advertised, but real
peak still 5200 MB at T1 sort / T2 sort / T3 match). The three
remaining SYCL-branch cuts now land on this branch and bring all
three phases below the 4 GiB cliff.

  Cut #1 — T1 sort gather tiled.
    src/host/GpuPipeline.cu T1 sort phase: when scratch.gather_tile_
    count >= 2, the per-tile sort-merge feeds a tiled gather instead
    of a single-shot one. Per tile: gather_u64 to a cap/N device
    tile, D2H to h_meta on host (whose unsorted-meta park lifetime
    ended at the JIT H2D into d_t1_meta a few lines earlier — h_meta
    is dead, free for reuse as the sorted-meta accumulator). After
    the loop, free d_t1_meta + merged_vals + tile, allocate
    d_t1_meta_sorted full-cap, H2D from h_meta. Live-set during
    gather drops from 8 + 8 + 4 = 20 cap (5200 MB) to 8 + 8/N + 4 =
    12 + 8/N cap. At N=4: 14 cap = 3640 MB.

  Cut #2 — T2 sort meta + xbits gathers tiled, deferred re-hydrate.
    Mirror of cut #1 at the T2 sort gather sites, plus a deferred
    re-hydrate so d_t2_meta_sorted (8 cap) and d_t2_xbits_sorted
    (4 cap) don't co-reside with d_merged_vals (4 cap). Both
    accumulators land on host first (h_meta + h_t2_xbits), then
    d_merged_vals is freed, then both sorted streams are re-hydrated
    full-cap on device for T3 match. Gather peak: 5200 → ~3640 MB.
    Re-hydrate peak: ~3120 MB.

  Cut #3 — T3 match section-pair input slicing.
    src/host/GpuPipeline.cu T3 match phase: a new t3_input_slice_path
    branch precedes the existing t3_stage_path. When
    scratch.t3_input_slice_count >= 2, cut #2's deferred re-hydrate
    skips the d_t2_meta_sorted H2D entirely — T2 meta stays parked
    on h_meta. The T3 match phase then:

    1. launch_t3_match_prepare to populate d_offsets in the temp
       storage region.
    2. D2H d_offsets so the host loop can compute section_l /
       section_r row spans. Tiny (17 × 8 = 136 bytes at k=28
       strength=2).
    3. For each section_l ∈ [0, num_sections): compute section_r via
       matching_section_host, look up the row spans, H2D the
       section_l + section_r meta rows from h_meta into a cap/2
       device slice buffer (tightly packed at indices [0, l_count)
       and [l_count, l_count + r_count)), set the kernel biases to
       map global l/r → slice indices, run launch_t3_match_section_
       pair_range over the section_l × num_match_keys bucket sub-
       range, D2H d_t3_stage to a per-plot pinned h_t3_acc accumulator
       at offset t3_count, increment t3_count.
    4. After all section_l: free d_t2_meta_slice + d_t3_stage +
       d_t3_match_temp + d_t2_xbits_sorted + d_t2_keys_merged,
       allocate d_t3 full-cap, H2D from h_t3_acc, free h_t3_acc.

    Per-plot pinned h_t3_acc (cap × T3PairingGpu = cap × u64) is
    necessary because h_meta is in active read-use across the
    section_l loop and can't double as the existing t3_stage_path's
    accumulator. T3 match peak: 5200 → ~3700 MB (cap/2 meta slice
    1040 + cap xbits 1040 + cap keys_merged 1040 + cap/4 t3 stage
    520 + offsets ~80 = ~3720 MB at k=28).

  src/host/BatchPlotter.cpp: minimal tier sets gather_tile_count = 4
    (= num_sections at k=28 strength=2) and t3_input_slice_count =
    num_sections. Dispatch message updated to advertise the layered
    cuts. kMinimalFloorBytes stays 3828 MiB — already matches
    expected peak (~3700 MB) + 128 MB margin.

  README.md: minimal-tier description rewritten to describe the
    three layered cuts, the new bottleneck (T3 match at ~3700 MB),
    and the wider 4-GiB-card target. The b86939f-era "N=8 T2
    staging only" wording was stale after d4f54ae shifted the
    bottleneck.

Verification on hardware (RTX 4090 was main's verification host):
  - k=22 batch across plain / compact / minimal must produce
    byte-identical .plot2 output (cuts re-shape memory only).
  - k=28 minimal forced under POS2GPU_MAX_VRAM_MB=4096 should
    dispatch minimal and complete; POS2GPU_STREAMING_STATS=1 should
    confirm peak ≤ ~3700 MB.
  - k=28 minimal vs k=28 compact must be byte-identical.

Cuts #4 (T1 match sliced) and #6 (Xs gen+sort+pack tiled) deferred —
they're additive savings on phases that are no longer the
bottleneck after the above three. Cut #4's kernel-side split landed
in the previous commit so the wiring is straightforward when
needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Cuts #1+#2+#3 brought T1 sort, T2 sort, and T3 match below 4 GiB,
but T1 match was unaffected and stayed at ~5280 MB at k=28
(d_xs 2080 + d_t1_meta 2080 + d_t1_mi 1040 + temp ~80) — the new
overall pipeline peak.  Cut #4 closes that gap.

  src/host/GpuPipeline.cu T1 match phase: when scratch.gather_tile_
    count >= 2, gate a tiled_t1_match branch that uses the existing
    launch_t1_match_prepare + launch_t1_match_range plumbing landed
    in commit bca9bf1.  Each section_l pass writes to cap/N device
    staging buffers (cap/N × u64 meta + cap/N × u32 mi), D2H'd per
    pass to scratch.h_meta + a per-plot pinned h_t1_mi accumulator
    at offset t1_count.  After all passes, free stage + d_xs and
    re-hydrate d_t1_mi full-cap from h_t1_mi for the upcoming T1
    sort.  d_t1_meta is never allocated — h_meta already holds the
    unsorted meta when entering T1 sort, so the existing park step
    becomes a no-op (now gated on d_t1_meta != nullptr).

    Peak: d_xs (2080) + cap/N × 12 (stage) + temp ≈ 2940 MB at
    N=4 (= num_sections at k=28 strength=2).  Plain / compact paths
    unchanged.

  src/host/BatchPlotter.cpp: dispatch message updated to advertise
    "N=4 T1-match" alongside the existing "T1/T2 sort gather" and
    "T3 input slicing" cuts.

  README.md: minimal-tier description rewritten as four layered
    cuts (was three) — adds cut #4 and re-orders the "compact's
    tied 5200 MB" summary to include T1 match.

After this commit the cuda-only minimal-tier peak budget at
k=28 strength=2 should be:

  Xs phase    : ~3072 MB (unchanged, no cut #6 yet)
  T1 match    : ~2940 MB (cut #4, was 5280)
  T1 sort     : ~3640 MB (cut #1, was 5200)
  T2 match    : ~3640 MB (existing N=8 staging)
  T2 sort     : ~3640 MB (cut #2, was 5200)
  T3 match    : ~3700 MB (cut #3, was 5200)
  T3 sort     : ~3155 MB (no change needed)

Overall peak: ~3700 MB at T3 match — fits kMinimalFloorBytes
(3828 MiB = ~3700 MB + 128 MB margin) and the 4 GiB-card floor.
Cut #6 (Xs gen+sort+pack tiled) deferred — Xs already under 4 GiB
and not the bottleneck.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Verification on RTX 4090 with --tier minimal forced via
XCHPLOT2_STREAMING=1 surfaced two corrections:

  1. The post-cuts overall peak is 4228 MB at T3 sort phase
     (d_t3 2080 + d_frags_out 2080 + CUB scratch 68), NOT
     ~3700 MB at T3 match as the previous commit's README +
     dispatch message claimed. Cuts #1+#2 reduced T1/T2-sort
     GATHER peaks (sub-phases inside the sort phase) but the
     CUB DeviceRadixSort itself still allocates four cap-sized
     uint32 buffers under DoubleBuffer mode (~4 cap = 4160 MB
     at k=28), so T1/T2/T3 sort phases stay near the 4 GiB
     line. Per-phase peaks measured:

       Xs       : 4136 MB (4 cap u32 + CUB scratch 40)
       T1 sort  : 4180 MB (4 cap u32 + CUB scratch 20)
       T2 sort  : 4170 MB (4 cap u32 + CUB scratch 10)
       T3 match : ~3700 MB (cut #3 working as designed)
       T3 sort  : 4228 MB ← bottleneck

     Compact tier unchanged at 5200 MB peak (no cuts active).
     Drop from 5200 → 4228 (−972 MB / −19%) is real, just less
     than the README claimed.

  2. 4 GiB cards (≤ ~3.5 GiB free post-CUDA-context) still
     don't fit. Closing that gap requires the SYCL-branch's
     cuts #5 (CUB sort output tiling with host accumulators
     across T1/T2/T3 sort) and #6 (Xs gen+sort+pack tiling) —
     not yet ported to cuda-only. The minimal tier in its
     current form fits 5 GiB+ cards (RTX 2060, RX 6600 / 6600
     XT, RX 7600) comfortably with ~1 GiB headroom.

Updates:

  src/host/BatchPlotter.cpp: kMinimalFloorBytes 3828 → 4356
    MiB (= 4228 measured peak + 128 MB margin). Dispatch
    message floor reads as "4.25 GiB floor" instead of the
    overstated "3.74 GiB".

  README.md: minimal-tier description rewritten with measured
    peak (4228 MB), the new bottleneck phase (T3 sort), the
    accurate target hardware (5 GiB+ cards, not 4 GiB), and a
    pointer to cuts #5/#6 as the remaining work for genuine
    4 GiB-card support. Top-of-file streaming-floor summary
    updated 3.8 → 4.25 GiB.

  tools/xchplot2/cli.cpp: --tier help text updated to match.

Verified byte-identical at k=22 across plain / compact / minimal
(sha256 17dbf594…) and at k=28 across compact / minimal
(sha256 f42e62ad…). Plain pool and compact streaming paths
unchanged by this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Replaces the four cap-sized uint32/uint64 buffers that CUB
DeviceRadixSort needs in DoubleBuffer mode with cap/N per-tile
buffers + host accumulators across all three sort phases. Drops
each phase peak from ~4200 MB to 3640 MB at k=28 by paying ~7 sec
of host CPU merge time per plot (k=28 minimal: 12 → 22 s/plot).

Each tile's CUB sort lands either in the input slice (d_*_mi /
d_t3) or the cap/N alternate buffer; whichever side it lands on,
we D2H to a host pinned accumulator at the matching offset. After
all tiles, we free the per-tile device buffers and the input
buffer, then run a tree of pairwise stable in-place merges on
host (std::inplace_merge for keys-only; a hand-rolled
paired_merge_t* for the pairs cases). The result is a globally
sorted run that we H2D back to the output buffer that downstream
consumers expect.

  T1 sort: scratch.h_keys_merged + scratch.h_t2_xbits as accumulators.
    h_keys_merged was already going to receive the T1 sorted-mi park
    after the device-side merge — cut #5 just writes it directly,
    skipping the round-trip. h_t2_xbits is dead at T1 sort time
    (T2 match staging hasn't filled it yet) so it doubles as the
    T1 vals accumulator. Final H2D rehydrates d_t1_merged_vals
    from h_t2_xbits for the cut #1 gather phase. d_t1_keys_merged
    stays null — h_keys_merged is already the parked form. Per-
    phase peak: 4180 → 3640 MB.

  T2 sort: scratch.h_keys_merged + a per-plot pinned h_t2_sort_vals
    (cap × u32, freed at end of phase). h_t2_xbits is NOT reused
    for T2 sort — cut #2's xbits gather still reads h_t2_xbits as
    the parked unsorted xbits stream, so an in-place reuse would
    corrupt that data. Mirror of T1 sort otherwise. Per-phase peak:
    4170 → 3640 MB.

  T3 sort: scratch.h_meta as the keys-only accumulator. h_meta's
    cut #3 lifetime as parked T2 meta ends at the H2D-back step
    that cut #3 emits before T3 sort entry, so it's reusable.
    SortKeys (no vals) → std::inplace_merge for the host merge
    step. Per-phase peak: 4228 → 3640 MB.

Plus a small init_u32_identity_offset kernel — the cap/N tile
sort needs its vals_in seeded with global positions
[tile_start..tile_end) so the post-merge d_merged_vals stream
indexes directly into the cap-sized d_t*_meta / d_t*_xbits.

Verification (RTX 4090 at k=22 + k=28 strength=2):
  - k=22 plain / compact / minimal byte-identical (sha256 17dbf594…).
  - k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…).
  - k=28 minimal peak 4228 → 4136 MB (-92 MB; cut #5 saves on
    each sort phase but cap-sized Xs gen+sort+pack is the new
    overall bottleneck — cut #6 closes that gap).
  - Compact / plain paths unchanged (the new tile path is gated
    on scratch.gather_tile_count >= 2 + the per-tier scratch
    pinned slots being populated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request May 6, 2026
…iveCpp

Three layered install paths so users can pick the friction they want:

  1. Containerfile (podman-first, also docker). Build args select the
     base image: nvidia/cuda for CUB+SYCL, rocm/dev-ubuntu for AMD,
     intel/oneapi for Intel (experimental). All variants build
     AdaptiveCpp 25.10 from source inside the image and ship a slim
     runtime stage. ~15-30 min first build, layer-cached after.

  2. scripts/install-deps.sh — distro-aware native bootstrap covering
     Arch, Ubuntu/Debian, and Fedora families. Detects GPU vendor via
     nvidia-smi/rocminfo and installs the right toolchain (full CUDA
     for NVIDIA, CUDA *headers* + ROCm for AMD), then builds
     AdaptiveCpp into /opt/adaptivecpp. --no-acpp opts out and lets
     CMake fetch it.

  3. CMake FetchContent fallback. find_package(AdaptiveCpp QUIET)
     followed by FetchContent_Declare at v25.10.0 with
     FetchContent_MakeAvailable when the local lookup fails. Opt-in
     option XCHPLOT2_FETCH_ADAPTIVECPP=ON (default ON). The
     add_sycl_to_target macro is verified after the fetch — if
     AdaptiveCpp doesn't expose it as a subproject we error with a
     pointer to the manual install.

build.rs also now reads $XCHPLOT2_BUILD_CUDA so the AMD/Intel container
builds can flip XCHPLOT2_BUILD_CUDA=OFF without touching CMake invocation.

README's Build section restructured into three clearly-labeled paths
with the full dependency table moved into path #3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file github_actions Pull requests that update GitHub Actions code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant