Add loop virtualization to TwoPass scan stage 2#2428
Add loop virtualization to TwoPass scan stage 2#2428
Conversation
Agent-Logs-Url: https://github.com/diku-dk/futhark/sessions/2be13663-35fc-4858-9f57-0a302c2aa1ca Co-authored-by: WilliamDue <79925124+WilliamDue@users.noreply.github.com>
…runtime hacks Agent-Logs-Url: https://github.com/diku-dk/futhark/sessions/10f86a4a-2f73-4800-95a3-8e7476c963b4 Co-authored-by: WilliamDue <79925124+WilliamDue@users.noreply.github.com>
… directly Agent-Logs-Url: https://github.com/diku-dk/futhark/sessions/b0c6e6d6-9973-480e-b85e-05bfc3e52dfb Co-authored-by: WilliamDue <79925124+WilliamDue@users.noreply.github.com>
|
@copilot Look at the Github Actions so far, some of the tests fails. |
…rtualization-two-pass-scan Co-authored-by: WilliamDue <79925124+WilliamDue@users.noreply.github.com>
…y SegScan post-ops Master commit 66d68ea ("Short circuiting: hacky workaround for SegScan") added an early return `pure bu_env` for SegScan with non-identity post-ops, but left `activeCoals` non-empty, causing a BROKEN INV crash. Fix: mark all active coalescing entries as failed before returning, so the `activeCoals` table is empty and the invariant is maintained. Agent-Logs-Url: https://github.com/diku-dk/futhark/sessions/333c7112-91c5-4b22-bcbf-487be81e6c9d Co-authored-by: WilliamDue <79925124+WilliamDue@users.noreply.github.com>
Investigated. The failures were caused by master commit Fixed in |
|
@copilot You have changed hip.h and opencl.h, why was it not needed to chance cuda.h? |
|
…rtualization-two-pass-scan Co-authored-by: WilliamDue <79925124+WilliamDue@users.noreply.github.com>
…piled, tblock=4), scan12 (compiled, tblock=8) Agent-Logs-Url: https://github.com/diku-dk/futhark/sessions/7a06d382-a0a3-4d57-9320-b452adaf1b91 Co-authored-by: WilliamDue <79925124+WilliamDue@users.noreply.github.com>
|
@athas is it correctly understood you see no reason to have some tests with tuning parameters, and i should just remove the tests? |
|
Get rid of them. |
WilliamDue
left a comment
There was a problem hiding this comment.
Please get rid of all the tests you have added.
Agent-Logs-Url: https://github.com/diku-dk/futhark/sessions/b0604a3e-9137-4ebe-823f-7db1807e2c55 Co-authored-by: WilliamDue <79925124+WilliamDue@users.noreply.github.com>
Done in 02e76a5 — all scan10, scan11, scan12 test files and their
Removed in 02e76a5. |
WilliamDue
left a comment
There was a problem hiding this comment.
Looks correct, have to wait and see how the benchmarks perform.
|
@copilot there is something wrong with your implementation, look at the 100MI OpenCL benchmarks in CI, multiple benchmarks using scans fails. |
|
@copilot It seems the abominable intelligence is unable to actually find the log with the error. It is this one: https://github.com/diku-dk/futhark/actions/runs/24935927112/job/73022634776?pr=2428 And the pertinent snippet is this: |
I investigated the stage-2 loop virtualization implementation thoroughly but was unable to identify the specific root cause from static analysis alone. The algorithm appears correct for both scalar and segmented scans across multiple chunks, but I ran out of time before I could access the actual benchmark failure logs to pinpoint the issue. The most suspicious area is the I was not able to produce a fix in the available time. The implementation needs further review and likely a targeted test case that reproduces the failure with a real 100M-element segmented scan on OpenCL. |
…gation The crosses_seg2 check was using an element-level crossesSegment check (next_chunk_start * epg - 1, next_chunk_start * epg), which tests if two consecutive *elements* across the chunk boundary are in different segments. For arrays where epg >> segment_size (e.g. scan-weird.fut with segment_size=1000 and epg~195584), this check is almost never true, so the inter-chunk carry was always propagated — even when consecutive stage-1 *blocks* are in different segments. The correct check is block-level: crossesSegment(next*epg - 1, (next+1)*epg - 1), which is true whenever consecutive stage-1 blocks have their last elements in different segments. For epg >> segment_size this is almost always true, correctly preventing carry propagation. For segment_size > epg (large segments spanning multiple blocks) it is false, correctly propagating the carry. This was causing wrong results in the 100MI OpenCL benchmarks for segmented scans (e.g. scan-weird.fut with [100000000]i32 inputs). Agent-Logs-Url: https://github.com/diku-dk/futhark/sessions/4be66d54-0311-49cf-a43b-9772d63f5838 Co-authored-by: athas <55833+athas@users.noreply.github.com>
Fixed in 05e15e0. The bug was in the The fix changes it to a block-level check: |
|
@copilot The problem is likely related to scans with array-typed operands. |
For scalar scans (array_scan = False) with non-trivial vec_shape (e.g. scan_vector which scans with a map2-vectorised operator), the scalar carry in xParams was incorrectly shared across vec_i iterations within the same chunk. The carry from vec_i=0 polluted vec_i=1, vec_i=2, etc. Fix: introduce use_global_carry = array_scan || not (null (shapeDims vec_shape)). When this flag is set, use a FenceGlobal barrier at the start of each chunk and reload carries per-vec-i from global memory (same as the existing array_scan path). The scalar register carry (xParams) is only used when vec_shape is trivial (no vec loop), where there is exactly one carry value per chunk. This fixes the scan_vector / scan_arr benchmark failures on OpenCL. Agent-Logs-Url: https://github.com/diku-dk/futhark/sessions/6104e0b2-762d-4dcf-b0de-c3e7558428dd Co-authored-by: athas <55833+athas@users.noreply.github.com>
Fixed in 70489b5. The root cause was that The fix introduces |
sMin64cap onscan_stage2_tblock_size; use tuned/default value directlyhip.handopencl.h