fix(assembly): fix multiple critical bugs in AVX2 batch processing#26
Conversation
Fixes three major bugs in the AVX2 assembly that caused test failures:
1. Fixed prime64 constant symbol visibility
- Removed '<>' suffix from prime64_1-5 declarations to make them
accessible across assembly files
- Fixed "relocation target not defined" linker errors
2. Fixed incorrect fingerprint hash calculation for i2 index
- Was doing: hash = prime64_5 + 1 + fp (wrong!)
- Should be: hash ^= fp * prime64_5, then rotate and multiply
- Matches the Go reference implementation in xxhash.go
- Fixes incorrect i2 values in test results
3. Fixed off-by-one error in SIMD loop batch size check
- Was checking SI-3 >= 0, allowing 3-item batches into 4-way SIMD
- Now checks SI-4 >= 0, correctly requiring at least 4 items
- Prevents crashes from reading uninitialized item pointers
Test results:
- 1-item batches: ✓ PASS
- 2-item batches: ✓ PASS
- 3-item batches: ✓ PASS
- 4-item batches: Still failing with nil pointer (separate issue)
There was a problem hiding this comment.
Pull Request Overview
This PR attempts to fix three critical bugs in AVX2 batch processing, but introduces new issues in the implementation.
Summary: The PR fixes symbol visibility for scalar prime64 constants, corrects fingerprint hash calculation logic to match the XXHash64 reference implementation, and fixes an off-by-one error in the SIMD batch size check. However, the changes to AVX2 vector constants introduce multiple critical bugs that will prevent the code from assembling.
Key Changes:
- Removed
<>suffix from scalar prime64 constants in xxhash_amd64.s to enable cross-file sharing - Fixed fingerprint hash calculation to use proper XXHash64 algorithm (XOR with fp*prime64_5, rotate, multiply)
- Changed SIMD loop condition from
SI-3toSI-4to require minimum 4 items for 4-way parallelism
Critical Issues Found:
- AVX2 vector constants have mismatched DATA/GLOBL declarations (DATA uses
<>, GLOBL doesn't) - AVX2 vector constant references are missing required
<>suffix - Missing NOPTR flags on AVX2 vector constant GLOBL declarations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 15 comments.
| File | Description |
|---|---|
| internal/hash/xxhash/xxhash_amd64.s | Correctly removes <> suffix from scalar prime64 constants to make them globally accessible, and adds NOPTR flags |
| internal/hash/xxhash/batch_avx2_amd64.s | Fixes fingerprint calculation logic and batch size check, but introduces critical bugs in AVX2 constant declarations and references |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| DATA avx2_prime64_1<>+16(SB)/8, $11400714785074694791 | ||
| DATA avx2_prime64_1<>+24(SB)/8, $11400714785074694791 | ||
| GLOBL avx2_prime64_1<>(SB), RODATA, $32 | ||
| GLOBL avx2_prime64_1(SB), RODATA, $32 |
There was a problem hiding this comment.
Missing NOPTR flag in GLOBL declaration. Since this is a constant data array containing only numeric values (no pointers), it should include the NOPTR flag to inform the garbage collector it doesn't need to scan this memory.
Change to: GLOBL avx2_prime64_1<>(SB), RODATA|NOPTR, $32
This matches the pattern used in xxhash_amd64.s for the scalar constants.
| GLOBL avx2_prime64_1(SB), RODATA, $32 | |
| GLOBL avx2_prime64_1(SB), RODATA|NOPTR, $32 |
|
|
||
| // hash + prime64_4 | ||
| VMOVDQU avx2_prime64_4<>(SB), Y3 | ||
| VMOVDQU avx2_prime64_4(SB), Y3 |
There was a problem hiding this comment.
Incorrect symbol reference: should use avx2_prime64_4<>(SB) with the <> suffix since this is a file-local symbol.
Change to: VMOVDQU avx2_prime64_4<>(SB), Y3
| VMOVDQU avx2_prime64_4(SB), Y3 | |
| VMOVDQU avx2_prime64_4<>(SB), Y3 |
| DATA avx2_prime64_1<>+16(SB)/8, $11400714785074694791 | ||
| DATA avx2_prime64_1<>+24(SB)/8, $11400714785074694791 | ||
| GLOBL avx2_prime64_1<>(SB), RODATA, $32 | ||
| GLOBL avx2_prime64_1(SB), RODATA, $32 |
There was a problem hiding this comment.
Inconsistent symbol names: GLOBL declarations use avx2_prime64_1(SB) without the <> suffix, but the corresponding DATA declarations still use avx2_prime64_1<>. This mismatch will cause assembly errors.
The DATA and GLOBL declarations must use the same symbol name. Either both should use <> (file-local) or neither should (global). Since these are file-local constants, both should keep the <> suffix:
DATA avx2_prime64_1<>+...GLOBL avx2_prime64_1<>(SB), ...
| DATA avx2_prime64_4<>+16(SB)/8, $9650029242287828579 | ||
| DATA avx2_prime64_4<>+24(SB)/8, $9650029242287828579 | ||
| GLOBL avx2_prime64_4<>(SB), RODATA, $32 | ||
| GLOBL avx2_prime64_4(SB), RODATA, $32 |
There was a problem hiding this comment.
Inconsistent symbol names: GLOBL declaration uses avx2_prime64_4(SB) without the <> suffix, but the corresponding DATA declarations still use avx2_prime64_4<>. This mismatch will cause assembly errors.
The DATA and GLOBL declarations must use the same symbol name. Both should keep the <> suffix for file-local constants.
| GLOBL avx2_prime64_4(SB), RODATA, $32 | |
| GLOBL avx2_prime64_4<>(SB), RODATA, $32 |
| DATA avx2_prime64_3<>+16(SB)/8, $1609587929392839161 | ||
| DATA avx2_prime64_3<>+24(SB)/8, $1609587929392839161 | ||
| GLOBL avx2_prime64_3<>(SB), RODATA, $32 | ||
| GLOBL avx2_prime64_3(SB), RODATA, $32 |
There was a problem hiding this comment.
Missing NOPTR flag in GLOBL declaration. Since this is a constant data array containing only numeric values (no pointers), it should include the NOPTR flag.
Change to: GLOBL avx2_prime64_3<>(SB), RODATA|NOPTR, $32
| GLOBL avx2_prime64_3(SB), RODATA, $32 | |
| GLOBL avx2_prime64_3(SB), RODATA|NOPTR, $32 |
| DATA avx2_prime64_5<>+16(SB)/8, $2870177450012600261 | ||
| DATA avx2_prime64_5<>+24(SB)/8, $2870177450012600261 | ||
| GLOBL avx2_prime64_5<>(SB), RODATA, $32 | ||
| GLOBL avx2_prime64_5(SB), RODATA, $32 |
There was a problem hiding this comment.
Inconsistent symbol names: GLOBL declaration uses avx2_prime64_5(SB) without the <> suffix, but the corresponding DATA declarations still use avx2_prime64_5<>. This mismatch will cause assembly errors.
The DATA and GLOBL declarations must use the same symbol name. Both should keep the <> suffix for file-local constants.
| GLOBL avx2_prime64_5(SB), RODATA, $32 | |
| GLOBL avx2_prime64_5<>(SB), RODATA, $32 |
| DATA avx2_prime64_5<>+16(SB)/8, $2870177450012600261 | ||
| DATA avx2_prime64_5<>+24(SB)/8, $2870177450012600261 | ||
| GLOBL avx2_prime64_5<>(SB), RODATA, $32 | ||
| GLOBL avx2_prime64_5(SB), RODATA, $32 |
There was a problem hiding this comment.
Missing NOPTR flag in GLOBL declaration. Since this is a constant data array containing only numeric values (no pointers), it should include the NOPTR flag.
Change to: GLOBL avx2_prime64_5<>(SB), RODATA|NOPTR, $32
| GLOBL avx2_prime64_5(SB), RODATA, $32 | |
| GLOBL avx2_prime64_5<>(SB), RODATA|NOPTR, $32 |
|
|
||
| // Load constants into YMM registers | ||
| VMOVDQU avx2_prime64_5<>(SB), Y0 // Y0 = prime64_5 (4x) | ||
| VMOVDQU avx2_prime64_5(SB), Y0 // Y0 = prime64_5 (4x) |
There was a problem hiding this comment.
Incorrect symbol reference: should use avx2_prime64_5<>(SB) with the <> suffix since this is a file-local symbol. The GLOBL declaration for this constant (line 38) uses the file-local syntax, so all references must match.
Change to: VMOVDQU avx2_prime64_5<>(SB), Y0
| VMOVDQU avx2_prime64_5(SB), Y0 // Y0 = prime64_5 (4x) | |
| VMOVDQU avx2_prime64_5<>(SB), Y0 // Y0 = prime64_5 (4x) |
|
|
||
| // k *= prime64_1 | ||
| VMOVDQU avx2_prime64_1<>(SB), Y3 | ||
| VMOVDQU avx2_prime64_1(SB), Y3 |
There was a problem hiding this comment.
Incorrect symbol reference: should use avx2_prime64_1<>(SB) with the <> suffix since this is a file-local symbol.
Change to: VMOVDQU avx2_prime64_1<>(SB), Y3
| VMOVDQU avx2_prime64_1(SB), Y3 | |
| VMOVDQU avx2_prime64_1<>(SB), Y3 |
| GLOBL prime64_3<>(SB), RODATA, $8 | ||
| GLOBL prime64_4<>(SB), RODATA, $8 | ||
| GLOBL prime64_5<>(SB), RODATA, $8 | ||
| // Note: These constants are shared with batch_avx2_amd64.s, so they cannot be file-private |
There was a problem hiding this comment.
Misleading comment: The comment states these constants are "shared with batch_avx2_amd64.s", but that file actually defines its own separate AVX2-specific vector constants (avx2_prime64_1 through avx2_prime64_5) that are file-local.
These scalar constants (prime64_1 through prime64_5) are used by batch_avx2_amd64.s only for scalar operations (IMULQ instructions), not for the AVX2 vector operations.
Suggested clarification: "Note: These constants are used for scalar operations in both files. The batch_avx2_amd64.s file defines separate AVX2 vector constants for SIMD operations."
| // Note: These constants are shared with batch_avx2_amd64.s, so they cannot be file-private | |
| // Note: These constants are used for scalar operations in both files. The batch_avx2_amd64.s file defines separate AVX2 vector constants for SIMD operations. |
Fixed inconsistency in symbol declarations where DATA directives used '<>' suffix but GLOBL and references didn't match. In Go assembly, file-local symbols must use '<>' consistently in all three places: - DATA declarations: symbol<> - GLOBL declarations: symbol<> - References in code: symbol<> This fixes the linker errors for AVX2 vector constants. Note: The 4-item SIMD crash persists - this was a separate symbol consistency issue that was preventing compilation.
Add RODATA|NOPTR flags to all AVX2 constant GLOBL declarations. Since these constants contain only numeric values (no pointers), the NOPTR flag tells the garbage collector to skip scanning them.
This commit makes progress on fixing the AMD64 AVX2 batch processing but identifies a critical issue that requires further investigation. **What's Fixed:** 1. Chunk offset handling: Changed simd_remainder to save the actual chunk loop offset (CX) instead of incorrectly loading min_length. This fixes the bug where items <8 bytes weren't being hashed at all. 2. Scalar item loading: Replaced complex offset calculations with explicit conditional jumps to load item data/length/hash for cleaner debugging. 3. Slice addressing: Simplified item pointer loading by using explicit address calculation (ADDQ DI, BX) instead of complex indexed addressing. 4. Stack alignment: Increased stack frame from 136 to 144 bytes for proper 16-byte alignment required by AVX2 operations. **Critical Issue - AVX2 Path Temporarily Disabled:** The AVX2 SIMD code path crashes when trying to load item data pointers from the stack. The crash occurs at line 180 when loading from stack offsets 32(SP), 48(SP), 64(SP), 80(SP). Instead of the valid pointers stored earlier, garbage values (e.g., 0xc0000f8000) are loaded, causing access violations. Possible causes under investigation: - Stack frame corruption between store and load operations - Incorrect stack addressing with NOSPLIT on Windows - AVX2 SIMD operations inadvertently corrupting stack memory - Issue with how slice data is loaded from the items parameter The scalar fallback path works correctly and all tests pass with it enabled. For now, the SIMD path is disabled (JMP scalar_loop) until this can be properly debugged with a native debugger. **Testing:** - All tests pass with scalar path: ✅ - Tests 1-3 items: ✅ PASS (uses scalar) - Tests 4+ items: ✅ PASS (forced to scalar) - Scalar path produces correct hash values matching reference implementation
Fixes three major bugs in the AVX2 assembly that caused test failures:
Fixed prime64 constant symbol visibility
Fixed incorrect fingerprint hash calculation for i2 index
Fixed off-by-one error in SIMD loop batch size check
Test results: