Simd optimization 16bit#27
Conversation
Fixes three major bugs in the AVX2 assembly that caused test failures:
1. Fixed prime64 constant symbol visibility
- Removed '<>' suffix from prime64_1-5 declarations to make them
accessible across assembly files
- Fixed "relocation target not defined" linker errors
2. Fixed incorrect fingerprint hash calculation for i2 index
- Was doing: hash = prime64_5 + 1 + fp (wrong!)
- Should be: hash ^= fp * prime64_5, then rotate and multiply
- Matches the Go reference implementation in xxhash.go
- Fixes incorrect i2 values in test results
3. Fixed off-by-one error in SIMD loop batch size check
- Was checking SI-3 >= 0, allowing 3-item batches into 4-way SIMD
- Now checks SI-4 >= 0, correctly requiring at least 4 items
- Prevents crashes from reading uninitialized item pointers
Test results:
- 1-item batches: ✓ PASS
- 2-item batches: ✓ PASS
- 3-item batches: ✓ PASS
- 4-item batches: Still failing with nil pointer (separate issue)
Fixed inconsistency in symbol declarations where DATA directives used '<>' suffix but GLOBL and references didn't match. In Go assembly, file-local symbols must use '<>' consistently in all three places: - DATA declarations: symbol<> - GLOBL declarations: symbol<> - References in code: symbol<> This fixes the linker errors for AVX2 vector constants. Note: The 4-item SIMD crash persists - this was a separate symbol consistency issue that was preventing compilation.
Add RODATA|NOPTR flags to all AVX2 constant GLOBL declarations. Since these constants contain only numeric values (no pointers), the NOPTR flag tells the garbage collector to skip scanning them.
This commit makes progress on fixing the AMD64 AVX2 batch processing but identifies a critical issue that requires further investigation. **What's Fixed:** 1. Chunk offset handling: Changed simd_remainder to save the actual chunk loop offset (CX) instead of incorrectly loading min_length. This fixes the bug where items <8 bytes weren't being hashed at all. 2. Scalar item loading: Replaced complex offset calculations with explicit conditional jumps to load item data/length/hash for cleaner debugging. 3. Slice addressing: Simplified item pointer loading by using explicit address calculation (ADDQ DI, BX) instead of complex indexed addressing. 4. Stack alignment: Increased stack frame from 136 to 144 bytes for proper 16-byte alignment required by AVX2 operations. **Critical Issue - AVX2 Path Temporarily Disabled:** The AVX2 SIMD code path crashes when trying to load item data pointers from the stack. The crash occurs at line 180 when loading from stack offsets 32(SP), 48(SP), 64(SP), 80(SP). Instead of the valid pointers stored earlier, garbage values (e.g., 0xc0000f8000) are loaded, causing access violations. Possible causes under investigation: - Stack frame corruption between store and load operations - Incorrect stack addressing with NOSPLIT on Windows - AVX2 SIMD operations inadvertently corrupting stack memory - Issue with how slice data is loaded from the items parameter The scalar fallback path works correctly and all tests pass with it enabled. For now, the SIMD path is disabled (JMP scalar_loop) until this can be properly debugged with a native debugger. **Testing:** - All tests pass with scalar path: ✅ - Tests 1-3 items: ✅ PASS (uses scalar) - Tests 4+ items: ✅ PASS (forced to scalar) - Scalar path produces correct hash values matching reference implementation
This commit fixes a SIGSEGV crash and hash correctness issues in the AVX2 batch processing path. Key changes: - Defined local scalar constants to avoid linkage issues. - Unrolled the finalize loop to process 4 items sequentially. - Implemented full scalar fallback logic (including 8-byte chunks) within the unrolled loop. - Bypassed stack corruption by recalculating initial hashes and reloading base pointers.
There was a problem hiding this comment.
Pull request overview
This PR upgrades the cuckoo filter to support 16-bit fingerprints (up from 8-bit), enabling lower false positive rates at the cost of increased memory usage. The change involves updating fingerprint types from byte to uint16 throughout the codebase and adapting SIMD operations accordingly.
Key Changes:
- Fingerprint size limit increased from 8 bits to 16 bits in validation and documentation
- Core fingerprint type changed from
bytetouint16across all hash implementations and bucket operations - AVX2 batch hash processing temporarily disabled with scalar fallback while 16-bit support is in progress
- New AVX2 assembly implementation for 16-bit fingerprint bucket lookups on AMD64
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| options.go | Updated validation to allow 1-16 bit fingerprints instead of 1-8 bits |
| internal/hash/types/types.go | Changed HashResult.Fp from byte to uint16 |
| internal/hash/hash.go | Updated HashInterface signatures to use uint16 for fingerprints |
| internal/hash/xxhash/xxhash.go | Modified GetIndices and GetAltIndex to use uint16 fingerprints with simplified alt-index calculation |
| internal/hash/xxhash/xxhash_amd64.s | Updated prime constant symbols to remove <> suffix and add NOPTR flag |
| internal/hash/xxhash/batch_amd64.go | Disabled AVX2 batch processing, using scalar fallback for 16-bit support |
| internal/hash/xxhash/batch_avx2_amd64.s | Extensive modifications for 16-bit support (currently unused due to fallback) |
| internal/hash/fnv/hash_fnv.go | Updated to use uint16 fingerprints with 2-byte buffer when needed |
| internal/hash/crc32/crc32.go | Updated to use uint16 fingerprints with conditional 1 or 2-byte hashing |
| internal/bucket/bucket.go | Changed bucket fingerprint storage from []byte to []uint16 |
| internal/bucket/bucket_simd_amd64.go | Updated to call new AVX2 assembly for 16-bit fingerprint lookups |
| internal/bucket/bucket_avx2_amd64.s | New AVX2 assembly for efficient 16-bit fingerprint comparison |
| internal/bucket/bucket_simd_arm64.go | Disabled NEON assembly, using scalar fallback for 16-bit support |
| internal/filter/filter.go | Updated relocate signature to accept uint16 fingerprints |
| Various test files | Updated test cases to use uint16 instead of byte for fingerprints |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Process 8-byte chunks for all 4 items in parallel | ||
| XORQ CX, CX // CX = offset within data | ||
| // JMP simd_remainder // DEBUG: Skip chunk loop |
There was a problem hiding this comment.
Commented-out debug code // JMP simd_remainder // DEBUG: Skip chunk loop left in the codebase. Remove commented debug code before merging to maintain code cleanliness.
| // JMP simd_remainder // DEBUG: Skip chunk loop |
| /* | ||
| //go:noescape | ||
| func processBatchXXHashAVX2(items [][]byte, results []types.HashResult, fingerprintBits, numBuckets uint) | ||
| */ |
There was a problem hiding this comment.
[nitpick] The AVX2 implementation is completely disabled in favor of a scalar fallback, but the assembly code in batch_avx2_amd64.s still exists and has been extensively modified. This creates maintenance burden and confusion. Consider either:
- Fully implementing 16-bit support in AVX2 assembly and re-enabling it
- Removing the unused assembly file if AVX2 support is not planned
The current state with 800+ lines of modified but unused assembly code is problematic for maintainability.
| /* | |
| //go:noescape | |
| func processBatchXXHashAVX2(items [][]byte, results []types.HashResult, fingerprintBits, numBuckets uint) | |
| */ |
| DATA avx2_shift_37<>+16(SB)/8, $37 | ||
| DATA avx2_shift_37<>+24(SB)/8, $37 | ||
| GLOBL avx2_shift_37<>(SB), RODATA, $32 |
There was a problem hiding this comment.
Duplicate data definition: avx2_shift_37 is defined twice - once at lines 59-60 and again at lines 73-75. This will cause an assembly error. Remove the duplicate definition at lines 73-75.
| DATA avx2_shift_37<>+16(SB)/8, $37 | |
| DATA avx2_shift_37<>+24(SB)/8, $37 | |
| GLOBL avx2_shift_37<>(SB), RODATA, $32 |
| IMULQ prime64_1(SB), R9 | ||
| VPINSRQ $0, R9, X4, X4 | ||
|
|
||
| VPEXTRQ $1, X4, R9 | ||
| IMULQ prime64_1<>(SB), R9 | ||
| IMULQ prime64_1(SB), R9 | ||
| VPINSRQ $1, R9, X4, X4 | ||
|
|
||
| VEXTRACTI128 $1, Y2, X5 | ||
| VPEXTRQ $0, X5, R9 | ||
| IMULQ prime64_1<>(SB), R9 | ||
| IMULQ prime64_1(SB), R9 | ||
| VPINSRQ $0, R9, X5, X5 | ||
|
|
||
| VPEXTRQ $1, X5, R9 | ||
| IMULQ prime64_1<>(SB), R9 | ||
| IMULQ prime64_1(SB), R9 |
There was a problem hiding this comment.
Inconsistent symbol references: lines 337, 341, 346, and 350 use prime64_1 without the local_ prefix, while surrounding code uses local_prime64_1<>. This inconsistency could cause linkage issues. All references should use the local_ prefixed constants defined at lines 63-72.
| MOVD data_base+0(FP), SI // SI = data pointer | ||
| MOVD data_len+8(FP), BX // BX = length | ||
| MOVW fp+24(FP), AX // AX = fingerprint (uint16) |
There was a problem hiding this comment.
[nitpick] The assembly uses MOVD to load pointers, which is appropriate for loading 64-bit values into XMM registers on modern assemblers. However, for clarity and to ensure compatibility, consider using MOVQ for 64-bit pointer loads as it's more explicit about the operand size being 64 bits.
| // Test all possible fingerprint values (1-65535, since 0 is never used) | ||
| failureCount := 0 | ||
| for fp := byte(1); fp != 0; fp++ { | ||
| for fp := uint16(1); fp != 0; fp++ { |
There was a problem hiding this comment.
[nitpick] The loop condition for fp := uint16(1); fp != 0; fp++ will overflow after reaching 65535 and wrap to 0, terminating the loop. However, this is intentional to test all 65535 possible non-zero fingerprint values. Consider adding a comment to clarify this intentional overflow behavior for future readers.
| @@ -245,7 +245,7 @@ func TestContainsZeroFingerprint(t *testing.T) { | |||
| func TestAllBytesValue(t *testing.T) { | |||
| size := uint(8) // Use size 8 for reasonable test time | |||
|
|
|||
There was a problem hiding this comment.
[nitpick] Similar to the comment above: the loop for target := uint16(1); target != 0; target++ intentionally overflows to test all possible fingerprint values. Consider adding a comment explaining this overflow-based loop termination for clarity.
| // Loop over all possible nonzero uint16 values (1 to 65535). | |
| // The loop relies on unsigned integer overflow: when target overflows from 65535 to 0, the loop terminates. |
| fpBuf := [2]byte{byte(fp), byte(fp >> 8)} | ||
| // For 8-bit or less, we only use the first byte to maintain backward compatibility | ||
| // and consistency with how it was done before | ||
| length := 1 |
There was a problem hiding this comment.
Variable length is declared but the variable name used in line 126 is inconsistent. The variable is named length here but needs to match the usage below.
|
|
||
| VPEXTRQ $1, X5, R9 | ||
| IMULQ prime64_1<>(SB), R9 | ||
| IMULQ prime64_1(SB), R9 |
There was a problem hiding this comment.
Inconsistent symbol reference: this line uses local_prime64_1 while line 304 uses prime64_1 (without the local_ prefix). All references in this section should be consistent. Based on the pattern and the fact that local constants were defined at lines 63-72, this should use local_prime64_1<> for consistency.
| IMULQ prime64_1(SB), R9 | |
| IMULQ local_prime64_1<>(SB), R9 |
| ADDQ CX, BP | ||
| TESTQ R8, R8 | ||
| JNZ scalar_ptr_ok | ||
| INT $3 // R8 is 0 |
There was a problem hiding this comment.
Debug instruction INT $3 left in production code. This will cause a breakpoint/crash when executed. Remove this debug code before merging.
| INT $3 // R8 is 0 |
17105e4 to
f4859b0
Compare
No description provided.