Skip to content

Simd optimization 16bit#27

Merged
shaia merged 17 commits into
mainfrom
simd-optimization-16bit
Nov 24, 2025
Merged

Simd optimization 16bit#27
shaia merged 17 commits into
mainfrom
simd-optimization-16bit

Conversation

@shaia
Copy link
Copy Markdown
Owner

@shaia shaia commented Nov 23, 2025

No description provided.

Fixes three major bugs in the AVX2 assembly that caused test failures:

1. Fixed prime64 constant symbol visibility
   - Removed '<>' suffix from prime64_1-5 declarations to make them
     accessible across assembly files
   - Fixed "relocation target not defined" linker errors

2. Fixed incorrect fingerprint hash calculation for i2 index
   - Was doing: hash = prime64_5 + 1 + fp (wrong!)
   - Should be: hash ^= fp * prime64_5, then rotate and multiply
   - Matches the Go reference implementation in xxhash.go
   - Fixes incorrect i2 values in test results

3. Fixed off-by-one error in SIMD loop batch size check
   - Was checking SI-3 >= 0, allowing 3-item batches into 4-way SIMD
   - Now checks SI-4 >= 0, correctly requiring at least 4 items
   - Prevents crashes from reading uninitialized item pointers

Test results:
- 1-item batches: ✓ PASS
- 2-item batches: ✓ PASS
- 3-item batches: ✓ PASS
- 4-item batches: Still failing with nil pointer (separate issue)
Fixed inconsistency in symbol declarations where DATA directives used
'<>' suffix but GLOBL and references didn't match. In Go assembly,
file-local symbols must use '<>' consistently in all three places:
- DATA declarations: symbol<>
- GLOBL declarations: symbol<>
- References in code: symbol<>

This fixes the linker errors for AVX2 vector constants.

Note: The 4-item SIMD crash persists - this was a separate symbol
consistency issue that was preventing compilation.
Add RODATA|NOPTR flags to all AVX2 constant GLOBL declarations.
Since these constants contain only numeric values (no pointers),
the NOPTR flag tells the garbage collector to skip scanning them.
This commit makes progress on fixing the AMD64 AVX2 batch processing but
identifies a critical issue that requires further investigation.

**What's Fixed:**
1. Chunk offset handling: Changed simd_remainder to save the actual chunk
   loop offset (CX) instead of incorrectly loading min_length. This fixes
   the bug where items <8 bytes weren't being hashed at all.

2. Scalar item loading: Replaced complex offset calculations with explicit
   conditional jumps to load item data/length/hash for cleaner debugging.

3. Slice addressing: Simplified item pointer loading by using explicit
   address calculation (ADDQ DI, BX) instead of complex indexed addressing.

4. Stack alignment: Increased stack frame from 136 to 144 bytes for proper
   16-byte alignment required by AVX2 operations.

**Critical Issue - AVX2 Path Temporarily Disabled:**
The AVX2 SIMD code path crashes when trying to load item data pointers from
the stack. The crash occurs at line 180 when loading from stack offsets
32(SP), 48(SP), 64(SP), 80(SP). Instead of the valid pointers stored earlier,
garbage values (e.g., 0xc0000f8000) are loaded, causing access violations.

Possible causes under investigation:
- Stack frame corruption between store and load operations
- Incorrect stack addressing with NOSPLIT on Windows
- AVX2 SIMD operations inadvertently corrupting stack memory
- Issue with how slice data is loaded from the items parameter

The scalar fallback path works correctly and all tests pass with it enabled.
For now, the SIMD path is disabled (JMP scalar_loop) until this can be
properly debugged with a native debugger.

**Testing:**
- All tests pass with scalar path: ✅
- Tests 1-3 items: ✅ PASS (uses scalar)
- Tests 4+ items: ✅ PASS (forced to scalar)
- Scalar path produces correct hash values matching reference implementation
This commit fixes a SIGSEGV crash and hash correctness issues in the AVX2 batch processing path. Key changes:
- Defined local scalar constants to avoid linkage issues.
- Unrolled the finalize loop to process 4 items sequentially.
- Implemented full scalar fallback logic (including 8-byte chunks) within the unrolled loop.
- Bypassed stack corruption by recalculating initial hashes and reloading base pointers.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the cuckoo filter to support 16-bit fingerprints (up from 8-bit), enabling lower false positive rates at the cost of increased memory usage. The change involves updating fingerprint types from byte to uint16 throughout the codebase and adapting SIMD operations accordingly.

Key Changes:

  • Fingerprint size limit increased from 8 bits to 16 bits in validation and documentation
  • Core fingerprint type changed from byte to uint16 across all hash implementations and bucket operations
  • AVX2 batch hash processing temporarily disabled with scalar fallback while 16-bit support is in progress
  • New AVX2 assembly implementation for 16-bit fingerprint bucket lookups on AMD64

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
options.go Updated validation to allow 1-16 bit fingerprints instead of 1-8 bits
internal/hash/types/types.go Changed HashResult.Fp from byte to uint16
internal/hash/hash.go Updated HashInterface signatures to use uint16 for fingerprints
internal/hash/xxhash/xxhash.go Modified GetIndices and GetAltIndex to use uint16 fingerprints with simplified alt-index calculation
internal/hash/xxhash/xxhash_amd64.s Updated prime constant symbols to remove <> suffix and add NOPTR flag
internal/hash/xxhash/batch_amd64.go Disabled AVX2 batch processing, using scalar fallback for 16-bit support
internal/hash/xxhash/batch_avx2_amd64.s Extensive modifications for 16-bit support (currently unused due to fallback)
internal/hash/fnv/hash_fnv.go Updated to use uint16 fingerprints with 2-byte buffer when needed
internal/hash/crc32/crc32.go Updated to use uint16 fingerprints with conditional 1 or 2-byte hashing
internal/bucket/bucket.go Changed bucket fingerprint storage from []byte to []uint16
internal/bucket/bucket_simd_amd64.go Updated to call new AVX2 assembly for 16-bit fingerprint lookups
internal/bucket/bucket_avx2_amd64.s New AVX2 assembly for efficient 16-bit fingerprint comparison
internal/bucket/bucket_simd_arm64.go Disabled NEON assembly, using scalar fallback for 16-bit support
internal/filter/filter.go Updated relocate signature to accept uint16 fingerprints
Various test files Updated test cases to use uint16 instead of byte for fingerprints

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated

// Process 8-byte chunks for all 4 items in parallel
XORQ CX, CX // CX = offset within data
// JMP simd_remainder // DEBUG: Skip chunk loop
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out debug code // JMP simd_remainder // DEBUG: Skip chunk loop left in the codebase. Remove commented debug code before merging to maintain code cleanliness.

Suggested change
// JMP simd_remainder // DEBUG: Skip chunk loop

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +43
/*
//go:noescape
func processBatchXXHashAVX2(items [][]byte, results []types.HashResult, fingerprintBits, numBuckets uint)
*/
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The AVX2 implementation is completely disabled in favor of a scalar fallback, but the assembly code in batch_avx2_amd64.s still exists and has been extensively modified. This creates maintenance burden and confusion. Consider either:

  1. Fully implementing 16-bit support in AVX2 assembly and re-enabling it
  2. Removing the unused assembly file if AVX2 support is not planned

The current state with 800+ lines of modified but unused assembly code is problematic for maintainability.

Suggested change
/*
//go:noescape
func processBatchXXHashAVX2(items [][]byte, results []types.HashResult, fingerprintBits, numBuckets uint)
*/

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated
Comment on lines 73 to 75
DATA avx2_shift_37<>+16(SB)/8, $37
DATA avx2_shift_37<>+24(SB)/8, $37
GLOBL avx2_shift_37<>(SB), RODATA, $32
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate data definition: avx2_shift_37 is defined twice - once at lines 59-60 and again at lines 73-75. This will cause an assembly error. Remove the duplicate definition at lines 73-75.

Suggested change
DATA avx2_shift_37<>+16(SB)/8, $37
DATA avx2_shift_37<>+24(SB)/8, $37
GLOBL avx2_shift_37<>(SB), RODATA, $32

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated
Comment on lines +337 to +350
IMULQ prime64_1(SB), R9
VPINSRQ $0, R9, X4, X4

VPEXTRQ $1, X4, R9
IMULQ prime64_1<>(SB), R9
IMULQ prime64_1(SB), R9
VPINSRQ $1, R9, X4, X4

VEXTRACTI128 $1, Y2, X5
VPEXTRQ $0, X5, R9
IMULQ prime64_1<>(SB), R9
IMULQ prime64_1(SB), R9
VPINSRQ $0, R9, X5, X5

VPEXTRQ $1, X5, R9
IMULQ prime64_1<>(SB), R9
IMULQ prime64_1(SB), R9
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent symbol references: lines 337, 341, 346, and 350 use prime64_1 without the local_ prefix, while surrounding code uses local_prime64_1<>. This inconsistency could cause linkage issues. All references should use the local_ prefixed constants defined at lines 63-72.

Copilot uses AI. Check for mistakes.
Comment thread internal/bucket/bucket_avx2_amd64.s Outdated
Comment on lines +8 to +10
MOVD data_base+0(FP), SI // SI = data pointer
MOVD data_len+8(FP), BX // BX = length
MOVW fp+24(FP), AX // AX = fingerprint (uint16)
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The assembly uses MOVD to load pointers, which is appropriate for loading 64-bit values into XMM registers on modern assemblers. However, for clarity and to ensure compatibility, consider using MOVQ for 64-bit pointer loads as it's more explicit about the operand size being 64 bits.

Copilot uses AI. Check for mistakes.
// Test all possible fingerprint values (1-65535, since 0 is never used)
failureCount := 0
for fp := byte(1); fp != 0; fp++ {
for fp := uint16(1); fp != 0; fp++ {
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The loop condition for fp := uint16(1); fp != 0; fp++ will overflow after reaching 65535 and wrap to 0, terminating the loop. However, this is intentional to test all 65535 possible non-zero fingerprint values. Consider adding a comment to clarify this intentional overflow behavior for future readers.

Copilot uses AI. Check for mistakes.
@@ -245,7 +245,7 @@ func TestContainsZeroFingerprint(t *testing.T) {
func TestAllBytesValue(t *testing.T) {
size := uint(8) // Use size 8 for reasonable test time

Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Similar to the comment above: the loop for target := uint16(1); target != 0; target++ intentionally overflows to test all possible fingerprint values. Consider adding a comment explaining this overflow-based loop termination for clarity.

Suggested change
// Loop over all possible nonzero uint16 values (1 to 65535).
// The loop relies on unsigned integer overflow: when target overflows from 65535 to 0, the loop terminates.

Copilot uses AI. Check for mistakes.
fpBuf := [2]byte{byte(fp), byte(fp >> 8)}
// For 8-bit or less, we only use the first byte to maintain backward compatibility
// and consistency with how it was done before
length := 1
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable length is declared but the variable name used in line 126 is inconsistent. The variable is named length here but needs to match the usage below.

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated

VPEXTRQ $1, X5, R9
IMULQ prime64_1<>(SB), R9
IMULQ prime64_1(SB), R9
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent symbol reference: this line uses local_prime64_1 while line 304 uses prime64_1 (without the local_ prefix). All references in this section should be consistent. Based on the pattern and the fact that local constants were defined at lines 63-72, this should use local_prime64_1<> for consistency.

Suggested change
IMULQ prime64_1(SB), R9
IMULQ local_prime64_1<>(SB), R9

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated
ADDQ CX, BP
TESTQ R8, R8
JNZ scalar_ptr_ok
INT $3 // R8 is 0
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug instruction INT $3 left in production code. This will cause a breakpoint/crash when executed. Remove this debug code before merging.

Suggested change
INT $3 // R8 is 0

Copilot uses AI. Check for mistakes.
@shaia shaia force-pushed the simd-optimization-16bit branch from 17105e4 to f4859b0 Compare November 23, 2025 14:28
@shaia shaia merged commit ca88ffb into main Nov 24, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants