Simd optimization 16bit by shaia · Pull Request #27 · shaia/SIMDCuckooFilter

shaia · 2025-11-23T09:28:08Z

No description provided.

Fixes three major bugs in the AVX2 assembly that caused test failures: 1. Fixed prime64 constant symbol visibility - Removed '<>' suffix from prime64_1-5 declarations to make them accessible across assembly files - Fixed "relocation target not defined" linker errors 2. Fixed incorrect fingerprint hash calculation for i2 index - Was doing: hash = prime64_5 + 1 + fp (wrong!) - Should be: hash ^= fp * prime64_5, then rotate and multiply - Matches the Go reference implementation in xxhash.go - Fixes incorrect i2 values in test results 3. Fixed off-by-one error in SIMD loop batch size check - Was checking SI-3 >= 0, allowing 3-item batches into 4-way SIMD - Now checks SI-4 >= 0, correctly requiring at least 4 items - Prevents crashes from reading uninitialized item pointers Test results: - 1-item batches: ✓ PASS - 2-item batches: ✓ PASS - 3-item batches: ✓ PASS - 4-item batches: Still failing with nil pointer (separate issue)

Fixed inconsistency in symbol declarations where DATA directives used '<>' suffix but GLOBL and references didn't match. In Go assembly, file-local symbols must use '<>' consistently in all three places: - DATA declarations: symbol<> - GLOBL declarations: symbol<> - References in code: symbol<> This fixes the linker errors for AVX2 vector constants. Note: The 4-item SIMD crash persists - this was a separate symbol consistency issue that was preventing compilation.

Add RODATA|NOPTR flags to all AVX2 constant GLOBL declarations. Since these constants contain only numeric values (no pointers), the NOPTR flag tells the garbage collector to skip scanning them.

This commit makes progress on fixing the AMD64 AVX2 batch processing but identifies a critical issue that requires further investigation. **What's Fixed:** 1. Chunk offset handling: Changed simd_remainder to save the actual chunk loop offset (CX) instead of incorrectly loading min_length. This fixes the bug where items <8 bytes weren't being hashed at all. 2. Scalar item loading: Replaced complex offset calculations with explicit conditional jumps to load item data/length/hash for cleaner debugging. 3. Slice addressing: Simplified item pointer loading by using explicit address calculation (ADDQ DI, BX) instead of complex indexed addressing. 4. Stack alignment: Increased stack frame from 136 to 144 bytes for proper 16-byte alignment required by AVX2 operations. **Critical Issue - AVX2 Path Temporarily Disabled:** The AVX2 SIMD code path crashes when trying to load item data pointers from the stack. The crash occurs at line 180 when loading from stack offsets 32(SP), 48(SP), 64(SP), 80(SP). Instead of the valid pointers stored earlier, garbage values (e.g., 0xc0000f8000) are loaded, causing access violations. Possible causes under investigation: - Stack frame corruption between store and load operations - Incorrect stack addressing with NOSPLIT on Windows - AVX2 SIMD operations inadvertently corrupting stack memory - Issue with how slice data is loaded from the items parameter The scalar fallback path works correctly and all tests pass with it enabled. For now, the SIMD path is disabled (JMP scalar_loop) until this can be properly debugged with a native debugger. **Testing:** - All tests pass with scalar path: ✅ - Tests 1-3 items: ✅ PASS (uses scalar) - Tests 4+ items: ✅ PASS (forced to scalar) - Scalar path produces correct hash values matching reference implementation

This commit fixes a SIGSEGV crash and hash correctness issues in the AVX2 batch processing path. Key changes: - Defined local scalar constants to avoid linkage issues. - Unrolled the finalize loop to process 4 items sequentially. - Implemented full scalar fallback logic (including 8-byte chunks) within the unrolled loop. - Bypassed stack corruption by recalculating initial hashes and reloading base pointers.

Copilot

Pull request overview

This PR upgrades the cuckoo filter to support 16-bit fingerprints (up from 8-bit), enabling lower false positive rates at the cost of increased memory usage. The change involves updating fingerprint types from byte to uint16 throughout the codebase and adapting SIMD operations accordingly.

Key Changes:

Fingerprint size limit increased from 8 bits to 16 bits in validation and documentation
Core fingerprint type changed from byte to uint16 across all hash implementations and bucket operations
AVX2 batch hash processing temporarily disabled with scalar fallback while 16-bit support is in progress
New AVX2 assembly implementation for 16-bit fingerprint bucket lookups on AMD64

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
options.go	Updated validation to allow 1-16 bit fingerprints instead of 1-8 bits
internal/hash/types/types.go	Changed HashResult.Fp from `byte` to `uint16`
internal/hash/hash.go	Updated HashInterface signatures to use `uint16` for fingerprints
internal/hash/xxhash/xxhash.go	Modified GetIndices and GetAltIndex to use `uint16` fingerprints with simplified alt-index calculation
internal/hash/xxhash/xxhash_amd64.s	Updated prime constant symbols to remove `<>` suffix and add NOPTR flag
internal/hash/xxhash/batch_amd64.go	Disabled AVX2 batch processing, using scalar fallback for 16-bit support
internal/hash/xxhash/batch_avx2_amd64.s	Extensive modifications for 16-bit support (currently unused due to fallback)
internal/hash/fnv/hash_fnv.go	Updated to use `uint16` fingerprints with 2-byte buffer when needed
internal/hash/crc32/crc32.go	Updated to use `uint16` fingerprints with conditional 1 or 2-byte hashing
internal/bucket/bucket.go	Changed bucket fingerprint storage from `[]byte` to `[]uint16`
internal/bucket/bucket_simd_amd64.go	Updated to call new AVX2 assembly for 16-bit fingerprint lookups
internal/bucket/bucket_avx2_amd64.s	New AVX2 assembly for efficient 16-bit fingerprint comparison
internal/bucket/bucket_simd_arm64.go	Disabled NEON assembly, using scalar fallback for 16-bit support
internal/filter/filter.go	Updated relocate signature to accept `uint16` fingerprints
Various test files	Updated test cases to use `uint16` instead of `byte` for fingerprints

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-23T09:31:55Z


    // Process 8-byte chunks for all 4 items in parallel
    XORQ CX, CX       // CX = offset within data
+    // JMP simd_remainder // DEBUG: Skip chunk loop


Commented-out debug code // JMP simd_remainder // DEBUG: Skip chunk loop left in the codebase. Remove commented debug code before merging to maintain code cleanliness.

Suggested change

// JMP simd_remainder // DEBUG: Skip chunk loop

Copilot · 2025-11-23T09:31:56Z

+/*
 //go:noescape
 func processBatchXXHashAVX2(items [][]byte, results []types.HashResult, fingerprintBits, numBuckets uint)
+*/


[nitpick] The AVX2 implementation is completely disabled in favor of a scalar fallback, but the assembly code in batch_avx2_amd64.s still exists and has been extensively modified. This creates maintenance burden and confusion. Consider either:

Fully implementing 16-bit support in AVX2 assembly and re-enabling it

Removing the unused assembly file if AVX2 support is not planned

The current state with 800+ lines of modified but unused assembly code is problematic for maintainability.

Suggested change

/*

//go:noescape

func processBatchXXHashAVX2(items [][]byte, results []types.HashResult, fingerprintBits, numBuckets uint)

*/

Copilot · 2025-11-23T09:31:56Z

 DATA avx2_shift_37<>+16(SB)/8, $37
 DATA avx2_shift_37<>+24(SB)/8, $37
 GLOBL avx2_shift_37<>(SB), RODATA, $32


Duplicate data definition: avx2_shift_37 is defined twice - once at lines 59-60 and again at lines 73-75. This will cause an assembly error. Remove the duplicate definition at lines 73-75.

Suggested change

DATA avx2_shift_37<>+16(SB)/8, $37

DATA avx2_shift_37<>+24(SB)/8, $37

GLOBL avx2_shift_37<>(SB), RODATA, $32

Copilot · 2025-11-23T09:31:56Z

+    IMULQ prime64_1(SB), R9
    VPINSRQ $0, R9, X4, X4

    VPEXTRQ $1, X4, R9
-    IMULQ prime64_1<>(SB), R9
+    IMULQ prime64_1(SB), R9
    VPINSRQ $1, R9, X4, X4

    VEXTRACTI128 $1, Y2, X5
    VPEXTRQ $0, X5, R9
-    IMULQ prime64_1<>(SB), R9
+    IMULQ prime64_1(SB), R9
    VPINSRQ $0, R9, X5, X5

    VPEXTRQ $1, X5, R9
-    IMULQ prime64_1<>(SB), R9
+    IMULQ prime64_1(SB), R9


Inconsistent symbol references: lines 337, 341, 346, and 350 use prime64_1 without the local_ prefix, while surrounding code uses local_prime64_1<>. This inconsistency could cause linkage issues. All references should use the local_ prefixed constants defined at lines 63-72.

Copilot · 2025-11-23T09:31:56Z

+	MOVD    data_base+0(FP), SI    // SI = data pointer
+	MOVD    data_len+8(FP), BX     // BX = length
+	MOVW    fp+24(FP), AX          // AX = fingerprint (uint16)


[nitpick] The assembly uses MOVD to load pointers, which is appropriate for loading 64-bit values into XMM registers on modern assemblers. However, for clarity and to ensure compatibility, consider using MOVQ for 64-bit pointer loads as it's more explicit about the operand size being 64 bits.

Copilot · 2025-11-23T09:31:57Z

+	// Test all possible fingerprint values (1-65535, since 0 is never used)
 	failureCount := 0
-	for fp := byte(1); fp != 0; fp++ {
+	for fp := uint16(1); fp != 0; fp++ {


[nitpick] The loop condition for fp := uint16(1); fp != 0; fp++ will overflow after reaching 65535 and wrap to 0, terminating the loop. However, this is intentional to test all 65535 possible non-zero fingerprint values. Consider adding a comment to clarify this intentional overflow behavior for future readers.

Copilot · 2025-11-23T09:31:57Z

@@ -245,7 +245,7 @@ func TestContainsZeroFingerprint(t *testing.T) {
 func TestAllBytesValue(t *testing.T) {
 	size := uint(8) // Use size 8 for reasonable test time



[nitpick] Similar to the comment above: the loop for target := uint16(1); target != 0; target++ intentionally overflows to test all possible fingerprint values. Consider adding a comment explaining this overflow-based loop termination for clarity.

Suggested change

// Loop over all possible nonzero uint16 values (1 to 65535).

// The loop relies on unsigned integer overflow: when target overflows from 65535 to 0, the loop terminates.

Copilot · 2025-11-23T09:31:57Z

+	fpBuf := [2]byte{byte(fp), byte(fp >> 8)}
+	// For 8-bit or less, we only use the first byte to maintain backward compatibility
+	// and consistency with how it was done before
+	length := 1


Variable length is declared but the variable name used in line 126 is inconsistent. The variable is named length here but needs to match the usage below.

Copilot · 2025-11-23T09:31:57Z


    VPEXTRQ $1, X5, R9
-    IMULQ prime64_1<>(SB), R9
+    IMULQ prime64_1(SB), R9


Inconsistent symbol reference: this line uses local_prime64_1 while line 304 uses prime64_1 (without the local_ prefix). All references in this section should be consistent. Based on the pattern and the fact that local constants were defined at lines 63-72, this should use local_prime64_1<> for consistency.

Suggested change

IMULQ prime64_1(SB), R9

IMULQ local_prime64_1<>(SB), R9

Copilot · 2025-11-23T09:31:58Z

-    ADDQ CX, BP
+    TESTQ R8, R8
+    JNZ scalar_ptr_ok
+    INT $3 // R8 is 0


Debug instruction INT $3 left in production code. This will cause a breakpoint/crash when executed. Remove this debug code before merging.

Suggested change

INT $3 // R8 is 0

…tency

shaia added 6 commits November 21, 2025 06:47

fix(assembly): add NOPTR flags to AVX2 constant declarations

0e5190e

Add RODATA|NOPTR flags to all AVX2 constant GLOBL declarations. Since these constants contain only numeric values (no pointers), the NOPTR flag tells the garbage collector to skip scanning them.

Implement 16-bit fingerprint support and AVX2 SIMD optimization

23bcfe4

shaia requested a review from Copilot November 23, 2025 09:28

shaia self-assigned this Nov 23, 2025

Copilot started reviewing on behalf of shaia November 23, 2025 09:28 View session

Copilot finished reviewing on behalf of shaia November 23, 2025 09:30

Copilot AI reviewed Nov 23, 2025

View reviewed changes

Optimize AVX2 bucket lookups and fix small bucket regression

f4859b0

shaia force-pushed the simd-optimization-16bit branch from 17105e4 to f4859b0 Compare November 23, 2025 14:28

shaia added 10 commits November 23, 2025 16:48

Remove debug instruction INT from xxhash assembly

295c0fc

Fix inconsistent symbol references in xxhash AVX2 assembly

5f18bef

Rename length to len in crc32.go for consistency

f50b9e6

Remove debug comment from xxhash assembly

e936b7f

Remove unused and duplicate avx2_shift_37 definition

a9fca72

Replace MOVD with MOVQ for pointer loads in AVX2 assembly

f503f79

Document intentional overflow in altindex test loop

f514a12

Document intentional overflow in bucket edge test loop

83f7e37

Rename len to length in crc32.go to avoid shadowing and ensure consis…

385b215

…tency

Fix compilation error in ARM64 batch processing and rename len to length

aeced50

shaia merged commit ca88ffb into main Nov 24, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simd optimization 16bit#27

Simd optimization 16bit#27
shaia merged 17 commits into
mainfrom
simd-optimization-16bit

shaia commented Nov 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	DATA avx2_shift_37<>+16(SB)/8, $37
	DATA avx2_shift_37<>+24(SB)/8, $37
	GLOBL avx2_shift_37<>(SB), RODATA, $32

		@@ -245,7 +245,7 @@ func TestContainsZeroFingerprint(t *testing.T) {
		func TestAllBytesValue(t *testing.T) {
		size := uint(8) // Use size 8 for reasonable test time



	// Loop over all possible nonzero uint16 values (1 to 65535).
	// The loop relies on unsigned integer overflow: when target overflows from 65535 to 0, the loop terminates.

Conversation

shaia commented Nov 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants