Skip to content

fix(assembly): fix multiple critical bugs in AVX2 batch processing#26

Merged
shaia merged 4 commits into
mainfrom
fix/assembly-prime-constants
Nov 24, 2025
Merged

fix(assembly): fix multiple critical bugs in AVX2 batch processing#26
shaia merged 4 commits into
mainfrom
fix/assembly-prime-constants

Conversation

@shaia
Copy link
Copy Markdown
Owner

@shaia shaia commented Nov 21, 2025

Fixes three major bugs in the AVX2 assembly that caused test failures:

  1. Fixed prime64 constant symbol visibility

    • Removed '<>' suffix from prime64_1-5 declarations to make them accessible across assembly files
    • Fixed "relocation target not defined" linker errors
  2. Fixed incorrect fingerprint hash calculation for i2 index

    • Was doing: hash = prime64_5 + 1 + fp (wrong!)
    • Should be: hash ^= fp * prime64_5, then rotate and multiply
    • Matches the Go reference implementation in xxhash.go
    • Fixes incorrect i2 values in test results
  3. Fixed off-by-one error in SIMD loop batch size check

    • Was checking SI-3 >= 0, allowing 3-item batches into 4-way SIMD
    • Now checks SI-4 >= 0, correctly requiring at least 4 items
    • Prevents crashes from reading uninitialized item pointers

Test results:

  • 1-item batches: ✓ PASS
  • 2-item batches: ✓ PASS
  • 3-item batches: ✓ PASS
  • 4-item batches: Still failing with nil pointer (separate issue)

Fixes three major bugs in the AVX2 assembly that caused test failures:

1. Fixed prime64 constant symbol visibility
   - Removed '<>' suffix from prime64_1-5 declarations to make them
     accessible across assembly files
   - Fixed "relocation target not defined" linker errors

2. Fixed incorrect fingerprint hash calculation for i2 index
   - Was doing: hash = prime64_5 + 1 + fp (wrong!)
   - Should be: hash ^= fp * prime64_5, then rotate and multiply
   - Matches the Go reference implementation in xxhash.go
   - Fixes incorrect i2 values in test results

3. Fixed off-by-one error in SIMD loop batch size check
   - Was checking SI-3 >= 0, allowing 3-item batches into 4-way SIMD
   - Now checks SI-4 >= 0, correctly requiring at least 4 items
   - Prevents crashes from reading uninitialized item pointers

Test results:
- 1-item batches: ✓ PASS
- 2-item batches: ✓ PASS
- 3-item batches: ✓ PASS
- 4-item batches: Still failing with nil pointer (separate issue)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR attempts to fix three critical bugs in AVX2 batch processing, but introduces new issues in the implementation.

Summary: The PR fixes symbol visibility for scalar prime64 constants, corrects fingerprint hash calculation logic to match the XXHash64 reference implementation, and fixes an off-by-one error in the SIMD batch size check. However, the changes to AVX2 vector constants introduce multiple critical bugs that will prevent the code from assembling.

Key Changes:

  • Removed <> suffix from scalar prime64 constants in xxhash_amd64.s to enable cross-file sharing
  • Fixed fingerprint hash calculation to use proper XXHash64 algorithm (XOR with fp*prime64_5, rotate, multiply)
  • Changed SIMD loop condition from SI-3 to SI-4 to require minimum 4 items for 4-way parallelism

Critical Issues Found:

  • AVX2 vector constants have mismatched DATA/GLOBL declarations (DATA uses <>, GLOBL doesn't)
  • AVX2 vector constant references are missing required <> suffix
  • Missing NOPTR flags on AVX2 vector constant GLOBL declarations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 15 comments.

File Description
internal/hash/xxhash/xxhash_amd64.s Correctly removes <> suffix from scalar prime64 constants to make them globally accessible, and adds NOPTR flags
internal/hash/xxhash/batch_avx2_amd64.s Fixes fingerprint calculation logic and batch size check, but introduces critical bugs in AVX2 constant declarations and references

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated
DATA avx2_prime64_1<>+16(SB)/8, $11400714785074694791
DATA avx2_prime64_1<>+24(SB)/8, $11400714785074694791
GLOBL avx2_prime64_1<>(SB), RODATA, $32
GLOBL avx2_prime64_1(SB), RODATA, $32
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing NOPTR flag in GLOBL declaration. Since this is a constant data array containing only numeric values (no pointers), it should include the NOPTR flag to inform the garbage collector it doesn't need to scan this memory.

Change to: GLOBL avx2_prime64_1<>(SB), RODATA|NOPTR, $32

This matches the pattern used in xxhash_amd64.s for the scalar constants.

Suggested change
GLOBL avx2_prime64_1(SB), RODATA, $32
GLOBL avx2_prime64_1(SB), RODATA|NOPTR, $32

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated

// hash + prime64_4
VMOVDQU avx2_prime64_4<>(SB), Y3
VMOVDQU avx2_prime64_4(SB), Y3
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect symbol reference: should use avx2_prime64_4<>(SB) with the <> suffix since this is a file-local symbol.

Change to: VMOVDQU avx2_prime64_4<>(SB), Y3

Suggested change
VMOVDQU avx2_prime64_4(SB), Y3
VMOVDQU avx2_prime64_4<>(SB), Y3

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated
DATA avx2_prime64_1<>+16(SB)/8, $11400714785074694791
DATA avx2_prime64_1<>+24(SB)/8, $11400714785074694791
GLOBL avx2_prime64_1<>(SB), RODATA, $32
GLOBL avx2_prime64_1(SB), RODATA, $32
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent symbol names: GLOBL declarations use avx2_prime64_1(SB) without the <> suffix, but the corresponding DATA declarations still use avx2_prime64_1<>. This mismatch will cause assembly errors.

The DATA and GLOBL declarations must use the same symbol name. Either both should use <> (file-local) or neither should (global). Since these are file-local constants, both should keep the <> suffix:

  • DATA avx2_prime64_1<>+...
  • GLOBL avx2_prime64_1<>(SB), ...

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated
DATA avx2_prime64_4<>+16(SB)/8, $9650029242287828579
DATA avx2_prime64_4<>+24(SB)/8, $9650029242287828579
GLOBL avx2_prime64_4<>(SB), RODATA, $32
GLOBL avx2_prime64_4(SB), RODATA, $32
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent symbol names: GLOBL declaration uses avx2_prime64_4(SB) without the <> suffix, but the corresponding DATA declarations still use avx2_prime64_4<>. This mismatch will cause assembly errors.

The DATA and GLOBL declarations must use the same symbol name. Both should keep the <> suffix for file-local constants.

Suggested change
GLOBL avx2_prime64_4(SB), RODATA, $32
GLOBL avx2_prime64_4<>(SB), RODATA, $32

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated
DATA avx2_prime64_3<>+16(SB)/8, $1609587929392839161
DATA avx2_prime64_3<>+24(SB)/8, $1609587929392839161
GLOBL avx2_prime64_3<>(SB), RODATA, $32
GLOBL avx2_prime64_3(SB), RODATA, $32
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing NOPTR flag in GLOBL declaration. Since this is a constant data array containing only numeric values (no pointers), it should include the NOPTR flag.

Change to: GLOBL avx2_prime64_3<>(SB), RODATA|NOPTR, $32

Suggested change
GLOBL avx2_prime64_3(SB), RODATA, $32
GLOBL avx2_prime64_3(SB), RODATA|NOPTR, $32

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated
DATA avx2_prime64_5<>+16(SB)/8, $2870177450012600261
DATA avx2_prime64_5<>+24(SB)/8, $2870177450012600261
GLOBL avx2_prime64_5<>(SB), RODATA, $32
GLOBL avx2_prime64_5(SB), RODATA, $32
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent symbol names: GLOBL declaration uses avx2_prime64_5(SB) without the <> suffix, but the corresponding DATA declarations still use avx2_prime64_5<>. This mismatch will cause assembly errors.

The DATA and GLOBL declarations must use the same symbol name. Both should keep the <> suffix for file-local constants.

Suggested change
GLOBL avx2_prime64_5(SB), RODATA, $32
GLOBL avx2_prime64_5<>(SB), RODATA, $32

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated
DATA avx2_prime64_5<>+16(SB)/8, $2870177450012600261
DATA avx2_prime64_5<>+24(SB)/8, $2870177450012600261
GLOBL avx2_prime64_5<>(SB), RODATA, $32
GLOBL avx2_prime64_5(SB), RODATA, $32
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing NOPTR flag in GLOBL declaration. Since this is a constant data array containing only numeric values (no pointers), it should include the NOPTR flag.

Change to: GLOBL avx2_prime64_5<>(SB), RODATA|NOPTR, $32

Suggested change
GLOBL avx2_prime64_5(SB), RODATA, $32
GLOBL avx2_prime64_5<>(SB), RODATA|NOPTR, $32

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated

// Load constants into YMM registers
VMOVDQU avx2_prime64_5<>(SB), Y0 // Y0 = prime64_5 (4x)
VMOVDQU avx2_prime64_5(SB), Y0 // Y0 = prime64_5 (4x)
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect symbol reference: should use avx2_prime64_5<>(SB) with the <> suffix since this is a file-local symbol. The GLOBL declaration for this constant (line 38) uses the file-local syntax, so all references must match.

Change to: VMOVDQU avx2_prime64_5<>(SB), Y0

Suggested change
VMOVDQU avx2_prime64_5(SB), Y0 // Y0 = prime64_5 (4x)
VMOVDQU avx2_prime64_5<>(SB), Y0 // Y0 = prime64_5 (4x)

Copilot uses AI. Check for mistakes.
Comment thread internal/hash/xxhash/batch_avx2_amd64.s Outdated

// k *= prime64_1
VMOVDQU avx2_prime64_1<>(SB), Y3
VMOVDQU avx2_prime64_1(SB), Y3
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect symbol reference: should use avx2_prime64_1<>(SB) with the <> suffix since this is a file-local symbol.

Change to: VMOVDQU avx2_prime64_1<>(SB), Y3

Suggested change
VMOVDQU avx2_prime64_1(SB), Y3
VMOVDQU avx2_prime64_1<>(SB), Y3

Copilot uses AI. Check for mistakes.
GLOBL prime64_3<>(SB), RODATA, $8
GLOBL prime64_4<>(SB), RODATA, $8
GLOBL prime64_5<>(SB), RODATA, $8
// Note: These constants are shared with batch_avx2_amd64.s, so they cannot be file-private
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading comment: The comment states these constants are "shared with batch_avx2_amd64.s", but that file actually defines its own separate AVX2-specific vector constants (avx2_prime64_1 through avx2_prime64_5) that are file-local.

These scalar constants (prime64_1 through prime64_5) are used by batch_avx2_amd64.s only for scalar operations (IMULQ instructions), not for the AVX2 vector operations.

Suggested clarification: "Note: These constants are used for scalar operations in both files. The batch_avx2_amd64.s file defines separate AVX2 vector constants for SIMD operations."

Suggested change
// Note: These constants are shared with batch_avx2_amd64.s, so they cannot be file-private
// Note: These constants are used for scalar operations in both files. The batch_avx2_amd64.s file defines separate AVX2 vector constants for SIMD operations.

Copilot uses AI. Check for mistakes.
Fixed inconsistency in symbol declarations where DATA directives used
'<>' suffix but GLOBL and references didn't match. In Go assembly,
file-local symbols must use '<>' consistently in all three places:
- DATA declarations: symbol<>
- GLOBL declarations: symbol<>
- References in code: symbol<>

This fixes the linker errors for AVX2 vector constants.

Note: The 4-item SIMD crash persists - this was a separate symbol
consistency issue that was preventing compilation.
Add RODATA|NOPTR flags to all AVX2 constant GLOBL declarations.
Since these constants contain only numeric values (no pointers),
the NOPTR flag tells the garbage collector to skip scanning them.
This commit makes progress on fixing the AMD64 AVX2 batch processing but
identifies a critical issue that requires further investigation.

**What's Fixed:**
1. Chunk offset handling: Changed simd_remainder to save the actual chunk
   loop offset (CX) instead of incorrectly loading min_length. This fixes
   the bug where items <8 bytes weren't being hashed at all.

2. Scalar item loading: Replaced complex offset calculations with explicit
   conditional jumps to load item data/length/hash for cleaner debugging.

3. Slice addressing: Simplified item pointer loading by using explicit
   address calculation (ADDQ DI, BX) instead of complex indexed addressing.

4. Stack alignment: Increased stack frame from 136 to 144 bytes for proper
   16-byte alignment required by AVX2 operations.

**Critical Issue - AVX2 Path Temporarily Disabled:**
The AVX2 SIMD code path crashes when trying to load item data pointers from
the stack. The crash occurs at line 180 when loading from stack offsets
32(SP), 48(SP), 64(SP), 80(SP). Instead of the valid pointers stored earlier,
garbage values (e.g., 0xc0000f8000) are loaded, causing access violations.

Possible causes under investigation:
- Stack frame corruption between store and load operations
- Incorrect stack addressing with NOSPLIT on Windows
- AVX2 SIMD operations inadvertently corrupting stack memory
- Issue with how slice data is loaded from the items parameter

The scalar fallback path works correctly and all tests pass with it enabled.
For now, the SIMD path is disabled (JMP scalar_loop) until this can be
properly debugged with a native debugger.

**Testing:**
- All tests pass with scalar path: ✅
- Tests 1-3 items: ✅ PASS (uses scalar)
- Tests 4+ items: ✅ PASS (forced to scalar)
- Scalar path produces correct hash values matching reference implementation
@shaia shaia merged commit 811b13e into main Nov 24, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants