Skip to content

Latest commit

 

History

History
141 lines (101 loc) · 5.38 KB

File metadata and controls

141 lines (101 loc) · 5.38 KB

CacheTrace Validation & Benchmarks

CacheTrace has been validated against real hardware performance counters on Intel Core i7-9850H (Coffee Lake).

Validation Methodology

Toolchain:

  • Valgrind Lackey: Generate memory access traces
  • CacheTrace: Simulate cache behavior
  • perf stat: Measure real hardware counters
  • Comparison: Validate trends match between simulation and hardware

Test Configuration:

  • CPU: Intel Core i7-9850H (Coffee Lake, 12MB L3)
  • CacheTrace Model: Coffee Lake (2MB L3)
  • Validation Range: 8KB - 1MB (within model L3 capacity)

Synthetic Benchmarks

Sequential access patterns with varying working set sizes to stress different cache levels.

Results

Workload Size Rounds CacheTrace L1 Hit CacheTrace L2 Hit CacheTrace L3 Hit perf L1 Miss perf LLC Miss
8KB seq 8,192 B 50 98% 0% 0% 5.00% 36.62%
128KB seq 131,072 B 20 0% 95% 0% 16.60% 32.31%
1MB seq 1,048,576 B 8 0% 0% 88% 20.00% 4.90%
1MB rand 1,048,576 B 8 0% 0% 88% 17.72% 1.72%

Interpretation

Cache Level Predictions Match Hardware Trends:

  1. 8KB workload: Fits in L1 (32KB)

    • CacheTrace: 98% L1 hits
    • Hardware: 5% L1 miss rate
    • Trend Match: Both show L1 serving most accesses
  2. 128KB workload: Exceeds L1 but fits in L2 (256KB)

    • CacheTrace: 95% L2 hits
    • Hardware: 16.6% L1 miss, 32% LLC miss
    • Trend Match: Both show L2 serving most accesses after L1 eviction
  3. 1MB workload: Exceeds L2 but fits in L3 (2MB model)

    • CacheTrace: 88% L3 hits
    • Hardware: 20% L1 miss, 4.9% LLC miss
    • Trend Match: Both show L3 serving most accesses
  4. 1MB random access: Same capacity, different pattern

    • CacheTrace: Same L3 hit rate as sequential
    • Hardware: Lower LLC miss rate (1.72% vs 4.90%)
    • Observation: Random access benefits from prefetcher bypass (not modeled)

Key Insight

CacheTrace correctly predicts which cache level serves accesses as working set size increases, validating the replacement policy implementations.

Real-World Applications

Validation with actual software: xxHash (hash function) and jq (JSON parser).

Results

Application Input Size Trace Accesses CT L1 Hit CT L2 Hit CT L3 Hit CT Avg Cycles perf L1 Miss perf LLC Miss
xxHash 128 KB 78,217 94% 10% 0% 13 5.56% 30.29%
xxHash 8 MB 1,110,411 88% 0% 0% 27 8.26% 65.98%
jq 2 KB 25,406,566 99% 28% 76% 4 1.83% 12.98%

Interpretation

xxHash (128KB):

  • Small hash buffer fits mostly in L1/L2
  • CacheTrace: 94% L1 hits
  • Hardware: 5.56% L1 miss
  • Trend Match: Excellent L1 locality

xxHash (8MB):

  • Large buffer exceeds all cache levels in model
  • CacheTrace: 88% L1 hits (streaming pattern)
  • Hardware: 8.26% L1 miss, 65.98% LLC miss
  • Observation: Shows memory bandwidth bottleneck

jq (2KB input):

  • JSON parsing with small input
  • CacheTrace: 99% L1 hits (hot path)
  • Hardware: 1.83% L1 miss
  • Trend Match: Very cache-friendly workload

Validation Summary

What CacheTrace Gets Right

  1. Cache Level Transitions: Correctly predicts when accesses move from L1 → L2 → L3
  2. Working Set Sensitivity: Hit rates change appropriately as data size increases
  3. Replacement Policy Accuracy: PLRU/QLRU implementations match hardware behavior
  4. Real-World Applicability: Works on actual applications, not just synthetic tests

Known Limitations

  1. Absolute Miss Rates Differ: CacheTrace shows hit rates, perf shows miss rates for all loads (including OS/runtime)
  2. Model vs Hardware L3 Size: Model uses 2MB L3, test hardware has 12MB L3
  3. No Prefetcher Modeling: Hardware prefetchers affect real performance but aren't simulated
  4. No Out-of-Order Effects: Effective latency model, not cycle-accurate OOO simulation
  5. Single-Threaded: No multi-core cache coherency or sharing

CacheTrace provides per-access visibility into cache behavior that perf counters cannot:

  • perf: "20% L1 miss rate" (aggregate)
  • CacheTrace: "Access to 0x1234 missed L1, hit L2, evicted block 0x5678" (per-access)

This level of detail is helpful for:

  • Understanding why a workload has poor cache behavior
  • Identifying which memory accesses cause thrashing
  • Predicting performance on different CPU generations

Reproducing These Results

The full validation harness is public in the tracebugtest repository. Use that repo to reproduce synthetic and real-world runs end-to-end.

Quick Validation

# Build CacheTrace
nasm -f elf64 cachetrace.asm -o cachetrace.o && ld cachetrace.o -o cachetrace

# Run the validation harness (sibling repo)
cd ../tracebugtest
make validate-synth       # synthetic benchmark validation
make validate-realworld   # real-world validation (xxHash)

Conclusion

CacheTrace's simulation matches hardware trends across:

  • Synthetic benchmarks (8KB - 1MB)
  • Real applications (xxHash, jq)
  • Different access patterns (sequential, random)
  • Multiple CPU generations (6 Intel CPUs supported)