This document specifies the current implementation in this repository (version series 0.55.x) for memory_benchmark on macOS Apple Silicon.
It is intentionally implementation-driven and reflects real behavior in code paths under main.cpp, src/core, src/benchmark, src/pattern_benchmark, src/output, and src/asm.
Primary goals:
- Define runtime architecture and execution flow.
- Define command/config semantics and validation rules.
- Define memory allocation, initialization, and benchmark execution contracts.
- Define output contracts (console and JSON).
- Capture current constraints, known drift, and measurement caveats.
Out of scope:
- Generic memory-performance theory (see LATENCY_WHITEPAPER.md).
-analyze-tlbmethodology details (see TLB_ANALYSIS_WHITEPAPER.md).-analyze-core2coremethodology details (see CORE_TO_CORE_WHITEPAPER.md).- Historical behavior from older releases.
- Target OS: macOS.
- Target CPU architecture: ARM64 Apple Silicon.
- Language: C++17 with ARM64 Apple Silicon assembly kernels.
- Build:
Makefile(clang++,as). - Test framework: GoogleTest (
test_runner).
The tool is designed and tuned for Apple Silicon execution characteristics (cache hierarchy, page behavior, QoS, and assembly kernels).
Fanless Apple Silicon systems (e.g., MacBook Air M5) have limited thermal capacity and cannot sustain heavy benchmarking for extended durations.
Observed behavior:
- MacBook Air M5 typically enters Heavy thermal state after approximately 2 pattern benchmark runs.
- Thermal state transitions depend significantly on ambient environment temperature.
- Thermal throttling during benchmarking invalidates measurement reliability and comparability.
Impact:
- Repeated back-to-back benchmark executions may encounter thermal limiting.
- Measurements from throttled runs should not be compared with baseline measurements from thermal-normal conditions.
- For consistent results on fanless systems, allow sufficient idle time between consecutive benchmark runs to permit thermal cool-down.
Recommendation:
- For reliable baseline measurement on fanless systems, run single benchmark iterations with thermal cool-down intervals between runs.
- Use
caffeinate -i -d memory_benchmark ...to prevent system sleep during longer measurement sessions on cooler systems.
Main orchestration (main.cpp) follows this pipeline:
Standalone modes (-analyze-tlb, -analyze-core2core) are dispatched early and use dedicated runners.
The pipeline below applies to standard/pattern benchmark execution.
- Create high-resolution total-execution timer.
- Parse CLI arguments into
BenchmarkConfig(parse_arguments). - Validate configuration (
validate_config). - Calculate derived sizes and counts:
calculate_buffer_sizescalculate_access_countscalculate_total_allocation_bytes
- Print resolved configuration and cache info.
- Raise main thread QoS (
QOS_CLASS_USER_INTERACTIVE) best-effort. - Execute one mode:
- Standard benchmark mode (
run_all_benchmarks), which allocates/initializes buffers per phase and releases them after the phase. - Pattern benchmark mode (
run_pattern_benchmarks), which pre-allocates/initializes required buffers (allocate_all_buffers+initialize_all_buffers).
- Standard benchmark mode (
- Print loop results and aggregate statistics.
- Optionally serialize JSON output.
- Print total elapsed runtime.
Memory cleanup is RAII-based through MmapPtr custom deleters (munmap on scope exit).
Configuration state is represented by BenchmarkConfig (src/core/config/config.h).
- Main options: buffer size MB, iterations, loop count, output path, threads.
- Mode flags:
run_patterns,only_bandwidth,only_latency. - Standalone analysis flags are handled outside
BenchmarkConfigorchestration flow:-analyze-tlb-analyze-core2core
- Cache behavior: auto L1/L2 or user-provided
-cache-size. - Latency sampling:
latency_sample_count. - Latency-chain construction mode:
latency_chain_mode(typeLatencyChainMode, CLI flag-latency-chain-mode)user_specified_latency_chain_modeflag
- TLB-locality control for latency chain construction:
latency_tlb_locality_bytes(default 1024 KB)0means global random chain.
- Best-effort cache-discouraging mode:
use_non_cacheable.
- Resolved byte sizes for main and cache buffers.
- Access counts for latency paths.
- System metadata (CPU name, macOS version, core counts).
- Max memory limits and bookkeeping flags.
- Two-pass parse:
- First pass extracts
-cache-sizeearly. - Second pass parses remaining options.
- First pass extracts
- Parser may throw internally (
std::stoll/validation) but converts to return-code failures at function boundary. - Help (
-h,--help) prints usage and exits successfully. -latency-chain-modeaccepts string values and resolves toLatencyChainModeenum.-analyze-core2coreuses dedicated mode parsing (outsideargument_parser.cpp) and only allows optional-output,-count, and-latency-samples. Full methodology and JSON contract: CORE_TO_CORE_WHITEPAPER.md.
Validation rejects incompatible flag combinations and invalid value states:
-only-bandwidthand-only-latencyare mutually exclusive.-patternscannot be combined with-only-bandwidthor-only-latency.-only-bandwidthcannot be combined with-cache-sizeand cannot use latency-sample overrides.-only-latencycannot be combined with-iterationsoverride.
Zero-disabling semantics (supported only in -only-latency):
-buffersize 0disables main-memory latency path.-cache-size 0disables cache-latency path.- Both cannot be zero simultaneously in
-only-latency.
TLB-locality constraints:
- Non-zero
-latency-tlb-locality-kbmust be a multiple of system page size. - Non-zero locality window must span at least two latency-stride steps.
Latency stride constraints:
-latency-stride-bytesmust be greater than zero.- Stride must be pointer-size aligned.
Memory-limit model:
- System available memory is queried.
- Global cap uses
MEMORY_LIMIT_FACTOR(80%). - Per-main-buffer cap is mode-aware (1 or 2 main buffers, depending on active mode/phase needs).
- A second peak-concurrent allocation check validates the highest active phase footprint (main + cache paths).
- Main buffer size is derived from
buffer_size_mb. - L1/L2/custom cache test buffers use factor constants currently set to
1.0. - Cache buffers are rounded to active latency stride granularity (
latency_stride_bytes, default256) and minimum constraints. - Minimum practical lower bound includes page-size enforcement.
-cache-size 0(in allowed mode) produces zero custom cache buffer.
- Main-memory latency accesses scale from base count relative to default buffer size.
- Cache latency access counts use fixed constants (
L1,L2,CUSTOM). -buffersize 0(in allowed mode) sets main latency accesses to zero.
Allocation entrypoints:
- Standard benchmark mode: per-phase allocators in
src/benchmark/benchmark_executor.cpp(prepare_*_buffershelpers). - Pattern benchmark mode:
allocate_all_buffers(src/core/memory/buffer_allocator.cpp).
Shared allocation behavior:
- Uses
mmapanonymous private mappings (macOS-specific behavior and limits apply). - Uses mode-aware conditional allocation to avoid unused buffers.
- Performs overflow-safe byte arithmetic before allocation.
- Enforces global memory-limit checks from peak-concurrent requirements.
Allocated buffer families (conditional):
- Main bandwidth:
src,dst. - Main latency:
lat. - Cache latency:
l1/l2orcustom. - Cache bandwidth:
l1_bw_src/dst,l2_bw_src/dstorcustom_bw_src/dst.
Pattern mode intentionally allocates and uses only main source/destination buffers.
allocate_buffer_non_cacheablestill uses normal user-space mappings.- Applies
madvise(MADV_RANDOM)hints on macOS. - This is best-effort cache discouragement only; not true uncached memory.
Initialization entrypoints:
- Standard benchmark mode: per-phase initialization in
run_single_benchmark_loopbefore each measured phase. - Pattern benchmark mode:
initialize_all_buffers(src/core/memory/buffer_initializer.cpp).
Initialization semantics:
- Bandwidth buffers: deterministic source pattern + zeroed destination.
- Latency buffers: randomized pointer-chasing circular chain via
setup_latency_chain. - Allocation/initialization happen before phase timing starts and are excluded from measured benchmark durations.
setup_latency_chain (src/core/memory/memory_utils.cpp) builds pointer chains used by main/cache latency tests.
The LatencyChainMode enum (from src/core/memory/memory_utils.h) defines four explicit modes plus Auto:
Auto(0, default): Resolves to effective mode based ontlb_locality_bytes:- If
tlb_locality_bytes == 0, behaves asGlobalRandom. - If
tlb_locality_bytes > 0, behaves asRandomInBoxRandomBox.
- If
GlobalRandom: Global random permutation across entire buffer (ignores locality).RandomInBoxRandomBox: Randomize within locality windows, then randomize window order.SameRandomInBoxIncreasingBox: Locality windows grow (doubling each step), randomization within each box.DiffRandomInBoxIncreasingBox: Randomization offset by locality window step.
- Uses stride-spaced pointer slots across buffer.
- Requires at least two pointers.
- Produces a circular linked structure.
- Collects chain diagnostics (pointer count, unique pages touched, page size, stride).
tlb_locality_bytes == 0: global random permutation (mode-independent).tlb_locality_bytes > 0: randomization within locality windows, then randomized window order (specific behavior mode-dependent).
- Preserve dependent load-to-use semantics.
- Reduce prefetch predictability.
- Allow controlled locality pressure experiments.
Warmup functions (src/warmup) run before measured tests.
- Bandwidth warmups are multi-threaded and adaptive in size.
- Latency warmups are single-threaded page-touch prefault style, avoiding chain execution.
- Worker and/or single-thread warmups attempt high QoS class best-effort.
Adaptive warmup size (warmup_internal.h):
min(buffer_size, max(64MB, 10% of buffer_size)).
When the user selects or defaults to latency_chain_mode=Auto, the implementation runs two latency-chain passes to decompose memory latency into TLB-related components:
- TLB-hit pass: Chain with 16 KB locality window (within typical TLB entry footprint).
- Global-random pass: Chain with global randomization (no locality).
Results from both passes are stored in BenchmarkResults and BenchmarkStatistics:
tlb_hit_latency_ns: Average latency from 16 KB locality pass.tlb_miss_latency_ns: Average latency from global-random pass.page_walk_penalty_ns:tlb_miss_latency_ns - tlb_hit_latency_ns(approximate TLB miss cost).
These breakdown metrics are optionally serialized to JSON under main_memory.latency.auto_tlb_breakdown.
Standard mode coordinator: run_all_benchmarks -> run_single_benchmark_loop.
Per loop, conditional by flags:
- Main-memory bandwidth tests (read/write/copy).
- Cache bandwidth tests (L1/L2 or custom).
- Cache latency tests (L1/L2 or custom).
- Main-memory latency test.
Important execution semantics:
- Phase-local buffers are allocated and initialized immediately before each phase and released after the phase, reducing standard-mode peak footprint.
- Cache bandwidth uses
iterations * CACHE_ITERATIONS_MULTIPLIER(saturated) for stability. - Cache bandwidth defaults to single-thread unless user explicitly provides
-threads. - Main-memory latency headline is computed from one continuous chase pass.
- If latency samples are enabled, sample collection runs in a separate pass.
Pattern mode coordinator: run_pattern_benchmarks (src/pattern_benchmark/pattern_coordinator.cpp).
Executed pattern families:
- Sequential forward.
- Sequential reverse.
- Strided 64B.
- Strided 4096B.
- Strided 16384B.
- Strided 2MB.
- Random uniform.
Each pattern reports read/write/copy bandwidth metrics.
Implementation notes:
- Random indices generated once per pattern loop for random benchmarks.
- Large-stride patterns can be skipped when constraints invalidate execution.
- Pattern statistics are aggregated across outer loop count.
Assembly entrypoints are declared in src/asm/asm_functions.h and used by benchmark/warmup code:
- Main-memory kernels: read, write, copy (non-temporal stores).
- Cache kernels: read, write, copy (cache-focused variants).
- Latency kernel: pointer-chase loop.
- Pattern kernels: reverse (read/write/copy), strided generic (read/write/copy), strided fixed-stride variants (64B, 4096B, 16384B, 2MB — each as read/write/copy), random (read/write/copy).
- Core-to-core kernels: initiator and responder round-trip loops.
Design intent:
- Keep hot loops in ARM64 Apple Silicon assembly for predictable overhead and high throughput.
- Follow AAPCS64 conventions for register preservation and call boundaries.
- Use checksum sinks in read paths to keep loads architecturally meaningful.
Latency kernel (memory_latency_chase_asm) performs strictly dependent pointer chasing and returns terminal pointer to prevent dead-code elimination.
Core-to-core kernels:
core_to_core_initiator_round_trips_asm: Waits for responder token, sends initiator token; repeats for specified round trips.core_to_core_responder_round_trips_asm: Mirrors initiator; coordinates via shared token word.
Timing API: HighResTimer (src/core/timing/timer.*).
- Provides second and nanosecond stop methods.
- Used for both macro (test durations) and micro (latency sample windows) timing.
- Factory creation returns optional; failure is treated as fatal at call sites.
BenchmarkStatistics and pattern statistics collect vectors per metric across outer loops.
- Single loop: direct value reporting.
- Multiple loops: aggregate statistics printed and optionally serialized.
- Statistics include central tendency and percentile-oriented summaries.
For contention-prone environments, percentiles (P50/P95/P99) are more informative than mean-only interpretation.
Computed from collected value vectors:
- Average: Mean of all values.
- Median (P50): 50th percentile value.
- P90: 90th percentile value.
- P95: 95th percentile value.
- P99: 99th percentile value.
- Stddev: Standard deviation.
- Min: Minimum observed value.
- Max: Maximum observed value.
Console rendering is centralized in src/output/console and message helpers in src/output/console/messages.
Contract highlights:
- Configuration and cache info printed before execution.
- Per-loop results are printed in standard mode.
- Pattern mode prints pattern table-style sections and derived efficiency indicators.
- Aggregate statistics printed when loop count > 1.
- Errors and warnings use
Messages::error_prefix()/Messages::warning_prefix()conventions.
JSON writer API (src/output/json/json_output/json_output.cpp):
- Standard mode:
configuration,execution_time_sec,main_memory,cache,timestamp,version. - Pattern mode:
configuration,execution_time_sec,patterns,timestamp,version.
In addition to standard fields (buffer size, iterations, loop count, thread count, CPU/OS info):
latency_chain_mode(string): Resolved pointer-chain construction mode.use_latency_tlb_locality(boolean): Whether TLB-locality window is in use.latency_tlb_locality_bytes(number): TLB-locality window size in bytes.latency_tlb_locality_kb(number): TLB-locality window size in KB.
main_memory.latency.average_ns.values(array): Latency sample values per loop.main_memory.latency.average_ns.statistics(object, optional): Aggregate stats (average, median, P90, P95, P99, stddev, min, max).main_memory.latency.samples_ns.values(array, optional): Per-sample latency values.main_memory.latency.samples_ns.statistics(object, optional): Per-sample aggregate statistics.main_memory.latency.chain_diagnostics(object): Pointer chain metadata (pointer_count, unique_pages_touched, page_size_bytes, stride_bytes).main_memory.latency.auto_tlb_breakdown(object, optional): TLB breakdown whenlatency_chain_mode==Auto:tlb_hit_ns.values(array): TLB-hit latency per loop.tlb_miss_ns.values(array): Global-random latency per loop.page_walk_penalty_ns.values(array): Approximate TLB miss cost.
- Ordered JSON is used for stable key order.
- Bandwidth metrics are nested as arrays in
valueswith optionalstatistics. - Latency is nested as described in §17.2.
- Relative
-outputpaths are resolved against current working directory.
This codebase uses boundary-aware mixed error handling:
- Program orchestration and most modules: return codes (
EXIT_SUCCESS/EXIT_FAILURE). - Argument parsing internals: exceptions for parsing/validation, converted to return codes at API boundary.
- Allocation internals: null smart-pointer returns for allocation failure.
Principle: no uncaught exceptions should escape to main() control flow.
- Bandwidth and pattern bandwidth paths are parallelized by thread-count configuration.
- Latency tests are intentionally single-threaded pointer-chase measurements.
- Cache tests default to single-thread unless user overrides thread count.
- Threaded work partitioning attempts cache-line-aware chunk handling to reduce false sharing effects.
The tool can be run on active systems with concurrent workloads, but interpretation must account for scheduler and memory-system contention.
Practical caveats:
- Heavy concurrent activity can inflate tail latency and depress bandwidth.
- Tail metrics (
P95,P99) usually reveal contention more clearly than averages. - Comparing runs across time requires similar background-load conditions.
- Small buffers can become cache-dominated and hide DRAM behavior.
For high-confidence baselines, run repeated loops and analyze distributions rather than single-point values.
Recommended validation commands:
- Build:
make - Unit tests (non-integration):
make test - Integration-only:
make test-integration - Full test set:
make test-all - CLI help smoke check:
memory_benchmark -h
For narrow changes, prefer targeted gtest filters via ./test_runner --gtest_filter=....
- Program entry:
main.cpp - Config parse/validate/derive:
src/core/config/argument_parser.cppsrc/core/config/config_validator.cppsrc/core/config/buffer_calculator.cpp
- Memory allocation/init:
src/core/memory/buffer_allocator.cppsrc/core/memory/buffer_initializer.cppsrc/core/memory/memory_manager.cppsrc/core/memory/memory_utils.cpp
- Standard benchmark:
src/benchmark/benchmark_runner.cppsrc/benchmark/benchmark_executor.cppsrc/benchmark/bandwidth_tests.cppsrc/benchmark/latency_tests.cpp
- Pattern benchmark:
src/pattern_benchmark/pattern_coordinator.cppsrc/pattern_benchmark/output.cpp
- Output:
src/output/console/output_printer.cppsrc/output/json/json_output/*.cpp
- Assembly kernels:
src/asm/*.ssrc/asm/core_to_core_latency.s(core-to-core latency measurements; see CORE_TO_CORE_WHITEPAPER.md)