Source: snapshots/bench/benchmarks_compare.txt
Note: While EMEL is modular and easy to bench in isolation, llama.cpp code is very entangled. These microbenches aim for apples-to-apples comparisons but likely are not. True benchmarks will be end-to-end once the system is complete.
ARM-first benchmark policy: maintained publication claims are ARM/AArch64 first and must be source-backed by the EMEL-owned runtime lane plus an isolated reference lane. Non-ARM, GPU, or browser-target backend rows are historical inventory or future backend scaffolding unless a milestone section explicitly names them as maintained evidence.
| Benchmark | emel.cpp ns/op | Reference | Reference ns/op | Comparison |
|---|---|---|---|---|
batch/planner_equal |
2042.333 | llama.cpp | 6278.708 | 0.325x |
batch/planner_seq |
2451.542 | llama.cpp | 2673.333 | 0.917x |
batch/planner_simple |
1213.625 | llama.cpp | 2256.459 | 0.538x |
diarization/sortformer/ami_en2002b_mix_headset_137.00_152.04_16khz_mono |
682069791.000 | onnx.sortformer.v2_1 | 543529041.000 | 1.255x (+25.5%), exact_match |
flash_attention/aarch64/op_flash_attn_ext_decode_like |
15031.000 | llama.cpp | 25210.250 | 0.596x |
gbnf/rule_parser_basic |
476.875 | llama.cpp | 257.375 | 1.853x |
gbnf/rule_parser_complex |
3282.625 | llama.cpp | 1479.458 | 2.219x |
generation/preloaded_request/lfm2_5_1_2b_thinking_q4_k_m_prompt_hello_max_tokens_1 |
402703625.000 | llama.cpp | 308659250.000 | 1.305x |
generation/preloaded_request/lfm2_5_1_2b_thinking_q4_k_m_prompt_hello_max_tokens_10 |
592400791.000 | llama.cpp | 538869584.000 | 1.099x |
generation/preloaded_request/lfm2_5_1_2b_thinking_q4_k_m_prompt_hello_max_tokens_100 |
2471608417.000 | llama.cpp | 2900662541.000 | 0.852x |
generation/preloaded_request/lfm2_5_1_2b_thinking_q4_k_m_prompt_hello_max_tokens_1000 |
22331522750.000 | llama.cpp | 27152140167.000 | 0.822x |
generation/preloaded_request/qwen3_0_6b_q8_0_prompt_hello_max_tokens_1 |
102294166.000 | llama.cpp | 136357709.000 | 0.750x |
generation/preloaded_request/qwen3_0_6b_q8_0_prompt_hello_max_tokens_10 |
206045666.000 | llama.cpp | 254174875.000 | 0.811x |
generation/preloaded_request/qwen3_0_6b_q8_0_prompt_hello_max_tokens_100 |
1189609833.000 | llama.cpp | 1498063125.000 | 0.794x |
generation/preloaded_request/qwen3_0_6b_q8_0_prompt_hello_max_tokens_1000 |
15148360708.000 | llama.cpp | 18977629917.000 | 0.798x |
kernel/aarch64/op_add |
118.625 | llama.cpp | 6258.416 | 0.019x |
kernel/aarch64/op_cos |
1723.208 | llama.cpp | 6794.000 | 0.254x |
kernel/aarch64/op_div |
110.958 | llama.cpp | 4979.625 | 0.022x |
kernel/aarch64/op_dup |
103.166 | llama.cpp | 4995.792 | 0.021x |
kernel/aarch64/op_log |
1873.666 | llama.cpp | 6701.333 | 0.280x |
kernel/aarch64/op_mul |
111.834 | llama.cpp | 6293.917 | 0.018x |
kernel/aarch64/op_mul_mat |
2834.208 | llama.cpp | 18014.208 | 0.157x |
kernel/aarch64/op_sin |
1316.167 | llama.cpp | 6108.000 | 0.215x |
kernel/aarch64/op_soft_max |
1073.833 | llama.cpp | 5972.833 | 0.180x |
kernel/aarch64/op_sqr |
103.416 | llama.cpp | 4939.250 | 0.021x |
kernel/aarch64/op_sqrt |
144.750 | llama.cpp | 5108.375 | 0.028x |
kernel/aarch64/op_sub |
118.625 | llama.cpp | 6595.209 | 0.018x |
kernel/aarch64/op_unary_exp |
1432.166 | llama.cpp | 6451.166 | 0.222x |
kernel/aarch64/op_unary_neg |
125.833 | llama.cpp | 5438.875 | 0.023x |
kernel/aarch64/op_unary_relu |
139.709 | llama.cpp | 5285.041 | 0.026x |
logits/sampler_raw/vocab_128000 |
19439.792 | llama.cpp | 19685.958 | 0.987x |
logits/sampler_raw/vocab_256000 |
37210.000 | llama.cpp | 36543.292 | 1.018x |
logits/sampler_raw/vocab_32000 |
4796.041 | llama.cpp | 5427.334 | 0.884x |
logits/sampler_sml/vocab_128000 |
19775.917 | llama.cpp | 18578.417 | 1.064x |
logits/sampler_sml/vocab_256000 |
33379.292 | llama.cpp | 23628.959 | 1.413x |
logits/sampler_sml/vocab_32000 |
4402.542 | llama.cpp | 4797.750 | 0.918x |
logits/validator_raw/vocab_128000 |
92666.375 | llama.cpp | 90186.917 | 1.027x |
logits/validator_raw/vocab_256000 |
185233.000 | llama.cpp | 175983.250 | 1.053x |
logits/validator_raw/vocab_32000 |
23632.000 | llama.cpp | 23353.958 | 1.012x |
logits/validator_sml/vocab_128000 |
97867.333 | llama.cpp | 95745.125 | 1.022x |
logits/validator_sml/vocab_256000 |
200314.208 | llama.cpp | 193214.375 | 1.037x |
logits/validator_sml/vocab_32000 |
24274.209 | llama.cpp | 23702.209 | 1.024x |
memory/hybrid_full |
444.458 | llama.cpp | 34307.000 | 0.013x |
memory/kv_full |
131.833 | llama.cpp | 33199.750 | 0.004x |
memory/recurrent_full |
148.625 | llama.cpp | 4460.167 | 0.033x |
text/encoders/bpe_long |
65.417 | llama.cpp | 66.333 | 0.986x |
text/encoders/bpe_short |
58.416 | llama.cpp | 57.000 | 1.025x |
text/encoders/fallback_long |
2432.500 | llama.cpp | 2504.917 | 0.971x |
text/encoders/fallback_short |
61.541 | llama.cpp | 63.708 | 0.966x |
text/encoders/plamo2_long |
7505.625 | llama.cpp | 7621.416 | 0.985x |
text/encoders/plamo2_short |
197.541 | llama.cpp | 204.083 | 0.968x |
text/encoders/rwkv_long |
817830.666 | llama.cpp | 817051.958 | 1.001x |
text/encoders/rwkv_short |
56197.916 | llama.cpp | 55188.041 | 1.018x |
text/encoders/spm_long |
3571985.041 | llama.cpp | 3530273.833 | 1.012x |
text/encoders/spm_short |
1345.542 | llama.cpp | 1296.458 | 1.038x |
text/encoders/ugm_long |
1365791.166 | llama.cpp | 1345215.917 | 1.015x |
text/encoders/ugm_short |
711.459 | llama.cpp | 728.083 | 0.977x |
text/encoders/wpm_long |
30303.667 | llama.cpp | 30499.167 | 0.994x |
text/encoders/wpm_short |
579.958 | llama.cpp | 546.333 | 1.062x |
text/jinja/formatter_long |
60.541 | llama.cpp | 217732.916 | 0.000x |
text/jinja/formatter_short |
15.583 | llama.cpp | 3876.583 | 0.004x |
text/jinja/parser_long |
65722.792 | llama.cpp | 49943.250 | 1.316x |
text/jinja/parser_short |
982.792 | llama.cpp | 505.125 | 1.946x |
tokenizer/full_bpe_long |
13085.167 | llama.cpp | 13487.333 | 0.970x |
tokenizer/full_bpe_short |
304.333 | llama.cpp | 310.458 | 0.980x |
tokenizer/full_plamo2_long |
12613.167 | llama.cpp | 12076.209 | 1.044x |
tokenizer/full_plamo2_short |
1963.167 | llama.cpp | 1886.250 | 1.041x |
tokenizer/full_rwkv_long |
829591.750 | llama.cpp | 811812.250 | 1.022x |
tokenizer/full_rwkv_short |
55552.208 | llama.cpp | 54091.042 | 1.027x |
tokenizer/full_spm_long |
3602667.042 | llama.cpp | 3540245.417 | 1.018x |
tokenizer/full_spm_short |
1455.375 | llama.cpp | 1453.000 | 1.002x |
tokenizer/full_ugm_long |
1365972.708 | llama.cpp | 1349505.083 | 1.012x |
tokenizer/full_ugm_short |
2427.667 | llama.cpp | 2406.584 | 1.009x |
tokenizer/full_wpm_long |
33084.125 | llama.cpp | 32833.125 | 1.008x |
tokenizer/full_wpm_short |
2178.833 | llama.cpp | 2255.458 | 0.966x |
tokenizer/preprocessor_bpe_long |
3278.208 | llama.cpp | 5181.875 | 0.633x |
tokenizer/preprocessor_bpe_short |
132.417 | llama.cpp | 1737.375 | 0.076x |
tokenizer/preprocessor_plamo2_long |
4133.541 | llama.cpp | 7169.542 | 0.577x |
tokenizer/preprocessor_plamo2_short |
2506.708 | llama.cpp | 5110.208 | 0.491x |
tokenizer/preprocessor_rwkv_long |
4179.666 | llama.cpp | 6981.542 | 0.599x |
tokenizer/preprocessor_rwkv_short |
2475.209 | llama.cpp | 5352.417 | 0.462x |
tokenizer/preprocessor_spm_long |
3996.833 | llama.cpp | 7463.042 | 0.536x |
tokenizer/preprocessor_spm_short |
2610.333 | llama.cpp | 4721.791 | 0.553x |
tokenizer/preprocessor_ugm_long |
4256.458 | llama.cpp | 7343.500 | 0.580x |
tokenizer/preprocessor_ugm_short |
2493.334 | llama.cpp | 4917.500 | 0.507x |
tokenizer/preprocessor_wpm_long |
3994.042 | llama.cpp | 7210.958 | 0.554x |
tokenizer/preprocessor_wpm_short |
2533.833 | llama.cpp | 5622.666 | 0.451x |
- Source snapshot:
snapshots/bench/benchmarks_compare.txt - Maintained benchmark suite:
diarization_sortformer. - Supported maintained model:
tests/models/diar_streaming_sortformer_4spk-v2.1.gguf(diar_streaming_sortformer_4spk_v2_1_gguf). - Canonical proof fixture:
tests/fixtures/diarization/ami_en2002b_mix_headset_137.00_152.04_16khz_mono.wav(ami_en2002b_mix_headset_137.00_152.04_16khz_mono). - Maintained fixture provenance:
source=ami ihm/test path=EN2002b.Mix-Headset.wav window=137.00-152.04s proof_status=maintained_loader_real_audio - Input contract: real
15.04second mono16 kHzaudio, maintained GGUF loader, maintained request frontend (1504 x 128log-mel features), maintained pre-encode stack (188 x 512encoder frames), then the maintained executor/probability/segment pipeline. - Current exact output:
output_dim=17segments,output_checksum=4249677247906920305. - Benchmark reference note:
proof_status=onnxruntime_cpu thread_contract=intra_op=1 inter_op=1 execution_mode=sequential benchmark_config=iterations:1,runs:15,warmup_iterations:1,warmup_runs:3 actual_providers=CPUExecutionProvider feature_contract=emel_maintained_features onnx_model=build/onnx_ref/diar_streaming_sortformer_4spk-v2.1.onnx onnx_sha256=5df5e883c8dae4e0ecba77739f3db38997c2ae57153de2583d625afb6abb2be0 output_contract=onnx_output_probabilities onnx_repo=ooobo/diar_streaming_sortformer_4spk-v2.1-onnx - Current publication caveat: the ONNX row is the CPU single-thread benchmark reference. PyTorch/NeMo remains the independent parity reference, and the ONNX row is only a closeout target after it exact-matches that parity lane.
- Table row: the primary steady-state case is included in the benchmark table above with its real reference timing and proof status.
- Publication workflow: the deterministic compare artifacts now live under
scripts/bench_diarization_compare.sh, which writesraw/emel.jsonl,raw/reference.jsonl, optionalraw/onnx_reference.jsonl, andcompare_summary.jsonusingdiarization_compare/v1/diarization_compare_summary/v1. - PyTorch/NeMo parity reference lane: when supplied with
--pytorch-reference-model, the workflow emits a distinctpytorch.nemo.sortformer.v2_1backend using the official model-cardSortformerEncLabelModel.diarize(..., include_tensor_outputs=True)path on the maintained WAV fixture. The reproducible environment is synced withuvfromtools/bench/diarization_pytorch_reference_requirements.txtviascripts/setup_diarization_pytorch_ref_env.sh. Timing excludes one-time model load. - ONNX benchmark reference lane: when supplied with
--onnx-reference-model, the workflow emits a distinctonnx.sortformer.v2_1backend using ONNX Runtime CPU single-thread (intra_op=1,inter_op=1,ORT_SEQUENTIAL) and the maintained feature tensor from the same fixture. Generated records includeactual_providersfrom ONNX Runtime. Missing ONNX dependencies or artifacts are hard failures, not recorded-baseline fallbacks. ONNX output must be cross-checked against the PyTorch parity lane before it is treated as a correctness target. - Optimization contract: one-time setup costs are secondary evidence. The primary benchmark claim remains the steady-state maintained pipeline lane on exact hardware.
- Rejected or deferred candidates:
- Tool-local synthetic contracts/PCM fixtures and reference-lane dependencies remain rejected.
- Full transformer dense/matmul kernelization remains future kernel-contract work.