Psionic is only inference-ready when it can honestly serve compute products on admitted hardware classes rather than just run tensor math.
Psionic does not start by picking one upstream engine brand and then forcing every lane through that shape.
It starts by fixing:
- hardware class and backend family
- residency and memory posture
- single-node versus clustered topology
- admitted serving role
- capability, refusal, and latency publication
The runtime lane follows from that decision.
docs/HARDWARE_VALIDATION_MATRIX.md owns the minimum shipped hardware rows and
their admitted roles. This document owns the serving-runtime contract above
those rows.
If two lanes share one API surface but differ in hardware class, topology, or admitted role, they must publish different runtime truth instead of collapsing into one generic engine claim.
- model load/unload lifecycle
- request execution path
- token streaming or equivalent delivery model
- KV cache lifecycle
- deterministic execution metadata
- runtime-side latency telemetry that keeps Tokio scheduling and async wait time separate from backend compute profiling
- backend capability gating
- served capability publication that keeps supported, route-required, refusal-required, and unsupported regions explicit together with context and latency envelopes
- Generic OpenAI-compatible GGUF serving may expose different runtime truth per
loaded model inside the same process. Publication must stay model-specific in
/health,/v1/models, and response headers. Gemma 4now has one published bounded Psionic lane on this checkout. The canonical publication doc isdocs/NON_GPT_OSS_GEMMA4_PILOT.md. The first frozen target remains:- artifact =
gemma4:e4b - family shape = dense
Gemma 4 - first claim = one bounded text-generation lane on CUDA
- first publication bar = truthful backend, execution mode, execution engine, and refusal metadata on the generic OpenAI-compatible server
- artifact =
- The published bounded
Gemma 4lane now provides:- native Psionic CUDA runtime admission for dense quantized GGUF projection artifacts
backend = cudaexecution_mode = nativeexecution_engine = psionic/v1/chat/completionsand/v1/responseson the generic OpenAI-compatible server- typed promoted-revision adoption for
gemma4:e4bfrom the bounded Gemma trainer, with checkpoint-plus-export revalidation, active served-revision provenance, fail-closed mismatch refusal, and rollback to the last known-good revision - Gemma-native tool calling through explicit
<|tool_call>call:<tool>{...}<tool_call|>blocks with JSON-schema-subset argument validation - replayable
/v1/responsesstate across assistant tool turns and replayed tool results - published
multimodal_projection_mode = processor_ownedforimageandvideo, with explicit refusal on the current generic OpenAI surface until a real Gemma processor lands - published
audio_input_mode = processor_ownedwithaudio_input_parts = ["input_audio"]on densee2bande4brows only, with explicit refusal on the current generic OpenAI surface until a real Gemma audio processor lands - bounded prompt-render, server-smoke, refusal, and repeatable CUDA
conformance coverage for
gemma4:e4b
Gemma 4also now has one second dense validation lane outside the first published claim:- artifact class = dense
31B - operator hook =
PSIONIC_GEMMA4_31B_PILOT_GGUF_PATH - validation scope = the same checked-in prompt fixture, CUDA backend truth,
/v1/chat/completionsand/v1/responsesendpoint set, structured-output refusal posture, and processor-owned image/video refusal posture as thee4blane - claim boundary = validation only; it does not widen the original
gemma4:e4bpublication
- artifact class = dense
Gemma 4now also has one first sparse-family publication contract outside the dense first claim:- artifact class = sparse
gemma4:26b - runtime contract =
family_specific_placement - published topology = the sparse expert topology inspected from the loaded
GGUF; current retained cluster fixtures still exercise the older
expert_count = 64,active_expert_count = 4,expert_feed_forward_length = 4096shape, so topology-value reconciliation remains a follow-on consistency task instead of part of this lane claim - publication surfaces =
/health,/v1/models, routed inventory, and mesh management status all carry the same sparse-topology truth - default request posture = the local generic OpenAI-compatible server now
admits one single-node text lane for
gemma4:26boncudaandmetalwithout requiring a distributed sparse schedule - single-node hardware posture = one host must admit the full quantized sparse GGUF plus active KV cache on the chosen backend; otherwise the server must fail closed or the operator must admit a distributed sparse schedule instead
- admitted request posture = once the operator admits a real
two-or-more-node sparse schedule, the same server can execute
gemma4:26band publishdisposition = sharded,execution_topology = tensor_sharded, and request-specific expert-routing proof in response headers andpsionic_cluster_execution - shard lifecycle = admitting that sparse schedule now also materializes and
caches one explicit shard artifact per expert-host assignment, publishes
shard build cache keys and artifact digests into routed inventory,
/health,/v1/models, and mesh management status, and reuses the same shard artifacts on repeated admission instead of silently pretending every sparse turn is cold - stateful locality =
/v1/responsesnow persists one sparse placement binding per conversation so follow-up turns stay on the same worker and placement digest while that shard state remains healthy, then rebind cleanly if the placement changes - current Metal single-node closure = the native local Metal path now keeps
all
30 / 30FFN layers on device for the realgemma-4-26B-A4B-it-Q4_K_M.ggufartifact, including fusedQ4_Ksparse gate/up experts andQ5_0dense plus sparse down projections; the first retained2026-04-14benchmark receipt moved from about5.14 tok/sto about24.22 tok/s, and the follow-on device-resident decode pass moved the same retained prompt again to about29.58 tok/swhile keeping greedy readback bounded to4 B/tokenand timed-request host KV materialization at0 - current competitive benchmark gate = the repo now also has one fail-closed
same-host benchmark gate for this exact lane at
scripts/release/run-gemma4-26b-competitive-benchmark.sh; the retained2026-04-15receipt keeps the sparse path honest withnative_sparse_execution = true,host_fallback_observed = false,sparse_ffn_backend = metal_grouped_experts, androuter_backend = metal_router_topk_softmax, but still reportsdecode_tok_s = 30.63for Psionic versus102.34onollamaand113.79onllama.cpp, so the gate currently fails for competitive throughput rather than sparse-receipt dishonesty; the canonical audit isdocs/audits/2026-04-15-gemma4-26b-competitive-benchmark-gate-audit.md - current boundary = that pass fixes the local fallback cliff but does not
yet close parity with
ollamaorllama.cpp, and the sparse 26B local lane still returns malformed text on the shared benchmark prompt - claim boundary = one local single-node text lane plus one admitted
distributed sparse extension only; this still does not promote multimodal,
audio, structured-output, or training claims for
gemma4:26b
- artifact class = sparse
- The first real distributed
Gemma 4proof is now one splitpipeline_shardedgemma4:e4brequest across two CUDA machines:- execution target = dense
gemma4:e4b - realized topology =
pipeline_sharded - realized disposition =
sharded - Metal prefix execution now keeps the split-stage layer stack on the device and only materializes the handoff hidden state and forwarded KV rows that the remote suffix actually needs
- request publication now exposes the clustered path directly in response headers and response or receipt provenance instead of inferring it from a remote route
- execution target = dense
- The generic OpenAI-compatible server, routed inventory, and mesh management
status now also publish one family-agnostic clustered-execution summary for
every admitted model row:
- generic execution modes =
remote_whole_request,replicated,dense_split,sparse_expert - generic topology kinds =
replicated,pipeline_sharded,layer_sharded,tensor_sharded - publication surfaces =
/health,/v1/models, routed inventory, and mesh management status - local routed detail = routed inventory and mesh management status can also carry one runtime-owned cluster capability profile that explains which multi-machine path is actually admitted on that row
- intent = downstream product surfaces no longer need
gpt_oss-specific inference to tell whether a model is proxied to one remote host, replicated across hosts, split across hosts, or running as a sparse expert topology
- generic execution modes =
- The older bootstrap path still exists, but it remains explicitly classified
as remote whole-request proxying:
- remote execution target = CUDA-backed
gemma4:e4b - routed publication keeps remote route truth explicit on
/v1/modelswithroute_backend = cuda,route_execution_mode = native, androute_execution_engine = psionic - thin-client served truth stays separate and honest with
served_backend = remoteandexecution_mode = proxy - routed remote publication now keeps the same admitted endpoint set as the local dense lane instead of silently narrowing back to chat-only
- remote execution target = CUDA-backed
psionic-runtimenow also owns one bounded dense multi-device execution core below the served product surface:- admitted topology = ordered
pipeline_shardedorlayer_shardedexecution over explicit layer-range shard artifacts - execution posture = adjacent shard runtimes prefer direct worker-to-worker stage handoff when both sides support it and otherwise fall back to the host-mediated path, with final output assembly on the last stage
gemma4:e4bnow has one bounded proof on top of that runtime-core path through runtime reports, provider receipts, and generic-server response headers- claim boundary = one bounded split dense path only; this does not by itself promote broader wallet-settled or sparse-family served product claims
- admitted topology = ordered
Gemma 4now also has one first-class Metal lane contract on the generic OpenAI-compatible server:backend = metalexecution_mode = nativeexecution_engine = psionicresidency_mode = metal_acceleratedfallback_policy = refuse- the same
/v1/chat/completionsand/v1/responsespublication surface as the admitted CUDA lane - no scheduler policy claim
- current execution posture = live native single-node execution on Apple
Silicon for the admitted local Gemma text lanes, including dense
gemma4:e4band sparsegemma4:26b
- The first bounded
Gemma 4claim must stay explicit about its unsupported regions:- admitted image execution on the processor-owned lane
- admitted video execution on the processor-owned lane
- admitted audio execution on the processor-owned audio lane
31B- broader
gemma4:26bsupport beyond the admitted single-node text lane and optional admitted distributed sparse extension - full Metal parity across Gemma artifacts and clustered roles
- generic structured outputs
- unquantized projection tensors on the native CUDA lane
- full parity with
llama.cpporollama
- That unsupported-region list is the boundary of the first published
gemma4:e4bclaim. The optional dense31Bvalidation lane keeps the same refusal posture and publication shape, but it still does not promote31Binto the published first claim automatically. - The dense
Gemma 4audio lane is narrower than the family-wide multimodal publication:e2bande4bpublishaudio_input_mode = processor_ownedandaudio_input_parts = ["input_audio"]31Band sparsegemma4:26bdo not publish audio capability and fail closed for directinput_audioparts on the current generic surface
- CPU-only debug bring-up may still be useful while the lane is under active
development. The repo now admits
Gemma 4on CPU for bounded debug execution, but CPU still does not satisfy the first publishedGemma 4support claim for Psionic. qwen35isimplemented_earlythrough a native Psionic CUDA text-generation runtime with prompt-projected image and video inputs at the HTTP layer.- Current
mainalso admits native local execution for the real Hugging FaceQwen3.5-27B-Q4_K_M.ggufartifact without a llama.cpp proxy on both CPU and Metal. That path now accepts the scalarqwen35.attention.head_count_kvform, theblk.N.ssm_dt.biastensor naming used by the public 27B artifact, the officialqwen35tokenizer pre, GGUF MRoPE metadata, and the missingqwen35.ssm.v_head_reorderedfamily fact by defaulting it totrue. - The current native Metal
qwen35lane is coherent and materially faster than the first bring-up, but it is still host-stepped and not yet in the same performance class as the local Ollama / llama.cpp reference. The current gap is execution shape, not GGUF admission. - The
qwen35lane must publish:backend = cudaexecution_mode = nativeexecution_engine = psionicresidency_mode = cuda_accelerated- single-request execution posture
- no scheduler policy claim
- The
qwen35lane must also publish:multimodal_projection_mode = prompt_projection_only- accepted projected media =
image,video - the derived
qwen35multimodal projection config from GGUF family facts
- The first
qwen35lane supports prompt-replay response-state flows on/v1/responses. - The first
qwen35lane supports image and video request projection on/v1/chat/completionsand/v1/responseswithout claiming a native image or video encoder. - The first
qwen35lane now supports bounded sampled decode on native CUDA when the request stays inside the exact candidate-only envelope:- sampled decode or non-zero effective temperature
- effective
top_kavailable and<= 128 - structured-output masking inactive
mirostatinactive
- The runtime sampling surface now also honors
min_p,typical_p,mirostat,mirostat_tau,mirostat_eta, and request-levelrepeat_last_nin addition to the existing sampled controls. - The generic OpenAI-compatible qwen35 request surface now forwards
top_k,top_p,min_p,typical_p,mirostat,mirostat_tau,mirostat_eta,repeat_penalty,repeat_last_n,presence_penalty,frequency_penalty, andseedon both/v1/chat/completionsand/v1/responses. repeat_last_nfollows the Ollama-compatible local sampler contract:- default
64 0disables the penalty lookback window-1expands the penalty window to the full available history
- default
min_premains compatible with the bounded qwen35 CUDA sampled lane because Psionic applies it after exact top-k candidate selection on both the dense and bounded sampling paths.typical_premains compatible with the bounded qwen35 CUDA sampled lane for the same reason.- The local Ollama
qwen3.5runner on this checkout does not expose the full Psionic sampler surface through the same active path. Its live apples-to-apples sampler contract is the one built bysample.NewSampler(temperature, topK, topP, minP, seed, grammar). mirostatis now supported on the qwen35 runtime surface too, but it is still exact-via-fallback rather than fast-path. The current lane routes it through explicitraw_logitsreadback instead of the bounded candidate lane.- Outside that envelope the qwen35 lane still falls back to explicit
raw_logitsreadback instead of silently narrowing behavior. - Native qwen35 structured outputs are now supported on two explicit paths:
- greedy, no-penalty, no-
mirostatstructured requests stay onTopKCandidates { top_k: 128 }and use exact sparse allowed-logit gather on candidate misses instead of dense vocab replay - structured requests outside that envelope still fall back to explicit
raw_logitsreadback instead of silently narrowing behavior
- greedy, no-penalty, no-
- The native structured-output path now uses tokenizer-native incremental token append caches with per-leading-char token-id buckets and replay-safe sparse fallback, so qwen35 candidate misses no longer double-advance the decode state, linearly rescan the full vocabulary on sparse schema misses, or require dense vocab replay on the bounded structured lane.
- The local
qwen35_cuda_benchharness now reproduces native-versus-Ollama JSON object and JSON schema requests too through--json-objectand--json-schema-file, and it now writes machine-readable per-run evidence through--json-out. The native direct row now also keepsbenchmark_class,load_s, per-runttft_s, per-runitl_s, and mean TTFT / ITL fields explicit in the direct receipt instead of leaving those timings trapped inside the runtime metrics lane. - The repo-owned direct-versus-HTTP collector for the admitted native qwen35
CUDA lane now lives at
scripts/release/qwen35_direct_vs_http_compare.py. It runs the native direct-engine receipt and the nativepsionic-openai-serverreceipt on one explicit prompt contract, keeps the two benchmark classes separate in the published JSON, records HTTP startup and warmup timing, writes one explicit concurrency ladder instead of folding runtime and server overhead into one number, and now requires the native direct row to pass the fallback-free CUDA publication gate unless an operator opts into compatibility receipts explicitly. - The repo-owned sequential collector for the canonical qwen35 versus Ollama
matrix now lives at
scripts/release/run-qwen35-ollama-matrix.sh. It writes a combined manifest plus row reports that preserve output-token arrays, prompt/decode timing, qwen35 output modes, readback bytes, raw-logit materialization, termination classification, first-divergence evidence, host power-limit metadata, Psionic commit, and Ollama version. The same harness now forces Psionic benchmark requests ontoPrefixCacheMode::Bypassso the canonical matrix measures raw qwen35 prompt and decode throughput instead of unrelated shared prefix-cache behavior. - Structured-output throughput is still not part of the canonical
Psionic-versus-Ollama matrix. On March 28, 2026, after adding leading-char
token-id buckets to the structured-output append cache, removing per-rule
name allocation from the recursive matcher memo path, and rebuilding from a
clean isolated target, a fresh local
qwen3.5:0.8bsummary-schema spot check measured native Psionic at about78 tok/son the first bounded sparse-gather run and about162 tok/smean across a warmed three-repeat pass, versus local Ollama at about331 tok/s. Psionic publishedqwen35_output_modes=[top_k_candidates:128,sparse_logits:2,sparse_logits:3,sparse_logits:10],qwen35_readback_bytes=5700, andqwen35_raw_logits=falseon the sparse run. The later warmed repeats stayed onqwen35_output_modes=[top_k_candidates:128]and hit the token cap without materializingstructured_output_value, so this remains a bounded parity note rather than a canonical throughput row. - Native
qwen35tool calling is now supported on the generic OpenAI-compatible server surface through the bounded tagged-JSON-schema tool contract:/v1/chat/completions- modes
none,auto,required, and named tool choice - request-level
parallel_tool_calls - ordered machine-readable
message.tool_calls - streamed
delta.tool_callswith ordered per-call indexes /v1/responsesprompt-replay state continuation across assistant tool turns and replayedrole = toolresults- JSON-schema-subset argument validation
- proxy
qwen35still fails closed for tool calling
- On March 31, 2026, the generic OpenAI-compatible
/v1/chat/completionsstream path stopped collapsing plain-text responses into one final SSE delta and now emits incremental content chunks instead. The same pass also stopped rejecting auto tool-mode turns when the model returned plain assistant text without a machine-readable tool envelope. - That stream improvement is still bounded on native
qwen35: the server now emits incremental SSE content deltas, but the nativeqwen35backend does not yet publish true decoder-time token events. The current lane still materializes the response before chunk emission at the server surface. - The first
qwen35lane must still fail closed for system-message image and video parts to stay aligned with the real template semantics. - On March 27, 2026, the native qwen35 CUDA lane gained the captured-graph greedy path, fused q/k and activation kernels, and the MMVQ-backed greedy output-head fast path that materially raised local greedy throughput on the Psionic side.
- On March 30, 2026, the repo added the first retained shared RVLLM runtime
harvest packet at
docs/PSION_RVLLM_CUDA_GRAPH_POOL.md, making CUDA-graph hits, misses, captures, shape drift, and refusal posture machine-visible across both nativeqwen35and nativegpt_ossdecode lanes instead of keeping that truth as lane-local benchmark folklore. - The same runtime-harvest pass now also exposes one explicit cuBLAS warmup
and handle-reuse packet at
docs/PSION_RVLLM_CUBLAS_WARMUP.md, together with machine-readablepsionic_cuda_startupstartup evidence inqwen35_cuda_bench --json-out. - On March 31, 2026, the same runtime-harvest line gained one explicit
cublasLtplan-cache packet atdocs/PSION_RVLLM_CUBLASLT_PLAN_CACHE.md, making admitted startup autotune status, selected-plan receipts, fallback shape counts, and workspace posture machine-visible for nativeqwen35and nativegpt_ossCUDA decode. - Later on March 31, 2026, the admitted native
qwen35CUDA decode lane moved the device token-embedding mirror, request decode params, and initial-token seed inside the request-local captured graph for the argmax, top-k, and raw-logit output branches, and the repo added captured-lane T>1 parity tests for those same branches against the debug-attention reference path incrates/psionic-serve/src/openai_http.rs. - The same issue-805 pass also published one explicit before/after receipt
pair at
fixtures/qwen35/benchmarks/qwen35_cuda_issue_805_20260331_archlinux_nongated.jsonandfixtures/qwen35/benchmarks/qwen35_cuda_issue_805_20260331_archlinux.json. The compatibility receipt measured about492.08 tok/swith one graph shape drift per request; the admitted rerun measured about502.39 tok/swithqwen35_graph_hits=27, one initial graph capture plus one matching miss,qwen35_graph_shape_drifts=0,qwen35_readback_bytes=224, zero host fallback evidence, andqwen35_attention_backends=[fa3_split_kv_f16_kv_graph@split1]. - The same pass now also retains one explicit GPU logits-selection packet at
docs/PSION_RVLLM_GPU_LOGITS_SELECTION.md, binding the already-shipped qwen35 and gpt-oss device-argmax / bounded-candidate lanes to explicit readback-byte and raw-logits fallback truth. - The same pass now also retains one explicit sampling-loop packet at
docs/PSION_RVLLM_SAMPLING_LOOP.md, making the admitted seeded sampler, exact bounded-candidate replay lane, and sparse penalty scratch-reuse path machine-visible instead of leaving that hot-path truth buried insideqwen35.rsandgpt_oss.rs. - The same pass now also retains one explicit pre-flight bundle packet at
docs/PSION_RVLLM_PREFLIGHT_BUNDLE.md, tying together graph capture, cuBLAS warmup, allocator-pool posture, kernel-cache posture, and cold-versus- warm startup evidence for the admitted CUDA serving lane. - The same pass now also retains one explicit fallback-free CUDA benchmark
gate packet at
docs/PSION_RVLLM_FALLBACK_FREE_CUDA_GATE.md, making host fallback evidence, raw-logit materialization, graph-stability checks, and refusal posture machine-visible for the admitted native qwen35 greedy lane instead of silently publishing degraded direct-engine rows. - The same pass now also retains one explicit paged-KV manager packet at
docs/PSION_RVLLM_PAGED_KV_MANAGER.md, making logical page layout, owner-bound growth accounting, spill policy, residency movement, and refusal posture explicit instead of leaving the block-manager contract implicit. - The same pass now also retains one explicit prefill/decode scheduler packet
at
docs/PSION_RVLLM_PREFILL_DECODE_SCHEDULER.md, making the admitted continuous-batch policy, realized scheduling classes, TTFT/ITL exposure, and response-header contract machine-visible instead of leaving that split buried across runtime receipts and HTTP glue. - The same pass now also retains one explicit attention-backend packet at
docs/PSION_RVLLM_ATTENTION_BACKEND.md, making the current CUDA attention backend selector explicit so dense f16 KV remains the default while turboquant KV and q8_1 output fusion stay capability-gated alternates. - The same pass now also retains one explicit FA3-class decode-attention
packet at
docs/PSION_RVLLM_FA3_DECODE_ATTENTION_BACKEND.md, making the admitted qwen35 CUDA graph backend, split-KV heuristic, SM gate, and downgrade evidence machine-visible instead of leaving the graph decode kernel identity implicit. - The same pass now also retains one explicit memory-pool packet at
docs/PSION_RVLLM_MEMORY_POOL.md, making the exact-spec CUDA allocator pool real for the admitted decode lane and binding before/after allocation count, steady-state memory, and long-run leak posture into one explicit runtime packet instead of leaving allocator reuse as an unimplemented contract. - The same pass now also retains one explicit fused-kernel shortlist packet at
docs/PSION_RVLLM_FUSED_KERNELS.md, making the admitted qwen35 QKV/RMSNorm and gpt-oss selected4 MoE kernels explicit together with their env-gated disable paths instead of leaving fused-kernel ownership as broad folklore. - The same pass now also retains one explicit KV eviction-and-reuse packet at
docs/PSION_RVLLM_KV_EVICTION_REUSE.md, making oldest-page eviction, reclaimed-page reuse, predictive reuse reporting, and long-context stress truth explicit instead of leaving that policy as implicit page churn. - The same pass now also retains one explicit direct-engine comparator packet
at
docs/PSION_RVLLM_DIRECT_ENGINE_COMPARATOR.md, making the admitted native direct row, native HTTP row, and optional directvllmreference row explicit as separate benchmark classes instead of leaving runtime-versus- server attribution to ad hoc local notes. - The older March 27 greedy qwen35-versus-Ollama numbers on this checkout are now historical only. The older harness omitted explicit Ollama greedy settings and therefore let Ollama use its default sampler surface instead of a forced greedy contract.
- On March 28, 2026, after adding a native one-row CUDA top-k candidate output
path and routing qwen35 sampled decode through
TopKCandidates { top_k }instead of unconditional dense-vocab readback, and after refreshing the local sampler surface to honormin_p,typical_p,mirostat,mirostat_tau,mirostat_eta, and request-levelrepeat_last_n, the same host now has a clean committed rerun with explicit divergence evidence indocs/QWEN35_OLLAMA_COMPARISON.md. - The same March 28, 2026 refresh widened the Psionic-side sampler and request
surface to include
typical_p,mirostat,mirostat_tau,mirostat_eta, and request-levelrepeat_last_n, but those controls are not part of the canonical Psionic-versus-Ollama matrix on this checkout because the localollamarunnerqwen3.5 path does not wire them through the same active sampler path. - On March 28, 2026, after widening the shared-memory CUDA top-k fast path to
the full bounded qwen35 envelope and then replacing the older one-row
radix-sort route with a partitioned multi-block one-row candidate path, the
same host now publishes the larger bounded-candidate
top_k = 100contract with row-strength classification instead of summary-only throughput claims. - Later on March 28, 2026, routing the canonical
top_k = 40sampled contract through that same partitioned one-row selector at the inclusive threshold removed the prior sampled overhead cliff on the clean RTX 4080 host and restored Psionic throughput leadership on all four clean length-matched sampled rows. - Later the same day, tuning the partitioned one-row top-k block count from
8to24on the same idle RTX 4080 host widened the sampled lead again. The follow-on rerun atfixtures/qwen35/benchmarks/qwen35_ollama_matrix_20260328_200654_archlinux-.jsonraised Psionicsampled_topk40throughput from470.49,243.56,175.06,108.36 tok/sto506.49,252.83,179.55,110.13 tok/s, and raisedsampled_topk100throughput from446.26,236.17,171.24,106.62 tok/sto492.81,247.57,177.28,109.02 tok/s, while the row-strength classifications stayed unchanged. - Later again on March 28, 2026, current
mainmade the partitioned block count adaptive by requestedtop_kinstead of using the same fixed value for every bounded sampled row. The commit-pinned rerun atfixtures/qwen35/benchmarks/qwen35_ollama_matrix_20260328_210428_archlinux-.jsonkeptsampled_topk40effectively flat at505.33,252.94,179.42,110.12 tok/swhile raisingsampled_topk100again to501.50,250.76,178.31,109.42 tok/s. - On March 29, 2026, doubling the partitioned top-k tile width by raising
kLogitsTopKItemsPerThreadfrom8to16produced another clean sampled gain on the same idle RTX 4080 host. The rerun atfixtures/qwen35/benchmarks/qwen35_ollama_matrix_20260329_003921_archlinux-.jsonraisedsampled_topk40to511.51,254.07,180.01,110.35 tok/sagainst Ollama337.58,206.38,144.10,96.37 tok/swhile keeping the same bounded-candidate output mode and row-strength classifications. - Later on March 29, 2026, the wider tile changed the best small-lane
partitioned block shape too. Retuning the
top_k = 40small profile from24to48on the same idle RTX 4080 host produced another clean sampled gain atfixtures/qwen35/benchmarks/qwen35_ollama_matrix_20260329_004540_archlinux-.json, raisingsampled_topk40again to519.35,256.03,180.99,110.68 tok/sagainst Ollama337.91,206.37,144.43,96.55 tok/swhile keeping the same bounded-candidate output mode and row-strength classifications. - Later the same day, changing the partitioned one-row top-k block shape from
128 threads x 16 itemsto256 threads x 8 itemsproduced the next full sampled rerun on the idle RTX 4080 host atfixtures/qwen35/benchmarks/qwen35_ollama_matrix_20260329_011606_archlinux-.json. That rerun kept cleansampled_topk40effectively flat at518.74,256.25,181.13,110.75 tok/swhile raisingsampled_topk100to513.06,253.55,179.74,109.98 tok/sagainst Ollama322.06,205.61,144.73,97.59 tok/s, with the same bounded-candidate output mode and row-strength classifications. - The fresh clean-host March 28, 2026 rerun on the same RTX 4080 changes the
canonical interpretation:
- raw greedy
tok/sis higher on Psionic across all four models, andqwen3.5:2bis now astrongexact-match row with matching EOS termination across all repeats - greedy
qwen3.5:0.8b,qwen3.5:4b, andqwen3.5:9bstill remainmismatched - clean sampled
top_k = 40rows stay on the bounded candidate lane, remain length-matched, and now beat Ollama on all four models even though token divergence still starts within the first few generated tokens - sampled
top_k = 100rows remainmismatched, but Psionic now still leads Ollama on all four of those rows on the clean host
- raw greedy
- Later on March 28, 2026, zeroing the per-request hybrid SSM state on request
init removed the old
qwen3.5:4bcap-hit corruption on both greedy andtop_k = 100sampled reruns while preserving the lead over Ollama on that row. - On March 28, 2026, the same bounded qwen35 CUDA sampled lane was widened
again to apply repeat, presence, and frequency penalties on device before
exact top-k selection instead of forcing explicit dense
raw_logitsreadback. On the local short-prompt smoke contract withtemperature = 0.8,top_k = 40,top_p = 0.9,min_p = 0.05,repeat_penalty = 1.1,repeat_last_n = 64,presence_penalty = 0.2,frequency_penalty = 0.1, andseed = 42, the qwen35 lane stayed onqwen35_output_modes=[top_k_candidates:40]withqwen35_raw_logits=falseacross all four local rows and measured about89 tok/sonqwen3.5:0.8b, about121 tok/sonqwen3.5:2b, about56 tok/sonqwen3.5:4b, and about43 tok/sonqwen3.5:9b. - The 4B row only became correct and faster after fixing the fused decode
output head for mixed
Q4_KandQ6_Kweights. GreedyArgmaxOnlydecode now routesQ6_Koutput weights throughQ8_1projection plusargmax_f32instead of falling back to the slower generic quantized matvec path. - The 9B row also fits and runs natively on this 16 GB host. The only extra benchmark requirement is operational: unload Ollama's resident GPU caches before measuring Psionic, because Ollama keeps prior model weights live in VRAM.
- The qwen35 lane is materially faster than the earlier pilot on this host,
and the current canonical matrix is now "ahead everywhere" on raw
tok/swhile still mixed on parity quality: greedy rawtok/sis higher on all four models with onestrongexact-match row, clean sampledtop_k = 40rows are ahead on all four models but onlyweak_length_matched_only, and the remaining headroom is still in greedy parity and the broader exact-match divergence work. - The multi-row local comparison matrix for
0.8b,2b,4b, and9blives indocs/QWEN35_OLLAMA_COMPARISON.md.
- explicit embeddings request/response contract
- deterministic vector shape metadata
- stable model identifier
- capability reporting tied to the served product
- execution receipt fields for outputs and runtime metadata
Psionic now has served KV-cache support. The remaining completion bar is not "whether KV cache exists." The remaining bar is whether the runtime can publish truthful ownership, residency, reuse, and refusal behavior across host and device paths.
The architecture must support:
- in-memory KV cache
- paged KV cache
- tiered KV cache
- concurrency-safe session ownership
- device-resident active decode state
- deferred host materialization for persistence, replay, and fallback paths
Phase 0 is complete when Psionic can run a deterministic, CPU-backed
psionic.embeddings smoke path with truthful capability and receipt surfaces.