Releases: zemo-g/rail
v5.1.0 — Rail emits its own GPU kernels
Rail now generates Metal Shading Language from its op-DAG, JIT-compiles it via Metal's newLibraryWithSource:, and dispatches the kernel at runtime. Every kernel the GPU executes is emitted by an attested Rail binary — the substrate piece needed for end-to-end attested GPU training.
This release bundles the full GPU substrate the auto-emission pipeline rests on: per-op Metal kernels, the bf16 numerics regime that unlocks stable 10k-step training, the JIT compile foundation, two hand-fused kernels, and the DAG matcher + emitter that drive them.
Auto-emission pipeline
| Module | Role |
|---|---|
stdlib/jit_node.rail |
JIT op-DAG types (JitNode, TracedTensor) + tape primitives + jit_nth. Pure-DAG module — no Tensor/transformer dependency, so codegen consumers import without pulling the training stack. |
stdlib/jit_tape.rail |
Execution tracers (jit_leaf, traced_rmsnorm, traced_matmul) layered on jit_node. |
stdlib/jit_match.rail |
DAG matcher. walk_tape returns a list of FuseMatch records (FuseRmsQKV, FuseSiluHad) in tape order, identifying subgraphs that fit known fusion shapes. |
stdlib/jit_emit.rail |
MSL emitter. emit_msl_for_match (tape, FuseMatch) returns the kernel source. Stubbed against known patterns today; v5.2+ replaces with shape-parameterized codegen driven by JitNode data. |
stdlib/jit.rail |
compile/dispatch shim. No longer owns MSL text; jit_compile_* pulls strings from jit_emit. External API unchanged for existing consumers. |
Hand-fused Metal kernels
| Kernel | Fusion | Speedup |
|---|---|---|
fused_rmsnorm_qkv |
RMSNorm + 3 matmul (Q/K/V) in one threadgroup-per-row dispatch. Exports Q | K | V | rstd | LN1 from a single packed buffer; rstd + LN1 kept for backward. 4 dispatches → 1. | 35× over the per-op chain at seq=512, d=64 |
fused_silu_hadamard |
SiLU(gate) * up in one elementwise dispatch. Exports h_act | sigmoid(gate); sigmoid kept for backward. | 18× over the per-op chain |
JIT compile foundation
tgl_jit_compile_from_tmp_file— Metal'snewLibraryWithSource:driven from a Rail-emitted.metalfile. Returns a pipeline ID cached ing_jit_pipesfor reuse across steps.tgl_jit_dispatch_1in1out/_2in1out/_rmsnorm_qkv/_silu_hadamard— per-pattern dispatchers with f64↔f32 host staging.
Per-op GPU kernels
tgl_rmsnorm_save_f64(1.8× over CPU),tgl_rope_apply_f64(7×),tgl_silu_fwd_f64(19×) at training shapes.
Per-op wins translated to ~2% per-step — confirming fusion, not per-op throughput, is the real ceiling, which is why the JIT pipeline above is the load-bearing thesis.
bf16 numerics regime
tgl_matmul_bf16+matmul_bf16wrapper. bf16 has f32's exponent range, sidestepping fp16's step-2759 NaN cliff. Training scripts default to forward bf16 with f64 on embedding + LM-head + backward.
Training scripts (chunked-corpus sampler, 2-block d=64)
tools/train/lm_v3_chunked_bf16_full_long.rail— bf16 forward, 10k-step stable, ~40% wall under f64 baseline.tools/train/lm_v3_chunked_jit_long.rail— Q/K/V + SwiGLU through fused JIT'd kernels. 200-step matched-seed pilot vs bf16 baseline: trajectory shape preserved, no NaN, both converge. 2.85% step-throughput improvement (3×3 alternating runs, seq=512 d=64 d_ff=192).tools/train/lm_v3_chunked_fp16_attn_f64_long.rail— falsification experiment ruling out attention as the fp16 culprit.
Tests + benches
7 JIT/GPU smoke tests + 2 benches added. Numerical parity verified: fused QKV max diff 1.67e-6 (f32 floor), silu-hadamard 1.5e-8. DAG matcher 5/5, MSL emitter 4/4, block-integration rstd/ln1/sigmoid parity all green.
Stability
- 140/140 compiler test suite still green.
- 2-pass byte-identical self-bootstrap unchanged — this release adds stdlib + foreign decls + Metal sources; the compiler core is untouched.
Full detail in CHANGELOG.md.
v5.0.2 — Attestation pipeline goes fully pure-Rail
Patch release. The first Rail release attested end-to-end through the Rail substrate — no curl, no shasum, no Python anywhere in the attestation path.
Fixes
stdlib/file.rail:foreign fopen path mode -> int(084791f).fopenreturns a file descriptor, not aFILE*; declaring it asptrbypassed Rail's tagging and tripped a polymorphic untag for odd fds (3 → 1, i.e. stdout). Corrected the foreign return type.- runtime
_fopen: unwrap path + mode beforeopen(2)(95d81de). Wrapped Rail strings carry theirptrat the heap header, soopen()saw the header tag byte as the path._fopennow calls_str_unwrap, matching the_rail_read_filecontract. Closes the argv-vs-literal path bug.
Attestation — retired the shell escape hatches (bbda5dd): attest.sh, sign_attestation.sh, and publish.sh deleted; release_index.rail replaces the Python heredoc in attest_release.sh. attest.rail + publish.rail are now canonical.
Stability — new seed rail_native (3b89d0f5) is at the 2-pass byte-identical fixed point.
Full detail in CHANGELOG.md.
v5.0.1 — Attestation hygiene + codegen tightening
Patch release. No new features.
Codegen — closed the half-applied ARM64 compile-fixes patch (emit_x1 large-immediate path, emit_x1 global-V fallback, O-handler RHS exclusion, ?-handler global-V exclusion). Self-compile fixed point and test suite unchanged from the v5.0.0 baseline.
Attestation hygiene
- Backfilled v4.0.0 / v4.0.1 / v4.1.0 release attestations (previously tagged-but-unattested).
.gitignorewhitelist forreleases/**/rail_native.attestation.jsonso a new release can't silently lose that file to therail_native.*wildcard rule.- New
docs/RELEASES.mdoperational runbook + 6 known gotchas.
Known limitation — attest.rail was still blocked on the ftell FFI bug, so attestation in this release (including its own artifacts) still used the documented tools/attest/attest.sh shell escape hatch. Closed in v5.0.2.
Full detail in CHANGELOG.md.
v5.0.0 — Self-hosted toolchain (Linux ELF substrate)
Rail produces its own aarch64 Linux ELF binaries. Encoder, assembler, static linker, and ELF writer are pure Rail. For the supported subset of inputs, the build pipeline invokes no external as, ld, or codesign.
What ships
| Module | Lines | Role |
|---|---|---|
jit/arm64.rail |
+200 | 23 new encoders for the Linux mnemonic set (ldrb/strb imm/reg-offset/post-index, clz, neg, cmn imm+reg, rev, rev16, fneg, frinta, fcvt s_d / d_s, tbnz, stp/ldp pre/post-index, add/sub imm, asr/lsr/lsl imm). 31/31 byte-verified against as. |
stdlib/elf.rail |
175 | Elf64 writer for static aarch64 binaries (one PT_LOAD for tiny, three for full text+data+bss-via-memsz). |
tools/v5/elf_asm.rail |
567 | Section-aware ARM64 assembler + static linker. Two-pass: pass 1 builds (name, section, offset) label table; layout resolves to vaddrs; pass 2 emits bytes with adrp/:lo12: symbol resolution. Handles .text/.data/.bss/.rodata/.section __DATA,__mod_init_func (skipped), .quad/.byte/.long/.ascii/.asciz/.space/.comm/.p2align/.align, plus writeback stp/ldp variants and the adrp+add :lo12: symbol-load idiom. |
tools/v5/compile_elf_full.rail |
80 | Driver: source .s → 3-pass pipeline → multi-segment ELF. Patches e_entry to _start. |
End-to-end verified on aarch64 Linux (Pi Zero 2 W)
| Program | ELF size | Result |
|---|---|---|
exit42_linux.s |
132 B | exit 42 |
fib_linux.s |
204 B | exit(fib(10)) = 55 |
hello_linux.s |
4105 B | prints "v5 lives\n", exit 9 (adrp + add :lo12: + write syscall) |
bss_test_linux.s |
4096 B | BSS counter loop, exit 7 |
Each binary's .text bytes are byte-equivalent to canonical as + ld output. Pipeline invokes neither external assembler nor linker.
compile.rail Linux pipeline fixes (precursor)
Two long-standing bugs in build_linux fixed: duplicate-symbol awk strip ran only on macOS-cross (now runs on Linux too, list broadened to cover the _rail_* runtime helpers that linux_libc.s redefines), and the macOS-only .section __DATA,__mod_init_func block is now stripped. tools/linux_libc.s gains _memcpy and _fmod — the two libSystem references with no Linux-side definition.
Tag-readiness checklist
-
compile.rail's real Linux output traverses the new pipeline → byte-equivalent ELF - Linux ELF substrate verified on aarch64 hardware
- 176 encoders byte-verified against
as(89 + 56 + 31) - No regression on
./rail_native test(136/140; 4 pre-existing tensor failures) - No regression on
./rail_native selfbyte-identical fixed point - CHANGELOG.md v5.0.0 entry
- Leak guard CI green
Deferred
- Pi self-host of
./rail_native testvia Rail-only build — current GC layout reserves 1.2 GB BSS, exceeding Pi Zero 2 W RAM. Heap-size knob is v5.1 scope. - macOS Mach-O end-to-end with dyld stubs — Phase 4b covered the libSystem-free subset; stub-aware Mach-O (LC_LOAD_DYLIB, indirect symbol table, __stubs, __got, bind opcodes) is ~1500 more lines. Tracked as v5.2.
🤖 Generated with Claude Code
v4.1.0 — Repo hygiene + leak-guard CI
Minor release. Comprehensive cleanup pass over the public tree. No
compiled-binary change; no language or stdlib changes.
CI + leak-prevention (B1)
- New workflow
.github/workflows/leak-guard.yml— every push and PR
is grep-scanned for the operator-recon pattern set (Tailscale IPs,
internal SSH targets, home-directory paths, internal Slack channel
IDs). Fails the build on any hit. Per-line opt-out via the comment
markerleak-guard-allow. CHANGELOG.md and the guard file are
excluded. - ci.yml triggers extended to include
nextbranch andv*tags.
Test-count assertion generalised from hardcoded137/137to any
matchingN/N(master is 137, next is 140, future may grow). - .gitignore — explicit ignores for
.mcp.json,.ledatic/,
.fleet/,*.pre-*. Closes the casual-git addrecurrence path
for the v4.0.1 leak class.
Branch hygiene (B2)
21 remote branches deleted from origin:
- 18
feat/*branches fully merged intonext(security A/B/C lanes,
x86 conformance harness, x86 runtime extensions, JIT fixes, docs
refresh, auto-deploy, punch-list integration). jit(merged intonext).track-mhd-kernel(merged intomaster).history-scrub-prep-2026-05-12(unused experimental branch).
Remaining: master, next, half-s2-kernels (open compiler work),
compound/exp-008-bytes_to_str (halted POC artifact). Down from 26
branches to 4.
Doc pruning (B3)
~104 operator session-handoff files removed from the public tree:
docs/plans/(74 files) — operator session-planning notes
(SESSION_HANDOFF_, PROMPT_SESSION_, WEEK_PLAN_, PHASE_, etc.).notes/orphan files (12).docs/handoffs/orphans (8).jit/operator notes (9) — SCRATCH, CONTINUATION, SESSION_PROMPT*,
AGENT_DRY_RUN, NEXT_STAGES, closures, floats.SECURITY_HANDOFF.md— internal Fort Knox punch list (the public
policy lives inSECURITY.md).
Kept: docs referenced from CHANGELOG (notes/bootstrap_convergence_audit_*,
notes/phase3_external_pilot_pitch_v0); jit/ code + README + CHANGELOG;
docs/sessions/ versioned handoffs (CHANGELOG-linked).
Dead-code pruning (B4)
- Deleted
tools/autocatalyst_v4.rail(broken — referenced runtime/llm.o
which never landed in-tree, flywheel-v1 artifact). - Deleted
tools/ac_dashboard.rail(orphan, flywheel dashboard). - Removed Razer3070 live-path references (decommissioned 2026-04-17):
tools/apps/control.rail— Razer fleet row + curl status segment.tools/fleet/fleet_display.rail— razer_status/razer_iter/razer_max/
razer_ping/razer_loss + RAZER row in the SPI-LCD render.tools/mcp/rail_mcp.rail— tool_fleet_status no longer SSHes for
nvidia-smi / v6_train.log; description updated.tools/compile.rail— compile_x86 fallback message no longer
recommends scp-to-Razer; suggests cross-tools or native host.
Byte-identical bootstrap preserved.
CLAUDE.mdtarget list: 'Linux x86_64 (Razer WSL)' →
'Linux x86_64 (cross-compile)'.
Structure pass (B5)
- Deleted 7 docs with no CHANGELOG or code references:
RAIL_ENGINEER_PROMPT.md,flywheel-data-quality.md,
flywheel-world-research.md,cascade-training.md,
rail-plasma.md,railgpt-from-scratch.md,
self-improving-playbook.md. - Flattened
docs/handoffs/(down to a single entry after B3 prune):
docs/handoffs/2026-05-02.md→docs/handoff-2026-05-02.md.
README polish (B6)
- Badge: v3.0.0 → v4.0.0; tagline → "Substrate maturity".
- Intro paragraph adds the v4.0.0 substrate-maturity lede (dual-backend
parity, JIT in Rail, 30/30 hard-bench, multi-witness attest). - New Releases section entry for v4.0.0 + a v4.0.1 sanitization note.
- History table extended: 7 new rows spanning v3.7.0 → v4.0.1
(previously jumped from v3.0.0 to v2.23.0).
Verification
- Leak guard: 0 hits across tracked files for the union pattern set.
- Test suite: 140/140 on the v4.1.0 tree (modulo the documented
/tmp/rail_outorphan-process collision when run concurrently with
anotherrail_native test). git pushon next: clean fast-forward; tag v4.1.0 cuts at 6 commits
past v4.0.1, all CI-green via the new workflow.
v4.0.1 — Public-surface sanitization
Patch release. Removes operator-specific infrastructure strings from the
public tree: Tailscale IPs, SSH usernames, home-directory paths, internal
Slack channel IDs, and a stray operator MCP config. No behavior change;
the compiled binary is identical to v4.0.0.
What was scrubbed (~110 files)
- Hard SSH targets in
tools/attest/*.sh,tools/fleet/*.sh,
tools/fleet/fleet_display.rail,tools/apps/control.rail— replaced
with<witness-user>@<witness-host>/<peer-user>@<peer-host>
placeholders. Callers must supply real values via environment. - Tailscale IPs (
100.87.231.45,100.79.50.108,100.120.203.70,
100.109.107.54,100.109.63.37) replaced with role placeholders
(<witness-tailscale-ip>etc.). Tailscale CGNAT-range addresses aren't
reachable from the public internet, but they were operational recon. - Home-directory paths (
/Users/ledaticempire/,/Users/user/,
/home/zemog/) replaced with~/or<HOME>placeholders across
source, docs,docs/plans/, training fixtures, and Objective-C dispatchers. - Operator service files —
tools/fleet/witness.service,
tools/fleet/witness_push.service,tools/fleet/com.ledatic.*.plist
renamed to*.examplewith<user>/<HOME>placeholders. Existing
install scripts already substitute these at install time. - Operator MCP config —
.mcp.jsonremoved from the tree. It was an
operator's Claude Code MCP wiring (path totools/mcp/rail_mcp.py),
not a build artifact; the MCP server still runs locally with a
per-user.mcp.jsonoutside the repo. - Slack channel IDs / DM names in
CHANGELOG.md,README.md,
stdlib/slack_client.raildocblock,docs/sessions/HANDOFF_v3_6.md—
D0ATHQ1BQD7andbrockbro2replaced with<DM_CHANNEL_ID>and
<test-dm>. Slack IDs don't grant access on their own, but these
were the only remaining specific-channel references in the public surface.
What was intentionally NOT scrubbed
reillygomez13@icloud.comintools/deploy/gen_*.rail— public
contact email rendered onto ledatic.org pages; meant to be public.- Commit messages in the v4.0.0 surface — rewriting history would break
existing clones for a topology-recon leak, not a credential leak.
The forward tree is clean; git history retains the originals. ~/.ledatic/path convention — generic project-named subdirectory,
not operator-specific.
Verification
git grep -E "100\.(87|79|109|120)\.|zemog@|user@100|reillygomez@|\
ledaticempire@|/Users/ledaticempire|/Users/user|/home/zemog|\
Detro|D0ATHQ1BQD7|brockbro2"
→ empty across tracked files.
Why a patch release
v4.0.0 carried operator-recon strings inadvertently included via the
multi-witness publisher work on the next lineage. The master lineage
was scrubbed in c4f6050 (2026-05-06) but next hadn't received the
same pass. v4.0.1 brings the substrate-track tree to the same hygiene
standard.
v4.0.0 — Substrate maturity
⚠️ Superseded by v4.0.1.
v4.0.0 included operator-recon strings (Tailscale IPs, internal SSH targets,
home-directory paths) that were inadvertently carried over from thenext
lineage. v4.0.1 is a documentation/config sanitization patch — the compiled
binary is identical. Please consume v4.0.1 instead.
A major version bump tagged on the next lineage. (master continues the parallel
v3.x attestation/agent track; the two have diverged on purpose.) 216 commits since
v3.11.0 was tagged on master 11 days ago — concurrency, playground, public JIT,
dual-backend parity, 30/30 substrate hard-bench publicly reproducible, browser-side
provenance verifier, four sweeping bug-class closures including a 17-day silent-
corruption fix discovered by a dual-implementation falsification harness.
No public API breaks; the major bump is a positioning marker, not a SemVer surface
change. The substrate-not-model thesis (docs/site/jit.md + tools/bench/repro_30of30.sh
https://ledatic.org/verify/<id>+ this entire shipping volume) is now publicly
defensible without hand-waving.
What "substrate maturity" means here
- A frontier model + a 1KB Rail spec compiles 30/30 on a held-out hard-bench,
reproducible by any partner with an API key. (f2c88b2) - The compiler is genuinely self-hosted on two backends, each with full
same-bug-class parity for the 9 binary ops across both operand orderings.
(ARM64 140/140, x86_64 136/136.9e16aa7+c9de6e9+b223960.) - The verifier is a library, not a tool —
import "jit/grade.rail"and a Rail
program can compile + execute new Rail at runtime in the same process. (07366ea) - The provenance pipeline is multi-witness Ed25519, browser-verifiable, with
pulse_id binding closing the prior session-replay gap. (f732176+2ada525) - A standalone single-file verifier ships at deterministic SHA — anyone can
grade reports without trusting the original signer's infrastructure.
(ledatic-site8f5b928)
Compiler & runtime
- Concurrency v1. Typed channels + select over a pthread-backed runtime.
import "stdlib/concurrent.rail"exposesrc_chan_make/rc_chan_send/
rc_chan_recv/rc_spawn. int64-only values in v0; 9 + 8 falsification tests
green. (4623e72) - Auto-memo fib silent-corruption — FIXED.
compile.rail:2593memo_store emit
was double-untagging x19 (which was already untagged in the prologue). Writes
went tomemo[n/2]while reads keyedmemo[n]; pairs collided on shared slots.
factescaped because it has only one recursive call and never reads back;fib
failed because two recursive reads collide.fib(10)was returning293886
instead of 55. Found by the JIT REPL agent comparing shell-compile vs JIT on
the same program; one-line fix using x19 directly as the index register.
Falsification attools/test/auto_memo_fib_correctness.rail. (b89a60b) - Nullary-LHS binary-op bug — FIXED. Any binary op with a top-level nullary
LHS expression was using the priorx0instead of the freshly-computed value.
compile.rail::emit_x1fast-path patched; 2-cycle bootstrap byte-identical.
Was the root of the multi-week "CPU substrate is mysteriously wrong" arc; closes
the substrate investigation. (pre-window but retroactively notable) _rail_joinO(n²) — FIXED. Runtime asm rewrite of join: 53.5 GB → 267 MB on
the 8×100K-float dump pattern (200× memory, 120× wall-clock). Diagnostic
harnesses kept attools/diagnose/dump_pattern_smoke.rail+
tools/diagnose/dump_bisect.rail. (pre-window)- Same-bug-class parity sweep — CLOSED on both backends, both orderings.
Each of 9 binary ops (+,-,*,/,%,<,>,<=,>=) now has
symmetric handling for(int, float)and(float, int)operand orderings.- x86
(int, float): inline emitcheck_both+.L<op>_mixed_if. (b223960) - x86
(float, int): already covered byb223960's symmetric routing. - ARM64
(int, float): inline emitcheck_both+.L<op>_mixed_ifmirror.
(9e16aa7+d4e3696) - ARM64
(float, int):.L<op>_mixed_fimirror that takes raw-f64 LHS via
fmov d0, x1, untags+converts tagged-int RHS viaasr+scvtf. For
_rail_addspecifically the dispatch is inserted at the top of.Ladd_heap
so the string-append path remains correct. (c9de6e9) - 9 + 9 = 18 falsification tests at
tools/test/<op>_{int_float,float_int}_ordering.rail.
- x86
- 3-movk integer literal codegen.
emit_load_intatcompile.rail:829now
emitsmovz+ up to 3movkchunks (bits 0-15, 16-31, 32-47, 48-63) with zero
chunks at ≥#32 skipped, plus a symmetricmovn+movkpath for negatives.
k16/k32/k48computed viashl 1 Nso constant-folding doesn't bake the
64-bit literal as a constant the seed can't emit. Regression testst132/t133/t134.
ARM64 floor: 137 → 140. (872424b) - Bootstrap convergence audit — published. The "bootstrap doesn't converge"
claim was falsified: it's a 2-cycle limit cycle. gen0's shipped runtime asm
doesn't necessarily match what gen0's source emits, so cycle 1 typically differs;
gen2 always lands the byte-identical fixed point. See
notes/bootstrap_convergence_audit_2026-05-13.md. - Diagnostic surface.
strip_trailing_wshelper replacestrimat 4 multi-line
as/ldresult sites so undefined-symbol errors and assembly errors no longer
silently truncate to the first line.shell_quote_arg+shell_quote_join+
join_args_quotedpreserve quoted argv through./rail_native run. (b7f267a,
23fa5fd)
Self-hosted JIT — now a first-class tool
- Public documentation.
docs/site/jit.md(109 lines): substrate-honesty
framing, end-to-endtest_codegendemo with output, honest capability + limit
table, file map for inspection. Linked fromdocs/site/index.md. Public surface
athttps://ledatic.org/rail/docs/jit.htmlonce deployed. (def1bcd) - JIT-first REPL at
tools/repl_jit.rail. ~3000× per-line vs shell-compile
(0.1 ms median JIT-line vs 319 ms shell-line). Persistent definitions across
lines via string-concat buffer; every line re-lowers the full defs + expr at
~0.4 ms. One-time ~21 s REPL compile mitigated by pre-compiled binary at
/tmp/repl_jit_bin. 11/11 smoke green including JIT-hits, ADT-fallback,
parse-error path.tools/repl.rail(the shell-based REPL) untouched. (6ab2666) - JIT-grade fast path at
tools/bench/jit_grade_batch.rail, opt-in via the
--jit-fastflag ontools/bench/repro_anthropic.py. Modest 1.18× grading-only
speedup (101.75 s → 86.18 s) — the public bench is API-bound, so default driver
stays shell-only. Lower-hit 14.2 % on synthesized completions, 40 % on canonical
hand-curated shapes. Soundness finding the falsification test earned:
jit_can_lower=1was UNSOUND as a fast-path predicate — JIT recognizes builtins
(str_eq/str_len/str_at/is_nil) thatrail_nativerejects; naive routing
would have silently marked fail cases as passes.contains_unsafe_jit_builtin
guard added; 26/26 parity. The bug was simultaneously fixed at the JIT source
itself injit_can_lower. (163521e) - In-process agentic loop at
tools/agent/jit_loop.rail. Single Rail program
that calls the Anthropic API viastdlib/anthropic_client.rail, JIT-compiles
the response viajit/grade.rail, executes, returns. Offline smoke green (fib 10
→ 55, fact 6 → 720). 5/5 in-subset programs JIT cleanly; 5/5 out-of-subset
reject loudly with diagnostics — hard verifier, no silent wrong answers. (07366ea) - JIT lower cluster fixes. Three closing bugs from the JIT integrations:
multi-lineletinside fn bodies (parse_fn_bodynowskip_nls's before body);
st_failno longer prints (uses mutable arr cell pattern from
stdlib/https_session.rail:64);jit_can_lowersubstring-checks for unsafe
builtins. (09263e6+ef88a42+1226600)
Substrate hard-bench — publicly reproducible
- F-53 closure.
tools/bench/substrate_hard_bench.rail+
tools/bench/repro_anthropic.py+tools/bench/repro_30of30.sh+
tools/bench/README.md. Two reproduction paths: Anthropic API (~$15–20 / run,
~15–25 min) and local MLX/vLLM (any 100B+ open-weight on an OpenAI-compatible
endpoint). Partners can now run the 30/30 bench without Studio access. (f2c88b2) - The empirical claim it backs: a frontier model + 1KB Rail spec scores 30/30
on a held-out hard-bench, beating a fine-tuned ensemble. Every band 5/5; 15.4
min wall-clock; multi-witness Ed25519 signed; verifiable at/verify/<id>.
Provenance — v2 with browser-side verify
- Pulse_id binding. Attestation v2 binds
pulse_idso old attestations
cannot be replayed against new pulses. TOCTOU on weights closed via re-hash
inside the signing transaction. (f732176) - Standalone verifier ships from the ledatic-site repo as a single-file
executable with a deterministic SHA. Third parties can grade reports without
trusting Studio infrastructure. (ledatic-site8f5b928) - Crypto stdlib hardening. 2 CRITICALs + 7 HIGHs closed via the 2026-05-12
parallel security-audit pass. Crypto stdlib + provenance + fleet posture all
tightened. See memory entrysecurity_audit_2026-05-12. (bf7ff54,
f065e0e,2ada525,39e02fe) - DNS-match short-circuit fix.
cv_dns_matchwildcard-vs-equal-length
path patched so SAN matching can't bypass on edge inputs. (47ca7f1) - Fleet bind to Tailscale IPv4.
fleet_agent_v3no longer listens on
0.0.0.0; bound to the Tailscale IP only. (1de6cff)
x86_64 backend — full conformance
- 136/136. From the prior 71/79 baseline, the 2026-05-12 punch-list (Agents
A–E in parallel) drove the backend to 100 % conformance via:- Bit-op runtime +
char_from_int+byte_at/set(60cd486) - **5 `rail...
- Bit-op runtime +
v3.8.0 — Releases physicified (attestation)
v3.8.0 — 2026-05-01 — Releases physicified (attestation)
Every tagged release, every ./rail_native test pass, and every 2-pass self-compile fixed point now binds to a live entropy beacon pulse_id and an Ed25519 signature from the project's fleet0 Pi witness (pk_fp = cac5f21a70564aeb). The signed artifacts ship in releases/<tag>/, are mirrored at https://ledatic.org/releases/<tag>/, and are reproducible offline with https://ledatic.org/attest/verify.sh.
Attestation kernel + drivers
tools/attest/attest.sh— primitive: signssha256(input) ⊗ pulse_id ⊗ value_hexvia the Pi witness using a namespace-prefixed message (attest|v1|...) so attestation sigs can never collide with beacon-witness sigs.attest_release.sh/attest_test_run.sh/attest_selfhost.sh— call the primitive on the binary, the test log, and the byte-identical fixed point respectively.verify.sh— re-derives the digest, fetches the public key fromledatic.org/attest/fleet0.pub.pem, runs the Ed25519 verify, exits non-zero on tamper.tools/attest/backfill_releases.sh— extracts each historical tag'srail_native+tools/compile.railblobs (no checkout) and signs them. v2.0.0 → v3.7.0 are all attested + downloadable.
Cadenced drivers
tools/attest/daily.sh— LaunchAgentcom.ledatic.attest_dailyre-attests production every morning at 06:00 local. Updates/builds/latestand/selfhost/latestpointers; production drift = "latest" pointer falls behind the live tree, immediately self-evident.tools/attest/fleet_status_publisher.sh— LaunchAgentcom.ledatic.fleet_attestpolls each fleet node's/healthevery 60 s, fetches the current pulse, signs the bundle, and publishes tohttps://ledatic.org/fleet/status.json.
Public surfaces
https://ledatic.org/system— mission control: five panels (beacon · witness · fleet · build · selfhost), each resolves to a signed JSON artifact, refreshes on 2.5 s cadence, self-marks "live" or "stale" based on signature freshness.
v3.7.0 included (was on next branch only)
This release also rolls in v3.7.0 — Float-TCO root fix, mixed-precision inference, parallel rerank — which was tagged on the next branch (2026-04-30) but never merged into master. See v3.7.0 release notes.
CI fix
.github/workflows/ci.yml now builds tools/metal/libtensor_gpu.dylib before running the test suite. Four tensor tests had been failing at link with "Undefined symbols for architecture arm64: _tgl_init …" since the Metal-FFI introduction (~2026-04-15); CI returned to green.
Validation
- 137/137 tests green
- Byte-identical 2-pass self-compile fixed point verified
The verb: Rail releases are no longer claims, they're physical events anchored to real time.
v3.7.0 — Float-TCO root fix, mixed-precision inference, parallel rerank
v3.7.0 — 2026-04-30 — Float-TCO root fix, mixed-precision inference, parallel rerank
Substantial substrate work. Seven commits, three real bugs (one fixed at
root, one workaround'd at source, one falsified), one substantial new
feature (Rail-native mixed-precision GPU inference), one substantial new
tool (parallel rerank wrapper), and a precise reproducer for one bug
that stayed open. 137/137 tests green; byte-identical self-bootstrap
verified.
Compiler / runtime
- Float-TCO root fix. Re-added
body_has_floatguard to
all_params_intintools/compile.rail:1992. Closes a 17-day silent
wrong-result bug introduced by commit82516e4(2026-04-13) that
caused tail-recursive float helpers (e.g.rms_row_apply) to
reinterpret float bits as ints in register-ABI calls, producing
garbage. Headline affected sites: RMSNorm CPU path, AdamW weight
decay, LayerNorm CPU backward. (7752738) - Runtime-mmap arena (A1.P4).
RAIL_ARENA_MBenv var (default 1 GB,
scales to 4 GB+ via mmap). Replaces the fixed 512 MB BSS arena that
was bumping the macOS dyld static-data ceiling. envp passthrough via
_rail_envpso env vars reach./rail_native runchild processes.
Long-context training (seq=2048+) now mechanically tractable on
macOS. (7752738) - Diagnostic counters (A1.P5).
alloc_stats_snapshotreturns 17
ints now: per-class freelist misses (0–11), munmap_count (12),
mmap_large_count (13), arena_spill_count (14), gc_count (15),
arena_spill_bytes (16). PlusRAIL_ARENA_TRACE=1for stderr-emitted
spill events. (7752738) - Parser multi-line compound expressions. Cons chains, nested calls,
list literals inside unclosed(...)/[...]now parse cleanly. Same
post-tokenizer pass routes bothtokenizeandtokenize_with_pos.
(7752738) ./rail_native quick. 15 critical tests in ~5s, vs the full
suite's 10+ min. Use between code edits. (7752738)
Inference path
- Rail-native mixed-precision matmul. New Metal kernel
matmul_f32x_halfw(fp32 activations × fp16 weights → fp32, fp32
accumulator). Host wrappertgl_matmul_f32x_halfw_hostcasts
f64↔f32 once at the GPU boundary; Rail-side surface stays in f64.
stdlib/tensor.rail:matmul_mixed. Primitive correctness:
max_abs_diff = 0.00042vs f64 reference (vs 0.00082 for the
all-fp16 path — 2× tighter). Byte-deterministic across 100+
sequential calls. New harness at
tools/train/lm_infer_v3_mixed.rail. Right substrate for d=384+
scaling; not the d=256 winner today (CPU+KV-narrowing remains
faster for current model). (ee6bdce) - Parallel rerank wrapper.
tools/train/parallel_rerank.shfans
out N inference subprocesses concurrently with distinct seeds,
pre-compiling the harness once for amortization. Validated 7.1×
wall-clock at N=8, ~11× projected at N=20 — bench projection drops
from 2.25hr to ~13min for 30 prompts × N=20 rerank.--bin <path>
flag (added in v3.7.0 as a follow-on) lets orchestrators skip the
built-in pre-compile. (ee6bdce,73043e2) tools/train/parity_check.sh. Three-way diff harness running
CPU (f64), GPU half (existing v3_half), and GPU mixed (new) on the
same checkpoint+prompt+seed. Useful for reasoning about which
precision path is producing which degenerate argmax token under
undertrained models. (ee6bdce)tools/test/sequential_matmul_half_test.rail. Regression test
verifyingtgl_matmul_half_hostis byte-deterministic across 1000
sequential calls. Eliminates the "primitive corruption" hypothesis
for any future GPU-collapse investigation. (7752738)
Diagnostic infrastructure
RAIL_GPU_POOL_DISABLE=1env flag intools/metal/tensor_gpu_lib.m.
Bypasses MTLBuffer pool best-fit reuse, forcing fresh
newBufferWithLengthon every acquire. Falsifies the standing
hypothesis that pool reuse caused GPU sequential-inference collapse;
with the flag set, collapse is byte-identical to baseline. (7752738)
Inference workaround
tools/train/lm_infer_cpu.rail:gen_loopno longer callsarena_reset.
Eliminates a compiler-codegen interaction (betweenarena_resetand
multiply-add expressions infloat_arr_set) that corrupted
_rail_small_fl[0]with the value being stored, surfacing as
SIGSEGV in_rail_chained_mallocon a subsequent allocation. Bug
was seed-deterministic (~50% of seeds at--max 128 --k 10), and
silently confounded all post-04-13 single-sample compile-rate
measurements. Workaround eliminates the trigger; per-iteration
intermediate tensors now accumulate in the bump arena, which the
default 1 GB easily holds for bounded inference runs. 30/30 stress
tests pass. The compiler-level fix remains open with a precise
one-line reproducer documented. (f215039)
Documentation
docs/SESSION_HANDOFF_2026-04-30_EOD.md— full afternoon arc.docs/SPUR_HANDOFF_2026-04-30.md,docs/MODEL_SESSION_HANDOFF.md,
docs/ROADMAP_2026-04-30.md— morning arc + 6-month framing.docs/RAIL_ENGINEER_SESSION_PROMPT_2026-04-30_NIGHT.md—
forward-looking prompt for the engineer picking up the open compiler
bug + remaining substrate debt.- Six new design notes:
arena-design.md,
arena-leak-fix-strategy.md,data-section-quirk.md,
backlog-deferred-design-notes.md,strict-typecheck-design.md,
garmin-research-notes.md.
What was falsified (negatives)
- GPU sequential-collapse "MTLBuffer pool reuse" hypothesis —
falsified viaRAIL_GPU_POOL_DISABLE. Collapse byte-identical with
pool off. Surviving cause: fp16 precision compounding across 22
matmul round-trips/token (intrinsic, not a fixable substrate bug). - 2026-04-15 "10 MB/step leak" hypothesis — falsified by
arena_resetchain-drain test (10 cycles, byte-tight). The
allocator is sound; remaining leak suspects are GPU-side
(MTLBuffer pool) orgpu_available 0re-eval churn. - Static 2 GB arena — tested, breaks dyld at link time. 1 GB is
the macOS BSS ceiling; runtime mmap (A1.P4) is the path beyond.
Memory entries
Fifteen entries in ~/.claude/projects/-Users-user/memory/ capture
today's earned knowledge: substrate findings, discipline rules
(feedback_verify_removals, feedback_diagnostics_first,
feedback_honest_backlog), the dylib investigation chain, the
mixed-precision and parallel-rerank specs, and the segfault
bisection.
v3.0.0 — Rail speaks TLS
Rail speaks HTTPS alone. A complete pure-Rail TLS 1.3 stack, X.509 chain validation, and HTTPS client — with zero C transitive dependency beyond as, ld, and the kernel's BSD sockets.
Live on release day, in production
anthropic_chat "claude-haiku-4-5-20251001" "Reply with exactly: hello from pure rail"
→ HTTP 200, reply "hello from pure rail" (6.9 s, pure Rail → Anthropic)
slack_post_text "D0ATHQ1BQD7" "v3.0.0 smoke: pure-Rail TLS direct to slack.com"
→ ok=true, HTTP 200 with x-slack-req-id (1.0 s, pure Rail → Slack)
https_get_url "https://www.amazon.com/"
→ HTTP 200 with set-cookie, x-amz-rid (4.0 s, RSA chain validated
to DigiCert Global Root G2)
The full Google Trust Services chain for api.anthropic.com (leaf → WE1 intermediate → GTS Root R4) validates end-to-end to the macOS /etc/ssl/cert.pem trust store.
What shipped
~3,800 lines of new pure-Rail crypto + TLS across 16 new stdlib modules. Every primitive NIST- or RFC-vector validated:
- Hashes: SHA-256, SHA-384, SHA-512
- MAC / KDF: HMAC-SHA-256, HKDF-Extract/Expand
- Symmetric: ChaCha20, Poly1305, ChaCha20-Poly1305 AEAD
- Public key: X25519, ECDSA-P256 (16-limb), ECDSA-P384 (24-limb), RSA-PSS / RSA-PKCS1 (128-limb)
- X.509 / PKI: ASN.1 DER parser, Base64 decoder, PEM iterator, macOS trust store loader (128 roots)
- TLS 1.3: key schedule, handshake state machine, record layer, CertificateVerify dispatch, SAN hostname match, validity period, full chain walker with shortest-path policy
- Application:
https_get/https_post/https_get_url+ URL parser + UDP DNS + live Anthropic + Slack clients
Trust posture
A TLS connection through Rail v3.0.0 refuses to hand plaintext to the caller unless all of the following hold:
- The server's CertificateVerify signature checks out (ECDSA-P256-SHA256 or RSA-PSS-SHA256) against the public key in the leaf's SubjectPublicKey.
- The leaf's SubjectAltName dNSName entries include a match for the hostname asked for (RFC 6125 §6.4.3 wildcard support).
- The current time is within the leaf's notBefore/notAfter window.
- The server Finished MAC validates.
Full chain walk to a CA root is available as cc_walk_chain (opt-in primitive).
Honest limits
Single cipher suite (TLS_CHACHA20_POLY1305_SHA256), single ECDHE group (x25519), three sig-algs, no session resumption, no 0-RTT, no constant-time guarantees, ~5–8 s per connection (public-key verify dominates). See SECURITY.md before deploying.
Tests
22 pure-Rail TLS tests all green + 116-test core suite still 116/116. Self-compile 2-pass byte-identical preserved. Full details in CHANGELOG.md.
The arc
- v1.x — Rail compiled itself.
- v2.x — Rail gained networks, trained transformers, shipped to Cloudflare.
- v3.0.0 — Rail calls
api.anthropic.comby itself.
Rail runs on Rail, the rest runs on physics.