Releases · zemo-g/rail

28 May 19:40

zemo-g

v5.1.0

94afdd1

v5.1.0 — Rail emits its own GPU kernels Latest

Latest

Rail now generates Metal Shading Language from its op-DAG, JIT-compiles it via Metal's newLibraryWithSource:, and dispatches the kernel at runtime. Every kernel the GPU executes is emitted by an attested Rail binary — the substrate piece needed for end-to-end attested GPU training.

This release bundles the full GPU substrate the auto-emission pipeline rests on: per-op Metal kernels, the bf16 numerics regime that unlocks stable 10k-step training, the JIT compile foundation, two hand-fused kernels, and the DAG matcher + emitter that drive them.

Auto-emission pipeline

Module	Role
`stdlib/jit_node.rail`	JIT op-DAG types (`JitNode`, `TracedTensor`) + tape primitives + `jit_nth`. Pure-DAG module — no Tensor/transformer dependency, so codegen consumers import without pulling the training stack.
`stdlib/jit_tape.rail`	Execution tracers (`jit_leaf`, `traced_rmsnorm`, `traced_matmul`) layered on jit_node.
`stdlib/jit_match.rail`	DAG matcher. `walk_tape` returns a list of `FuseMatch` records (`FuseRmsQKV`, `FuseSiluHad`) in tape order, identifying subgraphs that fit known fusion shapes.
`stdlib/jit_emit.rail`	MSL emitter. `emit_msl_for_match (tape, FuseMatch)` returns the kernel source. Stubbed against known patterns today; v5.2+ replaces with shape-parameterized codegen driven by JitNode data.
`stdlib/jit.rail`	compile/dispatch shim. No longer owns MSL text; `jit_compile_*` pulls strings from `jit_emit`. External API unchanged for existing consumers.

Hand-fused Metal kernels

Kernel	Fusion	Speedup
`fused_rmsnorm_qkv`	RMSNorm + 3 matmul (Q/K/V) in one threadgroup-per-row dispatch. Exports Q \| K \| V \| rstd \| LN1 from a single packed buffer; rstd + LN1 kept for backward. 4 dispatches → 1.	35× over the per-op chain at seq=512, d=64
`fused_silu_hadamard`	SiLU(gate) * up in one elementwise dispatch. Exports h_act \| sigmoid(gate); sigmoid kept for backward.	18× over the per-op chain

JIT compile foundation

tgl_jit_compile_from_tmp_file — Metal's newLibraryWithSource: driven from a Rail-emitted .metal file. Returns a pipeline ID cached in g_jit_pipes for reuse across steps.
tgl_jit_dispatch_1in1out / _2in1out / _rmsnorm_qkv / _silu_hadamard — per-pattern dispatchers with f64↔f32 host staging.

Per-op GPU kernels

tgl_rmsnorm_save_f64 (1.8× over CPU), tgl_rope_apply_f64 (7×), tgl_silu_fwd_f64 (19×) at training shapes.

Per-op wins translated to ~2% per-step — confirming fusion, not per-op throughput, is the real ceiling, which is why the JIT pipeline above is the load-bearing thesis.

bf16 numerics regime

tgl_matmul_bf16 + matmul_bf16 wrapper. bf16 has f32's exponent range, sidestepping fp16's step-2759 NaN cliff. Training scripts default to forward bf16 with f64 on embedding + LM-head + backward.

Training scripts (chunked-corpus sampler, 2-block d=64)

tools/train/lm_v3_chunked_bf16_full_long.rail — bf16 forward, 10k-step stable, ~40% wall under f64 baseline.
tools/train/lm_v3_chunked_jit_long.rail — Q/K/V + SwiGLU through fused JIT'd kernels. 200-step matched-seed pilot vs bf16 baseline: trajectory shape preserved, no NaN, both converge. 2.85% step-throughput improvement (3×3 alternating runs, seq=512 d=64 d_ff=192).
tools/train/lm_v3_chunked_fp16_attn_f64_long.rail — falsification experiment ruling out attention as the fp16 culprit.

Tests + benches

7 JIT/GPU smoke tests + 2 benches added. Numerical parity verified: fused QKV max diff 1.67e-6 (f32 floor), silu-hadamard 1.5e-8. DAG matcher 5/5, MSL emitter 4/4, block-integration rstd/ln1/sigmoid parity all green.

Stability

140/140 compiler test suite still green.
2-pass byte-identical self-bootstrap unchanged — this release adds stdlib + foreign decls + Metal sources; the compiler core is untouched.

Full detail in CHANGELOG.md.

Assets 2

28 May 20:18

zemo-g

v5.0.2

bbda5dd

v5.0.2 — Attestation pipeline goes fully pure-Rail

Patch release. The first Rail release attested end-to-end through the Rail substrate — no curl, no shasum, no Python anywhere in the attestation path.

Fixes

stdlib/file.rail: foreign fopen path mode -> int (084791f). fopen returns a file descriptor, not a FILE*; declaring it as ptr bypassed Rail's tagging and tripped a polymorphic untag for odd fds (3 → 1, i.e. stdout). Corrected the foreign return type.
runtime _fopen: unwrap path + mode before open(2) (95d81de). Wrapped Rail strings carry their ptr at the heap header, so open() saw the header tag byte as the path. _fopen now calls _str_unwrap, matching the _rail_read_file contract. Closes the argv-vs-literal path bug.

Attestation — retired the shell escape hatches (bbda5dd): attest.sh, sign_attestation.sh, and publish.sh deleted; release_index.rail replaces the Python heredoc in attest_release.sh. attest.rail + publish.rail are now canonical.

Stability — new seed rail_native (3b89d0f5) is at the 2-pass byte-identical fixed point.

Full detail in CHANGELOG.md.

Assets 2

28 May 20:18

zemo-g

v5.0.1

06e25a6

v5.0.1 — Attestation hygiene + codegen tightening

Patch release. No new features.

Codegen — closed the half-applied ARM64 compile-fixes patch (emit_x1 large-immediate path, emit_x1 global-V fallback, O-handler RHS exclusion, ?-handler global-V exclusion). Self-compile fixed point and test suite unchanged from the v5.0.0 baseline.

Attestation hygiene

Backfilled v4.0.0 / v4.0.1 / v4.1.0 release attestations (previously tagged-but-unattested).
.gitignore whitelist for releases/**/rail_native.attestation.json so a new release can't silently lose that file to the rail_native.* wildcard rule.
New docs/RELEASES.md operational runbook + 6 known gotchas.

Known limitation — attest.rail was still blocked on the ftell FFI bug, so attestation in this release (including its own artifacts) still used the documented tools/attest/attest.sh shell escape hatch. Closed in v5.0.2.

Full detail in CHANGELOG.md.

Assets 2

14 May 03:54

zemo-g

v5.0.0

33cda18

v5.0.0 — Self-hosted toolchain (Linux ELF substrate)

Rail produces its own aarch64 Linux ELF binaries. Encoder, assembler, static linker, and ELF writer are pure Rail. For the supported subset of inputs, the build pipeline invokes no external as, ld, or codesign.

What ships

Module	Lines	Role
`jit/arm64.rail`	+200	23 new encoders for the Linux mnemonic set (ldrb/strb imm/reg-offset/post-index, clz, neg, cmn imm+reg, rev, rev16, fneg, frinta, fcvt s_d / d_s, tbnz, stp/ldp pre/post-index, add/sub imm, asr/lsr/lsl imm). 31/31 byte-verified against `as`.
`stdlib/elf.rail`	175	Elf64 writer for static aarch64 binaries (one PT_LOAD for tiny, three for full text+data+bss-via-memsz).
`tools/v5/elf_asm.rail`	567	Section-aware ARM64 assembler + static linker. Two-pass: pass 1 builds (name, section, offset) label table; layout resolves to vaddrs; pass 2 emits bytes with adrp/:lo12: symbol resolution. Handles .text/.data/.bss/.rodata/.section __DATA,__mod_init_func (skipped), .quad/.byte/.long/.ascii/.asciz/.space/.comm/.p2align/.align, plus writeback stp/ldp variants and the adrp+add :lo12: symbol-load idiom.
`tools/v5/compile_elf_full.rail`	80	Driver: source `.s` → 3-pass pipeline → multi-segment ELF. Patches `e_entry` to `_start`.

End-to-end verified on aarch64 Linux (Pi Zero 2 W)

Program	ELF size	Result
`exit42_linux.s`	132 B	exit 42
`fib_linux.s`	204 B	`exit(fib(10)) = 55`
`hello_linux.s`	4105 B	prints `"v5 lives\n"`, exit 9 (adrp + add :lo12: + write syscall)
`bss_test_linux.s`	4096 B	BSS counter loop, exit 7

Each binary's .text bytes are byte-equivalent to canonical as + ld output. Pipeline invokes neither external assembler nor linker.

compile.rail Linux pipeline fixes (precursor)

Two long-standing bugs in build_linux fixed: duplicate-symbol awk strip ran only on macOS-cross (now runs on Linux too, list broadened to cover the _rail_* runtime helpers that linux_libc.s redefines), and the macOS-only .section __DATA,__mod_init_func block is now stripped. tools/linux_libc.s gains _memcpy and _fmod — the two libSystem references with no Linux-side definition.

Tag-readiness checklist

compile.rail's real Linux output traverses the new pipeline → byte-equivalent ELF
Linux ELF substrate verified on aarch64 hardware
176 encoders byte-verified against as (89 + 56 + 31)
No regression on ./rail_native test (136/140; 4 pre-existing tensor failures)
No regression on ./rail_native self byte-identical fixed point
CHANGELOG.md v5.0.0 entry
Leak guard CI green

Deferred

Pi self-host of ./rail_native test via Rail-only build — current GC layout reserves 1.2 GB BSS, exceeding Pi Zero 2 W RAM. Heap-size knob is v5.1 scope.
macOS Mach-O end-to-end with dyld stubs — Phase 4b covered the libSystem-free subset; stub-aware Mach-O (LC_LOAD_DYLIB, indirect symbol table, __stubs, __got, bind opcodes) is ~1500 more lines. Tracked as v5.2.

🤖 Generated with Claude Code

Assets 2

13 May 18:38

zemo-g

v4.1.0

147a7fd

v4.1.0 — Repo hygiene + leak-guard CI

Minor release. Comprehensive cleanup pass over the public tree. No
compiled-binary change; no language or stdlib changes.

CI + leak-prevention (B1)

New workflow .github/workflows/leak-guard.yml — every push and PR
is grep-scanned for the operator-recon pattern set (Tailscale IPs,
internal SSH targets, home-directory paths, internal Slack channel
IDs). Fails the build on any hit. Per-line opt-out via the comment
marker leak-guard-allow. CHANGELOG.md and the guard file are
excluded.
ci.yml triggers extended to include next branch and v* tags.
Test-count assertion generalised from hardcoded 137/137 to any
matching N/N (master is 137, next is 140, future may grow).
.gitignore — explicit ignores for .mcp.json, .ledatic/,
.fleet/, *.pre-*. Closes the casual-git add recurrence path
for the v4.0.1 leak class.

Branch hygiene (B2)

21 remote branches deleted from origin:

18 feat/* branches fully merged into next (security A/B/C lanes,
x86 conformance harness, x86 runtime extensions, JIT fixes, docs
refresh, auto-deploy, punch-list integration).
jit (merged into next).
track-mhd-kernel (merged into master).
history-scrub-prep-2026-05-12 (unused experimental branch).

Remaining: master, next, half-s2-kernels (open compiler work),
compound/exp-008-bytes_to_str (halted POC artifact). Down from 26
branches to 4.

Doc pruning (B3)

~104 operator session-handoff files removed from the public tree:

docs/plans/ (74 files) — operator session-planning notes
(SESSION_HANDOFF_, PROMPT_SESSION_, WEEK_PLAN_, PHASE_, etc.).
notes/ orphan files (12).
docs/handoffs/ orphans (8).
jit/ operator notes (9) — SCRATCH, CONTINUATION, SESSION_PROMPT*,
AGENT_DRY_RUN, NEXT_STAGES, closures, floats.
SECURITY_HANDOFF.md — internal Fort Knox punch list (the public
policy lives in SECURITY.md).

Kept: docs referenced from CHANGELOG (notes/bootstrap_convergence_audit_*,
notes/phase3_external_pilot_pitch_v0); jit/ code + README + CHANGELOG;
docs/sessions/ versioned handoffs (CHANGELOG-linked).

Dead-code pruning (B4)

Deleted tools/autocatalyst_v4.rail (broken — referenced runtime/llm.o
which never landed in-tree, flywheel-v1 artifact).
Deleted tools/ac_dashboard.rail (orphan, flywheel dashboard).
Removed Razer3070 live-path references (decommissioned 2026-04-17):
- tools/apps/control.rail — Razer fleet row + curl status segment.
- tools/fleet/fleet_display.rail — razer_status/razer_iter/razer_max/
  razer_ping/razer_loss + RAZER row in the SPI-LCD render.
- tools/mcp/rail_mcp.rail — tool_fleet_status no longer SSHes for
  nvidia-smi / v6_train.log; description updated.
- tools/compile.rail — compile_x86 fallback message no longer
  recommends scp-to-Razer; suggests cross-tools or native host.
  Byte-identical bootstrap preserved.
CLAUDE.md target list: 'Linux x86_64 (Razer WSL)' →
'Linux x86_64 (cross-compile)'.

Structure pass (B5)

Deleted 7 docs with no CHANGELOG or code references:
RAIL_ENGINEER_PROMPT.md, flywheel-data-quality.md,
flywheel-world-research.md, cascade-training.md,
rail-plasma.md, railgpt-from-scratch.md,
self-improving-playbook.md.
Flattened docs/handoffs/ (down to a single entry after B3 prune):
docs/handoffs/2026-05-02.md → docs/handoff-2026-05-02.md.

README polish (B6)

Badge: v3.0.0 → v4.0.0; tagline → "Substrate maturity".
Intro paragraph adds the v4.0.0 substrate-maturity lede (dual-backend
parity, JIT in Rail, 30/30 hard-bench, multi-witness attest).
New Releases section entry for v4.0.0 + a v4.0.1 sanitization note.
History table extended: 7 new rows spanning v3.7.0 → v4.0.1
(previously jumped from v3.0.0 to v2.23.0).

Verification

Leak guard: 0 hits across tracked files for the union pattern set.
Test suite: 140/140 on the v4.1.0 tree (modulo the documented
/tmp/rail_out orphan-process collision when run concurrently with
another rail_native test).
git push on next: clean fast-forward; tag v4.1.0 cuts at 6 commits
past v4.0.1, all CI-green via the new workflow.

Assets 2

13 May 18:03

zemo-g

v4.0.1

70089ba

v4.0.1 — Public-surface sanitization

Patch release. Removes operator-specific infrastructure strings from the
public tree: Tailscale IPs, SSH usernames, home-directory paths, internal
Slack channel IDs, and a stray operator MCP config. No behavior change;
the compiled binary is identical to v4.0.0.

What was scrubbed (~110 files)

Hard SSH targets in tools/attest/*.sh, tools/fleet/*.sh,
tools/fleet/fleet_display.rail, tools/apps/control.rail — replaced
with <witness-user>@<witness-host> / <peer-user>@<peer-host>
placeholders. Callers must supply real values via environment.
Tailscale IPs (100.87.231.45, 100.79.50.108, 100.120.203.70,
100.109.107.54, 100.109.63.37) replaced with role placeholders
(<witness-tailscale-ip> etc.). Tailscale CGNAT-range addresses aren't
reachable from the public internet, but they were operational recon.
Home-directory paths (/Users/ledaticempire/, /Users/user/,
/home/zemog/) replaced with ~/ or <HOME> placeholders across
source, docs, docs/plans/, training fixtures, and Objective-C dispatchers.
Operator service files — tools/fleet/witness.service,
tools/fleet/witness_push.service, tools/fleet/com.ledatic.*.plist
renamed to *.example with <user> / <HOME> placeholders. Existing
install scripts already substitute these at install time.
Operator MCP config — .mcp.json removed from the tree. It was an
operator's Claude Code MCP wiring (path to tools/mcp/rail_mcp.py),
not a build artifact; the MCP server still runs locally with a
per-user .mcp.json outside the repo.
Slack channel IDs / DM names in CHANGELOG.md, README.md,
stdlib/slack_client.rail docblock, docs/sessions/HANDOFF_v3_6.md —
D0ATHQ1BQD7 and brockbro2 replaced with <DM_CHANNEL_ID> and
<test-dm>. Slack IDs don't grant access on their own, but these
were the only remaining specific-channel references in the public surface.

What was intentionally NOT scrubbed

reillygomez13@icloud.com in tools/deploy/gen_*.rail — public
contact email rendered onto ledatic.org pages; meant to be public.
Commit messages in the v4.0.0 surface — rewriting history would break
existing clones for a topology-recon leak, not a credential leak.
The forward tree is clean; git history retains the originals.
~/.ledatic/ path convention — generic project-named subdirectory,
not operator-specific.

Verification

git grep -E "100\.(87|79|109|120)\.|zemog@|user@100|reillygomez@|\
ledaticempire@|/Users/ledaticempire|/Users/user|/home/zemog|\
Detro|D0ATHQ1BQD7|brockbro2"

→ empty across tracked files.

Why a patch release

v4.0.0 carried operator-recon strings inadvertently included via the
multi-witness publisher work on the next lineage. The master lineage
was scrubbed in c4f6050 (2026-05-06) but next hadn't received the
same pass. v4.0.1 brings the substrate-track tree to the same hygiene
standard.

Assets 2

13 May 17:46

zemo-g

v4.0.0

5502cae

v4.0.0 — Substrate maturity

⚠️ Superseded by v4.0.1.
v4.0.0 included operator-recon strings (Tailscale IPs, internal SSH targets,
home-directory paths) that were inadvertently carried over from the next
lineage. v4.0.1 is a documentation/config sanitization patch — the compiled
binary is identical. Please consume v4.0.1 instead.

A major version bump tagged on the next lineage. (master continues the parallel
v3.x attestation/agent track; the two have diverged on purpose.) 216 commits since
v3.11.0 was tagged on master 11 days ago — concurrency, playground, public JIT,
dual-backend parity, 30/30 substrate hard-bench publicly reproducible, browser-side
provenance verifier, four sweeping bug-class closures including a 17-day silent-
corruption fix discovered by a dual-implementation falsification harness.

No public API breaks; the major bump is a positioning marker, not a SemVer surface
change. The substrate-not-model thesis (docs/site/jit.md + tools/bench/repro_30of30.sh

https://ledatic.org/verify/<id> + this entire shipping volume) is now publicly
defensible without hand-waving.

What "substrate maturity" means here

A frontier model + a 1KB Rail spec compiles 30/30 on a held-out hard-bench,
reproducible by any partner with an API key. (f2c88b2)
The compiler is genuinely self-hosted on two backends, each with full
same-bug-class parity for the 9 binary ops across both operand orderings.
(ARM64 140/140, x86_64 136/136. 9e16aa7 + c9de6e9 + b223960.)
The verifier is a library, not a tool — import "jit/grade.rail" and a Rail
program can compile + execute new Rail at runtime in the same process. (07366ea)
The provenance pipeline is multi-witness Ed25519, browser-verifiable, with
pulse_id binding closing the prior session-replay gap. (f732176 + 2ada525)
A standalone single-file verifier ships at deterministic SHA — anyone can
grade reports without trusting the original signer's infrastructure.
(ledatic-site 8f5b928)

Compiler & runtime

Concurrency v1. Typed channels + select over a pthread-backed runtime.
import "stdlib/concurrent.rail" exposes rc_chan_make/rc_chan_send/
rc_chan_recv/rc_spawn. int64-only values in v0; 9 + 8 falsification tests
green. (4623e72)
Auto-memo fib silent-corruption — FIXED. compile.rail:2593 memo_store emit
was double-untagging x19 (which was already untagged in the prologue). Writes
went to memo[n/2] while reads keyed memo[n]; pairs collided on shared slots.
fact escaped because it has only one recursive call and never reads back; fib
failed because two recursive reads collide. fib(10) was returning 293886
instead of 55. Found by the JIT REPL agent comparing shell-compile vs JIT on
the same program; one-line fix using x19 directly as the index register.
Falsification at tools/test/auto_memo_fib_correctness.rail. (b89a60b)
Nullary-LHS binary-op bug — FIXED. Any binary op with a top-level nullary
LHS expression was using the prior x0 instead of the freshly-computed value.
compile.rail::emit_x1 fast-path patched; 2-cycle bootstrap byte-identical.
Was the root of the multi-week "CPU substrate is mysteriously wrong" arc; closes
the substrate investigation. (pre-window but retroactively notable)
_rail_join O(n²) — FIXED. Runtime asm rewrite of join: 53.5 GB → 267 MB on
the 8×100K-float dump pattern (200× memory, 120× wall-clock). Diagnostic
harnesses kept at tools/diagnose/dump_pattern_smoke.rail +
tools/diagnose/dump_bisect.rail. (pre-window)
Same-bug-class parity sweep — CLOSED on both backends, both orderings.
Each of 9 binary ops (+, -, *, /, %, <, >, <=, >=) now has
symmetric handling for (int, float) and (float, int) operand orderings.
- x86 (int, float): inline emit check_both + .L<op>_mixed_if. (b223960)
- x86 (float, int): already covered by b223960's symmetric routing.
- ARM64 (int, float): inline emit check_both + .L<op>_mixed_if mirror.
  (9e16aa7 + d4e3696)
- ARM64 (float, int): .L<op>_mixed_fi mirror that takes raw-f64 LHS via
  fmov d0, x1, untags+converts tagged-int RHS via asr + scvtf. For
  _rail_add specifically the dispatch is inserted at the top of .Ladd_heap
  so the string-append path remains correct. (c9de6e9)
- 9 + 9 = 18 falsification tests at tools/test/<op>_{int_float,float_int}_ordering.rail.
3-movk integer literal codegen. emit_load_int at compile.rail:829 now
emits movz + up to 3 movk chunks (bits 0-15, 16-31, 32-47, 48-63) with zero
chunks at ≥#32 skipped, plus a symmetric movn + movk path for negatives.
k16/k32/k48 computed via shl 1 N so constant-folding doesn't bake the
64-bit literal as a constant the seed can't emit. Regression tests t132/t133/t134.
ARM64 floor: 137 → 140. (872424b)
Bootstrap convergence audit — published. The "bootstrap doesn't converge"
claim was falsified: it's a 2-cycle limit cycle. gen0's shipped runtime asm
doesn't necessarily match what gen0's source emits, so cycle 1 typically differs;
gen2 always lands the byte-identical fixed point. See
notes/bootstrap_convergence_audit_2026-05-13.md.
Diagnostic surface. strip_trailing_ws helper replaces trim at 4 multi-line
as/ld result sites so undefined-symbol errors and assembly errors no longer
silently truncate to the first line. shell_quote_arg + shell_quote_join +
join_args_quoted preserve quoted argv through ./rail_native run. (b7f267a,
23fa5fd)

Self-hosted JIT — now a first-class tool

Public documentation. docs/site/jit.md (109 lines): substrate-honesty
framing, end-to-end test_codegen demo with output, honest capability + limit
table, file map for inspection. Linked from docs/site/index.md. Public surface
at https://ledatic.org/rail/docs/jit.html once deployed. (def1bcd)
JIT-first REPL at tools/repl_jit.rail. ~3000× per-line vs shell-compile
(0.1 ms median JIT-line vs 319 ms shell-line). Persistent definitions across
lines via string-concat buffer; every line re-lowers the full defs + expr at
~0.4 ms. One-time ~21 s REPL compile mitigated by pre-compiled binary at
/tmp/repl_jit_bin. 11/11 smoke green including JIT-hits, ADT-fallback,
parse-error path. tools/repl.rail (the shell-based REPL) untouched. (6ab2666)
JIT-grade fast path at tools/bench/jit_grade_batch.rail, opt-in via the
--jit-fast flag on tools/bench/repro_anthropic.py. Modest 1.18× grading-only
speedup (101.75 s → 86.18 s) — the public bench is API-bound, so default driver
stays shell-only. Lower-hit 14.2 % on synthesized completions, 40 % on canonical
hand-curated shapes. Soundness finding the falsification test earned:
jit_can_lower=1 was UNSOUND as a fast-path predicate — JIT recognizes builtins
(str_eq/str_len/str_at/is_nil) that rail_native rejects; naive routing
would have silently marked fail cases as passes. contains_unsafe_jit_builtin
guard added; 26/26 parity. The bug was simultaneously fixed at the JIT source
itself in jit_can_lower. (163521e)
In-process agentic loop at tools/agent/jit_loop.rail. Single Rail program
that calls the Anthropic API via stdlib/anthropic_client.rail, JIT-compiles
the response via jit/grade.rail, executes, returns. Offline smoke green (fib 10
→ 55, fact 6 → 720). 5/5 in-subset programs JIT cleanly; 5/5 out-of-subset
reject loudly with diagnostics — hard verifier, no silent wrong answers. (07366ea)
JIT lower cluster fixes. Three closing bugs from the JIT integrations:
multi-line let inside fn bodies (parse_fn_body now skip_nls's before body);
st_fail no longer prints (uses mutable arr cell pattern from
stdlib/https_session.rail:64); jit_can_lower substring-checks for unsafe
builtins. (09263e6 + ef88a42 + 1226600)

Substrate hard-bench — publicly reproducible

F-53 closure. tools/bench/substrate_hard_bench.rail +
tools/bench/repro_anthropic.py + tools/bench/repro_30of30.sh +
tools/bench/README.md. Two reproduction paths: Anthropic API (~$15–20 / run,
~15–25 min) and local MLX/vLLM (any 100B+ open-weight on an OpenAI-compatible
endpoint). Partners can now run the 30/30 bench without Studio access. (f2c88b2)
The empirical claim it backs: a frontier model + 1KB Rail spec scores 30/30
on a held-out hard-bench, beating a fine-tuned ensemble. Every band 5/5; 15.4
min wall-clock; multi-witness Ed25519 signed; verifiable at /verify/<id>.

Provenance — v2 with browser-side verify

Pulse_id binding. Attestation v2 binds pulse_id so old attestations
cannot be replayed against new pulses. TOCTOU on weights closed via re-hash
inside the signing transaction. (f732176)
Standalone verifier ships from the ledatic-site repo as a single-file
executable with a deterministic SHA. Third parties can grade reports without
trusting Studio infrastructure. (ledatic-site 8f5b928)
Crypto stdlib hardening. 2 CRITICALs + 7 HIGHs closed via the 2026-05-12
parallel security-audit pass. Crypto stdlib + provenance + fleet posture all
tightened. See memory entry security_audit_2026-05-12. (bf7ff54,
f065e0e, 2ada525, 39e02fe)
DNS-match short-circuit fix. cv_dns_match wildcard-vs-equal-length
path patched so SAN matching can't bypass on edge inputs. (47ca7f1)
Fleet bind to Tailscale IPv4. fleet_agent_v3 no longer listens on
0.0.0.0; bound to the Tailscale IP only. (1de6cff)

x86_64 backend — full conformance

136/136. From the prior 71/79 baseline, the 2026-05-12 punch-list (Agents
A–E in parallel) drove the backend to 100 % conformance via:
- Bit-op runtime + char_from_int + byte_at/set (60cd486)
- **5 `rail...

Assets 2

01 May 13:38

zemo-g

v3.8.0

ea9dbb3

v3.8.0 — Releases physicified (attestation)

v3.8.0 — 2026-05-01 — Releases physicified (attestation)

Every tagged release, every ./rail_native test pass, and every 2-pass self-compile fixed point now binds to a live entropy beacon pulse_id and an Ed25519 signature from the project's fleet0 Pi witness (pk_fp = cac5f21a70564aeb). The signed artifacts ship in releases/<tag>/, are mirrored at https://ledatic.org/releases/<tag>/, and are reproducible offline with https://ledatic.org/attest/verify.sh.

Attestation kernel + drivers

tools/attest/attest.sh — primitive: signs sha256(input) ⊗ pulse_id ⊗ value_hex via the Pi witness using a namespace-prefixed message (attest|v1|...) so attestation sigs can never collide with beacon-witness sigs.
attest_release.sh / attest_test_run.sh / attest_selfhost.sh — call the primitive on the binary, the test log, and the byte-identical fixed point respectively.
verify.sh — re-derives the digest, fetches the public key from ledatic.org/attest/fleet0.pub.pem, runs the Ed25519 verify, exits non-zero on tamper.
tools/attest/backfill_releases.sh — extracts each historical tag's rail_native + tools/compile.rail blobs (no checkout) and signs them. v2.0.0 → v3.7.0 are all attested + downloadable.

Cadenced drivers

tools/attest/daily.sh — LaunchAgent com.ledatic.attest_daily re-attests production every morning at 06:00 local. Updates /builds/latest and /selfhost/latest pointers; production drift = "latest" pointer falls behind the live tree, immediately self-evident.
tools/attest/fleet_status_publisher.sh — LaunchAgent com.ledatic.fleet_attest polls each fleet node's /health every 60 s, fetches the current pulse, signs the bundle, and publishes to https://ledatic.org/fleet/status.json.

Public surfaces

https://ledatic.org/system — mission control: five panels (beacon · witness · fleet · build · selfhost), each resolves to a signed JSON artifact, refreshes on 2.5 s cadence, self-marks "live" or "stale" based on signature freshness.

v3.7.0 included (was on next branch only)

This release also rolls in v3.7.0 — Float-TCO root fix, mixed-precision inference, parallel rerank — which was tagged on the next branch (2026-04-30) but never merged into master. See v3.7.0 release notes.

CI fix

.github/workflows/ci.yml now builds tools/metal/libtensor_gpu.dylib before running the test suite. Four tensor tests had been failing at link with "Undefined symbols for architecture arm64: _tgl_init …" since the Metal-FFI introduction (~2026-04-15); CI returned to green.

Validation

137/137 tests green
Byte-identical 2-pass self-compile fixed point verified

The verb: Rail releases are no longer claims, they're physical events anchored to real time.

Assets 2

01 May 03:03

zemo-g

v3.7.0

5354c51

v3.7.0 — Float-TCO root fix, mixed-precision inference, parallel rerank

v3.7.0 — 2026-04-30 — Float-TCO root fix, mixed-precision inference, parallel rerank

Substantial substrate work. Seven commits, three real bugs (one fixed at
root, one workaround'd at source, one falsified), one substantial new
feature (Rail-native mixed-precision GPU inference), one substantial new
tool (parallel rerank wrapper), and a precise reproducer for one bug
that stayed open. 137/137 tests green; byte-identical self-bootstrap
verified.

Compiler / runtime

Float-TCO root fix. Re-added body_has_float guard to
all_params_int in tools/compile.rail:1992. Closes a 17-day silent
wrong-result bug introduced by commit 82516e4 (2026-04-13) that
caused tail-recursive float helpers (e.g. rms_row_apply) to
reinterpret float bits as ints in register-ABI calls, producing
garbage. Headline affected sites: RMSNorm CPU path, AdamW weight
decay, LayerNorm CPU backward. (7752738)
Runtime-mmap arena (A1.P4). RAIL_ARENA_MB env var (default 1 GB,
scales to 4 GB+ via mmap). Replaces the fixed 512 MB BSS arena that
was bumping the macOS dyld static-data ceiling. envp passthrough via
_rail_envp so env vars reach ./rail_native run child processes.
Long-context training (seq=2048+) now mechanically tractable on
macOS. (7752738)
Diagnostic counters (A1.P5). alloc_stats_snapshot returns 17
ints now: per-class freelist misses (0–11), munmap_count (12),
mmap_large_count (13), arena_spill_count (14), gc_count (15),
arena_spill_bytes (16). Plus RAIL_ARENA_TRACE=1 for stderr-emitted
spill events. (7752738)
Parser multi-line compound expressions. Cons chains, nested calls,
list literals inside unclosed (...)/[...] now parse cleanly. Same
post-tokenizer pass routes both tokenize and tokenize_with_pos.
(7752738)
./rail_native quick. 15 critical tests in ~5s, vs the full
suite's 10+ min. Use between code edits. (7752738)

Inference path

Rail-native mixed-precision matmul. New Metal kernel
matmul_f32x_halfw (fp32 activations × fp16 weights → fp32, fp32
accumulator). Host wrapper tgl_matmul_f32x_halfw_host casts
f64↔f32 once at the GPU boundary; Rail-side surface stays in f64.
stdlib/tensor.rail:matmul_mixed. Primitive correctness:
max_abs_diff = 0.00042 vs f64 reference (vs 0.00082 for the
all-fp16 path — 2× tighter). Byte-deterministic across 100+
sequential calls. New harness at
tools/train/lm_infer_v3_mixed.rail. Right substrate for d=384+
scaling; not the d=256 winner today (CPU+KV-narrowing remains
faster for current model). (ee6bdce)
Parallel rerank wrapper. tools/train/parallel_rerank.sh fans
out N inference subprocesses concurrently with distinct seeds,
pre-compiling the harness once for amortization. Validated 7.1×
wall-clock at N=8, ~11× projected at N=20 — bench projection drops
from 2.25hr to ~13min for 30 prompts × N=20 rerank. --bin <path>
flag (added in v3.7.0 as a follow-on) lets orchestrators skip the
built-in pre-compile. (ee6bdce, 73043e2)
tools/train/parity_check.sh. Three-way diff harness running
CPU (f64), GPU half (existing v3_half), and GPU mixed (new) on the
same checkpoint+prompt+seed. Useful for reasoning about which
precision path is producing which degenerate argmax token under
undertrained models. (ee6bdce)
tools/test/sequential_matmul_half_test.rail. Regression test
verifying tgl_matmul_half_host is byte-deterministic across 1000
sequential calls. Eliminates the "primitive corruption" hypothesis
for any future GPU-collapse investigation. (7752738)

Diagnostic infrastructure

RAIL_GPU_POOL_DISABLE=1 env flag in tools/metal/tensor_gpu_lib.m.
Bypasses MTLBuffer pool best-fit reuse, forcing fresh
newBufferWithLength on every acquire. Falsifies the standing
hypothesis that pool reuse caused GPU sequential-inference collapse;
with the flag set, collapse is byte-identical to baseline. (7752738)

Inference workaround

tools/train/lm_infer_cpu.rail:gen_loop no longer calls arena_reset.
Eliminates a compiler-codegen interaction (between arena_reset and
multiply-add expressions in float_arr_set) that corrupted
_rail_small_fl[0] with the value being stored, surfacing as
SIGSEGV in _rail_chained_malloc on a subsequent allocation. Bug
was seed-deterministic (~50% of seeds at --max 128 --k 10), and
silently confounded all post-04-13 single-sample compile-rate
measurements. Workaround eliminates the trigger; per-iteration
intermediate tensors now accumulate in the bump arena, which the
default 1 GB easily holds for bounded inference runs. 30/30 stress
tests pass. The compiler-level fix remains open with a precise
one-line reproducer documented. (f215039)

Documentation

docs/SESSION_HANDOFF_2026-04-30_EOD.md — full afternoon arc.
docs/SPUR_HANDOFF_2026-04-30.md, docs/MODEL_SESSION_HANDOFF.md,
docs/ROADMAP_2026-04-30.md — morning arc + 6-month framing.
docs/RAIL_ENGINEER_SESSION_PROMPT_2026-04-30_NIGHT.md —
forward-looking prompt for the engineer picking up the open compiler
bug + remaining substrate debt.
Six new design notes: arena-design.md,
arena-leak-fix-strategy.md, data-section-quirk.md,
backlog-deferred-design-notes.md, strict-typecheck-design.md,
garmin-research-notes.md.

What was falsified (negatives)

GPU sequential-collapse "MTLBuffer pool reuse" hypothesis —
falsified via RAIL_GPU_POOL_DISABLE. Collapse byte-identical with
pool off. Surviving cause: fp16 precision compounding across 22
matmul round-trips/token (intrinsic, not a fixable substrate bug).
2026-04-15 "10 MB/step leak" hypothesis — falsified by
arena_reset chain-drain test (10 cycles, byte-tight). The
allocator is sound; remaining leak suspects are GPU-side
(MTLBuffer pool) or gpu_available 0 re-eval churn.
Static 2 GB arena — tested, breaks dyld at link time. 1 GB is
the macOS BSS ceiling; runtime mmap (A1.P4) is the path beyond.

Memory entries

Fifteen entries in ~/.claude/projects/-Users-user/memory/ capture
today's earned knowledge: substrate findings, discipline rules
(feedback_verify_removals, feedback_diagnostics_first,
feedback_honest_backlog), the dylib investigation chain, the
mixed-precision and parallel-rerank specs, and the segfault
bisection.

Assets 2

18 Apr 16:26

zemo-g

v3.0.0

bafefc5

v3.0.0 — Rail speaks TLS

Rail speaks HTTPS alone. A complete pure-Rail TLS 1.3 stack, X.509 chain validation, and HTTPS client — with zero C transitive dependency beyond as, ld, and the kernel's BSD sockets.

Live on release day, in production

anthropic_chat "claude-haiku-4-5-20251001" "Reply with exactly: hello from pure rail"
  → HTTP 200, reply "hello from pure rail"       (6.9 s, pure Rail → Anthropic)

slack_post_text "D0ATHQ1BQD7" "v3.0.0 smoke: pure-Rail TLS direct to slack.com"
  → ok=true, HTTP 200 with x-slack-req-id        (1.0 s, pure Rail → Slack)

https_get_url "https://www.amazon.com/"
  → HTTP 200 with set-cookie, x-amz-rid          (4.0 s, RSA chain validated
                                                  to DigiCert Global Root G2)

The full Google Trust Services chain for api.anthropic.com (leaf → WE1 intermediate → GTS Root R4) validates end-to-end to the macOS /etc/ssl/cert.pem trust store.

What shipped

~3,800 lines of new pure-Rail crypto + TLS across 16 new stdlib modules. Every primitive NIST- or RFC-vector validated:

Hashes: SHA-256, SHA-384, SHA-512
MAC / KDF: HMAC-SHA-256, HKDF-Extract/Expand
Symmetric: ChaCha20, Poly1305, ChaCha20-Poly1305 AEAD
Public key: X25519, ECDSA-P256 (16-limb), ECDSA-P384 (24-limb), RSA-PSS / RSA-PKCS1 (128-limb)
X.509 / PKI: ASN.1 DER parser, Base64 decoder, PEM iterator, macOS trust store loader (128 roots)
TLS 1.3: key schedule, handshake state machine, record layer, CertificateVerify dispatch, SAN hostname match, validity period, full chain walker with shortest-path policy
Application: https_get / https_post / https_get_url + URL parser + UDP DNS + live Anthropic + Slack clients

Trust posture

A TLS connection through Rail v3.0.0 refuses to hand plaintext to the caller unless all of the following hold:

The server's CertificateVerify signature checks out (ECDSA-P256-SHA256 or RSA-PSS-SHA256) against the public key in the leaf's SubjectPublicKey.
The leaf's SubjectAltName dNSName entries include a match for the hostname asked for (RFC 6125 §6.4.3 wildcard support).
The current time is within the leaf's notBefore/notAfter window.
The server Finished MAC validates.

Full chain walk to a CA root is available as cc_walk_chain (opt-in primitive).

Honest limits

Single cipher suite (TLS_CHACHA20_POLY1305_SHA256), single ECDHE group (x25519), three sig-algs, no session resumption, no 0-RTT, no constant-time guarantees, ~5–8 s per connection (public-key verify dominates). See SECURITY.md before deploying.

Tests

22 pure-Rail TLS tests all green + 116-test core suite still 116/116. Self-compile 2-pass byte-identical preserved. Full details in CHANGELOG.md.

The arc

v1.x — Rail compiled itself.
v2.x — Rail gained networks, trained transformers, shipped to Cloudflare.
v3.0.0 — Rail calls api.anthropic.com by itself.

Rail runs on Rail, the rest runs on physics.

Assets 2

Releases: zemo-g/rail

v5.1.0 — Rail emits its own GPU kernels

Auto-emission pipeline

Hand-fused Metal kernels

JIT compile foundation

Per-op GPU kernels

bf16 numerics regime

Training scripts (chunked-corpus sampler, 2-block d=64)

Tests + benches

Stability

Uh oh!

v5.0.2 — Attestation pipeline goes fully pure-Rail

Uh oh!

v5.0.1 — Attestation hygiene + codegen tightening

Uh oh!

v5.0.0 — Self-hosted toolchain (Linux ELF substrate)

What ships

End-to-end verified on aarch64 Linux (Pi Zero 2 W)

compile.rail Linux pipeline fixes (precursor)

Tag-readiness checklist

Deferred

Uh oh!

v4.1.0 — Repo hygiene + leak-guard CI

CI + leak-prevention (B1)

Branch hygiene (B2)

Doc pruning (B3)

Dead-code pruning (B4)

Structure pass (B5)

README polish (B6)

Verification

Uh oh!

v4.0.1 — Public-surface sanitization

What was scrubbed (~110 files)

What was intentionally NOT scrubbed

Verification

Why a patch release

Uh oh!

v4.0.0 — Substrate maturity

What "substrate maturity" means here

Compiler & runtime

Self-hosted JIT — now a first-class tool

Substrate hard-bench — publicly reproducible

Provenance — v2 with browser-side verify

x86_64 backend — full conformance

Uh oh!

v3.8.0 — Releases physicified (attestation)

v3.8.0 — 2026-05-01 — Releases physicified (attestation)

Attestation kernel + drivers

Cadenced drivers

Public surfaces

v3.7.0 included (was on next branch only)

CI fix

Validation

Uh oh!

v3.7.0 — Float-TCO root fix, mixed-precision inference, parallel rerank

v3.7.0 — 2026-04-30 — Float-TCO root fix, mixed-precision inference, parallel rerank

Compiler / runtime

Inference path

Diagnostic infrastructure

Inference workaround

Documentation

What was falsified (negatives)

Memory entries

Uh oh!

v3.0.0 — Rail speaks TLS

Live on release day, in production

What shipped

Trust posture

Honest limits

Tests

The arc

Uh oh!