Skip to content

Releases: zemo-g/rail

v5.1.0 — Rail emits its own GPU kernels

28 May 19:40

Choose a tag to compare

Rail now generates Metal Shading Language from its op-DAG, JIT-compiles it via Metal's newLibraryWithSource:, and dispatches the kernel at runtime. Every kernel the GPU executes is emitted by an attested Rail binary — the substrate piece needed for end-to-end attested GPU training.

This release bundles the full GPU substrate the auto-emission pipeline rests on: per-op Metal kernels, the bf16 numerics regime that unlocks stable 10k-step training, the JIT compile foundation, two hand-fused kernels, and the DAG matcher + emitter that drive them.

Auto-emission pipeline

Module Role
stdlib/jit_node.rail JIT op-DAG types (JitNode, TracedTensor) + tape primitives + jit_nth. Pure-DAG module — no Tensor/transformer dependency, so codegen consumers import without pulling the training stack.
stdlib/jit_tape.rail Execution tracers (jit_leaf, traced_rmsnorm, traced_matmul) layered on jit_node.
stdlib/jit_match.rail DAG matcher. walk_tape returns a list of FuseMatch records (FuseRmsQKV, FuseSiluHad) in tape order, identifying subgraphs that fit known fusion shapes.
stdlib/jit_emit.rail MSL emitter. emit_msl_for_match (tape, FuseMatch) returns the kernel source. Stubbed against known patterns today; v5.2+ replaces with shape-parameterized codegen driven by JitNode data.
stdlib/jit.rail compile/dispatch shim. No longer owns MSL text; jit_compile_* pulls strings from jit_emit. External API unchanged for existing consumers.

Hand-fused Metal kernels

Kernel Fusion Speedup
fused_rmsnorm_qkv RMSNorm + 3 matmul (Q/K/V) in one threadgroup-per-row dispatch. Exports Q | K | V | rstd | LN1 from a single packed buffer; rstd + LN1 kept for backward. 4 dispatches → 1. 35× over the per-op chain at seq=512, d=64
fused_silu_hadamard SiLU(gate) * up in one elementwise dispatch. Exports h_act | sigmoid(gate); sigmoid kept for backward. 18× over the per-op chain

JIT compile foundation

  • tgl_jit_compile_from_tmp_file — Metal's newLibraryWithSource: driven from a Rail-emitted .metal file. Returns a pipeline ID cached in g_jit_pipes for reuse across steps.
  • tgl_jit_dispatch_1in1out / _2in1out / _rmsnorm_qkv / _silu_hadamard — per-pattern dispatchers with f64↔f32 host staging.

Per-op GPU kernels

  • tgl_rmsnorm_save_f64 (1.8× over CPU), tgl_rope_apply_f64 (7×), tgl_silu_fwd_f64 (19×) at training shapes.

Per-op wins translated to ~2% per-step — confirming fusion, not per-op throughput, is the real ceiling, which is why the JIT pipeline above is the load-bearing thesis.

bf16 numerics regime

  • tgl_matmul_bf16 + matmul_bf16 wrapper. bf16 has f32's exponent range, sidestepping fp16's step-2759 NaN cliff. Training scripts default to forward bf16 with f64 on embedding + LM-head + backward.

Training scripts (chunked-corpus sampler, 2-block d=64)

  • tools/train/lm_v3_chunked_bf16_full_long.rail — bf16 forward, 10k-step stable, ~40% wall under f64 baseline.
  • tools/train/lm_v3_chunked_jit_long.rail — Q/K/V + SwiGLU through fused JIT'd kernels. 200-step matched-seed pilot vs bf16 baseline: trajectory shape preserved, no NaN, both converge. 2.85% step-throughput improvement (3×3 alternating runs, seq=512 d=64 d_ff=192).
  • tools/train/lm_v3_chunked_fp16_attn_f64_long.rail — falsification experiment ruling out attention as the fp16 culprit.

Tests + benches

7 JIT/GPU smoke tests + 2 benches added. Numerical parity verified: fused QKV max diff 1.67e-6 (f32 floor), silu-hadamard 1.5e-8. DAG matcher 5/5, MSL emitter 4/4, block-integration rstd/ln1/sigmoid parity all green.

Stability

  • 140/140 compiler test suite still green.
  • 2-pass byte-identical self-bootstrap unchanged — this release adds stdlib + foreign decls + Metal sources; the compiler core is untouched.

Full detail in CHANGELOG.md.

v5.0.2 — Attestation pipeline goes fully pure-Rail

28 May 20:18

Choose a tag to compare

Patch release. The first Rail release attested end-to-end through the Rail substrate — no curl, no shasum, no Python anywhere in the attestation path.

Fixes

  • stdlib/file.rail: foreign fopen path mode -> int (084791f). fopen returns a file descriptor, not a FILE*; declaring it as ptr bypassed Rail's tagging and tripped a polymorphic untag for odd fds (3 → 1, i.e. stdout). Corrected the foreign return type.
  • runtime _fopen: unwrap path + mode before open(2) (95d81de). Wrapped Rail strings carry their ptr at the heap header, so open() saw the header tag byte as the path. _fopen now calls _str_unwrap, matching the _rail_read_file contract. Closes the argv-vs-literal path bug.

Attestation — retired the shell escape hatches (bbda5dd): attest.sh, sign_attestation.sh, and publish.sh deleted; release_index.rail replaces the Python heredoc in attest_release.sh. attest.rail + publish.rail are now canonical.

Stability — new seed rail_native (3b89d0f5) is at the 2-pass byte-identical fixed point.

Full detail in CHANGELOG.md.

v5.0.1 — Attestation hygiene + codegen tightening

28 May 20:18

Choose a tag to compare

Patch release. No new features.

Codegen — closed the half-applied ARM64 compile-fixes patch (emit_x1 large-immediate path, emit_x1 global-V fallback, O-handler RHS exclusion, ?-handler global-V exclusion). Self-compile fixed point and test suite unchanged from the v5.0.0 baseline.

Attestation hygiene

  • Backfilled v4.0.0 / v4.0.1 / v4.1.0 release attestations (previously tagged-but-unattested).
  • .gitignore whitelist for releases/**/rail_native.attestation.json so a new release can't silently lose that file to the rail_native.* wildcard rule.
  • New docs/RELEASES.md operational runbook + 6 known gotchas.

Known limitationattest.rail was still blocked on the ftell FFI bug, so attestation in this release (including its own artifacts) still used the documented tools/attest/attest.sh shell escape hatch. Closed in v5.0.2.

Full detail in CHANGELOG.md.

v5.0.0 — Self-hosted toolchain (Linux ELF substrate)

14 May 03:54

Choose a tag to compare

Rail produces its own aarch64 Linux ELF binaries. Encoder, assembler, static linker, and ELF writer are pure Rail. For the supported subset of inputs, the build pipeline invokes no external as, ld, or codesign.

What ships

Module Lines Role
jit/arm64.rail +200 23 new encoders for the Linux mnemonic set (ldrb/strb imm/reg-offset/post-index, clz, neg, cmn imm+reg, rev, rev16, fneg, frinta, fcvt s_d / d_s, tbnz, stp/ldp pre/post-index, add/sub imm, asr/lsr/lsl imm). 31/31 byte-verified against as.
stdlib/elf.rail 175 Elf64 writer for static aarch64 binaries (one PT_LOAD for tiny, three for full text+data+bss-via-memsz).
tools/v5/elf_asm.rail 567 Section-aware ARM64 assembler + static linker. Two-pass: pass 1 builds (name, section, offset) label table; layout resolves to vaddrs; pass 2 emits bytes with adrp/:lo12: symbol resolution. Handles .text/.data/.bss/.rodata/.section __DATA,__mod_init_func (skipped), .quad/.byte/.long/.ascii/.asciz/.space/.comm/.p2align/.align, plus writeback stp/ldp variants and the adrp+add :lo12: symbol-load idiom.
tools/v5/compile_elf_full.rail 80 Driver: source .s → 3-pass pipeline → multi-segment ELF. Patches e_entry to _start.

End-to-end verified on aarch64 Linux (Pi Zero 2 W)

Program ELF size Result
exit42_linux.s 132 B exit 42
fib_linux.s 204 B exit(fib(10)) = 55
hello_linux.s 4105 B prints "v5 lives\n", exit 9 (adrp + add :lo12: + write syscall)
bss_test_linux.s 4096 B BSS counter loop, exit 7

Each binary's .text bytes are byte-equivalent to canonical as + ld output. Pipeline invokes neither external assembler nor linker.

compile.rail Linux pipeline fixes (precursor)

Two long-standing bugs in build_linux fixed: duplicate-symbol awk strip ran only on macOS-cross (now runs on Linux too, list broadened to cover the _rail_* runtime helpers that linux_libc.s redefines), and the macOS-only .section __DATA,__mod_init_func block is now stripped. tools/linux_libc.s gains _memcpy and _fmod — the two libSystem references with no Linux-side definition.

Tag-readiness checklist

  • compile.rail's real Linux output traverses the new pipeline → byte-equivalent ELF
  • Linux ELF substrate verified on aarch64 hardware
  • 176 encoders byte-verified against as (89 + 56 + 31)
  • No regression on ./rail_native test (136/140; 4 pre-existing tensor failures)
  • No regression on ./rail_native self byte-identical fixed point
  • CHANGELOG.md v5.0.0 entry
  • Leak guard CI green

Deferred

  • Pi self-host of ./rail_native test via Rail-only build — current GC layout reserves 1.2 GB BSS, exceeding Pi Zero 2 W RAM. Heap-size knob is v5.1 scope.
  • macOS Mach-O end-to-end with dyld stubs — Phase 4b covered the libSystem-free subset; stub-aware Mach-O (LC_LOAD_DYLIB, indirect symbol table, __stubs, __got, bind opcodes) is ~1500 more lines. Tracked as v5.2.

🤖 Generated with Claude Code

v4.1.0 — Repo hygiene + leak-guard CI

13 May 18:38

Choose a tag to compare

Minor release. Comprehensive cleanup pass over the public tree. No
compiled-binary change; no language or stdlib changes.

CI + leak-prevention (B1)

  • New workflow .github/workflows/leak-guard.yml — every push and PR
    is grep-scanned for the operator-recon pattern set (Tailscale IPs,
    internal SSH targets, home-directory paths, internal Slack channel
    IDs). Fails the build on any hit. Per-line opt-out via the comment
    marker leak-guard-allow. CHANGELOG.md and the guard file are
    excluded.
  • ci.yml triggers extended to include next branch and v* tags.
    Test-count assertion generalised from hardcoded 137/137 to any
    matching N/N (master is 137, next is 140, future may grow).
  • .gitignore — explicit ignores for .mcp.json, .ledatic/,
    .fleet/, *.pre-*. Closes the casual-git add recurrence path
    for the v4.0.1 leak class.

Branch hygiene (B2)

21 remote branches deleted from origin:

  • 18 feat/* branches fully merged into next (security A/B/C lanes,
    x86 conformance harness, x86 runtime extensions, JIT fixes, docs
    refresh, auto-deploy, punch-list integration).
  • jit (merged into next).
  • track-mhd-kernel (merged into master).
  • history-scrub-prep-2026-05-12 (unused experimental branch).

Remaining: master, next, half-s2-kernels (open compiler work),
compound/exp-008-bytes_to_str (halted POC artifact). Down from 26
branches to 4.

Doc pruning (B3)

~104 operator session-handoff files removed from the public tree:

  • docs/plans/ (74 files) — operator session-planning notes
    (SESSION_HANDOFF_, PROMPT_SESSION_, WEEK_PLAN_, PHASE_, etc.).
  • notes/ orphan files (12).
  • docs/handoffs/ orphans (8).
  • jit/ operator notes (9) — SCRATCH, CONTINUATION, SESSION_PROMPT*,
    AGENT_DRY_RUN, NEXT_STAGES, closures, floats.
  • SECURITY_HANDOFF.md — internal Fort Knox punch list (the public
    policy lives in SECURITY.md).

Kept: docs referenced from CHANGELOG (notes/bootstrap_convergence_audit_*,
notes/phase3_external_pilot_pitch_v0); jit/ code + README + CHANGELOG;
docs/sessions/ versioned handoffs (CHANGELOG-linked).

Dead-code pruning (B4)

  • Deleted tools/autocatalyst_v4.rail (broken — referenced runtime/llm.o
    which never landed in-tree, flywheel-v1 artifact).
  • Deleted tools/ac_dashboard.rail (orphan, flywheel dashboard).
  • Removed Razer3070 live-path references (decommissioned 2026-04-17):
    • tools/apps/control.rail — Razer fleet row + curl status segment.
    • tools/fleet/fleet_display.rail — razer_status/razer_iter/razer_max/
      razer_ping/razer_loss + RAZER row in the SPI-LCD render.
    • tools/mcp/rail_mcp.rail — tool_fleet_status no longer SSHes for
      nvidia-smi / v6_train.log; description updated.
    • tools/compile.rail — compile_x86 fallback message no longer
      recommends scp-to-Razer; suggests cross-tools or native host.
      Byte-identical bootstrap preserved.
  • CLAUDE.md target list: 'Linux x86_64 (Razer WSL)' →
    'Linux x86_64 (cross-compile)'.

Structure pass (B5)

  • Deleted 7 docs with no CHANGELOG or code references:
    RAIL_ENGINEER_PROMPT.md, flywheel-data-quality.md,
    flywheel-world-research.md, cascade-training.md,
    rail-plasma.md, railgpt-from-scratch.md,
    self-improving-playbook.md.
  • Flattened docs/handoffs/ (down to a single entry after B3 prune):
    docs/handoffs/2026-05-02.mddocs/handoff-2026-05-02.md.

README polish (B6)

  • Badge: v3.0.0 → v4.0.0; tagline → "Substrate maturity".
  • Intro paragraph adds the v4.0.0 substrate-maturity lede (dual-backend
    parity, JIT in Rail, 30/30 hard-bench, multi-witness attest).
  • New Releases section entry for v4.0.0 + a v4.0.1 sanitization note.
  • History table extended: 7 new rows spanning v3.7.0 → v4.0.1
    (previously jumped from v3.0.0 to v2.23.0).

Verification

  • Leak guard: 0 hits across tracked files for the union pattern set.
  • Test suite: 140/140 on the v4.1.0 tree (modulo the documented
    /tmp/rail_out orphan-process collision when run concurrently with
    another rail_native test).
  • git push on next: clean fast-forward; tag v4.1.0 cuts at 6 commits
    past v4.0.1, all CI-green via the new workflow.

v4.0.1 — Public-surface sanitization

13 May 18:03

Choose a tag to compare

Patch release. Removes operator-specific infrastructure strings from the
public tree: Tailscale IPs, SSH usernames, home-directory paths, internal
Slack channel IDs, and a stray operator MCP config. No behavior change;
the compiled binary is identical to v4.0.0.

What was scrubbed (~110 files)

  • Hard SSH targets in tools/attest/*.sh, tools/fleet/*.sh,
    tools/fleet/fleet_display.rail, tools/apps/control.rail — replaced
    with <witness-user>@<witness-host> / <peer-user>@<peer-host>
    placeholders. Callers must supply real values via environment.
  • Tailscale IPs (100.87.231.45, 100.79.50.108, 100.120.203.70,
    100.109.107.54, 100.109.63.37) replaced with role placeholders
    (<witness-tailscale-ip> etc.). Tailscale CGNAT-range addresses aren't
    reachable from the public internet, but they were operational recon.
  • Home-directory paths (/Users/ledaticempire/, /Users/user/,
    /home/zemog/) replaced with ~/ or <HOME> placeholders across
    source, docs, docs/plans/, training fixtures, and Objective-C dispatchers.
  • Operator service filestools/fleet/witness.service,
    tools/fleet/witness_push.service, tools/fleet/com.ledatic.*.plist
    renamed to *.example with <user> / <HOME> placeholders. Existing
    install scripts already substitute these at install time.
  • Operator MCP config.mcp.json removed from the tree. It was an
    operator's Claude Code MCP wiring (path to tools/mcp/rail_mcp.py),
    not a build artifact; the MCP server still runs locally with a
    per-user .mcp.json outside the repo.
  • Slack channel IDs / DM names in CHANGELOG.md, README.md,
    stdlib/slack_client.rail docblock, docs/sessions/HANDOFF_v3_6.md
    D0ATHQ1BQD7 and brockbro2 replaced with <DM_CHANNEL_ID> and
    <test-dm>. Slack IDs don't grant access on their own, but these
    were the only remaining specific-channel references in the public surface.

What was intentionally NOT scrubbed

  • reillygomez13@icloud.com in tools/deploy/gen_*.rail — public
    contact email rendered onto ledatic.org pages; meant to be public.
  • Commit messages in the v4.0.0 surface — rewriting history would break
    existing clones for a topology-recon leak, not a credential leak.
    The forward tree is clean; git history retains the originals.
  • ~/.ledatic/ path convention — generic project-named subdirectory,
    not operator-specific.

Verification

git grep -E "100\.(87|79|109|120)\.|zemog@|user@100|reillygomez@|\
ledaticempire@|/Users/ledaticempire|/Users/user|/home/zemog|\
Detro|D0ATHQ1BQD7|brockbro2"

→ empty across tracked files.

Why a patch release

v4.0.0 carried operator-recon strings inadvertently included via the
multi-witness publisher work on the next lineage. The master lineage
was scrubbed in c4f6050 (2026-05-06) but next hadn't received the
same pass. v4.0.1 brings the substrate-track tree to the same hygiene
standard.

v4.0.0 — Substrate maturity

13 May 17:46

Choose a tag to compare

⚠️ Superseded by v4.0.1.
v4.0.0 included operator-recon strings (Tailscale IPs, internal SSH targets,
home-directory paths) that were inadvertently carried over from the next
lineage. v4.0.1 is a documentation/config sanitization patch — the compiled
binary is identical. Please consume v4.0.1 instead.


A major version bump tagged on the next lineage. (master continues the parallel
v3.x attestation/agent track; the two have diverged on purpose.) 216 commits since
v3.11.0 was tagged on master 11 days ago — concurrency, playground, public JIT,
dual-backend parity, 30/30 substrate hard-bench publicly reproducible, browser-side
provenance verifier, four sweeping bug-class closures including a 17-day silent-
corruption fix discovered by a dual-implementation falsification harness.

No public API breaks; the major bump is a positioning marker, not a SemVer surface
change. The substrate-not-model thesis (docs/site/jit.md + tools/bench/repro_30of30.sh

  • https://ledatic.org/verify/<id> + this entire shipping volume) is now publicly
    defensible without hand-waving.

What "substrate maturity" means here

  • A frontier model + a 1KB Rail spec compiles 30/30 on a held-out hard-bench,
    reproducible by any partner with an API key. (f2c88b2)
  • The compiler is genuinely self-hosted on two backends, each with full
    same-bug-class parity for the 9 binary ops across both operand orderings.
    (ARM64 140/140, x86_64 136/136. 9e16aa7 + c9de6e9 + b223960.)
  • The verifier is a library, not a tool — import "jit/grade.rail" and a Rail
    program can compile + execute new Rail at runtime in the same process. (07366ea)
  • The provenance pipeline is multi-witness Ed25519, browser-verifiable, with
    pulse_id binding closing the prior session-replay gap. (f732176 + 2ada525)
  • A standalone single-file verifier ships at deterministic SHA — anyone can
    grade reports without trusting the original signer's infrastructure.
    (ledatic-site 8f5b928)

Compiler & runtime

  • Concurrency v1. Typed channels + select over a pthread-backed runtime.
    import "stdlib/concurrent.rail" exposes rc_chan_make/rc_chan_send/
    rc_chan_recv/rc_spawn. int64-only values in v0; 9 + 8 falsification tests
    green. (4623e72)
  • Auto-memo fib silent-corruption — FIXED. compile.rail:2593 memo_store emit
    was double-untagging x19 (which was already untagged in the prologue). Writes
    went to memo[n/2] while reads keyed memo[n]; pairs collided on shared slots.
    fact escaped because it has only one recursive call and never reads back; fib
    failed because two recursive reads collide. fib(10) was returning 293886
    instead of 55. Found by the JIT REPL agent comparing shell-compile vs JIT on
    the same program; one-line fix using x19 directly as the index register.
    Falsification at tools/test/auto_memo_fib_correctness.rail. (b89a60b)
  • Nullary-LHS binary-op bug — FIXED. Any binary op with a top-level nullary
    LHS expression was using the prior x0 instead of the freshly-computed value.
    compile.rail::emit_x1 fast-path patched; 2-cycle bootstrap byte-identical.
    Was the root of the multi-week "CPU substrate is mysteriously wrong" arc; closes
    the substrate investigation. (pre-window but retroactively notable)
  • _rail_join O(n²) — FIXED. Runtime asm rewrite of join: 53.5 GB → 267 MB on
    the 8×100K-float dump pattern (200× memory, 120× wall-clock). Diagnostic
    harnesses kept at tools/diagnose/dump_pattern_smoke.rail +
    tools/diagnose/dump_bisect.rail. (pre-window)
  • Same-bug-class parity sweep — CLOSED on both backends, both orderings.
    Each of 9 binary ops (+, -, *, /, %, <, >, <=, >=) now has
    symmetric handling for (int, float) and (float, int) operand orderings.
    • x86 (int, float): inline emit check_both + .L<op>_mixed_if. (b223960)
    • x86 (float, int): already covered by b223960's symmetric routing.
    • ARM64 (int, float): inline emit check_both + .L<op>_mixed_if mirror.
      (9e16aa7 + d4e3696)
    • ARM64 (float, int): .L<op>_mixed_fi mirror that takes raw-f64 LHS via
      fmov d0, x1, untags+converts tagged-int RHS via asr + scvtf. For
      _rail_add specifically the dispatch is inserted at the top of .Ladd_heap
      so the string-append path remains correct. (c9de6e9)
    • 9 + 9 = 18 falsification tests at tools/test/<op>_{int_float,float_int}_ordering.rail.
  • 3-movk integer literal codegen. emit_load_int at compile.rail:829 now
    emits movz + up to 3 movk chunks (bits 0-15, 16-31, 32-47, 48-63) with zero
    chunks at ≥#32 skipped, plus a symmetric movn + movk path for negatives.
    k16/k32/k48 computed via shl 1 N so constant-folding doesn't bake the
    64-bit literal as a constant the seed can't emit. Regression tests t132/t133/t134.
    ARM64 floor: 137 → 140. (872424b)
  • Bootstrap convergence audit — published. The "bootstrap doesn't converge"
    claim was falsified: it's a 2-cycle limit cycle. gen0's shipped runtime asm
    doesn't necessarily match what gen0's source emits, so cycle 1 typically differs;
    gen2 always lands the byte-identical fixed point. See
    notes/bootstrap_convergence_audit_2026-05-13.md.
  • Diagnostic surface. strip_trailing_ws helper replaces trim at 4 multi-line
    as/ld result sites so undefined-symbol errors and assembly errors no longer
    silently truncate to the first line. shell_quote_arg + shell_quote_join +
    join_args_quoted preserve quoted argv through ./rail_native run. (b7f267a,
    23fa5fd)

Self-hosted JIT — now a first-class tool

  • Public documentation. docs/site/jit.md (109 lines): substrate-honesty
    framing, end-to-end test_codegen demo with output, honest capability + limit
    table, file map for inspection. Linked from docs/site/index.md. Public surface
    at https://ledatic.org/rail/docs/jit.html once deployed. (def1bcd)
  • JIT-first REPL at tools/repl_jit.rail. ~3000× per-line vs shell-compile
    (0.1 ms median JIT-line vs 319 ms shell-line). Persistent definitions across
    lines via string-concat buffer; every line re-lowers the full defs + expr at
    ~0.4 ms. One-time ~21 s REPL compile mitigated by pre-compiled binary at
    /tmp/repl_jit_bin. 11/11 smoke green including JIT-hits, ADT-fallback,
    parse-error path. tools/repl.rail (the shell-based REPL) untouched. (6ab2666)
  • JIT-grade fast path at tools/bench/jit_grade_batch.rail, opt-in via the
    --jit-fast flag on tools/bench/repro_anthropic.py. Modest 1.18× grading-only
    speedup
    (101.75 s → 86.18 s) — the public bench is API-bound, so default driver
    stays shell-only. Lower-hit 14.2 % on synthesized completions, 40 % on canonical
    hand-curated shapes. Soundness finding the falsification test earned:
    jit_can_lower=1 was UNSOUND as a fast-path predicate — JIT recognizes builtins
    (str_eq/str_len/str_at/is_nil) that rail_native rejects; naive routing
    would have silently marked fail cases as passes. contains_unsafe_jit_builtin
    guard added; 26/26 parity. The bug was simultaneously fixed at the JIT source
    itself in jit_can_lower. (163521e)
  • In-process agentic loop at tools/agent/jit_loop.rail. Single Rail program
    that calls the Anthropic API via stdlib/anthropic_client.rail, JIT-compiles
    the response via jit/grade.rail, executes, returns. Offline smoke green (fib 10
    → 55, fact 6 → 720). 5/5 in-subset programs JIT cleanly; 5/5 out-of-subset
    reject loudly with diagnostics — hard verifier, no silent wrong answers. (07366ea)
  • JIT lower cluster fixes. Three closing bugs from the JIT integrations:
    multi-line let inside fn bodies (parse_fn_body now skip_nls's before body);
    st_fail no longer prints (uses mutable arr cell pattern from
    stdlib/https_session.rail:64); jit_can_lower substring-checks for unsafe
    builtins. (09263e6 + ef88a42 + 1226600)

Substrate hard-bench — publicly reproducible

  • F-53 closure. tools/bench/substrate_hard_bench.rail +
    tools/bench/repro_anthropic.py + tools/bench/repro_30of30.sh +
    tools/bench/README.md. Two reproduction paths: Anthropic API (~$15–20 / run,
    ~15–25 min) and local MLX/vLLM (any 100B+ open-weight on an OpenAI-compatible
    endpoint). Partners can now run the 30/30 bench without Studio access. (f2c88b2)
  • The empirical claim it backs: a frontier model + 1KB Rail spec scores 30/30
    on a held-out hard-bench, beating a fine-tuned ensemble. Every band 5/5; 15.4
    min wall-clock; multi-witness Ed25519 signed; verifiable at /verify/<id>.

Provenance — v2 with browser-side verify

  • Pulse_id binding. Attestation v2 binds pulse_id so old attestations
    cannot be replayed against new pulses. TOCTOU on weights closed via re-hash
    inside the signing transaction. (f732176)
  • Standalone verifier ships from the ledatic-site repo as a single-file
    executable with a deterministic SHA. Third parties can grade reports without
    trusting Studio infrastructure. (ledatic-site 8f5b928)
  • Crypto stdlib hardening. 2 CRITICALs + 7 HIGHs closed via the 2026-05-12
    parallel security-audit pass. Crypto stdlib + provenance + fleet posture all
    tightened. See memory entry security_audit_2026-05-12. (bf7ff54,
    f065e0e, 2ada525, 39e02fe)
  • DNS-match short-circuit fix. cv_dns_match wildcard-vs-equal-length
    path patched so SAN matching can't bypass on edge inputs. (47ca7f1)
  • Fleet bind to Tailscale IPv4. fleet_agent_v3 no longer listens on
    0.0.0.0; bound to the Tailscale IP only. (1de6cff)

x86_64 backend — full conformance

  • 136/136. From the prior 71/79 baseline, the 2026-05-12 punch-list (Agents
    A–E in parallel) drove the backend to 100 % conformance via:
    • Bit-op runtime + char_from_int + byte_at/set (60cd486)
    • **5 `rail...
Read more

v3.8.0 — Releases physicified (attestation)

01 May 13:38

Choose a tag to compare

v3.8.0 — 2026-05-01 — Releases physicified (attestation)

Every tagged release, every ./rail_native test pass, and every 2-pass self-compile fixed point now binds to a live entropy beacon pulse_id and an Ed25519 signature from the project's fleet0 Pi witness (pk_fp = cac5f21a70564aeb). The signed artifacts ship in releases/<tag>/, are mirrored at https://ledatic.org/releases/<tag>/, and are reproducible offline with https://ledatic.org/attest/verify.sh.

Attestation kernel + drivers

  • tools/attest/attest.sh — primitive: signs sha256(input) ⊗ pulse_id ⊗ value_hex via the Pi witness using a namespace-prefixed message (attest|v1|...) so attestation sigs can never collide with beacon-witness sigs.
  • attest_release.sh / attest_test_run.sh / attest_selfhost.sh — call the primitive on the binary, the test log, and the byte-identical fixed point respectively.
  • verify.sh — re-derives the digest, fetches the public key from ledatic.org/attest/fleet0.pub.pem, runs the Ed25519 verify, exits non-zero on tamper.
  • tools/attest/backfill_releases.sh — extracts each historical tag's rail_native + tools/compile.rail blobs (no checkout) and signs them. v2.0.0 → v3.7.0 are all attested + downloadable.

Cadenced drivers

  • tools/attest/daily.sh — LaunchAgent com.ledatic.attest_daily re-attests production every morning at 06:00 local. Updates /builds/latest and /selfhost/latest pointers; production drift = "latest" pointer falls behind the live tree, immediately self-evident.
  • tools/attest/fleet_status_publisher.sh — LaunchAgent com.ledatic.fleet_attest polls each fleet node's /health every 60 s, fetches the current pulse, signs the bundle, and publishes to https://ledatic.org/fleet/status.json.

Public surfaces

  • https://ledatic.org/system — mission control: five panels (beacon · witness · fleet · build · selfhost), each resolves to a signed JSON artifact, refreshes on 2.5 s cadence, self-marks "live" or "stale" based on signature freshness.

v3.7.0 included (was on next branch only)

This release also rolls in v3.7.0 — Float-TCO root fix, mixed-precision inference, parallel rerank — which was tagged on the next branch (2026-04-30) but never merged into master. See v3.7.0 release notes.

CI fix

.github/workflows/ci.yml now builds tools/metal/libtensor_gpu.dylib before running the test suite. Four tensor tests had been failing at link with "Undefined symbols for architecture arm64: _tgl_init …" since the Metal-FFI introduction (~2026-04-15); CI returned to green.

Validation

  • 137/137 tests green
  • Byte-identical 2-pass self-compile fixed point verified

The verb: Rail releases are no longer claims, they're physical events anchored to real time.

v3.7.0 — Float-TCO root fix, mixed-precision inference, parallel rerank

01 May 03:03

Choose a tag to compare

v3.7.0 — 2026-04-30 — Float-TCO root fix, mixed-precision inference, parallel rerank

Substantial substrate work. Seven commits, three real bugs (one fixed at
root, one workaround'd at source, one falsified), one substantial new
feature (Rail-native mixed-precision GPU inference), one substantial new
tool (parallel rerank wrapper), and a precise reproducer for one bug
that stayed open. 137/137 tests green; byte-identical self-bootstrap
verified.

Compiler / runtime

  • Float-TCO root fix. Re-added body_has_float guard to
    all_params_int in tools/compile.rail:1992. Closes a 17-day silent
    wrong-result bug introduced by commit 82516e4 (2026-04-13) that
    caused tail-recursive float helpers (e.g. rms_row_apply) to
    reinterpret float bits as ints in register-ABI calls, producing
    garbage. Headline affected sites: RMSNorm CPU path, AdamW weight
    decay, LayerNorm CPU backward. (7752738)
  • Runtime-mmap arena (A1.P4). RAIL_ARENA_MB env var (default 1 GB,
    scales to 4 GB+ via mmap). Replaces the fixed 512 MB BSS arena that
    was bumping the macOS dyld static-data ceiling. envp passthrough via
    _rail_envp so env vars reach ./rail_native run child processes.
    Long-context training (seq=2048+) now mechanically tractable on
    macOS. (7752738)
  • Diagnostic counters (A1.P5). alloc_stats_snapshot returns 17
    ints now: per-class freelist misses (0–11), munmap_count (12),
    mmap_large_count (13), arena_spill_count (14), gc_count (15),
    arena_spill_bytes (16). Plus RAIL_ARENA_TRACE=1 for stderr-emitted
    spill events. (7752738)
  • Parser multi-line compound expressions. Cons chains, nested calls,
    list literals inside unclosed (...)/[...] now parse cleanly. Same
    post-tokenizer pass routes both tokenize and tokenize_with_pos.
    (7752738)
  • ./rail_native quick. 15 critical tests in ~5s, vs the full
    suite's 10+ min. Use between code edits. (7752738)

Inference path

  • Rail-native mixed-precision matmul. New Metal kernel
    matmul_f32x_halfw (fp32 activations × fp16 weights → fp32, fp32
    accumulator). Host wrapper tgl_matmul_f32x_halfw_host casts
    f64↔f32 once at the GPU boundary; Rail-side surface stays in f64.
    stdlib/tensor.rail:matmul_mixed. Primitive correctness:
    max_abs_diff = 0.00042 vs f64 reference (vs 0.00082 for the
    all-fp16 path — 2× tighter). Byte-deterministic across 100+
    sequential calls. New harness at
    tools/train/lm_infer_v3_mixed.rail. Right substrate for d=384+
    scaling; not the d=256 winner today (CPU+KV-narrowing remains
    faster for current model). (ee6bdce)
  • Parallel rerank wrapper. tools/train/parallel_rerank.sh fans
    out N inference subprocesses concurrently with distinct seeds,
    pre-compiling the harness once for amortization. Validated 7.1×
    wall-clock at N=8, ~11× projected at N=20 — bench projection drops
    from 2.25hr to ~13min for 30 prompts × N=20 rerank. --bin <path>
    flag (added in v3.7.0 as a follow-on) lets orchestrators skip the
    built-in pre-compile. (ee6bdce, 73043e2)
  • tools/train/parity_check.sh. Three-way diff harness running
    CPU (f64), GPU half (existing v3_half), and GPU mixed (new) on the
    same checkpoint+prompt+seed. Useful for reasoning about which
    precision path is producing which degenerate argmax token under
    undertrained models. (ee6bdce)
  • tools/test/sequential_matmul_half_test.rail. Regression test
    verifying tgl_matmul_half_host is byte-deterministic across 1000
    sequential calls. Eliminates the "primitive corruption" hypothesis
    for any future GPU-collapse investigation. (7752738)

Diagnostic infrastructure

  • RAIL_GPU_POOL_DISABLE=1 env flag in tools/metal/tensor_gpu_lib.m.
    Bypasses MTLBuffer pool best-fit reuse, forcing fresh
    newBufferWithLength on every acquire. Falsifies the standing
    hypothesis that pool reuse caused GPU sequential-inference collapse;
    with the flag set, collapse is byte-identical to baseline. (7752738)

Inference workaround

  • tools/train/lm_infer_cpu.rail:gen_loop no longer calls arena_reset.
    Eliminates a compiler-codegen interaction (between arena_reset and
    multiply-add expressions in float_arr_set) that corrupted
    _rail_small_fl[0] with the value being stored, surfacing as
    SIGSEGV in _rail_chained_malloc on a subsequent allocation. Bug
    was seed-deterministic (~50% of seeds at --max 128 --k 10), and
    silently confounded all post-04-13 single-sample compile-rate
    measurements. Workaround eliminates the trigger; per-iteration
    intermediate tensors now accumulate in the bump arena, which the
    default 1 GB easily holds for bounded inference runs. 30/30 stress
    tests pass. The compiler-level fix remains open with a precise
    one-line reproducer documented. (f215039)

Documentation

  • docs/SESSION_HANDOFF_2026-04-30_EOD.md — full afternoon arc.
  • docs/SPUR_HANDOFF_2026-04-30.md, docs/MODEL_SESSION_HANDOFF.md,
    docs/ROADMAP_2026-04-30.md — morning arc + 6-month framing.
  • docs/RAIL_ENGINEER_SESSION_PROMPT_2026-04-30_NIGHT.md
    forward-looking prompt for the engineer picking up the open compiler
    bug + remaining substrate debt.
  • Six new design notes: arena-design.md,
    arena-leak-fix-strategy.md, data-section-quirk.md,
    backlog-deferred-design-notes.md, strict-typecheck-design.md,
    garmin-research-notes.md.

What was falsified (negatives)

  • GPU sequential-collapse "MTLBuffer pool reuse" hypothesis
    falsified via RAIL_GPU_POOL_DISABLE. Collapse byte-identical with
    pool off. Surviving cause: fp16 precision compounding across 22
    matmul round-trips/token (intrinsic, not a fixable substrate bug).
  • 2026-04-15 "10 MB/step leak" hypothesis — falsified by
    arena_reset chain-drain test (10 cycles, byte-tight). The
    allocator is sound; remaining leak suspects are GPU-side
    (MTLBuffer pool) or gpu_available 0 re-eval churn.
  • Static 2 GB arena — tested, breaks dyld at link time. 1 GB is
    the macOS BSS ceiling; runtime mmap (A1.P4) is the path beyond.

Memory entries

Fifteen entries in ~/.claude/projects/-Users-user/memory/ capture
today's earned knowledge: substrate findings, discipline rules
(feedback_verify_removals, feedback_diagnostics_first,
feedback_honest_backlog), the dylib investigation chain, the
mixed-precision and parallel-rerank specs, and the segfault
bisection.

v3.0.0 — Rail speaks TLS

18 Apr 16:26

Choose a tag to compare

Rail speaks HTTPS alone. A complete pure-Rail TLS 1.3 stack, X.509 chain validation, and HTTPS client — with zero C transitive dependency beyond as, ld, and the kernel's BSD sockets.

Live on release day, in production

anthropic_chat "claude-haiku-4-5-20251001" "Reply with exactly: hello from pure rail"
  → HTTP 200, reply "hello from pure rail"       (6.9 s, pure Rail → Anthropic)

slack_post_text "D0ATHQ1BQD7" "v3.0.0 smoke: pure-Rail TLS direct to slack.com"
  → ok=true, HTTP 200 with x-slack-req-id        (1.0 s, pure Rail → Slack)

https_get_url "https://www.amazon.com/"
  → HTTP 200 with set-cookie, x-amz-rid          (4.0 s, RSA chain validated
                                                  to DigiCert Global Root G2)

The full Google Trust Services chain for api.anthropic.com (leaf → WE1 intermediate → GTS Root R4) validates end-to-end to the macOS /etc/ssl/cert.pem trust store.

What shipped

~3,800 lines of new pure-Rail crypto + TLS across 16 new stdlib modules. Every primitive NIST- or RFC-vector validated:

  • Hashes: SHA-256, SHA-384, SHA-512
  • MAC / KDF: HMAC-SHA-256, HKDF-Extract/Expand
  • Symmetric: ChaCha20, Poly1305, ChaCha20-Poly1305 AEAD
  • Public key: X25519, ECDSA-P256 (16-limb), ECDSA-P384 (24-limb), RSA-PSS / RSA-PKCS1 (128-limb)
  • X.509 / PKI: ASN.1 DER parser, Base64 decoder, PEM iterator, macOS trust store loader (128 roots)
  • TLS 1.3: key schedule, handshake state machine, record layer, CertificateVerify dispatch, SAN hostname match, validity period, full chain walker with shortest-path policy
  • Application: https_get / https_post / https_get_url + URL parser + UDP DNS + live Anthropic + Slack clients

Trust posture

A TLS connection through Rail v3.0.0 refuses to hand plaintext to the caller unless all of the following hold:

  1. The server's CertificateVerify signature checks out (ECDSA-P256-SHA256 or RSA-PSS-SHA256) against the public key in the leaf's SubjectPublicKey.
  2. The leaf's SubjectAltName dNSName entries include a match for the hostname asked for (RFC 6125 §6.4.3 wildcard support).
  3. The current time is within the leaf's notBefore/notAfter window.
  4. The server Finished MAC validates.

Full chain walk to a CA root is available as cc_walk_chain (opt-in primitive).

Honest limits

Single cipher suite (TLS_CHACHA20_POLY1305_SHA256), single ECDHE group (x25519), three sig-algs, no session resumption, no 0-RTT, no constant-time guarantees, ~5–8 s per connection (public-key verify dominates). See SECURITY.md before deploying.

Tests

22 pure-Rail TLS tests all green + 116-test core suite still 116/116. Self-compile 2-pass byte-identical preserved. Full details in CHANGELOG.md.

The arc

  • v1.x — Rail compiled itself.
  • v2.x — Rail gained networks, trained transformers, shipped to Cloudflare.
  • v3.0.0 — Rail calls api.anthropic.com by itself.

Rail runs on Rail, the rest runs on physics.