Skip to content

Latest commit

 

History

History
575 lines (441 loc) · 40.7 KB

File metadata and controls

575 lines (441 loc) · 40.7 KB

λ-Calculus as the fsm_llm Substrate

Companion documents. This document is the architectural thesis — why λ-calculus is the substrate, what FSMs are in this picture, the five theorems. For the merge contract (the user-facing API specification, the invariants, and the falsification gates that defend them), see docs/lambda_fsm_merge.md, which is the canonical document for the post-merge state of the codebase. The earlier audit-and-plan (docs/lambda_integration.md v2.0) is superseded but preserved for historical reference.

Status: Architectural thesis (v0.2, 2026-04-24). Source: Distilled from Roy, Tutunov, Ji, Zimmer, Bou-Ammar, The Y-Combinator for LLMs: Solving Long-Context Rot with λ-Calculus (arXiv:2603.20105, Mar 2026), fused with the existing fsm_llm architecture.

This document proposes λ-calculus as the primary substrate of fsm_llm. FSMs remain — but as one surface syntax over that substrate, not the substrate itself. Most programs that today ceremonially wrap themselves in FSM JSON do not actually need dialog state, and should be expressed directly as λ-terms. Programs that do need dialog state continue to work unchanged: the existing FSM JSON compiles down to λ-terms and runs on the same executor as native λ-programs.

One runtime. Two surface syntaxes. FSMs where they earn their keep; λ-terms everywhere else.


1. Thesis

Every fsm_llm program is already a λ-term. The FSM JSON is a serialisation of a per-turn function step : (state, input, context) → (state', output, context') with a switch on state_id — which is precisely a simply-typed λ-expression with a case analysis. The runtime never loops; the host application does, via converse(). Making λ-terms primary lets us:

  1. Delete FSM overhead from programs that don't need state. Most pipelines, agentic patterns, and one-shot tasks are stateless — the FSM wrapping is ceremony.
  2. Unify the five sub-packages. fsm_llm_reasoning, fsm_llm_workflows, fsm_llm_agents all become named combinator patterns in a single standard library, not separate execution engines.
  3. Inherit the paper's theorems (termination, closed-form cost, polynomial accuracy decay, optimal k*=2) for every program, not just the ones that opt into a special "λ-substrate" extension.
  4. Preserve every existing program. The FSM JSON compiler runs before M1 ships; no example, no handler, no production deployment breaks.

The design move is not "add λ-substrate alongside FSM". It is: the runtime is a typed λ-term interpreter; FSM JSON is one front-end; a direct λ-DSL is another.


2. An FSM is a λ-term — constructively

Let S be the set of state IDs, I the input alphabet (user messages), O the output alphabet (assistant responses), C a context dict. The current fsm_llm runtime executes this function per converse() call:

$$ \text{step} : S \times I \times C \to S \times O \times C $$

In the existing 2-pass architecture this decomposes as:

$$ \text{step}(s, i, c) = \big(\delta(s, c'),; \rho(s', c')\big)\quad\text{where}\quad c' = \text{extract}_s(i, c) $$

  • extract_s — Pass 1 data extraction at state s (leaf β-reduction of 𝓜)
  • δ — transition evaluator (deterministic over c′; ambiguity resolved by a bounded Classifier call, which is itself a leaf β-reduction)
  • ρ — Pass 2 response generation at the post-transition state (leaf β-reduction)

Every one of these is a function; their composition is a λ-term. Writing it out in combinators (where [[s]] denotes state s's specification):

step = λ(s, i, c).
  let c' = extract_leaf([[s]].extraction_instructions, i ⊕ c)  -- one 𝓜 call
  let s' = dispatch(c', [[s]].transitions)                     -- pure + optional 𝓜 via Classifier
  let o  = response_leaf([[s']].response_instructions, c')     -- one 𝓜 call
  in (s', o, c')

Host-side recursion (the conversation loop) is external:

conversation = fix(λloop. λ(s, c).
  let i = await_user()
  let (s', o, c') = step(s, i, c)
  emit(o)
  if [[s']].terminal then halt(c') else loop(s', c'))

So converse() is one β-reduction step of the conversational fix. The FSM JSON is a serialisation of the [[s]] table and the δ dispatch. Everything runs in λ already; we have just not named it that.

Consequence: expressing a program directly as a λ-term (no FSM JSON) is a strict simplification for any program where the switch-on-s is degenerate (one state, or linear transitions). That turns out to be most of examples/.


3. Taxonomy: when FSMs earn their keep, when they don't

We audit the 80 examples against the question "does this program need a runtime switch on persistent dialog state?". Three categories emerge.

Category A — Genuinely stateful dialog (FSM surface is the right fit)

State must persist across user turns, and the transition graph is genuinely non-linear (user can redirect mid-conversation).

Examples: form_filling, adaptive_quiz, e_commerce (FSM stacking used), most dialog bots in examples/basic and examples/intermediate, smart_helpdesk, the classification/intent_routing family.

Recommendation: keep FSM JSON. The compiler (§6) translates to λ; nothing else changes for these.

Category B — Pipelines with implicit state (state is fake)

The "state" is really just a stage in a linear pipeline. No user re-entry, no branching on persistent context. The FSM is ceremony for what is structurally a function composition.

Examples: most of examples/agents/*prompt_chain, plan_execute, evaluator_optimizer, react_search, reflexion, debate, maker_checker, rewoo, orchestrator, eval_opt_structured, etc. (roughly 30 of 48 agent examples). Also basic/story_time, basic/simple_greeting, most of examples/reasoning.

Recommendation: express directly as λ-terms. Keep FSM version as a compatibility shim during migration; remove after M4.

Category C — Recursive decomposition over long inputs (λ-native from day one)

The paper's home turf. Long-document QA, aggregation over large corpora, pairwise comparison, multi-hop retrieval. fsm_llm currently has no good expression of these — push_fsm stacking approximates bounded recursion but gives no termination proof, no cost model, no optimal partition.

Examples: none today; likely additions per paper benchmarks — S-NIAH, OOLONG, OOL-Pairs, CodeQA-style tasks.

Recommendation: express as λ-terms with fix and the combinator library. This is where the paper's +21.9 pp / 4.1× wins live.

Rough split: A ≈ 20 examples, B ≈ 50, C ≈ 10 (new). The substrate change shrinks the FSM-authoring surface by ~75% without breaking anything in the remaining 25%.


4. Unified architecture

┌────────────────────────────────────────────────────────────────────┐
│ SURFACE  (authoring)                                               │
│                                                                    │
│   FSM JSON v4.1       λ-DSL (Python)       λ-JSON (serialised)     │
│   (legacy,            (preferred for       (portable,              │
│    Category A)         Category B/C)        optional)              │
│          │                     │                   │               │
│          └──── compile ────────┴────── parse ──────┘               │
│                                 │                                  │
│                                 ▼                                  │
│ ══════════════════════════  λ-AST  ════════════════════════════   │
│   Typed terms: Var · Abs · App · Let · Case                        │
│                Combinator(op, args) · Fix · Leaf(𝓜, template)     │
│                                                                    │
│                                 │                                  │
│                                 ▼                                  │
│ RUNTIME                                                            │
│                                                                    │
│   Planner    — computes (k*, τ*, d, ⊕, π) for any Fix node         │
│   Executor   — β-reduction with bounded 𝓜 at Leaf nodes           │
│   Oracle     — fsm_llm.LiteLLMInterface adapter (unchanged)        │
│   Session    — persistence for Category A dialog (unchanged)       │
│   Monitor    — per-term span tracing, depth/cost histograms        │
└────────────────────────────────────────────────────────────────────┘

Two crucial architectural properties:

  1. Single executor. There is no "FSM path" and "λ path". Every program is a λ-AST. converse(user_msg, conv_id) does one β-reduction of the program's per-turn term, threading (state_id, context) through the session store for Category A programs, or purely functionally for Category B/C.
  2. Single planner. Every Fix node — whether it originated from an FSM's push_fsm stacking or from a native fix combinator — goes through the same planner to get a termination proof, a cost estimate, and an execution plan.

5. λ-AST

The AST is deliberately small. Keeping it small is what makes the theorems compositional.

# src/fsm_llm/lam/ast.py  (becomes part of core)
class Term: ...

class Var(Term):          name: str
class Abs(Term):          param: str; body: Term
class App(Term):          fn: Term; arg: Term
class Let(Term):          name: str; value: Term; body: Term
class Case(Term):         scrutinee: Term; branches: dict[str, Term]   # finite discrimination

class Combinator(Term):   op: Literal["SPLIT","PEEK","MAP","FILTER","REDUCE","CONCAT","CROSS"]
                          args: list[Term]
class Fix(Term):           body: Abs                                 # body is a self-application λf. ...
class Leaf(Term):          oracle: OracleRef                         # 𝓜 call, schema-typed
                           template: PromptTemplate
                           input_vars: list[str]
                           schema: type[BaseModel] | None             # if set, structured extraction

Notes on design:

  • Case replaces the FSM's transition table. A compiled FSM has exactly one top-level Case on state_id; each branch is the λ-term for that state.
  • Leaf is the only node that invokes 𝓜. Everything else is symbolic. This is Assumption A2 encoded in the type system.
  • Fix appears exactly where the program wants bounded recursion — in Category C programs natively, in Category A programs only if stacking is used.
  • Combinator's op is a closed enum. New combinators are added by extending the enum + the executor; users cannot introduce arbitrary operators, which is what makes the library "pre-verified" in the paper's sense.

Typing (pragmatic, not full System F)

Each Leaf declares its schema (pydantic type) for typed IO at oracle boundaries. Combinators are monomorphised at parse time from the concrete types flowing through the tree. This catches the common errors (passing a list where a string is expected, REDUCE with a non-associative operator on multi-chunk output) without the complexity of full inference.


6. FSM JSON → λ-AST compiler

Legacy FSM JSON is compiled at load time by a small, pure function. No FSM is ever executed except via this compile → run path after M2 — FSMManager and MessagePipeline become thin adapters over the λ-executor.

6.1 Compilation scheme

For an FSM with states s₁…sₙ and initial s₀:

compile(fsm) =
  λ(state_id, user_msg, context).
    Case state_id of
      "s1" → compile_state(s1, user_msg, context)
      "s2" → compile_state(s2, user_msg, context)
      ...

compile_state(s, user_msg, context) =
  Let c' = extract_for(s, user_msg, context)            -- Leaf with s.extraction_instructions schema
  Let s' = eval_transitions(s.transitions, c')          -- pure case analysis + optional Classifier Leaf
  Let o  = respond_for(s', c')                          -- Leaf with s'.response_instructions
  (s', o, c')

extract_for(s, i, c) =
  Leaf(oracle=𝓜, template=s.extraction_instructions,
       input_vars=[i, c], schema=schema_from(s.classification_extractions))

eval_transitions(ts, c) =
  Case (all_guards_evaluated(ts, c)) of
    DETERMINISTIC(s')  → s'
    BLOCKED            → emit_blocked_response
    AMBIGUOUS(cands)   → Leaf(oracle=Classifier, template=disambig(cands), input_vars=[c])

push_fsm(sub) =
  Fix(λloop. compile(sub))   -- each push is a bounded fix invocation

6.2 Semantic preservation

For every FSM F, running the compiled λ-term via the executor produces the same (state', output, context') triple as the current FSMManager.process for every (state, input, context) triple in the valid input space. This is a theorem we prove via structural induction on the FSM AST, not a hope.

Regression test requirement: the existing 2,349 tests run against the λ-executor with no changes. Any deviation is a compiler bug, not a behavioural change.

6.3 Handler hooks survive compilation

The 8 HandlerTiming hooks dispatch through one execution path (make_handler_runner), with the call-site placement split between term-side AST splices and host-side dispatch to match the cardinality and conditional semantics each timing requires:

Term-side splices (real AST rewriters via Combinator(op=HOST_CALL, ...)):

  • PRE_PROCESSING — composed before the per-turn body's outermost Abs
  • POST_PROCESSING — composed after the per-turn body's outermost Abs

Host-side dispatch (call sites in dialog/pipeline.py and dialog/fsm.py, all routed through the same make_handler_runner callable for execution-path uniformity):

  • START_CONVERSATION / END_CONVERSATION — fire once per conversation, not per turn; splicing into the per-turn term would over-fire.
  • PRE_TRANSITION / POST_TRANSITION — fire only when a transition actually applies, not on every turn; their conditional gating semantics differ from a structural splicer's per-Case-branch wrap.
  • CONTEXT_UPDATE — fires per extracted_data write with a specific updated_keys set; per-Let updated_keys precision and if extracted_data: conditional gating cannot be matched by a per-Let splicer wrap.
  • ERROR — caught at the host's exception boundary; a kernel exception node would breach the closed combinator-op set.

The handler registration API (Program.register_handler and API.register_handler) is unchanged: handlers compose into the compiled term via fsm_llm.handlers.compose(term, handlers), and the composed term is cached on FSMManager keyed on (fsm_id, _handlers_version).

Status (2026-04-27): Shipped narrowed in plan v1 R5 (r5-green, commit 9208b8a) per D-STEP-04-RESOLUTION. Refining the splicer to cover transition + context_update is deferred to a follow-up plan (would need a discriminant-Case walk for PRE_TRANSITION's "advanced" branch, and per-Let key-set propagation for CONTEXT_UPDATE).


7. The λ-DSL — direct authoring for Category B and C

For programs that don't need FSM ceremony, write λ-terms directly in Python via a fluent combinator API. No JSON round-trip required.

from fsm_llm.lam import split, fmap, ffilter, reduce_, fix, leaf, oracle

# Category B: an evaluator-optimizer loop (was fsm_llm_agents/evaluator_optimizer)
draft  = leaf(template="Write a draft response to: {q}",           schema=str)
score  = leaf(template="Score the draft (0-10): {draft}",          schema=float)
refine = leaf(template="Improve this draft: {draft}\nFeedback: {fb}", schema=str)

evaluator_optimizer = fix(lambda self: lambda q, draft_so_far:
    leaf_if(score(draft_so_far) >= 8.0,
            then=draft_so_far,
            else_=self(q, refine(draft_so_far, feedback=score.explanation))))

# Category C: long-document NIAH (paper's S-NIAH plan)
niah = fix(lambda self: lambda P:
    leaf_answer(P) if size(P) <= TAU else
    reduce_(best, fmap(self, split(P, K_STAR))))

The Python API is an Abs/App builder — its __call__ returns AST nodes, not values. Execution is explicit: executor.run(niah, document). This keeps the semantics referentially transparent and makes the term serialisable.

Optional JSON surface (λ-JSON) exists for programs that want to be stored, diffed, or shared across language runtimes — but is secondary to the Python DSL.


8. The combinator library (Layer 1)

Same library as the paper, now part of the runtime kernel rather than an extra. Invariants (tokenizer-aware split, total/deterministic ℒ∖{𝓜}, associative reduce) are enforced at AST construction time, not at runtime.

split(P: str, k: int, *, boundary: BoundaryFn = token_aware) -> list[str]
peek(P: str, start: int, end: int) -> str
fmap(f, xs)        # (α→β) × [α] → [β]
ffilter(pred, xs)  # (α→𝔹) × [α] → [α]
reduce_(op, xs, *, unit=None)   # associative fold
concat(xs, sep="")
cross(xs, ys)      # Cartesian product

Composition operators ⊕ are a registry of named ReduceOp values, each with an associativity flag, an optional cost(k) method, and an accuracy_preserving_prob used by the planner.


9. Planner and executor (Layers 2 & 3)

Unchanged from the v0.1 design in function, but now the default path for every program, not an opt-in.

  • Planner (src/fsm_llm/lam/planner.py): pure function from PlanInputs (|P|, K, task_type, α, cost_config) to Plan (k*, τ*, d, ⊕, π, predicted_cost, predicted_calls, accuracy_floor). Zero LLM calls. Closed-form per paper's Theorems 2 & 4. Runs at Fix node entry.
  • Executor (src/fsm_llm/lam/executor.py): standard β-reduction interpreter. Recurses on AST, delegates Combinator to the library, Leaf to the oracle. For Fix it either uses a trampoline (headless) or the session-stack (conversational) — the two cases pick the same algorithm, just with different state-persistence backends.
  • Oracle (src/fsm_llm/lam/oracle.py): adapter from LiteLLMInterface to an Oracle protocol enforcing A1 (|P| ≤ K) at call time.

Cost model (CostAccumulator) wraps LiteLLMInterface transparently and emits per-leaf usage to Monitor.


10. Request lifecycle — unified

For a Category A dialog (existing FSM JSON):

API.converse(user_msg, conv_id)
  ▸ session.load(conv_id) → (state_id, context)
  ▸ program = COMPILED[fsm_id]                      -- cached λ-term
  ▸ (state', output, context') =
      executor.run(program, (state_id, user_msg, context))
      ├── Case state_id → branch for current state
      ├── Leaf extract_for     → 1 𝓜 call           Pass 1
      ├── Case eval_transitions → 0 or 1 𝓜 call     Pass 1 (classifier)
      └── Leaf respond_for     → 1 𝓜 call           Pass 2
  ▸ session.save(conv_id, (state', context'))
  ▸ return output

For a Category B pipeline (direct λ-DSL, no session):

executor.run(my_pipeline, inputs)
  ▸ planner.plan(...) at each Fix node
  ▸ β-reduction traverses AST
  ▸ leaves invoke 𝓜; combinators execute symbolically
  ▸ return final result

For a Category C long-context task (paper's home turf):

executor.run(niah_program, document)
  ▸ top-level Fix → planner computes (k*=2, τ*=K, d=⌈log₂(n/K)⌉)
  ▸ predicted_cost logged; budget check
  ▸ SPLIT → MAP → FILTER → MAP(𝓜) → REDUCE chain
  ▸ exactly (k*)^d + 1 oracle calls, guaranteed

Same code path for all three. This is the payoff.


11. What happens to the existing sub-packages

The five sub-packages reorganise into one kernel + a standard library of named λ-terms.

Today Post-unification Notes
fsm_llm/ (FSM, 2-pass pipeline, classifier, handlers, session, LLM iface) fsm_llm/ = kernel (λ-AST + compiler + executor + oracle + planner + session) 2-pass becomes a λ-term schema; FSM JSON is compiled; LLM iface unchanged
fsm_llm_reasoning (structured reasoning engine) fsm_llm/stdlib/reasoning.py Reasoning steps become named λ-terms; engine is a scheduler over them, which is itself a λ-term
fsm_llm_workflows (DAG orchestration) fsm_llm/stdlib/workflows.py DAG is a λ-term built by .then / .parallel / .branch combinators; the engine dissolves
fsm_llm_agents (12 patterns + graph, swarm, MCP, A2A, SOPs, semantic tools, meta) fsm_llm/stdlib/agents/ (one module per pattern) Each pattern becomes a named λ-term factory returning an Abs; the class-per-pattern proliferation collapses
fsm_llm_monitor fsm_llm_monitor/ (unchanged externally; gains λ-term span exporter) Trace granularity shifts from FSM state → AST node; the OTEL schema is richer but the consumer API is the same

The meta-builder (fsm_llm_agents/meta_*) targets λ-terms as its output format instead of FSM JSON.

The examples/ tree reorganises to reflect the taxonomy:

examples/
├── dialog/            (was basic/ intermediate/ advanced/ — Category A survivors)
├── pipeline/          (was agents/, most of reasoning/ — Category B)
├── long_context/      (new — Category C: NIAH, aggregate, pairwise, multi_hop)
└── mixed/             (programs that push sub-FSMs from within pipelines)

12. Preserved and extended theorems

The four theorems from the paper now apply universally.

T1 (Termination). Every λ-term without Fix halts in bounded β-reductions. Every term with Fix halts in exactly (k*)^d + 1 oracle calls, where d is the planner-computed depth, provided (i) split strictly reduces rank, (ii) all combinators in ℒ ∖ {𝓜} are total, (iii) session depth ≤ max_stack_depth.

T2 (Cost is pre-computable). For every Fix node, plan.predicted_cost bounds actual spend within 5% given matched pricing constants. For terms without Fix, cost is the sum of Leaf per-call costs — trivially bounded by a single traversal.

T3 (Accuracy floor). plan.accuracy_floor ≥ 𝒜(τ*)^{n·k*/τ*} · 𝒜_⊕^d. For decomposable tasks (𝒜_⊕ = 1), per-query accuracy is constant in n — polynomial, not exponential, decay with input length.

T4 (Default k*=2). Under linear cost and composition, the planner selects k* = 2 unless an accuracy constraint tightens it.

T5 (FSM semantic preservation). For any FSM F, execute(compile(F), x)FSMManager.process(F, x) for all valid inputs x. Proven by induction on FSM structure, tested by running the existing 2,349 tests against the λ-executor.

T1–T4 are carried from the paper. T5 is what lets us swap the runtime without breaking a single production deployment.


13. Migration path

Five milestones. Each delivers working software and can land independently.

Milestone Scope Ships
M1 — Kernel fsm_llm/lam/ (AST, parser for λ-DSL, combinator library, planner, executor, oracle adapter) λ-DSL runs end-to-end for Category B & C programs. FSM path untouched.
M2 — FSM compiler + executor unification fsm_llm/lam/fsm_compile.py; FSMManager delegates to λ-executor; T5 regression (all 2,349 tests pass). Status: S11 complete (compiled-only) — legacy MessagePipeline.process / process_stream retired; FSMManager.use_compiled flag removed; single-path architecture. Existing programs run on the new substrate. No user-visible change.
M3 — Stdlib Agents, reasoning, workflows patterns reimplemented as λ-term factories; old sub-packages re-export the names for backward compat. Status: slices 1 + 2 + 3 complete — M3 stdlib reorganisation done. Slice 1 ships src/fsm_llm/stdlib/agents/lam_factories.py with the four canonical M4-verified shapes as named factories: react_term (S1, 2 oracle calls), rewoo_term (S2, 2), reflexion_term (S3, 4), memory_term (S4, 2). Each factory closes over no Python state, takes prompt strings + env-var names as arguments, builds Leaf nodes internally, and returns a closed Term. Module imports only from fsm_llm.lam (purity invariant). Re-exported from fsm_llm.stdlib.agents.__all__ (additive — legacy class-based agents ReactAgent etc. retained). 6 shape-equivalence unit tests + 4 live smoke tests PASS on ollama_chat/qwen3.5:4b (oracle_calls match M4 baselines exactly: 2/2/4/2). Slice 2 ships src/fsm_llm/stdlib/reasoning/lam_factories.py with 11 named factories (~330 LOC) covering all 9 ReasoningType strategies + classifier + outer orchestrator: analytical_term / deductive_term / inductive_term / abductive_term / analogical_term / creative_term / critical_term (3-leaf chains); hybrid_term (4-leaf); calculator_term (2-leaf); classifier_term (4-leaf — domain → structure → needs → recommend); solve_term (orchestrator with 2 host-callable Apps + 2-leaf validation/final). Private _chain helper folds [(name, leaf), ...] into nested let-chains (D-S2-003). Purity invariant enforced by AST-walk unit test — module imports only from fsm_llm.lam. Re-exported additively from fsm_llm.stdlib.reasoning.__all__ (legacy ReasoningEngine retained; 112 baseline reasoning tests preserved). Evidence: 17 shape-equivalence unit tests PASS; 5 live smokes (analytical/deductive/creative/hybrid/classifier) PASS on ollama_chat/qwen3.5:4b in 73s (oracle_calls == leaf_count strict); bench scorecard evaluation/m3_slice2_reasoning_scorecard.json runs all 10 cells (9 strategies + classifier) — 10/10 theorem2_holds=true in 190s wall (3-leaf cells ~20-25s each, calculator 5s, classifier 8s). Slice 3 ships src/fsm_llm/stdlib/workflows/lam_factories.py (~330 LOC) with 5 named factories exercising kernel pieces NOT used by slices 1-2: linear_term (sequential let-chain over sub-terms), branch_term (boolean branch via case_ + 1 host-callable App), switch_term (N-way classifier dispatch via case_), parallel_term (fan-out + reduce via let-chain → reduce_(fmap(identity, branch_outputs))), retry_term (bounded retry via fix(λself. λx. let_("attempt", body(x), case_(success(attempt), {"true": attempt}, default=self(x))))). First slice exercising case_ (Case node) + fix (bounded recursion) + fmap/reduce_ (parallel idiom) outside the long_context stdlib. Module imports only from fsm_llm.lam (purity invariant enforced by AST-walk unit test). Re-exported additively from fsm_llm.stdlib.workflows.__all__ (legacy WorkflowEngine, WorkflowBuilder, 12 step classes, 14 dsl functions all retained; 136 baseline workflow tests preserved). Evidence: 14 shape-equivalence + purity unit tests PASS; 5 live smokes PASS on ollama_chat/qwen3.5:4b in 38s (linear=3 strict, branch=1 runtime-arm-only, switch=1 runtime, parallel=3 strict, retry=0 no-leaf); bench scorecard evaluation/m3_slice3_workflow_scorecard.json confirms 5/5 theorem2_holds=true in 15s wall — strict equality for linear/parallel cells, runtime arm-only equality for case_-using cells (with actual_le_upper_bound also recorded), and 0-leaf for retry. Theorem-2 is now demonstrated across 26 distinct factories (4 agent + 11 reasoning + 11 long_context + 5 workflow), three different shape families (let-chain, case_-branch, fix-recursion), and four kernel sub-systems (let / case_ / fmap+reduce_ / fix). M3 stdlib reorganisation is complete for the agent / reasoning / workflow trinity; long_context already in M5 path. Category B examples migrate — smaller, faster, theorems apply.
M4 — Category B example migration Rewrite the ~30 Category B examples as direct λ-DSL; deprecate (don't delete) their FSM JSON. Status: slices 1-5 complete — M4 done (46 of 46 migratable examples migrated under examples/pipeline/; 2 Category-A interactive HITL examples permanently skipped per D-010). Slice 5: 2 final S1-shape examples migrated — structured_output (synth_leaf enforces a MovieRecommendation Pydantic schema; preserves the original's structured-output theme) and tool_decorator (canonical Fahrenheit→Celsius task; the original's @tool decorator + Annotated[T, "desc"] schema-inference novelty lives at the registration layer and does not surface in λ-term shape). Both verified 3/3 PASS × 3 runs on ollama_chat/qwen3.5:4b with oracle_calls=2. Slice 4: 33 Tier 3-5 examples migrated across 4 canonical λ-shapes — S1 ReAct family (11): react_search, classified_tools, classified_dispatch, hierarchical_tools, reasoning_tool, multi_tool_recovery, concurrent_react, full_pipeline, react_structured_pipeline, security_audit, legal_document_review (decide_leaf → tool_dispatch → synth_leaf, 2 oracle calls); S2 REWOO/plan-execute (4): rewoo, rewoo_multi_step, plan_execute, plan_execute_recovery (plan_leaf → plan_executor → synth_leaf, 2 oracle calls); S3 Reflexion (4): reflexion, reflexion_code_gen, adapt, adapt_with_memory (solve → eval → reflect → re-solve, 4 oracle calls; depth-1 retry flatten); S4 memory/orchestrator (14): memory_agent, agent_memory_chain, orchestrator, orchestrator_specialist, hierarchical_orchestrator, multi_debate_panel, agent_as_tool, supply_chain_optimizer, workflow_agent, investment_portfolio, medical_literature, architecture_review, skill_loader, reasoning_stacking (context_leaf → answer_leaf, 2 oracle calls). Shared runtime helpers under examples/pipeline/_helpers.py (~180 LOC: run_pipeline, make_tool_dispatcher, make_plan_executor) eliminate boilerplate. Oracle patches landed (D-011): _invoke_structured now bypasses LiteLLMInterface.generate_response's _parse_response_generation_response (which collided with user schemas containing a reasoning field), forces temperature=0 for grammar-constrained decoding, and synthesises top-level required from properties when Pydantic omits it. 33/33 slice-4 examples PASS VERIFICATION 3/3 on ollama_chat/qwen3.5:4b. Skipped: hitl_approval, react_hitl_combined — truly interactive (Category A by D-010 logic).

Prior status (slices 1-3): Slice 1: prompt_chain (3-leaf let-chain) + self_consistency (fmap+reduce_ majority vote). Landed LiteLLMOracle._invoke_structured re-route through generate_response with Pydantic-derived response_format (D-008): the original extract_field wrapper put a JSON-string-of-answer in the value slot for small Ollama models (qwen3.5:4b), breaking structured Pydantic at leaves. Slice 2: Tier-1 remainder — debate (6-leaf 2-round prop/crit/judge), consistency_with_tools (3-sample math fold), debate_with_tools (evidence-based debate). Slice 3: Tier-2 — maker_checker, evaluator_optimizer, eval_opt_structured, maker_checker_code, regulatory_compliance, pipeline_review (3- and 5-leaf chains; bounded-recursion semantics of MakerCheckerAgent / EvaluatorOptimizerAgent flattened to fixed-depth chains, documented per-example). All 11 migrated examples emit byte-compatible VERIFICATION blocks (3/3 at 100%) on ollama_chat/qwen3.5:4b and exercise structured Pydantic at every leaf. examples/basic/{simple_greeting,story_time} deferred — interactive multi-turn dialogs (input() loops) don't fit batch λ-pipelines. Tiers 3-5 (35 examples: ReAct/REWOO/Reflexion/orchestrator) deferred to slice 4+ — they require tool-loop callback wiring and fix(...) recursion semantics that haven't been exercised in production yet. | Ceremonial FSM usage measurably shrinks. | | M5 — Long-context library + benchmarks | Category C programs (NIAH, OOLONG, OOL-Pairs equivalents) as λ-terms; publish benchmarks on Qwen3/Llama/Mistral paralleling the paper. Status: slices 1+2+3+4 complete. Slice 1: niah factory + examples/long_context/niah_demo/ (Theorem-2 evidence verified live on qwen3.5:4b). Slice 2: aggregate factory + examples/long_context/aggregate_demo/ + scripts/bench_long_context.py multi-model harness; 4-cell scorecard (qwen3.5 4b+9b × niah+aggregate) shows theorem2_holds=true on every cell — multi-model evidence that ex.oracle_calls == plan(...).predicted_calls is invariant in model size and factory shape (see evaluation/bench_long_context_*.json). Slice 3 complete: pairwise + multi_hop factories; private _recursive_long_context helper extracted (D-S2-001 fulfilled — niah/aggregate/pairwise share one term-construction body, multi_hop reuses it via extra_input_vars); bench harness gains --workers N (ProcessPoolExecutor, default 1) and cloud-model env-var preflight (OpenAI/Anthropic/Gemini/Bedrock/Azure). 4-cell sequential bench (qwen3.5:4b × niah/aggregate/pairwise/multi_hop) shows theorem2_holds=true on every cell — pairwise at 8/8 calls (single-Fix), multi_hop at 16/16 (= hops × predicted_calls, hops=2). Live demos pairwise_demo/ and multi_hop_demo/ produce VERIFICATION blocks with hard theorem-2 gates. Slice 4 complete: niah_padded factory for non-τ·k^d-aligned inputs (+ aligned_size / pad_to_aligned / make_pad_callable helpers); examples/long_context/niah_padded_demo/ (DOC_LEN=2000 → N*=2048, 1.024× pad overhead) shows oracle_calls_match_planner=true and needle_found=true on qwen3.5:4b; bench cell --factories niah_padded records padded_size + raw_doc_size and produces theorem2_holds=true. Slice 5 complete: oracle-mediated pairwise tournament op (oracle_compare_op) — at each reduce step the oracle picks the more relevant of two candidate segments. Planner extension: PlanInputs.reduce_calls_per_node: int = 0 (default preserves slice-3 cost equality bit-identically) plus derived Plan.leaf_calls and Plan.reduce_calls fields. With reduce_calls_per_node=1, Theorem-2 holds with predicted_calls = leaf_calls + reduce_calls = k^d + (k^d - 1) = 2·k^d - 1 on aligned inputs (sentinel-arm short-circuit relaxes strict equality to upper bound on sparse-needle inputs — D-S5-001). The slice-3 compare_op is retained as the back-compat default. Bench harness gains --pairwise-mode {length,oracle} (default length); pairwise_demo/run.py gains --mode {length,oracle} (default length) — both preserve existing eval-harness baselines. Status: slice 6 complete. Slice 6 ships (a) multi_hop_dynamic factory + make_dynamic_hop_runner host-callable orchestrator + not_found_gate helper (D-S6-001: each hop stays a niah-shaped Fix; iteration is lifted to Python so the confidence gate can short-circuit. D-S6-002: Theorem-2 reformulated as upper bound — strict actual == actual_hops · predicted_calls; loose actual ≤ max_hops · predicted_calls. Both reported); (b) labelled-dataset benchmark path on scripts/bench_long_context.py via new --dataset PATH / --score-mode {exact,substring,f1_token} / --max-hops N flags + run_dataset_cell per-record loop (D-004: inside-cell iteration; picklable by re-loading dataset in worker); (c) committed synthetic OOLONG-shaped fixtures evaluation/datasets/oolong_synth.jsonl (30 records: 10 niah + 10 aggregate + 10 multi_hop) and oolpairs_synth.jsonl (10 pairwise) per D-003 (real-OOLONG ingestion deferred — slice ships infrastructure). Live evidence on ollama_chat/qwen3.5:4b: demo --dynamic --max-hops 4 3/3 PASS (gate fires hop-0; actual_hops=1, oracle_calls=8=1·k^d); bench --dataset oolong_synth.jsonl --factories niah 10 records, accuracy=0.80, pass_rate=1.00; bench --dataset oolong_synth.jsonl --factories multi_hop_dynamic --max-hops 3 10 records, accuracy=0.80, pass_rate=1.00 — every record satisfies strict per-actual-hops T2 (evaluation/slice6_dataset_{niah,mhd}.json). 7 unit tests + 1 @pytest.mark.real_llm smoke pass. Status: slice 7 complete (OOLONG ingestion). Slice 7 ships a HuggingFace-backed loader (scripts/datasets/oolong_loader.py) that converts records from the real OOLONG benchmark (arXiv 2511.02817 — Bertsch et al.; HF oolongbench/oolong-{synth,real}) into the slice-6 internal JSONL schema ({id, task, question, answer, document, metadata}); a new optional install extra [oolong] (adds datasets>=3.0.0); HF streaming-mode load (D-S7-001; never bulk-downloads the 12.4 GB parquet); license-conservative posture (D-S7-002; converted records gitignored — evaluation/datasets/.gitignore); per-task subsetting via --limit-per-task + --max-context-len cap; and evaluation/datasets/README.md for reproduction. Scope reduction (D-001): OOL-Pairs is fictional (no public benchmark) — slice-6 oolpairs_synth.jsonl remains the canonical pairwise fixture. All OOLONG tasks map to task: "aggregate" (D-S7-003); original task preserved in metadata.oolong_task. Live evidence on ollama_chat/qwen3.5:4b: 5 OOLONG-synth records (one per task type, ≤2048 token ctx) bench-aggregate cell — strict Theorem-2 holds 5/5 (oracle_calls == predicted = k^d: 4/4, 4/4, 8/8, 8/8, 8/8); accuracy 0.00 expected (paper: GPT-5 < 50% at 128K — slice ships infrastructure, not numbers; evaluation/slice7_oolong_synth.json). 8 unit tests pass (mocked HF, no network in CI); planner + factories + bench harness byte-identical. Slice 8+: multi-model scorecards (Llama/Mistral/cloud), confidence-calibrated gate alternatives, OOLONG-real D&D ingestion + bench, per-task factory routing (NUMERIC_ONE_CLASS → niah). | fsm_llm becomes a superset of the paper's λ-RLM, with conversational interop. |

At no point does an existing program break. At every point a new kind of program becomes expressible.


14. What we explicitly do not claim

  • FSMs aren't disappearing. Category A programs (real dialog state) are a first-class surface. We're not arguing for "λ-only"; we're arguing for "λ as substrate, FSM as dialog surface".
  • The substrate does not subsume free-form code generation. Paper's Table 8: for CodeQA with strong models, free-form agents win. fsm_llm keeps the ability to embed an unstructured agent as a Leaf — that's what makes it strictly more expressive than λ-RLM-as-published.
  • Streaming is still open. The paper is batch. Leaf-level streaming works for CONCAT and monotone REDUCE variants; neural ⊕ buffers. Design for partial streaming in M4.
  • Full System F typing is not on the table for v1. Monomorphisation at parse time catches the common bugs. Richer types if M5 needs them.

15. Risks

Risk Likelihood Mitigation
T5 semantic-preservation proof has a hole; some FSM behaviour diverges under the compiler Medium All 2,349 tests must pass pre-M2 merge; CI gate. Any divergence is a ship-blocker.
Performance regression from AST interpretation overhead Low-medium Benchmark M2 vs baseline; if >5% regression, compile hot paths (per-FSM cache of reduced forms).
User confusion from two surface syntaxes Medium Migration guide + decision tree ("do you need dialog state? → FSM; otherwise → λ-DSL"). Don't deprecate FSM JSON.
Scope creep — substrate-wide change invites unrelated refactoring High Milestones are strict; each merge is narrow; stdlib reorganisation (M3) is separate from kernel (M1) and compiler (M2).
Tokenizer mismatch breaks T1 termination Medium Oracle protocol exposes tokenize(); split uses the oracle's tokenizer by default; integration test per provider.
Fix-via-push_fsm per-level overhead exceeds trampoline Low Trampoline is default; stacked executor only when dialog interop is needed. Benchmarked.

16. Summary

The proposal is not to add λ-substrate beside FSMs. The proposal is: λ-calculus is the substrate. FSMs become a surface syntax, kept for dialog-state programs where they earn their keep, compiled to λ-terms everywhere else. Most of fsm_llm_agents, most of fsm_llm_reasoning, and most of examples/ stop being FSMs and become what they always structurally were — λ-terms with combinators. The paper's theorems (termination, closed-form cost, polynomial accuracy decay, optimal k*=2) stop being opt-in guarantees for a sub-package and become universal properties of the runtime. And long-context tasks, which the current FSM model can't express well, gain a native, provably-bounded home.

One runtime. Two surface syntaxes. A smaller, stronger, more honest framework.


References

  • Roy, Tutunov, Ji, Zimmer, Bou-Ammar. The Y-Combinator for LLMs: Solving Long-Context Rot with λ-Calculus. arXiv:2603.20105, 20 Mar 2026.
  • Barendregt, The Lambda Calculus: Its Syntax and Semantics, 1984.
  • Plotkin, Call-by-name, call-by-value and the λ-calculus, TCS 1975.
  • fsm_llm CLAUDE.md — framework architecture, 2-pass flow, handler timing, FSM stacking, classification.