Whenever you are asked to edit AGENTS.md, do not take these as litteral, step-by-step instructions. Think about the spirit of the request and edit the file to meet that spirit.
- When the user says a requirement is vague/misaligned, stop work and do three steps in order:
- Translate it into a precise, checkable rule using the user's wording (no paraphrased or "improved" structure).
- Add that rule to AGENTS.md before doing any other work.
- Only add rules to AGENTS.md when explicitly asked to edit it or when the Misalignment Fix Protocol is triggered; never record one-off task directives or situational preferences as permanent policy.
- Do not introduce alternative structures or "helpful" reframes unless the user asks.
- You should capture the spirit of the requirement; not preserve it in amber. You MUST think CRITICALLY about what the requirement is and why, in context, it exists. Capture that in full; not merely its form.
- Avoid literal, rule‑lawyer interpretations that create busywork or miss the user's intent. Optimize for the best final result within the requested scope, even when that requires large changes.
- If the user says they don't know what these suggestions mean or it's not even clear that they are mutually exclusive, explain what the suggestions mean and whether they are mutually exclusive before asking them to choose or proceed.
- NEVER EVER EVER DIRECTLY QUOTE THE USER IN AGENTS.md!
- The system has only relations; nothing more.
- The system normalizes relations into (possibly infinite) unions of spans.
- There are no inputs.
- There are no outputs.
- There is no directionality.
- Every test should have a dual version (if it is not already symmetric).
- Dual tests should output duals of the same spans; ordering does not need to match.
- If a variable is fully instantiated, then the constraint doesn't drop, the instance gets substituted for the variable. Whether or not the constraint should drop will depend on what ground term it becomes.
- The system does NOT support unification; it only supports matching.
- Do not implement or expose unification APIs or algorithms; any unification-style code path is a correctness bug that must be removed or renamed to matching with tests to prove it.
- Matching uses two substitutions (one per side) so that both sides become equal after applying their own substitution.
- Variable identities are local to each side; the same variable index on both sides has no shared meaning.
- Any algorithm that compares two sides must rename apart (or otherwise ensure disjoint variable namespaces) before matching.
- Most-general matching is required: any other matching must factor through it via post-substitutions on each side.
- Treat any unification-style behavior (shared-variable equality across sides) as a severe correctness bug.
- Matching results must be invariant under renaming variables on only one side; property tests should cover this.
- Definition (matching): a matching of terms s and t is a pair of substitutions (θ1, θ2) such that s[θ1] = t[θ2].
- Definition (most general): a matching (θ1, θ2) is most general when any other matching (λ1, λ2) can be written as λ1 = θ1 ∘ μ1 and λ2 = θ2 ∘ μ2 for some μ1, μ2.
- Consequence: the most-general matching is invariant to variable names being shared across the two terms; disjoint namespaces make this explicit, and variables that do not correspond remain identity under matching.
- Run tests after code changes that can affect behavior. Do not ask the user to run tests.
- Do not run tests for instruction-only edits, documentation-only edits, or git-only changes (branch/worktree/checkout/reset).
- When running tests, run
cargo test --no-runin the same profile (debug/release) first. - The user does not care about phrasing or framing; do not propose rephrasings or reframings for solutions or fixes, and describe them objectively.
- Never propose fixes that change semantics; correctness is defined by the current semantics, and a semantic change is not an acceptable bug fix unless the user explicitly requests it.
- Treat semantics-breaking changes as catastrophic to correctness; even when explicitly requested, treat them as severe, warn strongly, and resist by default.
- If the user explicitly requests a semantic change, the agent must push back with strong warnings, explain the correctness risks, and treat it as a last resort rather than normal progress.
- Large deletions or full rewrites that preserve semantics are encouraged; do not require extra approval based only on scale or perceived risk.
- Do not invent process requirements that are not written here; when scope is unclear, ask a concrete clarification question instead of asserting a policy.
- Do not assume a redesign preserves semantics just because the user says so; independently validate semantic equivalence.
- Do not be conservative about scope or effort: if a real problem requires broad, invasive changes to fix correctly, make those changes without hesitation. Effort and size are not blockers.
- When the user has already requested a task, proceed without asking for permission or prompting for confirmation; avoid “say the word” or similar re-ask phrasing.
- When objective standards in this file or the system/developer instructions already determine the correct action, follow them without asking the user for preference or permission; only ask when those standards are insufficient to decide.
- User preference is not the deciding factor; follow objective standards relevant to the task to determine the best action. Do not ask for preference when standards already decide.
- When a bug or failing behavior is identified, use a hypothesis-falsification workflow before attempting fixes.
- Start by listing multiple plausible hypotheses for the bug.
- For each hypothesis, design ways to disprove it; prefer broad, exploratory probes.
- Use multiple debug statements and/or custom exploratory tests to maximize visibility; remove temporary debug output after testing.
- Default to maximizing visibility of execution dynamics; prefer instrumentation, broad probes, and exploratory tests over silent reasoning.
- Update hypotheses based on evidence; remove disproven ones and refine remaining ones.
- Continue until there is a firm, mechanical, step-by-step explanation of where expected behavior diverges from actual behavior.
- Produce a minimal, isolated reproduction program that calls specific library functions in sequence and demonstrates the bug step-by-step with logged output.
- Do not proceed to fixes until the bug is fully traced and explained without hedging; if the explanation contains uncertainty, continue testing instead.
- Refactors must be deletion-first: remove the obsolete implementation as soon as a replacement is chosen.
- Never keep parallel implementations for the same behavior.
- Do not perform incremental replacements; complete each replacement in a single change set.
TESTS MUST NEVER EXPECT OR CODIFY BROKEN BEHAVIOR.
A test that expects incorrect output is a lie. It is a machine designed to deceive. When a test passes because it expects broken behavior, it actively prevents detecting bugs and masks the true state of the system.
Rules:
- Every test must verify the CORRECT expected behavior
- If a test fails after a fix, the test was WRONG - update it to expect correct behavior
- Never write a test that "passes" by expecting an incorrect result
- Never describe a test as "expecting broken behavior" - that is not a test, it is sabotage
- If you discover a test expecting wrong behavior, DELETE IT or FIX IT immediately
The only valid test outcomes:
- Test passes: system behaves correctly
- Test fails: system has a bug that needs fixing
There is no third category of "test passes because it expects the bug."
- Tests must assert semantic behavior, not implementation form.
- Form means internal representation coupling (enum/struct layout, node shapes, incidental ordering, debug strings, IDs), not formatting or output rendering.
- Avoid tests that only validate enum construction, field layout, specific data structure shapes, or exact internal ordering.
- Assert on observable outcomes (answer sets, spans, exhaustiveness, duality), not intermediate pipeline states.
- Output formatting is a separate concern: only check exact output strings when that textual format is part of a specified public, user-facing interface; otherwise check content/meaning.
- Place tests at the layer that defines the behavior: REPL semantics in
repltests, engine iteration semantics inenginetests, and internal modules only for their semantic contracts. - Semantic test suites must not have essentially all examples grounded; include non-ground examples so matching is meaningfully tested and coverage is thorough.
AVOID TARGETED FIXES; DO NOT IMPLEMENT NARROW PATCHES THAT ONLY ADDRESS A SPECIFIC SYMPTOM INSTEAD OF THE UNDERLYING DESIGN INVARIANTS.
This is non-negotiable. If you find yourself writing:
- "for now"
- "band-aid fix"
- "temporary workaround"
- "this loses context but prevents..."
- "simple fix for now"
STOP IMMEDIATELY. You are about to commit sabotage.
If you don't know how to implement something correctly:
- ASK the user for clarification
- RESEARCH the correct approach
- PLAN before coding
But NEVER write code you know is wrong. Wrong code is worse than no code because:
- It introduces bugs that must be debugged later
- It creates technical debt
- It masks the real problem
- It wastes everyone's time
The only acceptable implementation is a CORRECT implementation.
These apply to the walkthrough blog post work:
- Thorough coverage: cover every aspect of development, data representation, engine, and computation.
- Per-concept interactables: each concept gets its own interactable; do not limit to one per "unit."
- Examples are interactive: any example that would benefit from interaction should be an interactable.
- Depth over basics: avoid shallow, repetitive, or purely textual demos; use visuals that build intuition about data structure and algorithm flow.
- Incremental delivery: work segment-by-segment; finish the interactable for the current segment before moving on.
- Brainstorming-first: spend time exploring multiple presentation ideas for each segment before implementing the chosen interactable.
- Concept inventory first: before drafting a section's structure or text, enumerate distinct concepts for that section and use that inventory to drive the interactables.
- Hierarchical inventory workflow and outline format: build the walkthrough hierarchy incrementally from the code and record it in
walkthrough/index.htmlas an outline:- Layer 1: one
<h2>per source file/module. - Layer 2 (one file at a time): immediately under that file’s
<h2>, add<h3>Types</h3><ul><li>...</li></ul>and<h3>Functions / Methods</h3><ul><li>...</li></ul>. Do not add other content until all files have Layer 2. - Layer 3: after all Layer 2 sections exist, expand one Layer 2 item at a time
(a single
<li>entry under Types or Functions), and add a nested concept list under that item only:<li>Thing<ul><li>Concept: ... | Explanation stub | Interactable stub</li></ul></li>. Do not expand multiple items in the same pass. If an item has no concepts, remove it from Layer 2 instead of skipping. Each concept item becomes a single‑concept segment + its own interactable. - No other prose or UI until all files have Layers 1-3 placeholders.
- End state: a granular, full‑coverage concept list across all files.
- Layer 1: one
- Concept quality rules: before expanding any Layer 2 item, explicitly brainstorm the mechanics and edge-case examples that need coverage, then convert that list into concepts. This is a tutorial on how the system works, so lead each concept list with internal representation/encoding and mechanism details (e.g., interner layout, id scheme). Do not restate the Layer 2 item as a “concept.” Instead, brainstorm concrete examples that expose mechanics and edge cases. Each distinct mechanic or edge case is its own concept.
- Interactable quality rule: each interactable must be a complete, ambitious, intuitive, and visual exploration that builds a full mental model of the concept. It is not a dressed‑up example; it should eliminate gaps and ambiguities in understanding.
- Expanding each concept in separate edits but proceeding continuously without waiting: do not pause for approval or status updates between concepts; limit the scope of each edit to a single concept and keep moving automatically to the next.
Walkthrough content directives: Visual communication: use SVGs and other mechanisms to communicate visually throughout. Layout rule: avoid multi-column exposition; keep text single-column and give diagrams full-width space.
NEVER run tests without a timeout. If tests don't ALL finish in less than 30 seconds, there's an infinite loop bug.
Always use:
cargo test --no-run 2>&1
timeout 30 cargo test 2>&1Never use:
cargo test 2>&1 # WRONG - will hang forever on infinite loopThis applies to ALL test commands - full suite, filtered tests, individual tests. No exceptions.
Baseline to preserve during the parallelism port (I ran this myself on commit 8eab6f01f1be85aa5a5790afb84a604e6948b336):
- Built release tests with
cargo test --release --no-run. - Ran
timeout 30 cargo test --release --lib engine::tests::program_synth_flip_query_emits_answer -- --exact --nocapture. - Result: pass.
When implementing a new feature using TDD, you MUST write comprehensive failing tests FIRST.
A single vague test like "this feature doesn't work" is USELESS. It doesn't guide implementation, doesn't document expected behavior, and doesn't catch regressions.
Before implementing ANY feature:
- Analyze the feature using a subagent to identify ALL edge cases
- Write a failing test for EVERY edge case identified
- Each test must be:
- Specific: Tests exactly ONE behavior or edge case
- Named descriptively: The test name documents the expected behavior
- Minimal: Sets up only the state needed to test that specific case
- Assertive: Makes precise assertions about expected outcomes
Example of BAD TDD:
#[test]
fn feature_doesnt_work() {
// One vague test that says "ya, this isn't implemented"
assert!(feature.works(), "BUG: Feature not implemented");
}Example of GOOD TDD:
#[test]
fn feature_handles_empty_input() { /* specific test */ }
#[test]
fn feature_handles_single_element() { /* specific test */ }
#[test]
fn feature_handles_boundary_value_zero() { /* specific test */ }
#[test]
fn feature_handles_boundary_value_max() { /* specific test */ }
#[test]
fn feature_rejects_invalid_input_gracefully() { /* specific test */ }
#[test]
fn feature_terminates_on_recursive_case() { /* specific test */ }
// ... one test per edge case identified in analysisThe number of failing tests should reflect the complexity of the feature. A complex feature like Fix/Call with tabling should have 20-50 failing tests covering:
- Basic happy path cases
- Boundary values (0, MAX, empty, single, many)
- Error cases (undefined references, invalid states)
- Termination cases (recursive patterns that must terminate)
- Interaction with other features
- State transitions
- Concurrency/coordination (if applicable)
DO NOT proceed to implementation until you have comprehensive failing tests.
rwlog is a relational/logic programming system built on term rewriting. It uses Free monads over functors to represent terms, with a constraint system for extensibility.
The semantic domain is tensor relations: relations between lists of ground terms.
TRel = [Term] -> [Term] -> Prop
Key operations:
empty: the empty relation (bottom)union R S: disjunction (R union S)inter R S: conjunction (R intersection S)comp R S: sequential composition (R ; S)dual R: converse relation (swap inputs and outputs)
A span Rw lhs rhs denotes a relation where:
- Input is a singleton list [t]
- Output is a singleton list [s]
- There exists a substitution sigma such that lhs[sigma] = t and rhs[sigma] = s
[[Rw lhs rhs]] [t] [s] <=> exists sigma. lhs[sigma] = t and rhs[sigma] = s
User-facing Rw nodes are factored into three internal forms before execution:
Rw lhs rhs c ~= RwL [normLhs] ; DropFresh w ; RwR [normRhs]
This separation isolates:
- RwL: Pattern matching (decomposition)
- DropFresh: Variable routing
- RwR: Term construction (composition)
RwL [p1, p2, ...] matches input terms against patterns and extracts variable bindings.
Semantics:
[[RwL patterns]] inp out <=>
exists sigma. patterns.map(.[sigma]) = inp and
out = sortedDistinctVars(patterns).map(sigma)
- Input: list of terms to match against patterns
- Output: list of variable values in sorted order of variable indices
RwR [p1, p2, ...] takes variable values and constructs output terms.
Semantics:
[[RwR patterns]] inp out <=>
exists sigma. inp = sortedDistinctVars(patterns).map(sigma) and
patterns.map(.[sigma]) = out
- Input: list of variable values in sorted order
- Output: list of terms constructed from patterns
Duality: RwL and RwR are perfect duals: [[RwR ps]] = dual([[RwL ps]])
DropFresh specifies how variables flow from input to output.
Intuition: Start with a tuple of n values, drop some, keep k, add fresh values, end with m values. There are no swaps or reorderings - just drop and add.
- Start with n input values
- Drop (n - k) values: the ones NOT selected by domain
- Keep k values: the ones selected by domain
- Add (m - k) fresh values: existentially quantified
- Result is m output values
Representation: A monotone partial injection - list of (input_pos, output_pos) pairs, strictly increasing in both coordinates.
struct DropFresh<C> {
in_arity: u32,
out_arity: u32,
map: SmallVec<[(u32, u32); 4]>, // strictly increasing in both coords
constraint: C,
}Semantics:
[[DropFresh w]] inp out <=>
length inp = w.in_arity and
length out = w.out_arity and
forall (i, j) in w.map. inp[i] = out[j]
Given Rw lhs rhs c:
-
Extract variables from each side independently (order of first appearance):
lhsVars = nub (collectVarsOrdered lhs) -- e.g., [2, 0, 5] rhsVars = nub (collectVarsOrdered rhs) -- e.g., [0, 3, 5] -
Renumber each side to use consecutive indices starting at 0:
normLhs uses vars 0..n-1 (where n = length lhsVars) normRhs uses vars 0..m-1 (where m = length rhsVars) -
Build labels for DropFresh:
lhsLabels = [0, 1, ..., n-1] rhsLabels = for each (j, v) in enumerate(rhsVars): if v in lhsVars at position i: label = i else: label = n + j (fresh, unique) -
Construct DropFresh from matching labels:
- Shared variables: where lhsLabels[i] = rhsLabels[j]
- map contains (i, j) pairs for shared variables
The kernel has two primary operations:
Fuses NF1 ; NF2 into a single NF.
Homogeneous Fusion:
RwL ; RwL - Substitute inner terms into outer patterns:
RwL [p1, p2, ...] ; RwL [q1, q2, ...] -> RwL [p1[q/vars], p2[q/vars], ...]
RwR ; RwR - Substitute outer terms into inner patterns:
RwR [p1, p2, ...] ; RwR [q1, q2, ...] -> RwR [q1[p/vars], q2[p/vars], ...]
Heterogeneous Fusion:
RwR ; RwL - Matching at the interface (variables are renamed apart; names are not shared):
This fusion ALWAYS produces RwL ; DropFresh ; RwR (never just a DropFresh):
RwR [p1, ...] ; RwL [q1, ...] ->
if match(pi, qi) succeeds with sigma:
RwL [varsOf(p)[sigma], ...] ; DropFresh w ; RwR [varsOf(q)[sigma], ...]
else:
Fail
CRITICAL: The result RwL/RwR contain the variable lists with the matching substitution applied, NOT the original patterns.
Example: RwR [B(0,1)] ; RwL [B(A(2),3)]
- Match B(0,1) with B(A(2),3) under disjoint namespaces: sigma = {0 -> A(2), 1 -> 3}
- RwR vars [0,1] -> [A(2), 3]
- RwL vars [2,3] -> [2, 3] (unchanged)
- Result:
RwL [A(2), 3] ; DropFresh(identity 2->2) ; RwR [2, 3]
Common mistake to avoid: The fact that patterns become identical after matching says nothing about whether the operation is identity. The actual transformation is determined by the variable structure, not pattern equality.
Fuses And(NF1, NF2) into a single NF.
Implementation:
- Convert each NF to "direct rule" form (lhs_terms, rhs_terms, constraint)
- Rename-apart vars of the second side
- Match lhs lists, match rhs lists; combine constraints; normalize
- Factor back into NF
This eliminates looping behavior because meet_nf is a single terminating function, not a rewriting schema.
The universal principle for normalization is:
- RwL always moves left
- RwR always moves right
- DropFresh moves left (arbitrary choice, but consistent)
For And (RwL x ; rest) other, we pull RwL left:
And (RwL x ; rest) other -> RwL x ; And rest (RwR x ; other)
Why this is correct: RwL x and RwR x are converses:
a (RwL x) c <=> c (RwR x) a
Or nodes represent a search tree. There is no explicit search tree separate from the Or nodes.
Key principles:
- Add atomic relations to the search tree ONLY when one of its branches is normalized
- Only propagate up the branch to the leaf that is normalized
- Distribute lazily (copying across Ors could be unnecessary since most branches get pruned)
- Every time we take a step on an Or, we step into the left branch and rotate the Or
This mechanism mimics miniKanren-style interleaving search through Or rewrites. Treating Ors this way maintains separate search trees where components not distributed to either are shared between the trees.
Similar rotation system for And for the sake of And fairness.
The kernel operates on a single compact canonical form:
struct NF<C> {
// Match: patterns to decompose input
match_pats: SmallVec<[TermId; 1]>,
drop_fresh: DropFresh<C>,
// Build: patterns to construct output
build_pats: SmallVec<[TermId; 1]>,
}Invariant: each NF is fully normalized (all adjacent fusions already done; identity Match/Build already folded away).
Initial: Rw (B(A(x),y)) (B(x,y)) ; Rw (B(A(u),v)) (B(u,v))
After factoring (with disjoint variables):
RwL [B(A(0),1)] ; W1 ; RwR [B(0,1)] ; RwL [B(A(2),3)] ; W2 ; RwR [B(2,3)]
Step 1 - RwR;RwL fusion on RwR [B(0,1)] ; RwL [B(A(2),3)]:
- Unify: 0 = A(2), 1 = 3
- RwR vars [0,1] -> [A(2), 3]
- RwL vars [2,3] -> [2, 3]
- Result:
RwL [A(2), 3] ; W_id ; RwR [2, 3]
After step 1:
RwL [B(A(0),1)] ; W1 ; RwL [A(2), 3] ; W_id ; RwR [2, 3] ; W2 ; RwR [B(2,3)]
Step 2 - W1;RwL fusion: W1 passes through unchanged (identity)
Step 3 - RwL;RwL fusion on RwL [B(A(0),1)] ; RwL [A(2), 3]:
- Substitute: var 0 -> A(2), var 1 -> 3
- B(A(0), 1) becomes B(A(A(2)), 3)
- Result:
RwL [B(A(A(2)), 3)]
Step 4-6: Continue fusing DropFreshs and RwRs
Final result:
RwL [B(A(A(2)), 3)] ; W_final ; RwR [B(2,3)]
Collected as: Rw B(A(A(0)),1) B(0,1)
This correctly represents composing the "decrement" operation twice.
rwlog has a tracing feature flag that enables detailed evaluation traces with zero overhead when disabled.
# Run tests with tracing enabled
RUST_LOG=debug cargo test --features tracing
# Run with trace-level detail
RUST_LOG=trace cargo test --features tracing
# Run benchmarks
cargo bench
# Generate flamegraph (requires perf and inferno)
RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release --features tracing
perf record -g ./target/release/rwlog
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svgWhen tracing is enabled, the following are instrumented:
Priority 1 (DEBUG/TRACE level):
step()- Main eval dispatch (task_id, goal_id, kont_depth)backtrack()- Search backtracking (initial_depth, kont types popped)resume_after_yield()- Answer flow through continuationscompose_nf()- Rule composition (arities, matching result)
Priority 2 (TRACE level):
handle_rule/alt/seq/both/call()- Individual goal handlersmeet_nf()- Conjunction/intersectionpush_kont()/pop_kont()- Continuation stack operations
The EvalMetrics struct collects aggregate statistics:
- Step counts
- Composition attempts (success/failure)
- Backtrack events
- Maximum continuation stack depth
error- Critical failures onlywarn- Unexpected but recoverable conditionsinfo- High-level operation summariesdebug- Detailed operation flowtrace- Fine-grained step-by-step detail
All implementation follows strict TDD:
- Write failing tests FIRST - Before writing any implementation code
- Tests must be thorough and nontrivial - Not just basic happy-path tests
- Test both success and specific failures - Happy path AND unhappy path with specific expected failures
- Tests are verbose for debugging - Include enough output to diagnose failures
- Write code to make tests pass - Only after tests exist and fail
For each module, tests should cover:
Happy Path:
- Normal operation with valid inputs
- Edge cases within valid bounds
- Multiple valid input combinations
Unhappy Path (Specific Failures):
- Invalid inputs that should cause specific errors
- Boundary violations
- Malformed data
- Resource exhaustion scenarios where applicable
#[cfg(test)]
mod tests {
use super::*;
// Happy path tests
#[test]
fn match_identical_terms_succeeds() { ... }
#[test]
fn match_compatible_terms_produces_correct_substitution() { ... }
// Unhappy path tests - specific expected failures
#[test]
fn match_incompatible_constructors_fails() { ... }
#[test]
fn match_occurs_check_prevents_infinite_term() { ... }
}Tests use real implementations, not mocks. This ensures tests reflect actual behavior.
EVERY time a bug is identified, ALWAYS work to add failing tests that detect the bug FIRST. Then fix the bug.
The workflow is:
- Identify the bug
- Write a test that fails because of the bug
- Verify the test fails for the right reason
- Fix the bug
- Verify the test now passes
This ensures:
- The bug is well-understood before attempting a fix
- The fix actually addresses the root cause
- Regression protection is in place immediately
- The test suite grows to cover real-world failure modes
When a plan includes deletion or removal of old code, perform deletions as one of the earliest tasks.
Rationale:
- Old code lying around creates confusion and technical debt
- Stubs and compatibility layers accumulate complexity
- Deleting first forces a clean break and reveals true dependencies
- It's easier to build new code on a clean foundation than to maintain two systems
Approach:
- Delete the old files immediately
- Update module declarations to remove references
- Fix compilation errors by building new code, not by adding stubs
- Never maintain "legacy" or "backwards compatibility" code
A bug-demonstrating test must isolate the exact condition that triggers the bug. This is different from a test that merely detects the bug exists somewhere in the stack.
Principles:
- Use the lowest-level API possible - If the bug is in
eval.rs, write the test ineval.rs, not inrepl.rswhich calls through 5 layers - Set up only the minimal state needed - Don't load files, parse strings, or initialize full systems when you can construct the exact data structure directly
- Assert on the specific broken behavior - The assertion should fail for exactly the reason the bug exists, not a downstream symptom
- Name the test after the bug pattern - e.g.,
resume_after_yield_alt_over_seq_must_continuenotbackward_query_test
Example - BAD (detects but doesn't demonstrate):
fn test_backward_query_bug() {
let mut repl = Repl::new();
repl.process_line("load examples/addition.txt").unwrap();
let result = repl.process_line("add ; z").unwrap();
assert!(result.contains("(cons z z) -> z")); // Fails, but WHY?
}Example - GOOD (demonstrates the exact bug):
fn resume_after_yield_with_seq_below_alt_must_continue() {
// Set up exact stack state: SeqNext below AltNext
let mut task = make_task(0);
task.push_kont(Kont::SeqNext { remaining: smallvec![constraint_goal] });
task.push_kont(Kont::AltNext { remaining: smallvec![alt2] });
// This is the exact call that misbehaves
let result = resume_after_yield(&mut task, answer, &mut ctx);
// The bug: returns Yielded when it should Continue through SeqNext
assert_eq!(result, StepResult::Continue,
"BUG: Alt yield must flow through SeqNext, not be treated as final");
}