Agentic ML-engineering loop

An LLM proposes, the classical engine decides — a deterministic verifier, not "it ran", is the judge.

The agentic loop realizes the SOTA AutoML pattern grounded on a deterministic executor: an LLM proposes a solution (a trainer plus hyperparameters), the classical engine trains and cross-validates it, and a Verifier — a stage distinct from execution-success — decides whether the result is genuinely good. Search is greedy with reflection over the attempt history, bounded by an iteration and patience budget.

The whole cycle is propose → train/CV → verify → reflect → select.

!!! firefly "The recurring pattern — the LLM proposes; the classical engine decides"

The LLM never gets the last word. It *suggests* the next `(trainer, params)` candidate; the
classical engine cross-validates it and a deterministic `Verifier` rules on whether it actually
beats a trivial baseline. A candidate that runs but fails to clear the baseline is rejected — so
the loop can only ever return a model that measurably earned its place.

The pieces

Type	Role
`SolutionCandidate`	A proposal: `trainer`, `params`, `rationale`.
`Verdict`	The verifier's judgement: `valid`, `reason`, `score`.
`AttemptRecord`	One iteration: `candidate`, `score`, `verdict`.
`EngineeringRun`	The full trace of a run, plus the fitted best model.
`AgenticAutoML`	The loop engine (`AgenticLoopPort`).
`DeterministicVerifier`	Correctness check: finite + beats the trivial baseline.
`AgentSolutionProposer`	LLM-backed proposer (reflects via a `FireflyAgent`).
`SequenceProposer`	Deterministic proposer for tests / fixed strategies.

SolutionCandidate, Verdict, and AttemptRecord are frozen dataclasses; EngineeringRun carries the trace and the refit model. The two proposers both satisfy the CandidateProposer protocol, and DeterministicVerifier satisfies the Verifier protocol — so any of them can be swapped for a custom implementation.

Quick start

AgenticAutoML takes any proposer and runs the loop over a Dataset:

from fireflyframework_datascience.datasets import Dataset
from fireflyframework_datascience.engineering import SolutionCandidate, SequenceProposer
from fireflyframework_datascience.engineering.loop import AgenticAutoML

dataset = Dataset.from_frame(df, target="churned")

proposer = SequenceProposer([
    SolutionCandidate("linear"),
    SolutionCandidate("random_forest", {"n_estimators": 300}),
    SolutionCandidate("hist_gradient_boosting", {"learning_rate": 0.05}),
])

loop = AgenticAutoML(proposer, max_iterations=8, patience=3, cv=5)
run = loop.solve(dataset)

print(run.summary())

!!! success "Expected"

```text
Agentic AutoML: 3 attempts (2 verified); best=hist_gradient_boosting roc_auc=0.9123 (baseline 0.5000)
```

The engine seeds the candidates from propose_initial, then repeatedly calls propose_next until a candidate is None, the iteration budget is spent, or patience runs out.

propose → train/CV → verify → reflect → select

The engine never trusts a candidate just because it executed. Each attempt is:

propose — the proposer yields a SolutionCandidate.
train / CV — the candidate is wrapped in a preprocessing pipeline and scored with sklearn's cross_val_score (default cv=5). A failing candidate scores -inf and never aborts the loop.
verify — the Verifier turns a raw score into a Verdict. Only valid verdicts can become the best.
reflect — propose_next is handed the full history to inform the next proposal.
select — the highest-scoring verified candidate wins and is refit on all data.

for candidate in self._proposer.propose_initial(dataset, names):  # (1)!
    record = self._attempt(dataset, candidate, task, scoring, baseline)  # (2)!
    attempts.append(record)
    if record.verdict.valid and record.score > best_score:  # (3)!
        best, best_score = candidate, record.score

patience = self._patience
for _ in range(self._max_iterations):
    candidate = self._proposer.propose_next(dataset, attempts, names)  # (4)!
    if candidate is None:
        break
    record = self._attempt(dataset, candidate, task, scoring, baseline)
    attempts.append(record)
    if record.verdict.valid and record.score > best_score:
        best, best_score, patience = candidate, record.score, self._patience  # (5)!
    else:
        patience -= 1
        if patience <= 0:
            break

Seed — propose_initial returns the starting population, evaluated before any reflection.
Train / CV + verify — _attempt cross-validates the candidate and asks the Verifier for a Verdict in one step.
Select — only a valid verdict that improves on the running best can take the lead.
Reflect — propose_next receives the full attempts history; a returned None ends the loop.
Patience reset — an improving verified attempt restores the full patience budget; a non-improving one decrements it, stopping the loop at zero.

DeterministicVerifier — correctness, not execution

A run that produces a number is not the same as a run that produced a good number.

!!! note "Correctness ≠ ran"

A candidate that trained and returned a score has only proven it *executed*. Verification is a
separate stage: `DeterministicVerifier` demands a *finite* score that *beats the trivial
baseline* by a `margin`. Anything else is rejected — execution-success is never mistaken for
correctness.

DeterministicVerifier requires a finite score that beats the trivial baseline (a DummyClassifier with strategy="prior" / DummyRegressor with strategy="mean") by a margin:

from fireflyframework_datascience.engineering.loop import DeterministicVerifier

verifier = DeterministicVerifier(margin=0.01)   # must beat baseline by at least 0.01
loop = AgenticAutoML(proposer, verifier=verifier)

Verdicts read like a review:

Verdict(valid=False, reason="training failed or produced a non-finite score", score=-inf)
Verdict(valid=False, reason="does not beat the trivial baseline (0.5010 <= 0.5000)", score=0.5010)
Verdict(valid=True,  reason="beats trivial baseline by +0.4123", score=0.9123)

You can supply any object implementing the Verifier protocol (verify(dataset, candidate, score, baseline) -> Verdict).

Proposers

Both built-in proposers satisfy the CandidateProposer protocol — pick the LLM-backed one for real search, or the deterministic one for tests and fixed strategies.

=== "AgentSolutionProposer (LLM)"

`AgentSolutionProposer` seeds every trainer at its defaults, then reflects on the ranked history
via a `FireflyAgent`. The LLM client is built lazily on first reflection — no client is created at
startup:

```python
from fireflyframework_datascience.engineering.loop import AgenticAutoML, AgentSolutionProposer

proposer = AgentSolutionProposer(model="openai:gpt-4o")
run = AgenticAutoML(proposer, max_iterations=10).solve(dataset)
```

On each `propose_next`, the agent receives the task, the allowed trainers, and the best-first
attempt history (top 8), and returns a structured `(trainer, params_json, rationale)`. If the
model names a trainer outside the allowed list, the proposer falls back to the best trainer seen so
far; malformed `params_json` degrades to `{}`. You can also inject a pre-built agent with
`AgentSolutionProposer(agent=my_agent)`.

=== "SequenceProposer (deterministic)"

For tests or fixed search plans, `SequenceProposer` replays a fixed candidate list — the first is
the seed, the rest are dispensed one per `propose_next`:

```python
from fireflyframework_datascience.engineering import SequenceProposer, SolutionCandidate

proposer = SequenceProposer([
    SolutionCandidate("linear", rationale="cheap baseline"),
    SolutionCandidate("random_forest", {"max_depth": 8}),
])
```

To write your own, implement the CandidateProposer protocol: propose_initial(dataset, trainers) and propose_next(dataset, history, trainers).

!!! tip "Which trainers are allowed"

The `trainers` list a proposer sees comes from the loop's registry. By default that is `linear`,
`random_forest`, and `hist_gradient_boosting`, plus `xgboost`, `lightgbm`, and `catboost` when
those optional libraries are installed. Pass `trainers=...` to `AgenticAutoML` to constrain or
extend the search space.

The EngineeringRun trace

solve returns an EngineeringRun — a full, auditable trace plus the refit best model:

run = loop.solve(dataset)

run.best_candidate      # SolutionCandidate | None
run.best_score          # float (nan if nothing verified)
run.model               # the refit Model (None if nothing verified)
run.metric              # e.g. "roc_auc"
run.baseline_score      # the trivial baseline it had to beat
run.n_iterations        # total attempts
run.valid_attempts      # only the verified ones

for a in run.attempts:
    print(a.candidate.trainer, a.score, a.verdict.valid, a.verdict.reason)

n_iterations and valid_attempts are derived from attempts, and summary() renders the one-line recap shown under Quick start.

Budgets: iterations and patience

The loop is greedy with two knobs:

max_iterations (default 8) — the hard cap on reflection rounds after seeding.
patience (default 3) — consecutive non-improving attempts allowed before early stopping.

Each improving verified attempt resets patience to the full budget; each non-improving one decrements it, and the loop stops when it hits zero. Tune the trade-off between thoroughness and cost:

loop = AgenticAutoML(
    proposer,
    cv=5,
    max_iterations=12,   # explore more
    patience=4,          # tolerate more dead ends
    random_state=42,
)

!!! warning "Patience only counts after seeding"

The initial population from `propose_initial` is always fully evaluated; patience and
`max_iterations` bound only the reflection rounds that follow. A run with an empty or trivial seed
still respects the iteration budget.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agentic ML-engineering loop

The pieces

Quick start

propose → train/CV → verify → reflect → select

DeterministicVerifier — correctness, not execution

Proposers

The EngineeringRun trace

Budgets: iterations and patience

See also

Uh oh!

FilesExpand file tree

agentic-loop.md

Latest commit

History

agentic-loop.md

File metadata and controls

Agentic ML-engineering loop

The pieces

Quick start

propose → train/CV → verify → reflect → select

DeterministicVerifier — correctness, not execution

Proposers

The EngineeringRun trace

Budgets: iterations and patience

See also