An LLM proposes, the classical engine decides β a deterministic verifier, not "it ran", is the judge.
The agentic loop realizes the SOTA AutoML pattern grounded on a deterministic executor: an LLM proposes a solution (a trainer plus hyperparameters), the classical engine trains and cross-validates it, and a Verifier β a stage distinct from execution-success β decides whether the result is genuinely good. Search is greedy with reflection over the attempt history, bounded by an iteration and patience budget.
The whole cycle is propose β train/CV β verify β reflect β select.
!!! firefly "The recurring pattern β the LLM proposes; the classical engine decides"
The LLM never gets the last word. It *suggests* the next `(trainer, params)` candidate; the
classical engine cross-validates it and a deterministic `Verifier` rules on whether it actually
beats a trivial baseline. A candidate that runs but fails to clear the baseline is rejected β so
the loop can only ever return a model that measurably earned its place.
| Type | Role |
|---|---|
SolutionCandidate |
A proposal: trainer, params, rationale. |
Verdict |
The verifier's judgement: valid, reason, score. |
AttemptRecord |
One iteration: candidate, score, verdict. |
EngineeringRun |
The full trace of a run, plus the fitted best model. |
AgenticAutoML |
The loop engine (AgenticLoopPort). |
DeterministicVerifier |
Correctness check: finite + beats the trivial baseline. |
AgentSolutionProposer |
LLM-backed proposer (reflects via a FireflyAgent). |
SequenceProposer |
Deterministic proposer for tests / fixed strategies. |
SolutionCandidate, Verdict, and AttemptRecord are frozen dataclasses; EngineeringRun carries
the trace and the refit model. The two proposers both satisfy the CandidateProposer protocol, and
DeterministicVerifier satisfies the Verifier protocol β so any of them can be swapped for a custom
implementation.
AgenticAutoML takes any proposer and runs the loop over a Dataset:
from fireflyframework_datascience.datasets import Dataset
from fireflyframework_datascience.engineering import SolutionCandidate, SequenceProposer
from fireflyframework_datascience.engineering.loop import AgenticAutoML
dataset = Dataset.from_frame(df, target="churned")
proposer = SequenceProposer([
SolutionCandidate("linear"),
SolutionCandidate("random_forest", {"n_estimators": 300}),
SolutionCandidate("hist_gradient_boosting", {"learning_rate": 0.05}),
])
loop = AgenticAutoML(proposer, max_iterations=8, patience=3, cv=5)
run = loop.solve(dataset)
print(run.summary())!!! success "Expected"
```text
Agentic AutoML: 3 attempts (2 verified); best=hist_gradient_boosting roc_auc=0.9123 (baseline 0.5000)
```
The engine seeds the candidates from propose_initial, then repeatedly calls propose_next until a
candidate is None, the iteration budget is spent, or patience runs out.
The engine never trusts a candidate just because it executed. Each attempt is:
- propose β the proposer yields a
SolutionCandidate. - train / CV β the candidate is wrapped in a preprocessing pipeline and scored with
sklearn'scross_val_score(defaultcv=5). A failing candidate scores-infand never aborts the loop. - verify β the
Verifierturns a raw score into aVerdict. Onlyvalidverdicts can become the best. - reflect β
propose_nextis handed the fullhistoryto inform the next proposal. - select β the highest-scoring verified candidate wins and is refit on all data.
for candidate in self._proposer.propose_initial(dataset, names): # (1)!
record = self._attempt(dataset, candidate, task, scoring, baseline) # (2)!
attempts.append(record)
if record.verdict.valid and record.score > best_score: # (3)!
best, best_score = candidate, record.score
patience = self._patience
for _ in range(self._max_iterations):
candidate = self._proposer.propose_next(dataset, attempts, names) # (4)!
if candidate is None:
break
record = self._attempt(dataset, candidate, task, scoring, baseline)
attempts.append(record)
if record.verdict.valid and record.score > best_score:
best, best_score, patience = candidate, record.score, self._patience # (5)!
else:
patience -= 1
if patience <= 0:
break- Seed β
propose_initialreturns the starting population, evaluated before any reflection. - Train / CV + verify β
_attemptcross-validates the candidate and asks theVerifierfor aVerdictin one step. - Select β only a
validverdict that improves on the running best can take the lead. - Reflect β
propose_nextreceives the fullattemptshistory; a returnedNoneends the loop. - Patience reset β an improving verified attempt restores the full patience budget; a non-improving one decrements it, stopping the loop at zero.
A run that produces a number is not the same as a run that produced a good number.
!!! note "Correctness β ran"
A candidate that trained and returned a score has only proven it *executed*. Verification is a
separate stage: `DeterministicVerifier` demands a *finite* score that *beats the trivial
baseline* by a `margin`. Anything else is rejected β execution-success is never mistaken for
correctness.
DeterministicVerifier requires a finite score that beats the trivial baseline (a DummyClassifier
with strategy="prior" / DummyRegressor with strategy="mean") by a margin:
from fireflyframework_datascience.engineering.loop import DeterministicVerifier
verifier = DeterministicVerifier(margin=0.01) # must beat baseline by at least 0.01
loop = AgenticAutoML(proposer, verifier=verifier)Verdicts read like a review:
Verdict(valid=False, reason="training failed or produced a non-finite score", score=-inf)
Verdict(valid=False, reason="does not beat the trivial baseline (0.5010 <= 0.5000)", score=0.5010)
Verdict(valid=True, reason="beats trivial baseline by +0.4123", score=0.9123)You can supply any object implementing the Verifier protocol
(verify(dataset, candidate, score, baseline) -> Verdict).
Both built-in proposers satisfy the CandidateProposer protocol β pick the LLM-backed one for real
search, or the deterministic one for tests and fixed strategies.
=== "AgentSolutionProposer (LLM)"
`AgentSolutionProposer` seeds every trainer at its defaults, then reflects on the ranked history
via a `FireflyAgent`. The LLM client is built lazily on first reflection β no client is created at
startup:
```python
from fireflyframework_datascience.engineering.loop import AgenticAutoML, AgentSolutionProposer
proposer = AgentSolutionProposer(model="openai:gpt-4o")
run = AgenticAutoML(proposer, max_iterations=10).solve(dataset)
```
On each `propose_next`, the agent receives the task, the allowed trainers, and the best-first
attempt history (top 8), and returns a structured `(trainer, params_json, rationale)`. If the
model names a trainer outside the allowed list, the proposer falls back to the best trainer seen so
far; malformed `params_json` degrades to `{}`. You can also inject a pre-built agent with
`AgentSolutionProposer(agent=my_agent)`.
=== "SequenceProposer (deterministic)"
For tests or fixed search plans, `SequenceProposer` replays a fixed candidate list β the first is
the seed, the rest are dispensed one per `propose_next`:
```python
from fireflyframework_datascience.engineering import SequenceProposer, SolutionCandidate
proposer = SequenceProposer([
SolutionCandidate("linear", rationale="cheap baseline"),
SolutionCandidate("random_forest", {"max_depth": 8}),
])
```
To write your own, implement the CandidateProposer protocol: propose_initial(dataset, trainers)
and propose_next(dataset, history, trainers).
!!! tip "Which trainers are allowed"
The `trainers` list a proposer sees comes from the loop's registry. By default that is `linear`,
`random_forest`, and `hist_gradient_boosting`, plus `xgboost`, `lightgbm`, and `catboost` when
those optional libraries are installed. Pass `trainers=...` to `AgenticAutoML` to constrain or
extend the search space.
solve returns an EngineeringRun β a full, auditable trace plus the refit best model:
run = loop.solve(dataset)
run.best_candidate # SolutionCandidate | None
run.best_score # float (nan if nothing verified)
run.model # the refit Model (None if nothing verified)
run.metric # e.g. "roc_auc"
run.baseline_score # the trivial baseline it had to beat
run.n_iterations # total attempts
run.valid_attempts # only the verified ones
for a in run.attempts:
print(a.candidate.trainer, a.score, a.verdict.valid, a.verdict.reason)n_iterations and valid_attempts are derived from attempts, and summary() renders the one-line
recap shown under Quick start.
The loop is greedy with two knobs:
max_iterations(default8) β the hard cap on reflection rounds after seeding.patience(default3) β consecutive non-improving attempts allowed before early stopping.
Each improving verified attempt resets patience to the full budget; each non-improving one decrements it, and the loop stops when it hits zero. Tune the trade-off between thoroughness and cost:
loop = AgenticAutoML(
proposer,
cv=5,
max_iterations=12, # explore more
patience=4, # tolerate more dead ends
random_state=42,
)!!! warning "Patience only counts after seeding"
The initial population from `propose_initial` is always fully evaluated; patience and
`max_iterations` bound only the reflection rounds that follow. A run with an empty or trivial seed
still respects the iteration budget.
- Datasets β the
Datasetthe loop searches over. - AutoML β the trainer registry, metrics, and the preprocessing pipeline wrapped around every candidate.
- GenAI features β other places the LLM proposes and the classical engine decides.
- LLM configuration β wiring the model behind
AgentSolutionProposer. - Architecture β the ports and adapters the loop plugs into.