Every. Single. Commit. Passes. Your. Tests.
Point it at a green repo, give it a goal — "migrate Flask → FastAPI", "Angular → React", "unittest → pytest" — and it walks your codebase there in dozens of tiny commits, running your real test suite after each one, refusing to advance while anything is red.
Most "AI migration" tools dump one giant diff in your lap and wish you luck. evergreen sells the opposite — a guarantee:
No exceptions. Enforced in deterministic code the model cannot reach.
The result is a git history that reads like a careful senior engineer wrote it: atomic,
reviewable, and git bisect-able from the first commit to the last.
The magic isn't the code generation — it's the closed loop. Here's a real run (a module rename driven by a local model via Ollama). Watch step 5: the model writes a broken edit, the suite goes red, the agent diagnoses it, repairs it, and only then commits:
baseline green: 14 tests in 0.23s
step 5/8 step-004 · Update calclib.stats to import from arithmetic (expand_contract, move)
apply llm edit (attempt 1)
apply error: replace_in_file: 'import calclib.ops' not found in calclib/stats.py
diagnose repair: Update calclib.stats to import from arithmetic
apply builtin:apply_edits (attempt 2)
green (full suite, 0.22s)
commit a8752e7d (step-004) ← committed ONLY after green
step 7/8 step-006 · Remove calclib.ops.py (expand_contract, remove)
apply llm edit (attempt 1)
red (full suite, 0.25s) → tests/test_ops.py::TestOps::test_div
diagnose repair: Remove calclib.ops.py
apply builtin:apply_edits (attempt 2)
red (full suite, 0.23s) → 3 failing
rollback step-006: could not reach green within the repair budget ← NO commit. clean.
done 6 green commits / 8 steps
migration incomplete: rolled back [step-006]. Your branch is untouched.
That's the whole pitch in one screen: the model makes mistakes, the invariant catches every one of them. A weaker model just means more repairs and the occasional honest "I couldn't finish this step" — never a broken commit, never a touched branch.
Install straight from GitHub (no PyPI release needed):
pip install "git+https://github.com/utsabpanta/evergreen-migration-agent" # Python 3.11+Or clone for development:
git clone https://github.com/utsabpanta/evergreen-migration-agent
cd evergreen-migration-agent
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"Then point it at any OpenAI-compatible model endpoint and run it inside a target repo:
export LLM_BASE_URL=http://localhost:11434/v1 # e.g. Ollama
export LLM_MODEL=qwen3-coder:480b-cloud
cd ~/path/to/your/green/repo # must be: a git repo, clean tree, passing tests
evergreen plan "migrate the test suite from unittest to pytest" # preview the DAG — changes nothing
evergreen run "migrate the test suite from unittest to pytest" # do it, with a live traceThe four commands:
| Command | What it does |
|---|---|
evergreen plan "<goal>" |
Show the step DAG + chosen strategies. Changes nothing. |
evergreen run "<goal>" |
Run it: plan → apply → test → commit, never advancing on red. |
evergreen status |
Done / next / blocked steps for the saved run. |
evergreen resume |
Continue an interrupted run from the last green commit. |
Flags worth knowing: --no-promote (leave the result on the evergreen/* branch instead
of fast-forwarding yours), --interactive (approve each step), --plan-file plan.json (run a
reviewed plan deterministically), --characterize (generate safety-net tests if coverage is
missing), --redact (log only hashes of what's sent to a hosted model), --max-repairs /
--max-replans (tune the autonomy budget).
The unittest→pytest migration ships as a deterministic recipe — it needs no LLM at all:
cp -r tests/fixtures/toy_unittest_pkg /tmp/demo && cd /tmp/demo
git init -b main && git add -A && git commit -m "initial"
evergreen run "migrate the test suite from unittest to pytest"
git log --oneline # 5 commits — check out any one; pytest is green at every stepevergreen is language-agnostic; "green" is defined by adapters that read your real test runner:
| Ecosystem | Runner | Structured results via |
|---|---|---|
| Python | pytest (also collects unittest suites) |
junit XML |
| Node.js | node --test (built-in) |
TAP reporter |
| JS / TS | jest | --json reporter |
| JS / TS | vitest | --reporter=json |
| anything | any command via [tool.evergreen] test_command |
exit code |
For well-known migrations, playbooks inject battle-tested strategy into the planner: Angular→React, Next.js→TanStack, JS→TS, CommonJS→ESM, Flask→FastAPI, unittest→pytest — each with the right coexistence pattern and the oracle warning (e.g. "Angular TestBed unit tests die with the framework — you need behavior-level tests").
goal + repo ─► Planner (LLM) ─► DAG of atomic steps ─► Orchestrator (DETERMINISTIC)
▲ │ for each step:
│ re-plan │ apply → verify → commit
│ │ red? → diagnose → repair → retry
Diagnostician (LLM) ◄── red tests ────────────┤ stuck? → roll back, never commit
▼
one green commit per step ✅
- The model proposes; your test suite disposes. The LLM plans, edits, and diagnoses. The
commit decision lives in
CommitManager, which demands the actual green full-suite result as proof and raisesPrimeDirectiveViolationotherwise. The model has no path to a commit. - Crossing the valley. You can't atomically swap a framework and stay green, so the planner picks a coexistence strategy per concern — expand→migrate→contract, strangler fig, shim, branch-by-abstraction — and a validator rejects any plan that deletes before it migrates.
- Sandboxed & reversible. All work happens in a
git worktreeon anevergreen/*branch. Snapshot before each step,git reseton failure, your branch untouched until you say so. - Trustworthy green. Zero tests → it refuses to migrate blindly and offers characterization tests. Flaky tests → quarantined so nondeterminism never defines "green." Slow suite → affected tests run first for fast feedback, but the full suite always gates the commit.
- Resumable & auditable. Every commit is a safe resume point; each carries an
Evergreen-Step:trailer; the whole run replays from a JSONL trace + plan.
One env-var interface, never a hard-coded vendor — vLLM, SGLang, Ollama, Z.AI, or any
endpoint speaking POST /v1/chat/completions:
export LLM_BASE_URL=... # required to enable LLM planning/editing/diagnosis
export LLM_API_KEY=... # if your endpoint needs one
export LLM_MODEL=... # e.g. glm-5.1, qwen3-coder, deepseek, …Local-first: the repo, sandbox, test execution, and static analysis never leave your
machine. Self-host the model (or run a recipe-only migration) and nothing leaves at all.
When a hosted endpoint is used, every payload is logged to .git/evergreen/llm.jsonl —
--redact keeps only hashes. Smarter model = better plans and fewer rollbacks; it can
never mean a broken commit.
Trust is the product, so here's what it won't do:
- Your test oracle must survive the migration. Tests coupled to the framework you're
removing (Angular
TestBed, mockednext/router) can't pin behavior across the swap. Use behavior-level tests (HTTP / DOM / E2E); evergreen detects the gap and warns. - Infrastructure migrations are mostly out of scope. "API Gateway → ALB + EC2" has no fast local test oracle for "the ALB routes correctly" — that's verified by deploying to the cloud. evergreen can rehost the application code (handler → server) under the invariant, but the IaC + traffic cutover is human-owned deploy work it can scaffold, not guarantee.
- Dependency-changing migrations that require
npm install/pip installof new packages need the Docker-isolated sandbox (a documented extension point, not yet built); the worktree baseline assumes deps are already present.
27 acceptance tests gate the spec's four build phases — including the headline proof: a real
Flask→FastAPI migration where git bisect run pytest finds no red commit in the produced
range (it pins only a deliberately injected bad commit), and the same loop runs end-to-end on
a real JavaScript repo under node --test.
pip install -e ".[dev]" && pytest # 27 passingevergreen/
cli.py # typer entrypoints: plan / run / resume / status
orchestrator.py # the deterministic loop; enforces the Prime Directive
planner.py # LLM-backed DAG planner, grounded by static analysis
executor.py # codemod-first, LLM-fallback step application
verifier.py # runner adapters → structured pass/fail (sole authority on "green")
diagnostician.py # LLM-backed repair on red
commit.py # CommitManager: refuses to commit anything not full-suite green
sandbox.py # git worktree isolation, snapshot/restore, promote
suite_assessment.py # baseline check, flaky quarantine, test-impact analysis
playbooks.py # per-migration strategy guidance injected into the planner
strategies/ # expand_contract · strangler_fig · shim · branch_by_abstraction
recipes.py # deterministic, LLM-free migrations (e.g. unittest→pytest)
llm.py # OpenAI-compatible client (any vendor)
models.py # MigrationPlan / Step / StepResult (pydantic)
tests/ # acceptance gates incl. the git-bisect proof; tiny real fixture repos
MIT.
The model proposes. Your test suite decides.
Built with Claude Code · runs on open weights