Skip to content

Commit d42d299

Browse files
DexterStoreyclaude
andauthored
contract engineering blog post touches
* Add Contract Engineering blog post Interactive figures for contract lifecycle, scenario matrix, and event log. Updates announcement banner to feature the new post. Adapted from site-rebuild branch with paths updated to main's directory structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * more * touches * Update contract engineering copy with final edits - Add TDD comparison paragraph with Wikipedia link - Link PRDs to Wikipedia on first mention - Add Ambiguity limitation, Acknowledgments section - Fix tense ("nobody had defined"), revert Limitations opener - Simplify Sharpen the Spec examples back to concise form - Update links: rubriclab, full Unblocking Agents URL Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restore fleshed-out contract examples and experimental Limitations tone Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a1f5343 commit d42d299

File tree

1 file changed

+12
-4
lines changed

1 file changed

+12
-4
lines changed

src/lib/posts/contract-engineering.mdx

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ export const metadata = {
1313
bannerImageUrl: "/images/excavator.png",
1414
}
1515

16-
Our team has been running experiments toward fully unsupervised development, with the goal of being able to write a spec at night, and wake up to working, deployed, production-ready software. Agents can already build and deploy without handing the keyboard back to you, given their own accounts, infrastructure, and verification tooling. We wrote about how we set that up in [Unblocking Agents](/blog/unblocking-agents).
16+
Our team has been running experiments toward fully unsupervised development, with the goal of being able to write a spec at night, and wake up to working, deployed, production-ready software. Agents can already build and deploy without handing the keyboard back to you, given their own accounts, infrastructure, and verification tooling. We wrote about how we set that up in [Unblocking Agents](https://rubriclabs.com/blog/unblocking-agents).
1717

18-
But the results haven't always been consistent, because nobody defined what "done" means in a form the agent can be held to.
18+
But the results haven't always been consistent. Over time, we realized this was because nobody had defined what "done" means in a form the agent can be held to.
1919

2020
We've been approaching this with something we call contracts: hard, versioned, executable claims about what the system must do. A contract is both the specification and the acceptance criteria. The agent builds until every contract passes, and it cannot ship until they do.
2121

@@ -27,14 +27,16 @@ The first is decay. The first prompt is always the best because you're offering
2727

2828
The second is non-determinism. Even with a perfect prompt, agent runs aren't reproducible. You can have the same prompt in a different session and get a completely different result. Some runs feel locked in: the agent makes good decisions and the architecture is clean. Other runs drift. The difference is sampling luck, context ordering, and temperature. You can't reliably reproduce what worked, and you can't explain why something didn't.
2929

30-
The third is specificity. PRDs were designed for humans under the assumption that engineering is expensive and that efforts take weeks with constant subtle feedback signals during sprints. A PRD says "the user should be able to send an email" and trusts the engineer to figure out what that means across OAuth, API calls, database writes, cache invalidation, and realtime updates. An agent, on the other hand, needs a level of specificity that would be unreasonable to ask of a human engineer, but is exactly right for a process where execution is cheap and restarts are free.
30+
The third is specificity. [PRDs](https://en.wikipedia.org/wiki/Product_requirements_document) were designed for humans under the assumption that engineering is expensive and that efforts take weeks with constant subtle feedback signals during sprints. A PRD says "the user should be able to send an email" and trusts the engineer to figure out what that means across OAuth, API calls, database writes, cache invalidation, and realtime updates. An agent, on the other hand, needs a level of specificity that would be unreasonable to ask of a human engineer, but is exactly right for a process where execution is cheap and restarts are free.
3131

3232
We need a format that's durable, deterministic, and precise enough that an agent can execute against it with no interim feedback and get it right.
3333

3434
## Contracts
3535

3636
A contract is a hard, versioned, executable definition of what the system must do. It defines a scenario — what the agent should build — and a sequence of events the system must produce, each with a proof requirement. Every claim resolves to something binary, verifiable against the real system. For example, either the database row exists or it doesn't, the webhook arrived or it didn't, etc. It doesn't live in a conversation or decay between runs. The agent doesn't decide what matters while it's building because the contract decides up front.
3737

38+
Contracts differ from [test-driven development (TDD)](https://en.wikipedia.org/wiki/Test-driven_development) in that they are durable and survive re-writes. As code changes, tests must be re-written, whereas contracts are expected to remain untouched except by the designer.
39+
3840
This is the specificity that prompting lacks. A PRD says "the user should be able to send an email" and trusts the engineer to fill in the gaps. A contract, on the other hand, spells out every event the system must produce across every layer — UI, API, database, cache, webhooks, realtime — each with proof that it happened.
3941

4042
## What This Looks Like
@@ -116,10 +118,16 @@ We've been running this approach against real applications, and it's already sur
116118

117119
**Qualitative drift.** LLM-scored screenshots are powerful but noisy. The same screenshot can score differently across evaluations. We're working on pinned scoring (caching evaluations so a passing screenshot stays passed until the UI actually changes) but calibrating the boundary between acceptable variance and a real regression is ongoing.
118120

121+
**Ambiguity.** Contracts currently follow two principles: being verifiable (or, at least, having a feedback loop to hill-climb), and being unambiguous enough that a junior developer could implement them. Formalizing and streamlining these guidelines into a reusable framework is non-trivial, and something that we need a framework around.
122+
119123
These are hard problems, but they're also the right problems given the tools at our disposal.
120124

121125
---
122126

123127
At night, you write the contracts. By morning, the agent has built against them, deployed the result, rerun failures, fixed what it could, and left behind proof: what passed, what failed, what changed, and why. This is Contract Engineering.
124128

125-
We're building in the open at [github.com/rubriclabs](https://github.com/rubriclabs).
129+
We're building in the open at [github.com/rubriclab](https://github.com/rubriclab).
130+
131+
## Acknowledgments
132+
133+
We'd like to share a special thank you with Jihad Esmail, Max Musing, and Erik Kaunismäki for their thoughtful feedback.

0 commit comments

Comments
 (0)