You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add Contract Engineering blog post
Interactive figures for contract lifecycle, scenario matrix, and event
log. Updates announcement banner to feature the new post. Adapted from
site-rebuild branch with paths updated to main's directory structure.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* more
* touches
* Update contract engineering copy with final edits
- Add TDD comparison paragraph with Wikipedia link
- Link PRDs to Wikipedia on first mention
- Add Ambiguity limitation, Acknowledgments section
- Fix tense ("nobody had defined"), revert Limitations opener
- Simplify Sharpen the Spec examples back to concise form
- Update links: rubriclab, full Unblocking Agents URL
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Restore fleshed-out contract examples and experimental Limitations tone
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: src/lib/posts/contract-engineering.mdx
+12-4Lines changed: 12 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,9 +13,9 @@ export const metadata = {
13
13
bannerImageUrl: "/images/excavator.png",
14
14
}
15
15
16
-
Our team has been running experiments toward fully unsupervised development, with the goal of being able to write a spec at night, and wake up to working, deployed, production-ready software. Agents can already build and deploy without handing the keyboard back to you, given their own accounts, infrastructure, and verification tooling. We wrote about how we set that up in [Unblocking Agents](/blog/unblocking-agents).
16
+
Our team has been running experiments toward fully unsupervised development, with the goal of being able to write a spec at night, and wake up to working, deployed, production-ready software. Agents can already build and deploy without handing the keyboard back to you, given their own accounts, infrastructure, and verification tooling. We wrote about how we set that up in [Unblocking Agents](https://rubriclabs.com/blog/unblocking-agents).
17
17
18
-
But the results haven't always been consistent, because nobody defined what "done" means in a form the agent can be held to.
18
+
But the results haven't always been consistent. Over time, we realized this was because nobody had defined what "done" means in a form the agent can be held to.
19
19
20
20
We've been approaching this with something we call contracts: hard, versioned, executable claims about what the system must do. A contract is both the specification and the acceptance criteria. The agent builds until every contract passes, and it cannot ship until they do.
21
21
@@ -27,14 +27,16 @@ The first is decay. The first prompt is always the best because you're offering
27
27
28
28
The second is non-determinism. Even with a perfect prompt, agent runs aren't reproducible. You can have the same prompt in a different session and get a completely different result. Some runs feel locked in: the agent makes good decisions and the architecture is clean. Other runs drift. The difference is sampling luck, context ordering, and temperature. You can't reliably reproduce what worked, and you can't explain why something didn't.
29
29
30
-
The third is specificity. PRDs were designed for humans under the assumption that engineering is expensive and that efforts take weeks with constant subtle feedback signals during sprints. A PRD says "the user should be able to send an email" and trusts the engineer to figure out what that means across OAuth, API calls, database writes, cache invalidation, and realtime updates. An agent, on the other hand, needs a level of specificity that would be unreasonable to ask of a human engineer, but is exactly right for a process where execution is cheap and restarts are free.
30
+
The third is specificity. [PRDs](https://en.wikipedia.org/wiki/Product_requirements_document) were designed for humans under the assumption that engineering is expensive and that efforts take weeks with constant subtle feedback signals during sprints. A PRD says "the user should be able to send an email" and trusts the engineer to figure out what that means across OAuth, API calls, database writes, cache invalidation, and realtime updates. An agent, on the other hand, needs a level of specificity that would be unreasonable to ask of a human engineer, but is exactly right for a process where execution is cheap and restarts are free.
31
31
32
32
We need a format that's durable, deterministic, and precise enough that an agent can execute against it with no interim feedback and get it right.
33
33
34
34
## Contracts
35
35
36
36
A contract is a hard, versioned, executable definition of what the system must do. It defines a scenario — what the agent should build — and a sequence of events the system must produce, each with a proof requirement. Every claim resolves to something binary, verifiable against the real system. For example, either the database row exists or it doesn't, the webhook arrived or it didn't, etc. It doesn't live in a conversation or decay between runs. The agent doesn't decide what matters while it's building because the contract decides up front.
37
37
38
+
Contracts differ from [test-driven development (TDD)](https://en.wikipedia.org/wiki/Test-driven_development) in that they are durable and survive re-writes. As code changes, tests must be re-written, whereas contracts are expected to remain untouched except by the designer.
39
+
38
40
This is the specificity that prompting lacks. A PRD says "the user should be able to send an email" and trusts the engineer to fill in the gaps. A contract, on the other hand, spells out every event the system must produce across every layer — UI, API, database, cache, webhooks, realtime — each with proof that it happened.
39
41
40
42
## What This Looks Like
@@ -116,10 +118,16 @@ We've been running this approach against real applications, and it's already sur
116
118
117
119
**Qualitative drift.** LLM-scored screenshots are powerful but noisy. The same screenshot can score differently across evaluations. We're working on pinned scoring (caching evaluations so a passing screenshot stays passed until the UI actually changes) but calibrating the boundary between acceptable variance and a real regression is ongoing.
118
120
121
+
**Ambiguity.** Contracts currently follow two principles: being verifiable (or, at least, having a feedback loop to hill-climb), and being unambiguous enough that a junior developer could implement them. Formalizing and streamlining these guidelines into a reusable framework is non-trivial, and something that we need a framework around.
122
+
119
123
These are hard problems, but they're also the right problems given the tools at our disposal.
120
124
121
125
---
122
126
123
127
At night, you write the contracts. By morning, the agent has built against them, deployed the result, rerun failures, fixed what it could, and left behind proof: what passed, what failed, what changed, and why. This is Contract Engineering.
124
128
125
-
We're building in the open at [github.com/rubriclabs](https://github.com/rubriclabs).
129
+
We're building in the open at [github.com/rubriclab](https://github.com/rubriclab).
130
+
131
+
## Acknowledgments
132
+
133
+
We'd like to share a special thank you with Jihad Esmail, Max Musing, and Erik Kaunismäki for their thoughtful feedback.
0 commit comments