"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." β Andrej Karpathy
LLMs have an attention budget. Every token depletes it.
- O(nΒ²) attention pairs β longer context = thinner, noisier attention
- ChromaDB study: 11/12 models dropped below 50% performance at 32K tokens
- Microsoft study: accuracy fell from 90% β 51% in longer conversations
More context β better outcomes. After a threshold, performance degrades (context rot).
Performance (%)
100 |\
90 | \
80 | \
70 | \
60 | \
50 | \__ context rot begins
40 | \______
+-----------------β Context Length (tokens)
4K 16K 32K+
The parallel to human cognition is striking. When humans face information overload, the dorsolateral prefrontal cortex "gives up" - decision quality deteriorates. But humans have anxiety as a warning signal. LLMs have no such mechanism - they silently degrade without self-awareness.
Research reveals counterintuitive findings:
- Distractors: Even ONE irrelevant element reduces performance
- Structure Paradox: Logically organized contexts can perform worse than shuffled ones
- Position Effects: Information at start/end is retrieved better than middle
The implication: careful curation beats comprehensive context every time.
Not all memory is equal:
| Type | What it stores | Example | Limitation |
|---|---|---|---|
| Semantic | Facts about things | "Python uses indentation" | Doesn't teach how |
| Episodic | Events that happened | "Build failed at 3pm" | Context-specific, doesn't generalize |
| Procedural | How to do things | "Always check schema before migration" | Transfers across tasks |
RAG gives you semantic memory. Chat history gives you episodic memory. The challenge is building procedural memory - patterns of how to succeed that transfer to new situations.
What most teams start with:
CLAUDE.md/CURSOR_RULESfiles with project rules- Examples folders
- Manual PRPs (Product Requirements Prompts)
Trade-offs:
- β Simple to implement
- β Predictable behavior
- β Goes stale fast
- β Manual maintenance overhead
- β Token bloat (loads everything every time)
For tasks that exceed the context window:
Compaction
- Summarize history β restart with high-fidelity summary
- Keep architectural decisions, discard redundant tool outputs
- Best for: conversational tasks with extensive back-and-forth
Structured Note-Taking
- Agent writes persistent notes outside context (e.g.,
NOTES.md) - Pull back into context as needed
- Best for: iterative development with clear milestones
Sub-Agent Architectures
- Coordinator plans; specialized sub-agents do deep dives
- Return condensed summaries (β1-2k tokens)
- Best for: complex research where parallel exploration pays off
flowchart LR
user(User Request) --> coord[Coordinator Agent]
coord -->|Task decomposition| res[Research Agent]
coord -->|Code deep dive| dev[Builder Agent]
coord -->|Data gathering| data[Retriever Agent]
res --> summaries[Summaries & Evidence]
dev --> summaries
data --> summaries
summaries --> coord
coord -->|1-2k token brief| user
Systems where context evolves through execution:
- Reflect on what worked/failed
- Curate strategies into persistent memory
- Inject learned patterns on future runs
This addresses the maintenance problem of static context - the system learns instead of requiring manual updates.
The Stanford ACE framework formalizes this as a feedback loop between execution and curation. Our open-source implementation of the framework (agentic-context-engine) has shown promising results: 30% β 100% success rate on browser automation with 82% fewer steps.
Good context engineering = finding the minimum tokens that maximize desired outcome.
Techniques:
- Compression formats (reduce token overhead)
- Citation-based tracking (reference, don't repeat)
- Active pruning (remove what doesn't help)
Don't preload everything. Fetch what's needed during execution.
- Keep lightweight references (file paths, queries)
- Load data at runtime using tools
- Mirrors human cognition: we don't memorize databases, we know how to look things up
System prompts should be clear but not over-specified:
- Too specific β fragility, high maintenance
- Too vague β bad output, false assumptions
Find the level of abstraction that guides without constraining.
Fewer, well-scoped tools beat many overlapping ones. If a human can't pick the right tool from your set, the model won't either.