[Question] Best practices for token-efficient incremental code modifications via SDK sessions?

## Use Case

Building a **web-based code generation platform** using the Copilot SDK (Python, v0.2.0). Users create and iteratively modify **single-file projects** (500–5,000+ lines of HTML/CSS/JS). A typical session involves 5–15 modification requests like "make the character bigger" or "change the background color" on an existing project.

## Current Approach (Pseudocode)

```
For each modification request:
  1. Reset session (disconnect + create fresh session)  ← clears history
  2. Build prompt:
     - System message (~1K tokens, same every time)
     - Full current project code (2K–30K tokens)
     - User's modification request (~50 tokens)
  3. Send prompt via session.send()
  4. Parse response for line-based patches (REPLACE_LINES / INSERT_AFTER / DELETE_LINES)
  5. If patches fail → retry with "return full updated code" (~doubles tokens)
  6. Apply patches/extract code → validate → finalize
```

**Optional thinking pre-pass:** For complex requests, a separate model call analyzes the code first (GPT-5.2 with reasoning), then the plan is prepended to the main prompt → effectively **2× input tokens**.

## The Problem: Token Waste

| Scenario | Est. Input Tokens | Est. Output Tokens | Notes |
|----------|------------------:|-------------------:|-------|
| Small project modification | ~4K | ~2K | 500-line project |
| Medium project modification | ~12K | ~8K | 2000-line project |
| Large project modification | ~35K | ~30K | 5000+ lines |
| + Thinking pre-pass | 2× input | same | Two full model calls |
| + Patch failure retry | 3× total | 2× output | Re-sends full code asking for complete output |

**Key observations:**
- `cache_read_tokens` is consistently **0** in our `session.usage` events — even though we track it
- Every modification resets the session → no context reuse between turns
- For a "change the button color" request on a 3000-line project, we're sending ~15K tokens of unchanged code
- With 10 modifications per session, that's ~150K+ input tokens for what should be incremental edits

## Specific Questions

### 1. Prompt Caching
I noticed from other issues (e.g. #1005 session logs) that `cacheReadTokens` can be non-zero. Our setup always shows 0. 
- Is prompt caching automatic when the system message + prefix stays the same?
- Does resetting the session (disconnect + create_session) break cache eligibility?
- Would **keeping the session alive** across turns (instead of resetting) enable caching of the static system message portion?

### 2. Session Continuity vs. Reset
We currently reset the session before each modification to avoid accumulating conversation history (since we always embed the full current code in the prompt anyway). But this may be preventing prompt caching.

**Trade-off question:** Is it better to:
- **(A)** Keep the session alive, let history accumulate, and rely on compaction (ref #1012) when context grows too large?
- **(B)** Reset each time but find a way to enable prompt caching?
- **(C)** Some hybrid — e.g., keep session alive for N turns, then reset?

What's the recommended pattern for **repeated modifications to the same large context**?

### 3. Reducing Output Tokens
The model frequently ignores patch/diff instructions and returns the **entire file** instead of targeted changes. This wastes output tokens proportional to project size.
- Are there SDK-level mechanisms to constrain output format?
- Has anyone found reliable prompting strategies that consistently produce diffs rather than full rewrites?
- Would `reasoning_effort: "low"` help for simple modifications while keeping output focused?

### 4. Thinking Pre-Pass Overhead
For complex requests, we run a separate "thinking" model call (GPT-5.2 with reasoning) to produce a plan, then feed that plan + full code to the main model. This doubles the input token cost.
- With `reasoning_effort` on the main model, is a separate thinking pre-pass still justified?
- Any patterns for "think-then-act" that avoid sending the full context twice?

### 5. Large File Strategies
For projects >5000 lines, we currently force a full rewrite (no patches). 
- Are there recommended patterns for chunked/windowed modifications — e.g., only sending the relevant portion of the file + surrounding context?
- Does the SDK's file handling (edit tool) use any internal optimization we could leverage instead of manual prompt construction?

## Environment

- **SDK:** `github-copilot-sdk==0.2.0` (Python)
- **Models:** GPT-5.2 (thinking), Claude Sonnet 4 / Claude Opus (main generation)
- **Session config:** `streaming=True`, `available_tools=["ask_user"]`, session reset per modification

## Related
- #1012 — Token info after compaction (relevant to question 2)

Would love to hear how others in the community handle similar "iterative code modification" workflows with the SDK. Any insights on which of these optimizations yield the biggest token savings would be greatly appreciated! 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Best practices for token-efficient incremental code modifications via SDK sessions? #1024

Use Case

Current Approach (Pseudocode)

The Problem: Token Waste

Specific Questions

1. Prompt Caching

2. Session Continuity vs. Reset

3. Reducing Output Tokens

4. Thinking Pre-Pass Overhead

5. Large File Strategies

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Est. Input Tokens	Est. Output Tokens	Notes
Small project modification	~4K	~2K	500-line project
Medium project modification	~12K	~8K	2000-line project
Large project modification	~35K	~30K	5000+ lines
+ Thinking pre-pass	2× input	same	Two full model calls
+ Patch failure retry	3× total	2× output	Re-sends full code asking for complete output

[Question] Best practices for token-efficient incremental code modifications via SDK sessions? #1024

Description

Use Case

Current Approach (Pseudocode)

The Problem: Token Waste

Specific Questions

1. Prompt Caching

2. Session Continuity vs. Reset

3. Reducing Output Tokens

4. Thinking Pre-Pass Overhead

5. Large File Strategies

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions