You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Building a web-based code generation platform using the Copilot SDK (Python, v0.2.0). Users create and iteratively modify single-file projects (500–5,000+ lines of HTML/CSS/JS). A typical session involves 5–15 modification requests like "make the character bigger" or "change the background color" on an existing project.
Current Approach (Pseudocode)
For each modification request:
1. Reset session (disconnect + create fresh session) ← clears history
2. Build prompt:
- System message (~1K tokens, same every time)
- Full current project code (2K–30K tokens)
- User's modification request (~50 tokens)
3. Send prompt via session.send()
4. Parse response for line-based patches (REPLACE_LINES / INSERT_AFTER / DELETE_LINES)
5. If patches fail → retry with "return full updated code" (~doubles tokens)
6. Apply patches/extract code → validate → finalize
Optional thinking pre-pass: For complex requests, a separate model call analyzes the code first (GPT-5.2 with reasoning), then the plan is prepended to the main prompt → effectively 2× input tokens.
The Problem: Token Waste
Scenario
Est. Input Tokens
Est. Output Tokens
Notes
Small project modification
~4K
~2K
500-line project
Medium project modification
~12K
~8K
2000-line project
Large project modification
~35K
~30K
5000+ lines
+ Thinking pre-pass
2× input
same
Two full model calls
+ Patch failure retry
3× total
2× output
Re-sends full code asking for complete output
Key observations:
cache_read_tokens is consistently 0 in our session.usage events — even though we track it
Every modification resets the session → no context reuse between turns
For a "change the button color" request on a 3000-line project, we're sending ~15K tokens of unchanged code
With 10 modifications per session, that's ~150K+ input tokens for what should be incremental edits
Specific Questions
1. Prompt Caching
I noticed from other issues (e.g. #1005 session logs) that cacheReadTokens can be non-zero. Our setup always shows 0.
Is prompt caching automatic when the system message + prefix stays the same?
Does resetting the session (disconnect + create_session) break cache eligibility?
Would keeping the session alive across turns (instead of resetting) enable caching of the static system message portion?
2. Session Continuity vs. Reset
We currently reset the session before each modification to avoid accumulating conversation history (since we always embed the full current code in the prompt anyway). But this may be preventing prompt caching.
(B) Reset each time but find a way to enable prompt caching?
(C) Some hybrid — e.g., keep session alive for N turns, then reset?
What's the recommended pattern for repeated modifications to the same large context?
3. Reducing Output Tokens
The model frequently ignores patch/diff instructions and returns the entire file instead of targeted changes. This wastes output tokens proportional to project size.
Are there SDK-level mechanisms to constrain output format?
Has anyone found reliable prompting strategies that consistently produce diffs rather than full rewrites?
Would reasoning_effort: "low" help for simple modifications while keeping output focused?
4. Thinking Pre-Pass Overhead
For complex requests, we run a separate "thinking" model call (GPT-5.2 with reasoning) to produce a plan, then feed that plan + full code to the main model. This doubles the input token cost.
With reasoning_effort on the main model, is a separate thinking pre-pass still justified?
Any patterns for "think-then-act" that avoid sending the full context twice?
5. Large File Strategies
For projects >5000 lines, we currently force a full rewrite (no patches).
Are there recommended patterns for chunked/windowed modifications — e.g., only sending the relevant portion of the file + surrounding context?
Does the SDK's file handling (edit tool) use any internal optimization we could leverage instead of manual prompt construction?
Environment
SDK:github-copilot-sdk==0.2.0 (Python)
Models: GPT-5.2 (thinking), Claude Sonnet 4 / Claude Opus (main generation)
Session config:streaming=True, available_tools=["ask_user"], session reset per modification
Would love to hear how others in the community handle similar "iterative code modification" workflows with the SDK. Any insights on which of these optimizations yield the biggest token savings would be greatly appreciated! 🙏
Use Case
Building a web-based code generation platform using the Copilot SDK (Python, v0.2.0). Users create and iteratively modify single-file projects (500–5,000+ lines of HTML/CSS/JS). A typical session involves 5–15 modification requests like "make the character bigger" or "change the background color" on an existing project.
Current Approach (Pseudocode)
Optional thinking pre-pass: For complex requests, a separate model call analyzes the code first (GPT-5.2 with reasoning), then the plan is prepended to the main prompt → effectively 2× input tokens.
The Problem: Token Waste
Key observations:
cache_read_tokensis consistently 0 in oursession.usageevents — even though we track itSpecific Questions
1. Prompt Caching
I noticed from other issues (e.g. #1005 session logs) that
cacheReadTokenscan be non-zero. Our setup always shows 0.2. Session Continuity vs. Reset
We currently reset the session before each modification to avoid accumulating conversation history (since we always embed the full current code in the prompt anyway). But this may be preventing prompt caching.
Trade-off question: Is it better to:
What's the recommended pattern for repeated modifications to the same large context?
3. Reducing Output Tokens
The model frequently ignores patch/diff instructions and returns the entire file instead of targeted changes. This wastes output tokens proportional to project size.
reasoning_effort: "low"help for simple modifications while keeping output focused?4. Thinking Pre-Pass Overhead
For complex requests, we run a separate "thinking" model call (GPT-5.2 with reasoning) to produce a plan, then feed that plan + full code to the main model. This doubles the input token cost.
reasoning_efforton the main model, is a separate thinking pre-pass still justified?5. Large File Strategies
For projects >5000 lines, we currently force a full rewrite (no patches).
Environment
github-copilot-sdk==0.2.0(Python)streaming=True,available_tools=["ask_user"], session reset per modificationRelated
Would love to hear how others in the community handle similar "iterative code modification" workflows with the SDK. Any insights on which of these optimizations yield the biggest token savings would be greatly appreciated! 🙏