V4: Claude Code agent patterns + FlexAttention + proposal pipeline fixes#10
V4: Claude Code agent patterns + FlexAttention + proposal pipeline fixes#10
Conversation
Proposal generation fixes: - Add _strip_thinking() to handle Qwen's <think>...</think> tags - Add _parse_proposal() with 3 fallback strategies (exact, markdown, assignment) - Restructure prompt with XML sections, few-shot examples, and explicit constraints - Lower temperature 0.9→0.7 for more reliable structured output - Add diagnostic logging: every rejection reason is printed with counts summary FlexAttention (PyTorch 2.5+): - Replace SDPA fallback with FlexAttention for sliding window + GQA support - Create cached BlockMask per window size (short=1024, long=2048) - Falls back to SDPA only if FlexAttention unavailable - Result: val_bpb 1.680 (was 1.739 with SDPA), 18% faster tok/sec Add 16 new tests (64 total): 5 for _strip_thinking, 11 for _parse_proposal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The original train.py selected FA3 repo by compute capability: - varunneal/flash-attention-3 for Hopper (SM 9.0) - kernels-community/flash-attn3 for all other GPUs This was lost during the V4 rewrite. Restored now. Note: neither repo supports Blackwell (SM 12.0) yet — falls through to FlexAttention. See #1, #4. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f3fbf721b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| y = _flex_attention(q, k, v, block_mask=block_mask, | ||
| enable_gqa=use_gqa) |
There was a problem hiding this comment.
Add SDPA fallback when FlexAttention execution errors
On non-FA3 GPUs this path is always selected when flex_attention imports, but _flex_attention(...) is called without any runtime fallback. If FlexAttention is present yet unsupported for the current device/dtype/shape (a common partial-support case), forward will raise and abort training instead of using the SDPA fallback that previously kept runs alive. Wrapping this call in a guarded fallback to scaled_dot_product_attention preserves training continuity across environments.
Useful? React with 👍 / 👎.
…2 dep - Mark V3 as stable (1.1507 BPB), V4 as WIP in README - Add attention backend status table (SDPA/FlexAttention/FA3/FA4) - Document Blackwell FA status (FA3 Hopper-only, FA4 beta crashes) - Fix .gitignore: exclude *.lock but keep uv.lock tracked - Add missing jinja2 dependency to pyproject.toml - Update run commands to show V3, V4, and baseline options Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Results
Issues
Test plan
🤖 Generated with Claude Code