Skip to content

[Fix] Prevent symbolic over-unification in multi-modality torch.split and add comprehensive tests#26

Merged
jiahy0825 merged 1 commit into
SandAI-org:mainfrom
cennn:fix/unbacked-symint-symbolic-unification
Apr 26, 2026
Merged

[Fix] Prevent symbolic over-unification in multi-modality torch.split and add comprehensive tests#26
jiahy0825 merged 1 commit into
SandAI-org:mainfrom
cennn:fix/unbacked-symint-symbolic-unification

Conversation

@cennn
Copy link
Copy Markdown
Collaborator

@cennn cennn commented Apr 26, 2026

🗂️ PR Category

  • ✨ New Feature
  • 🚀 Optimization (performance, memory, etc.)
  • 💥 Breaking Change
  • 🐛 Bug Fix
  • 🛠️ Development / Refactoring
  • 📚 Documentation
  • 🧹 Chore (Dependencies, CI/CD, Configuration, etc.)
  • 🧪 Testing

📝 Description

When a modality has 0 tokens during initial compilation (e.g. a CP rank receives only video tokens), Dynamo unifies symbolic variables (total_tokens == video_tokens), causing AssertionError: expected size X==Y on cache reuse with different modality distributions.

Fix: Use a carrier tensor with mark_unbacked dimensions so each modality size becomes an independent unbacked SymInt (u0, u1, u2), preventing symbolic unification. In the two-level compile architecture (@torch.compile outer + @magi_compile inner), tolist() triggers a graph break; the is_compiling() guard ensures mark_unbacked executes in eager without hitting the forbidden callable error.

Additional changes:

Disable triton.autotune_at_compile_time in standalone_compile to avoid CUDA illegal-memory-access with unbacked SymInt dimensions; tuning happens at first runtime invocation instead.
Skip absolute perf thresholds on non-H100 GPUs (parity check only).
Tests (test_symbolic_unification.py, 9 cases):

  • Part A: Reproduce symbolic over-unification bug (single-level compile)
  • Part B: Verify carrier tensor + mark_unbacked fix
  • Part C: CP4-like cache reuse across rank distributions
  • Part D: Two-level compile — good/bad order, Inductor cache symbol verification (u0,u1,u2 independence), is_compiling() guard necessity

@cennn cennn force-pushed the fix/unbacked-symint-symbolic-unification branch from 9e531fb to 3dc7fd7 Compare April 26, 2026 10:07
…d symbolic unification tests

- Disable triton.autotune_at_compile_time in standalone_compile to avoid
  CUDA illegal-memory-access with unbacked SymInt dimensions; tuning
  happens at first runtime invocation instead.
- Add comprehensive tests for symbolic over-unification (Part A-D):
  single-level and two-level compile, CP4 cache reuse, bad-order
  compilation, and Inductor cache symbol verification.
- Skip absolute perf thresholds on non-H100 GPUs (parity check only).
@cennn cennn force-pushed the fix/unbacked-symint-symbolic-unification branch from 3dc7fd7 to 676a246 Compare April 26, 2026 10:12
Copy link
Copy Markdown
Collaborator

@jiahy0825 jiahy0825 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiahy0825 jiahy0825 merged commit 6df0b5f into SandAI-org:main Apr 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants