[Fix] Prevent symbolic over-unification in multi-modality torch.split and add comprehensive tests#26
Merged
jiahy0825 merged 1 commit intoApr 26, 2026
Conversation
9e531fb to
3dc7fd7
Compare
…d symbolic unification tests - Disable triton.autotune_at_compile_time in standalone_compile to avoid CUDA illegal-memory-access with unbacked SymInt dimensions; tuning happens at first runtime invocation instead. - Add comprehensive tests for symbolic over-unification (Part A-D): single-level and two-level compile, CP4 cache reuse, bad-order compilation, and Inductor cache symbol verification. - Skip absolute perf thresholds on non-H100 GPUs (parity check only).
3dc7fd7 to
676a246
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🗂️ PR Category
📝 Description
When a modality has 0 tokens during initial compilation (e.g. a CP rank receives only video tokens), Dynamo unifies symbolic variables (total_tokens == video_tokens), causing AssertionError: expected size X==Y on cache reuse with different modality distributions.
Fix: Use a carrier tensor with mark_unbacked dimensions so each modality size becomes an independent unbacked SymInt (u0, u1, u2), preventing symbolic unification. In the two-level compile architecture (@torch.compile outer + @magi_compile inner), tolist() triggers a graph break; the is_compiling() guard ensures mark_unbacked executes in eager without hitting the forbidden callable error.
Additional changes:
Disable triton.autotune_at_compile_time in standalone_compile to avoid CUDA illegal-memory-access with unbacked SymInt dimensions; tuning happens at first runtime invocation instead.
Skip absolute perf thresholds on non-H100 GPUs (parity check only).
Tests (test_symbolic_unification.py, 9 cases):