[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

uv-xiao · 2026-01-15T07:39:37Z

Summary

This PR synchronizes TileScale with mainstream TileLang, integrating 577 commits of new features while preserving all TileScale distributed computing capabilities.

Key highlights:

TVM-FFI API modernization
All distributed TileOperators preserved and updated to new API

Related Issue: #46

Background

TileScale diverged from mainstream TileLang on July 21, 2025 (commit 8205791d). Since then:

Mainstream accumulated 577 commits with significant API changes
TileScale accumulated 135 commits with distributed features

This merge integrates both, ensuring no functionality is lost.

Documentation

Comprehensive documentation added in docs/sync_with_tilelang/:

Document	Description
`BUILD_AND_RUN.md`	Build instructions and test commands
`MERGE_RATIONALE.md`	Why this merge is needed
`MERGE_ANALYSIS.md`	Detailed feature status
`compare_to_tilang/`	Detailed study of TileScale contributions

Breaking Changes

C++ TileOperators now use TVM-FFI macros (already updated)
Some internal APIs changed (documented in merge analysis)

Commits

1cd95ce - Main merge commit
5e368a8 - Documentation updates

…-ai#1247) * [Refactor] Update buffer handling in copy and atomic operations * Refactored the `copy` and `atomic_add` functions to use element-wise minimum for defining copy extents, ensuring correct handling of overlapping regions. * Updated utility functions to create `BufferLoad` instances with explicit extents, improving memory management and clarity. * Removed unused imports from `atomic.py` and `copy.py` to streamline the codebase. * Adjusted logging in `copy.cc` to provide clearer warnings for fallback scenarios in bulk copy operations. * Remove obsolete .git_commit.txt file * Add unit test for dynamic copy extent handling in TileLang * Introduced a new test file `test_tilelang_issue_1237.py` to verify that the `T.copy` function correctly manages dynamic extents during primitive function building. * The test reproduces a specific issue related to dynamic slice lengths and static buffer sizes, ensuring robustness in the handling of such scenarios. * The test does not require execution of the kernel, as building the primitive function is sufficient to validate the fix. * lint fix * fix * Revert "fix" This reverts commit 828b4c1. * Update TVM submodule and refactor atomic and copy functions * Updated the TVM submodule to a dirty state. * Refactored `atomic_add` and `copy` functions to pass extents explicitly to the `_to_region` helper, improving clarity and correctness in handling buffer regions. * Commented out the main execution call in the test example for `cast` and added a new function call to better demonstrate the example usage. * Enhance extent handling in atomic and copy functions * Introduced `legalize_pairwise_extents` utility to align and broadcast extent lists for `atomic_add` and `copy` functions, ensuring compatibility and correctness in buffer operations. * Updated both functions to utilize the new utility, improving clarity and robustness in handling dynamic and static extents. * Added comments to clarify the extent handling logic. * Enhance `legalize_pairwise_extents` function with early-exit rule * Added an early-exit condition to the `legalize_pairwise_extents` function to return original extents if the number of non-1 dimensions in both source and destination extents is equal, improving performance by avoiding unnecessary adjustments. * Updated the function's documentation to clarify the new behavior and maintain clarity in the extent handling logic. * lint fix

* add typing stub for tir.ir * remove idents * minor update * [Language] Add missing while statement * add test

* [BugFix] Add autotune and exp2 for GDN kernel * [Lint] * [Lint]

… with `-inf` instead of clearing accumulators. (tile-ai#1222) * Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators. * lint * pre-commit * Update imports in flash attention test file to use new backward and forward examples for better clarity and consistency.

* [fix] NVRTC execution backend * [fmt] run pre-commit * [fix] coderabbit reviews * [test] add cuda-python to test dep * [fix] coderabbit reviews * [fix] CUDA 13 compatibility * [fix] sm90 * [fix] CUDA 13 compatibility * [fix] pre-commit * [fix] always use cuda::std::__atomic_ref_impl * [fix] restore to external API * Revert "[fix] restore to external API" This reverts commit 49bd875. * [fmt] use space instead tabs for py codegen * [fix] im2col API * [fix] revert atomic.h * [fix] dynamic shape * [refactor] extract common utils * [feat] support L2 persistent map * [fix] l2 persistent map * [fix] pre-commit * [fix] restore _TYPE_MAP * [fix] pre-commit * [fix] avoid duplicate TMA descs * [docs] add docstring * [fix] coderabbit * [fix] coderabbit * [fix] coderabbit * [fix] coderabbit

…e-ai#1260) * fix nsa bwd and atomic * [Lint] * [BugFix] - New implementation for atomicMax and atomicMin using atomicCAS - PTX version atomicAdd for single 16-byte data - Modify the test cases * [Lint] --------- Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

* [Example] Add page table for gqa decode * [Example] Page table for varlen decoding * [Lint] * [Refactor] Remove redundant code * [Lint] * [Lint] * [Lint]

* add typing stub for tir.ir * remove idents * minor update * [Refactor] add numpy conversion for dtype * fix lint error * remove unused np.float_ in dtype conversion * fix type in np.int_ * fix typo * minor fix * remove debug files

…een in scores_max numerical stability (tile-ai#1148) * Keep the max of all blocks seen in scores_max for stability * ruff formatting

* [Docs] Improve installation guide * address comments

…n for better numerical stablity (tile-ai#1269) * Implement max score retention across blocks in FlashAttention for improved stability * fix manual pipeline parameters * Update examples/flash_attention/example_gqa_fwd_varlen.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * fix typo * more * fix a previous typo --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

)

…e-ai#1273)

* [BugFix] Adding extra parameters into autotune hashkey * lint * None check * check serializable

…#1218) * Fix various issues under int64_t static and dynamic shape. * Resolve reviewed issues. * Add unit test. * fix --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* fix argument order for fla chunk_gated_delta_rule_fwd_h * explicit import assert_similar from utils * rename utils module to avoid name clash * set store_final_state and save_new_value to True * fix --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Language] Add shape check in T.view/reshape * address comments

* [Refactor] Update FFI type handling and simplify argument management * Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity. * Updated function registration in `runtime.cc` to utilize canonical names for better consistency. * Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled. * Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection. * Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity. * [Update] Sync TVM submodule and enhance kernel source handling * Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes. * Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging. * Commented out the main execution call in test files to prevent unintended execution during testing. * Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues. * Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends. * [Refactor] Clean up imports and improve code formatting * Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code. * Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency. * Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality. * Update execution backend options and improve resolution logic - Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target. - Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions. - Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target. - Updated documentation to reflect changes in execution backend options and their defaults. * lint fix * fix * Enhance argument handling in CUDA and HIP runtime modules - Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime. - Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers. - Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks. * lint fix * lint fix * lint fix * lint fix * minor fix * fix * recover check * Refactor argument binding and validation in `arg_binder.cc` - Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers. - Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards. - Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling. - Minor adjustments in test files to streamline kernel execution and improve readability. * lint fix * stride fix * minor fix * fix * lint fix * lint fix * Add CUDA stream access policy window helpers and integrate with L2 persistent cache management - Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage. - Updated runtime files to include new FFI packed functions for managing stream attributes. - Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown. - Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source. * check with symbolic * support null ptr * Update CMakeLists and lower.py for code generation and subproject status - Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support. - Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility. - Marked the TVM subproject as dirty to indicate local modifications. * lint fix * Update comments for clarity in quickstart.py

* fix for bool dtype * lint fix * fix * ci fix

* add typing stub for tir.ir * remove idents * minor update * [Refactor] add numpy conversion for dtype * fix lint error * remove unused np.float_ in dtype conversion * fix type in np.int_ * fix typo * minor fix * remove debug files * fix memory leak bug * fix lint error * add comments * fix lint error * remove duplicated, because tilelang doesn't dependent deprecated

…nfiguration (tile-ai#1283) - Updated the `tilelang_callback_cuda_compile` function to accept a `pass_config` parameter, allowing for more flexible compilation options. - Introduced handling for fast math and PTXAS options based on the provided pass configuration. - Modified the CUDA build process in `rt_mod_cuda.cc` to utilize the current pass context, improving the integration of compilation settings. - Refactored NVCC command construction to use a dedicated function for better clarity and maintainability.

Co-authored-by: cheeryBloosm <liu_yu_hao@126.com>

* [Language][UX] Nested loop checker in pre-lowering stage * rename * comment * address comments

…tion (tile-ai#1285) * [Feature] Add support for A: T.Tensor(n + 1) and A: T.Tensor(2*n) * issue fix * fix * fix * decreate nproc for debugging --------- Co-authored-by: Lei Wang <leiwang1999@outlook.com>

…#1291)

* bugfix * lint fix * test * lint fix * increate procs * recover

…le-ai#1305) * [Feat] add missing support of uint32x2 * [Feat] Add `T.Ref` annotation and tests * fix lint error * minor update for error message on twice decl * Remove unused let_bindings_ in CodeGenC to fix tile-ai#1300

Rachmanino · 2026-01-28T08:59:45Z

cc @chengyupku

…act management

…e management

…methods, update prerequisites, and enhance Docker instructions.

chengyupku · 2026-01-29T16:33:46Z

I noticed that the distributed scenarios are still defaulting to the Cython backend instead of the TVM-FFI backend (specifically in tilelang/jit/execution_backend.py)). When I attempted to force the use of the TVM-FFI backend, it triggered several errors. I think we should resolve this issue for this PR.

Rachmanino · 2026-01-30T00:37:55Z

i also noticed this, I‘ll take care of this

…

---- Replied Message ---- | From | Yu ***@***.***> | | Date | 01/30/2026 00:34 | | To | tile-ai/tilescale ***@***.***> | | Cc | Tong ***@***.***>, Comment ***@***.***> | | Subject | Re: [tile-ai/tilescale] [Sync] Merge mainstream TileLang TVM-FFI features into TileScale (PR #47) | chengyupku left a comment (tile-ai/tilescale#47) I noticed that the distributed scenarios are still defaulting to the Cython backend instead of the TVM-FFI backend (specifically in here). When I attempted to force the use of the TVM-FFI backend, it triggered several errors. I think we should resolve this issue for this PR. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

…ata, streamline kernel initialization, and enhance dispatch mechanisms with improved CUDA stream handling.

LeiWang1999 and others added 30 commits November 14, 2025 00:50

[Language] Add missing while statement (tile-ai#1254)

5eb30a4

* add typing stub for tir.ir * remove idents * minor update * [Language] Add missing while statement * add test

[BugFix] Add autotune and exp2 for GDN kernel (tile-ai#1258)

eac96cd

* [BugFix] Add autotune and exp2 for GDN kernel * [Lint] * [Lint]

[AMD] Update CK for ROCm7 (tile-ai#1262)

729e66c

[Example] Add GQA decoding kernel with varlen page table (tile-ai#1265)

716dbef

* [Example] Add page table for gqa decode * [Example] Page table for varlen decoding * [Lint] * [Refactor] Remove redundant code * [Lint] * [Lint] * [Lint]

[EXAMPLE] In the flash attention example keep the max of all blocks s…

a2a2781

…een in scores_max numerical stability (tile-ai#1148) * Keep the max of all blocks seen in scores_max for stability * ruff formatting

[Docs] Improve Installation Guide (tile-ai#1270)

b3d6f03

* [Docs] Improve installation guide * address comments

[Bugfix] Fix multiple cg defination when using T.sync_grid (tile-ai#1272

220c323

)

[Minor] Remove from __future__ import annotations for python 3.8 (til…

b192251

…e-ai#1273)

[BugFix] Adding extra parameters into autotune hashkey (tile-ai#1274)

e805f8e

* [BugFix] Adding extra parameters into autotune hashkey * lint * None check * check serializable

Fix various issues under int64_t static and dynamic shape. (tile-ai…

49c8571

…#1218) * Fix various issues under int64_t static and dynamic shape. * Resolve reviewed issues. * Add unit test. * fix --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

[Bugfix] Minor fix for some cases (tile-ai#1278)

1b0efb6

[Language] Add shape check in T.view/reshape (tile-ai#1277)

921b96a

* [Language] Add shape check in T.view/reshape * address comments

[Bugfix] Supply missing T.print for bool type (tile-ai#1279)

4c8b9ad

* fix for bool dtype * lint fix * fix * ci fix

Fix the bug in issue tile-ai#1266 (tile-ai#1284)

49f3539

Co-authored-by: cheeryBloosm <liu_yu_hao@126.com>

[Language][UX] Nested loop checker in pre-lowering stage (tile-ai#1288)

9e67b86

* [Language][UX] Nested loop checker in pre-lowering stage * rename * comment * address comments

[Compatibility] Support CUDA 11.3 (tile-ai#1290)

bef7e52

[Feat] Add support for using T.Tensor(n * 2 + 1) in function annota…

bccb648

…tion (tile-ai#1285) * [Feature] Add support for A: T.Tensor(n + 1) and A: T.Tensor(2*n) * issue fix * fix * fix * decreate nproc for debugging --------- Co-authored-by: Lei Wang <leiwang1999@outlook.com>

[Feat] add support for passing reference in T.Var annotation (tile-ai…

dd7fdb8

…#1291)

[Enhancement] Shared Memory Size Can be Dynamic (tile-ai#1294)

d4b6d09

* bugfix * lint fix * test * lint fix * increate procs * recover

Rachmanino added 5 commits January 19, 2026 17:29

disable ci for arm and metal

6ea6d96

fix ts_ext

de8d36d

use sdist for ci

141e4d4

use tilelang's new ci

109e67a

use cmake rather than pyproject dependency for tilescale extension

af1906e

Rachmanino force-pushed the uv/tilescale_tvmffi branch from 35a8c0f to 0d38d16 Compare January 19, 2026 14:02

install torch before ts_ext

b51575d

Rachmanino force-pushed the uv/tilescale_tvmffi branch from 0d38d16 to b51575d Compare January 19, 2026 14:14

Rachmanino added 3 commits January 28, 2026 15:12

fix torch lib link bug

78fc8fe

add missing codegen

add3089

disable ci test for deepep

b241d88

Rachmanino requested a review from chengyupku January 28, 2026 08:59

Rachmanino and others added 10 commits January 28, 2026 19:28

fix gitignore bug

4ce401d

disable ib for nccl

86b49ec

switch to new ci runner

71c6800

lint

ed2e798

set num_procs to 2

93f8dd0

fix typo

6f1fb5c

using tsinghua src for pip

1b7b053

refactor CI workflow to remove SDist download step, simplifying artif…

60450af

…act management

[BugFix] Add device_ids attribute to BaseAllocator for improved devic…

611d4a0

…e management

[Doc] Update Installation Guide for TileScale: Simplify installation …

2629772

…methods, update prerequisites, and enhance Docker instructions.

Rachmanino and others added 5 commits January 30, 2026 11:44

[Feature] Support tvm-ffi for TileScale

9eb91e4

update DeepEP installation script and

c250854

draft for supporting tvm-ffi in deepep

af40faf

[Refactor] Update memory management to use constant memory for meta_d…

c3b5392

…ata, streamline kernel initialization, and enhance dispatch mechanisms with improved CUDA stream handling.

lint fix

df240f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

Uh oh!

uv-xiao commented Jan 15, 2026

Uh oh!

Rachmanino commented Jan 28, 2026

Uh oh!

chengyupku commented Jan 29, 2026 •

edited

Loading

Uh oh!

Rachmanino commented Jan 30, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

Are you sure you want to change the base?

[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

Uh oh!

Conversation

uv-xiao commented Jan 15, 2026

Summary

Related Issue: #46

Background

Documentation

Uh oh!

Rachmanino commented Jan 28, 2026

Uh oh!

chengyupku commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachmanino commented Jan 30, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

chengyupku commented Jan 29, 2026 •

edited

Loading