[Feat] Add native sparse attention op #79

michaelwithu · 2026-01-04T07:43:17Z

Implemented the nsa_fwd kernel and wrapped it according to the TileOps interface hierarchy.
Developed a mean pooling kernel using TileLang, and verified its correctness by comparing outputs against the Triton implementation.

resolve #70

todo：

Remove temporary/test-only code that will be invalid in the final implementation
Integrate mean pooling kernel into the TileOp interface layer
Reuse FlashAttention’s infrastructure to support compression and sliding window attention functionality
Implement the complete NSA pipeline

gemini-code-assist · 2026-01-04T07:43:44Z

Summary of Changes

Hello @michaelwithu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the foundational components for Native Sparse Attention (NSA) within the project. It delivers a TileLang-based forward kernel for NSA, ensuring its integration into the existing TileOps structure. Additionally, a mean pooling kernel, also implemented with TileLang, has been validated against a Triton counterpart. The changes include extensive benchmarking and testing infrastructure to support the new attention mechanism, setting the stage for future enhancements like compression and sliding window attention.

Highlights

Native Sparse Attention (NSA) Kernel Implementation: A new nsa_fwd kernel has been implemented using TileLang and integrated into the TileOps interface hierarchy, enabling efficient forward passes for Native Sparse Attention.
Mean Pooling Kernel with TileLang: A mean pooling kernel has been developed using TileLang and its correctness verified by comparing its outputs against a Triton implementation, demonstrating the interoperability and accuracy of the TileLang kernel.
Comprehensive Benchmarking and Testing: Dedicated benchmark and test files have been added for the NativeSparseAttentionForwardOp, NativeSparseAttentionFunc, and NativeSparseAttentionLayer, along with input parameters, to ensure thorough validation and performance profiling of the new NSA implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the forward pass for Native Sparse Attention (NSA), including the TileLang kernel, op/function/layer wrappers, and benchmarks. It also adds a mean pooling kernel in TileLang, verified against a Triton implementation. The overall structure is good, following the existing TileOps hierarchy. However, there's a significant amount of temporary and test code committed within the library files, which should be removed or moved to the tests directory as noted in the TODOs. I've also identified a potential bug in the nsa_fwd kernel that limits the head dimension and some areas for improvement in the benchmark code.

gemini-code-assist · 2026-01-04T07:45:33Z

top/kernels/deepseek_nsa/nsa_fwd.py

+
+        NK = tilelang.cdiv(dim, block_T)
+        NV = tilelang.cdiv(dim, block_T)
+        assert NK == 1, "The key dimension can not be larger than 256"


The assertion assert NK == 1 will fail if the head dimension dim is larger than block_T. Given that block_T is at most 128, this kernel will not work for dim > 128. This is a significant limitation and should be fixed to support larger head dimensions. The comment "The key dimension can not be larger than 256" is also misleading.

gemini-code-assist · 2026-01-04T07:45:33Z

top/kernels/deepseek_nsa/mean_pooling.py

+    H, D, chunk_size = 4, 64, 32
+
+    x_unpad = torch.randn(total_T, H, D, dtype=torch.float16, device=device)
+    # x_triton = x_unpad.unsqueeze(0)  # (1, total_T, H, D)


This commented-out line appears to be dead code and should be removed.

gemini-code-assist · 2026-01-04T07:45:33Z

test_tileops.py

+import torch
+import top
+from top import MLAKernel
+
+device = "cuda"
+dtype = torch.float16
+
+batch = 128
+heads = 64
+kv_heads = 1
+kv_ctx = 8192
+dim = 512
+pe_dim = 64
+
+# Query input: [batch, heads, dim]
+q = torch.randn(batch, heads, dim, device=device, dtype=dtype)
+
+# Query positional encoding: [batch, heads, pe_dim]
+q_pe = torch.randn(batch, heads, pe_dim, device=device, dtype=dtype)
+
+# KV cache input: [batch, kv_ctx, kv_heads, dim]
+kv = torch.randn(batch, kv_ctx, kv_heads, dim, device=device, dtype=dtype)
+
+# KV positional encoding: [batch, kv_ctx, kv_heads, pe_dim]
+k_pe = torch.randn(batch, kv_ctx, kv_heads, pe_dim, device=device, dtype=dtype)
+
+# Use MLA kernel
+block_N = 64
+block_H = 64
+num_split = 1
+
+mla = MLAKernel(batch, heads, kv_heads, kv_ctx, dim, pe_dim, block_N, block_H, num_split)
+
+out = mla(q, q_pe, kv, k_pe)


This entire file seems to be for testing MLAKernel, which is unrelated to the Native Sparse Attention (NSA) changes in this pull request. It appears to be a temporary test file that was accidentally included. Please remove it to keep the PR focused.

gemini-code-assist · 2026-01-04T07:45:33Z

tests/functions/test_deepseek_nsa_func.py

+    block_size=64,
+    groups=1,
+    selected_blocks=16,
+    # dtype='float16',


This line is commented out and appears to be dead code. It should be removed. A similar issue exists in tests/layers/test_deepseek_nsa_layer.py and tests/ops/test_deepseek_nsa_ops.py.

gemini-code-assist · 2026-01-04T07:45:33Z

top/functions/deepseek_nsa.py

+# def main():
+#     B, SEQ_LEN, H, HQ, D, S, block_size, dtype, scale = 2, 64, 1, 16, 32, 1, 32, torch.float16, 0.1
+
+#     block_T = min(128, 16)
+
+#     kernel = NativeSparseAttentionFunc(
+#         batch=B,
+#         heads=HQ,
+#         seq_len=SEQ_LEN,
+#         dim=D,
+#         is_causal=True,
+#         block_size=block_size,
+#         groups=HQ // H,
+#         selected_blocks=S,
+#         scale=scale,
+#         tune=True,
+#     )
+
+
+#     torch.random.manual_seed(0)
+#     Q = torch.randn((B, SEQ_LEN, HQ, D), dtype=dtype, device="cuda").requires_grad_(True)
+#     K = torch.randn((B, SEQ_LEN, H, D), dtype=dtype, device="cuda").requires_grad_(True)
+#     V = torch.randn((B, SEQ_LEN, H, D), dtype=dtype, device="cuda").requires_grad_(True)
+#     g_slc = torch.ones((B, SEQ_LEN, HQ), dtype=dtype, device="cuda").requires_grad_(True)
+#     g_swa = torch.ones((B, SEQ_LEN, HQ), dtype=dtype, device="cuda").requires_grad_(True)
+#     DO = torch.randn((B, SEQ_LEN, HQ, D), dtype=dtype, device="cuda")
+
+#     block_indices = torch.full((B, SEQ_LEN, H, S), SEQ_LEN, dtype=torch.long, device="cuda")
+#     block_counts = torch.zeros((B, SEQ_LEN, H), dtype=torch.long, device="cuda")
+#     for b in range(B):
+#         for t in range(SEQ_LEN):
+#             for h in range(H):
+#                 i_i = torch.randperm(max(1, (t // block_size)))[:S]
+#                 block_indices[b, t, h, : len(i_i)] = i_i
+#                 block_counts[b, t, h] = (block_indices[b, t, h] != SEQ_LEN).sum().item()
+#     block_indices = block_indices.sort(-1)[0]
+
+#     out = kernel.forward(Q, K, V, block_indices.to(torch.int32))
+
+#     ref = naive_nsa(
+#         q=Q,
+#         k=K,
+#         v=V,
+#         g_slc=g_slc,
+#         g_swa=g_swa,
+#         block_indices=block_indices,
+#         block_counts=block_counts,
+#         block_size=block_size,
+#         scale=scale,
+#     )
+
+#     print("out", out)
+#     print("ref", ref)
+#     torch.testing.assert_close(ref, out, atol=1e-2, rtol=1e-2)
+
+
+# if __name__ == "__main__":
+#     main()


This large block of commented-out code seems to be a main function for testing. As per the PR's TODO list, temporary and test-only code should be removed from library files.

gemini-code-assist · 2026-01-04T07:45:33Z

top/kernels/deepseek_nsa/nsa_fwd.py

+def main():
+    # B, SEQ_LEN, H, HQ, D, S, block_size, dtype, scale = 2, 64, 1, 16, 32, 1, 32, torch.float16, 0.1
+    B, SEQ_LEN, H, HQ, D, S, block_size, dtype, scale = 2,  8192, 4, 16*4, 128, 16, 32, torch.float16, 0.1
+
+    block_T = min(128, tilelang.math.next_power_of_2(D))
+    kernel = _nsa_fwd_kernel(
+        batch=B,
+        heads=HQ,
+        seq_len=SEQ_LEN,
+        dim=D,
+        is_causal=True,
+        scale=scale,
+        block_size=block_size,
+        groups=HQ // H,
+        selected_blocks=S,
+    )(block_T=block_T, num_stages=2, threads=32)
+
+    kernel2 = nsa_fwd_kernel(
+        batch=B,
+        heads=HQ,
+        seq_len=SEQ_LEN,
+        dim=D,
+        is_causal=True,
+        block_size=block_size,
+        groups=HQ // H,
+        selected_blocks=S,
+        scale=scale,
+        tune=True,
+    )
+
+
+    src_kernel = kernel.get_kernel_source()
+    print(src_kernel)
+    # with open("nsa_fwd_kernel.cu", "w") as f:
+    #     f.write(src_kernel)
+    torch.random.manual_seed(0)
+    Q = torch.randn((B, SEQ_LEN, HQ, D), dtype=dtype, device="cuda").requires_grad_(True)
+    K = torch.randn((B, SEQ_LEN, H, D), dtype=dtype, device="cuda").requires_grad_(True)
+    V = torch.randn((B, SEQ_LEN, H, D), dtype=dtype, device="cuda").requires_grad_(True)
+    g_slc = torch.ones((B, SEQ_LEN, HQ), dtype=dtype, device="cuda").requires_grad_(True)
+    g_swa = torch.ones((B, SEQ_LEN, HQ), dtype=dtype, device="cuda").requires_grad_(True)
+    DO = torch.randn((B, SEQ_LEN, HQ, D), dtype=dtype, device="cuda")
+
+    block_indices = torch.full((B, SEQ_LEN, H, S), SEQ_LEN, dtype=torch.long, device="cuda")
+    block_counts = torch.zeros((B, SEQ_LEN, H), dtype=torch.long, device="cuda")
+    for b in range(B):
+        for t in range(SEQ_LEN):
+            for h in range(H):
+                i_i = torch.randperm(max(1, (t // block_size)))[:S]
+                block_indices[b, t, h, : len(i_i)] = i_i
+                block_counts[b, t, h] = (block_indices[b, t, h] != SEQ_LEN).sum().item()
+    block_indices = block_indices.sort(-1)[0]
+
+    out = kernel(Q, K, V, block_indices.to(torch.int32))
+
+    out2 = kernel2.forward(Q, K, V, block_indices.to(torch.int32))
+
+
+if __name__ == "__main__":
+    main()


This main function is for testing and should be removed from the kernel file before merging.

gemini-code-assist · 2026-01-04T07:45:33Z

top/kernels/deepseek_nsa/nsa_fwd.py

+    # with open("nsa_fwd_kernel.cu", "w") as f:
+    #     f.write(src_kernel)


This commented-out code seems to be for debugging and should be removed.

gemini-code-assist · 2026-01-04T07:45:33Z

top/layers/deepseek_nsa.py

+def main():
+    B, SEQ_LEN, H, HQ, D, S, block_size, dtype, scale = 2, 64, 1, 16, 32, 1, 32, torch.float16, 0.1
+
+    block_T = min(128, 16)
+
+    kernel = NativeSparseAttentionLayer(
+        batch=B,
+        heads=HQ,
+        seq_len=SEQ_LEN,
+        dim=D,
+        is_causal=True,
+        block_size=block_size,
+        groups=HQ // H,
+        selected_blocks=S,
+        scale=scale,
+        tune=True,
+    )
+
+
+    torch.random.manual_seed(0)
+    Q = torch.randn((B, SEQ_LEN, HQ, D), dtype=dtype, device="cuda").requires_grad_(True)
+    K = torch.randn((B, SEQ_LEN, H, D), dtype=dtype, device="cuda").requires_grad_(True)
+    V = torch.randn((B, SEQ_LEN, H, D), dtype=dtype, device="cuda").requires_grad_(True)
+    g_slc = torch.ones((B, SEQ_LEN, HQ), dtype=dtype, device="cuda").requires_grad_(True)
+    g_swa = torch.ones((B, SEQ_LEN, HQ), dtype=dtype, device="cuda").requires_grad_(True)
+    DO = torch.randn((B, SEQ_LEN, HQ, D), dtype=dtype, device="cuda")
+
+    block_indices = torch.full((B, SEQ_LEN, H, S), SEQ_LEN, dtype=torch.long, device="cuda")
+    block_counts = torch.zeros((B, SEQ_LEN, H), dtype=torch.long, device="cuda")
+    for b in range(B):
+        for t in range(SEQ_LEN):
+            for h in range(H):
+                i_i = torch.randperm(max(1, (t // block_size)))[:S]
+                block_indices[b, t, h, : len(i_i)] = i_i
+                block_counts[b, t, h] = (block_indices[b, t, h] != SEQ_LEN).sum().item()
+    block_indices = block_indices.sort(-1)[0]
+
+    out = kernel.forward(Q, K, V, block_indices.to(torch.int32))
+
+    ref = naive_nsa(
+        q=Q,
+        k=K,
+        v=V,
+        g_slc=g_slc,
+        g_swa=g_swa,
+        block_indices=block_indices,
+        block_counts=block_counts,
+        block_size=block_size,
+        scale=scale,
+    )
+
+    print("out", out)
+    print("ref", ref)
+    torch.testing.assert_close(ref, out, atol=1e-2, rtol=1e-2)
+
+
+if __name__ == "__main__":
+    main()


This main function contains test code which should be removed from the layer definition file.

gemini-code-assist · 2026-01-04T07:45:34Z

top/ops/deepseek_nsa.py

+        print("batch ", self.batch)
+        print("heads ", self.heads)
+        print("seq_len ", self.seq_len)
+        print("dim ", self.dim)
+        print("is_causal ", self.is_causal)
+        print("scale ", self.scale)
+        print("block_size ", self.block_size)
+        print("groups ", self.groups)
+        print("selected_blocks ", self.selected_blocks)
+        print("tune ", self.tune)


These print statements are for debugging and should be removed.

gemini-code-assist · 2026-01-04T07:45:34Z

top/ops/deepseek_nsa.py

+def main():
+    # B, SEQ_LEN, H, HQ, D, S, block_size, dtype, scale = 2, 64, 1, 16, 32, 1, 32, torch.float16, 0.1
+
+    B, SEQ_LEN, H, HQ, D, S, block_size, dtype, scale = 2,  8192, 4, 16*4, 128, 16, 32, torch.float16, 0.1
+
+    block_T = min(128, 16)
+
+    kernel = NativeSparseAttentionForwardOp(
+        batch=B,
+        heads=HQ,
+        seq_len=SEQ_LEN,
+        dim=D,
+        is_causal=True,
+        block_size=block_size,
+        groups=HQ // H,
+        selected_blocks=S,
+        scale=scale,
+        tune=True,
+    )
+
+
+    torch.random.manual_seed(0)
+    Q = torch.randn((B, SEQ_LEN, HQ, D), dtype=dtype, device="cuda").requires_grad_(True)
+    K = torch.randn((B, SEQ_LEN, H, D), dtype=dtype, device="cuda").requires_grad_(True)
+    V = torch.randn((B, SEQ_LEN, H, D), dtype=dtype, device="cuda").requires_grad_(True)
+    g_slc = torch.ones((B, SEQ_LEN, HQ), dtype=dtype, device="cuda").requires_grad_(True)
+    g_swa = torch.ones((B, SEQ_LEN, HQ), dtype=dtype, device="cuda").requires_grad_(True)
+    DO = torch.randn((B, SEQ_LEN, HQ, D), dtype=dtype, device="cuda")
+
+    block_indices = torch.full((B, SEQ_LEN, H, S), SEQ_LEN, dtype=torch.long, device="cuda")
+    block_counts = torch.zeros((B, SEQ_LEN, H), dtype=torch.long, device="cuda")
+    for b in range(B):
+        for t in range(SEQ_LEN):
+            for h in range(H):
+                i_i = torch.randperm(max(1, (t // block_size)))[:S]
+                block_indices[b, t, h, : len(i_i)] = i_i
+                block_counts[b, t, h] = (block_indices[b, t, h] != SEQ_LEN).sum().item()
+    block_indices = block_indices.sort(-1)[0]
+
+    out = kernel.forward(Q, K, V, block_indices.to(torch.int32))
+
+    # ref = naive_nsa(
+    #     q=Q,
+    #     k=K,
+    #     v=V,
+    #     g_slc=g_slc,
+    #     g_swa=g_swa,
+    #     block_indices=block_indices,
+    #     block_counts=block_counts,
+    #     block_size=block_size,
+    #     scale=scale,
+    # )
+
+    print("out", out)
+    # print("ref", ref)
+    # torch.testing.assert_close(ref, out, atol=1e-2, rtol=1e-2)
+
+
+if __name__ == "__main__":
+    main()


This main function contains test code and should be removed from the op definition file.

yJnmIchaeL and others added 7 commits December 26, 2025 17:21

nsa kernel

3476b96

add deepseek nsa

de259cd

nsa fwd benchmark

8849353

feat(python): run benchmarks and mean_pooling_tilelang kernel

edc6a19

feat(python): run benchmarks and mean_pooling_tilelang kernel

8644719

feat(python): run benchmarks and mean_pooling_tilelang kernel

b687402

feat(python): run benchmarks and mean_pooling_tilelang kernel

6b57bcd

gemini-code-assist bot reviewed Jan 4, 2026

View reviewed changes

xysmlx changed the title ~~Jayden/nsa~~ [Feat] Add native sparse attention op Jan 4, 2026

yJnmIchaeL added 7 commits January 6, 2026 14:33

[Feat]add nsa_fwd kernel/op & mean_pool kernel/op

e4bb142

[Feat]add nsa_fwd kernel/op & mean_pool kernel/op

92da705

[Feat]add nsa_fwd kernel/op & mean_pool kernel/op

406dc5b

test: using pytest for better extensibility.

abe1982

test: using pytest for better extensibility.

dca22b8

test: using pytest for better extensibility.

afc38bf

test: using pytest for better extensibility.

f35d549

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add native sparse attention op #79

[Feat] Add native sparse attention op #79

Uh oh!

michaelwithu commented Jan 4, 2026 •

edited by lcy-seso

Loading

Uh oh!

gemini-code-assist bot commented Jan 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

gemini-code-assist bot Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# with open("nsa_fwd_kernel.cu", "w") as f:
		# f.write(src_kernel)

[Feat] Add native sparse attention op #79

Are you sure you want to change the base?

[Feat] Add native sparse attention op #79

Uh oh!

Conversation

michaelwithu commented Jan 4, 2026 • edited by lcy-seso Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelwithu commented Jan 4, 2026 •

edited by lcy-seso

Loading