Skip to content

[Examples] add tfla op and tfla-optimized#722

Open
Old-cpu wants to merge 2 commits intobuddy-compiler:mainfrom
Old-cpu:tiled-flash-linea-attention-new
Open

[Examples] add tfla op and tfla-optimized#722
Old-cpu wants to merge 2 commits intobuddy-compiler:mainfrom
Old-cpu:tiled-flash-linea-attention-new

Conversation

@Old-cpu
Copy link
Contributor

@Old-cpu Old-cpu commented Mar 17, 2026

Summary

Description

This PR adds an optimized implementation of the Tiled Flash Linear Attention (TFLA) kernel based on the NeurIPS 2025 paper "Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels",demonstrating significant performance improvements over the baseline versions. The optimized TFLA kernel achieves approximately 21× speedup compared to the baseline TFLA implementation, and approximately 3× speedup compared to the fused GQA Attention kernel (next-gqa-attention-fusion.mlir).

Hardware Configuration

  • CPU: Intel Xeon Silver 4114 @ 2.20GHz
  • Cores: 2 sockets × 10 cores = 20 physical cores (40 logical threads)
  • Architecture: x86_64 with AVX-512 support
  • OpenMP threads: 48

Background

The TFLA kernel provides a fair comparison with Grouped Query Attention (GQA) under identical configurations:

  • Batch size: 1
  • Query heads: 12
  • KV groups: 2 (each group serves 6 heads via GQA)
  • Sequence length: 1 (single-token inference)
  • Hidden dimension per head: 128
  • KV cache sequence length: 1024 per groupt.

Expected Performance
im

  • Describe how reviewers can verify and test this change. Include command-line instructions if applicable.
    make next-tfla-run and make next-tfla-optimized-run

Checklist

  • The code builds successfully
  • Existing tests pass
  • New tests are added where appropriate
  • Code follows the project coding style
  • Documentation is updated if needed

@Old-cpu Old-cpu requested a review from zhanghb97 as a code owner March 17, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant