add fatrelu by zhenwei-intel · Pull Request #3 · zhenwei-intel/vllm-xpu-kernels

zhenwei-intel · 2026-04-09T06:00:22Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

gemini-code-assist

Code Review

This pull request implements the fatrelu_and_mul activation function for XPU, including the SYCL kernel, C++ bindings, and comprehensive Python tests. The feedback highlights significant code duplication between the newly added act_and_mul_with_param_vec_kernel and the existing act_and_mul_vec_kernel, as well as their associated macros. It is recommended to refactor these components into a unified implementation that can handle activation functions both with and without additional parameters to improve maintainability.

gemini-code-assist · 2026-04-09T06:04:09Z

+template <
+    typename scalar_t,
+    scalar_t (*ACT_FN)(const scalar_t&, const float),
+    int VEC_SIZE>
+class act_and_mul_with_param_vec_kernel {
+ public:
+  act_and_mul_with_param_vec_kernel(
+      scalar_t* __restrict__ out,
+      const scalar_t* __restrict__ input,
+      const int d,
+      const float param)
+      : out_(out), input_(input), d_(d), param_(param) {}
+
+  void operator()(sycl::nd_item<1> item) const {
+    using vec_t = vllm::xpu::aligned_vec<scalar_t, VEC_SIZE>;
+    const int64_t token_idx = item.get_group(0);
+    const int64_t offset = item.get_local_linear_id();
+    const int64_t step = item.get_local_range(0);
+    const int64_t bound = d_ / VEC_SIZE;
+
+    for (int64_t i = offset; i < bound; i += step) {
+      auto x_vec =
+          reinterpret_cast<const vec_t*>(input_)[token_idx * bound * 2 + i];
+      auto y_vec = reinterpret_cast<const vec_t*>(
+          input_)[token_idx * bound * 2 + i + bound];
+      vec_t out_vec;
+#pragma unroll
+      for (int j = 0; j < VEC_SIZE; ++j) {
+        out_vec[j] = ACT_FN(x_vec[j], param_) * y_vec[j];
+      }
+      reinterpret_cast<vec_t*>(out_)[token_idx * bound + i] = out_vec;
+    }
+  }
+
+ private:
+  scalar_t* __restrict__ out_;
+  const scalar_t* __restrict__ input_;
+  const int d_;
+  const float param_;
+};


This new kernel act_and_mul_with_param_vec_kernel is almost a complete duplicate of act_and_mul_vec_kernel. Similarly, the new macros VEC_LAUNCH_ACT_AND_MUL_WITH_PARAM and LAUNCH_ACTIVATION_GATE_KERNEL_WITH_PARAM_VEC are duplicates of existing macros. This significant code duplication makes maintenance harder because changes may need to be applied in multiple places.

Consider refactoring to unify these kernels and macros. For example, you could use a single kernel that can handle activation functions both with and without an extra parameter. This could be achieved using if constexpr and std::is_invocable_v to check the signature of the activation function pointer. This would reduce code duplication and improve maintainability.

…om numeric error (vllm-project#261) Signed-off-by: yangqun <qun.yang@intel.com>

…llm-project#264) Signed-off-by: yangqun <qun.yang@intel.com>

…#257) * tune num_kv_splits for page decode kernel Signed-off-by: baodii <di.bao@intel.com> * change min blocks in kernel Signed-off-by: baodii <di.bao@intel.com> * Redesign get_num_splits() heuristic for better split decisions Key changes to the heuristic: - Early-exit when cur_parallel >= num_xe_cores (was 4x cores) - kv_heads >= 4: fixed target of num_xe_cores * 64 / block_size (~20 at p64) - kv_heads <= 2: floor at num_xe_cores splits, scale with blocks/10 for long sequences Benchmark results (page_size=64, BMG 20 XE cores, unitrace, L2 flush): 10 configs faster (avg 1.05x), 10 neutral, 2 slower (kernel-level) Signed-off-by: baodii <di.bao@intel.com> --------- Signed-off-by: baodii <di.bao@intel.com>

* apply_scale for 4 bits Signed-off-by: mayuyuace <qiming1.zhang@intel.com> * apply_scale only for bf16 Signed-off-by: mayuyuace <qiming1.zhang@intel.com> * prefetch Signed-off-by: mayuyuace <qiming1.zhang@intel.com> * prefetch Signed-off-by: mayuyuace <qiming1.zhang@intel.com> * finetune moe gemm policy Signed-off-by: mayuyuace <qiming1.zhang@intel.com> * remove useless code and format Signed-off-by: mayuyuace <qiming1.zhang@intel.com> --------- Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

* fix overflow Signed-off-by: mayuyuace <qiming1.zhang@intel.com> * add UT for overflow Signed-off-by: mayuyuace <qiming1.zhang@intel.com> * file mode Signed-off-by: mayuyuace <qiming1.zhang@intel.com> --------- Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>

* [Test] refine test socpe definition Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * add scope Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * add scope Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * minor Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * onednn version Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * llama3 Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * fix Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> --------- Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Co-authored-by: Harish Subramony <harish.subramony@intel.com>

…d_qk_norm_rope kernel (vllm-project#267) * Add fuse_norm_quant, fuse_act_quant and fused_qk_norm_rope kernel Signed-off-by: Lai, Yejing <yejing.lai@intel.com> * fix format Signed-off-by: Lai, Yejing <yejing.lai@intel.com> * fix format Signed-off-by: Lai, Yejing <yejing.lai@intel.com> * add fused_qk_norm_rope head_dim=512 case and update vec_size Signed-off-by: Lai, Yejing <yejing.lai@intel.com> --------- Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Yejing Lai <yejing.lai@intel.com>

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

gemini-code-assist Bot reviewed Apr 9, 2026

View reviewed changes

zhenwei-intel force-pushed the fatrelu branch from fdf96d8 to 6f84b42 Compare April 9, 2026 06:26

YangQun1 and others added 14 commits April 9, 2026 22:16

skip gdn core_attn_out check for f32 ssm_state and 8k len due to rand…

0f0411a

…om numeric error (vllm-project#261) Signed-off-by: yangqun <qun.yang@intel.com>

skip gdn core_attn_out check for 8k len due to random numeric error (v…

c86c3f6

…llm-project#264) Signed-off-by: yangqun <qun.yang@intel.com>

[fmha] support head dim 512 (vllm-project#251)

273c322

refactor cmake to enable selective kernel build (vllm-project#260)

acc6a7d

Add swap_blocks_batch op with batched async memcpy (vllm-project#265)

56ab101

Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>

Fix XPU CPU-view tensor lifetime (vllm-project#262)

3fb2f1e

Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Co-authored-by: Harish Subramony <harish.subramony@intel.com>

remove yapf (vllm-project#272)

bd5809d

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

use local lru (vllm-project#275)

fd5e534

Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>

update scm version check and project python version (vllm-project#274)

2a19a4a

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

zhenwei-intel force-pushed the fatrelu branch 2 times, most recently from 27897c7 to 15b12eb Compare April 15, 2026 06:28

add fatrelu_and_mul

ce8995f

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

zhenwei-intel force-pushed the fatrelu branch from 15b12eb to ce8995f Compare April 15, 2026 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fatrelu#3

add fatrelu#3
zhenwei-intel wants to merge 15 commits into
mainfrom
fatrelu

zhenwei-intel commented Apr 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

zhenwei-intel commented Apr 9, 2026

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants