Skip to content

add fatrelu#3

Open
zhenwei-intel wants to merge 15 commits into
mainfrom
fatrelu
Open

add fatrelu#3
zhenwei-intel wants to merge 15 commits into
mainfrom
fatrelu

Conversation

@zhenwei-intel
Copy link
Copy Markdown
Owner

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the fatrelu_and_mul activation function for XPU, including the SYCL kernel, C++ bindings, and comprehensive Python tests. The feedback highlights significant code duplication between the newly added act_and_mul_with_param_vec_kernel and the existing act_and_mul_vec_kernel, as well as their associated macros. It is recommended to refactor these components into a unified implementation that can handle activation functions both with and without additional parameters to improve maintainability.

Comment thread csrc/activation.cpp
Comment on lines +173 to +212
template <
typename scalar_t,
scalar_t (*ACT_FN)(const scalar_t&, const float),
int VEC_SIZE>
class act_and_mul_with_param_vec_kernel {
public:
act_and_mul_with_param_vec_kernel(
scalar_t* __restrict__ out,
const scalar_t* __restrict__ input,
const int d,
const float param)
: out_(out), input_(input), d_(d), param_(param) {}

void operator()(sycl::nd_item<1> item) const {
using vec_t = vllm::xpu::aligned_vec<scalar_t, VEC_SIZE>;
const int64_t token_idx = item.get_group(0);
const int64_t offset = item.get_local_linear_id();
const int64_t step = item.get_local_range(0);
const int64_t bound = d_ / VEC_SIZE;

for (int64_t i = offset; i < bound; i += step) {
auto x_vec =
reinterpret_cast<const vec_t*>(input_)[token_idx * bound * 2 + i];
auto y_vec = reinterpret_cast<const vec_t*>(
input_)[token_idx * bound * 2 + i + bound];
vec_t out_vec;
#pragma unroll
for (int j = 0; j < VEC_SIZE; ++j) {
out_vec[j] = ACT_FN(x_vec[j], param_) * y_vec[j];
}
reinterpret_cast<vec_t*>(out_)[token_idx * bound + i] = out_vec;
}
}

private:
scalar_t* __restrict__ out_;
const scalar_t* __restrict__ input_;
const int d_;
const float param_;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new kernel act_and_mul_with_param_vec_kernel is almost a complete duplicate of act_and_mul_vec_kernel. Similarly, the new macros VEC_LAUNCH_ACT_AND_MUL_WITH_PARAM and LAUNCH_ACTIVATION_GATE_KERNEL_WITH_PARAM_VEC are duplicates of existing macros. This significant code duplication makes maintenance harder because changes may need to be applied in multiple places.

Consider refactoring to unify these kernels and macros. For example, you could use a single kernel that can handle activation functions both with and without an extra parameter. This could be achieved using if constexpr and std::is_invocable_v to check the signature of the activation function pointer. This would reduce code duplication and improve maintainability.

YangQun1 and others added 14 commits April 9, 2026 22:16
…om numeric error (vllm-project#261)

Signed-off-by: yangqun <qun.yang@intel.com>
…#257)

* tune num_kv_splits for page decode kernel

Signed-off-by: baodii <di.bao@intel.com>

* change min blocks in kernel

Signed-off-by: baodii <di.bao@intel.com>

* Redesign get_num_splits() heuristic for better split decisions

Key changes to the heuristic:
- Early-exit when cur_parallel >= num_xe_cores (was 4x cores)
- kv_heads >= 4: fixed target of num_xe_cores * 64 / block_size (~20 at p64)
- kv_heads <= 2: floor at num_xe_cores splits, scale with blocks/10 for
  long sequences

Benchmark results (page_size=64, BMG 20 XE cores, unitrace, L2 flush):
  10 configs faster (avg 1.05x), 10 neutral, 2 slower (kernel-level)

Signed-off-by: baodii <di.bao@intel.com>

---------

Signed-off-by: baodii <di.bao@intel.com>
* apply_scale for 4 bits

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

* apply_scale only for bf16

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

* prefetch

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

* prefetch

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

* finetune moe gemm policy

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

* remove useless code and format

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

---------

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
* fix overflow

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

* add UT for overflow

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

* file mode

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>

---------

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
* [Test] refine test socpe definition

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* add scope

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* add scope

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* minor

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* onednn version

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* llama3

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* fix

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

---------

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
Co-authored-by: Harish Subramony <harish.subramony@intel.com>
…d_qk_norm_rope kernel (vllm-project#267)

* Add fuse_norm_quant, fuse_act_quant and fused_qk_norm_rope kernel

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>

* fix format

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>

* fix format

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>

* add fused_qk_norm_rope head_dim=512 case and update vec_size

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>

---------

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Signed-off-by: Yejing Lai <yejing.lai@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
@zhenwei-intel zhenwei-intel force-pushed the fatrelu branch 2 times, most recently from 27897c7 to 15b12eb Compare April 15, 2026 06:28
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants