fix(micro_perf): 修复 GPU profiler 无法启动及多 kernel 算子耗时统计错误 by OOOOOChen · Pull Request #191 · bytedance/xpu-perf

OOOOOChen · 2026-04-17T09:59:32Z

关联 Issue

Close #190

问题

Bug 1：GPU profiler 永远无法启动

BasicOp.__init__() 将 self.require_profiling 硬编码为 False，导致 backend.py 中：

actual_profiling = self.enable_profiling and op_instance.require_profiling  # 恒为 False

claude enable_profiling 是否开启，BackendGPU.core_perf() 的 profiler 路径永远不会执行。

Bug 2：每次迭代调用多个同名 kernel 的算子耗时统计错误

claude core_run() 只调用同名 kernel 一次：

if len(kernel_latency_list[kernel]) != prefer_iterations:
    removed_keys.append(kernel)
average_latency += sum(...)  # 在 if 外，被删除的 kernel 仍被计入

MoeQuantGroupGemmOp 这类算子（循环 N 个 expert，每次 core_run() 产生 N × prefer_iterations 条同名 kernel 记录），这些 kernel 被错误过滤，且 average_latency 因累加逻辑在 if 外而被重复计入。

lsworkloads/llm/single_test_ops/gemm_ops.json 中的 moe_quant_group_gemm（ep_size=8, num_experts=128 → 16 experts/rank → 每次 core_run() 调用 16 次 torch.matmul）。

修复方案

micro_perf/core/op.py：将 require_profiling 默认值 False 改为 True。

micro_perf/backends/GPU/backend_gpu.py：

过滤条件改为 count % prefer_iterations != 0，支持每次迭代多次调用同名 kernel 的场景。
将耗时累加移入通过过滤的分支中，避免重复计入。
按 calls_per_iter × iters_offset 调整切片偏移，确保只取后半段统计区间。

变更文件

文件	变更内容
`micro_perf/core/op.py`	`require_profiling` 默认值 `False` → `True`
`micro_perf/backends/GPU/backend_gpu.py`	修正 kernel 耗时过滤与聚合逻辑

…ggregation

CLAassistant · 2026-04-17T09:59:39Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Chen Zhang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

testman0001 · 2026-04-17T10:10:55Z

195行：average_latency /= take_iters应该还是不对，是所有迭代的延迟average_latency /=prefer_iterations

OOOOOChen · 2026-04-17T15:33:53Z

195行：average_latency /= take_iters应该还是不对，是所有迭代的延迟average_latency /=prefer_iterations

按照原逻辑，应该只取prefer_iterations的最后take_iters次迭代的数据进行计算，应该是take_iters

testman0001 · 2026-04-18T09:41:49Z

195行：average_latency /= take_iters应该还是不对，是所有迭代的延迟average_latency /=prefer_iterations

按照原逻辑，应该只取prefer_iterations的最后take_iters次迭代的数据进行计算，应该是take_iters

哦，看到了你这边offset也是跳过了一半数据，我前面那个issue提到了warmup的问题，可以看下

fix(micro_perf): enable profiling by default and fix kernel latency a…

0bcd20f

…ggregation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(micro_perf): 修复 GPU profiler 无法启动及多 kernel 算子耗时统计错误#191

fix(micro_perf): 修复 GPU profiler 无法启动及多 kernel 算子耗时统计错误#191
OOOOOChen wants to merge 1 commit into
bytedance:mainfrom
OOOOOChen:fix/micro-perf-profiling

OOOOOChen commented Apr 17, 2026

Uh oh!

CLAassistant commented Apr 17, 2026

Uh oh!

testman0001 commented Apr 17, 2026

Uh oh!

OOOOOChen commented Apr 17, 2026 •

edited

Loading

Uh oh!

testman0001 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

OOOOOChen commented Apr 17, 2026

关联 Issue

问题

Bug 1：GPU profiler 永远无法启动

Bug 2：每次迭代调用多个同名 kernel 的算子耗时统计错误

修复方案

变更文件

Uh oh!

CLAassistant commented Apr 17, 2026

Uh oh!

testman0001 commented Apr 17, 2026

Uh oh!

OOOOOChen commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

testman0001 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OOOOOChen commented Apr 17, 2026 •

edited

Loading