Skip to content

fix(micro_perf): 修复 GPU profiler 无法启动及多 kernel 算子耗时统计错误#191

Open
OOOOOChen wants to merge 1 commit into
bytedance:mainfrom
OOOOOChen:fix/micro-perf-profiling
Open

fix(micro_perf): 修复 GPU profiler 无法启动及多 kernel 算子耗时统计错误#191
OOOOOChen wants to merge 1 commit into
bytedance:mainfrom
OOOOOChen:fix/micro-perf-profiling

Conversation

@OOOOOChen
Copy link
Copy Markdown

关联 Issue

Close #190

问题

Bug 1:GPU profiler 永远无法启动

BasicOp.__init__()self.require_profiling 硬编码为 False,导致 backend.py 中:

actual_profiling = self.enable_profiling and op_instance.require_profiling  # 恒为 False

claude enable_profiling 是否开启,BackendGPU.core_perf() 的 profiler 路径永远不会执行。

Bug 2:每次迭代调用多个同名 kernel 的算子耗时统计错误

claude core_run() 只调用同名 kernel 一次:

if len(kernel_latency_list[kernel]) != prefer_iterations:
    removed_keys.append(kernel)
average_latency += sum(...)  # 在 if 外,被删除的 kernel 仍被计入

MoeQuantGroupGemmOp 这类算子(循环 N 个 expert,每次 core_run() 产生 N × prefer_iterations 条同名 kernel 记录),这些 kernel 被错误过滤,且 average_latency 因累加逻辑在 if 外而被重复计入。

lsworkloads/llm/single_test_ops/gemm_ops.json 中的 moe_quant_group_gemmep_size=8, num_experts=128 → 16 experts/rank → 每次 core_run() 调用 16 次 torch.matmul)。

修复方案

micro_perf/core/op.py:将 require_profiling 默认值 False 改为 True

micro_perf/backends/GPU/backend_gpu.py

  • 过滤条件改为 count % prefer_iterations != 0,支持每次迭代多次调用同名 kernel 的场景。
  • 将耗时累加移入通过过滤的分支中,避免重复计入。
  • calls_per_iter × iters_offset 调整切片偏移,确保只取后半段统计区间。

变更文件

文件 变更内容
micro_perf/core/op.py require_profiling 默认值 FalseTrue
micro_perf/backends/GPU/backend_gpu.py 修正 kernel 耗时过滤与聚合逻辑

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Chen Zhang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@testman0001
Copy link
Copy Markdown

195行:average_latency /= take_iters应该还是不对,是所有迭代的延迟average_latency /=prefer_iterations

@OOOOOChen
Copy link
Copy Markdown
Author

OOOOOChen commented Apr 17, 2026

195行:average_latency /= take_iters应该还是不对,是所有迭代的延迟average_latency /=prefer_iterations

按照原逻辑,应该只取prefer_iterations的最后take_iters次迭代的数据进行计算,应该是take_iters

@testman0001
Copy link
Copy Markdown

195行:average_latency /= take_iters应该还是不对,是所有迭代的延迟average_latency /=prefer_iterations

按照原逻辑,应该只取prefer_iterations的最后take_iters次迭代的数据进行计算,应该是take_iters

哦,看到了你这边offset也是跳过了一半数据,我前面那个issue提到了warmup的问题,可以看下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[micro_perf] GPU profiler 无法启动 & 多 kernel 算子耗时统计错误

3 participants