fix(micro_perf): 修复 GPU profiler 无法启动及多 kernel 算子耗时统计错误#191
Open
OOOOOChen wants to merge 1 commit into
Open
Conversation
|
Chen Zhang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
195行:average_latency /= take_iters应该还是不对,是所有迭代的延迟average_latency /=prefer_iterations |
Author
按照原逻辑,应该只取prefer_iterations的最后take_iters次迭代的数据进行计算,应该是take_iters |
哦,看到了你这边offset也是跳过了一半数据,我前面那个issue提到了warmup的问题,可以看下 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
关联 Issue
Close #190
问题
Bug 1:GPU profiler 永远无法启动
BasicOp.__init__()将self.require_profiling硬编码为False,导致backend.py中:claude
enable_profiling是否开启,BackendGPU.core_perf()的 profiler 路径永远不会执行。Bug 2:每次迭代调用多个同名 kernel 的算子耗时统计错误
claude
core_run()只调用同名 kernel 一次:MoeQuantGroupGemmOp这类算子(循环 N 个 expert,每次core_run()产生N × prefer_iterations条同名 kernel 记录),这些 kernel 被错误过滤,且average_latency因累加逻辑在if外而被重复计入。ls
workloads/llm/single_test_ops/gemm_ops.json中的moe_quant_group_gemm(ep_size=8, num_experts=128→ 16 experts/rank → 每次core_run()调用 16 次torch.matmul)。修复方案
micro_perf/core/op.py:将require_profiling默认值False改为True。micro_perf/backends/GPU/backend_gpu.py:count % prefer_iterations != 0,支持每次迭代多次调用同名 kernel 的场景。calls_per_iter × iters_offset调整切片偏移,确保只取后半段统计区间。变更文件
micro_perf/core/op.pyrequire_profiling默认值False→Truemicro_perf/backends/GPU/backend_gpu.py