[micro_perf] GPU profiler 无法启动 & 多 kernel 算子耗时统计错误

## 问题描述

 `micro_perf` 的 GPU 后端中存在两个 bug，导致 profiling 结果不正确。

---

### Bug 1：GPU profiler 永远无法启动

**文件**：`micro_perf/core/op.py`

`BasicOp.__init__()` 将 `self.require_profiling` 硬编码为 `False`，导致 `backend.py` 中的判断：

```python
actual_profiling = self.enable_profiling and op_instance.require_profiling  # 恒为 False
```

claude `enable_profiling` 是否开启，`BackendGPU.core_perf()` 的 profiler 路径永远不会被执行。

---

### Bug 2：每次迭代调用多个同名 kernel 的算子耗时统计错误

**文件**：`micro_perf/backends/GPU/backend_gpu.py`

ls

```python
if len(kernel_latency_list[kernel]) != prefer_iterations:
    removed_keys.append(kernel)
average_latency += sum(kernel_latency_list[kernel][iters_offset:])  # 在 if 外，导致被删除的 kernel 仍被计入
```

claude  `core_run()` 只调用同名 kernel 一次。但对于 `MoeQuantGroupGemmOp` 这类算子，每次 `core_run()` 会循环 N 个 expert，分别调用 `torch.matmul`，产生 `N × prefer_iterations` 条同名 kernel 记录。这些 kernel 被错误地过滤掉（延迟统计为 0），且 `average_latency` 的累加逻辑写在 `if` 外，导致被删除的 kernel 仍然被计入统计。

**可复现场景**：`workloads/llm/single_test_ops/gemm_ops.json` 中的 `moe_quant_group_gemm`（`ep_size=8, num_experts=128` → 每 rank 16 个 expert → 每次 `core_run()` 调用 16 次 `torch.matmul`）。

---

## 期望修复

1. 将 `require_profiling` 默认值改为 `True`，让 profiler 正常工作。
2. 修正 kernel 过滤条件为 `count % prefer_iterations != 0`，支持每次迭代多次调用的场景，并将耗时累加移入通过过滤的分支中。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[micro_perf] GPU profiler 无法启动 & 多 kernel 算子耗时统计错误 #190

问题描述

Bug 1：GPU profiler 永远无法启动

Bug 2：每次迭代调用多个同名 kernel 的算子耗时统计错误

期望修复

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[micro_perf] GPU profiler 无法启动 & 多 kernel 算子耗时统计错误 #190

Description

问题描述

Bug 1：GPU profiler 永远无法启动

Bug 2：每次迭代调用多个同名 kernel 的算子耗时统计错误

期望修复

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions