A Claude Code superpowers skill that analyzes and optimizes C++ code for target platforms (ARM/X86). It operates as a 7-stage pipeline: static analysis, optional instrumentation profiling, performance report, benchmark generation, cross-compilation with disassembly analysis, remote execution, and data-driven optimization.
graph TB
subgraph Input["Stage 1: Input Parsing"]
I1[Code Snippet] --> PARSE
I2[Git Diff] --> PARSE
I3[File / Function] --> PARSE
PARSE[Parse & Expand Context]
end
subgraph Analysis["Stage 2: Static Analysis"]
PARSE --> L1[Algorithm Layer]
PARSE --> L2[Language Layer]
PARSE --> L3[Microarchitecture Layer]
PARSE --> L4[System Layer]
L1 & L2 & L3 & L4 --> KB["Knowledge Base\n56 patterns + 25+ libraries"]
KB --> SCORE["Score Issues\n(cycle estimation + sanity checks)"]
PROFILE[("Platform Profile\n(YAML, cycles)")] -.-> SCORE
end
subgraph Instrument["Stage 2.5: Instrumentation (Optional)"]
SCORE -->|LOW confidence| PROBE["Insert TLS Probes\n(L1→L2→L3 iterative)"]
PROBE --> XCOMP1["Cross-Compile\n(host)"]
XCOMP1 --> RUN1["Run on Target\n(SSH)"]
RUN1 --> HOTSPOT["Hotspot Report\n(cycle breakdown)"]
HOTSPOT -->|drill down| PROBE
end
subgraph Report["Stage 3: Performance Report"]
SCORE -->|HIGH confidence| RPT
HOTSPOT --> RPT["Graded Report\n🔴 High 🟡 Medium 🟢 Low"]
RPT -->|user selects issues| SEL["Selected Issues"]
RPT -->|"user says 'stop'"| DONE_RPT(("Done\n(report only)"))
end
subgraph Benchmark["Stage 4: Benchmark & Baseline"]
SEL --> GEN["Generate Benchmark\n(from template)"]
GEN --> XCOMP2["Cross-Compile\n(host)"]
XCOMP2 --> DISASM["Disassemble\n(verify compiler output)"]
DISASM -->|"contradicts analysis"| RETRACT["Retract Issue"]
DISASM -->|"confirms"| UPLOAD["SCP to Target"]
UPLOAD --> EXEC["Execute & Collect\n(JSON stats)"]
EXEC --> BASELINE["Baseline Data\nmedian / p99 / stddev"]
end
subgraph Optimize["Stage 5: Optimize & Verify"]
BASELINE --> OPTGEN["Generate\nOptimized Code"]
OPTGEN --> CORRECT["Correctness Check\n(baseline vs optimized)"]
CORRECT -->|mismatch| OPTGEN
CORRECT -->|pass| XCOMP3["Cross-Compile\nOptimized"]
XCOMP3 --> DISASM2["Disassemble\n(confirm expected instructions)"]
DISASM2 --> EXEC2["Execute on Target"]
EXEC2 --> CMP["Comparison Report\nbaseline vs optimized\nspeedup + correctness"]
end
subgraph Iterate["Stage 6: Iteration"]
CMP -->|"< 1.0x regression"| REVERT["REVERT\n(negative result = data)"]
CMP -->|"1.0x-1.2x"| ACCEPT(("Accept\n(good enough)"))
CMP -->|"> 1.2x"| DELIVER(("Deliver\nOptimized Code"))
CMP -->|"user: retry"| ALT["Try Alternative\nStrategy"]
ALT --> OPTGEN
end
subgraph Data["Data Sources"]
direction LR
PROF_YAML[("profiles/*.yaml\nCortex-A78, A55\nNeoverse-N1, Skylake")]
PATTERNS[("knowledge/patterns/\nvectorization, memory\nbranching, compute\nsystem")]
LIBS[("knowledge/libraries.yaml\n25+ alternatives")]
PROFILER[("profiler/\nC++ hardware\nmeasurement tool")]
PROFILER -->|generates| PROF_YAML
end
PROF_YAML -.-> PROFILE
PATTERNS -.-> KB
LIBS -.-> KB
classDef stage fill:#1a1a2e,stroke:#e94560,color:#fff
classDef data fill:#0f3460,stroke:#16213e,color:#fff
classDef decision fill:#533483,stroke:#e94560,color:#fff
classDef endpoint fill:#2d6a4f,stroke:#40916c,color:#fff
class PARSE,SCORE,PROBE,XCOMP1,RUN1,HOTSPOT,GEN,XCOMP2,DISASM,UPLOAD,EXEC,OPTGEN,CORRECT,XCOMP3,DISASM2,EXEC2 stage
class PROF_YAML,PATTERNS,LIBS,PROFILER data
class RPT,CMP decision
class DONE_RPT,ACCEPT,DELIVER endpoint
Give it C++ code (snippet, git diff, or file reference) and a target platform. It will:
- Analyze — scan for performance issues across 4 layers (algorithm, language, microarchitecture, system)
- Instrument (optional) — insert lightweight TLS-based timing probes to measure actual hotspots
- Report — grade issues by estimated impact (HIGH/MEDIUM/LOW) with cycle-level estimates
- Benchmark — generate standalone benchmarks, cross-compile, disassemble to verify compiler output
- Optimize — generate optimized code, verify correctness, measure speedup on target hardware
- Iterate — try alternative strategies with clear stopping rules
# In a Claude Code session with this skill installed:
> Optimize the performance of my_code.cpp for ARM Cortex-A78The skill will guide you through the full pipeline interactively.
skills/cpp-perf/
├── SKILL.md # Trigger metadata
├── cpp-perf.md # Pipeline instructions (7 stages)
├── templates/
│ ├── benchmark.cpp.tmpl # Benchmark harness (steady_clock, JSON, DoNotOptimize)
│ ├── correctness.cpp.tmpl # Optimization correctness verifier
│ └── cpp_perf_probe.h # Instrumentation probe (TLS ring buffer, ns timing)
├── profiles/ # Platform performance profiles (cycles)
│ ├── cortex-a78.yaml
│ ├── cortex-a55.yaml
│ ├── neoverse-n1.yaml
│ └── x86-skylake.yaml
├── knowledge/
│ ├── libraries.yaml # 25+ high-perf library alternatives
│ └── patterns/ # 56 optimization patterns from references
│ ├── vectorization/ # Auto-vectorization blockers, NEON idioms, SVE
│ ├── memory/ # AoS→SoA, loop tiling, prefetch, false sharing
│ ├── branching/ # Branch→cmov (with counter-examples), lookup tables
│ ├── compute/ # Dependency chains, FMA, strength reduction
│ └── system/ # Huge pages, alignment
└── profiler/ # C++ hardware profiler
├── CMakeLists.txt
├── common.h # Timing, calibration, SIGILL fault tolerance
├── main.cpp # CLI entry point
├── output.cpp # Structured YAML output + CPU model detection
├── measure_compute.cpp # 34 instruction measurements (int/fp/SIMD/LSE/crypto)
├── measure_cache.cpp # Cache hierarchy detection via pointer chasing
├── measure_memory.cpp # Bandwidth, TLB miss penalty
├── measure_branch.cpp # Branch misprediction penalty
├── measure_os.cpp # Syscall, thread, fork, synchronization primitives
├── measure_alloc.cpp # malloc, mmap, page faults
├── measure_io.cpp # File I/O (open/close, read/write, fsync)
└── measure_ipc.cpp # Pipe, eventfd, signal, scheduling
Generates a platform performance profile by measuring actual hardware characteristics.
# Build (requires C++17)
cd skills/cpp-perf/profiler
mkdir build && cd build
cmake .. && make -j4
# Run (on target platform)
./profiler > my-board.yaml 2>progress.log
# Run specific measurements only
./profiler compute cache branchOutput is a YAML file compatible with the profiles/ schema. Supports:
- ARM aarch64 — NEON, DotProd, FP16, LSE atomics, CRC32, AES (with SIGILL fallback for unsupported extensions)
- x86_64 — SSE, AVX (FMA if available)
- macOS Apple Silicon — full support via
mach_absolute_time()+ frequency calibration
56 optimization patterns extracted from professional references, organized across 7 categories:
| Category | Patterns |
|---|---|
| vectorization | Auto-vectorization blockers, NEON idioms, SVE patterns, FP16, DotProd |
| memory | AoS→SoA, loop tiling, prefetch, false sharing, NUMA, huge pages, alignment |
| branching | Branch→cmov (with counter-examples), branchless lookup tables, indirect dispatch |
| compute | Dependency chains, FMA utilization, strength reduction, reciprocal tricks |
| system | Syscall batching, huge pages, alignment, NUMA-aware allocation |
| concurrency | Amdahl/USL analysis, dynamic scheduling, lock contention, thread pools, bandwidth saturation |
| libraries | 25+ high-performance library alternatives (BLAS, SIMD wrappers, allocators, I/O) |
Sources: perf-book, perf-ninja, ComputeLibrary, optimized-routines, Cpp-High-Performance
Each pattern includes: problem description, detection method, before/after code, expected impact, and caveats (including when the optimization can make things worse).
- Cycle-based estimation with sanity checks — prevents over-confident recommendations (learned from a Game of Life case where "branchless optimization" caused a 3.2x regression)
- Cross-compilation on host, execution on target — the skill compiles on your dev machine and runs benchmarks on the ARM board via SSH
- Disassembly verification — always checks compiler output before claiming an optimization works
- Correctness-first — verifies optimized code matches baseline output before reporting speedup
- Iterative with stopping rules — regressions are immediately reverted; <1.2x gains are accepted as "good enough"
On first use, the skill guides you through creating cpp-perf-platform.yaml:
platforms:
my-arm-board:
compiler: aarch64-linux-gnu-g++
compiler_flags: "-O2 -march=armv8.2-a"
sysroot: /opt/arm-sysroot # optional
host: 192.168.1.100
port: 22
user: dev
arch: aarch64
work_dir: /tmp/cpp-perf
profile: cortex-a78- Design Spec — full architecture and pipeline design
- Instrumentation Spec — TLS probe infrastructure design
- Plan 1: Core Skill
- Plan 2: Knowledge Base
- Plan 3: Profiler
一个 Claude Code 超能力技能(superpowers skill),用于自动分析和优化 C++ 代码在目标平台(ARM/X86)上的性能。采用 7 阶段流水线:静态分析、可选插桩测量、性能报告、基准测试生成、交叉编译+反汇编分析、远程执行、数据驱动优化。
graph TB
subgraph 输入["阶段 1:输入解析"]
I1[代码片段] --> PARSE
I2[Git Diff] --> PARSE
I3[文件/函数引用] --> PARSE
PARSE[解析 & 扩展上下文]
end
subgraph 分析["阶段 2:静态分析"]
PARSE --> L1[算法层]
PARSE --> L2[语言层]
PARSE --> L3[微架构层]
PARSE --> L4[系统层]
L1 & L2 & L3 & L4 --> KB["知识库\n56 个优化模式 + 25+ 个库替代"]
KB --> SCORE["评分\n(cycle 估算 + 三重检查)"]
PROFILE[("平台 Profile\n(YAML, cycles)")] -.-> SCORE
end
subgraph 插桩["阶段 2.5:插桩测量(可选)"]
SCORE -->|低置信度| PROBE["插入 TLS 探针\n(L1→L2→L3 迭代)"]
PROBE --> XCOMP1["交叉编译\n(宿主机)"]
XCOMP1 --> RUN1["目标板运行\n(SSH)"]
RUN1 --> HOTSPOT["热点报告\n(cycle 分布)"]
HOTSPOT -->|继续下钻| PROBE
end
subgraph 报告["阶段 3:性能报告"]
SCORE -->|高置信度| RPT
HOTSPOT --> RPT["分级报告\n🔴 高影响 🟡 中影响 🟢 低影响"]
RPT -->|用户选择优化项| SEL["选中的问题"]
RPT -->|"用户说 'stop'"| DONE_RPT(("完成\n(仅报告)"))
end
subgraph 基准测试["阶段 4:基准测试 & Baseline"]
SEL --> GEN["生成 Benchmark\n(从模板)"]
GEN --> XCOMP2["交叉编译\n(宿主机)"]
XCOMP2 --> DISASM["反汇编\n(验证编译器输出)"]
DISASM -->|"与分析矛盾"| RETRACT["撤回该优化项"]
DISASM -->|"确认"| UPLOAD["SCP 上传"]
UPLOAD --> EXEC["执行 & 采集\n(JSON 统计)"]
EXEC --> BASELINE["Baseline 数据\nmedian / p99 / stddev"]
end
subgraph 优化["阶段 5:优化 & 验证"]
BASELINE --> OPTGEN["生成\n优化代码"]
OPTGEN --> CORRECT["正确性验证\n(baseline vs 优化版)"]
CORRECT -->|不一致| OPTGEN
CORRECT -->|通过| XCOMP3["交叉编译\n优化版"]
XCOMP3 --> DISASM2["反汇编\n(确认预期指令)"]
DISASM2 --> EXEC2["目标板执行"]
EXEC2 --> CMP["对比报告\nbaseline vs 优化版\n加速比 + 正确性"]
end
subgraph 迭代["阶段 6:迭代"]
CMP -->|"< 1.0x 倒退"| REVERT["回滚\n(负结果也是数据)"]
CMP -->|"1.0x-1.2x"| ACCEPT(("接受现状\n(已足够好)"))
CMP -->|"> 1.2x"| DELIVER(("交付\n优化代码"))
CMP -->|"用户要求重试"| ALT["尝试替代策略"]
ALT --> OPTGEN
end
subgraph 数据源["数据源"]
direction LR
PROF_YAML[("profiles/*.yaml\nCortex-A78, A55\nNeoverse-N1, Skylake")]
PATTERNS[("knowledge/patterns/\n向量化、内存\n分支、计算、系统")]
LIBS[("knowledge/libraries.yaml\n25+ 替代方案")]
PROFILER[("profiler/\nC++ 硬件\n测量工具")]
PROFILER -->|生成| PROF_YAML
end
PROF_YAML -.-> PROFILE
PATTERNS -.-> KB
LIBS -.-> KB
classDef stage fill:#1a1a2e,stroke:#e94560,color:#fff
classDef data fill:#0f3460,stroke:#16213e,color:#fff
classDef decision fill:#533483,stroke:#e94560,color:#fff
classDef endpoint fill:#2d6a4f,stroke:#40916c,color:#fff
class PARSE,SCORE,PROBE,XCOMP1,RUN1,HOTSPOT,GEN,XCOMP2,DISASM,UPLOAD,EXEC,OPTGEN,CORRECT,XCOMP3,DISASM2,EXEC2 stage
class PROF_YAML,PATTERNS,LIBS,PROFILER data
class RPT,CMP decision
class DONE_RPT,ACCEPT,DELIVER endpoint
给它一段 C++ 代码(代码片段、git diff 或文件引用)和目标平台,它会:
- 分析 — 从 4 个层面扫描性能问题(算法、语言特性、微架构、系统)
- 插桩(可选) — 插入轻量级 TLS 计时探针,实测热点分布
- 报告 — 按预估影响分级(高/中/低),附带 cycle 级估算
- 基准测试 — 生成独立 benchmark,交叉编译,反汇编验证编译器输出
- 优化 — 生成优化代码,验证正确性,在目标硬件上实测加速比
- 迭代 — 尝试替代策略,有明确的停止规则
# 在安装了此 skill 的 Claude Code 会话中:
> 优化 my_code.cpp 在 ARM Cortex-A78 上的性能Skill 会引导你交互式地完成完整流水线。
skills/cpp-perf/
├── SKILL.md # 触发元数据
├── cpp-perf.md # 流水线指令(7 个阶段)
├── templates/
│ ├── benchmark.cpp.tmpl # 基准测试模板(steady_clock, JSON, 防优化消除)
│ ├── correctness.cpp.tmpl # 优化正确性验证模板
│ └── cpp_perf_probe.h # 插桩探针(TLS 环形缓冲区, 纳秒计时)
├── profiles/ # 平台性能档案(cycle 为单位)
│ ├── cortex-a78.yaml
│ ├── cortex-a55.yaml
│ ├── neoverse-n1.yaml
│ └── x86-skylake.yaml
├── knowledge/
│ ├── libraries.yaml # 25+ 高性能库替代方案
│ └── patterns/ # 56 个优化模式(从专业书籍提取)
│ ├── vectorization/ # 自动向量化障碍、NEON 惯用法、SVE
│ ├── memory/ # AoS→SoA、循环分块、预取、伪共享
│ ├── branching/ # 分支→条件移动(含反面案例)、查找表
│ ├── compute/ # 依赖链、FMA、强度削减
│ └── system/ # 大页、对齐
└── profiler/ # C++ 硬件性能测量工具
├── CMakeLists.txt
├── common.h # 计时、校准、SIGILL 容错
├── main.cpp # CLI 入口
├── output.cpp # 结构化 YAML 输出 + CPU 型号检测
├── measure_compute.cpp # 34 项指令测量(整数/浮点/SIMD/LSE/加密)
├── measure_cache.cpp # Cache 层级检测(指针追踪法)
├── measure_memory.cpp # 带宽、TLB 缺失代价
├── measure_branch.cpp # 分支预测失败代价
├── measure_os.cpp # 系统调用、线程、fork、同步原语
├── measure_alloc.cpp # malloc、mmap、缺页中断
├── measure_io.cpp # 文件 I/O(open/close, read/write, fsync)
└── measure_ipc.cpp # 管道、eventfd、信号、调度
自动测量目标平台的硬件特性,生成性能档案。
# 构建(需要 C++17)
cd skills/cpp-perf/profiler
mkdir build && cd build
cmake .. && make -j4
# 在目标平台运行
./profiler > my-board.yaml 2>progress.log
# 只运行特定测量
./profiler compute cache branch输出与 profiles/ 目录下的 YAML 格式兼容。支持:
- ARM aarch64 — NEON、DotProd、FP16、LSE 原子操作、CRC32、AES(不支持的指令自动跳过,不会崩溃)
- x86_64 — SSE、AVX(支持 FMA 时自动检测)
- macOS Apple Silicon — 通过
mach_absolute_time()+ 频率校准完整支持
从专业参考资料中提取的 56 个优化模式,按 7 个类别组织:
| 类别 | 覆盖的模式 |
|---|---|
| 向量化 | 自动向量化障碍、NEON 惯用法、SVE 模式、FP16、DotProd |
| 内存 | AoS→SoA、循环分块、预取、伪共享、NUMA、大页、对齐 |
| 分支 | 分支→条件移动(含反面案例)、无分支查找表、间接分发 |
| 计算 | 依赖链、FMA 利用、强度削减、倒数技巧 |
| 系统 | 系统调用批处理、大页、对齐、NUMA 感知分配 |
| 并发 | Amdahl/USL 分析、动态调度、锁竞争、线程池、带宽饱和 |
| 库 | 25+ 高性能库替代方案(BLAS、SIMD 封装、分配器、I/O) |
来源:perf-book、perf-ninja、ComputeLibrary、optimized-routines、Cpp-High-Performance
每个模式包含:问题描述、检测方法、优化前后代码、预期收益、以及什么时候不该用(比如 Game of Life 案例中,"无分支优化"反而导致 3.2 倍性能下降)。
- 基于 cycle 的量化估算 + 三重检查 — 防止过度自信的优化建议(源自实测教训:Game of Life 的"无分支优化"在 Apple Silicon 上造成 3.2x 性能倒退)
- 宿主机交叉编译,目标板执行 — 在开发机上编译,通过 SSH 在 ARM 板上运行 benchmark
- 反汇编验证 — 每次优化前都检查编译器实际生成的指令,不靠猜
- 正确性优先 — 先验证优化后代码与原始代码输出一致,再报告加速比
- 有停止规则的迭代 — 性能倒退立即回滚;<1.2x 的提升接受现状,不做无意义的过度优化
首次使用时,skill 会引导你创建 cpp-perf-platform.yaml:
platforms:
my-arm-board:
compiler: aarch64-linux-gnu-g++
compiler_flags: "-O2 -march=armv8.2-a"
sysroot: /opt/arm-sysroot # 可选
host: 192.168.1.100
port: 22
user: dev
arch: aarch64
work_dir: /tmp/cpp-perf
profile: cortex-a78- 设计文档 — 完整架构和流水线设计
- 插桩设计 — TLS 探针基础设施设计
- 实现计划 1:核心 Skill
- 实现计划 2:知识库
- 实现计划 3:性能测量工具
MIT