cpp-perf: C++ Performance Optimization Skill for Claude Code

English

A Claude Code superpowers skill that analyzes and optimizes C++ code for target platforms (ARM/X86). It operates as a 7-stage pipeline: static analysis, optional instrumentation profiling, performance report, benchmark generation, cross-compilation with disassembly analysis, remote execution, and data-driven optimization.

Architecture Overview

graph TB
    subgraph Input["Stage 1: Input Parsing"]
        I1[Code Snippet] --> PARSE
        I2[Git Diff] --> PARSE
        I3[File / Function] --> PARSE
        PARSE[Parse & Expand Context]
    end

    subgraph Analysis["Stage 2: Static Analysis"]
        PARSE --> L1[Algorithm Layer]
        PARSE --> L2[Language Layer]
        PARSE --> L3[Microarchitecture Layer]
        PARSE --> L4[System Layer]
        L1 & L2 & L3 & L4 --> KB["Knowledge Base\n56 patterns + 25+ libraries"]
        KB --> SCORE["Score Issues\n(cycle estimation + sanity checks)"]
        PROFILE[("Platform Profile\n(YAML, cycles)")] -.-> SCORE
    end

    subgraph Instrument["Stage 2.5: Instrumentation (Optional)"]
        SCORE -->|LOW confidence| PROBE["Insert TLS Probes\n(L1→L2→L3 iterative)"]
        PROBE --> XCOMP1["Cross-Compile\n(host)"]
        XCOMP1 --> RUN1["Run on Target\n(SSH)"]
        RUN1 --> HOTSPOT["Hotspot Report\n(cycle breakdown)"]
        HOTSPOT -->|drill down| PROBE
    end

    subgraph Report["Stage 3: Performance Report"]
        SCORE -->|HIGH confidence| RPT
        HOTSPOT --> RPT["Graded Report\n🔴 High  🟡 Medium  🟢 Low"]
        RPT -->|user selects issues| SEL["Selected Issues"]
        RPT -->|"user says 'stop'"| DONE_RPT(("Done\n(report only)"))
    end

    subgraph Benchmark["Stage 4: Benchmark & Baseline"]
        SEL --> GEN["Generate Benchmark\n(from template)"]
        GEN --> XCOMP2["Cross-Compile\n(host)"]
        XCOMP2 --> DISASM["Disassemble\n(verify compiler output)"]
        DISASM -->|"contradicts analysis"| RETRACT["Retract Issue"]
        DISASM -->|"confirms"| UPLOAD["SCP to Target"]
        UPLOAD --> EXEC["Execute & Collect\n(JSON stats)"]
        EXEC --> BASELINE["Baseline Data\nmedian / p99 / stddev"]
    end

    subgraph Optimize["Stage 5: Optimize & Verify"]
        BASELINE --> OPTGEN["Generate\nOptimized Code"]
        OPTGEN --> CORRECT["Correctness Check\n(baseline vs optimized)"]
        CORRECT -->|mismatch| OPTGEN
        CORRECT -->|pass| XCOMP3["Cross-Compile\nOptimized"]
        XCOMP3 --> DISASM2["Disassemble\n(confirm expected instructions)"]
        DISASM2 --> EXEC2["Execute on Target"]
        EXEC2 --> CMP["Comparison Report\nbaseline vs optimized\nspeedup + correctness"]
    end

    subgraph Iterate["Stage 6: Iteration"]
        CMP -->|"< 1.0x regression"| REVERT["REVERT\n(negative result = data)"]
        CMP -->|"1.0x-1.2x"| ACCEPT(("Accept\n(good enough)"))
        CMP -->|"> 1.2x"| DELIVER(("Deliver\nOptimized Code"))
        CMP -->|"user: retry"| ALT["Try Alternative\nStrategy"]
        ALT --> OPTGEN
    end

    subgraph Data["Data Sources"]
        direction LR
        PROF_YAML[("profiles/*.yaml\nCortex-A78, A55\nNeoverse-N1, Skylake")]
        PATTERNS[("knowledge/patterns/\nvectorization, memory\nbranching, compute\nsystem")]
        LIBS[("knowledge/libraries.yaml\n25+ alternatives")]
        PROFILER[("profiler/\nC++ hardware\nmeasurement tool")]
        PROFILER -->|generates| PROF_YAML
    end

    PROF_YAML -.-> PROFILE
    PATTERNS -.-> KB
    LIBS -.-> KB

    classDef stage fill:#1a1a2e,stroke:#e94560,color:#fff
    classDef data fill:#0f3460,stroke:#16213e,color:#fff
    classDef decision fill:#533483,stroke:#e94560,color:#fff
    classDef endpoint fill:#2d6a4f,stroke:#40916c,color:#fff

    class PARSE,SCORE,PROBE,XCOMP1,RUN1,HOTSPOT,GEN,XCOMP2,DISASM,UPLOAD,EXEC,OPTGEN,CORRECT,XCOMP3,DISASM2,EXEC2 stage
    class PROF_YAML,PATTERNS,LIBS,PROFILER data
    class RPT,CMP decision
    class DONE_RPT,ACCEPT,DELIVER endpoint

What It Does

Give it C++ code (snippet, git diff, or file reference) and a target platform. It will:

Analyze — scan for performance issues across 4 layers (algorithm, language, microarchitecture, system)
Instrument (optional) — insert lightweight TLS-based timing probes to measure actual hotspots
Report — grade issues by estimated impact (HIGH/MEDIUM/LOW) with cycle-level estimates
Benchmark — generate standalone benchmarks, cross-compile, disassemble to verify compiler output
Optimize — generate optimized code, verify correctness, measure speedup on target hardware
Iterate — try alternative strategies with clear stopping rules

Quick Start

# In a Claude Code session with this skill installed:
> Optimize the performance of my_code.cpp for ARM Cortex-A78

The skill will guide you through the full pipeline interactively.

Project Structure

skills/cpp-perf/
├── SKILL.md                    # Trigger metadata
├── cpp-perf.md                 # Pipeline instructions (7 stages)
├── templates/
│   ├── benchmark.cpp.tmpl      # Benchmark harness (steady_clock, JSON, DoNotOptimize)
│   ├── correctness.cpp.tmpl    # Optimization correctness verifier
│   └── cpp_perf_probe.h        # Instrumentation probe (TLS ring buffer, ns timing)
├── profiles/                   # Platform performance profiles (cycles)
│   ├── cortex-a78.yaml
│   ├── cortex-a55.yaml
│   ├── neoverse-n1.yaml
│   └── x86-skylake.yaml
├── knowledge/
│   ├── libraries.yaml          # 25+ high-perf library alternatives
│   └── patterns/               # 56 optimization patterns from references
│       ├── vectorization/      # Auto-vectorization blockers, NEON idioms, SVE
│       ├── memory/             # AoS→SoA, loop tiling, prefetch, false sharing
│       ├── branching/          # Branch→cmov (with counter-examples), lookup tables
│       ├── compute/            # Dependency chains, FMA, strength reduction
│       └── system/             # Huge pages, alignment
└── profiler/                   # C++ hardware profiler
    ├── CMakeLists.txt
    ├── common.h                # Timing, calibration, SIGILL fault tolerance
    ├── main.cpp                # CLI entry point
    ├── output.cpp              # Structured YAML output + CPU model detection
    ├── measure_compute.cpp     # 34 instruction measurements (int/fp/SIMD/LSE/crypto)
    ├── measure_cache.cpp       # Cache hierarchy detection via pointer chasing
    ├── measure_memory.cpp      # Bandwidth, TLB miss penalty
    ├── measure_branch.cpp      # Branch misprediction penalty
    ├── measure_os.cpp          # Syscall, thread, fork, synchronization primitives
    ├── measure_alloc.cpp       # malloc, mmap, page faults
    ├── measure_io.cpp          # File I/O (open/close, read/write, fsync)
    └── measure_ipc.cpp         # Pipe, eventfd, signal, scheduling

Platform Profiler

Generates a platform performance profile by measuring actual hardware characteristics.

# Build (requires C++17)
cd skills/cpp-perf/profiler
mkdir build && cd build
cmake .. && make -j4

# Run (on target platform)
./profiler > my-board.yaml 2>progress.log

# Run specific measurements only
./profiler compute cache branch

Output is a YAML file compatible with the profiles/ schema. Supports:

ARM aarch64 — NEON, DotProd, FP16, LSE atomics, CRC32, AES (with SIGILL fallback for unsupported extensions)
x86_64 — SSE, AVX (FMA if available)
macOS Apple Silicon — full support via mach_absolute_time() + frequency calibration

Knowledge Base

56 optimization patterns extracted from professional references, organized across 7 categories:

Category	Patterns
vectorization	Auto-vectorization blockers, NEON idioms, SVE patterns, FP16, DotProd
memory	AoS→SoA, loop tiling, prefetch, false sharing, NUMA, huge pages, alignment
branching	Branch→cmov (with counter-examples), branchless lookup tables, indirect dispatch
compute	Dependency chains, FMA utilization, strength reduction, reciprocal tricks
system	Syscall batching, huge pages, alignment, NUMA-aware allocation
concurrency	Amdahl/USL analysis, dynamic scheduling, lock contention, thread pools, bandwidth saturation
libraries	25+ high-performance library alternatives (BLAS, SIMD wrappers, allocators, I/O)

Sources: perf-book, perf-ninja, ComputeLibrary, optimized-routines, Cpp-High-Performance

Each pattern includes: problem description, detection method, before/after code, expected impact, and caveats (including when the optimization can make things worse).

Key Design Decisions

Cycle-based estimation with sanity checks — prevents over-confident recommendations (learned from a Game of Life case where "branchless optimization" caused a 3.2x regression)
Cross-compilation on host, execution on target — the skill compiles on your dev machine and runs benchmarks on the ARM board via SSH
Disassembly verification — always checks compiler output before claiming an optimization works
Correctness-first — verifies optimized code matches baseline output before reporting speedup
Iterative with stopping rules — regressions are immediately reverted; <1.2x gains are accepted as "good enough"

Platform Configuration

On first use, the skill guides you through creating cpp-perf-platform.yaml:

platforms:
  my-arm-board:
    compiler: aarch64-linux-gnu-g++
    compiler_flags: "-O2 -march=armv8.2-a"
    sysroot: /opt/arm-sysroot        # optional
    host: 192.168.1.100
    port: 22
    user: dev
    arch: aarch64
    work_dir: /tmp/cpp-perf
    profile: cortex-a78

Documentation

Design Spec — full architecture and pipeline design
Instrumentation Spec — TLS probe infrastructure design
Plan 1: Core Skill
Plan 2: Knowledge Base
Plan 3: Profiler

中文

一个 Claude Code 超能力技能（superpowers skill），用于自动分析和优化 C++ 代码在目标平台（ARM/X86）上的性能。采用 7 阶段流水线：静态分析、可选插桩测量、性能报告、基准测试生成、交叉编译+反汇编分析、远程执行、数据驱动优化。

架构总览

graph TB
    subgraph 输入["阶段 1：输入解析"]
        I1[代码片段] --> PARSE
        I2[Git Diff] --> PARSE
        I3[文件/函数引用] --> PARSE
        PARSE[解析 & 扩展上下文]
    end

    subgraph 分析["阶段 2：静态分析"]
        PARSE --> L1[算法层]
        PARSE --> L2[语言层]
        PARSE --> L3[微架构层]
        PARSE --> L4[系统层]
        L1 & L2 & L3 & L4 --> KB["知识库\n56 个优化模式 + 25+ 个库替代"]
        KB --> SCORE["评分\n(cycle 估算 + 三重检查)"]
        PROFILE[("平台 Profile\n(YAML, cycles)")] -.-> SCORE
    end

    subgraph 插桩["阶段 2.5：插桩测量（可选）"]
        SCORE -->|低置信度| PROBE["插入 TLS 探针\n(L1→L2→L3 迭代)"]
        PROBE --> XCOMP1["交叉编译\n(宿主机)"]
        XCOMP1 --> RUN1["目标板运行\n(SSH)"]
        RUN1 --> HOTSPOT["热点报告\n(cycle 分布)"]
        HOTSPOT -->|继续下钻| PROBE
    end

    subgraph 报告["阶段 3：性能报告"]
        SCORE -->|高置信度| RPT
        HOTSPOT --> RPT["分级报告\n🔴 高影响  🟡 中影响  🟢 低影响"]
        RPT -->|用户选择优化项| SEL["选中的问题"]
        RPT -->|"用户说 'stop'"| DONE_RPT(("完成\n(仅报告)"))
    end

    subgraph 基准测试["阶段 4：基准测试 & Baseline"]
        SEL --> GEN["生成 Benchmark\n(从模板)"]
        GEN --> XCOMP2["交叉编译\n(宿主机)"]
        XCOMP2 --> DISASM["反汇编\n(验证编译器输出)"]
        DISASM -->|"与分析矛盾"| RETRACT["撤回该优化项"]
        DISASM -->|"确认"| UPLOAD["SCP 上传"]
        UPLOAD --> EXEC["执行 & 采集\n(JSON 统计)"]
        EXEC --> BASELINE["Baseline 数据\nmedian / p99 / stddev"]
    end

    subgraph 优化["阶段 5：优化 & 验证"]
        BASELINE --> OPTGEN["生成\n优化代码"]
        OPTGEN --> CORRECT["正确性验证\n(baseline vs 优化版)"]
        CORRECT -->|不一致| OPTGEN
        CORRECT -->|通过| XCOMP3["交叉编译\n优化版"]
        XCOMP3 --> DISASM2["反汇编\n(确认预期指令)"]
        DISASM2 --> EXEC2["目标板执行"]
        EXEC2 --> CMP["对比报告\nbaseline vs 优化版\n加速比 + 正确性"]
    end

    subgraph 迭代["阶段 6：迭代"]
        CMP -->|"< 1.0x 倒退"| REVERT["回滚\n(负结果也是数据)"]
        CMP -->|"1.0x-1.2x"| ACCEPT(("接受现状\n(已足够好)"))
        CMP -->|"> 1.2x"| DELIVER(("交付\n优化代码"))
        CMP -->|"用户要求重试"| ALT["尝试替代策略"]
        ALT --> OPTGEN
    end

    subgraph 数据源["数据源"]
        direction LR
        PROF_YAML[("profiles/*.yaml\nCortex-A78, A55\nNeoverse-N1, Skylake")]
        PATTERNS[("knowledge/patterns/\n向量化、内存\n分支、计算、系统")]
        LIBS[("knowledge/libraries.yaml\n25+ 替代方案")]
        PROFILER[("profiler/\nC++ 硬件\n测量工具")]
        PROFILER -->|生成| PROF_YAML
    end

    PROF_YAML -.-> PROFILE
    PATTERNS -.-> KB
    LIBS -.-> KB

    classDef stage fill:#1a1a2e,stroke:#e94560,color:#fff
    classDef data fill:#0f3460,stroke:#16213e,color:#fff
    classDef decision fill:#533483,stroke:#e94560,color:#fff
    classDef endpoint fill:#2d6a4f,stroke:#40916c,color:#fff

    class PARSE,SCORE,PROBE,XCOMP1,RUN1,HOTSPOT,GEN,XCOMP2,DISASM,UPLOAD,EXEC,OPTGEN,CORRECT,XCOMP3,DISASM2,EXEC2 stage
    class PROF_YAML,PATTERNS,LIBS,PROFILER data
    class RPT,CMP decision
    class DONE_RPT,ACCEPT,DELIVER endpoint

功能概述

给它一段 C++ 代码（代码片段、git diff 或文件引用）和目标平台，它会：

分析 — 从 4 个层面扫描性能问题（算法、语言特性、微架构、系统）
插桩（可选） — 插入轻量级 TLS 计时探针，实测热点分布
报告 — 按预估影响分级（高/中/低），附带 cycle 级估算
基准测试 — 生成独立 benchmark，交叉编译，反汇编验证编译器输出
优化 — 生成优化代码，验证正确性，在目标硬件上实测加速比
迭代 — 尝试替代策略，有明确的停止规则

快速开始

# 在安装了此 skill 的 Claude Code 会话中：
> 优化 my_code.cpp 在 ARM Cortex-A78 上的性能

Skill 会引导你交互式地完成完整流水线。

项目结构

skills/cpp-perf/
├── SKILL.md                    # 触发元数据
├── cpp-perf.md                 # 流水线指令（7 个阶段）
├── templates/
│   ├── benchmark.cpp.tmpl      # 基准测试模板（steady_clock, JSON, 防优化消除）
│   ├── correctness.cpp.tmpl    # 优化正确性验证模板
│   └── cpp_perf_probe.h        # 插桩探针（TLS 环形缓冲区, 纳秒计时）
├── profiles/                   # 平台性能档案（cycle 为单位）
│   ├── cortex-a78.yaml
│   ├── cortex-a55.yaml
│   ├── neoverse-n1.yaml
│   └── x86-skylake.yaml
├── knowledge/
│   ├── libraries.yaml          # 25+ 高性能库替代方案
│   └── patterns/               # 56 个优化模式（从专业书籍提取）
│       ├── vectorization/      # 自动向量化障碍、NEON 惯用法、SVE
│       ├── memory/             # AoS→SoA、循环分块、预取、伪共享
│       ├── branching/          # 分支→条件移动（含反面案例）、查找表
│       ├── compute/            # 依赖链、FMA、强度削减
│       └── system/             # 大页、对齐
└── profiler/                   # C++ 硬件性能测量工具
    ├── CMakeLists.txt
    ├── common.h                # 计时、校准、SIGILL 容错
    ├── main.cpp                # CLI 入口
    ├── output.cpp              # 结构化 YAML 输出 + CPU 型号检测
    ├── measure_compute.cpp     # 34 项指令测量（整数/浮点/SIMD/LSE/加密）
    ├── measure_cache.cpp       # Cache 层级检测（指针追踪法）
    ├── measure_memory.cpp      # 带宽、TLB 缺失代价
    ├── measure_branch.cpp      # 分支预测失败代价
    ├── measure_os.cpp          # 系统调用、线程、fork、同步原语
    ├── measure_alloc.cpp       # malloc、mmap、缺页中断
    ├── measure_io.cpp          # 文件 I/O（open/close, read/write, fsync）
    └── measure_ipc.cpp         # 管道、eventfd、信号、调度

硬件性能测量工具

自动测量目标平台的硬件特性，生成性能档案。

# 构建（需要 C++17）
cd skills/cpp-perf/profiler
mkdir build && cd build
cmake .. && make -j4

# 在目标平台运行
./profiler > my-board.yaml 2>progress.log

# 只运行特定测量
./profiler compute cache branch

输出与 profiles/ 目录下的 YAML 格式兼容。支持：

ARM aarch64 — NEON、DotProd、FP16、LSE 原子操作、CRC32、AES（不支持的指令自动跳过，不会崩溃）
x86_64 — SSE、AVX（支持 FMA 时自动检测）
macOS Apple Silicon — 通过 mach_absolute_time() + 频率校准完整支持

知识库

从专业参考资料中提取的 56 个优化模式，按 7 个类别组织：

类别	覆盖的模式
向量化	自动向量化障碍、NEON 惯用法、SVE 模式、FP16、DotProd
内存	AoS→SoA、循环分块、预取、伪共享、NUMA、大页、对齐
分支	分支→条件移动（含反面案例）、无分支查找表、间接分发
计算	依赖链、FMA 利用、强度削减、倒数技巧
系统	系统调用批处理、大页、对齐、NUMA 感知分配
并发	Amdahl/USL 分析、动态调度、锁竞争、线程池、带宽饱和
库	25+ 高性能库替代方案（BLAS、SIMD 封装、分配器、I/O）

来源：perf-book、perf-ninja、ComputeLibrary、optimized-routines、Cpp-High-Performance

每个模式包含：问题描述、检测方法、优化前后代码、预期收益、以及什么时候不该用（比如 Game of Life 案例中，"无分支优化"反而导致 3.2 倍性能下降）。

核心设计理念

基于 cycle 的量化估算 + 三重检查 — 防止过度自信的优化建议（源自实测教训：Game of Life 的"无分支优化"在 Apple Silicon 上造成 3.2x 性能倒退）
宿主机交叉编译，目标板执行 — 在开发机上编译，通过 SSH 在 ARM 板上运行 benchmark
反汇编验证 — 每次优化前都检查编译器实际生成的指令，不靠猜
正确性优先 — 先验证优化后代码与原始代码输出一致，再报告加速比
有停止规则的迭代 — 性能倒退立即回滚；<1.2x 的提升接受现状，不做无意义的过度优化

平台配置

首次使用时，skill 会引导你创建 cpp-perf-platform.yaml：

platforms:
  my-arm-board:
    compiler: aarch64-linux-gnu-g++
    compiler_flags: "-O2 -march=armv8.2-a"
    sysroot: /opt/arm-sysroot        # 可选
    host: 192.168.1.100
    port: 22
    user: dev
    arch: aarch64
    work_dir: /tmp/cpp-perf
    profile: cortex-a78

文档

设计文档 — 完整架构和流水线设计
插桩设计 — TLS 探针基础设施设计
实现计划 1：核心 Skill
实现计划 2：知识库
实现计划 3：性能测量工具

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.claude		.claude
docs/superpowers		docs/superpowers
reference		reference
skills/cpp-perf		skills/cpp-perf
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cpp-perf: C++ Performance Optimization Skill for Claude Code

English

Architecture Overview

What It Does

Quick Start

Project Structure

Platform Profiler

Knowledge Base

Key Design Decisions

Platform Configuration

Documentation

中文

架构总览

功能概述

快速开始

项目结构

硬件性能测量工具

知识库

核心设计理念

平台配置

文档

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cpp-perf: C++ Performance Optimization Skill for Claude Code

English

Architecture Overview

What It Does

Quick Start

Project Structure

Platform Profiler

Knowledge Base

Key Design Decisions

Platform Configuration

Documentation

中文

架构总览

功能概述

快速开始

项目结构

硬件性能测量工具

知识库

核心设计理念

平台配置

文档

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages