zpprof

Go 运行时性能监控与自动分析工具，支持 CPU/内存/Goroutine/线程/GCHeap 自动触发 pprof dump，并可将性能数据上报至 Pyroscope。

项目状态

✅ 核心模块实现完成
✅ 示例程序可编译运行
⏳ 单元测试待完成
⏳ 性能测试待完成

安装

go get github.com/zlsgo/zpprof

核心能力

多维度监控：CPU 使用率、RSS、Goroutine 数、OS 线程数、GC 堆内存
智能触发：基于阈值、增长率、GC 周期的自动 dump
冷却机制：避免频繁触发，支持预热窗口
异步上报：内置队列，支持 HTTP/Pyroscope Reporter
线程收缩：检测到线程泄漏时自动触发 dump 后延迟收缩
CGroup 识别：自动识别容器环境资源限制（Linux cgroup v1）
热更新：运行时更新配置无需重启

架构设计

zpprof/
├── config.go          # 配置定义
├── engine.go          # Engine 核心
├── types.go           # 类型定义
├── collector/         # 资源采集
│   ├── collector.go   # CPU/RSS/Goroutine/Thread 采集
│   └── cgroup.go      # CGroup v1 资源识别
├── rule/              # 规则引擎
│   ├── ringbuffer.go  # 滑动窗口
│   ├── rule.go        # 规则检查器
│   └── gcheap.go      # GCHeap 触发器
├── dumper/            # Dump 机制
│   └── dumper.go      # heap/goroutine/threadcreate/cpu dump
├── reporter/          # Reporter 机制
│   ├── types.go       # Reporter 接口
│   ├── queue.go       # 异步上报队列
│   ├── http.go        # HTTP Reporter
│   └── pyroscope.go   # Pyroscope Reporter
├── internal/shrink/   # 线程收缩
│   └── shrink.go
└── example/           # 示例程序
    └── main.go

使用场景

生产环境性能监控：CPU/内存异常时自动 dump，用于事后分析
Goroutine 泄漏检测：Goroutine 数异常增长时自动 dump
线程泄漏定位：OS 线程数异常增长时自动 dump + 延迟收缩
GC 压力分析：GC 周期内堆内存接近 NextGC 时自动 dump
持续性能分析：与 Pyroscope 集成，持续上报性能数据

核心特性

已实现功能

快速开始

基础使用

package main

import (
    "time"
    "github.com/zlsgo/zpprof"
)

func main() {
    engine, _ := zpprof.New(func(cfg *zpprof.Config) {
        cfg.CPU.Enable = true
        cfg.CPU.Min = 10
        cfg.CPU.Abs = 50
        cfg.CPU.Diff = 25

        cfg.CollectInterval = 3 * time.Second
    })

    engine.Start()

    // 运行业务逻辑
    select {}
}

完整配置示例

engine, _ := zpprof.New(func(cfg *zpprof.Config) {
    // 采集配置
    cfg.CollectInterval = 5 * time.Second
    cfg.WarmupCycles = 10

    // CPU 规则
    cfg.CPU.Enable = true
    cfg.CPU.Min = 20
    cfg.CPU.Abs = 80
    cfg.CPU.Diff = 25
    cfg.CPU.Cooldown = 5 * time.Minute

    // 内存规则
    cfg.Mem.Enable = true
    cfg.Mem.Min = 20
    cfg.Mem.Abs = 80
    cfg.Mem.Diff = 25
    cfg.Mem.Cooldown = 5 * time.Minute

    // Goroutine 规则
    cfg.Goroutine.Enable = true
    cfg.Goroutine.Min = 100
    cfg.Goroutine.Diff = 25
    cfg.Goroutine.Max = 100000
    cfg.Goroutine.Cooldown = 5 * time.Minute

    // 线程规则
    cfg.Thread.Enable = true
    cfg.Thread.Min = 10
    cfg.Thread.Diff = 25
    cfg.Thread.Cooldown = 5 * time.Minute

    // GCHeap 规则
    cfg.GCHeap.Enable = true
    cfg.GCHeap.MonitorMode = zpprof.GCHeapMonitorFinalizer // 或 GCHeapMonitorPoll
    cfg.GCHeap.TriggerRatio = 0.7
    cfg.GCHeap.DumpCount = 2
    cfg.GCHeap.Cooldown = 5 * time.Minute

    // 线程收缩
    cfg.ShrinkThread.Enable = true
    cfg.ShrinkThread.Threshold = 1000
    cfg.ShrinkThread.Delay = 30 * time.Second
    cfg.ShrinkThread.ShrinkCount = 100

    // Dump 配置
    cfg.Dump.DumpPath = "./dumps"
    cfg.Dump.BinaryDump = true
    cfg.Dump.TextDump = false
    cfg.Dump.CPUDuration = 10 * time.Second  // CPU profile 采样时长

    // 资源配置（自动识别 CGroup）
    cfg.Resource.UseCGroup = true
})

配置 Reporter

import (
    "time"
    "github.com/zlsgo/zpprof/reporter"
)

engine, _ := zpprof.New(func(cfg *zpprof.Config) {
    // Reporter 配置
    cfg.Reporter.Enable = true
    cfg.Reporter.QueueSize = 100
    cfg.Reporter.Reporters = []reporter.Reporter{
        // HTTP Reporter
        reporter.NewHTTPReporter(
            "http://your-server.com/metrics",
            "",  // token
            30 * time.Second,
        ),

        // Pyroscope Reporter
        reporter.NewPyroscopeReporter(
            "http://pyroscope:4040",
            "",  // token
            map[string]string{
                "env": "production",
                "app": "my-app",
            },
            30 * time.Second,
        ),
    }
})

配置说明

采集参数

CollectInterval：采集间隔（默认 5s）
WarmupCycles：预热周期数（前 N 次采集不触发规则，用于建立基准）

监控规则

每种监控类型（CPU/Mem/Goroutine/Thread）支持以下规则：

参数	说明	示例
`Enable`	是否启用	`true`
`Min`	最小阈值（低于此值不触发）	`20`（20% CPU）
`Abs`	绝对阈值（超过此值触发）	`80`（80% CPU）
`Diff`	差异百分比（相对平均值增长比例）	`25`（增长 25%）
`Cooldown`	冷却时间（触发后多久才能再次触发）	`5m`
`Max`	最大上限（Goroutine 专用，超过不触发）	`100000`

触发逻辑

触发条件（满足任一即触发）：

绝对阈值触发：当前值 > Min && 当前值 > Abs
差异百分比触发：当前值 > Min && (当前值 - 平均值) / 平均值 > Diff/100

额外限制：

预热期内不触发（采集次数 < WarmupCycles）
冷却期内不触发（上次触发后 < Cooldown）
Goroutine 超过 Max 不触发（防止过载）

GCHeap 配置

GCHeap 是基于 GC 周期的触发机制，在每次 GC 时检查：

Enable：是否启用
MonitorMode：监控模式
- GCHeapMonitorFinalizer：基于 finalizer 自动触发（推荐生产环境）
- GCHeapMonitorPoll：每秒主动 GC 轮询（适合测试环境）
TriggerRatio：触发比例，当 HeapAlloc / NextGC > TriggerRatio 时触发（默认 0.7）
DumpCount：连续 dump 次数（用于 heap diff 分析，默认 2）
Cooldown：冷却时间

原理：当堆内存接近 GC 触发点（NextGC）时，说明内存压力大，此时 dump 可捕获更多内存对象。

监控模式对比：

模式	优点	缺点	适用场景
`GCHeapMonitorFinalizer`	无轮询开销，跟随 GC 自然发生	依赖 GC 调度，触发时机不确定	生产环境
`GCHeapMonitorPoll`	可控、立即生效	每秒主动 GC，有性能开销	测试环境

线程收缩配置

线程收缩用于处理 OS 线程泄漏：

Enable：是否启用
Threshold：阈值（线程数超过此值时触发）
Delay：延迟时间（dump 后等待多久执行 shrink）
ShrinkCount：每次 shrink 的线程数

工作流程：

检测到线程数 > Threshold
Dump threadcreate profile
等待 Delay 时间
调用 debug.SetMaxThreads 降低线程数
延迟后恢复原值

Dump 配置

DumpPath：dump 文件输出目录（默认 ./dumps）
BinaryDump：是否输出二进制 profile 文件（默认 true）
TextDump：是否输出文本格式（默认 false）
CPUDuration：CPU profile 采样时长（默认 10s）

Dump 文件命名：{type}_{timestamp}_{eventID}.pprof

示例：cpu_20250129-150405_1735459445123456789.pprof

资源配置

CPUCore：CPU 核心数（0 表示自动检测）
MemoryLimit：内存限制（字节，0 表示自动检测）
UseCGroup：是否使用 CGroup 识别（Linux 容器环境）

自动识别逻辑：

如果 UseCGroup = true 且运行在 Linux 容器中，从 /sys/fs/cgroup 读取资源限制
否则使用 runtime.NumCPU() 和物理内存

监控指标详解

指标	单位	说明	对应 Dump
CPU	%	进程 CPU 使用率（相对核心数）	`cpu`
Mem	%	RSS 内存使用率（相对总内存/CGroup 限制）	`heap`
Goroutine	个	当前 Goroutine 数量	`goroutine`
Thread	个	OS 线程数量	`threadcreate`
GCHeap	-	HeapAlloc/NextGC 比例	`heap`

CPU 使用率计算

CPU% = (当前CPU时间 - 上次CPU时间) / (当前时间 - 上次时间) / CPUCore * 100

内存使用率计算

Mem% = RSS / MemoryLimit * 100

CGroup 环境下，MemoryLimit 来自 memory.limit_in_bytes。

运行示例

# 构建
go build -o example/demo example/main.go

# 运行
./example/demo

# 查看 dump 文件
ls -lh dumps/

# 分析 CPU profile
go tool pprof dumps/cpu_*.pprof

# 分析 heap profile
go tool pprof dumps/heap_*.pprof

# heap diff 分析（GCHeap 连续 dump）
go tool pprof -base dumps/heap_1.pprof dumps/heap_2.pprof

示例程序会模拟 CPU、内存、Goroutine 负载，触发相应的 dump。

最佳实践

生产环境配置建议

// 生产环境推荐配置
cfg.CollectInterval = 10 * time.Second  // 降低采集频率
cfg.WarmupCycles = 12                   // 增加预热周期
cfg.CPU.Cooldown = 10 * time.Minute     // 增加冷却时间
cfg.Mem.Cooldown = 10 * time.Minute
cfg.Goroutine.Max = 50000               // 设置合理上限

触发阈值设置原则

Min 值：设置为正常值的 1.5-2 倍，过滤噪音
Abs 值：设置为容器/物理资源的 70-80%
Diff 值：设置为 25-50%，捕获突增
Cooldown：生产环境建议 5-10 分钟，避免频繁触发

容器环境配置

cfg.Resource.UseCGroup = true           // 自动识别容器资源限制
cfg.CollectInterval = 10 * time.Second  // 容器环境降低采集频率
cfg.Dump.DumpPath = "/data/dumps"       // 使用持久化存储

GCHeap 使用建议

适用于内存泄漏排查、GC 压力分析
TriggerRatio = 0.7-0.8：在 GC 触发前捕获
DumpCount = 2：用于 heap diff 分析
与 Mem 规则配合使用，覆盖不同场景

Reporter 集成

import (
    "os"
    "time"
)

// Pyroscope 持续性能分析
cfg.Reporter.Enable = true
cfg.Reporter.Reporters = []reporter.Reporter{
    reporter.NewPyroscopeReporter(
        "http://pyroscope:4040",
        "",  // token
        map[string]string{
            "env":     os.Getenv("ENV"),
            "version": os.Getenv("VERSION"),
        },
        30 * time.Second,
    ),
}

高级用法

运行时热更新配置

// 构造新配置（需要提供完整配置）
newCfg := zpprof.Config{
    CollectInterval: 15 * time.Second,
    CPU: zpprof.Rule{
        Enable: true,
        Abs:    90,  // 调整阈值
        Min:    10,
        Diff:   25,
    },
    // ... 其他配置字段
}

// 应用更新
err := engine.Update(newCfg)

自定义 Reporter

type CustomReporter struct{}

func (r *CustomReporter) Report(ctx context.Context, event *reporter.Event) error {
    // 上报到自定义后端
    return nil
}

cfg.Reporter.Reporters = []reporter.Reporter{
    &CustomReporter{},
}

与 Prometheus 集成

// 通过 HTTP Reporter 上报到 Prometheus Pushgateway
cfg.Reporter.Reporters = []reporter.Reporter{
    reporter.NewHTTPReporter(
        "http://pushgateway:9091/metrics/job/zpprof",
        "",
        30 * time.Second,
    ),
}

常见问题

Q: 为什么没有触发 dump？

检查：

是否启用规则（Enable = true）
是否在预热期内（采集次数 < WarmupCycles）
是否在冷却期内（上次触发后 < Cooldown）
当前值是否满足触发条件（> Min 且满足 Abs 或 Diff）
Goroutine 是否超过 Max（超过不触发）

Q: CPU 使用率不准确？

确认 UseCGroup = true（容器环境）
手动设置 CPUCore（某些虚拟化环境）
增加 WarmupCycles（建立稳定基准）

Q: Dump 文件过大？

调整 BinaryDump = false，只输出到日志
定期清理旧 dump 文件
增加 Cooldown 时间，减少触发频率

Q: 如何分析 heap diff？

# GCHeap 会连续 dump 2 次（DumpCount = 2）
go tool pprof -base dumps/heap_1.pprof dumps/heap_2.pprof

# 查看新增内存
(pprof) top
(pprof) list <function>

Q: 线程收缩如何工作？

检测到 Thread > Threshold，dump threadcreate
等待 Delay 时间（让业务处理完当前请求）
调用 debug.SetMaxThreads(current - ShrinkCount) 收缩
延迟后恢复原值

注意：收缩过程可能影响性能，仅用于定位线程泄漏。

性能开销

zpprof 设计目标：

CPU 开销：< 2%（正常采集）
内存开销：< 20MB（不含 dump 文件）
采集延迟：毫秒级（异步采集）
上报队列：异步处理，不阻塞主流程

技术约束

Go 版本：1.24+
平台：跨平台支持 (Linux/macOS/Windows)
- CGroup 资源识别仅限 Linux 容器环境
- 其他平台使用系统 API 获取资源限制（runtime.NumCPU()）
依赖：仅标准库 + zlsgo + gopsutil

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
collector		collector
dumper		dumper
example		example
internal/shrink		internal/shrink
reporter		reporter
rule		rule
.cnb.yml		.cnb.yml
.gitignore		.gitignore
README.md		README.md
config.go		config.go
engine.go		engine.go
engine_test.go		engine_test.go
go.mod		go.mod
go.sum		go.sum
types.go		types.go

zlsgo/zpprof

Folders and files

Latest commit

History

Repository files navigation