Skip to content

chenxuniu/llm-power-profiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-power-profiler

llm-power-profiler is a lightweight local monitor that measures watts, tokens, and joules per token for OpenAI-compatible LLM servers.

nvidia-smi tells you watts. llm-power-profiler tells you joules per token.

Why

Local LLM servers such as vLLM, SGLang, TGI, llama.cpp, and Ollama-compatible endpoints can report token usage, while NVIDIA GPUs can report power telemetry through NVML. llm-power-profiler connects those two signals so you can understand the energy profile of your own inference workload.

This project is intentionally lightweight:

  • No daemon
  • No database
  • No Prometheus required
  • No root required
  • No benchmark harness required

Install

pip install llm-power-profiler

For local development:

git clone https://github.com/chenxuniu/llm-power-profiler
cd llm-power-profiler
pip install -e .

Quick Start

Start your OpenAI-compatible LLM server, for example vLLM:

vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Run the profiler as a local proxy:

llm-power-profiler proxy \
  --target http://localhost:8000 \
  --port 9000 \
  --gpus 0 \
  --export reports/session.json

Send requests through the profiler:

curl http://localhost:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain joules per token in one sentence."}],
    "max_tokens": 128
  }'

The terminal dashboard shows aggregate power and token metrics:

Requests        128
Total tokens    842,112
Throughput      1,284 tok/s
Avg power       421 W
Peak power      518 W
Energy          0.37 kWh
J/token         1.58
kWh/1M tokens   0.44

Screenshots

NVML and A100 power telemetry check:

llm-power-profiler doctor on A100

Local proxy smoke run with the mock OpenAI-compatible server:

llm-power-profiler proxy dashboard

The mock server is useful for validating the monitoring pipeline. Real LLM energy numbers should be collected with an actual inference server such as vLLM or SGLang.

Commands

llm-power-profiler doctor

Checks whether NVML and NVIDIA GPU power telemetry are available.

llm-power-profiler watch

Displays a lightweight GPU power/utilization/memory/temperature view.

llm-power-profiler proxy --target http://localhost:8000 --port 9000

Runs a local OpenAI-compatible proxy and calculates aggregate watts, tokens, and joules per token.

Add --export reports/session.json to write a JSON report when the proxy exits.

Add --gpus 0,1 to sample a subset of local NVIDIA GPUs.

Try Without a GPU

You can test the token-accounting path with the included mock OpenAI-compatible server:

python3 examples/mock_openai_server.py

If port 8000 is already in use:

python3 examples/mock_openai_server.py --port 8001

Then run the proxy:

llm-power-profiler proxy \
  --target http://127.0.0.1:8000 \
  --port 9000 \
  --export reports/mock-session.json

If NVML is unavailable, token metrics still work and power metrics are marked as disabled.

Generate repeatable test traffic:

python3 examples/traffic_generator.py \
  --base-url http://127.0.0.1:9000 \
  --model mock-llm \
  --requests 16 \
  --concurrency 4 \
  --max-tokens 128

MVP Scope

The first version focuses on:

  • NVIDIA GPUs through NVML
  • OpenAI-compatible HTTP APIs
  • /v1/chat/completions and /v1/completions
  • response usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens
  • aggregate session metrics
  • terminal-first monitoring
  • non-streaming responses

Not in the first version:

  • multi-node attribution
  • Slurm integration
  • DCGM dependency
  • request-level precise energy attribution
  • CPU or DRAM power telemetry
  • hosted dashboard
  • streaming token accounting

Metrics

Core metrics:

  • duration
  • request count
  • prompt tokens
  • completion tokens
  • total tokens
  • tokens per second
  • average GPU power
  • peak GPU power
  • energy in joules, Wh, and kWh
  • joules per token
  • kWh per 1M tokens
  • GPU utilization
  • GPU memory usage

Relationship to TokenPowerBench

TokenPowerBench is for rigorous LLM inference power benchmarking across engines, models, datasets, and cluster configurations.

llm-power-profiler is for lightweight local monitoring while you are already running an LLM server.

Development

pip install -e ".[dev]"
llm-power-profiler doctor

Run a quick syntax check:

python3 -m compileall src

Run tests:

PYTHONPATH=src python3 -m unittest discover -s tests

See docs/validation.md for a step-by-step local validation flow. See docs/minimal-validation.md for the lowest-burden validation plan. See docs/experiments.md for the first A100/H100/H200 experiment plan. See docs/hpec-paper-plan.md for the short paper direction.

Validation Scripts

Collect machine metadata:

python3 scripts/collect_env.py --gpus 0 --output reports/a100-env.json

Run the mock end-to-end validation:

python3 scripts/run_mock_validation.py --gpus 0 --prefix a100-mock

Run validation against an existing vLLM/OpenAI-compatible server:

python3 scripts/run_vllm_validation.py \
  --target http://127.0.0.1:8001 \
  --model <MODEL_NAME> \
  --gpus 0 \
  --prefix a100-vllm

Run a small concurrency sweep:

python3 scripts/run_concurrency_sweep.py \
  --target http://127.0.0.1:8001 \
  --model <MODEL_NAME> \
  --gpus 0 \
  --concurrency 1,4,8 \
  --prefix a100-vllm

Summarize reports:

python3 scripts/summarize_reports.py --reports-dir reports --csv reports/summary.csv

About

Lightweight local monitor for watts, tokens, and joules per token on OpenAI-compatible LLM servers.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors