llm-power-profiler is a lightweight local monitor that measures watts, tokens, and joules per token for OpenAI-compatible LLM servers.
nvidia-smitells you watts.llm-power-profilertells you joules per token.
Local LLM servers such as vLLM, SGLang, TGI, llama.cpp, and Ollama-compatible endpoints can report token usage, while NVIDIA GPUs can report power telemetry through NVML. llm-power-profiler connects those two signals so you can understand the energy profile of your own inference workload.
This project is intentionally lightweight:
- No daemon
- No database
- No Prometheus required
- No root required
- No benchmark harness required
pip install llm-power-profilerFor local development:
git clone https://github.com/chenxuniu/llm-power-profiler
cd llm-power-profiler
pip install -e .Start your OpenAI-compatible LLM server, for example vLLM:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000Run the profiler as a local proxy:
llm-power-profiler proxy \
--target http://localhost:8000 \
--port 9000 \
--gpus 0 \
--export reports/session.jsonSend requests through the profiler:
curl http://localhost:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Explain joules per token in one sentence."}],
"max_tokens": 128
}'The terminal dashboard shows aggregate power and token metrics:
Requests 128
Total tokens 842,112
Throughput 1,284 tok/s
Avg power 421 W
Peak power 518 W
Energy 0.37 kWh
J/token 1.58
kWh/1M tokens 0.44
NVML and A100 power telemetry check:
Local proxy smoke run with the mock OpenAI-compatible server:
The mock server is useful for validating the monitoring pipeline. Real LLM energy numbers should be collected with an actual inference server such as vLLM or SGLang.
llm-power-profiler doctorChecks whether NVML and NVIDIA GPU power telemetry are available.
llm-power-profiler watchDisplays a lightweight GPU power/utilization/memory/temperature view.
llm-power-profiler proxy --target http://localhost:8000 --port 9000Runs a local OpenAI-compatible proxy and calculates aggregate watts, tokens, and joules per token.
Add --export reports/session.json to write a JSON report when the proxy exits.
Add --gpus 0,1 to sample a subset of local NVIDIA GPUs.
You can test the token-accounting path with the included mock OpenAI-compatible server:
python3 examples/mock_openai_server.pyIf port 8000 is already in use:
python3 examples/mock_openai_server.py --port 8001Then run the proxy:
llm-power-profiler proxy \
--target http://127.0.0.1:8000 \
--port 9000 \
--export reports/mock-session.jsonIf NVML is unavailable, token metrics still work and power metrics are marked as disabled.
Generate repeatable test traffic:
python3 examples/traffic_generator.py \
--base-url http://127.0.0.1:9000 \
--model mock-llm \
--requests 16 \
--concurrency 4 \
--max-tokens 128The first version focuses on:
- NVIDIA GPUs through NVML
- OpenAI-compatible HTTP APIs
/v1/chat/completionsand/v1/completions- response
usage.prompt_tokens,usage.completion_tokens, andusage.total_tokens - aggregate session metrics
- terminal-first monitoring
- non-streaming responses
Not in the first version:
- multi-node attribution
- Slurm integration
- DCGM dependency
- request-level precise energy attribution
- CPU or DRAM power telemetry
- hosted dashboard
- streaming token accounting
Core metrics:
- duration
- request count
- prompt tokens
- completion tokens
- total tokens
- tokens per second
- average GPU power
- peak GPU power
- energy in joules, Wh, and kWh
- joules per token
- kWh per 1M tokens
- GPU utilization
- GPU memory usage
TokenPowerBench is for rigorous LLM inference power benchmarking across engines, models, datasets, and cluster configurations.
llm-power-profiler is for lightweight local monitoring while you are already running an LLM server.
pip install -e ".[dev]"
llm-power-profiler doctorRun a quick syntax check:
python3 -m compileall srcRun tests:
PYTHONPATH=src python3 -m unittest discover -s testsSee docs/validation.md for a step-by-step local validation flow. See docs/minimal-validation.md for the lowest-burden validation plan. See docs/experiments.md for the first A100/H100/H200 experiment plan. See docs/hpec-paper-plan.md for the short paper direction.
Collect machine metadata:
python3 scripts/collect_env.py --gpus 0 --output reports/a100-env.jsonRun the mock end-to-end validation:
python3 scripts/run_mock_validation.py --gpus 0 --prefix a100-mockRun validation against an existing vLLM/OpenAI-compatible server:
python3 scripts/run_vllm_validation.py \
--target http://127.0.0.1:8001 \
--model <MODEL_NAME> \
--gpus 0 \
--prefix a100-vllmRun a small concurrency sweep:
python3 scripts/run_concurrency_sweep.py \
--target http://127.0.0.1:8001 \
--model <MODEL_NAME> \
--gpus 0 \
--concurrency 1,4,8 \
--prefix a100-vllmSummarize reports:
python3 scripts/summarize_reports.py --reports-dir reports --csv reports/summary.csv
