Skip to content
#

p99-latency

Here is 1 public repository matching this topic...

Thermal-aware batch controller for vLLM/TensorRT-LLM. Prevents HBM thermal throttling from killing p99 latency on H100/H200. Monitors nvidia-smi, auto-cuts batch size at 85°C, migrates cold KV to DRAM. Prometheus + Grafana included. 4.2s -> 2.1s p99 at 128K context.

  • Updated Apr 13, 2026
  • Python

Improve this page

Add a description, image, and links to the p99-latency topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the p99-latency topic, visit your repo's landing page and select "manage topics."

Learn more