Skip to content

KiriSchrieffer/llm-serving-runtime

Repository files navigation

GPU-Aware LLM Serving Runtime

CI

A small local LLM serving runtime focused on the systems work behind inference serving: async request handling, scheduling, batching, streaming, metrics, and reproducible benchmarking.

This project is intentionally not an agent framework. It does not implement tools, RAG, fine-tuning, or CUDA kernels. The default path uses a mock backend that simulates prefill and decode latency so the serving path can be tested without a GPU.

Why This Exists

LLM serving is a systems problem: requests arrive concurrently, wait in queues, get scheduled into batches, stream tokens back to clients, and produce latency and throughput metrics. This repository demonstrates that engineering surface in a compact, inspectable Python service.

The goal is to make tradeoffs visible:

  • FIFO vs priority scheduling
  • single-request serving vs dynamic batching
  • time-to-first-token vs total latency
  • queue wait time under load
  • backend adapter boundaries for llama.cpp and vLLM

Architecture

flowchart LR
    Client["HTTP / SSE client"] --> API["FastAPI API<br/>/v1/chat/completions"]
    API --> Admission["AdmissionController<br/>queue cap + token bucket"]
    Admission --> Request["RuntimeRequest<br/>priority, stream flag, timing"]
    Request --> Scheduler["Scheduler<br/>FIFO or priority"]
    Scheduler --> Worker["WorkerManager<br/>batch formation + dispatch"]
    Worker --> Backend["Backend adapter<br/>mock / llama.cpp / vLLM"]
    Backend --> Worker
    Worker --> Handles["CompletionHandle / StreamingHandle"]
    Handles --> API
    API --> Client
    Worker --> Metrics["MetricsCollector<br/>queue wait, TTFT, latency"]
    Metrics --> GPU["NvidiaSmiSampler<br/>optional GPU telemetry"]
    API --> MetricsEndpoint["/metrics<br/>JSON or Prometheus"]
    Metrics --> MetricsEndpoint
Loading

The runtime has these layers:

  • API layer: FastAPI routes for health, metrics, and OpenAI-style chat completions.
  • Admission control: optional queue-cap and token-bucket guards that reject overload before requests enter the scheduler.
  • Core request model: internal request object with ID, messages, priority, timing, stream flag, and sampling params.
  • Scheduler: FIFO and priority-based schedulers, both with dynamic batch formation using configurable size and timeout.
  • Backend: abstract backend interface with mock, llama.cpp, and vLLM adapters. The real backends manage external llama-server / vllm serve subprocesses.
  • Worker path: background batch execution with per-request result routing through completion or streaming channels.
  • Streaming: Server-Sent Events for streaming chat completions.
  • Metrics: in-memory counters, latency histograms, optional GPU telemetry, Prometheus text-format export, and structured JSON request logs.

Current Status

What is implemented and tested:

  • GET /health
  • GET /metrics (JSON and Prometheus text format)
  • POST /v1/chat/completions (OpenAI-compatible)
  • non-streaming responses
  • streaming responses (Server-Sent Events)
  • FIFO scheduler
  • Priority scheduler with optional aging fairness policy
  • mock token generation
  • llama.cpp backend adapter with llama-server subprocess lifecycle
  • vLLM backend adapter with vllm serve subprocess lifecycle
  • background async worker execution with per-request response channels
  • native-backend worker fan-out so vLLM can receive concurrent requests and manage continuous batching internally
  • dynamic micro-batching with configurable maximum size and collection timeout
  • non-streaming request timeout handling and streaming disconnect cancellation
  • optional admission control with maximum queue size and token-bucket request rate limiting
  • batch-aware mock backend with shared prefill/decode simulation
  • batch size, queue wait, TTFT, total latency, and token metrics
  • optional nvidia-smi GPU memory/utilization sampler with unavailable fallback
  • real vLLM GPU smoke/load artifacts on an RTX 4090 with Qwen2.5-0.5B-Instruct
  • priority scheduler benchmark with high-priority TTFT, low-priority queue wait, starvation, fairness, and aging-policy metrics
  • rejected-request metrics by reason and priority
  • JSON metrics snapshots and Prometheus-style text exposition
  • structured JSON request lifecycle logging
  • CI quality gates with ruff, mypy, and pytest
  • pytest coverage for core paths (73 test cases)

Known limitations:

  • mock-backend benchmarks validate serving behavior, not real GPU throughput
  • llama.cpp benchmark artifacts are CPU-only and hardware-specific
  • vLLM GPU artifacts validate backend integration; they are not tuned maximum-throughput runs
  • GPU metrics use nvidia-smi; unsupported hosts report unavailable
  • Redis-backed queues
  • production authentication, distributed rate limiting, and multi-node serving

Benchmark Coverage

Benchmark scripts and saved artifacts track:

  • tokens/s
  • P50 latency
  • P95 latency
  • TTFT
  • queue wait time
  • batch size distribution
  • vLLM GPU smoke/load latency and throughput
  • priority scheduling fairness/starvation tradeoffs

Mock-backend runs are complete and documented. The repository also includes CPU-only llama.cpp notes and RTX 4090 vLLM smoke/load artifacts to show how mock results differ from real backends. The curated benchmark write-up lives in docs/benchmark_report.md; raw JSON artifacts live under benchmarks/results/.

Run Locally

python -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
uvicorn llm_runtime.main:app --reload

On Windows PowerShell, activate with .venv\Scripts\Activate.ps1. For dependency upgrades, edit pyproject.toml first and then refresh requirements-dev.txt from a tested virtual environment.

Health check:

curl http://127.0.0.1:8000/health

Runtime metrics:

curl http://127.0.0.1:8000/metrics

Abbreviated JSON output after a small mock request:

{
  "request_count": 1,
  "completed_count": 1,
  "failed_count": 0,
  "rejected_count": 0,
  "rejections_by_reason": {},
  "rejections_by_priority": {},
  "active_requests": 0,
  "generated_tokens_total": 4,
  "batch_count": 1,
  "batch_size_avg": 1.0,
  "batch_size_distribution": {"1": 1},
  "queue_wait_time_avg_s": 0.0001,
  "ttft_avg_s": 0.045,
  "total_latency_avg_s": 0.095,
  "gpu": {
    "status": "unavailable",
    "source": "nvidia-smi",
    "reason": "nvidia-smi not found",
    "gpu_count": 0,
    "memory_used_mb": 0,
    "memory_total_mb": 0,
    "utilization_pct": 0,
    "gpus": []
  },
  "queue_size": 0
}

Non-streaming chat completion:

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"mock-llm\",\"messages\":[{\"role\":\"user\",\"content\":\"hello\"}],\"max_tokens\":8}"

Streaming chat completion:

curl -N -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"mock-llm\",\"messages\":[{\"role\":\"user\",\"content\":\"stream a short reply\"}],\"max_tokens\":4,\"stream\":true}"

Example SSE output:

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "model": "mock-llm", "choices": [{"index": 0, "delta": {"content": "tok0 "}, "finish_reason": null}]}

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "model": "mock-llm", "choices": [{"index": 0, "delta": {"content": "tok1 "}, "finish_reason": null}]}

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "model": "mock-llm", "choices": [{"index": 0, "delta": {"content": "tok2 "}, "finish_reason": null}]}

data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "model": "mock-llm", "choices": [{"index": 0, "delta": {"content": "tok3 "}, "finish_reason": null}]}

data: [DONE]

Admission control example:

LLM_RUNTIME_MAX_QUEUE_SIZE=128 \
LLM_RUNTIME_REQUEST_RATE_LIMIT_PER_S=50 \
LLM_RUNTIME_REQUEST_RATE_LIMIT_BURST=100 \
uvicorn llm_runtime.main:app

When enabled, queue overload is rejected before enqueue with 503, and token-bucket rate limiting is rejected with 429. Rejections are exported in JSON metrics and Prometheus text format.

Run tests:

python -m pytest

Run local quality checks:

python -m ruff check src tests benchmarks
python -m mypy
python -m pytest -q

Cloud GPU vLLM Benchmark

For the vLLM GPU smoke and native-concurrency benchmarks, download only the model assets needed to serve Qwen/Qwen2.5-0.5B-Instruct. The helper is retryable, writes to a local models/ directory, and keeps Hugging Face cache data under .hf-cache/ so a failed network session can usually continue from partial downloads.

cd ~/llm-serving-runtime
git pull
git rev-parse --short HEAD

source .venv/bin/activate
python -m pip install "huggingface_hub>=0.24"

python scripts/download_vllm_smoke_assets.py \
  --hf-endpoint https://hf-mirror.com \
  --retries 20 \
  --retry-sleep-s 30 \
  --max-workers 1

When the script finishes, copy the printed export ... lines, then start the runtime with the local model directory:

export LLM_RUNTIME_BACKEND=vllm
export LLM_RUNTIME_MODEL_PATH=/root/llm-serving-runtime/models/Qwen2.5-0.5B-Instruct
export LLM_RUNTIME_NATIVE_BACKEND_CONCURRENCY=8
export LLM_RUNTIME_VLLM_GPU_MEMORY_UTILIZATION=0.85
export LLM_RUNTIME_VLLM_MAX_MODEL_LEN=2048
export LLM_RUNTIME_REQUEST_TIMEOUT_S=300
uvicorn llm_runtime.main:app --host 0.0.0.0 --port 8000

If the mirror is unavailable, omit --hf-endpoint https://hf-mirror.com and rerun the same command. If the SSH connection drops, reconnect and rerun the same command from the project directory.

To record the native-concurrency vLLM benchmark artifact, start the runtime with LLM_RUNTIME_NATIVE_BACKEND_CONCURRENCY=8, then run this from a second SSH session:

cd ~/llm-serving-runtime
source .venv/bin/activate

python benchmarks/run_load_test.py \
  --base-url http://127.0.0.1:8000 \
  --concurrency 8 \
  --requests 32 \
  --max-tokens 32 \
  --output benchmarks/results/vllm_gpu_native_concurrency_0_5b_c8.json

curl -s http://127.0.0.1:8000/metrics \
  | python -m json.tool \
  > benchmarks/results/vllm_gpu_metrics_after_native_0_5b_c8.json

Run with Docker

docker compose up --build llm-runtime

Run a quick benchmark against the compose service:

docker compose --profile benchmark up --build benchmark

Compare Batching Modes

Run the reproducible mock-backend comparison suite:

$env:PYTHONPATH="src"; python benchmarks/run_local_comparison.py --mode both --levels 1 8 16 32 64 --requests 64 --max-tokens 32 --max-batch-size 8 --batch-timeout-ms 10 --prefill-latency-ms 25 --decode-latency-ms 10 --output-dir benchmarks/results
python benchmarks/analyze_results.py --baseline benchmarks/results/fifo_baseline.json --dynamic benchmarks/results/dynamic_batching.json

The runner creates a fresh local Uvicorn app for each mode, ensuring in-memory metrics do not leak between the FIFO and dynamic batching experiments. The first recorded mock result is documented in docs/benchmark_report.md.

Run a fixed high-concurrency parameter sweep:

$env:PYTHONPATH="src"; python benchmarks/run_batch_sweep.py --concurrency 64 --requests 64 --max-tokens 32 --batch-sizes 2 4 8 16 --timeouts 0 5 10 20 --prefill-latency-ms 25 --decode-latency-ms 10 --output benchmarks/results/batch_sweep_c64.json
python benchmarks/analyze_batch_sweep.py --sweep benchmarks/results/batch_sweep_c64.json --baseline benchmarks/results/fifo_baseline.json --ttft-budget-ms 1000

Run the mixed-priority scheduler benchmark:

python benchmarks/run_priority_scheduler_benchmark.py --output benchmarks/results/priority_scheduler_mixed.json

The default workload creates a low-priority backlog, injects high-priority requests after 40 ms, and compares FIFO, strict priority, and priority aging. The saved artifact reports high-priority TTFT, low-priority queue wait, Jain fairness over inverse queue wait, and low-priority starvation counts.

Generate a scratch Markdown view from raw benchmark JSON artifacts:

python benchmarks/generate_report.py --output benchmarks/results/generated_benchmark_report.md

This generated Markdown is intentionally ignored by git. Keep curated benchmark analysis in docs/benchmark_report.md.

To manually run a server with dynamic batching enabled:

$env:LLM_RUNTIME_ENABLE_BATCHING="true"
$env:LLM_RUNTIME_MAX_BATCH_SIZE="8"
$env:LLM_RUNTIME_BATCH_TIMEOUT_MS="10"
uvicorn llm_runtime.main:app

To run with the llama.cpp backend:

$env:LLM_RUNTIME_BACKEND="llama.cpp"
$env:LLM_RUNTIME_MODEL_PATH="path/to/model.gguf"
$env:LLM_RUNTIME_N_GPU_LAYERS="35"
uvicorn llm_runtime.main:app

About

GPU-aware LLM serving runtime with async scheduling, dynamic batching, streaming, admission control, metrics, and mock/llama.cpp/vLLM backends.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors