A small local LLM serving runtime focused on the systems work behind inference serving: async request handling, scheduling, batching, streaming, metrics, and reproducible benchmarking.
This project is intentionally not an agent framework. It does not implement tools, RAG, fine-tuning, or CUDA kernels. The default path uses a mock backend that simulates prefill and decode latency so the serving path can be tested without a GPU.
LLM serving is a systems problem: requests arrive concurrently, wait in queues, get scheduled into batches, stream tokens back to clients, and produce latency and throughput metrics. This repository demonstrates that engineering surface in a compact, inspectable Python service.
The goal is to make tradeoffs visible:
- FIFO vs priority scheduling
- single-request serving vs dynamic batching
- time-to-first-token vs total latency
- queue wait time under load
- backend adapter boundaries for llama.cpp and vLLM
flowchart LR
Client["HTTP / SSE client"] --> API["FastAPI API<br/>/v1/chat/completions"]
API --> Admission["AdmissionController<br/>queue cap + token bucket"]
Admission --> Request["RuntimeRequest<br/>priority, stream flag, timing"]
Request --> Scheduler["Scheduler<br/>FIFO or priority"]
Scheduler --> Worker["WorkerManager<br/>batch formation + dispatch"]
Worker --> Backend["Backend adapter<br/>mock / llama.cpp / vLLM"]
Backend --> Worker
Worker --> Handles["CompletionHandle / StreamingHandle"]
Handles --> API
API --> Client
Worker --> Metrics["MetricsCollector<br/>queue wait, TTFT, latency"]
Metrics --> GPU["NvidiaSmiSampler<br/>optional GPU telemetry"]
API --> MetricsEndpoint["/metrics<br/>JSON or Prometheus"]
Metrics --> MetricsEndpoint
The runtime has these layers:
- API layer: FastAPI routes for health, metrics, and OpenAI-style chat completions.
- Admission control: optional queue-cap and token-bucket guards that reject overload before requests enter the scheduler.
- Core request model: internal request object with ID, messages, priority, timing, stream flag, and sampling params.
- Scheduler: FIFO and priority-based schedulers, both with dynamic batch formation using configurable size and timeout.
- Backend: abstract backend interface with mock, llama.cpp, and vLLM adapters. The real backends manage external
llama-server/vllm servesubprocesses. - Worker path: background batch execution with per-request result routing through completion or streaming channels.
- Streaming: Server-Sent Events for streaming chat completions.
- Metrics: in-memory counters, latency histograms, optional GPU telemetry, Prometheus text-format export, and structured JSON request logs.
What is implemented and tested:
GET /healthGET /metrics(JSON and Prometheus text format)POST /v1/chat/completions(OpenAI-compatible)- non-streaming responses
- streaming responses (Server-Sent Events)
- FIFO scheduler
- Priority scheduler with optional aging fairness policy
- mock token generation
- llama.cpp backend adapter with
llama-serversubprocess lifecycle - vLLM backend adapter with
vllm servesubprocess lifecycle - background async worker execution with per-request response channels
- native-backend worker fan-out so vLLM can receive concurrent requests and manage continuous batching internally
- dynamic micro-batching with configurable maximum size and collection timeout
- non-streaming request timeout handling and streaming disconnect cancellation
- optional admission control with maximum queue size and token-bucket request rate limiting
- batch-aware mock backend with shared prefill/decode simulation
- batch size, queue wait, TTFT, total latency, and token metrics
- optional
nvidia-smiGPU memory/utilization sampler with unavailable fallback - real vLLM GPU smoke/load artifacts on an RTX 4090 with Qwen2.5-0.5B-Instruct
- priority scheduler benchmark with high-priority TTFT, low-priority queue wait, starvation, fairness, and aging-policy metrics
- rejected-request metrics by reason and priority
- JSON metrics snapshots and Prometheus-style text exposition
- structured JSON request lifecycle logging
- CI quality gates with
ruff,mypy, andpytest - pytest coverage for core paths (73 test cases)
Known limitations:
- mock-backend benchmarks validate serving behavior, not real GPU throughput
- llama.cpp benchmark artifacts are CPU-only and hardware-specific
- vLLM GPU artifacts validate backend integration; they are not tuned maximum-throughput runs
- GPU metrics use
nvidia-smi; unsupported hosts reportunavailable - Redis-backed queues
- production authentication, distributed rate limiting, and multi-node serving
Benchmark scripts and saved artifacts track:
- tokens/s
- P50 latency
- P95 latency
- TTFT
- queue wait time
- batch size distribution
- vLLM GPU smoke/load latency and throughput
- priority scheduling fairness/starvation tradeoffs
Mock-backend runs are complete and documented. The repository also includes
CPU-only llama.cpp notes and RTX 4090 vLLM smoke/load artifacts to show how
mock results differ from real backends.
The curated benchmark write-up lives in docs/benchmark_report.md; raw JSON
artifacts live under benchmarks/results/.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
uvicorn llm_runtime.main:app --reloadOn Windows PowerShell, activate with .venv\Scripts\Activate.ps1.
For dependency upgrades, edit pyproject.toml first and then refresh
requirements-dev.txt from a tested virtual environment.
Health check:
curl http://127.0.0.1:8000/healthRuntime metrics:
curl http://127.0.0.1:8000/metricsAbbreviated JSON output after a small mock request:
{
"request_count": 1,
"completed_count": 1,
"failed_count": 0,
"rejected_count": 0,
"rejections_by_reason": {},
"rejections_by_priority": {},
"active_requests": 0,
"generated_tokens_total": 4,
"batch_count": 1,
"batch_size_avg": 1.0,
"batch_size_distribution": {"1": 1},
"queue_wait_time_avg_s": 0.0001,
"ttft_avg_s": 0.045,
"total_latency_avg_s": 0.095,
"gpu": {
"status": "unavailable",
"source": "nvidia-smi",
"reason": "nvidia-smi not found",
"gpu_count": 0,
"memory_used_mb": 0,
"memory_total_mb": 0,
"utilization_pct": 0,
"gpus": []
},
"queue_size": 0
}Non-streaming chat completion:
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{\"model\":\"mock-llm\",\"messages\":[{\"role\":\"user\",\"content\":\"hello\"}],\"max_tokens\":8}"Streaming chat completion:
curl -N -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{\"model\":\"mock-llm\",\"messages\":[{\"role\":\"user\",\"content\":\"stream a short reply\"}],\"max_tokens\":4,\"stream\":true}"Example SSE output:
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "model": "mock-llm", "choices": [{"index": 0, "delta": {"content": "tok0 "}, "finish_reason": null}]}
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "model": "mock-llm", "choices": [{"index": 0, "delta": {"content": "tok1 "}, "finish_reason": null}]}
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "model": "mock-llm", "choices": [{"index": 0, "delta": {"content": "tok2 "}, "finish_reason": null}]}
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk", "model": "mock-llm", "choices": [{"index": 0, "delta": {"content": "tok3 "}, "finish_reason": null}]}
data: [DONE]
Admission control example:
LLM_RUNTIME_MAX_QUEUE_SIZE=128 \
LLM_RUNTIME_REQUEST_RATE_LIMIT_PER_S=50 \
LLM_RUNTIME_REQUEST_RATE_LIMIT_BURST=100 \
uvicorn llm_runtime.main:appWhen enabled, queue overload is rejected before enqueue with 503, and
token-bucket rate limiting is rejected with 429. Rejections are exported in
JSON metrics and Prometheus text format.
Run tests:
python -m pytestRun local quality checks:
python -m ruff check src tests benchmarks
python -m mypy
python -m pytest -qFor the vLLM GPU smoke and native-concurrency benchmarks, download only the
model assets needed to serve Qwen/Qwen2.5-0.5B-Instruct. The helper is
retryable, writes to a local
models/ directory, and keeps Hugging Face cache data under .hf-cache/ so a
failed network session can usually continue from partial downloads.
cd ~/llm-serving-runtime
git pull
git rev-parse --short HEAD
source .venv/bin/activate
python -m pip install "huggingface_hub>=0.24"
python scripts/download_vllm_smoke_assets.py \
--hf-endpoint https://hf-mirror.com \
--retries 20 \
--retry-sleep-s 30 \
--max-workers 1When the script finishes, copy the printed export ... lines, then start the
runtime with the local model directory:
export LLM_RUNTIME_BACKEND=vllm
export LLM_RUNTIME_MODEL_PATH=/root/llm-serving-runtime/models/Qwen2.5-0.5B-Instruct
export LLM_RUNTIME_NATIVE_BACKEND_CONCURRENCY=8
export LLM_RUNTIME_VLLM_GPU_MEMORY_UTILIZATION=0.85
export LLM_RUNTIME_VLLM_MAX_MODEL_LEN=2048
export LLM_RUNTIME_REQUEST_TIMEOUT_S=300
uvicorn llm_runtime.main:app --host 0.0.0.0 --port 8000If the mirror is unavailable, omit --hf-endpoint https://hf-mirror.com and
rerun the same command. If the SSH connection drops, reconnect and rerun the
same command from the project directory.
To record the native-concurrency vLLM benchmark artifact, start the runtime
with LLM_RUNTIME_NATIVE_BACKEND_CONCURRENCY=8, then run this from a second
SSH session:
cd ~/llm-serving-runtime
source .venv/bin/activate
python benchmarks/run_load_test.py \
--base-url http://127.0.0.1:8000 \
--concurrency 8 \
--requests 32 \
--max-tokens 32 \
--output benchmarks/results/vllm_gpu_native_concurrency_0_5b_c8.json
curl -s http://127.0.0.1:8000/metrics \
| python -m json.tool \
> benchmarks/results/vllm_gpu_metrics_after_native_0_5b_c8.jsondocker compose up --build llm-runtimeRun a quick benchmark against the compose service:
docker compose --profile benchmark up --build benchmarkRun the reproducible mock-backend comparison suite:
$env:PYTHONPATH="src"; python benchmarks/run_local_comparison.py --mode both --levels 1 8 16 32 64 --requests 64 --max-tokens 32 --max-batch-size 8 --batch-timeout-ms 10 --prefill-latency-ms 25 --decode-latency-ms 10 --output-dir benchmarks/results
python benchmarks/analyze_results.py --baseline benchmarks/results/fifo_baseline.json --dynamic benchmarks/results/dynamic_batching.jsonThe runner creates a fresh local Uvicorn app for each mode, ensuring in-memory
metrics do not leak between the FIFO and dynamic batching experiments. The
first recorded mock result is documented in docs/benchmark_report.md.
Run a fixed high-concurrency parameter sweep:
$env:PYTHONPATH="src"; python benchmarks/run_batch_sweep.py --concurrency 64 --requests 64 --max-tokens 32 --batch-sizes 2 4 8 16 --timeouts 0 5 10 20 --prefill-latency-ms 25 --decode-latency-ms 10 --output benchmarks/results/batch_sweep_c64.json
python benchmarks/analyze_batch_sweep.py --sweep benchmarks/results/batch_sweep_c64.json --baseline benchmarks/results/fifo_baseline.json --ttft-budget-ms 1000Run the mixed-priority scheduler benchmark:
python benchmarks/run_priority_scheduler_benchmark.py --output benchmarks/results/priority_scheduler_mixed.jsonThe default workload creates a low-priority backlog, injects high-priority requests after 40 ms, and compares FIFO, strict priority, and priority aging. The saved artifact reports high-priority TTFT, low-priority queue wait, Jain fairness over inverse queue wait, and low-priority starvation counts.
Generate a scratch Markdown view from raw benchmark JSON artifacts:
python benchmarks/generate_report.py --output benchmarks/results/generated_benchmark_report.mdThis generated Markdown is intentionally ignored by git. Keep curated benchmark
analysis in docs/benchmark_report.md.
To manually run a server with dynamic batching enabled:
$env:LLM_RUNTIME_ENABLE_BATCHING="true"
$env:LLM_RUNTIME_MAX_BATCH_SIZE="8"
$env:LLM_RUNTIME_BATCH_TIMEOUT_MS="10"
uvicorn llm_runtime.main:appTo run with the llama.cpp backend:
$env:LLM_RUNTIME_BACKEND="llama.cpp"
$env:LLM_RUNTIME_MODEL_PATH="path/to/model.gguf"
$env:LLM_RUNTIME_N_GPU_LAYERS="35"
uvicorn llm_runtime.main:app