Skip to content

Layyyth/InferenceOps

Repository files navigation

InferenceOps Assignment

End-to-end LLM inference system with experimentation and monitoring.

Architecture

Client → nginx (8780) → Gateway (8000) → Modal vLLM
                             ↓
                        Prometheus (9090)
  • nginx: Load balancer with health checks
  • Gateway: FastAPI with X-Technique routing, Prometheus metrics
  • vLLM: Modal-hosted Mistral-7B (baseline + chunked prefill)

Setup

1. Clone and install

git clone <repo-url>
cd InferenceProject
uv pip install -r requirements.txt

2. Configure environment

cp .env.example .env
# Edit .env with your Modal URLs:
# VLLM_BASELINE_URL=https://your-account--vllm-baseline-serve.modal.run
# VLLM_CHUNKED_URL=https://your-account--vllm-chunked-prefill-serve.modal.run

3. Deploy Modal vLLM (if not done)

modal deploy vllm_engine/modal_baseline.py
modal deploy vllm_engine/modal_chunked.py

Running the Stack

Start all services

Terminal 1 - Gateway:

uv run python gateway/main.py

Terminal 2 - nginx:

nginx -c $(pwd)/nginx/nginx.conf

Terminal 3 - Prometheus (optional):

docker run -p 9090:9090 -v $(pwd)/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Verify stack

curl http://localhost:8780/health
curl http://localhost:8000/metrics

Experiments

Run inference experiments

uv run python experiments/runner.py

Runs 60s load test against both baseline and chunked_prefill arms. Results saved to data/metrics/.

Evaluate code agent

uv run python experiments/evaluate_agent.py

Runs code generation tasks from golden set, reports pass rates.

Generate plots

uv run python experiments/generate_plots.py

Creates latency distributions and throughput comparisons in figures/.

API Usage

Direct gateway

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Technique: baseline" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 10
  }'

Through nginx

curl -X POST http://localhost:8780/v1/chat/completions \
  -H "X-Technique: chunked_prefill" \
  -d '{...}'

Project Structure

.
├── gateway/              # FastAPI gateway with metrics
├── vllm_engine/          # Modal deployment scripts
├── agent/                # Code generation agent
├── experiments/          # Load testing and evaluation
├── data/                 # Experiment results
├── figures/              # Generated plots
├── nginx/                # Load balancer config
└── monitoring/           # Prometheus config

Metrics

Gateway exposes Prometheus metrics at http://localhost:8000/metrics:

  • llm_gateway_requests_total{technique, model}
  • llm_gateway_request_duration_seconds{technique}
  • llm_gateway_errors_total{error_type, layer}

Submission

Generate submission PDF:

jupyter nbconvert submission.ipynb --to pdf

See submission.ipynb for full analysis and results.

About

Final Project for AI System Design & Inference Engineering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors