InferenceOps Assignment

End-to-end LLM inference system with experimentation and monitoring.

Architecture

Client → nginx (8780) → Gateway (8000) → Modal vLLM
                             ↓
                        Prometheus (9090)

nginx: Load balancer with health checks
Gateway: FastAPI with X-Technique routing, Prometheus metrics
vLLM: Modal-hosted Mistral-7B (baseline + chunked prefill)

Setup

1. Clone and install

git clone <repo-url>
cd InferenceProject
uv pip install -r requirements.txt

2. Configure environment

cp .env.example .env
# Edit .env with your Modal URLs:
# VLLM_BASELINE_URL=https://your-account--vllm-baseline-serve.modal.run
# VLLM_CHUNKED_URL=https://your-account--vllm-chunked-prefill-serve.modal.run

3. Deploy Modal vLLM (if not done)

modal deploy vllm_engine/modal_baseline.py
modal deploy vllm_engine/modal_chunked.py

Running the Stack

Start all services

Terminal 1 - Gateway:

uv run python gateway/main.py

Terminal 2 - nginx:

nginx -c $(pwd)/nginx/nginx.conf

Terminal 3 - Prometheus (optional):

docker run -p 9090:9090 -v $(pwd)/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Verify stack

curl http://localhost:8780/health
curl http://localhost:8000/metrics

Experiments

Run inference experiments

uv run python experiments/runner.py

Runs 60s load test against both baseline and chunked_prefill arms. Results saved to data/metrics/.

Evaluate code agent

uv run python experiments/evaluate_agent.py

Runs code generation tasks from golden set, reports pass rates.

Generate plots

uv run python experiments/generate_plots.py

Creates latency distributions and throughput comparisons in figures/.

API Usage

Direct gateway

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Technique: baseline" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 10
  }'

Through nginx

curl -X POST http://localhost:8780/v1/chat/completions \
  -H "X-Technique: chunked_prefill" \
  -d '{...}'

Project Structure

.
├── gateway/              # FastAPI gateway with metrics
├── vllm_engine/          # Modal deployment scripts
├── agent/                # Code generation agent
├── experiments/          # Load testing and evaluation
├── data/                 # Experiment results
├── figures/              # Generated plots
├── nginx/                # Load balancer config
└── monitoring/           # Prometheus config

Metrics

Gateway exposes Prometheus metrics at http://localhost:8000/metrics:

llm_gateway_requests_total{technique, model}
llm_gateway_request_duration_seconds{technique}
llm_gateway_errors_total{error_type, layer}

Submission

Generate submission PDF:

jupyter nbconvert submission.ipynb --to pdf

See submission.ipynb for full analysis and results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InferenceOps Assignment

Architecture

Setup

1. Clone and install

2. Configure environment

3. Deploy Modal vLLM (if not done)

Running the Stack

Start all services

Verify stack

Experiments

Run inference experiments

Evaluate code agent

Generate plots

API Usage

Direct gateway

Through nginx

Project Structure

Metrics

Submission

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent		agent
data		data
experiments		experiments
figures		figures
gateway		gateway
monitoring		monitoring
nginx		nginx
vllm_engine		vllm_engine
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
submission.html		submission.html
submission.ipynb		submission.ipynb
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

InferenceOps Assignment

Architecture

Setup

1. Clone and install

2. Configure environment

3. Deploy Modal vLLM (if not done)

Running the Stack

Start all services

Verify stack

Experiments

Run inference experiments

Evaluate code agent

Generate plots

API Usage

Direct gateway

Through nginx

Project Structure

Metrics

Submission

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages