A production-grade MLOps platform that captures quality signal from live LLM traffic,
scores every response with a real AI judge, detects statistical drift, and orchestrates
the full self-healing loop — built as a cloud-native event-driven microservices system.
Every LLM in production silently degrades. Prompts shift, user behaviour changes, model capability drifts. Most teams find out weeks later — through user complaints or a drop in business metrics. By then the damage is done.
Continuum solves this. It instruments your production traffic, scores every LLM response in real-time using a multi-signal quality evaluator (ROUGE/BLEU + semantic similarity + a real Claude LLM-as-judge), applies Statistical Process Control to detect drift the moment it becomes statistically significant, and maintains the full infrastructure to orchestrate surgical LoRA fine-tuning on exactly the failing capability cluster — without human intervention.
The platform was deployed end-to-end on AWS ap-south-1 and verified with real production traffic. 30 LLM traces were seeded, scored by Claude, and queried through the live API and dashboard.
Real Claude judge correctly differentiating: good answers score ~0.89 (coherence 0.95, relevance 0.98), degraded answers score ~0.40 (informativeness 0.10, relevance 0.15)
ECS Fargate — All 11 microservices running on continuum-dev-cluster
|
Fargate Tasks — running containers |
ECR — 11 Docker image repositories |
|
Application Load Balancer Routes |
RDS PostgreSQL — quality scores persisted |
|
EC2 — self-managed Kafka (KRaft) 10 topics, 30 traces flowed end-to-end |
CloudWatch — live quality-scorer logs DB init → embedding model → Kafka connected → scoring |
| Capability | Evidence |
|---|---|
| 11-service event-driven pipeline on ECS Fargate | All services Active on continuum-dev-cluster |
| Kafka (KRaft) with 10 topics | 30 traces: production-traces → quality-scoring-queue → scored-traces-queue |
| Real LLM-as-judge via Claude | 23 traces scored — good ~0.89, degraded ~0.40, 4 dimensions each |
| ROUGE/BLEU + sentence-transformer semantic scoring | Fully local, no API needed |
| SPC drift detector, poison guard, decision engine | All Kafka-connected, consuming, DB-persisted |
| PostgreSQL persistence (RDS) | Schema created, quality scores written and queryable |
| Authenticated REST API gateway (FastAPI + ALB) | All 7 endpoints live with Bearer auth |
| React dashboard with real data | 74.9% avg quality trend, real evaluations table |
| 166 Prometheus metric series | Exposed at /metrics |
| Full Terraform IaC | One terraform apply provisions the entire cloud environment |
| e2e test: 10/10 PASS | python scripts/cloud_e2e_test.py |
Statistical window. The SPC drift detector requires a minimum of 100 scored traces in the rolling window before Western Electric rules activate. The current deployment was validated with 30 traces — sufficient to verify the ingestion and scoring pipeline end-to-end, but the drift signal and downstream decision engine were not exercised. Closing the loop requires sustained production-like traffic.
GPU dependency. The fine-tuning orchestrator (LoRA/QLoRA), vLLM serving layer, and DARE/TIES adapter merging all require GPU compute. ECS Fargate is CPU-only; these services run as stubs in the current cloud deployment. A GPU instance (g5.xlarge or equivalent) or SageMaker endpoint is required to exercise the full self-healing loop.
Single-node Kafka. The broker runs as a single KRaft node — no replication, no ISR guarantees. Suitable for development and demonstration; not appropriate for production data durability requirements. A 3-broker KRaft cluster or AWS MSK would be the upgrade path.
Model registry is a stub. The model-registry service (MLflow integration) is scaffolded but not yet wired into the fine-tuning orchestrator. Adapter lineage — base model, training data hash, benchmark results, merge configuration — is not persisted across runs.
HTTP only. The ALB serves HTTP in the dev environment. A production deployment requires TLS termination via ACM, a Route53 record, and HTTPS listeners. The current network topology uses public subnets with security-group-only isolation — cost-optimised for dev sessions, not a production security posture.
The core engineering work is complete. What remains is operational: provision a GPU instance, seed ≥150 traces with a deliberate quality degradation pattern (e.g., truncated answers after trace 80), observe the SPC window fire a CRITICAL drift signal, trigger a LoRA fine-tuning job on the failing capability cluster, merge the adapter with DARE/TIES, and canary-deploy at 5% → 50% → 100% with automatic rollback if the quality regression re-appears. This is the intended demo path once GPU compute is available.
Migrate to a 3-broker KRaft cluster with replication.factor=3 and min.insync.replicas=2. Add schema registry (Confluent or AWS Glue) for contract enforcement at the broker level rather than at the consumer. Consider MSK Serverless for operational simplicity at the cost of ~$150/month.
Wire the finetuning-orchestrator to register every adapter checkpoint with full lineage: base model ID, training data hash, capability benchmark results pre/post merge, and the DARE/TIES merge configuration. Expose registered models through the model-registry service for the A/B deployer to query.
Deploy vLLM on a GPU auto-scaling group (Karpenter on EKS or EC2 Auto Scaling) with a draft model for speculative decoding. Add model sharding for 7B+ parameter models. Expose a OpenAI-compatible /v1/chat/completions endpoint so any existing LLM client can route traffic through Continuum without code changes.
Replace the dashboard's 60-second polling interval with Server-Sent Events or WebSocket streams from the API gateway. Expose a /v1/stream/quality endpoint that pushes scored trace events as they arrive. Add quality score percentile bands (P50/P95/P99) to the drift detection chart.
Route CRITICAL drift signals to PagerDuty and Slack via SNS. Implement SLO burn rate alerts in CloudWatch: if the 1-hour error budget is consumed at a rate that would exhaust the 30-day budget within 6 hours, page on-call. Add a dead-letter-queue depth alarm to catch silent pipeline failures.
Move from ECS Fargate to EKS for horizontal pod autoscaling, GPU node pools managed by Karpenter, and better resource bin-packing across services. The existing Kubernetes manifests in infrastructure/k8s/ provide the starting point.
Add per-model-id Kafka consumer group routing so multiple models can share the same pipeline without cross-contamination. Implement RBAC on the API gateway: separate API keys per deployment, admin-only access to trigger endpoints, read-only keys for dashboard consumers.
PRODUCTION TRAFFIC (LLM responses)
│
▼
┌──────────────────────────────────────┐
│ INGESTION │
│ Kafka consumer · Pydantic validation│
│ Data lineage · Provenance tagging │
│ ────────────────────────────────── │
│ → quality-scoring-queue │
└──────────────────┬───────────────────┘
│
┌────────▼────────┐
│ QUALITY SCORER │
│ │
│ 1. ROUGE/BLEU │ ← lexical (local)
│ 2. Semantic sim │ ← sentence-transformers
│ 3. LLM-as-judge │ ← Claude: 4 dimensions
│ │
│ composite = 0.6·judge + 0.2·lexical + 0.2·semantic
│ → scored-traces-queue + PostgreSQL
└────────┬────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌──────────────────────┐
│ DRIFT │ │ POISON │ │ FINETUNING │
│ DETECTOR │ │ GUARD │ │ ORCHESTRATOR │
│ │ │ │ │ (GPU-gated) │
│ SPC │ │ Provenance│ │ │
│ Western │ │ Self-gen │ │ LoRA / QLoRA │
│ Electric │ │ filter │ │ DARE / TIES merge │
│ DoWhy │ │ Dist. div │ │ Capability guard │
│ │ │ │ │ │
│ → drift- │ │ → approved│ │ → model-registry │
│ signals │ │ traces │ │ → ab-deployer │
└─────┬─────┘ └───────────┘ └──────────────────────┘
│
▼
┌──────────────────┐ ┌─────────────────────────────┐
│ DECISION ENGINE │ │ SERVING (GPU-gated) │
│ │ │ vLLM → ab-deployer │
│ P95 > 2σ drop? │ │ Canary 5%→50%→100% │
│ Cooldown check │ │ Auto-rollback on regression │
│ → finetuning- │ └─────────────────────────────┘
│ triggers │
└──────────────────┘
OBSERVABILITY: Prometheus (166 series) · CloudWatch · OpenTelemetry · Structured logs
API GATEWAY: FastAPI 0.111 + ALB · rate limiting · Bearer auth
DASHBOARD: React 18 + Recharts + Zustand + TanStack Query + Tailwind
INFRASTRUCTURE: ECS Fargate · EC2 Kafka (KRaft) · RDS PostgreSQL · ElastiCache Redis
Secrets Manager · ECR · ALB · Terraform IaC
| Layer | Technology | Why |
|---|---|---|
| Event streaming | Apache Kafka (KRaft) | Industry-standard, exactly-once semantics |
| Quality scoring | PyTorch + sentence-transformers + Claude | Multi-signal: lexical + semantic + LLM judge |
| Drift detection | SciPy SPC + Western Electric rules | Low false-positive rate vs threshold alerts |
| Causal attribution | DoWhy | Most mature Python causal inference library |
| Fine-tuning | HuggingFace PEFT + LoRA/QLoRA | Most mature LoRA ecosystem |
| Adapter merging | mergekit (DARE/TIES) | Only production-grade merger preventing capability regression |
| Inference | vLLM | Best throughput, continuous batching, speculative decoding |
| Pipeline orchestration | Celery + Redis | Proven, async task execution |
| API | FastAPI 0.111 + Pydantic v2 | Async-native, type-safe, OpenAPI |
| Dashboard | React 18 + TypeScript + Recharts + Zustand | Real-time, typed, reactive |
| Containers | AWS ECS Fargate | Serverless, pay-per-second |
| Kafka hosting | AWS EC2 (self-managed KRaft) | ~$150/month cheaper than MSK |
| Database | AWS RDS PostgreSQL | Managed, stoppable between sessions |
| Cache / queues | AWS ElastiCache Redis | Celery broker, response caching |
| IaC | Terraform (modular) | Reproducible, version-controlled infrastructure |
| CI/CD | GitHub Actions | lint → type-check → test → build → push → deploy |
| Observability | Prometheus + OpenTelemetry + CloudWatch | 166 metric series, traces, structured logs |
continuum/
├── services/
│ ├── ingestion/ # Kafka consumer, Pydantic validation, lineage tagging
│ ├── quality_scorer/ # ROUGE + semantic + Claude LLM-as-judge
│ ├── drift_detector/ # SPC, Western Electric rules, DoWhy causal attribution
│ ├── poison_guard/ # Provenance tagging, self-generated data filter
│ ├── decision_engine/ # SPC threshold policy, cooldown, trigger management
│ ├── finetuning_orchestrator/ # LoRA/QLoRA pipeline, capability guard (GPU-gated)
│ ├── model-registry/ # MLflow wrapper + lineage
│ ├── ab_deployer/ # Canary 5→50→100%, auto-rollback on regression
│ ├── serving/ # vLLM proxy (GPU-gated)
│ └── api_gateway/ # FastAPI unified API, rate limiting, Bearer auth
├── dashboard/ # React 18 TypeScript frontend
├── shared/ # Pydantic contracts, Kafka producer/consumer, DB session
├── infrastructure/
│ ├── terraform/
│ │ ├── modules/ # alb, ecs, ecr, networking, rds, redis, kafka-ec2, secrets
│ │ └── environments/dev/
│ └── k8s/ # Kubernetes manifests (EKS alternative)
├── tests/
│ ├── unit/
│ ├── integration/ # Full Docker Compose stack tests
│ └── e2e/
├── scripts/
│ ├── local-setup.py # Initialize Kafka topics + DB schema
│ └── cloud_e2e_test.py # End-to-end test against live ALB (10/10 checks)
├── docs/
│ ├── architecture.md # Full system technical deep-dive
│ ├── deployment/ # Cloud deployment guide
│ └── operations/ # Runbooks, smoke tests
└── docker-compose.yml # Full local dev environment
docker --version # Docker Desktop 4.x+
python --version # Python 3.11+
node --version # Node.js 20+git clone https://github.com/HarshTomar1234/continuum.git
cd continuum
# Start Kafka, PostgreSQL, Redis, MLflow
docker compose up -d
# Initialize Kafka topics + DB schema
python scripts/local-setup.pycd services/quality_scorer
pip install -e ".[dev]"
export ANTHROPIC_API_KEY=sk-ant-... # Required for LLM judge only
export ENVIRONMENT=local
python -m services.quality_scorer.src.mainpython - << 'EOF'
import json
from confluent_kafka import Producer
trace = {
"schema_version": "1.0",
"trace_id": "test-001",
"timestamp_utc": "2026-06-21T10:00:00Z",
"model_id": "gpt2-ft-v3",
"model_version": "v3.1",
"deployment_id": "prod-slot-A",
"request": {
"prompt": "Explain gradient descent in machine learning.",
"system_prompt": None,
"temperature": 0.7,
"max_tokens": 512
},
"response": {
"completion": "Gradient descent iteratively minimizes a loss function by moving parameters in the direction of steepest descent, scaled by a learning rate, until convergence.",
"finish_reason": "stop",
"input_token_count": 8,
"output_token_count": 30,
"latency_ms": 312
},
"metadata": {
"user_id": "u1",
"session_id": "sess-001",
"source_system": "chatbot",
"region": "ap-south-1",
"tags": {}
}
}
p = Producer({"bootstrap.servers": "localhost:9094"})
p.produce("production-traces", json.dumps(trace).encode())
p.flush()
print("Trace produced.")
EOFpytest tests/unit/ -v
pytest tests/integration/ -v # requires: docker compose up -d
python scripts/cloud_e2e_test.py # end-to-end against local or live ALBcd dashboard && npm install && npm run dev
# → http://localhost:3000Cost model: designed for start/stop sessions. ~$5–15 per active session. ~$2–5/month at rest (S3 state + ECR only).
aws configure # Region: ap-south-1
# S3 bucket for Terraform state
aws s3 mb s3://continuum-terraform-state-YOURACCOUNTID --region ap-south-1
# DynamoDB table for state locking
aws dynamodb create-table \
--table-name continuum-terraform-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region ap-south-1
# EC2 key pair for Kafka SSH access
aws ec2 create-key-pair \
--key-name continuum-dev-key \
--region ap-south-1 \
--query 'KeyMaterial' --output text > ~/.ssh/continuum-dev-key.pem
chmod 400 ~/.ssh/continuum-dev-key.pemcd infrastructure/terraform/environments/dev
cat > terraform.tfvars << 'EOF'
aws_region = "ap-south-1"
project = "continuum"
environment = "dev"
vpc_cidr = "10.0.0.0/16"
ecs_desired_count = 1
db_password = "YourSecurePassword123!"
anthropic_api_key = "sk-ant-..."
kafka_key_name = "continuum-dev-key"
alert_email = "you@email.com"
domain_name = ""
EOF
terraform init
terraform plan
terraform apply # ~10–15 min first timeAfter apply:
platform_url = "http://continuum-dev-alb-{id}.ap-south-1.elb.amazonaws.com"
kafka_public_ip = "x.x.x.x"
ecs_cluster = "continuum-dev-cluster"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
ECR="${AWS_ACCOUNT_ID}.dkr.ecr.ap-south-1.amazonaws.com"
aws ecr get-login-password --region ap-south-1 | \
docker login --username AWS --password-stdin ${ECR}
for svc in api_gateway quality_scorer drift_detector poison_guard decision_engine \
finetuning_orchestrator ab_deployer serving ingestion; do
ecr_name=$(echo $svc | tr '_' '-')
docker build -f services/${svc}/Dockerfile -t continuum-${ecr_name}:latest .
docker tag continuum-${ecr_name}:latest ${ECR}/continuum-dev-${ecr_name}:latest
docker push ${ECR}/continuum-dev-${ecr_name}:latest
done
# Dashboard
docker build -f dashboard/Dockerfile -t continuum-dashboard:latest .
docker tag continuum-dashboard:latest ${ECR}/continuum-dev-dashboard:latest
docker push ${ECR}/continuum-dev-dashboard:latestCLUSTER="continuum-dev-cluster"
for svc in api-gateway quality-scorer drift-detector poison-guard decision-engine \
finetuning-orchestrator ab-deployer serving ingestion dashboard; do
aws ecs update-service \
--cluster ${CLUSTER} \
--service continuum-dev-${svc} \
--force-new-deployment \
--region ap-south-1 \
--query 'service.serviceName' --output text
donepip install httpx
python scripts/cloud_e2e_test.pyExpected:
[PASS] ALB routes to api-gateway HTTP 200
[PASS] health payload ok
[PASS] unauthenticated request rejected HTTP 401
[PASS] quality endpoint authorized HTTP 200
[PASS] scored traces present count=23
[PASS] quality differentiation
[PASS] drift endpoint authorized HTTP 200
[PASS] triggers endpoint authorized HTTP 200
[PASS] deployments endpoint authorized HTTP 200
[PASS] metrics endpoint exposed HTTP 200 (166 series)
Passed: 10 Failed: 0
KAFKA_IP=$(terraform output -raw kafka_public_ip)
KEY=~/.ssh/continuum-dev-key.pem
scp -i $KEY seed_traces.jsonl ec2-user@${KAFKA_IP}:/tmp/
ssh -i $KEY ec2-user@${KAFKA_IP} \
"cat /tmp/seed_traces.jsonl | sudo /opt/kafka/bin/kafka-console-producer.sh \
--bootstrap-server localhost:9092 --topic production-traces"
# Monitor pipeline flow
ssh -i $KEY ec2-user@${KAFKA_IP} "
for t in production-traces quality-scoring-queue scored-traces-queue drift-signals; do
T=\$(sudo /opt/kafka/bin/kafka-get-offsets.sh --bootstrap-server localhost:9092 \
--topic \$t 2>/dev/null | awk -F: '{s+=\$3} END {print s}')
printf '%-28s %s messages\n' \"\$t\" \"\${T:-0}\"
done"# Scale ECS to 0 (stops Fargate billing)
terraform apply -var="ecs_desired_count=0" -auto-approve
# Stop RDS
aws rds stop-db-instance \
--db-instance-identifier continuum-dev --region ap-south-1
# Stop Kafka EC2
aws ec2 stop-instances \
--instance-ids $(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=continuum-dev-kafka" \
--query 'Reservations[0].Instances[0].InstanceId' --output text) \
--region ap-south-1
# Full teardown
terraform destroy -auto-approveBase URL: http://{alb-dns}/
Auth: Authorization: Bearer {key} (default: dev-key)
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/health |
Gateway liveness — no auth required |
GET |
/v1/quality/metrics?hours=24&limit=100 |
Quality scores from PostgreSQL |
GET |
/v1/drift/status?hours=24&limit=100 |
Drift events |
GET |
/v1/triggers?limit=50 |
Fine-tuning triggers |
GET |
/v1/deployments |
Active model deployments |
GET |
/v1/deployments/{model_id} |
Single deployment state |
POST |
/v1/infer/{model_id} |
Inference via vLLM (GPU-gated) |
POST |
/v1/admin/trigger |
Manual fine-tuning trigger (admin key) |
GET |
/metrics |
Prometheus metrics (166 series) |
Example — query real Claude judge scores:
curl -s -H "Authorization: Bearer dev-key" \
"http://{alb-dns}/v1/quality/metrics?hours=24&limit=3" | python -m json.tool{
"metrics": [
{
"trace_id": "showcase-0024",
"model_id": "gpt2-ft-v3",
"composite_score": 0.8903,
"coherence": 0.95,
"relevance": 0.98,
"informativeness": 0.85,
"conciseness": 0.92,
"scored_at_utc": "2026-06-20T15:31:56Z"
},
{
"trace_id": "showcase-0029",
"model_id": "gpt2-ft-v3",
"composite_score": 0.4,
"coherence": 0.7,
"relevance": 0.4,
"informativeness": 0.1,
"conciseness": 0.3,
"scored_at_utc": "2026-06-20T15:32:00Z"
}
],
"count": 23,
"window_hours": 24
}| Topic | Producer | Consumers |
|---|---|---|
production-traces |
Your serving layer | ingestion |
quality-scoring-queue |
ingestion | quality-scorer |
scored-traces-queue |
quality-scorer | drift-detector, poison-guard |
approved-traces-queue |
poison-guard | finetuning-orchestrator |
drift-signals |
drift-detector | decision-engine |
finetuning-triggers |
decision-engine | finetuning-orchestrator |
deployment-events |
ab-deployer | monitoring |
dead-letter-queue |
all services on error | ops review |
| Variable | Default | Description |
|---|---|---|
ENVIRONMENT |
local |
local / dev / prod |
KAFKA_BOOTSTRAP_SERVERS |
localhost:9094 |
Broker address |
POSTGRES_URL |
postgresql+asyncpg://...localhost... |
Full async DB URL |
DB_PASSWORD |
continuum |
Injected from Secrets Manager in cloud |
REDIS_URL |
redis://localhost:6379/0 |
Redis connection URL |
ANTHROPIC_API_KEY |
— | Required for quality-scorer LLM judge |
LLM_JUDGE_MODEL |
claude-haiku-4-5-20251001 |
Claude model for scoring |
COMPOSITE_SCORE_HIGH_THRESHOLD |
0.7 |
High quality tier threshold |
COMPOSITE_SCORE_LOW_THRESHOLD |
0.5 |
Low quality tier threshold |
SPC_WINDOW_SIZE |
100 |
Samples before SPC rules activate |
SPC_K_SIGMA |
3.0 |
Control limit multiplier |
GitHub Actions runs on every push:
push → ruff lint → mypy type-check → pytest unit → docker build
→ [on PR to develop] ECR push → ECS force-redeploy
# Run locally before pushing
ruff check services/ shared/
mypy services/ shared/ --ignore-missing-imports
pytest tests/unit/ -v --cov=services --cov-report=term-missing| Challenge | Why it's hard | Approach |
|---|---|---|
| Causal drift attribution | Detecting drift is easy. Knowing why requires causal graphs over feature-outcome pairs | DoWhy causal inference over production trace pairs |
| Feedback loop poison detection | Self-generated training data causes silent model collapse | Distribution divergence + provenance tagging on every trace |
| Surgical LoRA fine-tuning | Targeting only failing capabilities requires knowing which ones failed | Capability cluster mapping via benchmarks → selective adapter training |
| Safe adapter merging | Merging adapters without destroying base capabilities is non-trivial | DARE + TIES merging with capability regression guard |
| SPC-based quality triggering | Threshold alerts produce false positives | Statistical Process Control + Western Electric rules |
| KRaft single-node offsets | offsets.topic.replication.factor defaults to 3 — silently breaks all consumer groups |
Set RF=1 for internal topics in server.properties |
| Cross-service contracts | Message schemas defined in one service, consumed by others | Consuming images bundle required contract modules |
| Decision | Rationale |
|---|---|
| ECS Fargate over EKS | No $72/month control plane; pay-per-second fits start/stop model |
| Self-managed Kafka over MSK | ~$150/month cheaper for dev/demo |
| FastAPI 0.111 (pinned) | 0.138+ introduces _IncludedRouter breaking both prometheus and OTel middleware |
| SPC over threshold alerts | Western Electric rules reduce false-positive alert fatigue |
| mergekit for adapter merging | Only production-grade DARE/TIES implementation |
| vLLM over TGI | Better throughput, speculative decoding, active community |
MIT © 2026 Harsh Tomar





.png)
.png)
.png)
