Phase 6: MCP server, Docker/K8s, observability, CI#7
Merged
Conversation
Closes the production-readiness arc. Multi-client agent access via MCP
(Claude UI / Claude Code / LibreChat), single-command local deploy via
docker-compose, namespace-scoped K8s manifests with HPAs + PDB + an
example NetworkPolicy stack, JSON logs + Prometheus metrics, and a CI
pipeline that runs the suite against a real Postgres before shipping
the image to GHCR.
services/mcp_server/
- FastMCP server with 14 tools: start_run, ingest_lines, classify_run,
build_matrix, list_runs, get_run, list_classifications,
list_missing_sgts, list_sgt_entries, set_sgt_name (gated),
list_proposals, get_proposal, create_proposal, approve_proposal,
reject_proposal, lookup_threat_intel.
- Two transports, one shared registry: stdio.py for local clients,
http.py for remote (streamable HTTP via Starlette).
- Stateless — every tool takes an explicit tenant_id / run_id; per-
tenant authorization at the repo layer (already established in
Phase 1).
- --allow-dictionary-edit gates set_sgt_name behind a CLI flag.
deploy/
- Dockerfile: multi-stage, non-root (uid 10001), one image for every
role. CMD swap per replica.
- docker-compose.yml: postgres + redis + migrate (one-shot) + api +
worker + scheduler + mcp-http, with profiles `ui`, `webex`,
`threat` for the optional services.
- k8s/base: kustomize layout — Deployments + Services + HPAs + a
leader-respecting scheduler PDB + cert-manager + nginx-ingress for
api/mcp/ui/webex hosts. Postgres + Redis as single-replica
StatefulSets (swap for managed instances in prod). Migration as a
pre-install Job (helm + argocd sync wave annotations). Example
NetworkPolicy stack with default-deny intra-namespace.
- PSAS restricted on the namespace; pods runAsNonRoot + readOnlyRootFS
+ drop all capabilities + seccomp RuntimeDefault.
core/observability/
- JSON structured logging (text|json via SCOPILOT_LOG_FORMAT).
- Prometheus counters: scopilot_flow_unknown_published_total,
scopilot_classifications_total, scopilot_proposals_total,
scopilot_threat_lookups_total + http_request_duration histogram.
- API exposes /metrics. Worker scan increments
flow_unknown_published on every fan-out.
.github/workflows/ci.yml
- ruff (clean — 98 fixes applied across the tree, including a real
bug: duplicate dict key in the AbuseIPDB category map).
- Alembic upgrade against a real postgres:16 service container.
- pytest with SQLite (fast).
- Docker build + push to ghcr.io/$REPO + Trivy scan, gated on push
to main.
Tests — 79/79 passing (+4 new):
- test_mcp_server.py (4): run lifecycle through the tool surface,
dictionary-edit gating, proposal create/approve via MCP, threat-
intel tool returns a graceful error when no providers configured.
https://claude.ai/code/session_01THsbGHdqjcvJeWUwrzZtp8
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Sixth and final phase. Closes the production-readiness arc: the agent is reachable from any MCP-aware client, the stack deploys in one command (locally or on K8s), every service emits Prometheus metrics + JSON logs, and CI gates lint + a real Postgres test run + a scanned image push.
What's in this PR
services/mcp_server/— MCP serverTwo transports, one shared registry (
server.py):python -m services.mcp_server.stdio) — for Claude Code / Desktop and other local clients.uvicorn services.mcp_server.http:app) — for LibreChat and remote MCP clients.14 tools spanning runs, classifications, the SGT dictionary, proposals, and threat-intel:
start_run,ingest_lines,classify_run,build_matrix,list_runs,get_run,list_classifications,list_missing_sgtslist_sgt_entries,set_sgt_name(gated by--allow-dictionary-edit)list_proposals,get_proposal,create_proposal,approve_proposal,reject_proposallookup_threat_intelTools are stateless — every call takes an explicit
tenant_id/run_id. Per-tenant authorization is enforced by the repo layer (established in Phase 1).deploy/— one-command stackDockerfile— multi-stage, non-root (uid 10001),readOnlyRootFilesystem-friendly. One image, every role; pod CMD picks the role.docker-compose.yml— postgres + redis + migrate (one-shot) + api + worker + scheduler + mcp-http always-on. Optionalui,webex,threatbehind Compose profiles.k8s/base/(kustomize) — Deployments + Services + HPAs (CPU-based for now; KEDA-on-Redis-Streams is a follow-up), aminAvailable: 1PDB on the scheduler so leader handover is fast during rollouts, postgres + redis as single-replica StatefulSets, migration as a pre-install Job (helm + argocd sync wave annotations), cert-manager + nginx-ingress for the four hostnames, example NetworkPolicy stack with default-deny. PSASrestrictedon the namespace.core/observability/logging.py— JSON structured logs (toggleable viaSCOPILOT_LOG_FORMAT).metrics.py— Prometheus counters for the high-signal events:scopilot_flow_unknown_published_total,scopilot_classifications_total,scopilot_proposals_total,scopilot_threat_lookups_total, plus anhttp_request_duration_secondshistogram. API exposes/metrics.flow_unknown_publishedas proof of plumbing — downstream services adopt the rest as they need it..github/workflows/ci.yml14:key in the AbuseIPDB category map silently masked"port_scan"behind"scanner").postgres:16service container.ghcr.io/$REPO(on push to main only) + Trivy scan (warn for now; fail-on-CRITICAL is a follow-up).Tests — 79/79 passing (+4 new)
test_mcp_server.py(4): full run lifecycle through the MCP tool surface (start_run→ingest_lines→classify_run→build_matrix); dictionary-edit gating provesset_sgt_nameis hidden by default;create_proposal→approve_proposal→list_proposalsend-to-end;lookup_threat_intelreturns a graceful{"error": …}when no providers are configured.What this PR explicitly does not do (out of scope)
DELETE … WHERE ingestion_ts < now() - 90dcron lives in operations.End-to-end sanity check
Test plan
pytest— 79/79 passingruff check .— cleancall_toolAPIkubectl apply -ksmoke (deferred — no cluster)Generated by Claude Code