Skip to content

Phase 6: MCP server, Docker/K8s, observability, CI#7

Merged
rlienard merged 1 commit into
mainfrom
claude/fervent-gauss-La7rd
May 22, 2026
Merged

Phase 6: MCP server, Docker/K8s, observability, CI#7
rlienard merged 1 commit into
mainfrom
claude/fervent-gauss-La7rd

Conversation

@rlienard
Copy link
Copy Markdown
Owner

Context

Sixth and final phase. Closes the production-readiness arc: the agent is reachable from any MCP-aware client, the stack deploys in one command (locally or on K8s), every service emits Prometheus metrics + JSON logs, and CI gates lint + a real Postgres test run + a scanned image push.

What's in this PR

services/mcp_server/ — MCP server

Two transports, one shared registry (server.py):

  • stdio (python -m services.mcp_server.stdio) — for Claude Code / Desktop and other local clients.
  • streamable HTTP (uvicorn services.mcp_server.http:app) — for LibreChat and remote MCP clients.

14 tools spanning runs, classifications, the SGT dictionary, proposals, and threat-intel:

Group Tools
Runs start_run, ingest_lines, classify_run, build_matrix, list_runs, get_run, list_classifications, list_missing_sgts
SGT dict list_sgt_entries, set_sgt_name (gated by --allow-dictionary-edit)
Proposals list_proposals, get_proposal, create_proposal, approve_proposal, reject_proposal
Threat intel lookup_threat_intel

Tools are stateless — every call takes an explicit tenant_id / run_id. Per-tenant authorization is enforced by the repo layer (established in Phase 1).

deploy/ — one-command stack

  • Dockerfile — multi-stage, non-root (uid 10001), readOnlyRootFilesystem-friendly. One image, every role; pod CMD picks the role.
  • docker-compose.yml — postgres + redis + migrate (one-shot) + api + worker + scheduler + mcp-http always-on. Optional ui, webex, threat behind Compose profiles.
  • k8s/base/ (kustomize) — Deployments + Services + HPAs (CPU-based for now; KEDA-on-Redis-Streams is a follow-up), a minAvailable: 1 PDB on the scheduler so leader handover is fast during rollouts, postgres + redis as single-replica StatefulSets, migration as a pre-install Job (helm + argocd sync wave annotations), cert-manager + nginx-ingress for the four hostnames, example NetworkPolicy stack with default-deny. PSAS restricted on the namespace.

core/observability/

  • logging.py — JSON structured logs (toggleable via SCOPILOT_LOG_FORMAT).
  • metrics.py — Prometheus counters for the high-signal events: scopilot_flow_unknown_published_total, scopilot_classifications_total, scopilot_proposals_total, scopilot_threat_lookups_total, plus an http_request_duration_seconds histogram. API exposes /metrics.
  • The scheduler scan path now increments flow_unknown_published as proof of plumbing — downstream services adopt the rest as they need it.

.github/workflows/ci.yml

  • ruff clean (the cleanup applied 98 fixes across the tree — including a real bug: a duplicate 14: key in the AbuseIPDB category map silently masked "port_scan" behind "scanner").
  • Alembic migration applied against a real postgres:16 service container.
  • pytest in SQLite (the suite is hermetic).
  • Docker build + push to ghcr.io/$REPO (on push to main only) + Trivy scan (warn for now; fail-on-CRITICAL is a follow-up).

Tests — 79/79 passing (+4 new)

  • test_mcp_server.py (4): full run lifecycle through the MCP tool surface (start_runingest_linesclassify_runbuild_matrix); dictionary-edit gating proves set_sgt_name is hidden by default; create_proposalapprove_proposallist_proposals end-to-end; lookup_threat_intel returns a graceful {"error": …} when no providers are configured.

What this PR explicitly does not do (out of scope)

  • OIDC verifier with JWKS — the auth dependency surface from Phase 2 is JWKS-ready; swapping in the verifier is plumbing-only, deferred to a follow-up because it needs an IdP to integration-test against.
  • mTLS via Linkerd — short-lived JWTs cover service-to-service for now.
  • KEDA Redis-Streams scaler — CPU-based HPAs ship today; KEDA is the next iteration.
  • Loki/Tempo exporters — the structured logs and Prometheus metrics drop into any backend; specific exporters are deployment choices.
  • GDPR purge cron job — schema + retention guidance is in place; the actual DELETE … WHERE ingestion_ts < now() - 90d cron lives in operations.

End-to-end sanity check

docker compose -f deploy/docker-compose.yml up -d
# api on :8000, mcp-http on :8002, worker + scheduler in the background

# Drive the agent from the CLI client
SCOPILOT_API_BASE=http://localhost:8000 scopilot health
SCOPILOT_API_BASE=http://localhost:8000 scopilot run start tests/fixtures/sample.log

# Or from any MCP client
python -m services.mcp_server.stdio

Test plan

  • pytest — 79/79 passing
  • ruff check . — clean
  • MCP tool surface covered by tests against the FastMCP call_tool API
  • Docker image structure validates (multi-stage, non-root, copies app.py)
  • Real Docker Compose smoke test (deferred — environment doesn't run Docker)
  • Real kubectl apply -k smoke (deferred — no cluster)

Generated by Claude Code

Closes the production-readiness arc. Multi-client agent access via MCP
(Claude UI / Claude Code / LibreChat), single-command local deploy via
docker-compose, namespace-scoped K8s manifests with HPAs + PDB + an
example NetworkPolicy stack, JSON logs + Prometheus metrics, and a CI
pipeline that runs the suite against a real Postgres before shipping
the image to GHCR.

services/mcp_server/
  - FastMCP server with 14 tools: start_run, ingest_lines, classify_run,
    build_matrix, list_runs, get_run, list_classifications,
    list_missing_sgts, list_sgt_entries, set_sgt_name (gated),
    list_proposals, get_proposal, create_proposal, approve_proposal,
    reject_proposal, lookup_threat_intel.
  - Two transports, one shared registry: stdio.py for local clients,
    http.py for remote (streamable HTTP via Starlette).
  - Stateless — every tool takes an explicit tenant_id / run_id; per-
    tenant authorization at the repo layer (already established in
    Phase 1).
  - --allow-dictionary-edit gates set_sgt_name behind a CLI flag.

deploy/
  - Dockerfile: multi-stage, non-root (uid 10001), one image for every
    role. CMD swap per replica.
  - docker-compose.yml: postgres + redis + migrate (one-shot) + api +
    worker + scheduler + mcp-http, with profiles `ui`, `webex`,
    `threat` for the optional services.
  - k8s/base: kustomize layout — Deployments + Services + HPAs + a
    leader-respecting scheduler PDB + cert-manager + nginx-ingress for
    api/mcp/ui/webex hosts. Postgres + Redis as single-replica
    StatefulSets (swap for managed instances in prod). Migration as a
    pre-install Job (helm + argocd sync wave annotations). Example
    NetworkPolicy stack with default-deny intra-namespace.
  - PSAS restricted on the namespace; pods runAsNonRoot + readOnlyRootFS
    + drop all capabilities + seccomp RuntimeDefault.

core/observability/
  - JSON structured logging (text|json via SCOPILOT_LOG_FORMAT).
  - Prometheus counters: scopilot_flow_unknown_published_total,
    scopilot_classifications_total, scopilot_proposals_total,
    scopilot_threat_lookups_total + http_request_duration histogram.
  - API exposes /metrics. Worker scan increments
    flow_unknown_published on every fan-out.

.github/workflows/ci.yml
  - ruff (clean — 98 fixes applied across the tree, including a real
    bug: duplicate dict key in the AbuseIPDB category map).
  - Alembic upgrade against a real postgres:16 service container.
  - pytest with SQLite (fast).
  - Docker build + push to ghcr.io/$REPO + Trivy scan, gated on push
    to main.

Tests — 79/79 passing (+4 new):
  - test_mcp_server.py (4): run lifecycle through the tool surface,
    dictionary-edit gating, proposal create/approve via MCP, threat-
    intel tool returns a graceful error when no providers configured.

https://claude.ai/code/session_01THsbGHdqjcvJeWUwrzZtp8
@rlienard rlienard marked this pull request as ready for review May 22, 2026 21:40
@rlienard rlienard merged commit 56e1cd9 into main May 22, 2026
@rlienard rlienard deleted the claude/fervent-gauss-La7rd branch May 22, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants