Skip to content

Harden security, reliability, and observability across the radar#6

Merged
stxkxs merged 2 commits into
mainfrom
harden-quality-and-security
Jun 20, 2026
Merged

Harden security, reliability, and observability across the radar#6
stxkxs merged 2 commits into
mainfrom
harden-quality-and-security

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 20, 2026

Copy link
Copy Markdown
Member

See the commit message for full detail (source of truth).

Summary

  • Resolves all 21 GitHub findings — 18 Dependabot + 3 code-scanning (ws/form-data CVEs via npm overrides + @anthropic-ai/sdk bump; CodeQL js/incomplete-url-scheme-check in parser.ts via an http(s) scheme allowlist). npm audit is clean.
  • Reliability + integrity: explicit timeouts on every external call, upsert-then-prune vector replace (no history wipe), pgvector TLS verification + dim-drift guard, SSRF guard hardened against IPv4-mapped IPv6.
  • Observability fixed end-to-end: all dashboard/PrometheusRule metrics now actually emit, NetworkPolicy egress to the OTel collector added, OTel double-init removed.
  • Deploy correctness: image ships sources.json (was monitoring zero sources), valid Bedrock Converse model ID, ExternalSecret no longer fails on a Bedrock-default deploy.
  • Tests 50 → 77 with coverage thresholds enforced in CI; runbook added; docs corrected.

Merging closes the open Dependabot + code-scanning alerts on main once GitHub re-scans.

stxkxs and others added 2 commits June 19, 2026 20:57
Resolves all 21 GitHub findings (18 Dependabot + 3 code-scanning) plus the
issues from a three-pass quality audit. npm audit is clean and the full local
gate is green (typecheck / lint / format / 77 tests + coverage / build / helm).

─── Security ───
- Dependency CVEs: npm overrides force patched transitive deps (ws 8.21,
  form-data 4.0.6, undici 7.28, vite 8.0.16, postcss, qs, protobufjs,
  brace-expansion, esbuild, @opentelemetry/core 2.8) and @anthropic-ai/sdk is
  bumped to 0.105 — clears all 20 dependency alerts (18 Dependabot + the ws /
  form-data code-scanning CVEs).
- SSRF guard (src/crawler/url-guard.ts): decode IPv4-mapped IPv6 (::ffff:…,
  incl. the hex-quad form Node emits) and re-gate on the embedded IPv4, so
  mapped loopback / RFC1918 / cloud-metadata can no longer bypass the block list.
- CodeQL js/incomplete-url-scheme-check (src/crawler/parser.ts): link extraction
  now allowlists http(s) on the resolved URL instead of blocklisting only
  `javascript:` (which let data:/vbscript:/mixed-case/whitespace variants slip).
- pgvector TLS now verifies (rejectUnauthorized:true, optional PG_CA_PATH CA
  pin) — was disabled. LLM analysis JSON is Zod-validated at the boundary.
- Slack DM handler gated to channel_type==='im' — was running an embedding + LLM
  query on every message in any readable channel (unbounded Bedrock spend).
- chart: ExternalSecret optional provider keys gated behind provider
  conditionals (a Bedrock-default deploy no longer fails the whole secret sync);
  pod UID aligned to the image's 1001.

─── Reliability ───
- Explicit timeouts on every external call: Bedrock (requestHandler +
  AbortSignal.timeout), Anthropic/OpenAI (SDK timeout), pg Pool
  (statement/query/connection/idle), Slack sink (CircuitBreaker + client
  timeout). A hung dependency can no longer stall the single-writer crawl.
- Data integrity: the pipeline replaces a source's vectors upsert-then-prune
  (deleteByMetadata with a keepIds exclusion) so a mid-write failure can't wipe
  history and silently re-baseline away a real change.
- pgvector: dimension-drift guard (atttypmod) fails loud on a config change;
  HNSW index avoids IVFFlat's empty-table recall pitfall.
- Slack HTTP-mode receiver now actually binds; /crawl reports a skipped run
  honestly; OpenAI embeddings are batched to respect request caps.

─── Observability ───
- src/metrics.ts now emits the full surface the Grafana dashboard +
  PrometheusRule query (per-kind Bedrock token counters, a circuit_breaker.open
  gauge, crawl outcomes/failures, a change_score distribution with explicit 0–1
  buckets, alert-send failures, pgvector errors, chunks/diffs) — previously most
  were defined-but-never-emitted, leaving four paging alerts permanently dead.
- NetworkPolicy: added egress to the OTel collector (tcp/4318) — default-deny
  was dropping every metric and trace in production.
- OTel init de-duplicated: removed the programmatic SDK (rejected as a duplicate
  of the Dockerfile --require preload, and patching modules too late anyway);
  the preload is now the single env-driven path, and the orphaned OTel deps are
  pruned.

─── Code quality, tests, docs ───
- toMessage() dedups the error-stringify idiom; the CLI reuses crawlAll via an
  onResult callback; metric emission is consolidated into the engines so the CLI
  path counts identically.
- Tests 50 → 77 with coverage thresholds enforced in CI: alert-gating
  orchestration, analysis JSON parse/fallback, chunker/differ boundaries,
  vectors keepIds + dim-drift, url-guard mapped-IPv6, parser scheme filtering.
- Dockerfile ships sources.example.json as sources.json (the path the app
  reads) — production was monitoring zero sources. BEDROCK_LLM_MODEL is a valid
  us.anthropic.*-v1:0 cross-region profile (was a bare alias invalid for
  Converse). Added docs/RUNBOOK.md; corrected stale Pino / metric-shape / model
  claims across the docs; documented the collector resource_to_telemetry
  requirement.

Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
The trivy-config and trivy-fs jobs ran a single Trivy invocation with both
`format: sarif` and `exit-code: "1"`. When Trivy finds a HIGH/CRITICAL it exits
non-zero, which marks the uploaded SARIF's invocation as executionSuccessful:
false — GitHub then renders the code-scanning analysis as "Trivy is reporting
errors" / a configuration error, even though the findings upload correctly.
That's the banner shown on `main` after the weekly scheduled run flagged the
ws / form-data CVEs.

Split each job into two steps:
- Report: `exit-code: "0"`, `format: sarif` → always a successful analysis, so
  code scanning records clean results and surfaces findings as alerts.
- Gate: a second `format: table`, `exit-code: "1"` step (with `skip-db-update`
  to reuse the DB the report step downloaded) that fails CI on HIGH/CRITICAL.

Same gate behavior (build still fails on HIGH/CRITICAL), but the code-scanning
analysis is never poisoned by the gate's non-zero exit — so a real future CVE
no longer re-triggers the configuration-error banner.

Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
@stxkxs stxkxs marked this pull request as ready for review June 20, 2026 04:11
@stxkxs stxkxs merged commit 29b9ae8 into main Jun 20, 2026
9 checks passed
@stxkxs stxkxs deleted the harden-quality-and-security branch June 20, 2026 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant