Harden security, reliability, and observability across the radar#6
Merged
Conversation
Resolves all 21 GitHub findings (18 Dependabot + 3 code-scanning) plus the issues from a three-pass quality audit. npm audit is clean and the full local gate is green (typecheck / lint / format / 77 tests + coverage / build / helm). ─── Security ─── - Dependency CVEs: npm overrides force patched transitive deps (ws 8.21, form-data 4.0.6, undici 7.28, vite 8.0.16, postcss, qs, protobufjs, brace-expansion, esbuild, @opentelemetry/core 2.8) and @anthropic-ai/sdk is bumped to 0.105 — clears all 20 dependency alerts (18 Dependabot + the ws / form-data code-scanning CVEs). - SSRF guard (src/crawler/url-guard.ts): decode IPv4-mapped IPv6 (::ffff:…, incl. the hex-quad form Node emits) and re-gate on the embedded IPv4, so mapped loopback / RFC1918 / cloud-metadata can no longer bypass the block list. - CodeQL js/incomplete-url-scheme-check (src/crawler/parser.ts): link extraction now allowlists http(s) on the resolved URL instead of blocklisting only `javascript:` (which let data:/vbscript:/mixed-case/whitespace variants slip). - pgvector TLS now verifies (rejectUnauthorized:true, optional PG_CA_PATH CA pin) — was disabled. LLM analysis JSON is Zod-validated at the boundary. - Slack DM handler gated to channel_type==='im' — was running an embedding + LLM query on every message in any readable channel (unbounded Bedrock spend). - chart: ExternalSecret optional provider keys gated behind provider conditionals (a Bedrock-default deploy no longer fails the whole secret sync); pod UID aligned to the image's 1001. ─── Reliability ─── - Explicit timeouts on every external call: Bedrock (requestHandler + AbortSignal.timeout), Anthropic/OpenAI (SDK timeout), pg Pool (statement/query/connection/idle), Slack sink (CircuitBreaker + client timeout). A hung dependency can no longer stall the single-writer crawl. - Data integrity: the pipeline replaces a source's vectors upsert-then-prune (deleteByMetadata with a keepIds exclusion) so a mid-write failure can't wipe history and silently re-baseline away a real change. - pgvector: dimension-drift guard (atttypmod) fails loud on a config change; HNSW index avoids IVFFlat's empty-table recall pitfall. - Slack HTTP-mode receiver now actually binds; /crawl reports a skipped run honestly; OpenAI embeddings are batched to respect request caps. ─── Observability ─── - src/metrics.ts now emits the full surface the Grafana dashboard + PrometheusRule query (per-kind Bedrock token counters, a circuit_breaker.open gauge, crawl outcomes/failures, a change_score distribution with explicit 0–1 buckets, alert-send failures, pgvector errors, chunks/diffs) — previously most were defined-but-never-emitted, leaving four paging alerts permanently dead. - NetworkPolicy: added egress to the OTel collector (tcp/4318) — default-deny was dropping every metric and trace in production. - OTel init de-duplicated: removed the programmatic SDK (rejected as a duplicate of the Dockerfile --require preload, and patching modules too late anyway); the preload is now the single env-driven path, and the orphaned OTel deps are pruned. ─── Code quality, tests, docs ─── - toMessage() dedups the error-stringify idiom; the CLI reuses crawlAll via an onResult callback; metric emission is consolidated into the engines so the CLI path counts identically. - Tests 50 → 77 with coverage thresholds enforced in CI: alert-gating orchestration, analysis JSON parse/fallback, chunker/differ boundaries, vectors keepIds + dim-drift, url-guard mapped-IPv6, parser scheme filtering. - Dockerfile ships sources.example.json as sources.json (the path the app reads) — production was monitoring zero sources. BEDROCK_LLM_MODEL is a valid us.anthropic.*-v1:0 cross-region profile (was a bare alias invalid for Converse). Added docs/RUNBOOK.md; corrected stale Pino / metric-shape / model claims across the docs; documented the collector resource_to_telemetry requirement. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
The trivy-config and trivy-fs jobs ran a single Trivy invocation with both `format: sarif` and `exit-code: "1"`. When Trivy finds a HIGH/CRITICAL it exits non-zero, which marks the uploaded SARIF's invocation as executionSuccessful: false — GitHub then renders the code-scanning analysis as "Trivy is reporting errors" / a configuration error, even though the findings upload correctly. That's the banner shown on `main` after the weekly scheduled run flagged the ws / form-data CVEs. Split each job into two steps: - Report: `exit-code: "0"`, `format: sarif` → always a successful analysis, so code scanning records clean results and surfaces findings as alerts. - Gate: a second `format: table`, `exit-code: "1"` step (with `skip-db-update` to reuse the DB the report step downloaded) that fails CI on HIGH/CRITICAL. Same gate behavior (build still fails on HIGH/CRITICAL), but the code-scanning analysis is never poisoned by the gate's non-zero exit — so a real future CVE no longer re-triggers the configuration-error banner. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See the commit message for full detail (source of truth).
Summary
@anthropic-ai/sdkbump; CodeQLjs/incomplete-url-scheme-checkinparser.tsvia an http(s) scheme allowlist).npm auditis clean.sources.json(was monitoring zero sources), valid Bedrock Converse model ID, ExternalSecret no longer fails on a Bedrock-default deploy.Merging closes the open Dependabot + code-scanning alerts on
mainonce GitHub re-scans.