Harden security, reliability, and observability across the radar by stxkxs · Pull Request #6 · nanohype/competitive-intelligence

stxkxs · 2026-06-20T03:57:36Z

See the commit message for full detail (source of truth).

Summary

Resolves all 21 GitHub findings — 18 Dependabot + 3 code-scanning (ws/form-data CVEs via npm overrides + @anthropic-ai/sdk bump; CodeQL js/incomplete-url-scheme-check in parser.ts via an http(s) scheme allowlist). npm audit is clean.
Reliability + integrity: explicit timeouts on every external call, upsert-then-prune vector replace (no history wipe), pgvector TLS verification + dim-drift guard, SSRF guard hardened against IPv4-mapped IPv6.
Observability fixed end-to-end: all dashboard/PrometheusRule metrics now actually emit, NetworkPolicy egress to the OTel collector added, OTel double-init removed.
Deploy correctness: image ships sources.json (was monitoring zero sources), valid Bedrock Converse model ID, ExternalSecret no longer fails on a Bedrock-default deploy.
Tests 50 → 77 with coverage thresholds enforced in CI; runbook added; docs corrected.

Merging closes the open Dependabot + code-scanning alerts on main once GitHub re-scans.

Resolves all 21 GitHub findings (18 Dependabot + 3 code-scanning) plus the issues from a three-pass quality audit. npm audit is clean and the full local gate is green (typecheck / lint / format / 77 tests + coverage / build / helm). ─── Security ─── - Dependency CVEs: npm overrides force patched transitive deps (ws 8.21, form-data 4.0.6, undici 7.28, vite 8.0.16, postcss, qs, protobufjs, brace-expansion, esbuild, @opentelemetry/core 2.8) and @anthropic-ai/sdk is bumped to 0.105 — clears all 20 dependency alerts (18 Dependabot + the ws / form-data code-scanning CVEs). - SSRF guard (src/crawler/url-guard.ts): decode IPv4-mapped IPv6 (::ffff:…, incl. the hex-quad form Node emits) and re-gate on the embedded IPv4, so mapped loopback / RFC1918 / cloud-metadata can no longer bypass the block list. - CodeQL js/incomplete-url-scheme-check (src/crawler/parser.ts): link extraction now allowlists http(s) on the resolved URL instead of blocklisting only `javascript:` (which let data:/vbscript:/mixed-case/whitespace variants slip). - pgvector TLS now verifies (rejectUnauthorized:true, optional PG_CA_PATH CA pin) — was disabled. LLM analysis JSON is Zod-validated at the boundary. - Slack DM handler gated to channel_type==='im' — was running an embedding + LLM query on every message in any readable channel (unbounded Bedrock spend). - chart: ExternalSecret optional provider keys gated behind provider conditionals (a Bedrock-default deploy no longer fails the whole secret sync); pod UID aligned to the image's 1001. ─── Reliability ─── - Explicit timeouts on every external call: Bedrock (requestHandler + AbortSignal.timeout), Anthropic/OpenAI (SDK timeout), pg Pool (statement/query/connection/idle), Slack sink (CircuitBreaker + client timeout). A hung dependency can no longer stall the single-writer crawl. - Data integrity: the pipeline replaces a source's vectors upsert-then-prune (deleteByMetadata with a keepIds exclusion) so a mid-write failure can't wipe history and silently re-baseline away a real change. - pgvector: dimension-drift guard (atttypmod) fails loud on a config change; HNSW index avoids IVFFlat's empty-table recall pitfall. - Slack HTTP-mode receiver now actually binds; /crawl reports a skipped run honestly; OpenAI embeddings are batched to respect request caps. ─── Observability ─── - src/metrics.ts now emits the full surface the Grafana dashboard + PrometheusRule query (per-kind Bedrock token counters, a circuit_breaker.open gauge, crawl outcomes/failures, a change_score distribution with explicit 0–1 buckets, alert-send failures, pgvector errors, chunks/diffs) — previously most were defined-but-never-emitted, leaving four paging alerts permanently dead. - NetworkPolicy: added egress to the OTel collector (tcp/4318) — default-deny was dropping every metric and trace in production. - OTel init de-duplicated: removed the programmatic SDK (rejected as a duplicate of the Dockerfile --require preload, and patching modules too late anyway); the preload is now the single env-driven path, and the orphaned OTel deps are pruned. ─── Code quality, tests, docs ─── - toMessage() dedups the error-stringify idiom; the CLI reuses crawlAll via an onResult callback; metric emission is consolidated into the engines so the CLI path counts identically. - Tests 50 → 77 with coverage thresholds enforced in CI: alert-gating orchestration, analysis JSON parse/fallback, chunker/differ boundaries, vectors keepIds + dim-drift, url-guard mapped-IPv6, parser scheme filtering. - Dockerfile ships sources.example.json as sources.json (the path the app reads) — production was monitoring zero sources. BEDROCK_LLM_MODEL is a valid us.anthropic.*-v1:0 cross-region profile (was a bare alias invalid for Converse). Added docs/RUNBOOK.md; corrected stale Pino / metric-shape / model claims across the docs; documented the collector resource_to_telemetry requirement. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>

The trivy-config and trivy-fs jobs ran a single Trivy invocation with both `format: sarif` and `exit-code: "1"`. When Trivy finds a HIGH/CRITICAL it exits non-zero, which marks the uploaded SARIF's invocation as executionSuccessful: false — GitHub then renders the code-scanning analysis as "Trivy is reporting errors" / a configuration error, even though the findings upload correctly. That's the banner shown on `main` after the weekly scheduled run flagged the ws / form-data CVEs. Split each job into two steps: - Report: `exit-code: "0"`, `format: sarif` → always a successful analysis, so code scanning records clean results and surfaces findings as alerts. - Gate: a second `format: table`, `exit-code: "1"` step (with `skip-db-update` to reuse the DB the report step downloaded) that fails CI on HIGH/CRITICAL. Same gate behavior (build still fails on HIGH/CRITICAL), but the code-scanning analysis is never poisoned by the gate's non-zero exit — so a real future CVE no longer re-triggers the configuration-error banner. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>

stxkxs and others added 2 commits June 19, 2026 20:57

stxkxs marked this pull request as ready for review June 20, 2026 04:11

stxkxs merged commit 29b9ae8 into main Jun 20, 2026
9 checks passed

stxkxs deleted the harden-quality-and-security branch June 20, 2026 04:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden security, reliability, and observability across the radar#6

Harden security, reliability, and observability across the radar#6
stxkxs merged 2 commits into
mainfrom
harden-quality-and-security

stxkxs commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stxkxs commented Jun 20, 2026

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant