Skip to content

fix: clear all gh code-scanning CVEs + quality-audit findings#6

Merged
stxkxs merged 2 commits into
mainfrom
quality-audit-and-security-fixes
Jun 20, 2026
Merged

fix: clear all gh code-scanning CVEs + quality-audit findings#6
stxkxs merged 2 commits into
mainfrom
quality-audit-and-security-fixes

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 20, 2026

Copy link
Copy Markdown
Member

See commit message for full details.

Summary

  • Security: clears all 5 HIGH gh code-scanning CVEs + 26 moderate OTel findings; both lockfiles now at 0 npm-audit vulnerabilities (was 28 + 7). Adds authenticated DB TLS (bundled RDS CA) and Pino log redaction.
  • Systems/quality: pg timeouts, graceful SIGTERM drain (+ terminationGracePeriodSeconds), audit-consumer early shutdown, Bedrock token metering, bounded cache, composite chunks PK, plus a batch of correctness/cleanup fixes.
  • Docs: all 6 docs rewritten off the dead ECS/CDK/Lambda model onto the actual EKS/Helm/ArgoCD/KEDA/landing-zone/Platform/ESO deployment.

Verification

130 tests pass · coverage 92.8/79/90/95 · typecheck/lint/format/build/chart-lint green · 0 vulnerabilities.

🤖 Two-pass run: quality audit + fix, then an adversarial regression-hunt of the diff (caught & fixed 4 self-introduced regressions before commit).

A two-pass quality-check-and-fix sweep: resolve every GitHub
code-scanning vulnerability, then fix the findings from a 9-dimension
quality audit (and the regressions that first pass introduced). Net
result: 0 npm-audit vulnerabilities in both lockfiles, 130 tests green,
coverage 92.8/79/90/95, typecheck/lint/format/build/chart-lint all clean.

─────────────────────── Supply chain / CVEs ───────────────────────
All 5 HIGH gh code-scanning CVEs were transitive deps; pinned via npm
`overrides`: @grpc/grpc-js→1.14.4, form-data→4.0.6, protobufjs→7.6.1,
ws→8.21.0 (all in-major patch/minor). The 26 MODERATE OpenTelemetry
findings (@opentelemetry/core W3C-Baggage DoS) are fixed at the source by
bumping @opentelemetry/auto-instrumentations-node 0.76→0.77, which pulls
sdk-node 0.219 → core 2.8.0 — a real dep bump rather than a fragile
override. Dev-only vite HIGH + esbuild low cleared via overrides. The
OAuth subpackage's AWS-SDK xml transitives (fast-xml-builder/-parser,
@aws-sdk/xml-builder) bumped within range; its own vite/esbuild overrides
added. Both lockfiles now report 0 vulnerabilities.

──────────────────────── Systems & reliability ────────────────────────
- pgvector Pool gains statement_timeout/query_timeout (5s),
  connectionTimeoutMillis (2s), idleTimeoutMillis (30s) — the hottest
  user-path call was the one external call with no timeout, so a
  hung-but-alive Postgres could pin the Bolt handler slot.
- Graceful SIGTERM drain: the query handler now tracks in-flight queries
  in a set and exposes drainInFlight(deadline); shutdown ordering is
  app.stop() → drain (25s) → flushMetrics → close, so a deploy mid-query
  no longer drops the awaited compliance audit. The chart's main
  Deployment gains terminationGracePeriodSeconds: 45 to fit the drain.
- audit-consumer shutdown exits as soon as the receive loop drains
  (loopExited flag, 500ms tick) instead of always burning the full 30s.
- uncaughtException now logs + exits(1) for a clean restart instead of
  swallowing into an undefined process state; unhandledRejection stays
  swallowed (Bolt Socket-Mode reconnect noise) and is now clearly scoped.
- Email cache is bounded (max 10k, oldest-evicted, expired-on-read) so a
  churn of distinct users can't grow it without limit.
- Bedrock answer path now meters input/output/cache-read tokens from the
  response usage block — the AI path's cost is observable per-query.

──────────────────────────── Security ────────────────────────────
- Postgres TLS now verifies the Aurora server cert against the bundled
  Amazon RDS global CA (certs/rds-global-bundle.pem, COPYed into the
  image) instead of blanket rejectUnauthorized:false — authenticated
  encryption, MITM-resistant. Secure by default (PG_SSL_REJECT_UNAUTHORIZED
  defaults true) with a documented escape hatch and graceful degrade if
  the CA is unreadable. Applied to both the app and the demo seeder.
- Pino logger gains a redact config mirroring the OAuth module's token
  field set + the realistic nested header/SDK-error shapes, so tokens and
  secrets never land in app logs.
- pii-scrubber email regex char-class bug fixed ([A-Z|a-z] → [A-Za-z],
  the pipe was matching a literal '|').

──────────────────────── Data model & schema ────────────────────────
chunks table keyed on a composite PRIMARY KEY (doc_id, chunk_index)
instead of doc_id alone, which had capped each document to a single chunk
row. initSchema runs its DDL on a dedicated client inside a transaction
with SET LOCAL statement_timeout = 0, so the new hot-path Pool timeout
can't kill an IVFFlat index build on a populated table. Seeder INSERT
upserts on (doc_id, chunk_index) and its PGDATABASE default is corrected
to the underscored name; its Bedrock embed call and Pool gain timeouts.

─────────────────────── Config & correctness ───────────────────────
- BEDROCK_LLM_MODEL_ID corrected to the cross-region inference profile
  (us.anthropic.claude-sonnet-4-6) across chart/values.yaml, .env.example,
  and CLAUDE.md — the bare anthropic.* ID is not invocable on-demand and
  would have shipped a broken default to every deploy.
- Removed dead config (CRAWL_INTERVAL_MINUTES) and a dead export
  (scrubPiiFromObject); rewrote a comment that recommended the
  CI-banned vi.mock("ioredis") pattern; fixed a stale Node-version note.
- Coverage config honestly excludes the two bootstrap entry files
  (src/redis.ts, src/bin/audit-consumer.ts) — only verifiable against
  real Redis / a live process — matching the existing index.ts precedent.

──────────────────────────── Documentation ────────────────────────────
All six docs (runbook, integrations, qa-playbook, rag-architecture,
test-plan, threat-model) rewritten off the dead AWS ECS/ECR/CDK/Fargate/
Lambda deployment model and onto the actual one: an eks-agent-platform
Platform tenant whose Helm chart ArgoCD reconciles from git, with
landing-zone substrate, the Platform CR's IRSA, External Secrets, and a
KEDA-scaled audit-consumer Deployment. The runbook is fully reworked to
kubectl/helm/argocd incident steps and the chart's PrometheusRule alerts.

──────────────────────────── Testing ────────────────────────────
Added tests for the graceful-drain/in-flight tracking and the Bedrock
token metering; coverage rose to 92.8/79/90/95 against the 75/60/75/75
thresholds. No SDK module-mocking introduced (the grep-enforced ban
holds); every new external touchpoint carries an explicit timeout.

Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
@stxkxs

stxkxs commented Jun 20, 2026

Copy link
Copy Markdown
Member Author

Deferred items (tracked separately)

These quality-audit findings were deliberately not included here — each needs validation infra, a design decision, or a feature slice beyond an audit-fix:

CodeQL js/incomplete-url-substring-sanitization (HIGH) flagged two test
fetch-fakes that routed by `url.includes("api.notion.com")` — a substring
check an attacker URL like `evil.com/api.notion.com` could bypass. These
are test fixtures, not production security checks, but the pattern is worth
correcting: route on the parsed hostname (`new URL(url).hostname ===
"api.notion.com"`) instead. Resolves the two open CodeQL alerts at
src/connectors/acl-guard.test.ts and
src/slack/query-handler.integration.test.ts.

Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
@stxkxs stxkxs marked this pull request as ready for review June 20, 2026 03:44
@stxkxs stxkxs merged commit cade037 into main Jun 20, 2026
9 checks passed
@stxkxs stxkxs deleted the quality-audit-and-security-fixes branch June 20, 2026 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant