Skip to content

feat(processor): Phase 5 — handler registry + Tier-1 types + safe streaming ZIP#108

Merged
jeffgreendesign merged 4 commits into
mainfrom
feat/processor-registry
Jun 12, 2026
Merged

feat(processor): Phase 5 — handler registry + Tier-1 types + safe streaming ZIP#108
jeffgreendesign merged 4 commits into
mainfrom
feat/processor-registry

Conversation

@jeffgreendesign

Copy link
Copy Markdown
Owner

Implements Phase 5 of the large-upload feature (parent plan §8/§8a/§9, sub-plan docs/plans/2026-06-10-phase5-handler-registry-zip-impl.md). Server-side only.

Three atomic, test-first slices:

T5.1 — File-handler registry + Tier-1 types (510ebe2)

  • Extraction moved from the hardcoded SUPPORTED_TYPES map into a registry (src/services/processor/{types,registry}.ts + handlers/); processor.ts stays the stable façade (extractText/validateFileType/isSupportedType delegate), so existing callers are unchanged.
  • Wires the already-present csv / xlsx / json parsers. Existing pdf/docx/txt/md behaviour preserved.

T5.2 — HTML handler (e228f7b)

  • Adds html-to-text; .html/.htm (text/html) extracted to readable text (scripts/styles dropped, original heading case preserved). Automatically a valid ZIP entry type too.

T5.3 — Safe streaming ZIP (aa4f330)

  • One ZIP → one document per supported entry. Full §9 safety enforced against the central directory before any decompression: entry count, compressed/expanded size, compression ratio, per-entry size, path traversal (..//absolute/drive/backslash), symlinks/non-regular entries, over-long names, nested archives. OS junk skipped; unsupported entries recorded skipped.
  • Supported entries decompressed one at a time (memory bounded to a single entry).
  • Outcomes: archive-level violation → failed (no documents); per-entry failures recorded → partial; all succeed → completed; zero supported → ZIP_NO_SUPPORTED_ENTRIES.
  • Idempotent retry: completed entries reused (never re-decompressed); per-entry rows upsert on (upload_id, entry_path).
  • New config (ZIP_MAX_*, §7 defaults), ZIP_* error codes (§4), recordUploadEntry/listUploadEntries DB writers. Documented in .env.example + docs/ENV.md.

Notes / judgment calls

  • Symlink + over-long-name map to ZIP_PATH_TRAVERSAL (§4 defines no dedicated codes; folded into the "unsafe entry path" code).
  • All-entries-failed (≥1 supported, 0 succeeded) → failed/PROCESSING_FAILED, not partial (partial requires ≥1 success per §5).
  • Content-mismatch entry (extension lies) → skipped/UNSUPPORTED_ENTRY, not failed.

Honest scope

Unblocks the server side of ZIP. The dashboard cannot send ZIP/large files until Phase 6 (resumable client + accept-list + error mapping) — do not advertise ZIP in the UI yet.

Verification

  • Test-first throughout. 247 tests passing; pnpm verify green (lint + lint:md + typecheck + test + security/docs/tool-sync + server build + docs site build).
  • No MCP tool-list change → tool-sync-check.sh stays green.
  • End-to-end (post-merge + deploy): drive the resumable API with a real mixed-content ZIP → expect partial with one document per supported entry and per-entry rows.

@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dashboard Building Building Preview, Comment Jun 11, 2026 2:24am
textrawl Ready Ready Preview, Comment Jun 11, 2026 2:24am

Request Review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9659b0c2-2f28-4bc6-8a49-76d14b1b09ea

📥 Commits

Reviewing files that changed from the base of the PR and between aa4f330 and c49eda4.

📒 Files selected for processing (5)
  • src/services/processor/__tests__/archive-zip.test.ts
  • src/services/processor/__tests__/registry.test.ts
  • src/services/processor/handlers/archive-zip.ts
  • src/services/processor/registry.ts
  • src/utils/config.ts
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/utils/config.ts
  • src/services/processor/tests/registry.test.ts
  • src/services/processor/registry.ts
  • src/services/processor/handlers/archive-zip.ts

Walkthrough

This PR introduces ZIP archive support to the upload processor while refactoring single-file extraction into a pluggable handler registry. The implementation validates ZIP safety (path traversal, compression bombs, entry limits) without full decompression, implements handlers for seven file formats (text, PDF, DOCX, CSV, XLSX, JSON, HTML), adds per-entry database persistence, and orchestrates multi-entry processing with idempotency and per-entry error recovery.

sequenceDiagram
  participant Client
  participant Storage
  participant UploadProcessor
  participant ZipValidator
  participant Registry
  participant DB
  participant DocumentService

  Client->>Storage: PUT object (uploadId, bytes)
  Client->>UploadProcessor: trigger processUpload(uploadId)
  UploadProcessor->>Storage: stream & hash object
  UploadProcessor->>UploadProcessor: verify checksum_expected (if provided)
  UploadProcessor->>ZipValidator: validateZip(buffer) (if declared_mimetype is ZIP)
  ZipValidator-->>UploadProcessor: { candidates, skipped }
  UploadProcessor->>DB: recordUploadEntry(skipped...) (mark unsupported/skipped)
  UploadProcessor->>DB: listUploadEntries(uploadId) (load prior entry states)
  loop per candidate
    UploadProcessor->>Registry: resolveForEntry(entry.name, bytes)
    Registry-->>UploadProcessor: FileHandler | undefined
    alt handler found
      UploadProcessor->>DocumentService: createDocument(entry bytes, metadata)
      DocumentService-->>UploadProcessor: documentId
      UploadProcessor->>DB: recordUploadEntry({ state: 'completed', document_id })
      UploadProcessor->>UploadProcessor: onDocumentIngested(documentId)
    else no handler
      UploadProcessor->>DB: recordUploadEntry({ state: 'skipped' })
    end
  end
  UploadProcessor->>DB: record aggregated upload result
  UploadProcessor-->>Client: upload state => failed/partial/completed
Loading
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is comprehensive and well-structured, covering the three atomic slices (T5.1, T5.2, T5.3), implementation details, judgment calls, scope limitations, and verification steps. However, the template requires a 'Summary' section, 'Changes' bullet list, 'Type of Change' checkboxes, and 'Related Issues' field, none of which are present in the provided description. Add required template sections: a brief summary, bulleted changes list, checked type-of-change box (New feature), completed checklist items, and Related Issues reference.
Docstring Coverage ⚠️ Warning Docstring coverage is 70.97% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(processor): Phase 5 — handler registry + Tier-1 types + safe streaming ZIP' clearly summarizes the main changes: a handler registry, additional file types, and ZIP archive support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/processor-registry

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/services/processor/handlers/archive-zip.ts Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/services/processor/handlers/archive-zip.ts`:
- Around line 135-141: The loop that iterates files currently calls
isOsJunk(path) before safety checks, which lets malicious paths like
../../Thumbs.db slip through; reorder the checks in the handler that processes
archive entries so isUnsafePath(path) (and any non-regular-path checks that
throw ZipPathTraversalError) run before isOsJunk(path), and only after
validating path safety apply the OS-junk filter; update the block around the for
(const f of files) loop, referencing isUnsafePath, isOsJunk, and the
ZipPathTraversalError to ensure unsafe entries are rejected first.

In `@src/services/processor/registry.ts`:
- Around line 42-43: resolveByMime and related functions (isSupportedMime,
extractByMime) currently use the raw mime string lowercased, which fails when
the value contains parameters like "; charset=utf-8"; update these functions to
normalize the incoming mime by trimming, lowercasing, and removing any
parameters (split on ';' and take the first token) before performing byMime
lookups or registry checks—use a local variable such as canonicalMime for the
normalized value and replace direct uses of mime.toLowerCase() in resolveByMime,
isSupportedMime, and extractByMime with that canonical value so registered types
like "text/html" match "text/html; charset=utf-8".

In `@src/services/upload-processor.ts`:
- Around line 258-263: Idempotency currently uses a pre-read snapshot
(priorCompleted) and then calls createDocument followed by recordUploadEntry,
which can race or fail between those steps causing duplicate/orphan documents;
fix by making per-entry processing atomic: for each entry (identified by
uploadId and entry_path) acquire a DB-level guard or perform the
create-and-record in a single transactional upsert so either both the document
and entry row are created/updated or neither are. Concretely, change the flow
around createDocument and recordUploadEntry so you either (A) run them inside a
single transaction that inserts the document and upserts the upload entry to
'completed' (using a unique key on upload_id+entry_path to avoid duplicates), or
(B) first insert/upsert an entry row with state='in_progress' (idempotent via
unique upload_id+entry_path) and then createDocument and atomically update that
row to 'completed' in the same transaction; reference functions/variables:
priorCompleted, listUploadEntries(uploadId), createDocument(...),
recordUploadEntry(...), uploadId, entry_path.

In `@src/utils/config.ts`:
- Around line 205-208: The ZIP_MAX_COMPRESSED_BYTES zod entry can return NaN
because parseInt may fail; update the transform on ZIP_MAX_COMPRESSED_BYTES to
parse the string into a number, validate with Number.isFinite (and optionally
positive/integer checks), and return undefined when the parsed value is
NaN/invalid so compressedLimit() won't receive NaN; locate the
ZIP_MAX_COMPRESSED_BYTES schema entry and change its transform to explicitly
check the parsed value (e.g., const n = parseInt(val,10); return
Number.isFinite(n) && n > 0 ? n : undefined) to harden numeric validation
referenced by compressedLimit() in archive-zip.ts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7c2a35fb-d236-4532-8d82-8af872707e21

📥 Commits

Reviewing files that changed from the base of the PR and between c8746c0 and aa4f330.

⛔ Files ignored due to path filters (3)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
  • src/services/processor/__tests__/fixtures/sample.csv is excluded by !**/*.csv
  • src/services/processor/__tests__/fixtures/sample.xlsx is excluded by !**/*.xlsx
📒 Files selected for processing (29)
  • .env.example
  • docs/ENV.md
  • package.json
  • src/db/__tests__/uploads.test.ts
  • src/db/uploads.ts
  • src/services/__tests__/upload-processor.test.ts
  • src/services/processor.ts
  • src/services/processor/__tests__/archive-zip.test.ts
  • src/services/processor/__tests__/fixtures/sample.html
  • src/services/processor/__tests__/fixtures/sample.json
  • src/services/processor/__tests__/fixtures/sample.md
  • src/services/processor/__tests__/fixtures/sample.txt
  • src/services/processor/__tests__/html.test.ts
  • src/services/processor/__tests__/registry.test.ts
  • src/services/processor/handlers/archive-zip.ts
  • src/services/processor/handlers/csv.ts
  • src/services/processor/handlers/docx.ts
  • src/services/processor/handlers/html.ts
  • src/services/processor/handlers/index.ts
  • src/services/processor/handlers/json.ts
  • src/services/processor/handlers/pdf.ts
  • src/services/processor/handlers/text.ts
  • src/services/processor/handlers/xlsx.ts
  • src/services/processor/registry.ts
  • src/services/processor/types.ts
  • src/services/upload-processor.ts
  • src/types/unzipper.d.ts
  • src/utils/config.ts
  • src/utils/errors.ts

Comment thread src/services/processor/handlers/archive-zip.ts
Comment thread src/services/processor/registry.ts Outdated
Comment on lines +258 to +263
// Idempotent retry: reuse documents from entries a prior attempt already completed.
const priorCompleted = new Map(
(await listUploadEntries(uploadId))
.filter((e) => e.state === 'completed' && e.document_id)
.map((e) => [e.entry_path, e.document_id as string]),
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make per-entry processing atomic to prevent duplicate/orphan documents on retry/concurrency.

At Line 258 and Lines 295-333, idempotency is based on a pre-read snapshot plus post-write recordUploadEntry. If execution fails after createDocument but before recordUploadEntry(state: 'completed'), or two workers race, the same entry_path can generate multiple documents while only one entry row survives via upsert.

Also applies to: 295-333, 343-351

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/services/upload-processor.ts` around lines 258 - 263, Idempotency
currently uses a pre-read snapshot (priorCompleted) and then calls
createDocument followed by recordUploadEntry, which can race or fail between
those steps causing duplicate/orphan documents; fix by making per-entry
processing atomic: for each entry (identified by uploadId and entry_path)
acquire a DB-level guard or perform the create-and-record in a single
transactional upsert so either both the document and entry row are
created/updated or neither are. Concretely, change the flow around
createDocument and recordUploadEntry so you either (A) run them inside a single
transaction that inserts the document and upserts the upload entry to
'completed' (using a unique key on upload_id+entry_path to avoid duplicates), or
(B) first insert/upsert an entry row with state='in_progress' (idempotent via
unique upload_id+entry_path) and then createDocument and atomically update that
row to 'completed' in the same transaction; reference functions/variables:
priorCompleted, listUploadEntries(uploadId), createDocument(...),
recordUploadEntry(...), uploadId, entry_path.

Comment thread src/utils/config.ts Outdated
@jeffgreendesign jeffgreendesign merged commit 6361176 into main Jun 12, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants