feat(processor): Phase 5 — handler registry + Tier-1 types + safe streaming ZIP#108
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (4)
WalkthroughThis PR introduces ZIP archive support to the upload processor while refactoring single-file extraction into a pluggable handler registry. The implementation validates ZIP safety (path traversal, compression bombs, entry limits) without full decompression, implements handlers for seven file formats (text, PDF, DOCX, CSV, XLSX, JSON, HTML), adds per-entry database persistence, and orchestrates multi-entry processing with idempotency and per-entry error recovery. sequenceDiagram
participant Client
participant Storage
participant UploadProcessor
participant ZipValidator
participant Registry
participant DB
participant DocumentService
Client->>Storage: PUT object (uploadId, bytes)
Client->>UploadProcessor: trigger processUpload(uploadId)
UploadProcessor->>Storage: stream & hash object
UploadProcessor->>UploadProcessor: verify checksum_expected (if provided)
UploadProcessor->>ZipValidator: validateZip(buffer) (if declared_mimetype is ZIP)
ZipValidator-->>UploadProcessor: { candidates, skipped }
UploadProcessor->>DB: recordUploadEntry(skipped...) (mark unsupported/skipped)
UploadProcessor->>DB: listUploadEntries(uploadId) (load prior entry states)
loop per candidate
UploadProcessor->>Registry: resolveForEntry(entry.name, bytes)
Registry-->>UploadProcessor: FileHandler | undefined
alt handler found
UploadProcessor->>DocumentService: createDocument(entry bytes, metadata)
DocumentService-->>UploadProcessor: documentId
UploadProcessor->>DB: recordUploadEntry({ state: 'completed', document_id })
UploadProcessor->>UploadProcessor: onDocumentIngested(documentId)
else no handler
UploadProcessor->>DB: recordUploadEntry({ state: 'skipped' })
end
end
UploadProcessor->>DB: record aggregated upload result
UploadProcessor-->>Client: upload state => failed/partial/completed
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/services/processor/handlers/archive-zip.ts`:
- Around line 135-141: The loop that iterates files currently calls
isOsJunk(path) before safety checks, which lets malicious paths like
../../Thumbs.db slip through; reorder the checks in the handler that processes
archive entries so isUnsafePath(path) (and any non-regular-path checks that
throw ZipPathTraversalError) run before isOsJunk(path), and only after
validating path safety apply the OS-junk filter; update the block around the for
(const f of files) loop, referencing isUnsafePath, isOsJunk, and the
ZipPathTraversalError to ensure unsafe entries are rejected first.
In `@src/services/processor/registry.ts`:
- Around line 42-43: resolveByMime and related functions (isSupportedMime,
extractByMime) currently use the raw mime string lowercased, which fails when
the value contains parameters like "; charset=utf-8"; update these functions to
normalize the incoming mime by trimming, lowercasing, and removing any
parameters (split on ';' and take the first token) before performing byMime
lookups or registry checks—use a local variable such as canonicalMime for the
normalized value and replace direct uses of mime.toLowerCase() in resolveByMime,
isSupportedMime, and extractByMime with that canonical value so registered types
like "text/html" match "text/html; charset=utf-8".
In `@src/services/upload-processor.ts`:
- Around line 258-263: Idempotency currently uses a pre-read snapshot
(priorCompleted) and then calls createDocument followed by recordUploadEntry,
which can race or fail between those steps causing duplicate/orphan documents;
fix by making per-entry processing atomic: for each entry (identified by
uploadId and entry_path) acquire a DB-level guard or perform the
create-and-record in a single transactional upsert so either both the document
and entry row are created/updated or neither are. Concretely, change the flow
around createDocument and recordUploadEntry so you either (A) run them inside a
single transaction that inserts the document and upserts the upload entry to
'completed' (using a unique key on upload_id+entry_path to avoid duplicates), or
(B) first insert/upsert an entry row with state='in_progress' (idempotent via
unique upload_id+entry_path) and then createDocument and atomically update that
row to 'completed' in the same transaction; reference functions/variables:
priorCompleted, listUploadEntries(uploadId), createDocument(...),
recordUploadEntry(...), uploadId, entry_path.
In `@src/utils/config.ts`:
- Around line 205-208: The ZIP_MAX_COMPRESSED_BYTES zod entry can return NaN
because parseInt may fail; update the transform on ZIP_MAX_COMPRESSED_BYTES to
parse the string into a number, validate with Number.isFinite (and optionally
positive/integer checks), and return undefined when the parsed value is
NaN/invalid so compressedLimit() won't receive NaN; locate the
ZIP_MAX_COMPRESSED_BYTES schema entry and change its transform to explicitly
check the parsed value (e.g., const n = parseInt(val,10); return
Number.isFinite(n) && n > 0 ? n : undefined) to harden numeric validation
referenced by compressedLimit() in archive-zip.ts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7c2a35fb-d236-4532-8d82-8af872707e21
⛔ Files ignored due to path filters (3)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yamlsrc/services/processor/__tests__/fixtures/sample.csvis excluded by!**/*.csvsrc/services/processor/__tests__/fixtures/sample.xlsxis excluded by!**/*.xlsx
📒 Files selected for processing (29)
.env.exampledocs/ENV.mdpackage.jsonsrc/db/__tests__/uploads.test.tssrc/db/uploads.tssrc/services/__tests__/upload-processor.test.tssrc/services/processor.tssrc/services/processor/__tests__/archive-zip.test.tssrc/services/processor/__tests__/fixtures/sample.htmlsrc/services/processor/__tests__/fixtures/sample.jsonsrc/services/processor/__tests__/fixtures/sample.mdsrc/services/processor/__tests__/fixtures/sample.txtsrc/services/processor/__tests__/html.test.tssrc/services/processor/__tests__/registry.test.tssrc/services/processor/handlers/archive-zip.tssrc/services/processor/handlers/csv.tssrc/services/processor/handlers/docx.tssrc/services/processor/handlers/html.tssrc/services/processor/handlers/index.tssrc/services/processor/handlers/json.tssrc/services/processor/handlers/pdf.tssrc/services/processor/handlers/text.tssrc/services/processor/handlers/xlsx.tssrc/services/processor/registry.tssrc/services/processor/types.tssrc/services/upload-processor.tssrc/types/unzipper.d.tssrc/utils/config.tssrc/utils/errors.ts
| // Idempotent retry: reuse documents from entries a prior attempt already completed. | ||
| const priorCompleted = new Map( | ||
| (await listUploadEntries(uploadId)) | ||
| .filter((e) => e.state === 'completed' && e.document_id) | ||
| .map((e) => [e.entry_path, e.document_id as string]), | ||
| ); |
There was a problem hiding this comment.
Make per-entry processing atomic to prevent duplicate/orphan documents on retry/concurrency.
At Line 258 and Lines 295-333, idempotency is based on a pre-read snapshot plus post-write recordUploadEntry. If execution fails after createDocument but before recordUploadEntry(state: 'completed'), or two workers race, the same entry_path can generate multiple documents while only one entry row survives via upsert.
Also applies to: 295-333, 343-351
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/services/upload-processor.ts` around lines 258 - 263, Idempotency
currently uses a pre-read snapshot (priorCompleted) and then calls
createDocument followed by recordUploadEntry, which can race or fail between
those steps causing duplicate/orphan documents; fix by making per-entry
processing atomic: for each entry (identified by uploadId and entry_path)
acquire a DB-level guard or perform the create-and-record in a single
transactional upsert so either both the document and entry row are
created/updated or neither are. Concretely, change the flow around
createDocument and recordUploadEntry so you either (A) run them inside a single
transaction that inserts the document and upserts the upload entry to
'completed' (using a unique key on upload_id+entry_path to avoid duplicates), or
(B) first insert/upsert an entry row with state='in_progress' (idempotent via
unique upload_id+entry_path) and then createDocument and atomically update that
row to 'completed' in the same transaction; reference functions/variables:
priorCompleted, listUploadEntries(uploadId), createDocument(...),
recordUploadEntry(...), uploadId, entry_path.
…tion, ZIP bomb-guard hardening
Implements Phase 5 of the large-upload feature (parent plan §8/§8a/§9, sub-plan
docs/plans/2026-06-10-phase5-handler-registry-zip-impl.md). Server-side only.Three atomic, test-first slices:
T5.1 — File-handler registry + Tier-1 types (
510ebe2)SUPPORTED_TYPESmap into a registry (src/services/processor/{types,registry}.ts+handlers/);processor.tsstays the stable façade (extractText/validateFileType/isSupportedTypedelegate), so existing callers are unchanged.T5.2 — HTML handler (
e228f7b)html-to-text;.html/.htm(text/html) extracted to readable text (scripts/styles dropped, original heading case preserved). Automatically a valid ZIP entry type too.T5.3 — Safe streaming ZIP (
aa4f330)..//absolute/drive/backslash), symlinks/non-regular entries, over-long names, nested archives. OS junk skipped; unsupported entries recordedskipped.failed(no documents); per-entry failures recorded →partial; all succeed →completed; zero supported →ZIP_NO_SUPPORTED_ENTRIES.completedentries reused (never re-decompressed); per-entry rows upsert on(upload_id, entry_path).ZIP_MAX_*, §7 defaults),ZIP_*error codes (§4),recordUploadEntry/listUploadEntriesDB writers. Documented in.env.example+docs/ENV.md.Notes / judgment calls
ZIP_PATH_TRAVERSAL(§4 defines no dedicated codes; folded into the "unsafe entry path" code).failed/PROCESSING_FAILED, notpartial(partial requires ≥1 success per §5).skipped/UNSUPPORTED_ENTRY, notfailed.Honest scope
Unblocks the server side of ZIP. The dashboard cannot send ZIP/large files until Phase 6 (resumable client + accept-list + error mapping) — do not advertise ZIP in the UI yet.
Verification
pnpm verifygreen (lint + lint:md + typecheck + test + security/docs/tool-sync + server build + docs site build).tool-sync-check.shstays green.partialwith one document per supported entry and per-entry rows.