Skip to content

fix: Ingest scanner, bulk queue, error handling, metrics, and MCP job submission#60

Merged
mriechers merged 10 commits intomainfrom
worktree-bold-fox-rn6y
Mar 31, 2026
Merged

fix: Ingest scanner, bulk queue, error handling, metrics, and MCP job submission#60
mriechers merged 10 commits intomainfrom
worktree-bold-fox-rn6y

Conversation

@mriechers
Copy link
Copy Markdown
Owner

@mriechers mriechers commented Mar 31, 2026

Summary

  • Recursive directory scanning — Scanner now descends into subdirectories (up to 3 levels deep). Default config changed to scan from root /, auto-discovering directories like /IWP/. Migration 009 updates stored config.
  • Bulk queue error fix — Frontend now correctly reads BulkQueueResponse wrapper. Also fixes orphaned file records on job creation failure.
  • Global JSON exception handler — All unhandled server errors return JSON instead of HTML. Fixes "Unexpected token '<'" on uploads.
  • Transcript metrics persistence — Worker saves word_count and duration_minutes to the job record after calculation, surviving failures and retries.
  • Rate limit relaxation — RATE_EXPENSIVE bumped to 30/min, made configurable via env vars.
  • MCP job submission tool — New submit_processing_job tool queues jobs by Media ID from any MCP-connected workspace.

Issues resolved (14 total)

Issue Fix
#57 Recursive directory scanning discovers /IWP/ and future directories
#45 Frontend reads BulkQueueResponse wrapper correctly
#30 Global JSON exception handler + upload mkdir fix
#44 Worker persists transcript metrics to DB
#39 Metrics now survive failures (root cause of silent None)
#41 Rate limits relaxed to 30/min and made configurable
#47 New MCP tool: submit_processing_job
#43 $0.00 costs are expected for free-tier models (no code change)
#17 Rename already complete (no action needed)
#26 Already fixed in Sprint 14
#27 Table creation works via dual-path init
#32 Docker permissions issue with guidance
#55 Timestamp report display verified correct

Test plan

  • 54 tests passing (42 scanner + 12 API), including 6 new recursive scanning tests
  • Deploy and run a scan — verify /IWP/ directory files appear
  • Bulk select transcripts on Ready for Work and queue — verify success toast
  • Upload transcripts — verify JSON errors on failure
  • Retry a failed job — verify word_count and duration_minutes populated
  • Test submit_processing_job MCP tool with a known Media ID

🤖 Generated with Claude Code

mriechers and others added 6 commits March 31, 2026 13:07
The scanner was only checking files in the top-level configured
directories (/misc/, /SCC2SRT/, /wisconsinlife/) and skipping all
subdirectories. This meant directories like /IWP/ were completely
invisible. Now scans from root (/) and recurses into subdirectories
up to 3 levels deep, respecting the ignore_directories config.

- Add recursive scanning with MAX_SCAN_DEPTH=3 to _scan_directory
- Split _parse_directory_listing to return (files, subdirs) tuple
- Change default directories from curated list to ["/"]
- Wire ignore_directories through router to scanner
- Update default scan_time from midnight to 07:00
- Add migration 009 to update stored config values
- Add 6 new tests for recursive scanning behavior

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The frontend expected the bulk queue response to be a flat array but
the API returns a BulkQueueResponse wrapper object. Calling .filter()
on the object threw TypeError, caught as a generic "Failed to queue"
error — even though the files were actually queued on the backend.

Fix: Read .queued and .failed from the BulkQueueResponse directly.

Also fixes orphaned file records: if download_file succeeds but
create_job fails, the file status is now reset to 'new' instead of
being stuck as 'queued' with no job_id (invisible on Ready for Work).

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#30)

Added global exception handler in api/main.py to ensure all unhandled
server errors return JSON responses, not FastAPI's default HTML page.
The frontend was getting "Unexpected token '<'" when trying to parse
HTML error responses as JSON.

Also fixes:
- upload.py: mkdir now uses parents=True and catches PermissionError
  with a clear message instead of an unhandled exception
- TranscriptUploader.tsx: gracefully handles non-JSON error responses
  instead of crashing on response.json()

Closes #30, relates to #32 (Docker permissions)

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

Transcript metrics (word_count, duration_minutes) were calculated
in-memory for routing decisions but never saved to the database.
If a job failed and was retried, the metrics remained None because
the initial calculation was lost.

Now persists metrics to the job record after calculation, with a
guard to only write when the values are missing (avoids overwriting
on retry of already-populated jobs).

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bumped RATE_EXPENSIVE from 10/min to 30/min and RATE_READ from
60/min to 120/min — the previous limits were too restrictive for
a single-user internal editorial tool. Both are now configurable
via RATE_LIMIT_EXPENSIVE and RATE_LIMIT_READ env vars.

Also includes the transcript metrics persistence fix from the
previous commit (workers now save word_count and duration_minutes
to the job record after calculation).

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New tool allows queuing transcript processing jobs by Media ID from
any MCP-connected workspace (Claude Desktop, other projects). The
tool:

1. Checks for existing jobs to avoid duplicates
2. Searches local transcripts/ dir for matching files
3. Falls back to ingest server available_files
4. Queues via the appropriate API endpoint

This enables workflows like scheduling Content Calendar entries and
triggering processing in the same session without switching projects.

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mriechers mriechers changed the title fix: Ingest scanner, bulk queue, and error handling improvements fix: Ingest scanner, bulk queue, error handling, metrics, and MCP job submission Mar 31, 2026
mriechers and others added 4 commits March 31, 2026 16:23
The cost tracker was a single module-level global that got silently
overwritten when multiple jobs ran concurrently. The second job's
start_run_tracking() destroyed the first job's tracker, causing
the first job to write actual_cost=0.

Fix: replaced single global with a dict of trackers keyed by job_id.
Each job now gets its own isolated tracker. The chat() method looks
up the correct tracker by job_id, and end_run_tracking(job_id)
retrieves the right one.

Also:
- Set TESTING=1 early in conftest.py to disable rate limiter in tests
- Tests now pass: 307 passed, 0 failed

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes 5 critical issues from PR review:

1. Global exception handler: return generic "Internal server error"
   instead of leaking str(exc) to clients; add exc_info=True to log
2. Upload error: remove filesystem path from PermissionError detail
3. MCP submit_processing_job: replace bare except:pass with logged
   warnings for duplicate check and ingest search failures
4. Worker metrics persistence: wrap in try/except so transient DB
   errors don't kill the whole job for a non-critical backfill
5. Orphaned file reset: include exception details in error log

Also adds logging module to MCP server.

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Important issues:
- Add finally cleanup for _run_trackers dict to prevent memory leak
  when jobs crash before end_run_tracking (worker.py)
- Add 2 concurrency tests for per-job cost tracker isolation
  (test_llm.py)
- Replace 3 pre-existing bare except:pass in MCP server helpers
  with debug-level logging (server.py)

Suggestions:
- Add DEBUG logging to _parse_file_metadata bare except
  (ingest_scanner.py)
- Document ignore_directories matches by name only, not path
  (ingest_scanner.py)

309 tests passing.

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mriechers mriechers merged commit 96ab2da into main Mar 31, 2026
6 checks passed
@mriechers mriechers deleted the worktree-bold-fox-rn6y branch March 31, 2026 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant