Skip to content

Pre-Prod Task List — SOLVRO MCP (MVP Blockers/Improvements) #17

@W1ndrunn3rr

Description

@W1ndrunn3rr

Reference: Priority order for MVP launch


1. Persistent User & Session Storage

  • Replace in-memory SessionManager with a real database (PostgreSQL or MongoDB as a new compose.stack.yml service)
  • Migrate ConversationSession and Message models to SQLAlchemy / Motor async ORM
  • Store: user_id, session_id, messages[], created_at, is_active, metadata
  • Add user registration/auth (JWT or API key) — currently any user_id string is accepted with zero validation
  • Migrate session_manager.py to async DB calls; drop the threading.Lock

2. Multi-threaded / Concurrent Data Pipeline

  • Make the Prefect pipeline process documents in parallel
  • Replace sequential for page in pages: loop in pipeline.py with asyncio.gather() or Prefect's task.submit() with thread/process pool
  • Each page: extract → generate Cypher → populate runs concurrently (configurable concurrency limit)
  • Add idempotency: track processed documents (hash → DB)
  • Write integration test: mock 10 pages and verify concurrent write without schema reflection race

3. Google Drive as Data Source

  • Replace Azure Blob with Google Drive in data_acquisition.py
  • Authenticate via Google Drive API (service account JSON, secret-managed)
  • List files from configured Drive folder (support PDF, DOCX, TXT)
  • Download to temp dir, pass paths to OCR extraction
  • Add GOOGLE_DRIVE_FOLDER_ID and GOOGLE_SERVICE_ACCOUNT_JSON to .env.example
  • Handle pagination for large folders

4. Neo4j Graph State Snapshot (Data Dump)

  • Export full graph after pipeline via CALL apoc.export.cypher.all() to .cypher dump (stored in cloud)
  • On pipeline startup, check for dump and import if present, skipping LLM extraction
  • Add just dump-graph and just restore-graph recipes
  • Track pipeline_run metadata in Neo4j, only process new/changed files

5. Frontend Improvements

  • Conversation naming (LLM-powered title after first reply; store in session.metadata["title"])
  • Conversation list sidebar (titles, timestamp, rename/delete)
  • Message streaming (SSE for /api/chat)
  • Empty and error states (suggested questions, API error surfacing)

6. Component & Integration Tests

  • Core logic test coverage is near zero; see recommended coverage targets for:
    • Guardrails, schema caching, cypher generation, pipeline execution
    • Full API/graph integration tests with real or mocked backends

7. (Extra) Query Caching & Rate Limiting

  • Add semantic query cache (Redis + pgvector); check for near-duplicates before running LLM/graph
  • Rate limiting on /api/chat (slowapi middleware), default 20 req/min per user
  • Both prod features toggleable in .env

Blocking/effort tags:

  • Graph dump/restore — Low effort, blocks prod (MVP must-have)
  • Google Drive source — Medium effort, blocks prod
  • Tests — Medium effort, deosn't blocks prod
  • Persistent storage — High effort, blocks prod (sessions lost on restart!)
  • Concurrent pipeline — Medium, not a blocker
  • Frontend UX — Medium, not a blocker
  • Cache/rate limit — Low, not a blocker

See full task list above for details. Prioritize by block/effort.


Metadata

Metadata

Assignees

No one assigned

    Labels

    MCPMCP related task

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions