Skip to content

ryanlane/document-manager

Repository files navigation

Archive Brain

License: MIT

Archive Brain is a local-first document archive assistant.
It ingests your personal files, enriches them with metadata, and enables semantic search and Retrieval-Augmented Generation (RAG) β€” all running on your own machine.

This project is designed for people who want to understand and explore their archives, not ship their data to the cloud.


✨ What Archive Brain Does

  • Ingests documents from local folders automatically
  • Extracts text from PDFs, images (OCR), and plain text
  • Segments large documents into meaningful chunks
  • Uses local LLMs to generate:
    • Titles
    • Summaries
    • Tags
  • Builds vector embeddings for semantic search
  • Gallery view for browsing and analyzing images with vision models
  • Real-time dashboard showing pipeline progress and current processing phase
  • Lets you ask natural-language questions over your archive with RAG

Archive Brain Search UI
Main semantic search interface

All processing runs locally via Docker and Ollama.


πŸ” Data & Privacy Model

Archive Brain is local-first by default.

  • Files are read from your local filesystem
  • All processing happens inside Docker containers on your machine
  • LLM inference runs via Ollama (local or self-hosted)
  • No data is sent to external services unless you explicitly configure it

If you point the system at a remote LLM or external API, you control that tradeoff.


πŸš€ Quick Start

docker compose -f docker-compose.yml --profile prod up -d --build

On first run, the system will download required LLM models (several GB). You can monitor progress with:

docker compose -f docker-compose.yml --profile prod logs -f ollama-init

Once running, open:

  • Web UI: http://localhost:3000 - Search your archive with semantic queries
    • Browse files and images in gallery view
    • Analyze images with AI vision models
    • Monitor pipeline progress on the dashboard* API: http://localhost:8000

That’s it β€” the ingestion pipeline starts automatically.

➑️ New here? Read docs/first-run.md for what to expect on first startup.


πŸ“‚ Adding Your Documents

Archive Brain runs in Docker, so folders from your host system must be explicitly mounted before they can be indexed.

If you can search but clicking a document shows empty content or Open source returns {"detail":"File not found on disk"}, it usually means the file path exists in the database but the underlying folder is not mounted into the containers. Set STORY_SOURCE / KNOWLEDGE_SOURCE in .env to point at your real folders (see .env.example), then restart the stack.

This is a one-time setup step and is required before your files will appear in the UI.

➑️ Read: Adding Folders to Archive Brain


🧠 How It Works (High Level)

Archive Brain runs a background pipeline:

  1. Ingest – Scan folders and extract raw content
  2. Segment – Split content into logical chunks
  3. Enrich Documents – Generate metadata for full documents
  4. Enrich Chunks – Optionally enrich chunks (configurable mode)
  5. Embed Documents – Create vector embeddings for document-level search
  6. Embed Chunks – Create vector embeddings for chunk-level search
  7. Retrieve & Generate – Power semantic search and Q&A

The dashboard shows real-time progress, including which phase the worker is currently processing and estimated completion times for each stage.

Archive Brain Dashboard
Dashboard: pipeline status with real-time phase tracking

➑️ For details, see docs/architecture.md.


πŸ“¦ Supported File Types

Type Extensions Notes
Text .txt, .md, .html Full text extraction
PDF .pdf Text + OCR fallback
Images .jpg, .png, .gif, .webp, .tiff OCR + vision descriptions
Documents .docx Planned

πŸ–ΌοΈ Image Gallery & Analysis

Archive Brain includes a dedicated gallery view for browsing and analyzing images:

  • Grid and list views for browsing all extracted images
  • Lightbox viewer with full-resolution display
  • OCR text extraction from images
  • AI-powered descriptions using vision models (LLaVA)
  • On-demand analysis - generate descriptions for any image with a single click
  • Sortable views - sort by date, filename, or file size

Images are automatically extracted during ingestion and can be analyzed individually or in batch.


βš™οΈ Configuration

cp .env.example .env

Key settings:

  • OLLAMA_MODEL – Chat model for enrichment and Q&A
  • OLLAMA_EMBEDDING_MODEL – Embedding model for vector search
  • OLLAMA_VISION_MODEL – Vision model for image analysis
  • DB_PASSWORD – Database password

Source folders and file types are defined in:

config/config.yaml

Performance Tuning

Chunk Enrichment Mode

Control how chunks are processed to balance speed vs. metadata richness:

  • none – Skip chunk enrichment entirely (fastest)
  • embed_only – Only create embeddings, no LLM enrichment (recommended default)
  • full – Full LLM enrichment with titles, summaries, and tags (slowest)

Change this in Settings via the web UI or by calling the API. The embed_only mode provides excellent search quality while dramatically reducing processing time.

Multi-Provider LLM Support

Archive Brain can distribute load across multiple LLM providers for faster processing:

  • Configure additional Ollama servers or cloud providers
  • Worker automatically balances requests across available providers
  • Improves throughput for large archives

πŸ–₯️ Deployment Options

Default (Recommended)

Self-contained Docker setup with Ollama included:

docker compose -f docker-compose.yml --profile prod up -d

NVIDIA GPU Acceleration

Requires NVIDIA Container Toolkit:

docker compose -f docker-compose.yml -f docker-compose.gpu.yml --profile prod up -d

External Ollama (Advanced)

Run Ollama on the host or another machine:

export OLLAMA_URL=http://host.docker.internal:11434
docker compose -f docker-compose.yml -f docker-compose.external-llm.yml --profile prod up -d

πŸ”„ Re-running & Iteration

  • Pipeline steps are designed to be idempotent
  • Re-running ingestion will skip unchanged files
  • Metadata and embeddings are reused when possible

To reset everything:

docker compose -f docker-compose.yml --profile prod down -v
docker compose -f docker-compose.yml --profile prod up -d --build

🚧 Current Limitations

  • Single-user only - no multi-tenancy support
  • No authentication or access control
  • Not optimized for real-time ingestion - designed for batch processing
  • Large archives may require extended processing time
    • Use embed_only mode for faster processing
    • GPU acceleration recommended for 1M+ document archives
  • Worker cycles through phases - processes one type of task at a time (documents β†’ chunks β†’ embeddings)

🧬 Embeddings Visualization

Embeddings Visualization
Visualize document and chunk embeddings

πŸ› οΈ Tech Stack

  • PostgreSQL + pgvector
  • Python + FastAPI
  • React + Vite
  • Apache Tika, Tesseract OCR
  • Ollama (LLMs)
  • Docker Compose

πŸ“š Documentation


πŸ“„ License

MIT