Archive Brain

Archive Brain is a local-first document archive assistant.
It ingests your personal files, enriches them with metadata, and enables semantic search and Retrieval-Augmented Generation (RAG) — all running on your own machine.

This project is designed for people who want to understand and explore their archives, not ship their data to the cloud.

✨ What Archive Brain Does

Ingests documents from local folders automatically
Extracts text from PDFs, images (OCR), and plain text
Segments large documents into meaningful chunks
Uses local LLMs to generate:
- Titles
- Summaries
- Tags
Builds vector embeddings for semantic search
Gallery view for browsing and analyzing images with vision models
Real-time dashboard showing pipeline progress and current processing phase
Lets you ask natural-language questions over your archive with RAG

Main semantic search interface

All processing runs locally via Docker and Ollama.

🔐 Data & Privacy Model

Archive Brain is local-first by default.

Files are read from your local filesystem
All processing happens inside Docker containers on your machine
LLM inference runs via Ollama (local or self-hosted)
No data is sent to external services unless you explicitly configure it

If you point the system at a remote LLM or external API, you control that tradeoff.

🚀 Quick Start

docker compose -f docker-compose.yml --profile prod up -d --build

On first run, the system will download required LLM models (several GB). You can monitor progress with:

docker compose -f docker-compose.yml --profile prod logs -f ollama-init

Once running, open:

Web UI: http://localhost:3000 - Search your archive with semantic queries
- Browse files and images in gallery view
- Analyze images with AI vision models
- Monitor pipeline progress on the dashboard* API: http://localhost:8000

That’s it — the ingestion pipeline starts automatically.

➡️ New here? Read docs/first-run.md for what to expect on first startup.

📂 Adding Your Documents

Archive Brain runs in Docker, so folders from your host system must be explicitly mounted before they can be indexed.

If you can search but clicking a document shows empty content or Open source returns {"detail":"File not found on disk"}, it usually means the file path exists in the database but the underlying folder is not mounted into the containers. Set STORY_SOURCE / KNOWLEDGE_SOURCE in .env to point at your real folders (see .env.example), then restart the stack.

This is a one-time setup step and is required before your files will appear in the UI.

➡️ Read: Adding Folders to Archive Brain

🧠 How It Works (High Level)

Archive Brain runs a background pipeline:

Ingest – Scan folders and extract raw content
Segment – Split content into logical chunks
Enrich Documents – Generate metadata for full documents
Enrich Chunks – Optionally enrich chunks (configurable mode)
Embed Documents – Create vector embeddings for document-level search
Embed Chunks – Create vector embeddings for chunk-level search
Retrieve & Generate – Power semantic search and Q&A

The dashboard shows real-time progress, including which phase the worker is currently processing and estimated completion times for each stage.

Dashboard: pipeline status with real-time phase tracking

➡️ For details, see docs/architecture.md.

📦 Supported File Types

Type	Extensions	Notes
Text	`.txt`, `.md`, `.html`	Full text extraction
PDF	`.pdf`	Text + OCR fallback
Images	`.jpg`, `.png`, `.gif`, `.webp`, `.tiff`	OCR + vision descriptions
Documents	`.docx`	Planned

🖼️ Image Gallery & Analysis

Archive Brain includes a dedicated gallery view for browsing and analyzing images:

Grid and list views for browsing all extracted images
Lightbox viewer with full-resolution display
OCR text extraction from images
AI-powered descriptions using vision models (LLaVA)
On-demand analysis - generate descriptions for any image with a single click
Sortable views - sort by date, filename, or file size

Images are automatically extracted during ingestion and can be analyzed individually or in batch.

⚙️ Configuration

cp .env.example .env

Key settings:

OLLAMA_MODEL – Chat model for enrichment and Q&A
OLLAMA_EMBEDDING_MODEL – Embedding model for vector search
OLLAMA_VISION_MODEL – Vision model for image analysis
DB_PASSWORD – Database password

Source folders and file types are defined in:

config/config.yaml

Performance Tuning

Chunk Enrichment Mode

Control how chunks are processed to balance speed vs. metadata richness:

none – Skip chunk enrichment entirely (fastest)
embed_only – Only create embeddings, no LLM enrichment (recommended default)
full – Full LLM enrichment with titles, summaries, and tags (slowest)

Change this in Settings via the web UI or by calling the API. The embed_only mode provides excellent search quality while dramatically reducing processing time.

Multi-Provider LLM Support

Archive Brain can distribute load across multiple LLM providers for faster processing:

Configure additional Ollama servers or cloud providers
Worker automatically balances requests across available providers
Improves throughput for large archives

🖥️ Deployment Options

Default (Recommended)

Self-contained Docker setup with Ollama included:

docker compose -f docker-compose.yml --profile prod up -d

NVIDIA GPU Acceleration

Requires NVIDIA Container Toolkit:

docker compose -f docker-compose.yml -f docker-compose.gpu.yml --profile prod up -d

External Ollama (Advanced)

Run Ollama on the host or another machine:

export OLLAMA_URL=http://host.docker.internal:11434
docker compose -f docker-compose.yml -f docker-compose.external-llm.yml --profile prod up -d

🔄 Re-running & Iteration

Pipeline steps are designed to be idempotent
Re-running ingestion will skip unchanged files
Metadata and embeddings are reused when possible

To reset everything:

docker compose -f docker-compose.yml --profile prod down -v
docker compose -f docker-compose.yml --profile prod up -d --build

🚧 Current Limitations

Single-user only - no multi-tenancy support
No authentication or access control
Not optimized for real-time ingestion - designed for batch processing
Large archives may require extended processing time
- Use embed_only mode for faster processing
- GPU acceleration recommended for 1M+ document archives
Worker cycles through phases - processes one type of task at a time (documents → chunks → embeddings)

🧬 Embeddings Visualization

Visualize document and chunk embeddings

🛠️ Tech Stack

PostgreSQL + pgvector
Python + FastAPI
React + Vite
Apache Tika, Tesseract OCR
Ollama (LLMs)
Docker Compose

📚 Documentation

First Run Guide: docs/first-run.md
Adding Source Folders: docs/ADDING_FOLDERS.md
Architecture Overview: docs/architecture.md

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.claude		.claude
.vscode		.vscode
archive_dev		archive_dev
archive_root/docs		archive_root/docs
backend		backend
config		config
docs		docs
frontend		frontend
scripts		scripts
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.example.yml		docker-compose.example.yml
docker-compose.external-llm.yml		docker-compose.external-llm.yml
docker-compose.gpu.yml		docker-compose.gpu.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Archive Brain

✨ What Archive Brain Does

🔐 Data & Privacy Model

🚀 Quick Start

📂 Adding Your Documents

🧠 How It Works (High Level)

📦 Supported File Types

🖼️ Image Gallery & Analysis

⚙️ Configuration

Performance Tuning

🖥️ Deployment Options

Default (Recommended)

NVIDIA GPU Acceleration

External Ollama (Advanced)

🔄 Re-running & Iteration

🚧 Current Limitations

🧬 Embeddings Visualization

🛠️ Tech Stack

📚 Documentation

📄 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Archive Brain

✨ What Archive Brain Does

🔐 Data & Privacy Model

🚀 Quick Start

📂 Adding Your Documents

🧠 How It Works (High Level)

📦 Supported File Types

🖼️ Image Gallery & Analysis

⚙️ Configuration

Performance Tuning

🖥️ Deployment Options

Default (Recommended)

NVIDIA GPU Acceleration

External Ollama (Advanced)

🔄 Re-running & Iteration

🚧 Current Limitations

🧬 Embeddings Visualization

🛠️ Tech Stack

📚 Documentation

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages