Colonial Archives Graph-RAG

Source-grounded Q&A over colonial-era archive documents, powered by a knowledge graph and retrieval-augmented generation. Every answer traces back to specific document pages — zero tolerance for hallucination.

Hackathon Objectives

This project addresses all three objectives of the CO 273 Pipeline challenge, plus the bonus task.

Objective 1: High-Fidelity OCR & Correction

We use Google Cloud Document AI to produce accurate text from colonial-era scanned PDFs — far surpassing the legacy Gale OCR.

Capability	Implementation
OCR engine	Google Cloud Document AI
Batch processing	15-page batches with semaphore concurrency (5 parallel requests)
Text cleaning	Sliding-window chunking (450 tokens, 100 overlap), CJK language detection
Quality monitoring	Per-page confidence scores; pages below 0.5 flagged for review
Scale	28 PDFs ingested from CO 273 series (volumes 534, 550, 579) — architecture handles the full collection

The 9-step ingestion pipeline (PDF → OCR → chunk → embed → vector upsert → entity extraction → normalization → Neo4j MERGE → auto-classification) runs end-to-end, tested with real colonial archive data.

Key files: backend/app/services/ocr.py, backend/app/services/chunking.py, backend/app/routers/ingest.py

Design docs: docs/plans/2026-03-01-phase1-backend-foundation.md, docs/plans/2026-03-01-data-ingestion-and-integration.md

Objective 2: The "Kratoska Link" (Re-indexing)

We go beyond keyword indexing by building a full knowledge graph of entities and relationships extracted from the archive text, enabling researchers to discover connections between people, places, events, and institutions across the entire CO 273 collection.

Capability	Implementation
Entity extraction	Gemini 2.0 Flash structured JSON output per chunk
Entity normalization	Three-stage dedup: exact match → embedding similarity (0.85 threshold) → fuzzy string (rapidfuzz)
Graph database	Neo4j AuraDB with idempotent MERGE (re-ingestion safe)
Entity search	`GET /graph/search?q=Opium` — find entities with word-split fallback
Subgraph traversal	1–3 hop Cypher queries from seed entities, filtered by category
Category classification	5 predefined archive categories (General, Economic, Social, Internal, Defence) + Gemini auto-classification
Source traceability	Every entity stores `evidence_doc_id`, `evidence_page`, `evidence_text_span` — click any node to view the original PDF page

Data at scale: 1,463 entities, 6,843 relationships extracted from 28 PDFs, all 100% traceable to source pages.

The interactive two-state knowledge graph (Cytoscape.js + fcose layout) lets researchers visually explore the full archive on load, then see query-relevant subgraphs after asking a question. Nodes are sized by connection count, colored by category, and clickable — opening the source document at the exact evidence page.

Key files: backend/app/services/entity_extraction.py, backend/app/services/entity_normalization.py, backend/app/services/neo4j_service.py, frontend/src/components/GraphCanvas.tsx

Design docs: docs/plans/2026-03-01-colonial-archives-graph-rag-design.md, docs/plans/2026-03-02-graph-visualization-overhaul.md

Objective 3: The Semantic Historian (RAG & Summarization)

The chatbot uses an archive-first retrieval pipeline — answers are generated strictly from colonial documents, with web fallback only when the archive cannot answer (clearly marked with a disclaimer).

Capability	Implementation
Hybrid retrieval	Parallel vector search (Vertex AI Vector Search, text-embedding-004) + graph traversal (Neo4j)
LLM generation	Gemini 2.0 Flash, temperature 0.1, archive-only prompt
Citation system	`[archive:N]` markers → clickable badges → PDF viewer at exact source page
Web fallback	Tavily search, triggered only when archive LLM returns fallback — prefixed with disclaimer
Full-doc retrieval	"Show me CO 273/550/18" → bypasses vector search, fetches full OCR text directly
Scoring	`(1 - cosine_distance) * 0.6 + graph_hit_ratio * 0.4` — archive context ranked by relevance
Category filtering	Optional multi-select filter restricts search to specific archive categories

No LangChain or LlamaIndex — all retrieval orchestration uses direct SDK calls for full traceability and control. Every answer includes numbered citations linking back to specific document pages, viewable in an in-browser PDF viewer (pdf.js with signed URLs).

Key files: backend/app/services/hybrid_retrieval.py, backend/app/services/llm.py, backend/app/services/vector_search.py, frontend/src/components/ChatPanel.tsx, frontend/src/components/PdfModal.tsx

Design docs: docs/plans/2026-03-01-archive-first-query.md, docs/plans/2026-03-01-query-pipeline-fix.md, docs/plans/2026-03-02-full-document-retrieval.md

Bonus: Commodities and Capitalism

We started by broadly classifying all ingested documents into five thematic categories aligned with the hackathon themes — visible as the color-coded legend in the knowledge graph:

Category	Theme Mapping	Data Share
Economic and Financial	Commodities and Capitalism	~70%
General and Establishment	Colonialism, Race and Politics	~30% (combined across 4 themes)
Social Services	Gender and Sexuality
Internal Relations and Research	Popular Culture
Defence and Military	Ecology and Other Animals

From this broad classification, we specialized — 70% of our data comes from the theme of Commodities and Capitalism, with 30% drawn from the other four themes. This concentration reflects the CO 273 collection's heavy focus on trade, revenue, and economic administration in the Straits Settlements.

Our specialization is supported by secondary source ingestion: we incorporated Trocki's Opium and Empire and Huff's The Economic Growth of Singapore (see document_categories.json) alongside the primary CO 273 correspondence, enabling the chatbot to contextualize colonial economic documents against modern historiography.

What researchers can do:

Filter by category — the graph legend toggles category visibility, letting researchers isolate Economic/Financial nodes to explore trade networks, opium revenue, port administration
Explore entity connections — the graph reveals cross-document relationships (e.g., opium trade networks linking merchants, ports, and policies across volumes)
Trace evidence chains — every entity and relationship is source-grounded, enabling historians to verify AI-discovered connections against the original handwritten documents
Visual cluster analysis — the fcose-clustered overview graph shows thematic groupings by color, revealing structural patterns in the colonial correspondence

The warm archival design theme (Crimson Pro typography, stone/ink color palette, Merlion logo) was chosen to respect the historical character of the material while providing a modern research interface.

Key files: backend/app/config/document_categories.json, backend/app/services/auto_classification.py

Design docs: docs/plans/2026-03-01-frontend-design-refinement.md, docs/plans/2026-03-02-node-source-document-link.md

Architecture

┌─────────────┐     ┌──────────────────────────────────────────────┐
│   React UI  │────▶│  FastAPI Backend                             │
│  Cytoscape  │     │                                              │
│  PDF.js     │     │  ┌─────────┐  ┌──────────┐  ┌────────────┐  │
│  Zustand    │     │  │ Vertex  │  │  Neo4j   │  │ Document   │  │
└─────────────┘     │  │ AI Vec  │  │ AuraDB   │  │ AI OCR     │  │
                    │  │ Search  │  │ (Graph)  │  │            │  │
                    │  └─────────┘  └──────────┘  └────────────┘  │
                    │  ┌─────────┐  ┌──────────┐  ┌────────────┐  │
                    │  │ Gemini  │  │  Cloud   │  │  Tavily    │  │
                    │  │ 2.0     │  │  Storage │  │  (Web)     │  │
                    │  │ Flash   │  │  (GCS)   │  │            │  │
                    │  └─────────┘  └──────────┘  └────────────┘  │
                    └──────────────────────────────────────────────┘

Stack: FastAPI (Python 3.11) · React 19 · TypeScript · Cytoscape.js · Tailwind CSS · Google Cloud (Document AI, Vertex AI, Cloud Storage, Vector Search) · Neo4j AuraDB · Gemini 2.0 Flash

No LangChain or LlamaIndex — direct SDK calls for full traceability and control.

Data Pipeline

The ingestion pipeline processes archive PDFs in 9 steps:

PDF upload — from Google Cloud Storage
OCR — Document AI with page batching (15/batch) and semaphore concurrency
Text cleaning + chunking — sliding window (450 tokens, 100 overlap), CJK language detection
Embedding — Vertex AI text-embedding-004, batch size 250
Vector upsert — Vertex AI Vector Search
Entity extraction — Gemini structured JSON output per chunk
Entity normalization — three-stage dedup (exact, embedding similarity, fuzzy string)
Graph MERGE — idempotent upsert into Neo4j
Auto-classification — Gemini-based category assignment for unmapped PDFs

Steps 7–9 (graph) are non-blocking — vector ingestion succeeds even if the graph pipeline fails.

Getting Started

Prerequisites

Python 3.11+
Node.js 18+
GCP project with Document AI, Vertex AI, Cloud Storage enabled
Neo4j AuraDB instance (free tier sufficient)
Tavily API key (for web search fallback)

Environment Setup

cp backend/.env.example backend/.env
# Fill in all PLACEHOLDER_* values

Required environment variables:

Variable	Description
`GCP_PROJECT_ID`	Google Cloud project ID
`GCP_REGION`	Primary region (e.g. `asia-southeast1`)
`DOC_AI_PROCESSOR_ID`	Document AI OCR processor ID
`CLOUD_STORAGE_BUCKET`	GCS bucket containing archive PDFs
`VECTOR_SEARCH_ENDPOINT`	Vertex AI Vector Search endpoint
`VECTOR_SEARCH_INDEX_ID`	Vector Search index ID
`VECTOR_SEARCH_DEPLOYED_INDEX_ID`	Deployed index ID
`NEO4J_URI`	Neo4j connection URI (`neo4j+s://...`)
`NEO4J_USER`	Neo4j username
`NEO4J_PASSWORD`	Neo4j password
`TAVILY_API_KEY`	Tavily web search API key

Running Locally

Backend:

cd backend
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8080
# Swagger docs at http://localhost:8080/docs

Frontend:

cd frontend
npm install
npm run dev

Running with Docker

cd infra
docker-compose up --build
# Requires GCP credentials — see backend/.env.example

API

Method	Path	Description
`POST`	`/ingest_pdf`	Ingest a PDF (OCR + graph + classification)
`GET`	`/ingest_status/{job_id}`	Check ingestion job status
`POST`	`/query`	Ask a question (archive-first + web fallback)
`GET`	`/document/signed_url`	Get a signed URL for a source PDF
`GET`	`/document/{doc_id}/text`	Get OCR text for specific pages
`GET`	`/graph/overview`	Full knowledge graph (5-min cache)
`GET`	`/graph/search`	Search entities by name
`GET`	`/graph/{canonical_id}`	Get subgraph for an entity
`GET`	`/admin/documents`	List ingested documents
`GET`	`/admin/documents/{doc_id}/ocr`	OCR quality report
`GET`	`/health`	Health check (includes Neo4j status)

Testing

# Backend (85 tests)
cd backend
python -m pytest tests/ -v

# Frontend (45 tests)
cd frontend
npx vitest run

Project Structure

backend/
├── app/
│   ├── main.py              # FastAPI app, lifespan, CORS
│   ├── config/
│   │   ├── settings.py      # Pydantic Settings from .env
│   │   └── document_categories.json
│   ├── models/
│   │   └── schemas.py       # All Pydantic v2 models
│   ├── routers/             # Thin HTTP layer
│   │   ├── ingest.py
│   │   ├── query.py
│   │   ├── graph.py
│   │   └── admin.py
│   └── services/            # Business logic
│       ├── ocr.py
│       ├── chunking.py
│       ├── embeddings.py
│       ├── vector_search.py
│       ├── llm.py
│       ├── entity_extraction.py
│       ├── entity_normalization.py
│       ├── neo4j_service.py
│       ├── hybrid_retrieval.py
│       ├── web_search.py
│       └── storage.py
├── tests/
docs/
└── plans/                   # Design documents and implementation plans
dev/
└── active/                  # Multi-phase team context (plan + tasks per phase)
frontend/
├── src/
│   ├── App.tsx
│   ├── api/client.ts
│   ├── components/
│   │   ├── ChatPanel.tsx
│   │   ├── GraphCanvas.tsx
│   │   ├── GraphLegend.tsx
│   │   ├── NodeSidebar.tsx
│   │   ├── PdfModal.tsx
│   │   └── AdminPanel.tsx
│   ├── stores/useAppStore.ts
│   └── types/index.ts
└── package.json
infra/
└── docker-compose.yml

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
backend		backend
frontend		frontend
infra		infra
.gitignore		.gitignore
PORTFOLIO.md		PORTFOLIO.md
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Colonial Archives Graph-RAG

Hackathon Objectives

Objective 1: High-Fidelity OCR & Correction

Objective 2: The "Kratoska Link" (Re-indexing)

Objective 3: The Semantic Historian (RAG & Summarization)

Bonus: Commodities and Capitalism

Architecture

Data Pipeline

Getting Started

Prerequisites

Environment Setup

Running Locally

Running with Docker

API

Testing

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Colonial Archives Graph-RAG

Hackathon Objectives

Objective 1: High-Fidelity OCR & Correction

Objective 2: The "Kratoska Link" (Re-indexing)

Objective 3: The Semantic Historian (RAG & Summarization)

Bonus: Commodities and Capitalism

Architecture

Data Pipeline

Getting Started

Prerequisites

Environment Setup

Running Locally

Running with Docker

API

Testing

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages