PastPortals v2: AI-Powered Multimodal CRAG System for Cultural Heritage Interpretation

🏛️ Project Overview

PastPortals is an intelligent, AI-powered museum guide system developed as a response to limitations in traditional and existing digital museum information systems. The platform integrates Correction + Retrieval-Augmented Generation (CRAG), natural language processing, multimodal interaction, vector-based retrieval, voice-first conversational AI, and continuous self-improving feedback loops to deliver accurate, context-aware, and engaging cultural heritage experiences.

🎯 Problem Statement & Motivation

❌ Limitations of Existing Systems

Traditional museum systems rely on static methods—printed labels, brochures, audio guides, and system suffers from:-

Hallucination: Generation of inaccurate or unsupported information
Lack of domain grounding: Insufficient knowledge of historical and cultural contexts
No transparency: Inability to verify information sources
Limited multimodal support: Restricted to single interaction modes
Poor scalability: Inefficient handling of simultaneous visitors

✅ Proposed Solution

PastPortals implements Correction + Retrieval-Augmented Generation (CRAG) to bridge this gap by:

Retrieving verified information from curated knowledge bases
Validating and correcting generated content through fact-checking mechanisms
Supporting multimodal interaction (text, voice, image, video)
Enabling voice-first conversational AI for hands-free cultural exploration
Implementing intelligent feedback loops that refine system behavior with each user interaction
Enabling multilingual communication across 18+ languages
Ensuring scalability and continuous improvement for high-traffic museum environments

🏗️ Technical Architecture

📊 System Components

┌─────────────────────────────────────────────────────────────────────┐
│                       Frontend Application Layer                    │
│              React 18 | Document Upload | Voice Interface           │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ HTTP/REST API
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Content Processing Layer                        │
│        Document Extraction | OCR | Video Analysis | Voice           │
│        (PyMuPDF | python-docx | Tesseract | OpenCV)                │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Extracted Content + Metadata
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   Retrieval & Ranking Layer                         │
│              FAISS Vector Search | Wikipedia API                    │
│              Historical Content Classification                       │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Contextually Relevant Information
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                  Generation & Response Layer                        │
│        Google Gemini 2.5 Flash | Fallback Enrichment               │
│              Fact Validation | Response Synthesis                   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Generated Response with Metadata
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Response Delivery Layer                          │
│         Markdown Rendering | Audio Output | Related Topics          │
└─────────────────────────────────────────────────────────────────────┘

🛠️ Technology Stack

Frontend Architecture 🎨

Component	Technology	Purpose
UI Framework	React 18.2	Component-based interface with virtual DOM rendering
Routing	React Router 6	Client-side navigation and state management
Animations	Framer Motion	Smooth transitions and interactive UI elements
Icons	Lucide React	Comprehensive, accessible icon system
HTTP Client	Axios	RESTful API communication with request/response interceptors
Testing	Jest + React Testing Library	40+ component tests with coverage reporting

Backend Infrastructure ⚙️

Component	Technology	Purpose
API Framework	FastAPI	High-performance async REST API framework
Language	Python 3.13	Primary backend language with modern features
Async Processing	asyncio + uvicorn	Non-blocking concurrent request handling
Testing	pytest	50+ unit tests with comprehensive coverage
API Documentation	Pydantic + Swagger	Auto-generated interactive API documentation

Content Processing & Extraction 📄

Component	Technology	Purpose
PDF Extraction	PyMuPDF (fitz)	High-fidelity text and metadata extraction
Word Documents	python-docx	Structured parsing of DOCX format
Optical Character Recognition	pytesseract + Tesseract	Text extraction from images and scanned documents
Video Analysis	OpenCV (cv2)	Frame sampling and temporal processing (8 frames/video)
Voice Processing	Web Speech API	Real-time speech-to-text transcription

AI/ML & Generation Layer 🧠

Component	Technology	Purpose
LLM Generation	Google Gemini 2.5 Flash	Advanced language generation with low latency
Retrieval-Augmented	CRAG (Correction Module)	Fact validation and hallucination correction
Vector Similarity	FAISS	Fast approximate nearest neighbor search
Sentence Embeddings	Sentence Transformers	Dense vector representation of content
Domain Classification	Historical Keyword Analysis	Context-aware content categorization

Voice-First Conversational AI 🎙️

Component	Technology	Purpose
Speech-to-Text	Google Cloud Speech-to-Text / Web Speech API	Multilingual voice input processing
Natural Language Understanding	LLM + RAG Pipeline	Intent extraction and query comprehension
Text-to-Speech	Google Cloud Text-to-Speech	Natural-sounding response delivery
Voice Assistant Framework	Custom voice conversation bot	Context-aware dialogue management
Real-time Streaming	WebSocket support	Continuous voice interaction without latency

Data & Retrieval Systems 📚

Component	Technology	Purpose
Vector Database	FAISS with in-memory indexing	Millisecond-level similarity search
Knowledge Bases	Wikipedia API + Smithsonian Open Access	Curated historical content retrieval
Domain Datasets	Custom museum collections	Institution-specific artifact metadata
Feedback Storage	JSON + structured logs	User interaction tracking for improvement
Cache Layer	Redis (optional)	Response caching and session management

Intelligent Feedback Loop System 🔄

Component	Technology	Purpose
User Interaction Tracking	Event logging pipeline	Capture queries, dwell time, user ratings
Feedback Collection	Implicit + explicit signals	Track relevance, accuracy, and satisfaction
Vector Similarity Refinement	Weight adjustment algorithms	Dynamically tune ranking for domain-specific queries
Model Adaptation	Online learning mechanisms	Continuous improvement of retrieval quality
Performance Monitoring	Metrics & analytics dashboard	Track system improvement across sessions

🎓 System Objectives

The development of PastPortals targets the following key objectives:

Accuracy & Reliability: Ground all responses in trusted, curated datasets with fact-checking mechanisms to reduce hallucination and enhance credibility
Voice-First Interaction: Enable seamless voice-based conversational interfaces for hands-free cultural exploration with natural language understanding
Continuous Self-Improvement: Implement intelligent feedback loops that refine retrieval ranking, response quality, and domain understanding from every user interaction
Multilingual Support: Provide accessibility across 18+ languages for diverse visitor populations with cultural context preservation
Multimodal Delivery: Process and respond to diverse input modalities (text, voice, image, video) while delivering content in preferred formats
Scalability & Performance: Handle multiple concurrent users without degradation using FastAPI async architecture
Accessibility & Cultural Sensitivity: Maintain authenticity in heritage interpretation while supporting diverse learning styles and accessibility requirements

⭐ Core Features

✨ Current Implementation (v2.0)

Feature	Implementation
Document Processing	PDF, DOCX, TXT, MD, JSON, CSV, HTML extraction
Image Recognition	Tesseract-based OCR for photographic content
Video Analysis	Frame sampling with temporal OCR processing
Voice Interaction	WebRTC recording + transcription pipeline
Unified API	Single endpoint supporting all input modalities
Progress Tracking	Real-time upload status visualization (0-100%)
Content Validation	Format and size limit enforcement with user feedback
Fallback Responses	Wikipedia-enriched responses for API unavailability
Comprehensive Testing	50+ backend + 40+ frontend unit tests
Museum Integration	Curated museum data and virtual tour content

📁 File Processing Specifications

Category	Maximum Size	Supported Formats
Documents	50 MB	PDF, DOCX, TXT, MD, CSV, JSON, HTML, HTM
Images	25 MB	PNG, JPG, JPEG, WEBP, BMP, TIFF, TIF
Video	500 MB	MP4, MOV, AVI, MKV, WEBM, M4V
Voice	N/A	Real-time recording via WebRTC

🔄 Intelligent Self-Improving Feedback Loop System

Every user interaction represents an opportunity for system learning. PastPortals v2 incorporates a sophisticated feedback pipeline that continuously refines retrieval accuracy, response relevance, and domain understanding.

🔀 Feedback Mechanism Architecture

Stage 1: User Feedback Captured

Explicit ratings and implicit signals (re-queries, dwell time) logged per interaction
Domain context stored with each query-response pair
User satisfaction metrics tracked across museum exhibition types

Stage 2: Ranking Model Updated

Feedback dynamically adjusts vector similarity weights
Domain classifier confidence thresholds refined based on user validation
Historical accuracy data incorporated into retrieval ranking

Stage 3: System Evolution

Pipeline gets measurably smarter with each user session
Adaptive behavior emerges from aggregated feedback signals
Cultural context understanding deepens through continuous learning

🎯 Key Benefits

Adaptive Responses: Museum guides learn visitor preferences and knowledge levels
Domain Refinement: Historical accuracy improves through expert feedback integration
Personalization: Interaction quality increases for returning visitors
Continuous Validation: User corrections automatically retrain ranking models

🎤 Voice-First Conversational AI Bot

PastPortals v2 delivers a seamless, hands-free cultural exploration experience through intelligent voice-first conversational AI.

A museum guide that listens, reasons, and speaks back in real time, turning every artifact into a conversation instead of a static label.

✨ Why It Stands Out

Hands-free discovery: Ask questions naturally and get spoken answers without typing or navigating menus.
Grounded responses: Every reply is filtered through CRAG so the assistant stays accurate, contextual, and museum-ready.
Multilingual conversations: Visitors can interact in 18+ languages, making the experience accessible and global.

🗣️ Core Voice Features

Feature	Technology	Implementation
Speech-to-Text Input	Google Cloud Speech-to-Text / Web Speech API	Converts user voice into text queries in real-time
AI Understanding	LLM + RAG + CRAG Pipeline	Processes natural language intent with cultural context
Text-to-Speech Output	Google Cloud Text-to-Speech	Delivers responses as natural, human-like voice
Real-Time Interaction	WebSocket streaming protocol	Instant conversational feedback without latency
Context-Aware Dialogue	Domain-aware conversation state	Adapts responses based on museum location and artifact
Multilingual Support	18+ language voice processing	Bilingual interactions for international visitors

🔬 Technical Stack for Voice AI

Voice Input: Web Speech API + Whisper transcription
Voice Processing: TensorFlow Lite for on-device optimization
Response Generation: Gemini 2.5 Flash with domain context
Voice Output: Google Cloud TTS with natural prosody
Conversation Management: State machine for dialogue flow

🌊 Data Flow: Multimodal Intelligent Pipeline

PastPortals v2 represents a complete data journey, from diverse user inputs to intelligent, verified outputs, constantly refining itself through feedback.

⚡ Processing Pipeline

User Input Acquisition → Text, Voice, Image, or Video submission
Multimodal Processing → Speech-to-Text, OCR, Frame Extraction, Document Parsing
Domain Classification → Historical/cultural context detection
Vector Retrieval → FAISS semantic search of curated knowledge bases
LLM Generation → Google Gemini 2.5 Flash response synthesis
Fact Validation → CRAG correction module validates accuracy
Output Delivery → Markdown-formatted response + voice synthesis
Feedback Collection → User interaction logged for continuous improvement
System Refinement → Ranking and understanding models updated

📋 System Requirements

Node.js: v16 or higher
Python: v3.10 or higher
Tesseract OCR: System-level installation required
Virtual Environment: Python venv or equivalent

💻 Development Setup

# Activate virtual environment
& .venv\Scripts\Activate.ps1

# Install backend dependencies
pip install -r backend/requirements.txt

# Install frontend dependencies
cd frontend
npm install

# Configure environment variables
# Root .env file:
# GEMINI_API_KEY=your_api_key
# CORS_ORIGINS=http://localhost:3001

# frontend/.env file:
# PORT=3001
# REACT_APP_API_URL=http://localhost:5000

🚀 Running the Application

Terminal 1 - Backend Server:

cd backend
# FastAPI server (async support for concurrent requests)
uvicorn app:app --reload --port 5000

# Or using Python directly (if configured)
python app.py
# Server runs on http://localhost:5000 with auto-generated docs at http://localhost:5000/docs

Terminal 2 - Frontend Application:

cd frontend
npm start
# Application accessible at http://localhost:3001

Navigate to http://localhost:3001/multimodal to access the multimodal input interface.

✅ Testing & Quality Assurance

👾 Backend Testing

# Activate the project virtualenv first, then execute all backend tests
./.venv/bin/python -m pytest -q backend

# Generate coverage report
./.venv/bin/python -m pytest -q backend --cov=backend.utils --cov=backend.routes --cov-report=html

# Test specific modules
./.venv/bin/python -m pytest -q backend/tests/test_multimodal_utils.py
./.venv/bin/python -m pytest -q backend/tests/test_multimodal_routes.py

If you already have the virtualenv activated, pytest -q backend also works from the repository root.

Test Coverage:

test_multimodal_utils.py: 35+ tests (content extraction, OCR validation, response generation)
test_multimodal_routes.py: 15+ tests (API endpoint validation, error handling)
Aggregate Coverage: 90%+ of core functionality

📋 Frontend Testing

cd frontend
npm test                    # Execute all component tests
npm test -- --coverage      # Generate coverage report
npm test MultimodalPanel    # Test specific component

Test Coverage:

MultimodalPanel.test.jsx: 40+ tests (file validation, upload workflow, results display)
Framework: Jest + React Testing Library

🔐 API Specification

🔓 Primary Endpoint: Multimodal Analysis

Endpoint: POST /api/multimodal/analyze

Request Format:

Content-Type: multipart/form-data

Parameters:
- file (optional): File object (document/image/video)
- question (required): User query string
- mode (required): Input modality (document|image|video|voice)

Response Schema:

{
  "success": boolean,
  "mode": "document|image|video|voice",
  "method": "text-file|pdf|docx|ocr-image|ocr-video|generic-text",
  "extracted_text": "Full text extracted from input",
  "response": "Generated or fallback response (900-1100 words)",
  "metadata": {
    "filename": "original_filename.ext",
    "extension": ".pdf|.jpg|.mp4|...",
    "size_bytes": number,
    "processing_method": "extraction_method_used"
  },
  "notes": ["Processing note 1", "Processing note 2"],
  "related_topics": [
    {
      "title": "Topic Title",
      "extract": "Brief description from Wikipedia"
    }
  ],
  "fallback": false
}

📖 Documentation

This repository includes comprehensive technical documentation:

📘 CRAG_ARCHITECTURE_AND_PIPELINE_FLOW.md

Detailed technical documentation covering:

CRAG architecture and the 4-stage retrieval, generation, validation, and correction flow
Multimodal integration points and response structure
Validation metrics, configuration defaults, and troubleshooting guidance
Practical testing notes, including the existing multimodal test files and recommended CRAG cases

Intended Audience: Developers, code maintainers, technical architects, and QA engineers

🚀 Future Developments

🌟 Version 2.2 (Planned)

3D Artifact Visualization: Interactive 3D models of museum pieces with voice guidance
Mobile Voice Assistant: Dedicated mobile app with voice-first experience
Knowledge Graph Integration: Semantic relationship mapping for cultural artifacts
Performance Optimization: Latency reduction to <500ms for voice interactions
Collaborative Annotation: Visitor annotations that improve cultural understanding

📊 Project Statistics

Metric	Value
Total Lines of Code	8,500+
Test Coverage	90%+
API Endpoints	15+
Supported File Types	20+
Supported Languages	18+ (planned)
Museum Partnerships	6 institutions
Development Duration	3+ months

🔗 Key References & Data Sources

Content: Wikipedia Foundation (en.wikipedia.org)
Historical Images: Wikimedia Commons
Museum Data: Smithsonian Open Access, Louvre API, British Museum Collections
AI Generation: Google Generative AI (Gemini 2.5 Flash)
OCR Engine: Tesseract Open Source OCR
Vector Search: Facebook FAISS
Video Processing: OpenCV Foundation

🤛 Contributing & Support

📋 Reporting Issues

Please submit issues via GitHub Issues with:

Detailed description and reproduction steps
Environment specifications (OS, Python version, Node version)
Error logs and stack traces
Screenshots or relevant attachments

🔨 Development Workflow

Create feature branch: git checkout -b feature/feature-name
Implement changes and execute tests locally
Commit with descriptive messages following conventional commits
Push to remote and create pull request
Submit for code review and CI/CD validation

Sample Output

Multimodal Analysis Result

{
  "success": true,
  "mode": "document",
  "method": "pdf_extraction",
  "extracted_text": "The Roman Empire was one of the most influential civilizations in human history, spanning over 500 years...",
  "response": "The Roman Empire, originating from the Italian peninsula, became a dominant force that transformed Western civilization. From 27 BCE to 476 CE, Rome developed sophisticated administrative systems, advanced architectural techniques, and influential legal frameworks. Key achievements include the construction of infrastructure such as aqueducts, roads, and amphitheaters, alongside the development of Latin as a universal language. The Roman military was renowned for its organization and effectiveness, while Roman law established principles that continue to influence modern legal systems.",
  "metadata": {
    "filename": "roman_history.pdf",
    "extension": ".pdf",
    "size_bytes": 2048576,
    "processing_method": "pdf_extraction"
  },
  "notes": [
    "PDF extracted successfully with 8 keywords identified",
    "Content grounded in Wikipedia historical data"
  ],
  "related_topics": [
    {
      "title": "Roman Republic",
      "extract": "The Roman Republic was the period of Roman history when the state operated as a republic..."
    },
    {
      "title": "Julius Caesar",
      "extract": "Gaius Julius Caesar was a Roman military general and statesman who played a critical role..."
    }
  ],
  "fallback": false
}

User Interface Examples

Document Upload Interface

Search Results with Historical Context

Timeline Navigation

License & Attribution

This project is distributed under the MIT License for educational and research purposes.

Data Attribution:

Historical Content: Wikipedia Foundation
Imagery: Wikimedia Commons (Creative Commons License)
Museum Information: Official institutional APIs
AI Capabilities: Google Gemini API
OCR Technology: Tesseract OCR Project

Authors

Yash Kumar Kalirawan

Keywords

Artificial Intelligence · Museums · Conversational Agents · Retrieval-Augmented Generation · Multimodal AI · Visitor Engagement · Cultural Heritage · Natural Language Processing · Vector Databases · OCR Technology

PastPortals v2 — Advancing Cultural Heritage Interpretation Through Intelligent Technology

🤝 Contributing

Contributions are welcome!

Fork the repository.
Create a feature branch.
Commit your changes.
Push to the branch.
Open a Pull Request.

⭐ If you like my work, drop a ⭐ and let's connect!

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
CRAG_ARCHITECTURE_AND_PIPELINE_FLOW.md		CRAG_ARCHITECTURE_AND_PIPELINE_FLOW.md
README.md		README.md
SETUP.md		SETUP.md

Folders and files

Latest commit

History

Repository files navigation

PastPortals v2: AI-Powered Multimodal CRAG System for Cultural Heritage Interpretation

🏛️ Project Overview

🎯 Problem Statement & Motivation

❌ Limitations of Existing Systems

✅ Proposed Solution

🏗️ Technical Architecture

📊 System Components

🛠️ Technology Stack

Frontend Architecture 🎨

Backend Infrastructure ⚙️

Content Processing & Extraction 📄

AI/ML & Generation Layer 🧠

Voice-First Conversational AI 🎙️

Data & Retrieval Systems 📚

Intelligent Feedback Loop System 🔄

🎓 System Objectives

⭐ Core Features

✨ Current Implementation (v2.0)

📁 File Processing Specifications

🔄 Intelligent Self-Improving Feedback Loop System

🔀 Feedback Mechanism Architecture

🎯 Key Benefits

🎤 Voice-First Conversational AI Bot

✨ Why It Stands Out

🗣️ Core Voice Features

🔬 Technical Stack for Voice AI

🌊 Data Flow: Multimodal Intelligent Pipeline

⚡ Processing Pipeline

📋 System Requirements

💻 Development Setup

🚀 Running the Application

✅ Testing & Quality Assurance

👾 Backend Testing

📋 Frontend Testing

🔐 API Specification

🔓 Primary Endpoint: Multimodal Analysis

📖 Documentation

📘 CRAG_ARCHITECTURE_AND_PIPELINE_FLOW.md

🚀 Future Developments

🌟 Version 2.2 (Planned)

📊 Project Statistics

🔗 Key References & Data Sources

🤛 Contributing & Support

📋 Reporting Issues

🔨 Development Workflow

Sample Output

Multimodal Analysis Result

User Interface Examples

License & Attribution

Authors

Keywords

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages