Skip to content

ykjaat6104/PastPortals-V2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PastPortals v2: AI-Powered Multimodal CRAG System for Cultural Heritage Interpretation

Python React FastAPI Google Gemini FAISS License

🏛️ Project Overview

PastPortals is an intelligent, AI-powered museum guide system developed as a response to limitations in traditional and existing digital museum information systems. The platform integrates Correction + Retrieval-Augmented Generation (CRAG), natural language processing, multimodal interaction, vector-based retrieval, voice-first conversational AI, and continuous self-improving feedback loops to deliver accurate, context-aware, and engaging cultural heritage experiences.


🎯 Problem Statement & Motivation

❌ Limitations of Existing Systems

Traditional museum systems rely on static methods—printed labels, brochures, audio guides, and system suffers from:-

  • Hallucination: Generation of inaccurate or unsupported information
  • Lack of domain grounding: Insufficient knowledge of historical and cultural contexts
  • No transparency: Inability to verify information sources
  • Limited multimodal support: Restricted to single interaction modes
  • Poor scalability: Inefficient handling of simultaneous visitors

✅ Proposed Solution

PastPortals implements Correction + Retrieval-Augmented Generation (CRAG) to bridge this gap by:

  1. Retrieving verified information from curated knowledge bases
  2. Validating and correcting generated content through fact-checking mechanisms
  3. Supporting multimodal interaction (text, voice, image, video)
  4. Enabling voice-first conversational AI for hands-free cultural exploration
  5. Implementing intelligent feedback loops that refine system behavior with each user interaction
  6. Enabling multilingual communication across 18+ languages
  7. Ensuring scalability and continuous improvement for high-traffic museum environments

🏗️ Technical Architecture

📊 System Components

┌─────────────────────────────────────────────────────────────────────┐
│                       Frontend Application Layer                    │
│              React 18 | Document Upload | Voice Interface           │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ HTTP/REST API
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Content Processing Layer                        │
│        Document Extraction | OCR | Video Analysis | Voice           │
│        (PyMuPDF | python-docx | Tesseract | OpenCV)                │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Extracted Content + Metadata
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   Retrieval & Ranking Layer                         │
│              FAISS Vector Search | Wikipedia API                    │
│              Historical Content Classification                       │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Contextually Relevant Information
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                  Generation & Response Layer                        │
│        Google Gemini 2.5 Flash | Fallback Enrichment               │
│              Fact Validation | Response Synthesis                   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Generated Response with Metadata
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Response Delivery Layer                          │
│         Markdown Rendering | Audio Output | Related Topics          │
└─────────────────────────────────────────────────────────────────────┘

🛠️ Technology Stack

Frontend Architecture 🎨

Component Technology Purpose
UI Framework React 18.2 Component-based interface with virtual DOM rendering
Routing React Router 6 Client-side navigation and state management
Animations Framer Motion Smooth transitions and interactive UI elements
Icons Lucide React Comprehensive, accessible icon system
HTTP Client Axios RESTful API communication with request/response interceptors
Testing Jest + React Testing Library 40+ component tests with coverage reporting

Backend Infrastructure ⚙️

Component Technology Purpose
API Framework FastAPI High-performance async REST API framework
Language Python 3.13 Primary backend language with modern features
Async Processing asyncio + uvicorn Non-blocking concurrent request handling
Testing pytest 50+ unit tests with comprehensive coverage
API Documentation Pydantic + Swagger Auto-generated interactive API documentation

Content Processing & Extraction 📄

Component Technology Purpose
PDF Extraction PyMuPDF (fitz) High-fidelity text and metadata extraction
Word Documents python-docx Structured parsing of DOCX format
Optical Character Recognition pytesseract + Tesseract Text extraction from images and scanned documents
Video Analysis OpenCV (cv2) Frame sampling and temporal processing (8 frames/video)
Voice Processing Web Speech API Real-time speech-to-text transcription

AI/ML & Generation Layer 🧠

Component Technology Purpose
LLM Generation Google Gemini 2.5 Flash Advanced language generation with low latency
Retrieval-Augmented CRAG (Correction Module) Fact validation and hallucination correction
Vector Similarity FAISS Fast approximate nearest neighbor search
Sentence Embeddings Sentence Transformers Dense vector representation of content
Domain Classification Historical Keyword Analysis Context-aware content categorization

Voice-First Conversational AI 🎙️

Component Technology Purpose
Speech-to-Text Google Cloud Speech-to-Text / Web Speech API Multilingual voice input processing
Natural Language Understanding LLM + RAG Pipeline Intent extraction and query comprehension
Text-to-Speech Google Cloud Text-to-Speech Natural-sounding response delivery
Voice Assistant Framework Custom voice conversation bot Context-aware dialogue management
Real-time Streaming WebSocket support Continuous voice interaction without latency

Data & Retrieval Systems 📚

Component Technology Purpose
Vector Database FAISS with in-memory indexing Millisecond-level similarity search
Knowledge Bases Wikipedia API + Smithsonian Open Access Curated historical content retrieval
Domain Datasets Custom museum collections Institution-specific artifact metadata
Feedback Storage JSON + structured logs User interaction tracking for improvement
Cache Layer Redis (optional) Response caching and session management

Intelligent Feedback Loop System 🔄

Component Technology Purpose
User Interaction Tracking Event logging pipeline Capture queries, dwell time, user ratings
Feedback Collection Implicit + explicit signals Track relevance, accuracy, and satisfaction
Vector Similarity Refinement Weight adjustment algorithms Dynamically tune ranking for domain-specific queries
Model Adaptation Online learning mechanisms Continuous improvement of retrieval quality
Performance Monitoring Metrics & analytics dashboard Track system improvement across sessions
tech_stack_architecture

🎓 System Objectives

The development of PastPortals targets the following key objectives:

  1. Accuracy & Reliability: Ground all responses in trusted, curated datasets with fact-checking mechanisms to reduce hallucination and enhance credibility
  2. Voice-First Interaction: Enable seamless voice-based conversational interfaces for hands-free cultural exploration with natural language understanding
  3. Continuous Self-Improvement: Implement intelligent feedback loops that refine retrieval ranking, response quality, and domain understanding from every user interaction
  4. Multilingual Support: Provide accessibility across 18+ languages for diverse visitor populations with cultural context preservation
  5. Multimodal Delivery: Process and respond to diverse input modalities (text, voice, image, video) while delivering content in preferred formats
  6. Scalability & Performance: Handle multiple concurrent users without degradation using FastAPI async architecture
  7. Accessibility & Cultural Sensitivity: Maintain authenticity in heritage interpretation while supporting diverse learning styles and accessibility requirements

⭐ Core Features

✨ Current Implementation (v2.0)

Feature Implementation
Document Processing PDF, DOCX, TXT, MD, JSON, CSV, HTML extraction
Image Recognition Tesseract-based OCR for photographic content
Video Analysis Frame sampling with temporal OCR processing
Voice Interaction WebRTC recording + transcription pipeline
Unified API Single endpoint supporting all input modalities
Progress Tracking Real-time upload status visualization (0-100%)
Content Validation Format and size limit enforcement with user feedback
Fallback Responses Wikipedia-enriched responses for API unavailability
Comprehensive Testing 50+ backend + 40+ frontend unit tests
Museum Integration Curated museum data and virtual tour content

📁 File Processing Specifications

Category Maximum Size Supported Formats
Documents 50 MB PDF, DOCX, TXT, MD, CSV, JSON, HTML, HTM
Images 25 MB PNG, JPG, JPEG, WEBP, BMP, TIFF, TIF
Video 500 MB MP4, MOV, AVI, MKV, WEBM, M4V
Voice N/A Real-time recording via WebRTC

🔄 Intelligent Self-Improving Feedback Loop System

feedback_loop_system

Every user interaction represents an opportunity for system learning. PastPortals v2 incorporates a sophisticated feedback pipeline that continuously refines retrieval accuracy, response relevance, and domain understanding.

🔀 Feedback Mechanism Architecture

Stage 1: User Feedback Captured

  • Explicit ratings and implicit signals (re-queries, dwell time) logged per interaction
  • Domain context stored with each query-response pair
  • User satisfaction metrics tracked across museum exhibition types

Stage 2: Ranking Model Updated

  • Feedback dynamically adjusts vector similarity weights
  • Domain classifier confidence thresholds refined based on user validation
  • Historical accuracy data incorporated into retrieval ranking

Stage 3: System Evolution

  • Pipeline gets measurably smarter with each user session
  • Adaptive behavior emerges from aggregated feedback signals
  • Cultural context understanding deepens through continuous learning

🎯 Key Benefits

  • Adaptive Responses: Museum guides learn visitor preferences and knowledge levels
  • Domain Refinement: Historical accuracy improves through expert feedback integration
  • Personalization: Interaction quality increases for returning visitors
  • Continuous Validation: User corrections automatically retrain ranking models

🎤 Voice-First Conversational AI Bot

Voice-First Conversational AI

PastPortals v2 delivers a seamless, hands-free cultural exploration experience through intelligent voice-first conversational AI.

A museum guide that listens, reasons, and speaks back in real time, turning every artifact into a conversation instead of a static label.

✨ Why It Stands Out

  • Hands-free discovery: Ask questions naturally and get spoken answers without typing or navigating menus.
  • Grounded responses: Every reply is filtered through CRAG so the assistant stays accurate, contextual, and museum-ready.
  • Multilingual conversations: Visitors can interact in 18+ languages, making the experience accessible and global.

🗣️ Core Voice Features

Feature Technology Implementation
Speech-to-Text Input Google Cloud Speech-to-Text / Web Speech API Converts user voice into text queries in real-time
AI Understanding LLM + RAG + CRAG Pipeline Processes natural language intent with cultural context
Text-to-Speech Output Google Cloud Text-to-Speech Delivers responses as natural, human-like voice
Real-Time Interaction WebSocket streaming protocol Instant conversational feedback without latency
Context-Aware Dialogue Domain-aware conversation state Adapts responses based on museum location and artifact
Multilingual Support 18+ language voice processing Bilingual interactions for international visitors

🔬 Technical Stack for Voice AI

  • Voice Input: Web Speech API + Whisper transcription
  • Voice Processing: TensorFlow Lite for on-device optimization
  • Response Generation: Gemini 2.5 Flash with domain context
  • Voice Output: Google Cloud TTS with natural prosody
  • Conversation Management: State machine for dialogue flow

🌊 Data Flow: Multimodal Intelligent Pipeline

data_flow_pipeline

PastPortals v2 represents a complete data journey, from diverse user inputs to intelligent, verified outputs, constantly refining itself through feedback.

⚡ Processing Pipeline

  1. User Input Acquisition → Text, Voice, Image, or Video submission
  2. Multimodal Processing → Speech-to-Text, OCR, Frame Extraction, Document Parsing
  3. Domain Classification → Historical/cultural context detection
  4. Vector Retrieval → FAISS semantic search of curated knowledge bases
  5. LLM Generation → Google Gemini 2.5 Flash response synthesis
  6. Fact Validation → CRAG correction module validates accuracy
  7. Output Delivery → Markdown-formatted response + voice synthesis
  8. Feedback Collection → User interaction logged for continuous improvement
  9. System Refinement → Ranking and understanding models updated

📋 System Requirements

  • Node.js: v16 or higher
  • Python: v3.10 or higher
  • Tesseract OCR: System-level installation required
  • Virtual Environment: Python venv or equivalent

💻 Development Setup

# Activate virtual environment
& .venv\Scripts\Activate.ps1

# Install backend dependencies
pip install -r backend/requirements.txt

# Install frontend dependencies
cd frontend
npm install

# Configure environment variables
# Root .env file:
# GEMINI_API_KEY=your_api_key
# CORS_ORIGINS=http://localhost:3001

# frontend/.env file:
# PORT=3001
# REACT_APP_API_URL=http://localhost:5000

🚀 Running the Application

Terminal 1 - Backend Server:

cd backend
# FastAPI server (async support for concurrent requests)
uvicorn app:app --reload --port 5000

# Or using Python directly (if configured)
python app.py
# Server runs on http://localhost:5000 with auto-generated docs at http://localhost:5000/docs

Terminal 2 - Frontend Application:

cd frontend
npm start
# Application accessible at http://localhost:3001

Navigate to http://localhost:3001/multimodal to access the multimodal input interface.


✅ Testing & Quality Assurance

👾 Backend Testing

# Activate the project virtualenv first, then execute all backend tests
./.venv/bin/python -m pytest -q backend

# Generate coverage report
./.venv/bin/python -m pytest -q backend --cov=backend.utils --cov=backend.routes --cov-report=html

# Test specific modules
./.venv/bin/python -m pytest -q backend/tests/test_multimodal_utils.py
./.venv/bin/python -m pytest -q backend/tests/test_multimodal_routes.py

If you already have the virtualenv activated, pytest -q backend also works from the repository root.

Test Coverage:

  • test_multimodal_utils.py: 35+ tests (content extraction, OCR validation, response generation)
  • test_multimodal_routes.py: 15+ tests (API endpoint validation, error handling)
  • Aggregate Coverage: 90%+ of core functionality

📋 Frontend Testing

cd frontend
npm test                    # Execute all component tests
npm test -- --coverage      # Generate coverage report
npm test MultimodalPanel    # Test specific component

Test Coverage:

  • MultimodalPanel.test.jsx: 40+ tests (file validation, upload workflow, results display)
  • Framework: Jest + React Testing Library

🔐 API Specification

🔓 Primary Endpoint: Multimodal Analysis

Endpoint: POST /api/multimodal/analyze

Request Format:

Content-Type: multipart/form-data

Parameters:
- file (optional): File object (document/image/video)
- question (required): User query string
- mode (required): Input modality (document|image|video|voice)

Response Schema:

{
  "success": boolean,
  "mode": "document|image|video|voice",
  "method": "text-file|pdf|docx|ocr-image|ocr-video|generic-text",
  "extracted_text": "Full text extracted from input",
  "response": "Generated or fallback response (900-1100 words)",
  "metadata": {
    "filename": "original_filename.ext",
    "extension": ".pdf|.jpg|.mp4|...",
    "size_bytes": number,
    "processing_method": "extraction_method_used"
  },
  "notes": ["Processing note 1", "Processing note 2"],
  "related_topics": [
    {
      "title": "Topic Title",
      "extract": "Brief description from Wikipedia"
    }
  ],
  "fallback": false
}

📖 Documentation

This repository includes comprehensive technical documentation:

📘 CRAG_ARCHITECTURE_AND_PIPELINE_FLOW.md

Detailed technical documentation covering:

  • CRAG architecture and the 4-stage retrieval, generation, validation, and correction flow
  • Multimodal integration points and response structure
  • Validation metrics, configuration defaults, and troubleshooting guidance
  • Practical testing notes, including the existing multimodal test files and recommended CRAG cases

Intended Audience: Developers, code maintainers, technical architects, and QA engineers


🚀 Future Developments

🌟 Version 2.2 (Planned)

  • 3D Artifact Visualization: Interactive 3D models of museum pieces with voice guidance
  • Mobile Voice Assistant: Dedicated mobile app with voice-first experience
  • Knowledge Graph Integration: Semantic relationship mapping for cultural artifacts
  • Performance Optimization: Latency reduction to <500ms for voice interactions
  • Collaborative Annotation: Visitor annotations that improve cultural understanding

📊 Project Statistics

Metric Value
Total Lines of Code 8,500+
Test Coverage 90%+
API Endpoints 15+
Supported File Types 20+
Supported Languages 18+ (planned)
Museum Partnerships 6 institutions
Development Duration 3+ months
project_statistics

🔗 Key References & Data Sources

  • Content: Wikipedia Foundation (en.wikipedia.org)
  • Historical Images: Wikimedia Commons
  • Museum Data: Smithsonian Open Access, Louvre API, British Museum Collections
  • AI Generation: Google Generative AI (Gemini 2.5 Flash)
  • OCR Engine: Tesseract Open Source OCR
  • Vector Search: Facebook FAISS
  • Video Processing: OpenCV Foundation

🤛 Contributing & Support

📋 Reporting Issues

Please submit issues via GitHub Issues with:

  • Detailed description and reproduction steps
  • Environment specifications (OS, Python version, Node version)
  • Error logs and stack traces
  • Screenshots or relevant attachments

🔨 Development Workflow

  1. Create feature branch: git checkout -b feature/feature-name
  2. Implement changes and execute tests locally
  3. Commit with descriptive messages following conventional commits
  4. Push to remote and create pull request
  5. Submit for code review and CI/CD validation

Sample Output

Multimodal Analysis Result

{
  "success": true,
  "mode": "document",
  "method": "pdf_extraction",
  "extracted_text": "The Roman Empire was one of the most influential civilizations in human history, spanning over 500 years...",
  "response": "The Roman Empire, originating from the Italian peninsula, became a dominant force that transformed Western civilization. From 27 BCE to 476 CE, Rome developed sophisticated administrative systems, advanced architectural techniques, and influential legal frameworks. Key achievements include the construction of infrastructure such as aqueducts, roads, and amphitheaters, alongside the development of Latin as a universal language. The Roman military was renowned for its organization and effectiveness, while Roman law established principles that continue to influence modern legal systems.",
  "metadata": {
    "filename": "roman_history.pdf",
    "extension": ".pdf",
    "size_bytes": 2048576,
    "processing_method": "pdf_extraction"
  },
  "notes": [
    "PDF extracted successfully with 8 keywords identified",
    "Content grounded in Wikipedia historical data"
  ],
  "related_topics": [
    {
      "title": "Roman Republic",
      "extract": "The Roman Republic was the period of Roman history when the state operated as a republic..."
    },
    {
      "title": "Julius Caesar",
      "extract": "Gaius Julius Caesar was a Roman military general and statesman who played a critical role..."
    }
  ],
  "fallback": false
}

User Interface Examples

Document Upload Interface Hero Screenshot

Search Results with Historical Context Search Results

Timeline Navigation Timeline View


License & Attribution

This project is distributed under the MIT License for educational and research purposes.

Data Attribution:

  • Historical Content: Wikipedia Foundation
  • Imagery: Wikimedia Commons (Creative Commons License)
  • Museum Information: Official institutional APIs
  • AI Capabilities: Google Gemini API
  • OCR Technology: Tesseract OCR Project

Authors

  • Yash Kumar Kalirawan

Keywords

Artificial Intelligence · Museums · Conversational Agents · Retrieval-Augmented Generation · Multimodal AI · Visitor Engagement · Cultural Heritage · Natural Language Processing · Vector Databases · OCR Technology


PastPortals v2 — Advancing Cultural Heritage Interpretation Through Intelligent Technology

🤝 Contributing

Contributions are welcome!

  1. Fork the repository.
  2. Create a feature branch.
  3. Commit your changes.
  4. Push to the branch.
  5. Open a Pull Request.

⭐ If you like my work, drop a ⭐ and let's connect!

About

It is a enhanced version of Past Portals with Multi-Modal Input system , C-RAG , Feedback Loop, and Voice-First Conversational AI bot

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors