Skip to content

A document processing application that allows users to upload various document types, extract data, create embeddings, and interact with processed content using AI.

Notifications You must be signed in to change notification settings

lethemanh/document-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Document Processor

A comprehensive document processing application that allows users to upload various document types, extract data, create embeddings, and interact with processed content using AI. The application features a robust backend with asynchronous job processing and a modern React frontend.

Demo

https://youtu.be/2bKFfopVvn8

Features

  • Multi-format Document Upload: Support for PDF, PNG, JPEG, SVG, CSV, TXT files
  • Intelligent Data Extraction: Extract text and structured data from uploaded documents
  • Asynchronous Processing: Job queue system with Redis and Bull for performance optimization
  • Vector Embeddings: Generate embeddings using OpenAI and store in Pinecone vector database
  • Background Processing: Asynchronous job processing with Bull queue
  • Interactive Q&A: Ask questions about your uploaded documents
  • Search Functionality: Search through extracted document content
  • Data Visualization: View extracted data in structured format
  • Queue Monitoring: Real-time job queue status and monitoring

Tech Stack

  • Backend: Node.js, Express.js with TypeScript
  • Frontend: React 18 with Material-UI (MUI) v5 and TypeScript
  • Job Queue: Redis + Bull for asynchronous processing
  • Job Processing: Bull queue with Redis for background task execution
  • AI/ML: OpenAI GPT & Embeddings
  • Vector Database: Pinecone
  • Document Processing: PDF-parse, Tesseract.js (OCR), Sharp (image processing)
  • Database: MongoDB with Mongoose
  • Configuration: Centralized configuration management with environment variables
  • Type Safety: Full TypeScript implementation with strict type checking

Project Structure

document-processor/
├── app/
│   ├── backend/
│   │   ├── src/
│   │   │   ├── config/           # Configuration management
│   │   │   │   ├── appConfig.ts  # Centralized configuration service
│   │   │   │   └── database.ts   # Database connection setup
│   │   │   ├── models/           # Database models
│   │   │   │   └── Document.ts   # Document schema and model
│   │   │   ├── services/         # Business logic services
│   │   │   │   ├── documentProcessor.ts  # Document processing logic
│   │   │   │   ├── vectorService.ts      # Vector embeddings and search
│   │   │   │   ├── queryService.ts       # Document querying with AI
│   │   │   │   └── jobQueue.ts           # Background job processing
│   │   │   └── server.ts         # Express server with all routes
│   │   ├── .env.example          # Environment variables template
│   │   ├── eng.traineddata       # Tesseract OCR language data
│   │   └── package.json          # Backend dependencies
│   └── frontend/
│       ├── public/               # Static assets
│       ├── src/
│       │   ├── components/       # React UI components
│       │   │   ├── DocumentUpload.tsx    # File upload component
│       │   │   ├── DocumentList.tsx      # Document listing component
│       │   │   └── DocumentInteraction.tsx # Q&A and search interface
│       │   ├── services/         # API client services
│       │   │   └── documentService.ts    # API communication layer
│       │   ├── types/            # TypeScript type definitions
│       │   │   └── index.ts      # Shared interfaces and types
│       │   ├── App.tsx           # Main application component
│       │   └── index.tsx         # React app entry point
│       └── package.json          # Frontend dependencies
├── .gitignore                    # Git ignore patterns
└── README.md                     # This file

Setup

  1. Clone the repository

  2. Install backend dependencies:

    cd app/backend
    npm install
  3. Install frontend dependencies:

    cd app/frontend
    npm install
  4. Set up environment variables:

    cd app/backend
    cp .env.example .env

    Edit the .env file with your configuration.

  5. Required Services:

    • Redis: Required for job queue
    • MongoDB: For document storage
    • OpenAI API Key: For text embeddings and completions
    • Pinecone API Key: For vector similarity search
  6. Environment Variables:

    # Server Configuration
    PORT=3001
    NODE_ENV=development
    
    # Database Configuration
    MONGODB_URI=mongodb://localhost:27017/document-processor
    
    # Redis Configuration (for job queue)
    REDIS_HOST=localhost
    REDIS_PORT=6379
    
    # OpenAI Configuration
    OPENAI_API_KEY=your_openai_api_key
    OPENAI_EMBEDDING_MODEL=text-embedding-3-small
    OPENAI_GPT_MODEL=gpt-4o
    # IMPORTANT: This dimension must match your Pinecone index dimension
    # text-embedding-3-small supports 512, 1024, 1536 dimensions
    # text-embedding-3-large supports 256, 1024, 3072 dimensions
    OPENAI_EMBEDDING_DIMENSIONS=1024
    
    # Pinecone Configuration
    PINECONE_API_KEY=your_pinecone_api_key
    PINECONE_ENVIRONMENT=your_pinecone_environment
    PINECONE_INDEX_NAME=document-processor-index
    
    # Upload Configuration
    MAX_FILE_SIZE=10485760  # 10MB
    UPLOAD_DIR=./uploads
    

Development

Start Backend Server

cd app/backend
npm install
npm run dev

Start Frontend Development Server

cd app/frontend
npm install
npm start

Usage

  1. Start required services:

    # Start MongoDB (if not using Atlas)
    mongod
    
    # Start Redis
    redis-server
  2. Start the backend server:

    cd app/backend
    npm run dev

    Backend will run on http://localhost:3002

  3. Start the frontend development server:

    cd app/frontend
    npm start

    Frontend will run on http://localhost:3000 with proxy to backend

  4. Access the application at http://localhost:3000

  5. Document Processing Flow:

    • Upload documents through the web interface
    • Documents are automatically queued for processing using Bull job queue
    • Background workers process embeddings immediately when jobs are added
    • Monitor job queue status through the application
    • Search and interact with processed documents using AI-powered Q&A
  6. Architecture Notes:

    • Backend uses centralized configuration management with singleton pattern
    • All routes are defined in the main server.ts file
    • Job processing is handled asynchronously with Redis and Bull
    • Vector embeddings are stored in Pinecone for semantic search
    • Frontend uses Material-UI for consistent design system

API Endpoints

All API endpoints are defined in the main server.ts file and follow RESTful conventions.

Document Management

  • POST /api/upload - Upload and queue document for processing

    • Accepts: multipart/form-data with file field
    • Supported formats: PDF, PNG, JPEG, SVG, CSV, TXT
    • Returns: Document object with processing status and job ID
  • GET /api/documents - List all documents

    • Query params: ?status=processing|completed|failed
    • Returns: Array of document objects with metadata
  • GET /api/documents/:id - Get specific document details

    • Returns: Complete document object with extracted data and processing status
  • DELETE /api/documents/:id - Delete document and cleanup

    • Removes document, associated files, embeddings, and vector data

Search & Query

  • POST /api/search - Semantic search across all processed documents

    • Body: { "query": "search terms", "limit": 10 }
    • Uses vector similarity search with Pinecone
    • Returns: Array of matching document chunks with relevance scores
  • POST /api/ask - Ask questions about specific documents

    • Body: { "question": "Your question", "documentId": "optional" }
    • Uses OpenAI GPT with document context
    • Returns: AI-generated answer with confidence score and sources

Supported File Types

The application supports multiple document formats with intelligent processing:

  • PDF: Text extraction using pdf-parse library
  • Images (PNG, JPEG, SVG): OCR text extraction using Tesseract.js with eng.traineddata
  • CSV: Structured data parsing with automatic table detection and visualization
  • TXT: Direct text processing with metadata extraction

Key Features & Technologies

Backend Architecture

  • Configuration Management: Centralized singleton configuration service with validation
  • Document Processing: Multi-format document processor with metadata extraction
  • Vector Search: OpenAI embeddings with Pinecone vector database for semantic search
  • Job Queue: Bull queue with Redis for asynchronous background processing with automatic worker processing
  • Database: MongoDB with Mongoose ODM for document storage
  • Type Safety: Full TypeScript implementation with strict typing

Frontend Architecture

  • React 18: Modern React with hooks and functional components
  • Material-UI v5: Comprehensive design system with emotion styling
  • TypeScript: Strict type checking with shared interfaces
  • Component Structure: Modular components for upload, listing, and interaction
  • API Integration: Centralized service layer for backend communication
  • State Management: React hooks for local state management

AI & ML Integration

  • OpenAI GPT-4: Advanced question answering and document analysis with two-step Q&A process
    • Step 1: OpenAI analyzes questions to extract key concepts and generate refined search queries
    • Step 2: Uses original question with enhanced context for final answer generation
  • OpenAI Embeddings: text-embedding-3-small for semantic understanding with configurable dimensions
    • Configurable via OPENAI_EMBEDDING_DIMENSIONS environment variable
    • Must match Pinecone index dimensions exactly (512, 1024, or 1536 for text-embedding-3-small)
  • Enhanced Vector Search: Score boosting algorithm with sigmoid transformation for better relevance
    • Applies sigmoid transformation: 1 / (1 + exp(-10 * (score - 0.5)))
    • Additional 20% boost for high-confidence matches (>0.7)
    • Filters low-relevance results (<0.1) early in the process
  • Optimized Chunking Strategy: Improved text segmentation for better semantic matching
    • Reduced chunk size to 500 tokens for more focused content
    • Increased overlap to 100 tokens for better context preservation
  • Token Management: Automatic context truncation to prevent OpenAI rate limit errors
    • Limits context to top 3 most relevant documents
    • Truncates each document to 2000 characters maximum
    • Maintains essential information while staying within API limits
  • Semantic Query Enhancement: Query refinement for improved search accuracy
    • Enhances queries with contextual information for better embedding matching
  • Pinecone: High-performance vector similarity search with optimized performance
    • Searches 3x more candidates initially, then filters and re-ranks results
    • Automatic worker processing without cron job scheduling overhead

Configuration Notes

The application uses a robust configuration system that:

  • Validates required API keys on startup
  • Provides fallback defaults for development
  • Supports environment-specific configurations
  • Gracefully handles missing API keys with warnings
  • Uses singleton pattern for consistent configuration access
  • Embedding Dimensions: Configurable via OPENAI_EMBEDDING_DIMENSIONS environment variable
    • Must match your Pinecone index dimension exactly
    • text-embedding-3-small: supports 512, 1024, 1536 dimensions
    • text-embedding-3-large: supports 256, 1024, 3072 dimensions
    • Default: 1024 dimensions

Troubleshooting

Common Issues

  1. Redis Connection: Ensure Redis server is running on the configured port
  2. MongoDB Connection: Check MongoDB URI and ensure database is accessible
  3. API Keys: Verify OpenAI and Pinecone API keys are valid and have sufficient credits
  4. File Upload: Check file size limits and supported formats
  5. Port Conflicts: Backend runs on 3002, frontend on 3000 by default

Development Tips

  • Use npm run type-check to validate TypeScript without compilation
  • Check application logs for detailed error information
  • Ensure Pinecone index dimensions match OPENAI_EMBEDDING_DIMENSIONS setting
  • Monitor token usage to avoid OpenAI rate limits (context is automatically truncated)

About

A document processing application that allows users to upload various document types, extract data, create embeddings, and interact with processed content using AI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published