A comprehensive document processing application that allows users to upload various document types, extract data, create embeddings, and interact with processed content using AI. The application features a robust backend with asynchronous job processing and a modern React frontend.
- Multi-format Document Upload: Support for PDF, PNG, JPEG, SVG, CSV, TXT files
- Intelligent Data Extraction: Extract text and structured data from uploaded documents
- Asynchronous Processing: Job queue system with Redis and Bull for performance optimization
- Vector Embeddings: Generate embeddings using OpenAI and store in Pinecone vector database
- Background Processing: Asynchronous job processing with Bull queue
- Interactive Q&A: Ask questions about your uploaded documents
- Search Functionality: Search through extracted document content
- Data Visualization: View extracted data in structured format
- Queue Monitoring: Real-time job queue status and monitoring
- Backend: Node.js, Express.js with TypeScript
- Frontend: React 18 with Material-UI (MUI) v5 and TypeScript
- Job Queue: Redis + Bull for asynchronous processing
- Job Processing: Bull queue with Redis for background task execution
- AI/ML: OpenAI GPT & Embeddings
- Vector Database: Pinecone
- Document Processing: PDF-parse, Tesseract.js (OCR), Sharp (image processing)
- Database: MongoDB with Mongoose
- Configuration: Centralized configuration management with environment variables
- Type Safety: Full TypeScript implementation with strict type checking
document-processor/
├── app/
│ ├── backend/
│ │ ├── src/
│ │ │ ├── config/ # Configuration management
│ │ │ │ ├── appConfig.ts # Centralized configuration service
│ │ │ │ └── database.ts # Database connection setup
│ │ │ ├── models/ # Database models
│ │ │ │ └── Document.ts # Document schema and model
│ │ │ ├── services/ # Business logic services
│ │ │ │ ├── documentProcessor.ts # Document processing logic
│ │ │ │ ├── vectorService.ts # Vector embeddings and search
│ │ │ │ ├── queryService.ts # Document querying with AI
│ │ │ │ └── jobQueue.ts # Background job processing
│ │ │ └── server.ts # Express server with all routes
│ │ ├── .env.example # Environment variables template
│ │ ├── eng.traineddata # Tesseract OCR language data
│ │ └── package.json # Backend dependencies
│ └── frontend/
│ ├── public/ # Static assets
│ ├── src/
│ │ ├── components/ # React UI components
│ │ │ ├── DocumentUpload.tsx # File upload component
│ │ │ ├── DocumentList.tsx # Document listing component
│ │ │ └── DocumentInteraction.tsx # Q&A and search interface
│ │ ├── services/ # API client services
│ │ │ └── documentService.ts # API communication layer
│ │ ├── types/ # TypeScript type definitions
│ │ │ └── index.ts # Shared interfaces and types
│ │ ├── App.tsx # Main application component
│ │ └── index.tsx # React app entry point
│ └── package.json # Frontend dependencies
├── .gitignore # Git ignore patterns
└── README.md # This file
-
Clone the repository
-
Install backend dependencies:
cd app/backend npm install -
Install frontend dependencies:
cd app/frontend npm install -
Set up environment variables:
cd app/backend cp .env.example .envEdit the
.envfile with your configuration. -
Required Services:
- Redis: Required for job queue
- MongoDB: For document storage
- OpenAI API Key: For text embeddings and completions
- Pinecone API Key: For vector similarity search
-
Environment Variables:
# Server Configuration PORT=3001 NODE_ENV=development # Database Configuration MONGODB_URI=mongodb://localhost:27017/document-processor # Redis Configuration (for job queue) REDIS_HOST=localhost REDIS_PORT=6379 # OpenAI Configuration OPENAI_API_KEY=your_openai_api_key OPENAI_EMBEDDING_MODEL=text-embedding-3-small OPENAI_GPT_MODEL=gpt-4o # IMPORTANT: This dimension must match your Pinecone index dimension # text-embedding-3-small supports 512, 1024, 1536 dimensions # text-embedding-3-large supports 256, 1024, 3072 dimensions OPENAI_EMBEDDING_DIMENSIONS=1024 # Pinecone Configuration PINECONE_API_KEY=your_pinecone_api_key PINECONE_ENVIRONMENT=your_pinecone_environment PINECONE_INDEX_NAME=document-processor-index # Upload Configuration MAX_FILE_SIZE=10485760 # 10MB UPLOAD_DIR=./uploads
cd app/backend
npm install
npm run devcd app/frontend
npm install
npm start-
Start required services:
# Start MongoDB (if not using Atlas) mongod # Start Redis redis-server
-
Start the backend server:
cd app/backend npm run devBackend will run on
http://localhost:3002 -
Start the frontend development server:
cd app/frontend npm startFrontend will run on
http://localhost:3000with proxy to backend -
Access the application at
http://localhost:3000 -
Document Processing Flow:
- Upload documents through the web interface
- Documents are automatically queued for processing using Bull job queue
- Background workers process embeddings immediately when jobs are added
- Monitor job queue status through the application
- Search and interact with processed documents using AI-powered Q&A
-
Architecture Notes:
- Backend uses centralized configuration management with singleton pattern
- All routes are defined in the main server.ts file
- Job processing is handled asynchronously with Redis and Bull
- Vector embeddings are stored in Pinecone for semantic search
- Frontend uses Material-UI for consistent design system
All API endpoints are defined in the main server.ts file and follow RESTful conventions.
-
POST /api/upload- Upload and queue document for processing- Accepts:
multipart/form-datawithfilefield - Supported formats: PDF, PNG, JPEG, SVG, CSV, TXT
- Returns: Document object with processing status and job ID
- Accepts:
-
GET /api/documents- List all documents- Query params:
?status=processing|completed|failed - Returns: Array of document objects with metadata
- Query params:
-
GET /api/documents/:id- Get specific document details- Returns: Complete document object with extracted data and processing status
-
DELETE /api/documents/:id- Delete document and cleanup- Removes document, associated files, embeddings, and vector data
-
POST /api/search- Semantic search across all processed documents- Body:
{ "query": "search terms", "limit": 10 } - Uses vector similarity search with Pinecone
- Returns: Array of matching document chunks with relevance scores
- Body:
-
POST /api/ask- Ask questions about specific documents- Body:
{ "question": "Your question", "documentId": "optional" } - Uses OpenAI GPT with document context
- Returns: AI-generated answer with confidence score and sources
- Body:
The application supports multiple document formats with intelligent processing:
- PDF: Text extraction using pdf-parse library
- Images (PNG, JPEG, SVG): OCR text extraction using Tesseract.js with eng.traineddata
- CSV: Structured data parsing with automatic table detection and visualization
- TXT: Direct text processing with metadata extraction
- Configuration Management: Centralized singleton configuration service with validation
- Document Processing: Multi-format document processor with metadata extraction
- Vector Search: OpenAI embeddings with Pinecone vector database for semantic search
- Job Queue: Bull queue with Redis for asynchronous background processing with automatic worker processing
- Database: MongoDB with Mongoose ODM for document storage
- Type Safety: Full TypeScript implementation with strict typing
- React 18: Modern React with hooks and functional components
- Material-UI v5: Comprehensive design system with emotion styling
- TypeScript: Strict type checking with shared interfaces
- Component Structure: Modular components for upload, listing, and interaction
- API Integration: Centralized service layer for backend communication
- State Management: React hooks for local state management
- OpenAI GPT-4: Advanced question answering and document analysis with two-step Q&A process
- Step 1: OpenAI analyzes questions to extract key concepts and generate refined search queries
- Step 2: Uses original question with enhanced context for final answer generation
- OpenAI Embeddings: text-embedding-3-small for semantic understanding with configurable dimensions
- Configurable via
OPENAI_EMBEDDING_DIMENSIONSenvironment variable - Must match Pinecone index dimensions exactly (512, 1024, or 1536 for text-embedding-3-small)
- Configurable via
- Enhanced Vector Search: Score boosting algorithm with sigmoid transformation for better relevance
- Applies sigmoid transformation:
1 / (1 + exp(-10 * (score - 0.5))) - Additional 20% boost for high-confidence matches (>0.7)
- Filters low-relevance results (<0.1) early in the process
- Applies sigmoid transformation:
- Optimized Chunking Strategy: Improved text segmentation for better semantic matching
- Reduced chunk size to 500 tokens for more focused content
- Increased overlap to 100 tokens for better context preservation
- Token Management: Automatic context truncation to prevent OpenAI rate limit errors
- Limits context to top 3 most relevant documents
- Truncates each document to 2000 characters maximum
- Maintains essential information while staying within API limits
- Semantic Query Enhancement: Query refinement for improved search accuracy
- Enhances queries with contextual information for better embedding matching
- Pinecone: High-performance vector similarity search with optimized performance
- Searches 3x more candidates initially, then filters and re-ranks results
- Automatic worker processing without cron job scheduling overhead
The application uses a robust configuration system that:
- Validates required API keys on startup
- Provides fallback defaults for development
- Supports environment-specific configurations
- Gracefully handles missing API keys with warnings
- Uses singleton pattern for consistent configuration access
- Embedding Dimensions: Configurable via
OPENAI_EMBEDDING_DIMENSIONSenvironment variable- Must match your Pinecone index dimension exactly
- text-embedding-3-small: supports 512, 1024, 1536 dimensions
- text-embedding-3-large: supports 256, 1024, 3072 dimensions
- Default: 1024 dimensions
- Redis Connection: Ensure Redis server is running on the configured port
- MongoDB Connection: Check MongoDB URI and ensure database is accessible
- API Keys: Verify OpenAI and Pinecone API keys are valid and have sufficient credits
- File Upload: Check file size limits and supported formats
- Port Conflicts: Backend runs on 3002, frontend on 3000 by default
- Use
npm run type-checkto validate TypeScript without compilation - Check application logs for detailed error information
- Ensure Pinecone index dimensions match
OPENAI_EMBEDDING_DIMENSIONSsetting - Monitor token usage to avoid OpenAI rate limits (context is automatically truncated)