DocQuery is a application that lets you upload and interact with your PDFs. It has a hybrid search engine and Q&A agent powered by Retrieval-Augmented Generation (RAG) to search and retrieve information from PDF documents.
- Upload, view, and delete PDF documents
- Store and display document metadata
- Organize documents by custom categories for better management
- Find relevant documents using combining keyword search (exact text matching) and semantic search (natural language and contextual meaning)
- Ask questions about documents that synthesizes information across multiple documents
- Generates concise and accurate answers and cites relevant document sources in responses
- Go to the "Home" section
- Click "Add Category"
- Enter a category name and click "Add Category"
- Click the toggle for a category to see document information
- Click "Delete Category" to delete category and all documents inside it
- Select a category using the category dropdown
- Go to the "Home" section
- Click the toggle for a category to see document information
- Click "View" to view a document
- Click "Delete" to delete a document
- Go to the "Home" section
- Click "Add Document"
- Select a PDF file
- Enter a document description
- Choose or create a category
- Click "Upload"
- Go to the "Search" section
- Enter your search query
- View results ranked by relevance
- Click 'View' to view the document
- Use filters to narrow results by category
- Navigate to the "Q&A" section
- Enter a query about the documents
- View the LLM generated answer with cited sources
- Continue the conversation
PDF Upload → Text Extraction → Chunking → Embedding Generation → Vector Store
User Query → Hybrid Search → Document Results
User Query → Processing → Retrieval (Hybrid) → Augmentation → LLM Response
- Document Dashboard Page (Home)
- Document and Category Management and View
- Category Filter
- Add/Remove Documents and Categories
- Add Document Page
- Search Page
- Q&A Page
- Endpoints
- Documents
- Categories
- Search
- Q&A
- Core Logic
- Document Ingestion
- Hybrid Search
- RAG Pipeline
- PostgreSQL: Document metadata and categories
- Qdrant: Vector embeddings (dense and spare) for hybrid search
- AWS S3: PDF files and processed documents
- Qdrant stores dense and sparse embeddings to perform hybrid search
sentence-transformers/all-MiniLM-L6-v2to generate dense embeddings for semantic searchQdrant/bm25to generate sparse Embeddings for keyword search using BM25
- Computes cosine similarity to determine document relevance to a query (to capture similar contextual meaning)
- Ranking algorithm to determine document relevance to a query (for exact keyword matching)
- Calculates scores based on
- (1) term frequency (TF)
- (2) inverse document frequency (IDF)
- (3) document length normalization
- Model:
Qwen/Qwen2-1.5B-Instruct - Uses retrieved documents from hybrid search as context to augment response generation
- Node.js 16+ (for frontend development)
- Python 3.9+ (for backend)
- PostgreSQL 14+ (for database)
- Docker (for containerized deployment)
git clone https://github.com/aarav27/document-query-application.git
cd document-query-applicationcd backend
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your configuration:
# - DATABASE_URL=postgresql://user:password@localhost/docquery
# - AWS_S3_BUCKET="your_bucket_name"
# - AWS_REGION_NAME="region_name"
# - AWS_ACCESS_KEY_ID=your_access_key
# - AWS_SECRET_ACCESS_KEY=your_secret_key
# - QDRANT_URL=http://localhost:6333
# - QDRANT_API_KEY=your_api_key
# Start the backend server
uvicorn app.main:app --reloadBackend will be available at http://localhost:8000
cd frontend
# Install dependencies
npm install
# Start the development server
npm startFrontend will be available at http://localhost:5173
# Start PostgreSQL container
docker run -d \
--name docquery-postgres \
-e POSTGRES_DB=docquery \
-e POSTGRES_USER=docuser \
-e POSTGRES_PASSWORD=password \
-p 5432:5432 \
-v pgdata:/var/lib/postgresql/data \
postgres:14
# Copy SQL initialization file into container
docker cp db-init/init.sql docquery-postgres:/init.sql
# Execute SQL inside container to create tables
docker exec -it docquery-postgres psql -U docuser -d docquery -f /init.sql# Using Docker
docker run -d \
--name qdrant \
-p 6333:6333 \
qdrant/qdrant:latestQdrant dashboard will be available at http://localhost:6333/dashboard
docquery/
├── frontend/ # React TypeScript application
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── pages/ # Page components
│ │ ├── styles/ # CSS/styling
│ │ ├── util/ # Types
│ │ ├── App.tsx # Navigation and routes
│ │ └── main.tsx # Application root
│ └── package.json
├── backend/ # FastAPI application
│ ├── app/
│ │ ├── api/ # API endpoints
│ │ ├── core/ # Configuration and utilities
│ │ ├── models/ # SQLAlchemy models
│ │ ├── rag/ # Document search and Q&A logic
│ │ ├── schemas/ # Pydantic schemas
│ │ └── services/ # Document and category management logic
│ ├── main.py # Backend startup
│ └── requirements.txt
└── db-init # Database setup
│ └── init.sql # Application Schema
GET /api/documents- Retrieve all documentsGET /api/documents/{id}- Retrieve a specific documentPOST /api/documents- Add a documentDELETE /api/documents/{id}- Delete a documentPOST /api/documents/download-url- Generate download URL for documentPOST /api/documents/upload-url- Generate upload URL for document
GET /api/categories- Retrieve all categoriesPOST /api/categories- Create a categoryDELETE /api/categories/{id}- Delete a category
POST /api/search- Hybrid search across documents
POST /api/rag/qa- Ask a question about documents and get an answer