rag application — upload pdfs, pptx, images and chat with them using azure ai search + azure openai.
- upload — drop a pdf, pptx, or image
- parse — extracts text (pdf-parse for pdfs, pptx-parser for slides, azure form recognizer ocr for scanned docs/images)
- chunk — splits content into overlapping chunks
- embed — generates embeddings via azure openai
- index — stores chunks + vectors in azure ai search
- query — hybrid search (keyword + semantic + vector) retrieves relevant chunks
- answer — azure openai generates answers grounded in your documents
| layer | tech |
|---|---|
| frontend | react 19, vite, mui, react-router |
| backend | express 5, node.js |
| search | azure ai search (hybrid: keyword + semantic + vector) |
| llm | azure openai |
| storage | azure blob storage |
| parsing | pdf-parse, pptx-parser, azure form recognizer (ocr) |
frontend (react :5173)
↓
backend (express :5000)
├── /api/upload → parse → chunk → embed → index (azure ai search)
├── /api/chat → hybrid search → azure openai → answer
└── /api/documents → list uploaded docs
- node.js 18+
- azure subscription with:
- azure openai (deployment with embeddings + chat model)
- azure ai search
- azure blob storage
- azure form recognizer (for ocr)
# backend/.env
AZURE_OPENAI_ENDPOINT=
AZURE_OPENAI_KEY=
AZURE_OPENAI_DEPLOYMENT=
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=
AZURE_SEARCH_ENDPOINT=
AZURE_SEARCH_KEY=
AZURE_SEARCH_INDEX=
AZURE_STORAGE_CONNECTION_STRING=
AZURE_STORAGE_CONTAINER=
AZURE_FORM_RECOGNIZER_ENDPOINT=
AZURE_FORM_RECOGNIZER_KEY=cd backend
npm install
node scripts/createSearchIndex.js # create azure ai search index
npm start # runs on :5000cd frontend
npm install
npm run dev # runs on :5173backend/
server.js
routes/
upload.js # file upload + processing pipeline
chat.js # rag query endpoint
documents.js # list documents
services/
pdfParser.js # pdf text extraction
pptParser.js # pptx text extraction
ocrService.js # azure form recognizer ocr
ocrLargePDF.js # ocr fallback for scanned pdfs
chunkingService.js # text chunking with overlap
embeddingService.js # azure openai embeddings
searchIndexer.js # index chunks into azure ai search
searchQueryService.js # hybrid search (keyword + semantic + vector)
chatService.js # azure openai chat completion
blobStorage.js # azure blob storage
scripts/
createSearchIndex.js # index schema setup
frontend/
src/
pages/ # upload + chat views
theme/ # mui theme config
- pdf (text-based + scanned with ocr fallback)
- pptx (powerpoint slides)
- images (png, jpg — via azure form recognizer ocr)
MIT