A production-style Corrective RAG (CRAG) pipeline built with LangGraph, FAISS, Groq LLaMA-3.3-70b, and Streamlit. Unlike a basic retrieve-then-answer loop, the system uses an LLM to grade every retrieval for relevance, automatically rewrites and retries the query when the first pass falls short, and only then synthesises a grounded answer — all orchestrated as a compiled LangGraph state machine.
┌──────────────┐
│ User Query │
└──────┬───────┘
│
▼
┌──────────────────┐
│ retrieve_docs │ FAISS semantic search over indexed documents
└──────┬───────────┘
│
▼
┌──────────────────┐
│ grade_docs │ LLM binary relevance check on retrieved passages
└──────┬───────────┘
│
─────┴──────────────────────────
│ │
relevant not relevant
│ │
│ ┌───────────────────────┐
│ │ rewrite_and_retrieve │ LLM rewrites query,
│ │ (max 1 retry) │ re-fetches from FAISS
│ └───────────┬───────────┘
│ │
└──────────────┬───────────────┘
│
▼
┌──────────────────┐
│ generate_answer │ Direct LLM call with graded context
└──────┬───────────┘
│
▼
┌───────────────┐
│ Final Answer │
└───────────────┘
State (RAGState) flows through each node carrying question, retrieved_docs, grade, and answer.
The 1-retry ceiling is enforced structurally — there is no edge from rewrite_and_retrieve back to grade_docs.
| Layer | Technology |
|---|---|
| LLM | Groq — llama-3.3-70b-versatile |
| Orchestration | LangGraph StateGraph with conditional routing |
| RAG pattern | Corrective RAG — LLM-based relevance grading + query rewriting |
| Vector store | FAISS (in-memory) |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 (local, no API cost) |
| Document loaders | LangChain — Web, PDF, TXT |
| UI | Streamlit |
| Evaluation | Cosine similarity via sentence-transformers |
| Package management | pip / uv |
git clone https://github.com/your-username/agentic-rag.git
cd agentic-ragpython -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activatepip install -r requirements.txtGet a free key at console.groq.com → API Keys, then:
# Windows
copy .env.example .env
# macOS / Linux
cp .env.example .envOpen .env and replace the placeholder:
GROQ_API_KEY="gsk_your_actual_key_here"
streamlit run streamlit_app.pyThe app fetches and indexes the default source documents on first launch (~30 seconds), then the chat interface is ready.
Questions drawn from the indexed sources (Lilian Weng's blog posts on LLM agents and video diffusion):
- What are the three main components of an LLM-powered autonomous agent?
- How does chain-of-thought prompting help agents solve complex tasks?
- What types of memory does an LLM agent use, and how do they differ?
- What are the main challenges of applying diffusion models to video generation?
- What role does classifier-free guidance play in diffusion models?
eval.py scores the system against 6 reference Q&A pairs using cosine similarity between expected and actual answers:
python eval.pyResults are printed to the terminal and saved to eval_results.json.
agentic-rag/
│
├── src/
│ ├── config/
│ │ └── config.py # Groq API key, model name, chunk settings
│ ├── document_ingestion/
│ │ └── document_processor.py # Web / PDF / TXT loaders + text splitter
│ ├── graph_builder/
│ │ └── graph_builder.py # LangGraph StateGraph definition
│ ├── node/
│ │ └── nodes.py # retrieve_docs, grade_docs,
│ │ # rewrite_and_retrieve, generate_answer
│ ├── state/
│ │ └── rag_state.py # RAGState Pydantic model
│ └── vectorstore/
│ └── vectorstore.py # FAISS + HuggingFace embeddings
│
├── data/
│ ├── attention.pdf # Optional local PDF source
│ └── url.txt # Default source URLs
│
├── streamlit_app.py # Streamlit UI entry point
├── main.py # CLI entry point with interactive mode
├── eval.py # Evaluation harness (cosine similarity)
├── requirements.txt # Pip dependencies
├── pyproject.toml # Project metadata and pinned deps
├── .env.example # Environment variable template
└── README.md
- Ingestion — URLs and local files are loaded, split into 500-token chunks, and embedded with
all-MiniLM-L6-v2into an in-memory FAISS index. - Retrieval —
retrieve_docsruns a cosine-similarity search and returns the top-k passages. - Grading —
grade_docssends the passages to the LLM with a binary relevance prompt (relevant/not relevant). Short-circuits without an LLM call when no docs are returned. - Correction — if graded
not_relevant,rewrite_and_retrieveasks the LLM to reformulate the query, then re-fetches. Max one retry — enforced by graph topology. - Generation —
generate_answercalls the LLM with the graded context and returns a concise, grounded answer.