🦊 Agentic Doc Search RAG

A production-grade, Self-Corrective Agentic RAG system that reasons before it answers —
built to solve the silent hallucination problem in standard RAG pipelines.

Live Demo — Screenshots

The following screenshots demonstrate the system working end-to-end: correct answers with full agent path transparency, graceful out-of-scope rejection, live session metrics, and MCP server tool validation.

1. Accurate Policy Answer with Agent Path Trace

Query about security policy responsibilities — system retrieves relevant chunks, grades them as relevant, and generates a cited answer. The agent path (retrieve → grade_documents → generate) is visible in real time.

2. Graceful Out-of-Scope Rejection — Zero Hallucination

Query completely outside the document domain — the grading node correctly identifies retrieved chunks as irrelevant, routes to the no_answer node, and returns an honest rejection. No fabricated answer.

3. MCP Server Tool Validation — Inspector Confirmed

The RAG system exposed as an MCP tool, called through MCP Inspector. Result validated as ✓ Valid according to output schema — proving the server is production-ready for AI agent integration.

Why This Project Exists — The Problem

Standard RAG (Retrieval-Augmented Generation) pipelines follow a fixed, linear path:

User Query → Retrieve Docs → Generate Answer

This architecture always generates an answer — regardless of whether the retrieved documents are actually relevant or whether the question is even in scope. In production, this creates three failure modes that are silent and dangerous:

Failure Mode 1: Irrelevant Retrieval with Confident Hallucination

A vector similarity search returns the top-K "closest" chunks by embedding distance. But cosine similarity is not the same as semantic relevance. A chunk about "access token expiry" may rank high for a query about "password policy" — and the LLM will use it to generate a confident, wrong answer.

Standard RAG produces no warning. The user never knows.

Failure Mode 2: No Out-of-Scope Handling

Ask a naive RAG about topics outside its documents ("How do I reset my GitLab password?") and it will still attempt to answer using whatever tangentially related chunks it found. This is not just wrong — in security and compliance contexts, it is dangerous misinformation.

Failure Mode 3: No Transparency or Observability

A linear RAG pipeline gives users zero visibility into what happened: which documents were retrieved, whether they were relevant, how long inference took, and which policy produced the answer. Debugging hallucinations becomes guesswork.

In high-stakes domains like security policy compliance, a hallucinated answer is not just unhelpful — it is a liability.

The Solution: Agentic Self-Corrective RAG

Instead of a fixed pipeline, this project implements an agent — a system that observes the situation and decides what to do at each step.

The agent is built as a LangGraph StateGraph with 4 specialised nodes. Between each node, the graph evaluates a condition and chooses the next path. This enables:

Self-correction: If retrieved documents fail the relevance grade, the system routes away from generation rather than hallucinating.
Graceful rejection: Queries outside the document domain get an honest "I don't have that information" response.
Full transparency: Every node execution is visible to the user in real time, including timing and path.

The target corpus is a set of 6 GitLab Security and Technology Policy documents — a realistic enterprise compliance knowledge base covering Access Management, Audit Logging, Change Management, Penetration Testing, the SDLC, and overarching policy governance.

Agent Architecture

The graph has 4 nodes and conditional routing:

User Query
    │
    ▼
┌──────────────┐
│   retrieve   │  ← Embeds query, fetches top-6 chunks from ChromaDB (cosine similarity)
└──────┬───────┘   Stores both documents AND their metadata (policy_title, filename) in state
       │
       ▼
┌──────────────────────┐
│   grade_documents    │  ← Dedicated LLM call: "Are these docs relevant to the question?"
│   (Llama 3.1 8B)     │   Plain-text yes/no — avoids function-calling failures on small models
└──────┬───────────────┘
       │
   ┌───┴─────────────────────────┐
   │ yes (relevant)              │ no (irrelevant)
   ▼                             ▼
┌──────────┐              ┌─────────────┐
│ generate │              │  no_answer  │  ← Returns polite rejection, zero hallucination
│          │              │             │   Eliminates infinite retry loops
│ Formats context          └──────┬──────┘
│ with [Source: Policy |          │
│  File:] headers                 │
│ so LLM always cites             │
│ the correct document            │
└─────┬────┘                      │
      │                           │
      └─────────────┬─────────────┘
                    ▼
                   END

Visual Graph (Auto-generated by LangGraph)

This diagram is auto-generated by calling graph.get_graph().draw_mermaid_png() — it is the actual compiled execution graph, not a manual diagram.

Core Design Decisions

This table explains why each technical choice was made. Decisions made under constraints are where real engineering judgment shows.

Decision	What Was Chosen	Why — The Reasoning
Orchestration framework	LangGraph `StateGraph`	Enables conditional branching, stateful loops, and graph-based routing — impossible in a plain LangChain `chain`. A chain always executes every step; a graph decides what to execute.
Document grading node	Dedicated LLM call before generation	Forces the agent to evaluate relevance before committing to an answer. Eliminates the root cause of RAG hallucination: using irrelevant context.
Plain-text grading (not structured output)	`yes`/`no` string response	Smaller open-source models like Llama 3.1 8B frequently fail at JSON function-calling / structured output. Plain-text prompting is more reliable at this scale.
`no_answer` exit node	Explicit graceful rejection path	Prevents the graph from looping infinitely when no relevant docs are found, and eliminates out-of-scope hallucination.
Metadata-tagged context	`[Source: Policy	File:]` prefix per chunk
Similarity search over MMR	`search_type="similarity"`	Maximum Marginal Relevance (MMR) prioritises diversity, which caused wrong-policy chunks to be retrieved. Similarity search returns the most relevant chunks consistently.
MemorySaver checkpointer	Per-`thread_id` conversation state	Persists full conversation history across turns without re-processing documents. True multi-turn chat in a stateful graph.
Local HuggingFace embeddings	`sentence-transformers/all-MiniLM-L6-v2`	No API cost, no rate limits, no latency overhead for embedding calls. Fast enough for this corpus size and runs entirely offline.
Groq inference	Llama 3.1 8B Instant	Sub-second LLM responses even for complex policy questions — critical for a chat UI. Free tier is sufficient for development and demos.
MCP server	FastMCP over the RAG graph	Exposes the entire agent as a standard MCP tool, enabling AI agents (Claude Desktop, VS Code Copilot, etc.) to query the policy knowledge base directly.

Implementation Walkthrough

Phase 1: Data Ingestion Pipeline

.md policy files → UnstructuredMarkdownLoader → Text Cleaner → RecursiveCharacterTextSplitter → ChromaDB

Each Markdown file is:

Loaded with UnstructuredMarkdownLoader (preserves section structure)
Cleaned (strip markdown headers and excess whitespace for embedding quality)
Split into overlapping chunks — CHUNK_SIZE=800, CHUNK_OVERLAP=150 — to prevent information loss at section boundaries
Tagged with metadata: { "policy_title": "Audit Logging Policy", "filename": "audit-logging-policy.md" }
Embedded with all-MiniLM-L6-v2 and persisted to ChromaDB

The CHUNK_OVERLAP=150 is intentional — it ensures that sentences spanning a chunk boundary are fully represented in both chunks, preventing partial context retrieval.

Phase 2: Agent State

The graph shares a typed state dictionary AgentState across all nodes:

class AgentState(TypedDict):
    messages:      Annotated[list[AnyMessage], add_messages]  # Full conversation history
    documents:     Optional[list[Document]]                   # Retrieved LangChain Documents
    doc_metadata:  Optional[list[dict]]                       # Extracted policy_title + filename per chunk
    next_action:   Optional[Literal["generate", "no_answer"]] # Grader's routing decision

The doc_metadata field is the key to correct citations — it is populated by retrieve_node and consumed by generate_node to build the [Source: Policy | File:] context headers.

Phase 3: The Four Nodes

retrieve_node

# Embeds query, fetches top-6 chunks, stores both documents AND metadata in state
results = retriever.invoke(query)
doc_metadata = [{"policy_title": d.metadata["policy_title"], 
                 "filename": d.metadata["filename"]} for d in results]
return {"documents": results, "doc_metadata": doc_metadata}

grade_documents_node

# Plain-text yes/no grading — avoids structured output failures on small models
prompt = f"Question: {query}\nDocuments: {content}\nAre these documents relevant? Answer yes or no."
response = llm.invoke(prompt)
next_action = "generate" if "yes" in response.content.lower() else "no_answer"
return {"next_action": next_action}

generate_node

# Tags each chunk with [Source: Policy | File:] so the LLM can always cite correctly
context_parts = []
for doc, meta in zip(documents, doc_metadata):
    header = f"[Source: {meta['policy_title']} | File: {meta['filename']}]"
    context_parts.append(f"{header}\n{doc.page_content}")
formatted_context = "\n\n---\n\n".join(context_parts)

no_answer_node

# Honest rejection — no hallucination, no retry loop
message = "I could not find relevant information in the policy documents to answer your question."
return {"messages": [AIMessage(content=message)]}

Phase 4: Conditional Routing

graph.add_conditional_edges(
    "grade_documents",
    lambda state: state["next_action"],   # reads grader's decision from state
    {"generate": "generate", "no_answer": "no_answer"}
)

The router reads directly from AgentState["next_action"] — set by the grader — and routes to the correct node. This is the core of the agent's decision-making.

Phase 5: Streamlit UI with Observability

The UI uses graph.stream(input, config, stream_mode="updates") to receive each node's output as it executes. This enables:

Real-time agent step display via st.status — users see retrieve → grade_documents → generate as it happens
Inference timing — time.time() wraps the entire stream call; elapsed time is displayed in the status bar
Session metrics sidebar — cumulative query count, running average inference time, last query duration, last agent path

Tech Stack

Layer	Technology	Version	Role
Agent Orchestration	LangGraph	Latest	Stateful graph with conditional edges, loops, and memory
LLM Inference	Groq — Llama 3.1 8B Instant	API	Ultra-fast inference for both grading (~100ms) and generation
Embeddings	`sentence-transformers/all-MiniLM-L6-v2`	HuggingFace	Local embedding model — no API cost, no rate limits
Vector Store	ChromaDB	`langchain-chroma`	Persistent local vector database with cosine similarity
Document Loading	`UnstructuredMarkdownLoader`	LangChain	Parses `.md` policy files preserving structure
Text Splitting	`RecursiveCharacterTextSplitter`	LangChain	Overlapping chunk strategy for context continuity
Chat UI	Streamlit	Latest	Streaming chat with real-time agent status and metrics
Conversation Memory	LangGraph `MemorySaver`	Built-in	Per-session state persistence via `thread_id`
MCP Protocol	FastMCP	Latest	Exposes RAG as a callable MCP tool for AI agents
Language	Python	3.11	Core implementation

MCP Server Integration

Beyond the Streamlit UI, this project exposes the entire RAG agent as a Model Context Protocol (MCP) server — making it usable by any MCP-compatible AI client (Claude Desktop, VS Code Copilot Chat, Cursor, etc.).

What is MCP?

MCP (Model Context Protocol) is an open standard that lets AI assistants call external tools through a structured interface. By wrapping the RAG pipeline as an MCP tool, any AI agent can query these policy documents as a first-class tool call.

What the MCP Server Exposes

Type	Name	Description
Tool	`search_security_policies`	Takes a natural language query, runs the full agentic RAG graph, returns the answer as a string
Resource	`policies://list`	Returns the list of all indexed policy document names

Server Implementation Highlights

sys.path injection — Server file lives in mcp-server/ subdirectory; project root is added to path so src.* imports resolve correctly
Logging suppression — transformers, sentence_transformers, and chromadb loggers set to ERROR level to prevent noise in MCP Inspector stderr
String return type — FastMCP requires tools to return primitive types; the AIMessage.content string is extracted before return to pass Pydantic validation
Correct input format — Wraps query in HumanMessage before invoking the graph, matching the state schema exactly

# mcp-server/mcp_server.py — core tool
@mcp.tool()
def search_security_policies(query: str) -> str:
    """Search GitLab security and technology policy documents using an agentic RAG pipeline."""
    agent = build_graph()
    config = {"configurable": {"thread_id": "mcp-session"}}
    result = agent.invoke({"messages": [HumanMessage(content=query)]}, config=config)
    for msg in reversed(result["messages"]):
        if isinstance(msg, AIMessage):
            return msg.content
    return "No answer could be generated."

Running the MCP Server

cd mcp-server
python mcp_server.py

Validating with MCP Inspector

npx @modelcontextprotocol/inspector@0.14.3 python mcp_server.py

Project Structure

DocRAGSearch/
│
├── app.py                          # Streamlit chat UI — streaming, agent status, session metrics
├── save_graph.py                   # One-time utility: generates agent_graph.png from compiled graph
├── agent_graph.png                 # Auto-generated LangGraph execution graph diagram
├── pyproject.toml                  # Project dependencies (uv/pip compatible)
├── .env                            # API keys — NOT committed
│
├── assets/                         # Screenshots for README showcase
│   ├── SS1.png                     # Accurate policy answer with agent path trace
│   ├── SS2.png                     # Graceful out-of-scope rejection
│   ├── SS3.png                     # Session metrics sidebar
│   └── SS4.png                     # MCP Inspector tool validation
│
├── data/
│   └── security-and-technology-policies/
│       ├── access-management-policy.md
│       ├── audit-logging-policy.md
│       ├── change-management-policy.md
│       ├── penetration-testing-policy.md
│       ├── software-development-lifecycle-policy.md
│       └── security-and-technology-policies-management.md
│
├── db/
│   └── chroma_db/                  # Persisted ChromaDB vector store (not committed)
│
├── mcp-server/
│   ├── mcp_server.py               # FastMCP server — exposes RAG as MCP tool + resource
│   └── test_client.py              # Direct Python test client (no MCP protocol needed)
│
└── src/
    ├── config.py                   # Centralized config: env vars, model names, chunk parameters
    ├── state.py                    # LangGraph AgentState TypedDict
    ├── graph.py                    # All 4 nodes + conditional routing + graph compilation
    ├── prompts.py                  # RAG system prompt (hardened against hedging)
    ├── data_ingestion.py           # Markdown loader → text cleaner → chunker → metadata tagger
    ├── vector_store.py             # ChromaDB create + retriever factory
    ├── schema.py                   # Pydantic schemas
    └── utils.py                    # Utility helpers

Setup & Run

1. Clone the repository

git clone https://github.com/BrijeshRakhasiya/Agentic-Doc-Search-RAG.git
cd Agentic-Doc-Search-RAG

2. Create and activate a virtual environment

python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

3. Install dependencies

pip install -e .

4. Set up environment variables

Create a .env file in the project root:

GROQ_API_KEY=your_groq_api_key_here

Get a free API key at console.groq.com

5. Build the vector store (run once)

python -c "
from src.data_ingestion import DataIngestor
from src.vector_store import VectorStoreManager
ingestor = DataIngestor()
chunks = ingestor.load_and_split()
VectorStoreManager().create_vector_store(chunks)
"

6. Launch the Streamlit app

streamlit run app.py

7. (Optional) Launch the MCP server

cd mcp-server
python mcp_server.py

8. (Optional) Regenerate the agent graph diagram

python save_graph.py

Sample Q&A Results

Question	Agent Path	Behavior
"Who is responsible for implementing the Audit Logging Policy?"	`retrieve → grade → generate`	Correctly identifies Security Team as responsible owner
"What system tiers are in scope for Change Management?"	`retrieve → grade → generate`	Returns Tier 1, 2, 3 in-scope; Tier 4 explicitly excluded
"What happens after a penetration test finds critical vulnerabilities?"	`retrieve → grade → generate`	Cites CM-3/pen testing policy: assess severity, Vuln Management Standard, retest requirement
"How often must penetration tests be conducted?"	`retrieve → grade → generate`	Answers: at minimum annually, and after significant system changes
"How do I reset my GitLab password?"	`retrieve → grade → no_answer`	Retrieves docs, grades as NOT relevant, routes to rejection — zero hallucination
"What is the company's vacation policy?"	`retrieve → grade → no_answer`	Completely out-of-scope query handled cleanly — polite, accurate rejection

Engineering Highlights

What Makes This Different from Standard RAG

Standard RAG Pipeline	This System
Always generates an answer	Only generates when docs pass relevance grading
Silently hallucinates on irrelevant retrieval	Routes to explicit `no_answer` node
No visibility into what was retrieved	Full agent path in UI (`retrieve → grade → generate`)
No citation tracking	Every chunk tagged `[Source: Policy \| File:]` before generation
Single-use query	Multi-turn chat via `MemorySaver` + `thread_id`
Linear, no self-correction	Conditional routing — the agent decides what to do
UI only	Also exposed as MCP tool for AI agent integration

Specific Engineering Decisions That Prevented Bugs

Plain-text grading over structured output — Llama 3.1 8B failed structured output / function-calling consistently. Switching to plain yes/no prompting eliminated BadRequestError entirely.
override=True in load_dotenv — Without this flag, dotenv skips variables already set in the environment. The API key appeared to load but was ignored, causing AuthenticationError at runtime.
Explicit next_action key in grader return — The grader initially returned {"generate": "yes"} (wrong key). The router was reading state["next_action"] (correct key). This mismatch caused a KeyError on every query. Fixed by aligning the return key to match the state field name.
Similarity search over MMR — MMR's diversity-first heuristic was retrieving chunks from unrelated policies when the query matched multiple documents. Switching to pure similarity search ensured the most relevant document's chunks dominated the top-6 results.
Metadata stored separately from documents — LangChain Document objects carry .metadata but the generate node needed a clean, structured list of {policy_title, filename} pairs indexed to match the documents list. Storing doc_metadata as a separate state field made this clean and reliable.

License

This project is licensed under the MIT License — see LICENSE for details.

MIT License — Copyright (c) 2026 Brijesh Rakhasiya

Built with care by Brijesh Rakhasiya
GitHub Profile • Repository

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
data/security-and-technology-policies		data/security-and-technology-policies
mcp-server		mcp-server
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent_graph.png		agent_graph.png
app.py		app.py
pyproject.toml		pyproject.toml
save_graph.py		save_graph.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

🦊 Agentic Doc Search RAG

Table of Contents

Live Demo — Screenshots

1. Accurate Policy Answer with Agent Path Trace

2. Graceful Out-of-Scope Rejection — Zero Hallucination

3. MCP Server Tool Validation — Inspector Confirmed

Why This Project Exists — The Problem

Failure Mode 1: Irrelevant Retrieval with Confident Hallucination

Failure Mode 2: No Out-of-Scope Handling

Failure Mode 3: No Transparency or Observability

The Solution: Agentic Self-Corrective RAG

Agent Architecture

Visual Graph (Auto-generated by LangGraph)

Core Design Decisions

Implementation Walkthrough

Phase 1: Data Ingestion Pipeline

Phase 2: Agent State

Phase 3: The Four Nodes

Phase 4: Conditional Routing

Phase 5: Streamlit UI with Observability

Tech Stack

MCP Server Integration

What is MCP?

What the MCP Server Exposes

Server Implementation Highlights

Running the MCP Server

Validating with MCP Inspector

Project Structure

Setup & Run

1. Clone the repository

2. Create and activate a virtual environment

3. Install dependencies

4. Set up environment variables

5. Build the vector store (run once)

6. Launch the Streamlit app

7. (Optional) Launch the MCP server

8. (Optional) Regenerate the agent graph diagram

Sample Q&A Results

Engineering Highlights

What Makes This Different from Standard RAG

Specific Engineering Decisions That Prevented Bugs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages