Grounded retrieval-augmented generation over the medical literature — with a citation-faithfulness eval built in.
Most RAG demos stop at "it retrieved something and wrote an answer." LitRAG goes one step further: it checks whether the generated answer is actually supported by the retrieved sources, and flags hallucinated or unsupported claims. That groundedness layer — not the pipeline — is the point.
Built on LangChain (orchestration) + Hugging Face sentence-transformers (embeddings) + FAISS (vector store), so it runs locally with no managed vector-DB key required.
Status: built. Corpus = 15 real semaglutide abstracts (see
data/). Retrieval runs locally and is verified; generation + faithfulness judge need an LLM key (see Quickstart).
Two things at once:
- A faithful RAG reference. A small, readable pipeline that retrieves from PubMed abstracts and answers questions with citations, then verifies those citations hold up.
- An honest framework comparison. The pipeline is implemented in LangChain; the README documents how the same retrieval would look in LlamaIndex, and where each framework's abstraction helps vs. gets in the way. (See Framework notes.)
The groundedness eval reuses the citation-faithfulness approach from a separate cookbook notebook: locate the cited quote in the source deterministically, then use an LLM-as-judge to grade whether the source supports the claim (supports / partial / contradicts / not-found).
PubMed abstracts ──▶ chunk ──▶ HF sentence-transformers embeddings ──▶ FAISS index
│
question ──▶ retrieve top-k ──────────────┘
│
▼
LangChain RAG chain (Claude / OpenAI)
│
▼
answer + cited passages
│
▼
citation-faithfulness eval ──▶ grounded? / flagged claims
| File | Role |
|---|---|
ingest.py |
Load the abstract corpus, chunk into passages with source metadata |
index.py |
Build/load the FAISS index from HF sentence-transformers embeddings |
rag.py |
LangChain retrieval + generation chain; returns answer with cited passages |
faithfulness.py |
Citation-faithfulness eval — locate quote in source, LLM-judge support level |
demo.py |
End-to-end run: ingest → index → ask → answer → grade |
data/ |
Sample PubMed abstracts (static sample so the repo runs key-free for retrieval) |
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # add ANTHROPIC_API_KEY (or OPENAI_API_KEY) for the generation + judge steps
python demo.pyEmbedding + retrieval run fully local (HF + FAISS); only generation and the faithfulness judge call an LLM API.
Implement in this order — each step is independently runnable:
ingest.py— loaddata/sample_abstracts.*, chunk to ~512-token passages, attach{pmid, title, source}metadata. (Optional: a--from-pubcrawlpath that pulls fresh abstracts via the PubCrawl MCP server instead of the static sample.)index.py— embed passages withlangchain_huggingface.HuggingFaceEmbeddings(model e.g.sentence-transformers/all-MiniLM-L6-v2), build alangchain_community.vectorstores.FAISSindex, save/load from disk.rag.py— a LangChain retrieval chain (chat model=langchain_anthropic.ChatAnthropicwithclaude-sonnet-4-6, orlangchain_openai), prompted to answer and quote the supporting passage per claim. Return structured{answer, claims:[{text, cited_quote, source}]}.faithfulness.py— for each claim: locatecited_quotein the retrieved source (normalize +rapidfuzz.partial_ratio); if found, LLM-judge support level (supports/partial/contradicts/not-found). Short-circuit to "hallucinated quote" if the quote isn't locatable. (Port the logic from the cookbookcitation_faithfulness.py.)demo.py— wire it end-to-end on 2–3 example questions; print answer + per-claim grounded verdict.- Tests — a couple of unit tests on the quote-locator (exact hit, fuzzy hit, fabricated quote → not found).
- Fill in Framework notes from the actual build experience — be specific about where LangChain's abstraction earned its keep and where it cost a layer of indirection.
Keep it small. A reviewer should be able to read the whole thing in ten minutes.
Written from the build, not the docs.
What LangChain bought.
Document+ metadata as the universal currency.ingest.pyemitsDocument(page_content, metadata={pmid, title, source}); FAISS embeds it, the retriever returns it, and the{pmid, title, source}rides through embedding and retrieval untouched. The faithfulness eval needs exactly that provenance, and the framework carried it end-to-end for free — no parallel bookkeeping of "which text came from which abstract."- Swappable embedder + LLM.
HuggingFaceEmbeddingsandChatAnthropicare drop-in; switching the judge/generator to OpenAI is a one-line import change. The local embedder and the API generator sit behind the same interfaces. with_structured_output(PydanticModel). This is the biggest win for this pipeline. Structured per-claim citations ({answer, claims:[{text, cited_quote, source}]}) are the whole point, and I got validated objects back without hand-writing a tool schema or a parser — just a Pydantic model.- FAISS persistence.
from_documents/save_local/load_localgave embed-once, reuse-after for nothing (seeindex.get_or_build_index).
Where it added indirection.
- LCEL composition is clean until you debug it.
{"context": retriever | format, "question": passthrough} | prompt | structuredreads well, but the dict→Runnable coercion is opaque: drop a non-Runnableinto that dict and you get a crypticExpected a Runnable, callable or dictfrom deep incoerce_to_runnable, far from the line you wrote. (Hit this verbatim while wiring a test stub.) - The judge left the framework on purpose.
faithfulness.grade_supportuses the raw Anthropic SDK, not LangChain — because it wants forced tool use andcache_controlon the source-passage content block (so checking many claims against one long abstract reuses the cached passage).with_structured_outputabstracts the tool away, but it also abstracts away per-block cache control — so the one place I most wanted provider-specific control is the place I dropped out of the abstraction. That's the honest seam. HuggingFaceEmbeddingsis a thin wrapper. For a corpus this small you could callsentence-transformers+faissdirectly in ~20 lines and lose almost nothing.
I built the pipeline in LangChain; this section is how I'd map it onto LlamaIndex and the trade-offs I'd expect — not a second, shipped implementation. Kept deliberately separate from the build notes above, which are from the build.
Retrieve-then-generate collapses to roughly:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
index = VectorStoreIndex.from_documents(SimpleDirectoryReader("data/").load_data())
print(index.as_query_engine().query("How much weight loss with semaglutide?"))Mapping LitRAG's pieces onto LlamaIndex primitives:
| LitRAG (LangChain) | LlamaIndex equivalent |
|---|---|
Document(page_content, metadata={pmid, title, source}) |
TextNode + metadata, with excluded_embed_metadata_keys to keep PMIDs out of the embedded text |
ingest.chunk (word-window splitter) |
a SentenceSplitter / TokenTextSplitter node parser |
HuggingFaceEmbeddings |
HuggingFaceEmbedding (same sentence-transformers model) |
FAISS.from_documents |
VectorStoreIndex over a FaissVectorStore |
store.as_retriever(search_kwargs={"k": 4}) |
index.as_retriever(similarity_top_k=4) |
llm.with_structured_output(StructuredAnswer) |
index.as_query_engine(output_cls=StructuredAnswer) (Pydantic program) |
hand-rolled per-claim cited_quote contract |
CitationQueryEngine — citations are first-class; the response carries source_nodes |
What it would buy. Citation is native. CitationQueryEngine numbers its sources and
hands back the source_nodes it used, so "answer with provenance" isn't something I bolt
on with a Pydantic schema — it's the default contract. For a product whose entire point
is grounded citations, that's a genuine fit advantage.
What would stay exactly the same. The faithfulness judge. faithfulness.grade_support
would still drop to the raw Anthropic SDK for forced tool use + cache_control on the
passage — LlamaIndex abstracts those away just as LangChain does. The honest seam
(framework for retrieval, SDK for the eval that is the product) is identical either
way; the framework choice only ever touches the boring 80%.
What it would cost. A second embedding/index stack to maintain, and — the real catch —
LlamaIndex's citation synthesizer re-chunks and renumbers sources its own way, which
fights the contract the eval depends on. The locator needs the model to quote a span
verbatim from a retrieved passage; a synthesizer that paraphrases into a numbered
citation breaks locate_quote before the judge ever runs. Reconciling first-class
citations with a verbatim-quote requirement is non-trivial, and it's the main reason
swapping frameworks isn't a free lunch here.
Honest status. Not built. If the LlamaIndex equivalent matters for a given purpose,
the right move is a small rag_llamaindex.py variant — code, not more prose.
The honest takeaway. For a 5-file pipeline the framework earns its place on the
glue — the Document/metadata plumbing and with_structured_output — and costs a
layer exactly where the product lives: the faithfulness judge, which I built on the
Anthropic SDK directly for forced tool use + caching. Neither framework touches the two
things that make this repo more than a demo (the per-claim cited_quote contract and the
two-stage groundedness check) — those are plain Pydantic + the SDK. So: reach for the
framework for the boring 80% (load → embed → retrieve → structure); drop to the SDK for
the 20% that is the point. If retrieval-with-citations were the whole product,
LlamaIndex would be the better-fitting default at this size.
MIT © 2026 Nick Lamb