Skip to content

francBara/rag-assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Write-Up: Document RAG with Grounded Citations

Setup

pip install .

Usage

# 1. Parse and chunk the document
python src/main.py ingest [-o OUTPUT_FILE]

# 2. Build the vector + BM25 index
python src/main.py index

# 3. Retrieve relevant chunks for a query
python src/main.py retrieve -q "your query"

# 4. Generate a grounded answer with citations
python src/main.py generate -q "your query" [-p]

# 5. Interactive multi-turn conversation
python src/main.py conversation [-p]

# 6. Generate a report from a template
python src/main.py template -t template.json [-o OUTPUT_FILE]

Part 1: Document Parsing and Chunking

Noise Filtering

Filtering happens in two passes. First, chunks with explicit noise types (marginalia, logo) are dropped outright. Second, a frequency-based heuristic catches noise that the upstream parser mislabeled: text chunks whose content also appears in a marginalia chunk are removed, and any text appearing on more than 50% (configurable) of pages (computed as max(page_count * 0.5, 1)) is treated as a repeated header/footer.

Chunking Strategy

Chunks are split using a recursive hierarchical strategy with three levels of granularity: paragraph boundaries (\n\n), then newlines, then sentences (via NLTK sent_tokenize). If a piece still exceeds the 500-token limit (configurable) after sentence splitting, a final fallback splits at the token level using tiktoken. The recursion means a large paragraph first tries to stay intact, and only gets sentence-split if it exceeds the budget, so embedding quality is preserved for paragraphs that fit within limits.

Adjacent split chunks get a 50-token overlap (configurable) drawn from the tail sentences of the previous chunk. This overlap is sentence-aligned (not token-sliced) to avoid mid-word artifacts. The overlap exists so that retrieval can find information that falls near a split boundary, at the cost of slight index bloat. Only text chunks are split; table rows and figures are small enough to pass through unchanged.

Limitation: The splitter operates on individual chunks independently; it does not merge small adjacent chunks into larger ones. A sequence of short paragraphs (e.g., a bulleted list where each bullet is its own chunk) produces many undersized chunks rather than being consolidated. This increases index size and can fragment context that belongs together. This was chosen to preserve chunk granularity given by the data pipeline, while including neighbors during retrieval. Also, on splitting, a chunk still refers to the original chunk grounding, losing detail on grounding reference.

Table Handling

Tables are decomposed into one chunk per row. Each row is linearized as "Header1: value1 | Header2: value2 | ...", which gives embedding models the column context needed to match queries like "IC50 values for BD1-selective inhibitors" against a row containing "IC50 (nM): 3.4". Each cell's HTML id attribute is resolved against the global grounding dictionary to get cell-level bounding boxes, so a retrieved row about BMS-986158 carries bboxes for just that row's cells, not the entire table.

When a preceding text chunk matches the pattern "Table N..." (detected by is_table_title), it is consumed and prepended to every row of the following table (e.g., "Table 1. Clinical trials - Drug: BMS-986158 | Phase: I | ..."). This helps retrieval by injecting the table's topic into each row's content.

Limitation: Row-per-chunk indexing loses cross-row relationships. A query like "which drug has the highest IC50" requires comparing multiple rows, but each row is retrieved independently. The neighbor expansion mitigates this, but the system cannot reason across the full table during retrieval, only during generation when multiple rows appear in context.

Limitation: If the header and data row have different cell counts (e.g., merged cells in HTML), the linearization logs a warning but still zips what it can. Merged cells or multi-row spans are not explicitly handled.

Figure Handling

Figure descriptions (the <::...::> markers) are indexed as chunks with type figure, stripped of the marker syntax. During retrieval, figure chunks receive a 30% score weighting (configurable) as penalization during retrieval, and during generation, their content is wrapped in <AI_GENERATED> tags so the LLM knows not to treat them as author-written text.

The rationale: figure descriptions sometimes contain useful factual content (compound names, IC50 values extracted from diagrams), so excluding them entirely would lose retrievable information. But they are AI-generated paraphrases, not author text, so they should not dominate retrieval results or be cited with the same confidence as body text. The penalty and tagging strike a middle ground. A more aggressive approach would be to extract only named entities/values from figure descriptions and discard the rest, but that adds complexity for marginal gain given the penalty already demotes them.

Section Tracking

Section headings are detected via regex: numbered patterns (^\d+(\.\d+)*\.?\s+\S), ALL-CAPS lines, and markdown # headings. The current section is carried as metadata through all subsequent chunks until a new heading is found. This metadata serves two purposes: it appears in the output for context, and it is used by the neighbor expansion to avoid crossing section boundaries when fetching adjacent chunks.

Part 2: Indexing and Retrieval

Embedding Model: BAAI/bge-small-en-v1.5

This is a 384-dimensional sentence-transformer model. The "small" variant was chosen for fast inference during development and indexing. For a 14-page document with ~200 chunks, embedding latency is not a bottleneck, but a larger model (bge-large, 1024-dim) would likely improve retrieval quality for nuanced queries. The BGE family was chosen because it is specifically trained for retrieval tasks (with instruction-tuned query prefixes), unlike general-purpose sentence encoders.

Vector Store: ChromaDB

ChromaDB was chosen for simplicity: it runs in-process with persistent storage, requires no external service, and handles embedding + metadata storage in a single API. For a single-document prototype this is sufficient.

ChromaDB uses L2 (Euclidean) distance by default. Scores are converted to a [0, 1] similarity via 1 / (1 + L2_distance). This is a monotonic transformation that preserves ranking order. An alternative would be cosine similarity (which ChromaDB supports via collection configuration), but since BGE embeddings are normalized, L2 and cosine rankings are equivalent.

BM25 Keyword Index

BM25Okapi provides exact keyword matching to complement the embedding search. Tokenization lowercases, splits on non-alphanumeric characters (preserving intra-word hyphens like "BMS-986158"), and adds Porter-stemmed variants alongside originals. Including both the original token and its stem means "inhibitors" matches both "inhibitors" and "inhibitor" without losing the ability to exact-match on the original form.

BM25 scores are normalized by dividing by the max score in the result set. This maps them to [0, 1] to be commensurable with the vector similarity scores for fusion.

Hybrid Retrieval and Fusion

Both vector and BM25 retrieve top_k * 3 candidates each (30 by default). These are fused via convex combination with alpha = 0.5 (configurable):

score = 0.5 * vector_score + 0.5 * bm25_score

After fusion, a cross-encoder reranker (BAAI/bge-reranker-base) rescores the candidates. The cross-encoder sees the full query-document pair jointly (unlike bi-encoder embeddings which encode them independently), so it can capture fine-grained relevance. This is the most expensive step in retrieval but only runs on the already-filtered candidate set.

Neighbor Expansion

After selecting the final top-K chunks, each chunk's immediate neighbors (1 before, 1 after by default) are fetched from a JSON-serialized raw chunk store. Neighbors are only included if they share the same section, preventing context pollution across section boundaries. This addresses the limitation that individual chunks may lack surrounding context, an answer about "BRD4 phosphorylation" might need the preceding sentence that sets up the mechanism.

The neighbor chunks are passed to the LLM as separate numbered sources (not concatenated into the main chunk), so each can be independently cited. This preserves grounding precision.

Retrieval Cache

The cache stores (query_text, query_embedding, results) tuples in memory. On a new query, it computes cosine similarity against all cached query embeddings:

  • Similarity >= 0.9: full reuse (return cached results directly)
  • Similarity >= 0.7: partial reuse (retrieve fresh top_k/2 results, merge with cached, deduplicate)
  • Below 0.7: cache miss

This is designed for multi-turn conversation where follow-up queries often overlap with prior turns. The 0.9 threshold for full reuse is conservative, but a false positive (returning stale results for a genuinely different query) would silently degrade answer quality. The partial reuse path at 0.7 hedges by fetching some fresh results while still benefiting from cached ones.

Limitation: Full reuse with similarity higher than 0.9 could ignore literal matches, given by BM25 during retrieval. This could lead to stale retrievals and can be improved by including other metrics for measuring cache similarity.

Part 3: Grounded Answer Generation

LLM and Prompting

Generation uses Claude Sonnet 4.6 at temperature 0.0 for deterministic outputs. The system prompt instructs the model to cite every claim using <<SOURCE N>> markers and to refuse answering if context is insufficient. The <<SOURCE N>> format was chosen over [N] in the prompt to avoid ambiguity with markdown (the markers are post-processed into [N] in the final output).

Context is formatted as numbered source blocks with content and page number. A token budget (configurable) caps how much context is included; sources are added group-by-group (main chunk + neighbors) until the budget is exhausted. The first group is always included even if it exceeds the budget, ensuring the model always has at least some context.

Citation Parsing

After the LLM responds, all <<SOURCE N>> markers are extracted via regex, deduplicated, and validated against the flat chunk list. Invalid references (where N exceeds the number of provided sources) are logged as hallucinated citations and dropped, they do not appear in the final output. Valid references are mapped to their corresponding GroundedChunk objects, carrying through the original bounding boxes.

Part 4: Multi-Turn Conversation and Agentic Retrieval

Query Rewriting

Follow-up questions are rewritten into standalone queries using conversation history. The rewriter sees the full history formatted as User: ... / Assistant: ... pairs, and is instructed to resolve pronouns and implicit references. For example, "What about its resistance profile?" after discussing BMS-986158 becomes a query explicitly mentioning BMS-986158 and resistance.

A minimum-length guard (3 characters) catches degenerate rewrites. The rewriting step adds one LLM call per turn.

Query Decomposition

Complex queries are decomposed into 1-3 sub-queries using structured tool output.

Gap Detection

After the initial retrieval round, the system asks the LLM whether critical information is missing given the query and retrieved context. The prompt is explicit: "Only suggest queries if critical information is completely missing." This conservative framing avoids unnecessary additional retrieval rounds for minor gaps. Gap queries are capped at 2 per iteration.

The gap context is truncated at 2000 tokens (configurable) to keep the gap-check LLM call fast and cheap.

Agentic Loop

The full agentic retrieval flow is: decompose query into sub-queries, retrieve for each, check for gaps, retrieve for gap queries, then rerank all accumulated results together. max_iterations (configurable) controls the amount of times this step is repeated. Chunk deduplication uses a seen_ids set to avoid re-retrieving the same chunk across sub-queries.

After all iterations, the combined results are reranked by the cross-encoder against the original query. This is important because sub-query retrieval may have scored chunks highly for a sub-query that is only tangentially relevant to the main question.

Part 5: Template-Driven Generation

Section-to-Query Mapping

For each template section, the retrieval query is constructed as "{heading}: {guidance}". This concatenation gives the embedding model both the section topic and the specific content requirements. When agentic retrieval is enabled, this query is further decomposed into sub-queries.

The whole template is provided as context to the LLM, so that it does not have a blindfolded vision on one section at a time.

Data Gap Detection

After generating each section's content, a separate LLM call compares the guidance requirements against what was actually generated. This is fundamentally a recall check: did the section cover what the template asked for? Gaps are reported as free-text strings (e.g., "No in vivo pharmacokinetic data available for BRD4 degraders").

The gap detection is post-hoc, it does not trigger additional retrieval. Integrating it with the agentic loop (retrieve more if gaps are found) would improve coverage but significantly increase latency and cost per section.

Hallucination Guard

The section generation prompt includes explicit instructions to mark missing data as [INSUFFICIENT DATA] rather than fabricating it. If no chunks are retrieved for a section at all, the entire section is pre-filled with an insufficient data message before the LLM is even called. This two-layer guard (no-context shortcut + prompt instruction) reduces the risk of hallucinated regulatory content, which in a real IND submission context would be a serious compliance issue.

Architectural Decisions

No Framework

The system is built without LangChain or LlamaIndex. This was a deliberate choice: the retrieval pipeline has specific requirements (hybrid fusion, cell-level grounding preservation, section-aware neighbor expansion) that would require fighting against framework abstractions rather than benefiting from them. The total codebase is small enough that the coordination overhead of a framework is not justified.

Structured LLM Output via Tool Use

Query decomposition, gap detection, and data gap analysis all use Anthropic's tool use API with JSON schemas to get structured responses. This is more reliable than prompting for JSON and parsing it: the API guarantees valid JSON matching the schema, eliminating a class of parsing failures.

Token Budgeting

Every context-building step has an explicit token budget tracked via tiktoken. The tokenizer uses cl100k_base, which is a reasonable approximation for Claude's tokenization. Exact token counts will differ slightly from Claude's actual tokenizer, but the budget is set conservatively enough (6000 tokens for context) that small discrepancies do not matter.

Configuration

All tunable parameters (chunk size, overlap, alpha, penalties, thresholds, model names, feature flags) are in a single config.toml file. This makes it easy to experiment with different settings without modifying code. The agentic features (query decomposition, gap checking) can each be independently toggled off, which is useful for comparing single-shot vs. agentic retrieval quality.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages