Skip to content

feat: page-aware sources + numbered RAG citations (#2106)#2353

Open
RealBhupesh wants to merge 2 commits intoarc53:mainfrom
RealBhupesh:citations
Open

feat: page-aware sources + numbered RAG citations (#2106)#2353
RealBhupesh wants to merge 2 commits intoarc53:mainfrom
RealBhupesh:citations

Conversation

@RealBhupesh
Copy link
Copy Markdown

Summary

This PR adds Perplexity-style numbered citations with page-aware RAG sources.

Changes

  • Docling PDF ingestion now parses PDFs per-page and stores page in each chunk’s metadata.
  • Retrieval (ClassicRAG) forwards page to the frontend sources payload.
  • Prompt context is rendered with numbered excerpts like [1] p. N <filename> and instructions to cite using [1], [2], etc.
  • Research citations are deduplicated using source + title + page, and references show (p. N).
  • Frontend source cards display the page label (p. N) alongside each source and existing citation clicks continue to jump to the correct source card.

Notes

  • Existing ingested indexes must be re-ingested to populate page metadata for older documents.

Testing

  • Unit: python -m pytest tests/retriever/test_rag_prompt_docs.py -q
  • Manual: Re-ingest a multi-page PDF, ask a question that requires citations, and verify:
    • sources show p. N
    • answers include [1], [2] style citations that correspond to the correct source entries

Future work (optional)

  • Open-document + in-page highlight viewer (not included in this PR).

Bhupesh Cholake added 2 commits April 2, 2026 11:28
)

- Docling PDF: per-page segments with page metadata; bulk reader merges segment extra_info
- ClassicRAG + rag_prompt_docs: pass page through; build numbered excerpts for prompts
- Stream/workflow: use build_numbered_docs_together for template context
- Research CitationManager: dedupe by page; references show p. N
- Frontend: show page on source cards; optional source/link types
- Tests: rag_prompt_docs unit tests; adjust workflow/stream pre_fetch assertions

Made-with: Cursor
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 2, 2026

Someone is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant