RAG Service

Getting Started

For prerequisites, environment setup, and general development workflow, see the Contributing Guide.

This README covers RAG service-specific architecture and development details.

Introduction

The RAG (Retrieval-Augmented Generation) service processes grant documents to extract structured templates and generates complete grant applications using LLMs with retrieval from indexed documents. The service integrates Wikidata-enhanced scientific context to improve the quality and scientific accuracy of generated applications.

Core capabilities:

Grant Template Extraction: Extracts structured templates from call-for-proposals (CFP) documents, identifying required sections, length constraints, and evaluation criteria
Grant Application Generation: Generates complete grant applications using a multi-stage pipeline that combines document retrieval, scientific context enhancement, and LLM-powered synthesis
Wikidata Enhancement: Enriches research objectives with scientific context from Wikidata's knowledge base, expanding scientific terms and providing authoritative references

Service Structure

services/rag/
├── src/
│   ├── grant_template/         # CFP template extraction
│   │   ├── extract_sections.py
│   │   ├── cfp_section_analysis.py
│   │   ├── extract_cfp_data.py
│   │   └── generate_metadata.py
│   ├── grant_application/      # Application generation pipeline
│   │   ├── enrich_research_objective.py
│   │   ├── generate_work_plan.py
│   │   ├── generate_section_text.py
│   │   └── extract_relationships.py
│   ├── autofill/               # Autofill request handling
│   │   └── process_autofill.py
│   └── utils/                  # Shared utilities
│       ├── wikidata_client.py
│       ├── format_scientific_context.py
│       └── llm_clients.py
└── tests/
    ├── grant_template/
    ├── grant_application/
    ├── utils/
    └── benchmarks/

Operation Flow

Grant Application Pipeline

graph TD
    A[Pub/Sub: autofill-requests] --> B[Stage 1: Blueprint Prep]
    B --> C[Stage 2: Workplan Generation]
    C --> D[Stage 3: Section Synthesis]

    B --> B1[Enrich Research Objective]
    B1 --> B2[Wikidata Enhancement]
    B2 --> B3[Generate Search Queries]

    C --> C1[Retrieve Relevant Docs]
    C1 --> C2[Generate Work Plan]
    C2 --> C3[Extract Relationships]

    D --> D1[Generate Section Text]
    D1 --> D2[Save to PostgreSQL]
    D2 --> D3[Pub/Sub: frontend-notifications]

Grant Template Extraction

graph TD
    A[Pub/Sub: rag-processing] --> B[Extract CFP Sections]
    B --> C[Analyze Section Requirements]
    C --> D[Extract CFP Data]
    D --> E[Generate Metadata]
    E --> F[Save to PostgreSQL]
    F --> G[Pub/Sub: frontend-notifications]

Wikidata Enhancement

graph TD
    A[Research Objective] --> B[Extract Scientific Terms]
    B --> C[Batch Terms]
    C --> D[Query Wikidata SPARQL]
    D --> E{Success?}
    E -->|Yes| F[Format Scientific Context]
    E -->|No| G[Retry with Backoff]
    G --> D
    F --> H[Enhance Application]

Processing Stages

Grant Template Extraction

Section Extraction: Identifies all required sections in CFP document, distinguishing between research plan and supporting sections
Requirement Analysis: Analyzes each section for length constraints, evaluation criteria, and formatting requirements
Data Extraction: Extracts structured data including submission deadlines, funding amounts, and eligibility criteria
Metadata Generation: Generates searchable metadata for template discovery and matching

Grant Application Generation

BLUEPRINT_PREP: Enriches research objective with scientific context from Wikidata, extracts core terms, generates guiding questions and search queries
WORKPLAN_GENERATION: Retrieves relevant documents, generates comprehensive work plan with tasks and milestones, extracts dependencies between tasks
SECTION_SYNTHESIS: Generates final section text using work plan, retrieved documents, and CFP requirements

Integration Points

Pub/Sub Topics

rag-processing: Receives grant template extraction requests
autofill-requests: Receives grant application generation requests
frontend-notifications: Publishes completion notifications to frontend

PostgreSQL

grant_templates: Stores extracted CFP templates with sections and requirements
grant_applications: Stores generated applications with section content
research_objectives: Stores enriched objectives with scientific terms
work_plans: Stores generated work plans with tasks and dependencies

Wikidata SPARQL

Endpoint: https://query.wikidata.org/sparql
Batch Processing: 5 terms per batch for optimal performance
Retry Logic: 3 attempts with exponential backoff
Timeout: 30 seconds per request

LLM Providers

Gemini 2.5 Flash: Primary model for structured extraction and generation (1M context window)
Claude 3.5 Sonnet: Secondary model for complex reasoning tasks
Vertex AI: Managed model deployment with rate limiting and retry logic

Notes

JSON Schema Design Principles

All JSON schemas for Gemini structured output follow official Google best practices to minimize token usage and improve model reliability:

Core Principles

Short Property Names: Single words preferred (name, type, source, quote) not verbose (section_name, measurement_type, cfp_source_reference, quote_from_source)
Minimal Nesting: Max 2 levels deep, flatten arrays where possible
Reduced Required Fields: 2-6 required fields per object, use NotRequired for rest
Object Arrays Over Tuples: Always use named properties [{source, target, desc}] not [["s","t","d"]]
Concise Descriptions: 1-2 sentences max in schema, move details to prompts
Strategic Constraints: Use minItems/maxItems (3-10 typical), avoid over-constraining
Property Ordering: Match example order in prompts

Optimization Pattern

RAG service uses optimized short property names throughout the pipeline. Conversion to database column names happens only at the database boundary when saving to tables.

# Optimized schema for Gemini
schema = {
    "properties": {
        "name": {"type": "string"},
        "quote": {"type": "string"},
        "source": {"type": "string"}
    },
    "required": ["name", "quote"]
}

# LLM returns optimized format
response = {"name": "Research Plan", "quote": "...", "source": "..."}

# Use short names throughout RAG service
section = ExtractedSectionDTO(
    title=response["name"],
    ...
)

# Convert to DB format only when saving to database tables

Property Name Mapping

DB Schema (Descriptive)	Pipeline Schema (Optimized)	Token Savings
`section_name`	`name`	50%
`quote_from_source`	`quote`	70%
`cfp_source_reference`	`source`	70%
`is_detailed_research_plan`	`is_plan`	50%
`enriched_objective`	`enriched`	50%
`core_scientific_terms`	`terms`	60%
`guiding_questions`	`questions`	50%
`search_queries`	`queries`	40%
`sections_count`	`count`	60%
`length_constraints_found`	`constraints_count`	50%

Schema Example: Before and After

Before (verbose):

{
    "is_detailed_research_plan": {"type": "boolean"},
    "needs_applicant_writing": {"type": "boolean"},
    "length_limit": {"type": "integer", "nullable": True},
    "length_source": {"type": "string", "nullable": True}
}

After (optimized):

{
    "is_plan": {"type": "boolean"},
    "needs_writing": {"type": "boolean"},
    "length_constraint": {
        "type": "object",
        "nullable": True,
        "properties": {
            "type": {"type": "string", "enum": ["words", "characters"]},
            "value": {"type": "integer", "minimum": 1},
            "source": {"type": "string", "nullable": True},
        },
        "required": ["type", "value", "source"],
    }
}

Business Logic Preservation

All validation logic uses short property names:

extract_sections.py: Exactly 1 section with is_plan=true, research plan must have long_form=true
enrich_research_objective.py: Exactly 5 terms, minimum 3 questions, minimum 3 queries
cfp_section_analysis.py: count must match required_sections array length, constraints_count must match length_constraints array length
extract_relationships.py: Tuple arrays replaced with object arrays [{source, target, desc}]

Prompt Engineering Guidelines

All RAG prompts target Gemini 2.5 Flash (1M context, thinking mode) and follow official best practices:

Core Principles

Concise & Clear: State instructions once, no repetition or shouting (ALL CAPS)
NO JSON Examples in Prompts: Schema provides structure, prompt provides context only
Hierarchical Structure: Use ## headers and numbered lists for organization
Professional Tone: Avoid emoji warnings and excessive emphasis

JSON Output Prompts

Provide JSON schema separately (see *_json_schema constants)
DO NOT include JSON examples in prompts - schema is sufficient
Focus prompt on task description and requirements
Structure: Task → Requirements → Schema reference (no examples)

Example files: extract_sections.py, cfp_section_analysis.py, generate_metadata.py, extract_cfp_data.py

Text Output Prompts

Structure: Requirements → Materials → Guidelines → Format
Keep under 50 lines for simple tasks, under 100 for complex
Use model's thinking mode (don't prescribe chain-of-thought)

Example files: generate_section_text.py

Token Budget Guidelines

System prompts: < 30 lines (state role and key constraints)
User prompts: < 100 lines for complex tasks, < 50 for simple
NO JSON examples: Schema-only approach saves 200-500 tokens per prompt
Total prompt tokens: Target < 500 tokens per prompt (excluding input data)

Anti-Patterns to Avoid

Massive verbose prompts (>400 lines)
JSON examples in prompts (schema is sufficient)
Repetitive instructions
Emoji warnings and excessive ALL CAPS
Contradictory or overlapping instructions
Prescriptive chain-of-thought (model has thinking mode)
Tuple arrays instead of object arrays

Prompt Refactoring Checklist

Remove all repetition (say things once)
Remove ALL JSON examples - schema provides structure
Consolidate overlapping sections
Remove emoji warnings and excessive caps
Verify JSON schema uses short property names
Add transformation layer if schema changed
Test with validation logic to ensure compatibility

Wikidata Enhancement

The service integrates Wikidata's scientific knowledge base to enrich grant applications with authoritative scientific context.

Implementation

Client: Function-based httpx client with no context manager overhead
Configuration: Constants-based (no environment variables): batch size 5, timeout 30s, max retries 3
Processing: Batch processing with exponential backoff retry logic
Integration: Enriches research objectives in BLUEPRINT_PREP stage

Performance Metrics

Processing Speed: < 5 seconds for typical scientific term sets (10 terms)
Scalability: Handles 5-50 terms efficiently with batch processing
Reliability: 95%+ success rate with retry mechanisms
Throughput: > 0.1 terms/second
Success Rate: > 80% success rate

Scientific Context

Term Coverage: 10/10 scientific terms detected
Context Relevance: 100% relevant scientific context
Field Organization: Organized by scientific field for better LLM consumption
Template-based: Uses prompt template library for consistent formatting

Quality Improvements

AI evaluation results demonstrate significant improvements from Wikidata enhancement:

AI Evaluation Results

Baseline Quality: 4.0/5.0
Wiki-Enhanced Quality: 5.0/5.0 (+25% improvement)
Scientific Terms: 10% → 100% coverage (+900% improvement)
All Criteria: +1.0 point improvement across all evaluation dimensions

Scientific Accuracy

Term Detection: 100% (10/10 scientific terms identified)
Context Relevance: 100% relevant scientific context from Wikidata
Structured Organization: Context organized by scientific field for optimal LLM processing
Quality Metrics: Comprehensive AI evaluation with statistical analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Service

Getting Started

Introduction

Service Structure

Operation Flow

Grant Application Pipeline

Grant Template Extraction

Wikidata Enhancement

Processing Stages

Grant Template Extraction

Grant Application Generation

Integration Points

Pub/Sub Topics

PostgreSQL

Wikidata SPARQL

LLM Providers

Notes

JSON Schema Design Principles

Core Principles

Optimization Pattern

Property Name Mapping

Schema Example: Before and After

Business Logic Preservation

Prompt Engineering Guidelines

Core Principles

JSON Output Prompts

Text Output Prompts

Token Budget Guidelines

Anti-Patterns to Avoid

Prompt Refactoring Checklist

Wikidata Enhancement

Implementation

Performance Metrics

Scientific Context

Quality Improvements

AI Evaluation Results

Scientific Accuracy

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

RAG Service

Getting Started

Introduction

Service Structure

Operation Flow

Grant Application Pipeline

Grant Template Extraction

Wikidata Enhancement

Processing Stages

Grant Template Extraction

Grant Application Generation

Integration Points

Pub/Sub Topics

PostgreSQL

Wikidata SPARQL

LLM Providers

Notes

JSON Schema Design Principles

Core Principles

Optimization Pattern

Property Name Mapping

Schema Example: Before and After

Business Logic Preservation

Prompt Engineering Guidelines

Core Principles

JSON Output Prompts

Text Output Prompts

Token Budget Guidelines

Anti-Patterns to Avoid

Prompt Refactoring Checklist

Wikidata Enhancement

Implementation

Performance Metrics

Scientific Context

Quality Improvements

AI Evaluation Results

Scientific Accuracy