For prerequisites, environment setup, and general development workflow, see the Contributing Guide.
This README covers RAG service-specific architecture and development details.
The RAG (Retrieval-Augmented Generation) service processes grant documents to extract structured templates and generates complete grant applications using LLMs with retrieval from indexed documents. The service integrates Wikidata-enhanced scientific context to improve the quality and scientific accuracy of generated applications.
Core capabilities:
- Grant Template Extraction: Extracts structured templates from call-for-proposals (CFP) documents, identifying required sections, length constraints, and evaluation criteria
- Grant Application Generation: Generates complete grant applications using a multi-stage pipeline that combines document retrieval, scientific context enhancement, and LLM-powered synthesis
- Wikidata Enhancement: Enriches research objectives with scientific context from Wikidata's knowledge base, expanding scientific terms and providing authoritative references
services/rag/
├── src/
│ ├── grant_template/ # CFP template extraction
│ │ ├── extract_sections.py
│ │ ├── cfp_section_analysis.py
│ │ ├── extract_cfp_data.py
│ │ └── generate_metadata.py
│ ├── grant_application/ # Application generation pipeline
│ │ ├── enrich_research_objective.py
│ │ ├── generate_work_plan.py
│ │ ├── generate_section_text.py
│ │ └── extract_relationships.py
│ ├── autofill/ # Autofill request handling
│ │ └── process_autofill.py
│ └── utils/ # Shared utilities
│ ├── wikidata_client.py
│ ├── format_scientific_context.py
│ └── llm_clients.py
└── tests/
├── grant_template/
├── grant_application/
├── utils/
└── benchmarks/
graph TD
A[Pub/Sub: autofill-requests] --> B[Stage 1: Blueprint Prep]
B --> C[Stage 2: Workplan Generation]
C --> D[Stage 3: Section Synthesis]
B --> B1[Enrich Research Objective]
B1 --> B2[Wikidata Enhancement]
B2 --> B3[Generate Search Queries]
C --> C1[Retrieve Relevant Docs]
C1 --> C2[Generate Work Plan]
C2 --> C3[Extract Relationships]
D --> D1[Generate Section Text]
D1 --> D2[Save to PostgreSQL]
D2 --> D3[Pub/Sub: frontend-notifications]
graph TD
A[Pub/Sub: rag-processing] --> B[Extract CFP Sections]
B --> C[Analyze Section Requirements]
C --> D[Extract CFP Data]
D --> E[Generate Metadata]
E --> F[Save to PostgreSQL]
F --> G[Pub/Sub: frontend-notifications]
graph TD
A[Research Objective] --> B[Extract Scientific Terms]
B --> C[Batch Terms]
C --> D[Query Wikidata SPARQL]
D --> E{Success?}
E -->|Yes| F[Format Scientific Context]
E -->|No| G[Retry with Backoff]
G --> D
F --> H[Enhance Application]
- Section Extraction: Identifies all required sections in CFP document, distinguishing between research plan and supporting sections
- Requirement Analysis: Analyzes each section for length constraints, evaluation criteria, and formatting requirements
- Data Extraction: Extracts structured data including submission deadlines, funding amounts, and eligibility criteria
- Metadata Generation: Generates searchable metadata for template discovery and matching
- BLUEPRINT_PREP: Enriches research objective with scientific context from Wikidata, extracts core terms, generates guiding questions and search queries
- WORKPLAN_GENERATION: Retrieves relevant documents, generates comprehensive work plan with tasks and milestones, extracts dependencies between tasks
- SECTION_SYNTHESIS: Generates final section text using work plan, retrieved documents, and CFP requirements
- rag-processing: Receives grant template extraction requests
- autofill-requests: Receives grant application generation requests
- frontend-notifications: Publishes completion notifications to frontend
- grant_templates: Stores extracted CFP templates with sections and requirements
- grant_applications: Stores generated applications with section content
- research_objectives: Stores enriched objectives with scientific terms
- work_plans: Stores generated work plans with tasks and dependencies
- Endpoint:
https://query.wikidata.org/sparql - Batch Processing: 5 terms per batch for optimal performance
- Retry Logic: 3 attempts with exponential backoff
- Timeout: 30 seconds per request
- Gemini 2.5 Flash: Primary model for structured extraction and generation (1M context window)
- Claude 3.5 Sonnet: Secondary model for complex reasoning tasks
- Vertex AI: Managed model deployment with rate limiting and retry logic
All JSON schemas for Gemini structured output follow official Google best practices to minimize token usage and improve model reliability:
- Short Property Names: Single words preferred (
name,type,source,quote) not verbose (section_name,measurement_type,cfp_source_reference,quote_from_source) - Minimal Nesting: Max 2 levels deep, flatten arrays where possible
- Reduced Required Fields: 2-6 required fields per object, use
NotRequiredfor rest - Object Arrays Over Tuples: Always use named properties
[{source, target, desc}]not[["s","t","d"]] - Concise Descriptions: 1-2 sentences max in schema, move details to prompts
- Strategic Constraints: Use
minItems/maxItems(3-10 typical), avoid over-constraining - Property Ordering: Match example order in prompts
RAG service uses optimized short property names throughout the pipeline. Conversion to database column names happens only at the database boundary when saving to tables.
# Optimized schema for Gemini
schema = {
"properties": {
"name": {"type": "string"},
"quote": {"type": "string"},
"source": {"type": "string"}
},
"required": ["name", "quote"]
}
# LLM returns optimized format
response = {"name": "Research Plan", "quote": "...", "source": "..."}
# Use short names throughout RAG service
section = ExtractedSectionDTO(
title=response["name"],
...
)
# Convert to DB format only when saving to database tables| DB Schema (Descriptive) | Pipeline Schema (Optimized) | Token Savings |
|---|---|---|
section_name |
name |
50% |
quote_from_source |
quote |
70% |
cfp_source_reference |
source |
70% |
is_detailed_research_plan |
is_plan |
50% |
enriched_objective |
enriched |
50% |
core_scientific_terms |
terms |
60% |
guiding_questions |
questions |
50% |
search_queries |
queries |
40% |
sections_count |
count |
60% |
length_constraints_found |
constraints_count |
50% |
Before (verbose):
{
"is_detailed_research_plan": {"type": "boolean"},
"needs_applicant_writing": {"type": "boolean"},
"length_limit": {"type": "integer", "nullable": True},
"length_source": {"type": "string", "nullable": True}
}After (optimized):
{
"is_plan": {"type": "boolean"},
"needs_writing": {"type": "boolean"},
"length_constraint": {
"type": "object",
"nullable": True,
"properties": {
"type": {"type": "string", "enum": ["words", "characters"]},
"value": {"type": "integer", "minimum": 1},
"source": {"type": "string", "nullable": True},
},
"required": ["type", "value", "source"],
}
}All validation logic uses short property names:
- extract_sections.py: Exactly 1 section with
is_plan=true, research plan must havelong_form=true - enrich_research_objective.py: Exactly 5
terms, minimum 3questions, minimum 3queries - cfp_section_analysis.py:
countmust matchrequired_sectionsarray length,constraints_countmust matchlength_constraintsarray length - extract_relationships.py: Tuple arrays replaced with object arrays
[{source, target, desc}]
All RAG prompts target Gemini 2.5 Flash (1M context, thinking mode) and follow official best practices:
- Concise & Clear: State instructions once, no repetition or shouting (ALL CAPS)
- NO JSON Examples in Prompts: Schema provides structure, prompt provides context only
- Hierarchical Structure: Use
## headersand numbered lists for organization - Professional Tone: Avoid emoji warnings and excessive emphasis
- Provide JSON schema separately (see
*_json_schemaconstants) - DO NOT include JSON examples in prompts - schema is sufficient
- Focus prompt on task description and requirements
- Structure: Task → Requirements → Schema reference (no examples)
Example files: extract_sections.py, cfp_section_analysis.py, generate_metadata.py, extract_cfp_data.py
- Structure: Requirements → Materials → Guidelines → Format
- Keep under 50 lines for simple tasks, under 100 for complex
- Use model's thinking mode (don't prescribe chain-of-thought)
Example files: generate_section_text.py
- System prompts: < 30 lines (state role and key constraints)
- User prompts: < 100 lines for complex tasks, < 50 for simple
- NO JSON examples: Schema-only approach saves 200-500 tokens per prompt
- Total prompt tokens: Target < 500 tokens per prompt (excluding input data)
- Massive verbose prompts (>400 lines)
- JSON examples in prompts (schema is sufficient)
- Repetitive instructions
- Emoji warnings and excessive ALL CAPS
- Contradictory or overlapping instructions
- Prescriptive chain-of-thought (model has thinking mode)
- Tuple arrays instead of object arrays
- Remove all repetition (say things once)
- Remove ALL JSON examples - schema provides structure
- Consolidate overlapping sections
- Remove emoji warnings and excessive caps
- Verify JSON schema uses short property names
- Add transformation layer if schema changed
- Test with validation logic to ensure compatibility
The service integrates Wikidata's scientific knowledge base to enrich grant applications with authoritative scientific context.
- Client: Function-based httpx client with no context manager overhead
- Configuration: Constants-based (no environment variables): batch size 5, timeout 30s, max retries 3
- Processing: Batch processing with exponential backoff retry logic
- Integration: Enriches research objectives in BLUEPRINT_PREP stage
- Processing Speed: < 5 seconds for typical scientific term sets (10 terms)
- Scalability: Handles 5-50 terms efficiently with batch processing
- Reliability: 95%+ success rate with retry mechanisms
- Throughput: > 0.1 terms/second
- Success Rate: > 80% success rate
- Term Coverage: 10/10 scientific terms detected
- Context Relevance: 100% relevant scientific context
- Field Organization: Organized by scientific field for better LLM consumption
- Template-based: Uses prompt template library for consistent formatting
AI evaluation results demonstrate significant improvements from Wikidata enhancement:
- Baseline Quality: 4.0/5.0
- Wiki-Enhanced Quality: 5.0/5.0 (+25% improvement)
- Scientific Terms: 10% → 100% coverage (+900% improvement)
- All Criteria: +1.0 point improvement across all evaluation dimensions
- Term Detection: 100% (10/10 scientific terms identified)
- Context Relevance: 100% relevant scientific context from Wikidata
- Structured Organization: Context organized by scientific field for optimal LLM processing
- Quality Metrics: Comprehensive AI evaluation with statistical analysis