ARF (Advanced Retrieval Framework) is a sophisticated Retrieval-Augmented Generation (RAG) system that designed to minimize the cost and hullsination based on R-Flow. I optimized for legal document search and analysis in this use. It provides intelligent semantic search, multi-strategy retrieval, and context-aware document summarization across multiple legal domains.
Experience ARF in action: KnowYourRights.ai
KnowYourRights.ai - AI-powered legal rights search and case intake platform powered by ARF
- Overview
- Features
- Architecture
- Installation
- Configuration
- Usage
- Data Sources
- Components
- Development
- Contributing
ARF is a production-ready RAG framework that enables:
- Multi-domain legal document retrieval across US Constitution, US Code, Code of Federal Regulations, USCIS Policy Manual, Supreme Court cases, and client cases
- Intelligent semantic search using MongoDB Atlas Vector Search with Voyage AI embeddings
- Hybrid search strategies combining semantic, keyword, alias, and exact matching
- LLM-powered reranking to improve result relevance and ordering
- Bilingual support (English/Spanish) for queries and responses
- Automatic document ingestion and embedding generation
- Domain-specific threshold tuning for optimal retrieval performance
- Semantic Vector Search: MongoDB Atlas Vector Search with Voyage AI embeddings (voyage-3-large, 1024 dimensions)
- Multi-Strategy Retrieval:
- Semantic similarity search
- Keyword/BM25 matching (configurable per domain)
- Alias-based search (for US Constitution)
- Exact pattern matching
- Hybrid search combining multiple strategies
- Query Processing Pipeline:
- Query rephrasing and expansion
- Multi-stage filtering with configurable thresholds
- LLM reranking for borderline results
- Result ranking and gap filtering
- Intelligent Caching: Query result caching and summary reuse
- Bilingual Support: English and Spanish query processing and response generation
- Case-to-Document Mapping: Automatic linking of Supreme Court cases to relevant constitutional provisions
- US Constitution: Alias search, keyword matching, structured article/section navigation
- US Code: Large-scale document handling with efficient indexing
- Code of Federal Regulations (CFR): Hierarchical part/chapter/section organization
- USCIS Policy Manual: Automatic weekly updates, reference tracking
- Supreme Court Cases: Case-to-constitutional provision mapping
- Client Cases: SQL-based search for private case databases
ARF/
├── RAG_interface.py # Main orchestrator class
├── config.py # Configuration and collection definitions
├── rag_dependencies/ # Core RAG components
│ ├── mongo_manager.py # MongoDB connection and query management
│ ├── vector_search.py # MongoDB Atlas Vector Search implementation
│ ├── query_manager.py # Query processing and normalization
│ ├── query_processor.py # End-to-end query pipeline
│ ├── alias_manager.py # Alias/keyword search for US Constitution
│ ├── keyword_matcher.py # Structured keyword matching
│ ├── llm_verifier.py # LLM-based result reranking
│ ├── openai_service.py # OpenAI API integration
│ └── ai_service.py # AI service abstraction
└── preprocess/ # Data ingestion scripts
├── us_constitution/ # US Constitution ingestion
├── us_code/ # US Code ingestion
├── cfr/ # CFR ingestion
├── uscis_policy_manual/ # USCIS Policy Manual ingestion
├── supreme_court_cases/ # Supreme Court cases ingestion
└── [other sources]/ # Additional data sources
- Query Input: User query with optional filters (jurisdiction, language, case filters)
- Query Normalization: Text normalization, pattern matching, domain detection
- Multi-Strategy Search:
- Semantic vector search (primary)
- Alias search (if enabled)
- Keyword matching (if enabled)
- Exact pattern matching
- Result Filtering:
- Threshold-based filtering (domain-specific)
- LLM reranking for borderline results
- Gap filtering to remove outliers
- Result Ranking: Score-based ranking with bias adjustments
- Summary Generation: LLM-powered document summaries (cached for reuse)
- Response Formatting: Bilingual response generation
- Python 3.8+
- MongoDB Atlas account with vector search enabled
- OpenAI API key
- Voyage AI API key
-
Clone the repository:
git clone <repository-url> cd arf
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables: Create a
.envfile (or.env.local,.env.dev,.env.production) with:OPENAI_API_KEY=your_openai_api_key VOYAGE_API_KEY=your_voyage_api_key MONGO_URI=your_mongodb_atlas_connection_string
-
Set up MongoDB Atlas:
- Create vector search indexes on your collections
- Index name:
vector_index(default) - Vector field:
embedding - Dimensions: 1024
Collections are defined in config.py with domain-specific settings:
COLLECTION = {
"US_CONSTITUTION_SET": {
"db_name": "public",
"main_collection_name": "us_constitution",
"document_type": "US Constitution",
"use_alias_search": True,
"use_keyword_matcher": True,
"thresholds": DOMAIN_THRESHOLDS["us_constitution"],
# ... additional settings
},
# ... other collections
}Each domain has optimized thresholds for:
query_search: Initial semantic search thresholdalias_search: Alias matching thresholdRAG_SEARCH_min: Minimum score to continue processingLLM_VERIFication: Threshold for LLM rerankingRAG_SEARCH: High-confidence result thresholdconfident: Threshold for saving summariesFILTER_GAP: Maximum score gap between resultsLLM_SCORE: LLM reranking score adjustment
The framework supports multiple environments:
--production: Uses.env.production--dev: Uses.env.dev--local: Uses.env.local- Auto-detection: Based on Docker environment and file existence
from RAG_interface import RAG
from config import COLLECTION
# Initialize RAG for a specific collection
rag = RAG(COLLECTION["US_CONSTITUTION_SET"], debug_mode=False)
# Process a query
results, query = rag.process_query(
query="What does the 14th Amendment say about equal protection?",
language="en"
)
# Get summary for a specific result
summary = rag.process_summary(
query=query,
result_list=results,
index=0,
language="en"
)# With jurisdiction filtering
results, query = rag.process_query(
query="immigration policy",
jurisdiction="federal",
language="en"
)
# Bilingual summary
insight_en, insight_es = rag.process_summary_bilingual(
query=query,
result_list=results,
index=0,
language="es" # Returns both English and Spanish
)
# SQL-based client case search
rag_sql = RAG(COLLECTION["CLIENT_CASES"], debug_mode=False)
results = rag_sql.process_query(
query="asylum case",
filtered_cases=["case_id_1", "case_id_2"]
)skip_pre_checks: Skip initial query validationskip_cases_search: Skip Supreme Court case searchfiltered_cases: Filter results to specific case IDs (SQL path)jurisdiction: Filter by jurisdictionlanguage: Query language ("en" or "es")
ARF supports ingestion from multiple legal document sources:
-
US Constitution (
preprocess/us_constitution/)- Main constitutional text
- Alias mappings for articles/sections
- Supreme Court case references
-
US Code (
preprocess/us_code/)- All 54 titles of the United States Code
- XML to JSON conversion
- Hierarchical clause organization
-
Code of Federal Regulations (
preprocess/cfr/)- All CFR titles
- Part/chapter/section structure
- XML parsing and normalization
-
USCIS Policy Manual (
preprocess/uscis_policy_manual/)- HTML to JSON conversion
- Automatic weekly updates
- Reference tracking to CFR
-
Supreme Court Cases (
preprocess/supreme_court_cases/)- Public case database
- Case-to-constitutional provision mapping
-
California Codes (
preprocess/ca_codes/)- California State Codes
- Multiple fetch strategies
-
California Constitution (
preprocess/ca_constitution/)- State constitutional text
-
Federal Register (
preprocess/federal_register/)- Federal Register documents
-
Agency Guidance (
preprocess/agency_guidance/)- USCIS, DHS, ICE guidance documents
See preprocess/README.md for detailed ingestion instructions. Example:
# Ingest US Constitution with embeddings
python preprocess/us_constitution/ingest_con_law.py --production --from-scratch --with-embeddings
# Ingest Supreme Court cases
python preprocess/supreme_court_cases/ingest_supreme_court_cases.py --production --with-embeddingsMain orchestrator class that wires all subsystems together:
- Collection configuration management
- Domain-specific threshold selection
- Component initialization
- Public API for query processing
End-to-end query processing pipeline:
- Query normalization and expansion
- Multi-stage search execution
- Result filtering and ranking
- Summary generation and caching
- Case-to-document mapping
MongoDB Atlas Vector Search implementation:
- Native
$vectorSearchaggregation - Score bias adjustments
- Efficient similarity search
- Error handling and retries
Query processing utilities:
- Text normalization
- Pattern matching
- Query rephrasing
- Domain detection
Alias-based search for US Constitution:
- Keyword/alias embeddings
- Fast alias matching
- Score boosting for exact matches
Structured keyword matching:
- Article/section pattern matching
- Hierarchical document navigation
- Exact match detection
LLM-based result reranking:
- Relevance scoring and reranking
- Borderline result reranking
- Confidence adjustment and score refinement
MongoDB connection and query management:
- Database connections
- Collection access
- Query caching
- User query history
arf/
├── RAG_interface.py # Main entry point
├── config.py # Configuration
├── rag_dependencies/ # Core RAG modules
├── preprocess/ # Data ingestion
│ ├── [source]/ # Source-specific scripts
│ └── README.md # Ingestion documentation
└── Data/ # Knowledge base data
└── Knowledge/ # Processed JSON files
# Run preprocessing verification scripts
python preprocess/cfr/check_cfr_structure.py
python preprocess/us_code/verify_clause_numbers.py- Create a new directory in
preprocess/ - Implement fetch and ingest scripts
- Add collection configuration to
config.py - Define domain-specific thresholds
- Create vector search indexes in MongoDB Atlas
Enable debug mode for detailed logging:
rag = RAG(COLLECTION["US_CONSTITUTION_SET"], debug_mode=True)Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Follow PEP 8 Python style guide
- Use type hints where appropriate
- Add docstrings to public functions
- Include logging for important operations
This project is licensed under MIT License
- MongoDB Atlas for vector search capabilities
- Voyage AI for embedding models
- OpenAI for LLM services
For detailed information on data ingestion, see preprocess/README.md.
