Skip to content

jager47X/R-Flow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

112 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R-Flow - Optimized Retrieval-Augmented Generation System -

logo

🌍DEMO

📖 Table of Contents

Introduction

This project introduces enhancements to Retrieval-Augmented Generation (RAG) systems through adaptive query rewriting, knowledge graph integration, and intelligent caching. By combining dual-threshold mechanisms with ConceptNet-based semantic reasoning, the system interprets user intent more effectively. A hybrid graph-vector retrieval model enables context-aware document matching by blending symbolic and neural approaches. Additionally, a dynamic caching layer accelerates retrieval while maintaining high accuracy.

Demonstrated in the legal domain using the U.S. Constitution as a primary corpus, this system enhances precision, reduces hallucinations in large language models (LLMs), and minimizes redundant computations. The approach optimizes both retrieval accuracy and computational efficiency, with real-time legal insights and flexible search interaction.

Concepts

1. Although t-SNE is reduced dimention and does not represent the relationships precisouly, the idea is get close to the dot where user is searching as close as possible.

(a) Below is the 3D t-SNE (US Constitution) Click here to View Online

plot

2. By Combinating two technology the Graph (Query Reformation) + SemeticeSearch (Dual Thresholds) We can not only visualize the path of the search but also construct well reasoning threrefore higher accuracy.

(b) Below is Basic US Consition Graph with ChatGpt intercept Click here to View Online

graph

(c) Below is Visualized Quary Reformation on the Graph with ChatGpt intercept Click here to View Online

graph

(d) Below is Visualized Quary Reformation on the added on t-SNE US Consititution Plot Click here to View Online

plot

- To Access full interaction, all tools are located in ./src/visualizations

  • embdding_plot.py - generate the t-SNE plot (a)
  • graph_usc.py - generate the graph of US consititon (b)
  • query_expansions.py - generate the t-SNE plot with embeddings of queries and generate (c) and the graph of expanded(reformulated) queries (d)

Features

1. Adaptive Query Rewriting with ConceptNet

R-Flow enhances traditional RAG pipelines with dual-threshold query reformulation, now powered by ConceptNet, enabling semantic reasoning and contrastive variation. This improves interpretation of nuanced user inputs and adapts to varying levels of query specificity.

2. Hybrid Graph-Based Retrieval

R-Flow introduces a hybrid RAG model that merges vector search with graph-based reasoning. Leveraging document embeddings alongside knowledge graph traversal, this model improves contextual matching and reasoning across semantically related legal concepts.

3. Intelligent Caching

To improve throughput, R-Flow incorporates a dynamic caching layer that prevents redundant queries and speeds up responses. Caches are intelligently managed based on reformulated query patterns, boosting performance while maintaining precision.

4. Query Expansion & Reformulation

Using Sentence-BERT, Spacy, and ConceptNet, R-Flow expands and refines user queries semantically and contrastively. This approach improves recall and document relevance, allowing for more accurate downstream LLM generation.

5. Annoy-based Vector Search

For neural retrieval, R-Flow uses Annoy indexing to perform fast, approximate nearest neighbor searches on large legal document sets (e.g., U.S. and Australian constitutional data). Annoy trees are pre-built and optimized for high-speed access.

6. Document Embedding with OpenAI

R-Flow supports embeddings via OpenAI’s API or local sentence transformers, transforming legal documents into rich vector representations. These embeddings are used both for retrieval and graph integration.

7. Real-time Summarization and Case Insights

Retrieved documents are summarized using OpenAI’s ChatGPT, extracting key insights (e.g., religious freedom under the First Amendment) and making legal content more digestible to non-experts.

8. Flexible Dataset Integration

Users can configure new datasets easily by modifying input configs and document parsing logic. R-Flow currently supports the U.S. Constitution and Australian legal corpora, with expansion support for global documents.

9. MongoDB-Backed Storage & Indexing

All documents, queries, embeddings, and reformulations are stored in MongoDB, with separate collections for fast querying, lookup, and audit trails. Supports local or remote instances with secure access options.

10. Highly Configurable System Design

R-Flow is built for flexibility:

Embedding Models: Choose between OpenAI or HuggingFace embeddings.

Search Parameters: Adjust thresholds, ANN trees, and result counts.

Query Modes: Control how reformulated queries are generated and scored.

Daily Limits: Set daily usage caps to manage costs and access control.

11. Rich Query Interaction

The system supports advanced user query flows:

Next: Show more documents matching the last query.

More: Drill into summaries, quotes, or insights.

Exit: Terminate cleanly.

12. Optimized Data Ingestion Pipeline

Legal documents are processed through a multi-step pipeline:

Preprocessing and parsing (e.g., sectioning and clause extraction).

Embedding generation.

Graph construction for symbolic reasoning.

Vector index creation using Annoy.

13. Scalable and Secure Architecture

R-Flow is cloud-ready and can be deployed to serve enterprise or academic use cases. It’s built with scalability and modular security, supporting both local development and secure cloud-hosted MongoDB setups.

Project Structure

  • outdated
R-Flow/
|   .env
|   .gitignore
|   License.md
|   README.md
|   requirements.txt
|
+---Data
|   +---github-img
|   |       logo.png
|   |       Query_Reformation_Graph.png
|   |       Query_Reformation_Plot.png
|   |       US_Constitution_Graph.png
|   |       US_Constitution_plot.png
|   |
|   +---Indexing
|   |   +---annoy
|   |   |       usc.ann
|   |   |       usc_id_map.pkl
|   |   |
|   |   \---faiss
|   |           usc.index
|   |           usc.pkl
|   |
|   +---knowledge
|   |       Us_Constitution.json
|   |
|   +---Mongo
|   |   \---dump
|   |       \---ai_rag_db
|   |               search_limits.bson
|   |               search_limits.metadata.json
|   |               User_queries.bson
|   |               User_queries.metadata.json
|   |               User_queries_annoy.bson
|   |               User_queries_annoy.metadata.json
|   |               us_constitution_annoy.bson
|   |               us_constitution_annoy.metadata.json
|   |               us_constitution_embedding.bson
|   |               us_constitution_embedding.metadata.json
|   |               us_constitution_faiss.bson
|   |               us_constitution_faiss.metadata.json
|   |
|   \---visualizations_outputs
|           3D_Visualization_of_Embeddings.html
|           Query_Reformation_Graph.html
|           Query_Reformation_Plot.html
|           US_Constitution_Graph.html
|           US_Constitution_Graph.png
|
+---src
|   +---dependencies
|   |   |   annoySearch.py
|   |   |   config.py
|   |   |   DatabaseHandler.py
|   |   |   faissSearch.py
|   |   |   graph_rag.py
|   |   |   openai_service.py
|   |   |   query_reformulator.py
|   |
|   +---preprocess
|   |   |   build_ANNOY.py
|   |   |   build_FAISS.py
|   |   |   ingest_Australian_Legal_Corpus.py
|   |   |   ingest_Us_constititon.py
|   |   |   update_embedding.py
|   |   |   __init__.py
|   |   |
|   |
|   \---visualizations
|       |   embedding_plot.py
|       |   graph_usc.py
|       |   query_expansions.py
|       |   __init__.py
|       |
|
\---tests
    |   main.py
    |   test.py
    |
    \---pytest

Setup Instructions

1. Clone the Repository

git clone https://github.com/yourusername/vector-rag-system.git
cd vector-rag-system

2. 📦 Install Dependencies

  • (optional) Activate virtual enviroment
  • Ensure you have Python installed, then run:
pip install -r requirements.txt
python -m spacy download en_core_web_sm

If you do not have MongoDB: 🔗 Install MongoDB

3. Configure Environment Variables

Create a .env file in the project root with the following content or use .env if you have an acess.

Obtain API key from OpenAI'API (Optinal)

OPENAI_API_KEY=your_openai_api_key_here # Get OpenAI'API 
MONGO_URI=mongodb://localhost:27017/   # Local 

4. Restore MongoDB Backup (Optional)

To use the provided MongoDB backup, run the following command:

mongorestore --db rag_db Data/Mongo/dump/ai_rag_db

Ensure that the MongoDB server is running before executing the command.

5. Download Dataset

  • Once downloaded, place it under Corpus folder under the project folder

Open Australian Legal Corpus, available on Kaggle:🔗 Open Australian Legal Corpus

Constitution ofthe United States:🔗 Constitution ofthe United States

6. Check Data Structure of the dataset in config.py

-Outdated-

# config.py
import os
from dotenv import load_dotenv
from pymongo import MongoClient

load_dotenv()  # Load variables from .env

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MONGO_URI = os.getenv("MONGO_URI", "mongodb://localhost:27017/")
EMBEDDING_MODEL = "text-embedding-3-large"
EMBEDDING_DIMENSIONS = 3072
THRESHOLD_QUERY_SEARCH_MIN = 0.38  # Rephrasing threshold
THRESHOLD_QUERY_SEARCH = 0.48       
TOP_QUERY_RESULT = 10              # Number of query results at once
LIMIT = 10000                      # Limit of requests per day
AUSLEGAL_DOCUMENT_PATH="./Data/knowledge/Open_Australian_Legal_Corpus.jsonl"
USCON_DOCUMENT_PATH="./Data/knowledge/Us_Constitution.json"
DB_NAME = "ai_rag_db"
QUERY_COLLECTION_NAME = "User_queries"
CHATMODEL="gpt-4o"  # ChatGPT model to use
# For MongoDB connection
# For Dataset

COLLECTION = {
    "US_CONSTITUTION_SET": {
        "db_name": DB_NAME,
        "query_collection_name": QUERY_COLLECTION_NAME,
        "embedding_collection_name": "us_constitution_embedding",
        "annoy_collection_name": "us_constitution_annoy",
        "faiss_collection_name": "us_constitution_faiss",
        "faiss_index_path": "Data/Indexing/faiss/usc.index",
        "annoy_index_path": "Data/Indexing/annoy/usc.ann",
        "id_map_path": "Data/Indexing/annoy/usc_id_map.pkl",
        "faiss_id_map_path": "Data/Indexing/faiss/usc.pkl",
        "document_type": "US Constitution", 
        "unique_index": "title",  
    },
    "AUS_LAW_SET": {
        "db_name": DB_NAME,
        "query_collection_name": QUERY_COLLECTION_NAME,
        "embedding_collection_name": "Australian_Law_2024_embedding",
        "annoy_collection_name": "Australian_Law_2024_annoy",
        "faiss_collection_name": "Australian_Law_2024_faiss",
        "faiss_index_path": "Data/Indexing/faiss/auslaw.index",
        "annoy_index_path": "Data/Indexing/annoy/auslaw.ann",
        "id_map_path": "Data/Indexing/annoy/aus_id_map.pkl",
        "faiss_id_map_path": "Data/Indexing/faiss/aus_id_map.pkl",
        "document_type": "Australia Laws 2024",  
        "unique_index": "version_id"
    }
}

Usage

A Use the Mongo backup data from MONGO\dump\rag_db

B Make from scrach

B.1 Ingest Data

  • Each Parsing needs to be implmented beforehand, then load your JSONL data into MongoDB by running:

Input

python -m src.preprocess.ingest_Us_constititon # For US_Consitiotion
# Or Make own Ingest script for MongoDB (At least one unique id is required such as title, version id, etc)
# Australian for testing

Output

[INFO] Connected to MongoDB with write concern w=0.
[INFO] Indexes dropped temporarily.
[INFO] Found 52 documents with a title.
[INFO] Found 0 existing titles in the collection.
[INFO] Inserted 52 new documents using bulk unordered insert.
[INFO] Re-created unique index on 'title'.
[INFO] JSON data ingestion complete.
[INFO] Processed: 52 new documents.
[INFO] Skipped: 0 documents with missing 'title' or duplicates.
[INFO] MongoDB connection closed.

B.2 Create Embedding

  • Create Embedding and inserto the database using OpenAI's embedding. optinally sentense transformer can be used

Input

python -m src.preprocess.update_embedding

Output

Generating embedding for text...
[INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Embedding generated.

B.3 Build Search Engine

  • Buidling the annoy index for each collection separatly, we will load each of he annoy as search engine.

Input

python -m src.preprocess.build_ANNOY #for annoy
python -m src.preprocess.build_FAISS #for faiss

Output

[INFO] Available configurations:
[INFO] 1: US Constitution
[INFO] 2: Australia Laws 2024
[INFO] Enter configuration number: 1
[INFO] Using configuration: US Constitution
[INFO] Selected configuration details: {
    "db_name": "ai_rag_db",
    "query_collection_name": "User_queries",
    "embedding_collection_name": "us_constitution_embedding",
    "annoy_collection_name": "us_constitution_annoy",
    "annoy_index_path": "./annoy/usc.ann",
    "id_map_path": "./annoy/usc_id_map.pkl",
    "document_type": "US Constitution",
    "unique_index": "title"
}
[INFO] ANNOY_INDEX_PATH: ./annoy/usc.ann
[INFO] ID_MAP_PATH: ./annoy/usc_id_map.pkl
[INFO] Fetched 52 documents with embeddings from 'us_constitution_embedding'.
[INFO] Annoy index built and saved to ./annoy/usc.ann
[INFO] ID map saved to file: ./annoy/usc_id_map.pkl
[INFO] Cleared previous documents from collection 'us_constitution_annoy'.
[INFO] Inserted 52 documents into 'us_constitution_annoy'.
[INFO] MongoDB connection closed.

2. Testing

  • Create test.py Launch the main application to handle user queries:

Q&A Testing:

  1. dummy testing (Bryan), automated query generations using LLM, get output and save query and output in CSV.
  2. Nodifies the dummy testing to import csv and visualize the data including
  • normal rag vs our rag
  • benchmark of Faiss vs annoy

ideas: We want to assert Valuesof each keys

  • query from CSV or dummy (as Key)
  • assert Title of the Retrieval documents as list with CSV title (as Values)

Relationships:

  • Many(query): multiple unique queries - Many(documents): multiple results

Example:

# main.py
import json
import logging
from src.dependencies.DatabaseHandler import DatabaseHandler
from src.dependencies.openai_service import ChatGPT
from src.dependencies.config import COLLECTION

# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

def display_source(case):
    """
    Prints additional details about a case, excluding '_id' and 'map_id'.
    """
    print("\n--- Source ---")
    for key, value in case.items():
        if key not in ["_id", "map_id"]:
            print(f"{key}: {value}")
    print("--- End Source ---\n")


def main():
    logger.info("Starting RAG assistant interactive session...")

    # List available configurations from COLLECTION.
    keys = list(COLLECTION.keys())
    print("Available configurations:")
    for i, key in enumerate(keys, start=1):
        doc_type = COLLECTION[key].get("document_type", "Unknown")
        print(f"{i}: {doc_type}")
    
    # Let the user choose a configuration by number.
    try:
        selected_num = int(input("Enter configuration number: ").strip())
        if selected_num < 1 or selected_num > len(keys):
            raise ValueError("Selection out of range")
    except Exception as e:
        logger.warning("Invalid configuration number. Defaulting to 1.")
        selected_num = 1

    config = COLLECTION[keys[selected_num - 1]]
    logger.info("Using configuration: %s", config["document_type"])
    print(f"Using configuration: {config['document_type']}")
    print("Selected configuration details:")
    print(json.dumps(config, indent=4, default=list))

    db_handler = DatabaseHandler(config)
    chat_service = ChatGPT(db_handler.db)
    
    last_query_results = None
    current_idx = 0
    
    while True:
        user_input = input("Enter a query, 'next' for next result, 'source' for details, or 'exit': ").strip().lower()
        logger.info("User input: %s", user_input)

        if user_input == "exit":
            logger.info("User exited the session.")
            break
        elif user_input == "next":
            if not last_query_results:
                logger.info("No previous query found for 'next' command.")
                print("No previous query found. Please enter a new query first.")
                continue
            if current_idx >= len(last_query_results):
                logger.info("No more results available in current query result set.")
                print("No source results for this query. Enter a new query.")
                continue
            case, similarity = last_query_results[current_idx]
            current_idx += 1
        elif user_input == "source":
            if not last_query_results or current_idx == 0:
                logger.info("No result available to show source details.")
                print("No result available for the source. Enter a query first.")
                continue
            case, similarity = last_query_results[current_idx - 1]
            logger.info("Displaying more details for current case.")
            display_source(case)
            continue
        else:
            logger.info("Processing new query: %s", user_input)
            last_query_results, processed = db_handler.process_query(user_input)
            if not processed:
                logger.warning("Daily search limit reached during query processing.")
                print("Daily search limit reached. Exiting.")
                break
            if not last_query_results:
                logger.info("No results returned for query: %s", user_input)
                print("No results found. Try another query.")
                continue
            current_idx = 0
            case, similarity = last_query_results[current_idx]
            current_idx += 1

        logger.info("Generating summary for result index %d with similarity %.4f", current_idx - 1, similarity)
        summary = chat_service.summarize_cases(case,query=user_input)
        print(f"\nSummary (Similarity: {similarity:.2f}):\n{summary}\n")

    logger.info("Session terminated. Connection closed.")
    print("Goodbye!")

if __name__ == "__main__":
    main()

Input

python -m tests.main

Output

Available configurations:
1: US Constitution
2: Australia Laws 2024
Enter configuration number: 1
Using configuration: US Constitution
Selected configuration details:
{
    "db_name": "ai_rag_db",
    "query_collection_name": "User_queries",
    "embedding_collection_name": "us_constitution_embedding",
    "annoy_collection_name": "us_constitution_annoy",
    "annoy_index_path": "./annoy/usc.ann",
    "id_map_path": "./annoy/usc_id_map.pkl",
    "document_type": "US Constitution",
    "unique_index": "title"
}
[INFO] Annoy index loaded from ./annoy/usc.ann
[INFO] ID map loaded from ./annoy/usc_id_map.pkl
[INFO] Annoy index and ID map loaded successfully.
[INFO] DatabaseHandler initialized with configuration: US Constitution
Enter a query, 'next' for next result, 'more' for details, or 'exit': most important right
Processing query...
[INFO] User query: most important right
[INFO] Querying...
[INFO] Today's date: 2025-03-04
[INFO] Usage of today: 15
[INFO] Generating embedding for text...
[INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
[INFO] Embedding generated.
[INFO] Today's date: 2025-03-04
[INFO] Updated usage of today to 16.
[INFO] Stored new query embedding in MongoDB.
[INFO] Searching in the vector database for up to 10 results.
[INFO] Searching for similar documents...
[INFO] Annoy returned 10 indices.
[INFO] Search complete. 1 documents returned.
[INFO] Query: most important right | Document [title]: First Amendment | Similarity: 0.43
[INFO] Generating summary for case with title: First Amendment
[INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[INFO] Summary generated successfully for case with _id: 67c6ba0c21385f76c48478de
[INFO] Updated summary in database for case with _id: 67c6ba0c21385f76c48478de
[INFO] Today's date: 2025-03-04
[INFO] Usage of today: 31
[INFO] Today s date: 2025-03-04
[INFO] Updated usage of today to 32.

Summary (Similarity: 0.43):
This excerpt is from the First Amendment to the United States Constitution, which outlines several fundamental rights and freedoms. It prohibits Congress from enacting legislation that establishes any religion or impedes the free practice of religion. Additionally, it safeguards freedoms related to speech, the press, peaceful assembly, and the right to petition the government to address grievances.

Top Insights:
1. **Religious Freedom**: The First Amendment ensures both the prevention of a government-mandated religion and the protection of individuals' rights to practice religion as they choose.
2. **Freedom of Speech and Press**: These freedoms are crucial for a democratic society, allowing individuals to express themselves and share information without undue government interference.
3. **Right to Assemble and Petition**: Citizens have the right to gather peacefully and to seek governmental changes, which supports public participation in democracy and accountability from elected officials.

Enter a query, 'next' for next result, 'more' for details, or 'exit': more

--- More Details ---
article: Amendment
section:
title: First Amendment
text: Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.
summary: This excerpt is from the First Amendment to the United States Constitution, which outlines several fundamental rights and freedoms. It prohibits Congress from enacting legislation that establishes any religion or impedes the free practice of religion. Additionally, it safeguards freedoms related to speech, the press, peaceful assembly, and the right to petition the government to address grievances.

Top Insights:
1. **Religious Freedom**: The First Amendment ensures both the prevention of a government-mandated religion and the protection of individuals' rights to practice religion as they choose.
2. **Freedom of Speech and Press**: These freedoms are crucial for a democratic society, allowing individuals to express themselves and share information without undue government interference.
3. **Right to Assemble and Petition**: Citizens have the right to gather peacefully and to seek governmental changes, which supports public participation in democracy and accountability from elected officials.
--- End Details ---

Enter a query, 'next' for next result, 'more' for details, or 'exit': exit
Goodbye!

Customization

  • Embedding Model: Change the EMBEDDING_MODEL in your .env file to use a different OpenAI model or Localy compute using sentenceTransformer if needed.

  • MongoDB Configuration: Adjust the MONGO_URI in your .env file to connect to a different MongoDB instance.

  • Annoy Settings: Tweak parameters such as VECTOR_SIZE and ANNOY_TREE_COUNT in build_searchEngine.py to suit your data and - performance requirements.

  • Summarization Prompt: Modify the prompt in summarizer.py to tailor the summarization output.

  • More Database: To add more custmize data follow the each step:

    1. Check Data Structure of the dataset in config.py and add the dataset
    2. Ingest Data
    3. Create Embedding
    4. Build Search Engine
  • Other Configuration:

    • THRESHOLD_QUERY_SEARCH - Threshold of the cosine simialrity of the search
    • TOP_QUERY_RESULT - Number of query retiriveted at once
    • LIMIT - limit of the query per day

License

This project is licensed under MIT License

Contributer

Logo Designer: Ambre Grimault

Q&A Enginner: Bryan Lee

Supervised by Professor Okada, Professor Mateescu

Tester: Jordan Rosenbarg (Cheif Law Clark), Ethan Oppenhaimer (Law Student), Gianluca Allesina

Contact

Contributions, issues, and feature requests are welcome! Please check the issues page for known issues and to submit new ones.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages