Skip to content

Open-Source PDF Assistant: This tool allows users to ask questions based on the content of a PDF by simply providing a link to the document. It leverages Docker to create a vector database using pgvector for efficient text retrieval, ensuring unlimited queries without OpenAI embedder limitations. πŸš€πŸ“„

Notifications You must be signed in to change notification settings

tejas-130704/PDF_Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ AI-Powered Knowledge Base with pgvector and Sentence Transformer πŸ§ πŸ“š

🌟 Overview

This project utilizes pgvector for vector-based storage and retrieval, combined with an open-source sentence transformer for text embedding. The default OpenAI embedder requires an API key, which may exceed credit limits; thus, we configure the system to use a sentence transformer model with 1024 dimensions instead of the default 1536 dimensions.

πŸ”₯ Open-Source PDF Assistant

I have created an open-source PDF assistant that eliminates the limitations imposed by OpenAI's embedder. This assistant can be used without any restrictions, allowing unlimited queries and document processing. This project utilizes pgvector for vector-based storage and retrieval, combined with an open-source sentence transformer for text embedding. The default OpenAI embedder requires an API key, which may exceed credit limits; thus, we configure the system to use a sentence transformer model with 1024 dimensions instead of the default 1536 dimensions.

πŸ› οΈ Setup Instructions

βœ… Prerequisites

  • 🐳 Docker & Docker Compose installed
  • 🐍 Python 3.8+ installed
  • πŸ—„οΈ PostgreSQL with pgvector extension enabled
  • 🌐 Streamlit for the frontend

πŸš€ Running the Project

Step 0: πŸ”— Clone the GitHub Repository

First, clone the project repository from GitHub:

git clone https://github.com/tejas-130704/PDF_Assistant.git
cd PDF_Assistant

Step 1: πŸ—οΈ Start pgvector with Docker

Ensure your docker-compose.yaml file is correctly set up, then run:

docker-compose up -d

Step 2: πŸ—„οΈ Configure the Database

After the Docker container is running, execute the following commands:

docker exec -it <container_id> psql -U root -d mydb

Replace <container_id> with the actual container ID (can be found using docker ps).

Connect to the database:

\c mydb

Check existing tables:

\dt

Drop the existing embeddings table if it exists:

DROP TABLE IF EXISTS ai.embeddings;

Create the new table with 1024-dimensional embeddings:

CREATE TABLE ai.embeddings (
    id VARCHAR PRIMARY KEY,
    name VARCHAR NOT NULL,
    meta_data JSONB,
    filters JSONB,
    content TEXT NOT NULL,
    embedding vector(1024), -- Adjusted to match the embedding model dimensions
    usage JSONB,
    content_hash VARCHAR UNIQUE
);

Verify that the table was created successfully:

\dt ai.*

Step 3: πŸ“¦ Install Dependencies

Navigate to your project directory and install required packages:

pip install -r requirements.txt

Step 4: πŸš€ Run the Application

Start the Streamlit application:

streamlit run app.py

Once running, open your browser and go to:

http://localhost:8501/

Step 5: πŸ“š Load the Knowledge Base

  1. Add GROQ_API_KEY in the sidebar.
  2. Provide the PDF link containing knowledge base content.
  3. Click "Load Knowledge Base".
  4. Once you see the message "Knowledge Base Loaded Successfully!", you can start asking questions. πŸŽ‰

Screenshots

Screenshot 2025-03-01 125025

Screenshot 2025-03-01 125213

πŸ› οΈ Troubleshooting

  • ⚠️ If you get an error about mismatched vector dimensions, ensure that the embedding dimension in PostgreSQL matches the sentence transformer (1024).
  • πŸ›‘ If OpenAI is still being used, check that your Python script is correctly configured to use sentence transformers instead of OpenAI embeddings.
  • βœ… Ensure that all required dependencies are installed using pip install -r requirements.txt.

πŸš€ Future Enhancements

  • πŸ”’ Adding user authentication for secure access
  • πŸš€ Implementing cache storage to speed up repeated queries
  • 🎨 Enhancing UI/UX for a more interactive experience

πŸŽ–οΈ Contributors

  • Tejas Narayan Jadhav - GitHub

🀝 Feel free to contribute by submitting pull requests or reporting issues! πŸš€

About

Open-Source PDF Assistant: This tool allows users to ask questions based on the content of a PDF by simply providing a link to the document. It leverages Docker to create a vector database using pgvector for efficient text retrieval, ensuring unlimited queries without OpenAI embedder limitations. πŸš€πŸ“„

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages