Skip to content

Harshitha-arch/wueb-chatbot

Repository files navigation

🎓 WUEB Document Assistant

A sophisticated semantic search chatbot for Wroclaw University of Economics and Business (WUEB) governing documents. This system uses OpenAI embeddings and GPT models to provide accurate, non-hallucinated answers based on university policies and procedures.

Python OpenAI Streamlit ChromaDB

✨ Features

  • 🔍 Semantic Search: Uses OpenAI embeddings for understanding document meaning
  • 🛡️ Anti-Hallucination: Only uses information from provided documents
  • 🇵🇱 Polish Language Support: Handles Polish text extraction and processing
  • 📊 Confidence Scoring: Shows confidence levels for transparency
  • 🗄️ Vector Database: ChromaDB for efficient similarity search
  • 🌐 Beautiful UI: Streamlit interface with chat experience
  • 📚 Document Management: Easy loading and reloading of PDFs

🏗️ System Architecture

📄 PDF Documents (Polish)
    ↓
🔍 Text Extraction & Cleaning
    ↓
✂️ Semantic Chunking (1000 chars, 200 overlap)
    ↓
🧠 OpenAI Embeddings (text-embedding-ada-002)
    ↓
🗄️ ChromaDB Vector Database
    ↓
❓ User Query → Embedding → Similarity Search
    ↓
📋 Context Retrieval (Top 5 results)
    ↓
🤖 GPT-3.5-turbo Response Generation
    ↓
✅ Accurate Answer with Confidence Score

🚀 Quick Start

Prerequisites

  • Python 3.8 or higher
  • OpenAI API key
  • WUEB PDF documents

Installation

  1. Clone the repository

    git clone https://github.com/harshitha/arch/wueb-chatbot.git
    cd wueb-chatbot
  2. Install dependencies

    pip install -r requirements.txt
  3. Set up environment

    cp env_example.txt .env
    # Edit .env file and add your OpenAI API key
  4. Add your PDF documents

    mkdir pdfs
    # Copy your WUEB PDF documents to the pdfs/ directory
  5. Run the application

    python quick_start.py
    # or
    streamlit run app.py

📁 Project Structure

wueb-chatbot/
├── 📄 app.py                 # Main Streamlit application
├── 📄 chatbot.py            # Core chatbot logic
├── 📄 config.py             # Configuration settings
├── 📄 data_loader.py        # PDF processing pipeline
├── 📄 pdf_processor.py      # Text extraction & chunking
├── 📄 vector_store.py       # Vector database operations
├── 🧪 test_chatbot.py       # System testing
├── 🚀 quick_start.py        # Automated setup
├── 📋 requirements.txt       # Python dependencies
├── 📖 README.md             # Project documentation
├── 📝 USAGE_GUIDE.md        # Detailed usage guide
├── ⚙️ setup.py              # Package installation
├── 🚫 .gitignore            # Git ignore rules
├── 📝 env_example.txt       # Environment template
├── 📁 pdfs/                 # PDF documents directory
└── 📁 vector_db/            # ChromaDB vector database

🎯 Usage Examples

Basic Questions

Q: "What are the admission requirements?"
Q: "How do I apply for a program?"
Q: "What are the tuition fees?"

Advanced Queries

Q: "Tell me about the university structure and governance"
Q: "What are the academic calendar dates for 2024?"
Q: "Explain the student rights and responsibilities"

Polish Queries

Q: "Jakie są wymagania rekrutacyjne?"
Q: "Ile kosztuje czesne?"
Q: "Jakie są prawa studentów?"

⚙️ Configuration

Key settings can be modified in config.py:

# Chunking Settings
CHUNK_SIZE = 1000          # Characters per chunk
CHUNK_OVERLAP = 200        # Overlap between chunks

# Search Settings  
TOP_K_RESULTS = 5          # Number of similar documents
SIMILARITY_THRESHOLD = 0.7  # Minimum similarity score

# Model Settings
OPENAI_MODEL = "gpt-3.5-turbo"
EMBEDDING_MODEL = "text-embedding-ada-002"
MAX_TOKENS = 1000
TEMPERATURE = 0.1          # Low for factual responses

🧪 Testing

Run the comprehensive test suite:

python test_chatbot.py

This tests:

  • ✅ System initialization
  • ✅ PDF directory validation
  • ✅ Document loading
  • ✅ Chatbot queries
  • ✅ System information

🔧 API Usage

Use the chatbot programmatically:

from chatbot import WUEBChatbot
from data_loader import DataLoader

# Initialize
chatbot = WUEBChatbot()
data_loader = DataLoader()

# Load documents
data_loader.load_documents()

# Ask questions
result = chatbot.process_query("What are the admission requirements?")
print(result['response'])
print(f"Confidence: {result['confidence']}")

🛡️ Security & Privacy

  • ✅ API keys stored in environment variables
  • ✅ No sensitive data logged
  • ✅ User queries validated
  • ✅ Vector database stored locally
  • ✅ PDF documents remain private

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • OpenAI for providing the embedding and language models
  • ChromaDB for the vector database
  • Streamlit for the web interface framework
  • WUEB for the governing documents

📞 Support

For issues and questions:

  • Check the USAGE_GUIDE.md
  • Run python test_chatbot.py for diagnostics
  • Review configuration in config.py
  • Open an issue on GitHub

🎓 Ready to help with WUEB questions!

About

Semantic Search chatbot for Wroclaw University of Economics and Business

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages