A sophisticated semantic search chatbot for Wroclaw University of Economics and Business (WUEB) governing documents. This system uses OpenAI embeddings and GPT models to provide accurate, non-hallucinated answers based on university policies and procedures.
- 🔍 Semantic Search: Uses OpenAI embeddings for understanding document meaning
- 🛡️ Anti-Hallucination: Only uses information from provided documents
- 🇵🇱 Polish Language Support: Handles Polish text extraction and processing
- 📊 Confidence Scoring: Shows confidence levels for transparency
- 🗄️ Vector Database: ChromaDB for efficient similarity search
- 🌐 Beautiful UI: Streamlit interface with chat experience
- 📚 Document Management: Easy loading and reloading of PDFs
📄 PDF Documents (Polish)
↓
🔍 Text Extraction & Cleaning
↓
✂️ Semantic Chunking (1000 chars, 200 overlap)
↓
🧠 OpenAI Embeddings (text-embedding-ada-002)
↓
🗄️ ChromaDB Vector Database
↓
❓ User Query → Embedding → Similarity Search
↓
📋 Context Retrieval (Top 5 results)
↓
🤖 GPT-3.5-turbo Response Generation
↓
✅ Accurate Answer with Confidence Score
- Python 3.8 or higher
- OpenAI API key
- WUEB PDF documents
-
Clone the repository
git clone https://github.com/harshitha/arch/wueb-chatbot.git cd wueb-chatbot -
Install dependencies
pip install -r requirements.txt
-
Set up environment
cp env_example.txt .env # Edit .env file and add your OpenAI API key -
Add your PDF documents
mkdir pdfs # Copy your WUEB PDF documents to the pdfs/ directory -
Run the application
python quick_start.py # or streamlit run app.py
wueb-chatbot/
├── 📄 app.py # Main Streamlit application
├── 📄 chatbot.py # Core chatbot logic
├── 📄 config.py # Configuration settings
├── 📄 data_loader.py # PDF processing pipeline
├── 📄 pdf_processor.py # Text extraction & chunking
├── 📄 vector_store.py # Vector database operations
├── 🧪 test_chatbot.py # System testing
├── 🚀 quick_start.py # Automated setup
├── 📋 requirements.txt # Python dependencies
├── 📖 README.md # Project documentation
├── 📝 USAGE_GUIDE.md # Detailed usage guide
├── ⚙️ setup.py # Package installation
├── 🚫 .gitignore # Git ignore rules
├── 📝 env_example.txt # Environment template
├── 📁 pdfs/ # PDF documents directory
└── 📁 vector_db/ # ChromaDB vector database
Q: "What are the admission requirements?"
Q: "How do I apply for a program?"
Q: "What are the tuition fees?"
Q: "Tell me about the university structure and governance"
Q: "What are the academic calendar dates for 2024?"
Q: "Explain the student rights and responsibilities"
Q: "Jakie są wymagania rekrutacyjne?"
Q: "Ile kosztuje czesne?"
Q: "Jakie są prawa studentów?"
Key settings can be modified in config.py:
# Chunking Settings
CHUNK_SIZE = 1000 # Characters per chunk
CHUNK_OVERLAP = 200 # Overlap between chunks
# Search Settings
TOP_K_RESULTS = 5 # Number of similar documents
SIMILARITY_THRESHOLD = 0.7 # Minimum similarity score
# Model Settings
OPENAI_MODEL = "gpt-3.5-turbo"
EMBEDDING_MODEL = "text-embedding-ada-002"
MAX_TOKENS = 1000
TEMPERATURE = 0.1 # Low for factual responsesRun the comprehensive test suite:
python test_chatbot.pyThis tests:
- ✅ System initialization
- ✅ PDF directory validation
- ✅ Document loading
- ✅ Chatbot queries
- ✅ System information
Use the chatbot programmatically:
from chatbot import WUEBChatbot
from data_loader import DataLoader
# Initialize
chatbot = WUEBChatbot()
data_loader = DataLoader()
# Load documents
data_loader.load_documents()
# Ask questions
result = chatbot.process_query("What are the admission requirements?")
print(result['response'])
print(f"Confidence: {result['confidence']}")- ✅ API keys stored in environment variables
- ✅ No sensitive data logged
- ✅ User queries validated
- ✅ Vector database stored locally
- ✅ PDF documents remain private
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for providing the embedding and language models
- ChromaDB for the vector database
- Streamlit for the web interface framework
- WUEB for the governing documents
For issues and questions:
- Check the USAGE_GUIDE.md
- Run
python test_chatbot.pyfor diagnostics - Review configuration in
config.py - Open an issue on GitHub
🎓 Ready to help with WUEB questions!