A simple system for information retrieval and summarization from the internet, vector database, and uploaded file.
- Use well-defined workflow instead
- Agentic approach and tools calling are still very unreliable.
- Identify realistic use case first
- Understand business logic / human task workflow -> identify steps that can leverage IT -> define IT workflow -> identify steps in IT that require ML / AI for better accuracy
- LLM is quite nice for docstring and documentation generation
- generate draft -> verify and modify -> confirm changes -> update when necessary
- Chatbot is anti-pattern for most use cases
- A well-designed GUI is more effective for those use cases.
- Textbox helps certain use cases such that users can describe the context in more detail.
- Internet search + LLM
- Research topics
- ROI: Time saving from manually searching for information
- Limitations: Cannot access articles behind paywall. Cannot differentiate slop from legitimate information, especially with more slops generated by AI nowadays.
- Suggestions: Limit the information sources and give access to paid sources via API
- Research topics
- Summarization
- Summarize articles, emails, meeting notes, etc.
- ROI: Time saving from reading the whole content
- Limitations: May still hallucinate
- Suggestions: Minimize hallucination to an acceptable degree
- Summarize articles, emails, meeting notes, etc.
- Vector Database search + LLM
- Q&A from internal documents (FAQ, technical, laws, etc.)
- ROI: Outsource repetitive questions to AI, hence save human resources.
- Limitations: May still hallucinate
- Suggestions: Minimize hallucination to an acceptable degree
- Document search
- ROI: Time saving from digging for relevant documents
- Limitations: Retrieval is not 100% accurate
- Suggestions: Use tags for more accurate and faster search
- Q&A from internal documents (FAQ, technical, laws, etc.)
- Invoice data extraction
- Invoice / chat / receipt / etc. data extraction
- ROI: Remove time required for data entry
- Limitations: Not 100% accurate
- Suggestions: Manual verification still required
- Invoice / chat / receipt / etc. data extraction
Note: Due to hallucination problem, always have a domain expert to verify the outputs. Showing the sources for validation helps.
- Internet: General web information retrieval using DuckDuckGo
- News Website: News article search using DuckDuckGo
- Vector Database: RAG-based local document search
- File Upload: Read and retrieve information from uploaded file (currently only support PDF invoice reading)
- Smart Search Routing: Automatically selects the most appropriate search tool via AI agent
- Vector Store Integration: Efficient document indexing and similarity search
- Intelligent Summarization: Context-aware summarization (100-250 words)
- UI with Streaming Updates: Real-time response streaming in the UI
- Invoice Reading: Extract data from uploaded invoice with Layout Detection + OCR + LLM
The system follows a modular architecture with several key components:
- WebSearcher: Interprets queries and coordinates search tool selection
- Summarizer: Processes search results into coherent summaries
- Invoice Data Extractor: Extract data from the structure-preserved OCR-ed invoice
- Language Models:
- Qwen 3 (4B params) with AWQ quantization (default)
- Qwen 2.5 (3B params) with AWQ quantization
- SmolLM2 (1.7B params)
- Embedding Model:
- Stella EN 1.5B v5
- Internet and news search via DuckDuckGo
- Vector store retriever for document search
- Tweaked for retrieving Malaysia Budget 2025 information.
- Modify the description in the build_my_budget_retriever function in vector_store_retriever.py to accommodate for other type of documents.
The workflow operates as follows:
- WebSearcher agent processes the user query
- Appropriate tools are selected and executed
- Results are passed to the Summarizer agent
- Final summary is presented to the user
Note: Invoice data extraction has a separate workflow.
- Python 3.11 or higher
- CUDA-capable GPU (at least 12GB VRAM recommended)
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Install Chrome and ChromeDriver:
- Download ChromeDriver from Google Chrome for Testing
- Linux: Copy chromedriver to /usr/bin/
- Start the application:
python src/app.py
- Access the web interface (default: http://localhost:7860)
- Upload documents or start chatting to search for information
- Run:
mkdocs serve
- Access the documentation (default: http://localhost:8000)
Not included in this repository but can be used as reference.
- ROUGE scores
- A type of statistical method
- Suitable for assessing the summarization quality
- Not 100% reliable but good enough as an indicator with other metrics
- BLEU scores
- A type of statistical method
- Suitable for translation accuracy (note: this repo does not have translation feature)
- Again, not 100% reliable but good enough as an indicator with other metrics
- LLM as a judge
- Use a different LLM to give score on the quality of the generated content.
- Not really reliable too.
- Human Evaluation
- A more reliable approach, recommend to have a team of domain expert for evaluation.
Note: Recommeded approach
- Use statistical method and/or LLM as a judge as indicator during model improvement and tweaking.
- Once satisfied with the outcome, perform human evaluation.
- If not good enough, repeat from step 1. Else, finalize the model.
- Precision
- Useful for measuring how many of the detection are wrong.
- Recall
- Useful for measuring how many data are not detected.
- F-1 Score
- Useful for measuring overall performance.
