π Live Demo: https://webpizza-ai-poc.vercel.app/
β οΈ Experimental POC: This is a proof-of-concept for testing purposes only. It may contain bugs and errors. Loosely inspired by DataPizza AI.
100% Client-Side AI Document Chat - No servers, no APIs, complete privacy.
Chat with your PDF documents using AI that runs entirely in your browser via WebGPU.
- π 100% Private: All processing happens in your browser - your documents never leave your device
- β‘ Dual Engine: Choose between standard WebLLM or optimized WeInfer (~3.76x faster)
- π€ Multiple Models: Phi-3, Llama 3, Mistral 7B, Qwen, Gemma
- π PDF Support: Upload and chat with your PDF documents
- π― RAG Pipeline: Advanced retrieval-augmented generation with vector search
- πΎ Local Storage: Documents cached in IndexedDB for instant access
- π WebGPU Accelerated: Leverage your GPU for fast inference
- Frontend: Angular 20.3.0
- LLM Engines:
- WebLLM v0.2.79 (Standard)
- WeInfer (Optimized fork with buffer reuse + async pipeline)
- Embeddings: Transformers.js v2.17.2 (all-MiniLM-L6-v2)
- PDF Parsing: PDF.js v5.4.296
- Vector Store: IndexedDB with cosine similarity
- Compute: WebGPU / WebAssembly
- Modern browser with WebGPU support (Chrome 113+, Edge 113+)
- 4GB+ RAM available
- Modern GPU (Intel HD 5500+, NVIDIA GTX 650+, AMD HD 7750+, Apple M1+)
# Install dependencies
npm install
# Start dev server
npm start
# Build for production
npm run build- Open
chrome://flagsoredge://flags - Search for "WebGPU"
- Enable "Unsafe WebGPU"
- Restart browser
Check your browser: https://webgpureport.org/
- Select Engine: Choose between WebLLM (standard) or WeInfer (optimized)
- Choose Model: Select an LLM based on your hardware capabilities
- Upload PDF: Drop your document (first load downloads model ~1-4GB)
- Ask Questions: Chat with your document using natural language
- β No data collection
- β No server uploads
- β No tracking cookies
- β No analytics
- β 100% client-side processing
- β Your data never leaves your device
See our Privacy Policy and Cookie Policy for details.
- Phi-3 Mini: ~3-6 tokens/sec
- Llama 3.2 1B: ~8-12 tokens/sec
- Mistral 7B: ~2-4 tokens/sec
- ~3.76x faster across all models
- Buffer reuse optimization
- Asynchronous pipeline processing
- GPU sampling optimization
# Install Vercel CLI
npm i -g vercel
# Deploy
vercelThe project includes vercel.json with optimal configuration for WebGPU and routing.
Ensure your hosting supports:
- SPA routing (all routes β index.html)
- Cross-Origin headers for WebGPU:
Cross-Origin-Embedder-Policy: require-corpCross-Origin-Opener-Policy: same-origin
| Browser | Version | WebGPU Support |
|---|---|---|
| Chrome | 113+ | β Full Support |
| Edge | 113+ | β Full Support |
| Safari | 18+ | |
| Firefox | - | β Not Yet |
- Phi-3-mini-4k-instruct-q4f16_1-MLC (~2GB)
- Llama-3.2-1B-Instruct-q4f16_1-MLC (~1GB)
- Llama-3.2-3B-Instruct-q4f16_1-MLC (~1.5GB)
- Mistral-7B-Instruct-v0.3-q4f16_1-MLC (~4GB)
- Qwen2.5-1.5B-Instruct-q4f16_1-MLC (~1GB)- Phi-3-mini-4k-instruct-q4f16_1-MLC (~2GB)
- Qwen2-1.5B-Instruct-q4f16_1-MLC (~1GB)
- Mistral-7B-Instruct-v0.3-q4f16_1-MLC (~4GB)
- Llama-3-8B-Instruct-q4f16_1-MLC (~4GB)
- gemma-2b-it-q4f16_1-MLC (~1.2GB)- Check browser version (Chrome/Edge 113+)
- Enable
chrome://flags#enable-unsafe-webgpu - Update graphics drivers
- Test at https://webgpureport.org/
- Try a smaller model (Llama 1B, Qwen)
- Use WeInfer engine for 3.76x speedup
- Close other tabs/applications
- Check GPU isn't throttling
- Use smaller models
- Close other browser tabs
- Increase browser memory limit
- Clear browser cache and restart
This is a proof-of-concept project. Contributions, issues, and feature requests are welcome!
MIT License - See LICENSE file for details
Emanuele Strazzullo
- Website: emanuelestrazzullo.dev
- LinkedIn: linkedin.com/in/emanuelestrazzullo
- MLC LLM - WebLLM inference engine
- WeInfer - Optimized WebLLM fork
- Transformers.js - Browser ML library
- PDF.js - PDF parsing
- Hugging Face - Model hosting
Made with β€οΈ by Emanuele Strazzullo