A Flask web application for real-time interaction with the FastVLM-0.5B model using ONNX runtime on CPU. Capture live video from your webcam, provide voice or text prompts, and generate concise image descriptions or responses streamed in real-time. Supports continuous voice recognition, prompt locking for automated captioning, and aspect-ratio-preserving video rendering with black bars for unfilled areas.
Try my huggingface space for this app (with latency because of free tier usage) available at https://huggingface.co/spaces/safouaneelg/Talk2FastVLM.
DEMO.webm
Follow these steps to set up and run the app locally.
To avoid conflicts, use conda or Python venv to create an isolated environment:
Using Conda:
conda create -n talk2fastvlm python=3.10
conda activate talk2fastvlmUsing venv:
python -m venv talk2fastvlm
source talk2fastvlm/bin/activate # On Windows: talk2fastvlm\Scripts\activategit clone https://github.com/safouaneelg/Talk2FastVLM.git
cd Talk2FastVLM/The model files are hosted on Hugging Face and require Git LFS for large files:
git lfs install # Install LFS if not already (one-time setup)
git lfs clone https://huggingface.co/onnx-community/FastVLM-0.5B-ONNXThis downloads the ONNX model files to a ./FastVLM-0.5B-ONNX/onnx/ directory along with the model tokenizer, configs ...etc.
You can also manually download 3 quantized onnx files (vision encoder + decoder + embed) and store them in FastVLM-0.5B-ONNX/onnx/
pip install -r requirements.txtStart the Flask server:
python app.pyThe app will be available at http://localhost:7860. Open this URL in a modern browser (Chrome/Firefox recommended for speech recognition).
- Access Permissions: Grant camera and microphone access when prompted.
- Voice Mode: Enabled by default; click the mic icon to toggle off/on.
- Prompt Locking: Use the lock button to fix a prompt and enable auto-captioning every 5 seconds.
The app uses the q4f16 quantized version by default for a balance of speed and quality (282 MB decoder, 272 MB embed, 253 MB vision). To switch quantizations:
-
Edit
app.pyand update the file names inload_model():vision_session = ort.InferenceSession(os.path.join(onnx_path, "vision_encoder_<variant>.onnx"), providers=providers) embed_session = ort.InferenceSession(os.path.join(onnx_path, "embed_tokens_<variant>.onnx"), providers=providers) decoder_session = ort.InferenceSession(os.path.join(onnx_path, "decoder_model_merged_<variant>.onnx"), providers=providers)
Replace
<variant>with one of the available options below. -
Restart the app (
python app.py).
Thanks to onnx-community, these are the ONNX model files available from the Hugging Face repo.
| Variant | Decoder Size | Embed Size | Vision Size |
|---|---|---|---|
(default) q4f16 |
282 MB | 272 MB | 253 MB |
bnb4 |
287 MB | 544 MB | 505 MB |
fp16 |
992 MB | 272 MB | 253 MB |
int8 |
503 MB | 136 MB | 223 MB |
q4 |
317 MB | 544 MB | 505 MB |
quantized |
503 MB | 136 MB | 223 MB |
uint8 |
503 MB | 136 MB | 223 MB |
- Recommendations:
q4f16for most users (fast on CPU).fp16for better accuracy.- Lower-bit variants like
int8oruint8for memory-constrained setups.
- Open
http://localhost:7860in your browser. - Allow camera/mic access.
- Voice Input: Speak your prompt (example "Describe my gesture/"); it auto-fills the text area and triggers generation.
- Text Input: Type a prompt (e.g., default: "Describe this image in detail, focusing on any visible hands or gestures") and press Enter or click Send.
- Lock Mode: Click the lock icon to fix the typed prompt and enable continuous captioning (updates every 5s).
- Captions stream in real-time below the video.
The system prompt encourages concise, one-sentence responses focused on visible actions (e.g., hands/gestures).
- Speech Recognition Issues: Ensure HTTPS or localhost; check browser console for errors. Not supported in all browsers.
- Model Loading Errors: Verify Git LFS download completed all files (~8 GB total). Or download manually the onnx files needed and store them in
./FastVLM-0.5B-ONNX/onnxfolder. - Port Conflicts: The port
7860was selected for huggingface but you can change it inapp.pyif needed.
- Built on FastVLM.
- Fast VLM original repo : apple/ml-fastvlm