A human-centered Visual Question Answering System Prototype
Author: Yuzhang (Leo) Mei
Clarivision is a multi-turn, ambiguity-aware Visual Question Answering (VQA) system designed to assist blind and low-vision (BLVI) users.
Unlike traditional one-pass VQA systems, Clarivision:
- Detects ambiguity in user questions
- Engages in clarification dialogue
- Supports both multi-turn interaction and one-pass response
- Supports both image and short video input
- Incorporates temporal reasoning across video frames
- Provides voice input (Speech-to-Text) and voice output (Text-to-Speech)
- Includes screen-reader-friendly UI elements
If a question is ambiguous (e.g., multiple similar objects exist), the system asks follow-up clarification questions and provides buttons as feedback before answering.
Generates grouped and structured descriptions when ambiguity is present.
- Extracts key frames
- Aggregates objects across time
- Detects temporal ambiguity
- Allows time-aware clarification
- Keyboard navigation
- Screen-reader-friendly labeling
- Text-to-Speech (automatic answer narration)
- Speech-to-Text input
- Frontend (React)
- Flask Backend API
- Vision Model (GPT-4V in this prototype, other models supported as well)
- Ambiguity Detection
- Session-based Multi-turn Dialogue
Self-made-Visual-Question-Answering-System-Simple/
├── README.md
├── backend/
│ ├──ambiguity.py
│ ├──app.py
│ ├──llm_answer.py
│ ├──openai_vision.py
│ ├──requirements.txt
│ ├──response_generator.py
│ ├──session_store.py
│ ├──temporal_aggregator.py
│ ├──temporal_ambiguity.py
│ ├──video_processor.py
├── frontend/
│ ├──src/
│ ├──App.jsx
│ ├──styles.css
├── .vscode
│ ├── settings.json
├── .gitignore
Notes: please neglect the files included in the repo but not listed in the tree :)
git clone https://github.com/YuzhangMei/Self-made-Visual-Question-Answering-System-Simple
cd <your-repo>cd backend
python -m venv venvmacOS / Linux
source venv/bin/activateWindows
venv\Scripts\activatepip install -r requirements.txtIf no requirements.txt is provided, install manually:
pip install flask flask-cors openaiCreate an environment variable: macOS / Linux
export OPENAI_API_KEY="your_api_key_here"Windows
setx OPENAI_API_KEY "your_api_key_here"Restart terminal after setting.
Run the following command in backend/.
python app.pyBackend runs at: http://localhost:5000 .
The following steps serve for the frontend setup (React).
cd frontend
npm installnpm run devFrontend runs at: http://localhost:5173 .
- Upload an image or short video.
- Enter a question either by typing or speaking.
- Select mode:
- Clarify (iterative clarification; multi-turn interaction)
- One-pass (one-pass response; direct structured answer)
- If ambiguity is detected, choose the correct object from the buttons given.
- Continue with follow-up questions if desired.
- Click "End Session" to reset.
The system explicitly detects when:
- Multiple similar objects are present
- Temporal ambiguity exists in video
- The user's question underspecifies the referent
It then generates clarification options before answering.
For video inputs:
- Frames are sampled
- Objects are detected per frame
- Objects are aggregated across timestamps
- Clarification options include time spans
- Fully keyboard navigable
- Buttons are screen-reader compatible
- Automatic TTS narration
- Built-in Speech Recognition input
- Temporal spans are based on sampled frames, not continuous tracking.
- Object detection quality depends on the underlying vision model.
- No persistent database (sessions are in-memory).
- Real-time video stream support
- Object tracking instead of frame-based aggregation
- Improved semantic merging of similar object labels
- Persistent session storage