Clarivision

A human-centered Visual Question Answering System Prototype

Author: Yuzhang (Leo) Mei

Overview

Clarivision is a multi-turn, ambiguity-aware Visual Question Answering (VQA) system designed to assist blind and low-vision (BLVI) users.

Unlike traditional one-pass VQA systems, Clarivision:

Detects ambiguity in user questions
Engages in clarification dialogue
Supports both multi-turn interaction and one-pass response
Supports both image and short video input
Incorporates temporal reasoning across video frames
Provides voice input (Speech-to-Text) and voice output (Text-to-Speech)
Includes screen-reader-friendly UI elements

Features

✅ Multi-turn Clarification

If a question is ambiguous (e.g., multiple similar objects exist), the system asks follow-up clarification questions and provides buttons as feedback before answering.

✅ One-pass Structured Response for Ambiguity

Generates grouped and structured descriptions when ambiguity is present.

✅ Video Support

Extracts key frames
Aggregates objects across time
Detects temporal ambiguity
Allows time-aware clarification

✅ Accessibility Features

Keyboard navigation
Screen-reader-friendly labeling
Text-to-Speech (automatic answer narration)
Speech-to-Text input

System Architecture

Frontend (React)
Flask Backend API
Vision Model (GPT-4V in this prototype, other models supported as well)
Ambiguity Detection
Session-based Multi-turn Dialogue

Repository Structure

Self-made-Visual-Question-Answering-System-Simple/
├── README.md
├── backend/
│   ├──ambiguity.py
│   ├──app.py
│   ├──llm_answer.py
│   ├──openai_vision.py
│   ├──requirements.txt
│   ├──response_generator.py
│   ├──session_store.py
│   ├──temporal_aggregator.py
│   ├──temporal_ambiguity.py
│   ├──video_processor.py
├── frontend/
│   ├──src/
│       ├──App.jsx
│       ├──styles.css
├── .vscode
│   ├── settings.json
├── .gitignore

Notes: please neglect the files included in the repo but not listed in the tree :)

Setup Instructions

Step 1: Clone the Repository

git clone https://github.com/YuzhangMei/Self-made-Visual-Question-Answering-System-Simple
cd <your-repo>

Step 2-a: Create a Virtual Environment

cd backend
python -m venv venv

Step 2-b: Activate the Virtual Environment

macOS / Linux

source venv/bin/activate

Windows

venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

If no requirements.txt is provided, install manually:

pip install flask flask-cors openai

Step 4: Set OpenAI API Key

Create an environment variable: macOS / Linux

export OPENAI_API_KEY="your_api_key_here"

Windows

setx OPENAI_API_KEY "your_api_key_here"

Restart terminal after setting.

Step 5: Run Backend Server

Run the following command in backend/.

python app.py

Backend runs at: http://localhost:5000 .

The following steps serve for the frontend setup (React).

Step 6: Install Node Dependencies

cd frontend
npm install

Step 7: Run Frontend

npm run dev

Frontend runs at: http://localhost:5173 .

How to use

Upload an image or short video.
Enter a question either by typing or speaking.
Select mode:
- Clarify (iterative clarification; multi-turn interaction)
- One-pass (one-pass response; direct structured answer)
If ambiguity is detected, choose the correct object from the buttons given.
Continue with follow-up questions if desired.
Click "End Session" to reset.

Design Highlights

Ambiguity Detection

The system explicitly detects when:

Multiple similar objects are present
Temporal ambiguity exists in video
The user's question underspecifies the referent

It then generates clarification options before answering.

Temporal Reasoning

For video inputs:

Frames are sampled
Objects are detected per frame
Objects are aggregated across timestamps
Clarification options include time spans

Accessibility

Fully keyboard navigable
Buttons are screen-reader compatible
Automatic TTS narration
Built-in Speech Recognition input

Known Limitations

Temporal spans are based on sampled frames, not continuous tracking.
Object detection quality depends on the underlying vision model.
No persistent database (sessions are in-memory).

Future Improvements

Real-time video stream support
Object tracking instead of frame-based aggregation
Improved semantic merging of similar object labels
Persistent session storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clarivision

Overview

Features

✅ Multi-turn Clarification

✅ One-pass Structured Response for Ambiguity

✅ Video Support

✅ Accessibility Features

System Architecture

Repository Structure

Setup Instructions

Step 1: Clone the Repository

Step 2-a: Create a Virtual Environment

Step 2-b: Activate the Virtual Environment

Step 3: Install Dependencies

Step 4: Set OpenAI API Key

Step 5: Run Backend Server

Step 6: Install Node Dependencies

Step 7: Run Frontend

How to use

Design Highlights

Ambiguity Detection

Temporal Reasoning

Accessibility

Known Limitations

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Clarivision

Overview

Features

✅ Multi-turn Clarification

✅ One-pass Structured Response for Ambiguity

✅ Video Support

✅ Accessibility Features

System Architecture

Repository Structure

Setup Instructions

Step 1: Clone the Repository

Step 2-a: Create a Virtual Environment

Step 2-b: Activate the Virtual Environment

Step 3: Install Dependencies

Step 4: Set OpenAI API Key

Step 5: Run Backend Server

Step 6: Install Node Dependencies

Step 7: Run Frontend

How to use

Design Highlights

Ambiguity Detection

Temporal Reasoning

Accessibility

Known Limitations

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages