Turn raw data into insights — no SQL, no coding required. Upload a dataset, ask a question, and get charts, tables, and AI-powered explanations in seconds.
AI Data Analyst is a full-stack GenAI application that acts as your personal data scientist. Upload a file — CSV, Excel, or JSON — type a question, and the app figures out the best way to analyse and visualise it.
Two analysis modes are available:
⚡ Standard — one AI model call generates code and an explanation in seconds.
🤖 Multi-Agent — three specialised agents run in sequence:
- 🗺️ Planner — breaks the question into steps and automatically detects the best chart types for your data
- ✍️ Coder — writes chart-aware Python analysis code guided by the plan
- 📝 Narrator — reads the execution output and writes an executive summary with key findings and a recommendation
No SQL. No coding. Just ask, and get answers.
- Features
- Quick Start
- App Walkthrough
- Multi-Agent Workflow
- Automatic Chart Detection
- Example Questions
- Architecture
- Project Structure
- Running Tests
- Docker Deployment
- Environment Variables
- AI Models
- Contributing
- License
Upload CSV, Excel (.xlsx, .xls), or JSON files — multiple at once. Switch between datasets instantly from the sidebar. Need to join two files? Use the built-in merge tool to combine datasets on a shared key column with inner, left, right, or outer joins.
The moment you upload a file, the app generates a full profile:
- Dataset shape (rows × columns)
- Per-column statistics: min, max, mean, std for numeric columns; unique counts and top values for categorical
- Missing value counts and percentages
- Sample rows preview
Automatically scans for issues and surfaces them with severity ratings and one-line fix snippets:
| Severity | Issue |
|---|---|
| 🔴 High | Column has >50% missing values |
| 🟡 Medium | Column has >5% missing values, or duplicate rows detected |
| 🟢 Low | Dates stored as strings, all-unique ID columns, extreme outliers |
One AI model call handles everything:
- Receives the dataset profile + question
- Uses function calling with a strict JSON schema → structured
code+explanation - Code runs in a subprocess sandbox with a 60-second timeout
- matplotlib/seaborn charts are captured automatically
- Generated code always visible in an expandable panel
Three AI agents, each with a focused role — see Multi-Agent Workflow for full details.
The Planner agent reads your question and data types to recommend the most appropriate chart. See Automatic Chart Detection for the decision rules.
Run the same question through two AI models side-by-side and compare charts, insights, token usage, and response time.
IQR (per-column): Detects outliers using the interquartile range — box plot + distribution histogram with fences, plus a filterable table of outlier rows.
Isolation Forest (multivariate): Fits scikit-learn's IsolationForest across all numeric columns to catch anomalies that only appear in combination. Results visualised in a scatter plot coloured by anomaly status.
Every analysis tracked automatically:
- Total queries, cumulative token usage, average response time
- Per-query history table (model, question, tokens, time, charts, success/fail)
- Stacked token-usage bar chart across the session
Download a full report for any analysis: dataset summary, question, AI insight, charts, text output, usage metrics, and generated code.
One-command startup with health check, 2 GB memory limit, and auto-restart.
- Python 3.9+
- An OpenAI API key
# 1. Clone the repo
git clone https://github.com/BenElhajWiem/ai_data_analyst.git
cd ai_data_analyst
# 2. Install dependencies
pip install -r requirements.txt
# 3. Add your API key
cp .env.example .env
# Open .env and set: OPENAI_API_KEY=sk-proj-...
# 4. Launch
streamlit run app.pyOpen http://localhost:8501 in your browser.
Tip: If
streamlitis not on your PATH, runpython -m streamlit run app.py
Click "Upload Dataset" in the sidebar. Upload one or more CSV, Excel, or JSON files. Switch the active dataset using the selector at the top of the main area.
- Row/column counts, missing values, and duplicates at a glance
- Interactive preview of the first 100 rows
- Auto Profile — full text summary of every column
- Column Statistics —
describe()table for numeric columns - Multi-file Merge — join two datasets on a shared key (if multiple files are loaded)
The Quality tab scans for missing values, duplicates, string-encoded dates, and outliers. Each issue includes a one-line pandas fix. The missing-value heatmap shows affected rows and columns visually.
Go to AI Analysis. Choose a workflow mode, type your question, and click Analyze.
⚡ Standard — fastest, single AI model call, insight card + charts.
🤖 Multi-Agent — richer output: analysis plan, detected chart types, executive summary, key findings, and a recommendation.
Go to Compare, type a question, and select two models. Both run and results appear side-by-side with token and latency badges.
Anomalies tab: pick a column for IQR outlier detection, or run Isolation Forest across all numeric columns for multivariate anomaly detection.
The Metrics tab shows cumulative token usage, response times, and a per-query history table for the whole session.
Go to Export, select an analysis, and click Generate PDF Report.
When you select 🤖 Multi-Agent mode, three AI agents run in sequence, each with a strict function-calling schema:
Question + Data Profile
│
▼
┌───────────────────┐
│ Agent 1: Planner │ → steps[], chart_types[], key_columns[], chart_rationale
└───────────────────┘
│
▼
┌──────────────────┐
│ Agent 2: Coder │ → Python code (chart-type aware, guided by the plan)
└──────────────────┘
│
▼
┌──────────────────────┐
│ Subprocess Sandbox │ → charts (PNG bytes), stdout, exit code
└──────────────────────┘
│
▼
┌────────────────────┐
│ Agent 3: Narrator │ → summary, key_findings[], recommendation
└────────────────────┘
Why three agents instead of one? Each agent is smaller, more focused, and easier to improve independently. The Planner's chart selection feeds directly into the Coder's prompt, so the generated code uses the right visualisation type from the start. The Narrator reads the actual execution output — real numbers, not hallucinations — before writing its summary.
Token cost: Multi-Agent mode makes 3 API calls instead of 1. The Metrics dashboard shows the combined token count so you can compare cost vs. quality.
The Planner agent applies these rules when selecting chart types:
| Data pattern | Detected chart |
|---|---|
| Date / time column in the question | line or area |
| Comparing values across categories | bar |
| Two numeric variables | scatter |
| Distribution of one numeric variable | histogram or box |
| Correlation across all numeric columns | heatmap |
| Proportions / shares (≤ 6 groups) | pie |
| Exact values are important | table |
| Grouped distributions | box or violin |
The detected chart types appear as coloured badges in the Analysis Plan card in the UI, along with a one-sentence rationale. The Coder agent is explicitly instructed to use those chart types when writing the code.
Supported chart types: bar · line · scatter · histogram · heatmap · pie · box · violin · area · table
Show monthly revenue trends and highlight the best and worst months.
What are the top 10 products by total sales? Use a bar chart.
Is there a correlation between price and quantity sold? Show a scatter plot.
Plot the distribution of every numeric column in a grid of histograms.
Which customer segment has the highest average order value?
Show how sales split by region — pie chart and a summary table.
Are there any unusual spikes in daily transactions?
Compare revenue this year vs last year, month by month.
graph TD
U([👤 User]) --> UI[Streamlit UI\napp.py — 7 tabs]
UI -->|Standard mode| AL[analyst.py\nSingle-agent]
UI -->|Multi-Agent mode| AG[agents.py\n3-agent pipeline]
AL -->|function calling| OA[☁️ OpenAI API]
OA -->|code + explanation| AL
AG -->|Agent 1: Planner| OA
OA -->|steps + chart_types| AG
AG -->|Agent 2: Coder| OA
OA -->|Python code| AG
AG -->|Agent 3: Narrator| OA
OA -->|summary + findings| AG
AL --> EX[executor.py\nSubprocess Sandbox]
AG --> EX
EX --> PL[📈 Charts PNG bytes]
EX --> TX[📝 stdout output]
PL --> UI
TX --> UI
UI --> UP[📂 File Upload]
UP --> DF[(pandas DataFrame)]
DF --> PR[profiler.py]
DF --> AN[anomaly.py]
UI --> EXP[exporter.py\nPDF Report]
Sandbox: Generated code runs in a child process. The DataFrame is written to a temp CSV; the subprocess loads it, executes the code, saves all open matplotlib figures to PNGs, and exits. The parent reads the PNGs into memory before the temp directory is deleted.
Function calling everywhere: Every agent uses a strict JSON schema. No string parsing, no regex, guaranteed schema compliance across all three agents.
ai_data_analyst/
│
├── app.py ← Streamlit UI — 7 tabs, custom CSS, two analysis modes
├── analyst.py ← Standard mode: single AI model call, function calling
├── agents.py ← Multi-agent pipeline: Planner → Coder → Narrator
├── executor.py ← Subprocess sandbox — runs code, captures charts
├── profiler.py ← Data profile string + cleaning suggestions
├── anomaly.py ← IQR outlier detection + Isolation Forest
├── exporter.py ← PDF report generation (fpdf2)
│
├── requirements.txt
├── .env.example ← Copy to .env and add your API key
├── .gitignore ← .env excluded — your key stays local
├── Dockerfile
├── docker-compose.yml
│
└── tests/
├── test_executor.py ← 6 tests: execution, plots, timeout, errors
└── test_profiler.py ← 7 tests: missing values, duplicates, dates, outliers
pip install pytest
pytest tests/ -vtests/test_executor.py::test_basic_print PASSED
tests/test_executor.py::test_computation PASSED
tests/test_executor.py::test_plot_captured PASSED
tests/test_executor.py::test_syntax_error PASSED
tests/test_executor.py::test_timeout PASSED
tests/test_executor.py::test_dataframe_accessible PASSED
tests/test_profiler.py::test_profile_contains_shape PASSED
tests/test_profiler.py::test_profile_contains_column_names PASSED
tests/test_profiler.py::test_no_suggestions_on_clean_data PASSED
tests/test_profiler.py::test_detects_missing_values PASSED
tests/test_profiler.py::test_detects_duplicates PASSED
tests/test_profiler.py::test_detects_date_string PASSED
tests/test_profiler.py::test_detects_outliers PASSED
13 passed
cp .env.example .env # add your OPENAI_API_KEY
docker compose up --build # build and start
docker compose up -d --build # background mode
docker compose down # stopOpen http://localhost:8501. The container includes a health check, 2 GB memory limit, and restart: unless-stopped.
| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
✅ Yes | Your OpenAI API key — get one here |
Stored in .env (via python-dotenv). .env is in .gitignore and never committed.
| Model | Speed | Cost | Best for |
|---|---|---|---|
| Standard (powerful) | ~5–10s | $$$ | Complex analysis, multi-agent mode, production |
| Standard (fast) | ~2–4s | $ | Quick exploration, simple queries, cost-sensitive use |
Multi-Agent mode always uses the most capable model for all three agents.
- Fork the repo and create a branch:
git checkout -b feature/my-feature - Make your changes and add tests where appropriate
- Verify all tests pass:
pytest tests/ -v - Open a pull request with a clear description of what changed and why
MIT — see LICENSE for details.