Skip to content

BenElhajWiem/ai_data_analyst

Repository files navigation

📊 AI Data Analyst

Turn raw data into insights — no SQL, no coding required. Upload a dataset, ask a question, and get charts, tables, and AI-powered explanations in seconds.


What is this?

AI Data Analyst is a full-stack GenAI application that acts as your personal data scientist. Upload a file — CSV, Excel, or JSON — type a question, and the app figures out the best way to analyse and visualise it.

Two analysis modes are available:

⚡ Standard — one AI model call generates code and an explanation in seconds.

🤖 Multi-Agent — three specialised agents run in sequence:

  1. 🗺️ Planner — breaks the question into steps and automatically detects the best chart types for your data
  2. ✍️ Coder — writes chart-aware Python analysis code guided by the plan
  3. 📝 Narrator — reads the execution output and writes an executive summary with key findings and a recommendation

No SQL. No coding. Just ask, and get answers.


Table of Contents


Features

📂 Multi-file Upload

Upload CSV, Excel (.xlsx, .xls), or JSON files — multiple at once. Switch between datasets instantly from the sidebar. Need to join two files? Use the built-in merge tool to combine datasets on a shared key column with inner, left, right, or outer joins.

📋 Automatic Data Profile

The moment you upload a file, the app generates a full profile:

  • Dataset shape (rows × columns)
  • Per-column statistics: min, max, mean, std for numeric columns; unique counts and top values for categorical
  • Missing value counts and percentages
  • Sample rows preview

🧹 Data Quality Checker

Automatically scans for issues and surfaces them with severity ratings and one-line fix snippets:

Severity Issue
🔴 High Column has >50% missing values
🟡 Medium Column has >5% missing values, or duplicate rows detected
🟢 Low Dates stored as strings, all-unique ID columns, extreme outliers

💬 AI Analysis — Standard Mode

One AI model call handles everything:

  • Receives the dataset profile + question
  • Uses function calling with a strict JSON schema → structured code + explanation
  • Code runs in a subprocess sandbox with a 60-second timeout
  • matplotlib/seaborn charts are captured automatically
  • Generated code always visible in an expandable panel

🤖 Multi-Agent Mode

Three AI agents, each with a focused role — see Multi-Agent Workflow for full details.

📊 Automatic Chart Detection

The Planner agent reads your question and data types to recommend the most appropriate chart. See Automatic Chart Detection for the decision rules.

⚖️ Model Comparison

Run the same question through two AI models side-by-side and compare charts, insights, token usage, and response time.

🔍 Anomaly Detection

IQR (per-column): Detects outliers using the interquartile range — box plot + distribution histogram with fences, plus a filterable table of outlier rows.

Isolation Forest (multivariate): Fits scikit-learn's IsolationForest across all numeric columns to catch anomalies that only appear in combination. Results visualised in a scatter plot coloured by anomaly status.

📊 Session Metrics Dashboard

Every analysis tracked automatically:

  • Total queries, cumulative token usage, average response time
  • Per-query history table (model, question, tokens, time, charts, success/fail)
  • Stacked token-usage bar chart across the session

📄 PDF Export

Download a full report for any analysis: dataset summary, question, AI insight, charts, text output, usage metrics, and generated code.

🐳 Docker Deployment

One-command startup with health check, 2 GB memory limit, and auto-restart.


Quick Start

Prerequisites

Local Setup

# 1. Clone the repo
git clone https://github.com/BenElhajWiem/ai_data_analyst.git
cd ai_data_analyst

# 2. Install dependencies
pip install -r requirements.txt

# 3. Add your API key
cp .env.example .env
# Open .env and set: OPENAI_API_KEY=sk-proj-...

# 4. Launch
streamlit run app.py

Open http://localhost:8501 in your browser.

Tip: If streamlit is not on your PATH, run python -m streamlit run app.py


App Walkthrough

Step 1 — Upload your data

Click "Upload Dataset" in the sidebar. Upload one or more CSV, Excel, or JSON files. Switch the active dataset using the selector at the top of the main area.

Step 2 — Explore the Data tab 📋

  • Row/column counts, missing values, and duplicates at a glance
  • Interactive preview of the first 100 rows
  • Auto Profile — full text summary of every column
  • Column Statisticsdescribe() table for numeric columns
  • Multi-file Merge — join two datasets on a shared key (if multiple files are loaded)

Step 3 — Check data quality 🧹

The Quality tab scans for missing values, duplicates, string-encoded dates, and outliers. Each issue includes a one-line pandas fix. The missing-value heatmap shows affected rows and columns visually.

Step 4 — Ask a question 💬

Go to AI Analysis. Choose a workflow mode, type your question, and click Analyze.

⚡ Standard — fastest, single AI model call, insight card + charts.

🤖 Multi-Agent — richer output: analysis plan, detected chart types, executive summary, key findings, and a recommendation.

Step 5 — Compare models ⚖️

Go to Compare, type a question, and select two models. Both run and results appear side-by-side with token and latency badges.

Step 6 — Detect anomalies 🔍

Anomalies tab: pick a column for IQR outlier detection, or run Isolation Forest across all numeric columns for multivariate anomaly detection.

Step 7 — Track metrics 📊

The Metrics tab shows cumulative token usage, response times, and a per-query history table for the whole session.

Step 8 — Export a report 📄

Go to Export, select an analysis, and click Generate PDF Report.


Multi-Agent Workflow

When you select 🤖 Multi-Agent mode, three AI agents run in sequence, each with a strict function-calling schema:

Question + Data Profile
        │
        ▼
┌───────────────────┐
│  Agent 1: Planner │  → steps[], chart_types[], key_columns[], chart_rationale
└───────────────────┘
        │
        ▼
┌──────────────────┐
│  Agent 2: Coder  │  → Python code (chart-type aware, guided by the plan)
└──────────────────┘
        │
        ▼
┌──────────────────────┐
│  Subprocess Sandbox  │  → charts (PNG bytes), stdout, exit code
└──────────────────────┘
        │
        ▼
┌────────────────────┐
│  Agent 3: Narrator │  → summary, key_findings[], recommendation
└────────────────────┘

Why three agents instead of one? Each agent is smaller, more focused, and easier to improve independently. The Planner's chart selection feeds directly into the Coder's prompt, so the generated code uses the right visualisation type from the start. The Narrator reads the actual execution output — real numbers, not hallucinations — before writing its summary.

Token cost: Multi-Agent mode makes 3 API calls instead of 1. The Metrics dashboard shows the combined token count so you can compare cost vs. quality.


Automatic Chart Detection

The Planner agent applies these rules when selecting chart types:

Data pattern Detected chart
Date / time column in the question line or area
Comparing values across categories bar
Two numeric variables scatter
Distribution of one numeric variable histogram or box
Correlation across all numeric columns heatmap
Proportions / shares (≤ 6 groups) pie
Exact values are important table
Grouped distributions box or violin

The detected chart types appear as coloured badges in the Analysis Plan card in the UI, along with a one-sentence rationale. The Coder agent is explicitly instructed to use those chart types when writing the code.

Supported chart types: bar · line · scatter · histogram · heatmap · pie · box · violin · area · table


Example Questions

Show monthly revenue trends and highlight the best and worst months.
What are the top 10 products by total sales? Use a bar chart.
Is there a correlation between price and quantity sold? Show a scatter plot.
Plot the distribution of every numeric column in a grid of histograms.
Which customer segment has the highest average order value?
Show how sales split by region — pie chart and a summary table.
Are there any unusual spikes in daily transactions?
Compare revenue this year vs last year, month by month.

Architecture

graph TD
    U([👤 User]) --> UI[Streamlit UI\napp.py — 7 tabs]

    UI -->|Standard mode| AL[analyst.py\nSingle-agent]
    UI -->|Multi-Agent mode| AG[agents.py\n3-agent pipeline]

    AL -->|function calling| OA[☁️ OpenAI API]
    OA -->|code + explanation| AL

    AG -->|Agent 1: Planner| OA
    OA -->|steps + chart_types| AG
    AG -->|Agent 2: Coder| OA
    OA -->|Python code| AG
    AG -->|Agent 3: Narrator| OA
    OA -->|summary + findings| AG

    AL --> EX[executor.py\nSubprocess Sandbox]
    AG --> EX
    EX --> PL[📈 Charts PNG bytes]
    EX --> TX[📝 stdout output]

    PL --> UI
    TX --> UI

    UI --> UP[📂 File Upload]
    UP --> DF[(pandas DataFrame)]
    DF --> PR[profiler.py]
    DF --> AN[anomaly.py]
    UI --> EXP[exporter.py\nPDF Report]
Loading

Sandbox: Generated code runs in a child process. The DataFrame is written to a temp CSV; the subprocess loads it, executes the code, saves all open matplotlib figures to PNGs, and exits. The parent reads the PNGs into memory before the temp directory is deleted.

Function calling everywhere: Every agent uses a strict JSON schema. No string parsing, no regex, guaranteed schema compliance across all three agents.


Project Structure

ai_data_analyst/
│
├── app.py              ← Streamlit UI — 7 tabs, custom CSS, two analysis modes
├── analyst.py          ← Standard mode: single AI model call, function calling
├── agents.py           ← Multi-agent pipeline: Planner → Coder → Narrator
├── executor.py         ← Subprocess sandbox — runs code, captures charts
├── profiler.py         ← Data profile string + cleaning suggestions
├── anomaly.py          ← IQR outlier detection + Isolation Forest
├── exporter.py         ← PDF report generation (fpdf2)
│
├── requirements.txt
├── .env.example        ← Copy to .env and add your API key
├── .gitignore          ← .env excluded — your key stays local
├── Dockerfile
├── docker-compose.yml
│
└── tests/
    ├── test_executor.py    ← 6 tests: execution, plots, timeout, errors
    └── test_profiler.py    ← 7 tests: missing values, duplicates, dates, outliers

Running Tests

pip install pytest
pytest tests/ -v
tests/test_executor.py::test_basic_print          PASSED
tests/test_executor.py::test_computation          PASSED
tests/test_executor.py::test_plot_captured        PASSED
tests/test_executor.py::test_syntax_error         PASSED
tests/test_executor.py::test_timeout              PASSED
tests/test_executor.py::test_dataframe_accessible PASSED
tests/test_profiler.py::test_profile_contains_shape        PASSED
tests/test_profiler.py::test_profile_contains_column_names PASSED
tests/test_profiler.py::test_no_suggestions_on_clean_data  PASSED
tests/test_profiler.py::test_detects_missing_values        PASSED
tests/test_profiler.py::test_detects_duplicates            PASSED
tests/test_profiler.py::test_detects_date_string           PASSED
tests/test_profiler.py::test_detects_outliers              PASSED

13 passed

Docker Deployment

cp .env.example .env        # add your OPENAI_API_KEY
docker compose up --build   # build and start

docker compose up -d --build  # background mode
docker compose down           # stop

Open http://localhost:8501. The container includes a health check, 2 GB memory limit, and restart: unless-stopped.


Environment Variables

Variable Required Description
OPENAI_API_KEY ✅ Yes Your OpenAI API key — get one here

Stored in .env (via python-dotenv). .env is in .gitignore and never committed.


AI Models

Model Speed Cost Best for
Standard (powerful) ~5–10s $$$ Complex analysis, multi-agent mode, production
Standard (fast) ~2–4s $ Quick exploration, simple queries, cost-sensitive use

Multi-Agent mode always uses the most capable model for all three agents.


Contributing

  1. Fork the repo and create a branch: git checkout -b feature/my-feature
  2. Make your changes and add tests where appropriate
  3. Verify all tests pass: pytest tests/ -v
  4. Open a pull request with a clear description of what changed and why

License

MIT — see LICENSE for details.

About

Upload a CSV, Excel, or JSON file and ask questions. AI writes and executes the Python analysis code, then returns charts, insights, and a PDF report — no SQL or coding required.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors