LiveOps Agent

An AI co-pilot for retail ops. Upload a CSV of orders, ask the agent in plain English, watch it pick from 10 analytic tools, and get the answer + plan + an observation trace. Backed by a 55-case eval harness with an honest dual-baseline.

🔗 Live demo: liveops-agent.onrender.com (~20 sec cold start on Render free tier)

What's interesting about this

Most portfolio AI projects flex a single eval number. This one pins three, and the gap between them is the story:

Corpus	Backend	Pass rate	What it tells you
Main (55 cases)	heuristic	100.0%	Regression-catcher. Same regex, same questions — moves only if something breaks.
Holdout (14 cases)	heuristic	7.1%	Paraphrases the heuristic was never tuned on. Falls off a cliff.
Holdout (14 cases)	Gemini	28.6%	Same paraphrases through an LLM. ~4× the heuristic — meaningful generalization, with room to grow on prompt tuning.

The 7.1% is the most honest number. It's the cliff that tells you whether the agent generalizes or just memorizes. Both holdout numbers are pinned in tests/fixtures/, regression-checked in CI, and surfaced on the live /evals page.

Try it in 30 seconds

git clone https://github.com/sathwikarr/liveops-agent.git
cd liveops-agent
pip install -r requirements.txt
uvicorn web.server:app --reload
# → http://127.0.0.1:8000

The bundled retail dataset (3,878 orders, 200 customers, 30 SKUs, 18 months) is loaded automatically. No signup required — /demo, /workbench, and /evals all work anonymously.

To exercise the full multi-user surface (saved datasets, question history, encrypted Slack/SMTP connectors), sign up at /signup.

What you can do on the live site

Page	What's there
`/`	Honest dual-baseline ribbon (100% / 7.1% / 28.6%) right on the landing
`/demo`	KPI tiles + 3 server-rendered charts (revenue line, top products bar, churn donut) over the bundled dataset
`/workbench`	Drop a CSV → ask plain-English questions → see plan + tools + answer + 6 charts inline. Loading skeletons, observability badges, copy-as-Markdown
`/evals`	Pass-rate timeline, per-tool breakdown, per-case failure log, Run-evals button
`/dashboard` (auth)	Active dataset KPIs + charts, saved-datasets list, recent + pinned questions, connector status
`/history` (auth)	Chronological feed of every upload + question, click-to-expand to see past answers
`/settings` (auth)	Slack webhook + SMTP credentials, Fernet-encrypted at rest, "Send test ping"

The 10 analytic tools

The agent picks from these. Each is independently tested, eval-graded, and chartable:

Tool	What it returns
`revenue_by_period`	Daily / weekly / monthly revenue trend
`top_products` / `top_customers`	Pareto rankings by total revenue
`segment_customers`	RFM bucket counts (Champions, Loyal, At-Risk, Lost…)
`product_quadrants`	BCG quadrant counts (Star, Cash Cow, Question Mark, Dog)
`co_purchases`	Frequent product pairs by lift
`price_elasticity`	Per-SKU log-log regression slope
`churn_risk`	Distribution of Active / Cooling / At-Risk / Churned
`cohort_retention`	Signup-cohort retention matrix
`describe_columns`	Schema + dtypes (fallback for unrecognised questions)

The tool registry lives in analyst/agent.py. Each tool is a ToolSpec with name, description, params, and a callable.

Architecture

flowchart LR
    subgraph In["Input"]
        U["User CSV upload"]
        B["Bundled retail dataset"]
    end
    subgraph Plan["Planner"]
        H["Heuristic regex"]
        L["Gemini"]
    end
    subgraph Exec["Executor"]
        T["10-tool registry"]
    end
    subgraph Out["Surfaces"]
        WB["#47;workbench"]
        DSH["#47;dashboard"]
        HIST["#47;history"]
    end
    subgraph Store["Storage"]
        DB[("SQLite WAL")]
        UF["user_data#47;<br/>per-user uploads"]
    end
    subgraph Eval["Eval harness"]
        M["Main corpus<br/>55 cases"]
        HO["Holdout<br/>14 paraphrases"]
    end

    U --> Plan
    B --> Plan
    Plan --> T
    T --> WB
    T --> DSH
    T --> HIST
    H -.fallback.-> T
    L -.fallback.-> T
    WB --> DB
    DSH --> DB
    HIST --> DB
    U --> UF
    M --> Plan
    HO --> Plan

Stack

Backend — FastAPI + Starlette sessions + bcrypt + Fernet
Data — pandas, scikit-learn, Prophet (per-segment forecasting), networkx (basket lift)
LLM — Gemini 2.5 Flash via the google-genai SDK, with a deterministic regex-routing fallback
Frontend — Jinja2 + Tailwind CDN + Alpine.js + Chart.js
Storage — SQLite (WAL), schema in agent/db.py
Auth — bcrypt, signed-cookie sessions, per-IP sliding-window rate limit on signup + login
Tests — 325 pytest cases, run on 3.11 + 3.12 in CI
Deploy — Docker + GitHub Actions + Render free tier

What makes this different from other portfolio AI projects

Honest about generalization. Most demos show one number. This one shows the cliff between memorized and novel input — and explains why the LLM number is 28.6%, not 80%, with the next iteration named explicitly.
The eval harness is the product. Not an afterthought. Pass rates, per-tool scoreboards, regression checks, and a 14-case holdout are all surfaced in the UI, persisted to SQLite, and charted as a timeline.
Multi-user from the start. Encrypted connector storage, isolated uploads, isolated history, isolated questions — every test asserts the isolation explicitly.
Real production plumbing. Auth, rate limiting, mobile-responsive layout, loading skeletons, accessibility (aria-busy, aria-controls, focus-visible), CI on every PR.

Project layout

agent/                       # SQLite + auth + secrets
  auth.py        bcrypt signup + login
  db.py          users · anomalies · actions · datasets · questions · connectors · eval_runs
  secret.py      Fernet encrypt/decrypt with LIVEOPS_FERNET_KEY
  bandit.py      Thompson-sampling action picker
  detect.py      per-segment z-score + IsolationForest
  explain.py     Gemini wrapper (deterministic-fallback)
  forecast.py    Prophet per-segment + walk-forward CV
  notify.py      Slack + SMTP unified notifier

analyst/                     # Analytic surface
  agent.py       Tool registry, heuristic planner, LLM planner, executor, ask()
  evals/         55-case main corpus + 14-case holdout + scorer + runner
  analysis.py    RFM, cohorts, market basket, elasticity
  predict.py     churn, stockout, demand
  charts.py      8 chart builders
  sample_data/   bundled retail_orders.csv

web/                         # FastAPI app
  server.py      Routes, /api/* endpoints, dataset/profile/chart helpers
  templates/     landing, demo, workbench, evals, dashboard, history,
                 run_agent, settings, login (Jinja2 + Alpine)
  static/site.css Premium-vein motion polish

tests/                       # 325 pytest cases
  test_web.py            60 web tests
  test_analyst_agent.py  41 agent tests (incl. why-string regression)
  test_analyst_evals.py  17 eval harness tests
  …                      auth, DB, forecast, notify, charts, pinboard
  fixtures/              pinned baselines (3 JSON files)

Configuration (.env)

See .env.example for the full list. Most are optional — the app degrades gracefully.

Variable	Purpose
`GEMINI_API_KEY`	LLM backend; without it, heuristic still works (degrades on holdout)
`GEMINI_MODEL`	Defaults to `gemini-2.5-flash`
`SESSION_SECRET`	Signs the session cookie. Generate with `python -c "import secrets; print(secrets.token_urlsafe(32))"`
`LIVEOPS_FERNET_KEY`	Encrypts saved Slack webhooks + SMTP creds. Generate with `python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"`
`LIVEOPS_DB`	SQLite path. Defaults to `data/liveops.sqlite3`
`LIVEOPS_UPLOAD_DIR`	Per-user upload directory. Defaults to `user_data/workbench_uploads/`
`LIVEOPS_LLM_DEBUG`	Set to `1` to log LLM-fallback reasons to stderr (network errors, parse failures, etc.)

Tests

325 pytest cases cover auth, DB, every analytic tool, the LLM agent loop, the heuristic planner's friendly-prose contract (regex literals are explicitly forbidden in plan why strings), the 55-case eval harness, the held-out 14-case corpus, the FastAPI routes (signup → login → dashboard → upload → ask → history → settings), per-user isolation across datasets / questions / connectors, encryption-at-rest verification, rate-limiter behavior, and frontend HTML scaffolding.

pip install pytest
SLACK_WEBHOOK="" SMTP_HOST="" GEMINI_API_KEY="" pytest -v

CI runs on every PR — see .github/workflows/tests.yml.

Deployment

Docker

docker build -t liveops-agent .
docker run --rm -p 8000:8000 \
  --env-file .env \
  -v "$PWD/data:/app/data" \
  -v "$PWD/user_data:/app/user_data" \
  liveops-agent

Render (current live deploy)

Connect the GitHub repo on render.com → New Web Service.
Build command: pip install -r requirements.txt
Start command: uvicorn web.server:app --host 0.0.0.0 --port $PORT
Set GEMINI_API_KEY, SESSION_SECRET, LIVEOPS_FERNET_KEY in the Environment tab.

The app auto-deploys on every push to main.

Security

Sensitive data handling:

Passwords — bcrypt-hashed, never stored plaintext.
Sessions — signed with SESSION_SECRET (Starlette's SessionMiddleware).
Connector secrets (Slack webhooks, SMTP credentials) — Fernet-encrypted at rest, derivation key from LIVEOPS_FERNET_KEY. The encryption is verified by tests: stored blob never contains the plaintext substring.
Per-user isolation — datasets, questions, and connectors are all keyed by username and tested for cross-user access (404 on read, 401 on mutate).
Rate limiting — 5 attempts / 60s sliding window per IP on /login and /signup.
.env is git-ignored. See SECURITY.md for the secret-rotation playbook.

Roadmap (what I'd build next)

Tighten the LLM routing prompt. The 28.6% holdout pass rate is the single biggest lever — most failures are the LLM returning the right tool name in prose but not in the JSON shape the parser wants. ~30 minutes of prompt iteration could push this past 60%.
Token + per-tool latency in obs_meta. Currently obs_meta reports wall-clock latency and tool count; surfacing token cost (LLM) and per-tool timing would close the per-question observability story.
Self-serve password reset. Today the login page has an honest "Forgot password?" stub — proper email-driven reset needs a one-shot token table and SMTP-from-server.
Bandit feedback loop. Capturing thumbs-up/down on each agent answer to learn which tool selections lead to satisfying responses.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
.vscode		.vscode
agent		agent
analyst		analyst
data		data
deploy		deploy
pages		pages
streamlit		streamlit
tests		tests
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
app.py		app.py
architecture.png		architecture.png
auto_agent.py		auto_agent.py
docker-compose.yml		docker-compose.yml
fly.toml		fly.toml
main.py		main.py
pytest.ini		pytest.ini
render.yaml		render.yaml
requirements.txt		requirements.txt
sample_test_orders.csv		sample_test_orders.csv
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiveOps Agent

What's interesting about this

Try it in 30 seconds

What you can do on the live site

The 10 analytic tools

Architecture

Stack

What makes this different from other portfolio AI projects

Project layout

Configuration (.env)

Tests

Deployment

Docker

Render (current live deploy)

Security

Roadmap (what I'd build next)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LiveOps Agent

What's interesting about this

Try it in 30 seconds

What you can do on the live site

The 10 analytic tools

Architecture

Stack

What makes this different from other portfolio AI projects

Project layout

Configuration (.env)

Tests

Deployment

Docker

Render (current live deploy)

Security

Roadmap (what I'd build next)

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages