An AI co-pilot for retail ops. Upload a CSV of orders, ask the agent in plain English, watch it pick from 10 analytic tools, and get the answer + plan + an observation trace. Backed by a 55-case eval harness with an honest dual-baseline.
🔗 Live demo: liveops-agent.onrender.com (~20 sec cold start on Render free tier)
Most portfolio AI projects flex a single eval number. This one pins three, and the gap between them is the story:
| Corpus | Backend | Pass rate | What it tells you |
|---|---|---|---|
| Main (55 cases) | heuristic | 100.0% | Regression-catcher. Same regex, same questions — moves only if something breaks. |
| Holdout (14 cases) | heuristic | 7.1% | Paraphrases the heuristic was never tuned on. Falls off a cliff. |
| Holdout (14 cases) | Gemini | 28.6% | Same paraphrases through an LLM. ~4× the heuristic — meaningful generalization, with room to grow on prompt tuning. |
The 7.1% is the most honest number. It's the cliff that tells you whether the agent generalizes or just memorizes. Both holdout numbers are pinned in tests/fixtures/, regression-checked in CI, and surfaced on the live /evals page.
git clone https://github.com/sathwikarr/liveops-agent.git
cd liveops-agent
pip install -r requirements.txt
uvicorn web.server:app --reload
# → http://127.0.0.1:8000The bundled retail dataset (3,878 orders, 200 customers, 30 SKUs, 18 months) is loaded automatically. No signup required — /demo, /workbench, and /evals all work anonymously.
To exercise the full multi-user surface (saved datasets, question history, encrypted Slack/SMTP connectors), sign up at /signup.
| Page | What's there |
|---|---|
/ |
Honest dual-baseline ribbon (100% / 7.1% / 28.6%) right on the landing |
/demo |
KPI tiles + 3 server-rendered charts (revenue line, top products bar, churn donut) over the bundled dataset |
/workbench |
Drop a CSV → ask plain-English questions → see plan + tools + answer + 6 charts inline. Loading skeletons, observability badges, copy-as-Markdown |
/evals |
Pass-rate timeline, per-tool breakdown, per-case failure log, Run-evals button |
/dashboard (auth) |
Active dataset KPIs + charts, saved-datasets list, recent + pinned questions, connector status |
/history (auth) |
Chronological feed of every upload + question, click-to-expand to see past answers |
/settings (auth) |
Slack webhook + SMTP credentials, Fernet-encrypted at rest, "Send test ping" |
The agent picks from these. Each is independently tested, eval-graded, and chartable:
| Tool | What it returns |
|---|---|
revenue_by_period |
Daily / weekly / monthly revenue trend |
top_products / top_customers |
Pareto rankings by total revenue |
segment_customers |
RFM bucket counts (Champions, Loyal, At-Risk, Lost…) |
product_quadrants |
BCG quadrant counts (Star, Cash Cow, Question Mark, Dog) |
co_purchases |
Frequent product pairs by lift |
price_elasticity |
Per-SKU log-log regression slope |
churn_risk |
Distribution of Active / Cooling / At-Risk / Churned |
cohort_retention |
Signup-cohort retention matrix |
describe_columns |
Schema + dtypes (fallback for unrecognised questions) |
The tool registry lives in analyst/agent.py. Each tool is a ToolSpec with name, description, params, and a callable.
flowchart LR
subgraph In["Input"]
U["User CSV upload"]
B["Bundled retail dataset"]
end
subgraph Plan["Planner"]
H["Heuristic regex"]
L["Gemini"]
end
subgraph Exec["Executor"]
T["10-tool registry"]
end
subgraph Out["Surfaces"]
WB["#47;workbench"]
DSH["#47;dashboard"]
HIST["#47;history"]
end
subgraph Store["Storage"]
DB[("SQLite WAL")]
UF["user_data#47;<br/>per-user uploads"]
end
subgraph Eval["Eval harness"]
M["Main corpus<br/>55 cases"]
HO["Holdout<br/>14 paraphrases"]
end
U --> Plan
B --> Plan
Plan --> T
T --> WB
T --> DSH
T --> HIST
H -.fallback.-> T
L -.fallback.-> T
WB --> DB
DSH --> DB
HIST --> DB
U --> UF
M --> Plan
HO --> Plan
- Backend — FastAPI + Starlette sessions + bcrypt + Fernet
- Data — pandas, scikit-learn, Prophet (per-segment forecasting), networkx (basket lift)
- LLM — Gemini 2.5 Flash via the
google-genaiSDK, with a deterministic regex-routing fallback - Frontend — Jinja2 + Tailwind CDN + Alpine.js + Chart.js
- Storage — SQLite (WAL), schema in
agent/db.py - Auth — bcrypt, signed-cookie sessions, per-IP sliding-window rate limit on signup + login
- Tests — 325 pytest cases, run on 3.11 + 3.12 in CI
- Deploy — Docker + GitHub Actions + Render free tier
- Honest about generalization. Most demos show one number. This one shows the cliff between memorized and novel input — and explains why the LLM number is 28.6%, not 80%, with the next iteration named explicitly.
- The eval harness is the product. Not an afterthought. Pass rates, per-tool scoreboards, regression checks, and a 14-case holdout are all surfaced in the UI, persisted to SQLite, and charted as a timeline.
- Multi-user from the start. Encrypted connector storage, isolated uploads, isolated history, isolated questions — every test asserts the isolation explicitly.
- Real production plumbing. Auth, rate limiting, mobile-responsive layout, loading skeletons, accessibility (
aria-busy,aria-controls, focus-visible), CI on every PR.
agent/ # SQLite + auth + secrets
auth.py bcrypt signup + login
db.py users · anomalies · actions · datasets · questions · connectors · eval_runs
secret.py Fernet encrypt/decrypt with LIVEOPS_FERNET_KEY
bandit.py Thompson-sampling action picker
detect.py per-segment z-score + IsolationForest
explain.py Gemini wrapper (deterministic-fallback)
forecast.py Prophet per-segment + walk-forward CV
notify.py Slack + SMTP unified notifier
analyst/ # Analytic surface
agent.py Tool registry, heuristic planner, LLM planner, executor, ask()
evals/ 55-case main corpus + 14-case holdout + scorer + runner
analysis.py RFM, cohorts, market basket, elasticity
predict.py churn, stockout, demand
charts.py 8 chart builders
sample_data/ bundled retail_orders.csv
web/ # FastAPI app
server.py Routes, /api/* endpoints, dataset/profile/chart helpers
templates/ landing, demo, workbench, evals, dashboard, history,
run_agent, settings, login (Jinja2 + Alpine)
static/site.css Premium-vein motion polish
tests/ # 325 pytest cases
test_web.py 60 web tests
test_analyst_agent.py 41 agent tests (incl. why-string regression)
test_analyst_evals.py 17 eval harness tests
… auth, DB, forecast, notify, charts, pinboard
fixtures/ pinned baselines (3 JSON files)
See .env.example for the full list. Most are optional — the app degrades gracefully.
| Variable | Purpose |
|---|---|
GEMINI_API_KEY |
LLM backend; without it, heuristic still works (degrades on holdout) |
GEMINI_MODEL |
Defaults to gemini-2.5-flash |
SESSION_SECRET |
Signs the session cookie. Generate with python -c "import secrets; print(secrets.token_urlsafe(32))" |
LIVEOPS_FERNET_KEY |
Encrypts saved Slack webhooks + SMTP creds. Generate with python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())" |
LIVEOPS_DB |
SQLite path. Defaults to data/liveops.sqlite3 |
LIVEOPS_UPLOAD_DIR |
Per-user upload directory. Defaults to user_data/workbench_uploads/ |
LIVEOPS_LLM_DEBUG |
Set to 1 to log LLM-fallback reasons to stderr (network errors, parse failures, etc.) |
325 pytest cases cover auth, DB, every analytic tool, the LLM agent loop, the heuristic planner's friendly-prose contract (regex literals are explicitly forbidden in plan why strings), the 55-case eval harness, the held-out 14-case corpus, the FastAPI routes (signup → login → dashboard → upload → ask → history → settings), per-user isolation across datasets / questions / connectors, encryption-at-rest verification, rate-limiter behavior, and frontend HTML scaffolding.
pip install pytest
SLACK_WEBHOOK="" SMTP_HOST="" GEMINI_API_KEY="" pytest -vCI runs on every PR — see .github/workflows/tests.yml.
docker build -t liveops-agent .
docker run --rm -p 8000:8000 \
--env-file .env \
-v "$PWD/data:/app/data" \
-v "$PWD/user_data:/app/user_data" \
liveops-agent- Connect the GitHub repo on render.com → New Web Service.
- Build command:
pip install -r requirements.txt - Start command:
uvicorn web.server:app --host 0.0.0.0 --port $PORT - Set
GEMINI_API_KEY,SESSION_SECRET,LIVEOPS_FERNET_KEYin the Environment tab.
The app auto-deploys on every push to main.
Sensitive data handling:
- Passwords — bcrypt-hashed, never stored plaintext.
- Sessions — signed with
SESSION_SECRET(Starlette'sSessionMiddleware). - Connector secrets (Slack webhooks, SMTP credentials) — Fernet-encrypted at rest, derivation key from
LIVEOPS_FERNET_KEY. The encryption is verified by tests: stored blob never contains the plaintext substring. - Per-user isolation — datasets, questions, and connectors are all keyed by username and tested for cross-user access (404 on read, 401 on mutate).
- Rate limiting — 5 attempts / 60s sliding window per IP on
/loginand/signup. .envis git-ignored. SeeSECURITY.mdfor the secret-rotation playbook.
- Tighten the LLM routing prompt. The 28.6% holdout pass rate is the single biggest lever — most failures are the LLM returning the right tool name in prose but not in the JSON shape the parser wants. ~30 minutes of prompt iteration could push this past 60%.
- Token + per-tool latency in
obs_meta. Currentlyobs_metareports wall-clock latency and tool count; surfacing token cost (LLM) and per-tool timing would close the per-question observability story. - Self-serve password reset. Today the login page has an honest "Forgot password?" stub — proper email-driven reset needs a one-shot token table and SMTP-from-server.
- Bandit feedback loop. Capturing thumbs-up/down on each agent answer to learn which tool selections lead to satisfying responses.
MIT — see LICENSE.