Skip to content

feat(sp0): benchmark suite, real framework integration tests, self-hosted login#1

Merged
aks-builds merged 17 commits into
mainfrom
feature/sp0-benchmark-real-tests
Jun 25, 2026
Merged

feat(sp0): benchmark suite, real framework integration tests, self-hosted login#1
aks-builds merged 17 commits into
mainfrom
feature/sp0-benchmark-real-tests

Conversation

@aks-builds

Copy link
Copy Markdown
Owner

Summary

  • Benchmark suite (20 tasks): token reduction measured at 23.2%, accuracy delta 0% — all claim-backed
  • Real integration tests for all 5 framework adapters: LangChain, LangGraph, AutoGen (ag2), CrewAI (skip — langchain 1.x incompatibility), Smolagents
  • Bug fixes: LangChain LCEL callback injection, AutoGen _process_message signature, Smolagents CallbackRegistry API (v1.10+), benchmark accuracy word-boundary checks
  • agentsave login rewrite: self-hosted connection command — prompts URL + API key, verifies via /api/health + /api/billing, removes all app.agentsave.io references
  • Dep corrections: pyautogen to ag2>=0.13.0, crewai>=0.80.0 to crewai>=0.11.0, smolagents>=1.10.0 added to dev extras

Test plan

  • pytest tests/ -q -> 88 passed, 3 skipped (crewai expected)
  • agentsave login --help shows self-hosted prompt text, no app.agentsave.io
  • python -m benchmarks.runner regenerates BENCHMARKS.md with real numbers

Generated with Claude Code

…README

- Remove fake arXiv:2510.26585 and arXiv:2603.13358 citations
- Change title from 'Save 30%' to 'Cut AI agent token costs' until 30% is proven
- Replace 'zero accuracy loss' tagline with 'targeting ~30% with no accuracy loss — see BENCHMARKS.md'
- Update SDK layer copy: 23% measured, 30% target, not claimed as achieved
- Mark E2E tests badge as 'coming soon' (no E2E suite exists yet)
- Mark InferRoute PPD claim as 'in development'
- Add BENCHMARKS.md with current numbers and methodology
…laims true

Covers Sub-project 0 (benchmark suite + real framework tests), Sub-project 1
(agentsave-dashboard backend with JWT license keys + self-hosted auth),
Sub-project 2 (agentsave-ui with 70-test Playwright suite across 3 layers),
and Sub-project 3 (agentsave-inferroute Enterprise sidecar). Includes
agentsave login self-hosted flow and README update checkpoints per sub-project.
…P2 UI, SP3 InferRoute

SP0: 20-task benchmark suite, accuracy measurement, real framework integration
     tests, agentsave login self-hosted rewrite
SP1: agentsave-dashboard FastAPI backend — SQLite, JWT license keys, 7 endpoints,
     retention service, first-run keygen, CLI
SP2: agentsave-ui Next.js dashboard — all 8 pages, 3-layer Playwright suite
     (~23 tests covering API, browser, and SDK-to-UI flows)
SP3: agentsave-inferroute PPD routing sidecar — turn classifier, router,
     vLLM/SGLang adapters, Docker image, TTFT benchmark scaffold
…chema test

- Bug 1: use word-boundary regex for short ground truths (<=3 chars) to
  prevent single-char false positives (e.g. C matching Cu)
- Bug 2: use word-boundary regex for purely numeric ground truths to
  prevent numeric substring false positives (e.g. 100 matching 1001)
- Bug 3: return False immediately when ground_truth is whitespace-only
- Add import re required by the regex fixes
- Add pythonpath to pyproject.toml so benchmarks is importable without editable install
- Add test_tasks_schema to tests/benchmarks/test_accuracy.py; all 7 pass
…dary check

For len(truth) <= 3 or purely numeric truths, a failed word-boundary
regex now returns False immediately — no difflib fallback — preventing
matches_ground_truth("Cu", "C") and ("799","79") from returning True
via a high similarity ratio.
…rious tokens_after clamp

- Delete unused TaskResult dataclass (run_benchmark() never instantiated it)
- Replace cf.is_relevant(output) with gain >= relevance_threshold to avoid
  redundant TF-IDF scoring on every output
- Remove the `if tokens_after == 0: tokens_after = 1` guard — BenchmarkResult
  divides by total_before (never zero-guarded by this line) so the clamp was
  wrong and unnecessary
Fix 1: open() in test_report.py now uses encoding="utf-8" to avoid
UnicodeDecodeError on Windows when reading ✓/✗ characters (cp1252 default).

Fix 2: per-task token-reduction division is now guarded against
tokens_before == 0 to prevent ZeroDivisionError.

Fix 3: report heading now derives task count from len(result.per_task)
instead of the hardcoded literal "20".

Fix 4: removed embedded \n from the first two heading strings so
"\n".join(lines)" no longer produces a double blank line after each heading.
…nused langchain-community dep

- langchain.py: route callbacks via config= for LCEL Runnables/Chains; keep kwarg path for AgentExecutor
- test_langchain_integration.py: replace trivial tokens_consumed > 0 with input+output token assertion; tighten importorskip to fake_chat_models submodule across all three tests
- pyproject.toml: remove langchain-community from dev deps (never imported)
- Create tests/adapters/test_autogen_integration.py: 3 tests using real
  autogen (ag2 0.13.4) ConversableAgent with llm_config=False; all pass.
- Create tests/adapters/test_crewai_integration.py: 3 tests using
  Crew.__new__ bypass; skip cleanly when crewai not importable.
- Fix AutoGenAdapter._process_message signature to match ag2 register_reply
  calling convention: (agent_self, messages=None, sender=None, config=None).
- Add pyautogen>=0.4.0 and crewai>=0.80.0 to dev optional-dependencies.
ag2>=0.13.0 preserves the classic ConversableAgent API; pyautogen>=0.10
redirects to autogen-agentchat which has an incompatible API.
crewai>=0.80.0 does not exist on PyPI (max is 0.11.2).
Fixes SmolagentsAdapter for smolagents>=1.10 where step_callbacks is
a CallbackRegistry (not a list). Uses registry.register() to inject the
supervisor callback and snapshots/restores _callbacks on exit.
Three integration tests pass against smolagents 1.26.0.
Removes all references to app.agentsave.io. login now prompts for
a dashboard URL and API key, verifies connectivity via /api/health
and /api/billing, then saves api_url/token to ~/.agentsave/config.json.
Also removes the dashboard command and adds smolagents>=1.10.0 to dev deps.
…in crewai integration tests

crewai.Crew on Python 3.11 is Pydantic v2 and raises ValueError when
assigning non-field names (e.g. kickoff). FakeCrew mimics the type-detection
signature (type.__name__=='Crew', 'crewai' in type.__module__) without
using the real Pydantic model, so the adapter is still fully exercised.
@aks-builds aks-builds merged commit dc374c9 into main Jun 25, 2026
3 checks passed
@aks-builds aks-builds deleted the feature/sp0-benchmark-real-tests branch June 25, 2026 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant