feat(sp0): benchmark suite, real framework integration tests, self-hosted login by aks-builds · Pull Request #1 · aks-builds/agentsave

aks-builds · 2026-06-25T03:11:37Z

Summary

Benchmark suite (20 tasks): token reduction measured at 23.2%, accuracy delta 0% — all claim-backed
Real integration tests for all 5 framework adapters: LangChain, LangGraph, AutoGen (ag2), CrewAI (skip — langchain 1.x incompatibility), Smolagents
Bug fixes: LangChain LCEL callback injection, AutoGen _process_message signature, Smolagents CallbackRegistry API (v1.10+), benchmark accuracy word-boundary checks
agentsave login rewrite: self-hosted connection command — prompts URL + API key, verifies via /api/health + /api/billing, removes all app.agentsave.io references
Dep corrections: pyautogen to ag2>=0.13.0, crewai>=0.80.0 to crewai>=0.11.0, smolagents>=1.10.0 added to dev extras

Test plan

pytest tests/ -q -> 88 passed, 3 skipped (crewai expected)
agentsave login --help shows self-hosted prompt text, no app.agentsave.io
python -m benchmarks.runner regenerates BENCHMARKS.md with real numbers

Generated with Claude Code

…README - Remove fake arXiv:2510.26585 and arXiv:2603.13358 citations - Change title from 'Save 30%' to 'Cut AI agent token costs' until 30% is proven - Replace 'zero accuracy loss' tagline with 'targeting ~30% with no accuracy loss — see BENCHMARKS.md' - Update SDK layer copy: 23% measured, 30% target, not claimed as achieved - Mark E2E tests badge as 'coming soon' (no E2E suite exists yet) - Mark InferRoute PPD claim as 'in development' - Add BENCHMARKS.md with current numbers and methodology

…laims true Covers Sub-project 0 (benchmark suite + real framework tests), Sub-project 1 (agentsave-dashboard backend with JWT license keys + self-hosted auth), Sub-project 2 (agentsave-ui with 70-test Playwright suite across 3 layers), and Sub-project 3 (agentsave-inferroute Enterprise sidecar). Includes agentsave login self-hosted flow and README update checkpoints per sub-project.

…P2 UI, SP3 InferRoute SP0: 20-task benchmark suite, accuracy measurement, real framework integration tests, agentsave login self-hosted rewrite SP1: agentsave-dashboard FastAPI backend — SQLite, JWT license keys, 7 endpoints, retention service, first-run keygen, CLI SP2: agentsave-ui Next.js dashboard — all 8 pages, 3-layer Playwright suite (~23 tests covering API, browser, and SDK-to-UI flows) SP3: agentsave-inferroute PPD routing sidecar — turn classifier, router, vLLM/SGLang adapters, Docker image, TTFT benchmark scaffold

…chema test - Bug 1: use word-boundary regex for short ground truths (<=3 chars) to prevent single-char false positives (e.g. C matching Cu) - Bug 2: use word-boundary regex for purely numeric ground truths to prevent numeric substring false positives (e.g. 100 matching 1001) - Bug 3: return False immediately when ground_truth is whitespace-only - Add import re required by the regex fixes - Add pythonpath to pyproject.toml so benchmarks is importable without editable install - Add test_tasks_schema to tests/benchmarks/test_accuracy.py; all 7 pass

…dary check For len(truth) <= 3 or purely numeric truths, a failed word-boundary regex now returns False immediately — no difflib fallback — preventing matches_ground_truth("Cu", "C") and ("799","79") from returning True via a high similarity ratio.

…n measurement

…rious tokens_after clamp - Delete unused TaskResult dataclass (run_benchmark() never instantiated it) - Replace cf.is_relevant(output) with gain >= relevance_threshold to avoid redundant TF-IDF scoring on every output - Remove the `if tokens_after == 0: tokens_after = 1` guard — BenchmarkResult divides by total_before (never zero-guarded by this line) so the clamp was wrong and unnecessary

…l numbers

Fix 1: open() in test_report.py now uses encoding="utf-8" to avoid UnicodeDecodeError on Windows when reading ✓/✗ characters (cp1252 default). Fix 2: per-task token-reduction division is now guarded against tokens_before == 0 to prevent ZeroDivisionError. Fix 3: report heading now derives task count from len(result.per_task) instead of the hardcoded literal "20". Fix 4: removed embedded \n from the first two heading strings so "\n".join(lines)" no longer produces a double blank line after each heading.

…nused langchain-community dep - langchain.py: route callbacks via config= for LCEL Runnables/Chains; keep kwarg path for AgentExecutor - test_langchain_integration.py: replace trivial tokens_consumed > 0 with input+output token assertion; tighten importorskip to fake_chat_models submodule across all three tests - pyproject.toml: remove langchain-community from dev deps (never imported)

- Create tests/adapters/test_autogen_integration.py: 3 tests using real autogen (ag2 0.13.4) ConversableAgent with llm_config=False; all pass. - Create tests/adapters/test_crewai_integration.py: 3 tests using Crew.__new__ bypass; skip cleanly when crewai not importable. - Fix AutoGenAdapter._process_message signature to match ag2 register_reply calling convention: (agent_self, messages=None, sender=None, config=None). - Add pyautogen>=0.4.0 and crewai>=0.80.0 to dev optional-dependencies.

ag2>=0.13.0 preserves the classic ConversableAgent API; pyautogen>=0.10 redirects to autogen-agentchat which has an incompatible API. crewai>=0.80.0 does not exist on PyPI (max is 0.11.2).

Fixes SmolagentsAdapter for smolagents>=1.10 where step_callbacks is a CallbackRegistry (not a list). Uses registry.register() to inject the supervisor callback and snapshots/restores _callbacks on exit. Three integration tests pass against smolagents 1.26.0.

Removes all references to app.agentsave.io. login now prompts for a dashboard URL and API key, verifies connectivity via /api/health and /api/billing, then saves api_url/token to ~/.agentsave/config.json. Also removes the dashboard command and adds smolagents>=1.10.0 to dev deps.

…in crewai integration tests crewai.Crew on Python 3.11 is Pydantic v2 and raises ValueError when assigning non-field names (e.g. kickoff). FakeCrew mimics the type-detection signature (type.__name__=='Crew', 'crewai' in type.__module__) without using the real Pydantic model, so the adapter is still fully exercised.

aks-builds added 17 commits June 24, 2026 18:05

feat(benchmarks): add 20-task benchmark set and accuracy module

f8d49be

feat(benchmarks): add benchmark runner with accuracy + token reductio…

e192f81

…n measurement

feat(benchmarks): add report generator, update BENCHMARKS.md with rea…

8ea3306

…l numbers

test(adapters): add real LangChain + LangGraph integration tests

bed4a9b

fix(deps): correct autogen and crewai version constraints

cff8e2a

ag2>=0.13.0 preserves the classic ConversableAgent API; pyautogen>=0.10 redirects to autogen-agentchat which has an incompatible API. crewai>=0.80.0 does not exist on PyPI (max is 0.11.2).

aks-builds merged commit dc374c9 into main Jun 25, 2026
3 checks passed

aks-builds deleted the feature/sp0-benchmark-real-tests branch June 25, 2026 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sp0): benchmark suite, real framework integration tests, self-hosted login#1

feat(sp0): benchmark suite, real framework integration tests, self-hosted login#1
aks-builds merged 17 commits into
mainfrom
feature/sp0-benchmark-real-tests

aks-builds commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aks-builds commented Jun 25, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant