feat(sp0): benchmark suite, real framework integration tests, self-hosted login#1
Merged
Merged
Conversation
…README - Remove fake arXiv:2510.26585 and arXiv:2603.13358 citations - Change title from 'Save 30%' to 'Cut AI agent token costs' until 30% is proven - Replace 'zero accuracy loss' tagline with 'targeting ~30% with no accuracy loss — see BENCHMARKS.md' - Update SDK layer copy: 23% measured, 30% target, not claimed as achieved - Mark E2E tests badge as 'coming soon' (no E2E suite exists yet) - Mark InferRoute PPD claim as 'in development' - Add BENCHMARKS.md with current numbers and methodology
…laims true Covers Sub-project 0 (benchmark suite + real framework tests), Sub-project 1 (agentsave-dashboard backend with JWT license keys + self-hosted auth), Sub-project 2 (agentsave-ui with 70-test Playwright suite across 3 layers), and Sub-project 3 (agentsave-inferroute Enterprise sidecar). Includes agentsave login self-hosted flow and README update checkpoints per sub-project.
…P2 UI, SP3 InferRoute
SP0: 20-task benchmark suite, accuracy measurement, real framework integration
tests, agentsave login self-hosted rewrite
SP1: agentsave-dashboard FastAPI backend — SQLite, JWT license keys, 7 endpoints,
retention service, first-run keygen, CLI
SP2: agentsave-ui Next.js dashboard — all 8 pages, 3-layer Playwright suite
(~23 tests covering API, browser, and SDK-to-UI flows)
SP3: agentsave-inferroute PPD routing sidecar — turn classifier, router,
vLLM/SGLang adapters, Docker image, TTFT benchmark scaffold
…chema test - Bug 1: use word-boundary regex for short ground truths (<=3 chars) to prevent single-char false positives (e.g. C matching Cu) - Bug 2: use word-boundary regex for purely numeric ground truths to prevent numeric substring false positives (e.g. 100 matching 1001) - Bug 3: return False immediately when ground_truth is whitespace-only - Add import re required by the regex fixes - Add pythonpath to pyproject.toml so benchmarks is importable without editable install - Add test_tasks_schema to tests/benchmarks/test_accuracy.py; all 7 pass
…dary check
For len(truth) <= 3 or purely numeric truths, a failed word-boundary
regex now returns False immediately — no difflib fallback — preventing
matches_ground_truth("Cu", "C") and ("799","79") from returning True
via a high similarity ratio.
…rious tokens_after clamp - Delete unused TaskResult dataclass (run_benchmark() never instantiated it) - Replace cf.is_relevant(output) with gain >= relevance_threshold to avoid redundant TF-IDF scoring on every output - Remove the `if tokens_after == 0: tokens_after = 1` guard — BenchmarkResult divides by total_before (never zero-guarded by this line) so the clamp was wrong and unnecessary
Fix 1: open() in test_report.py now uses encoding="utf-8" to avoid UnicodeDecodeError on Windows when reading ✓/✗ characters (cp1252 default). Fix 2: per-task token-reduction division is now guarded against tokens_before == 0 to prevent ZeroDivisionError. Fix 3: report heading now derives task count from len(result.per_task) instead of the hardcoded literal "20". Fix 4: removed embedded \n from the first two heading strings so "\n".join(lines)" no longer produces a double blank line after each heading.
…nused langchain-community dep - langchain.py: route callbacks via config= for LCEL Runnables/Chains; keep kwarg path for AgentExecutor - test_langchain_integration.py: replace trivial tokens_consumed > 0 with input+output token assertion; tighten importorskip to fake_chat_models submodule across all three tests - pyproject.toml: remove langchain-community from dev deps (never imported)
- Create tests/adapters/test_autogen_integration.py: 3 tests using real autogen (ag2 0.13.4) ConversableAgent with llm_config=False; all pass. - Create tests/adapters/test_crewai_integration.py: 3 tests using Crew.__new__ bypass; skip cleanly when crewai not importable. - Fix AutoGenAdapter._process_message signature to match ag2 register_reply calling convention: (agent_self, messages=None, sender=None, config=None). - Add pyautogen>=0.4.0 and crewai>=0.80.0 to dev optional-dependencies.
ag2>=0.13.0 preserves the classic ConversableAgent API; pyautogen>=0.10 redirects to autogen-agentchat which has an incompatible API. crewai>=0.80.0 does not exist on PyPI (max is 0.11.2).
Fixes SmolagentsAdapter for smolagents>=1.10 where step_callbacks is a CallbackRegistry (not a list). Uses registry.register() to inject the supervisor callback and snapshots/restores _callbacks on exit. Three integration tests pass against smolagents 1.26.0.
Removes all references to app.agentsave.io. login now prompts for a dashboard URL and API key, verifies connectivity via /api/health and /api/billing, then saves api_url/token to ~/.agentsave/config.json. Also removes the dashboard command and adds smolagents>=1.10.0 to dev deps.
…in crewai integration tests crewai.Crew on Python 3.11 is Pydantic v2 and raises ValueError when assigning non-field names (e.g. kickoff). FakeCrew mimics the type-detection signature (type.__name__=='Crew', 'crewai' in type.__module__) without using the real Pydantic model, so the adapter is still fully exercised.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
Generated with Claude Code