Skip to content

Releases: parameterlab/MASEval

v0.4.0

28 Mar 07:06
v0.4.0
45098b0

Choose a tag to compare

[0.4.0] - 2026-03-28

Fixed

Core

  • Fixed MessageHistory.to_list() returning a reference to the internal list instead of a copy, causing simulator logs to contain future conversation messages that hadn't occurred at the time of logging. (PR: #48)
  • Fixed get_git_info() crashing on detached HEAD (e.g. in CI checkout), now returns detached@<short-hash> as the branch name. (PR: #41)

Interface

  • Agent adapter gather_config() in smolagents, langgraph, and llamaindex no longer silently swallows exceptions, ensuring config collection errors are visible instead of producing incomplete configuration data. (PR: #53)

Added

Core

  • Usage and cost tracking via Usage and TokenUsage data classes. ModelAdapter tracks token usage automatically after each chat() call. Components that implement UsageTrackableMixin are collected via gather_usage(). Live totals available during benchmark runs via benchmark.usage (grand total) and benchmark.usage_by_component (per-component breakdowns). Post-hoc analysis via UsageReporter.from_reports(benchmark.reports) with breakdowns by task, component, or model. (PR: #45)
  • Pluggable cost calculation via CostCalculator protocol. StaticPricingCalculator computes cost from user-supplied per-token rates. LiteLLMCostCalculator in maseval.interface.usage for automatic pricing via LiteLLM's model database (supports custom_pricing overrides and model_id_map; requires litellm). Pass a cost_calculator to ModelAdapter or AgentAdapter to compute Usage.cost. Provider-reported cost always takes precedence. (PR: #45)
  • AgentAdapter now accepts cost_calculator and model_id parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (LiteLLMCostCalculator if litellm is installed). LangGraph requires explicit model_id since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)
  • Task.freeze() and Task.unfreeze() methods to make task data read-only during benchmark runs, preventing accidental mutation of environment_data, user_data, evaluation_data, and metadata (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with Task.is_frozen. (PR: #42)
  • TaskFrozenError exception in maseval.core.exceptions, raised when attempting to modify a frozen task. (PR: #42)
  • Added InformativeSubsetQueue and DISCOQueue to maseval.core.task for subset-based evaluation (e.g., anchor-point selection for DISCO). DISCOQueue accepts anchor_points_path to load indices from a .json/.pkl file via DISCOQueue.load_anchor_points(). Available via from maseval import DISCOQueue, InformativeSubsetQueue. (PR: #34 and #41)
  • Added ModelScorer abstract base class in maseval.core.scorer for log-likelihood scoring, with loglikelihood(), loglikelihood_batch(), and loglikelihood_choices() methods. (PR: #34 and #41)
  • Added SeedGenerator abstract base class and DefaultSeedGenerator implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
  • Added seed and seed_generator parameters to Benchmark.__init__ for enabling reproducibility (PR: #24)
  • Added seed_generator parameter to all benchmark setup methods (setup_environment, setup_user, setup_agents, setup_evaluators) (PR: #24)
  • Added seed parameter to ModelAdapter.__init__ for deterministic model inference (PR: #24)
  • Added SeedingError exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
  • Added UserExhaustedError exception in maseval.core.exceptions for flow control when a user's turns are exhausted (PR: #39)

Interface

  • Added seed support to interface adapters: OpenAIModelAdapter, GoogleGenAIModelAdapter, LiteLLMModelAdapter, HuggingFacePipelineModelAdapter pass seeds to underlying APIs (PR: #24)
  • Added HuggingFaceModelScorer in maseval.interface.inference — log-likelihood scorer backed by a HuggingFace AutoModelForCausalLM, with single-token optimisation for MCQ evaluation. Implements the ModelScorer interface. (PR: #34 and #41)
  • CAMEL-AI integration: CamelAgentAdapter and CamelLLMUser for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
    • Added CamelAgentUser for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
    • Added camel_role_playing_execution_loop() for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
    • Added CamelRolePlayingTracer and CamelWorkforceTracer for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)

Benchmarks

  • MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. MMLUBenchmark is a framework-agnostic base class (setup_agents() and get_model_adapter() must be implemented by subclasses); DefaultMMLUBenchmark provides a ready-made HuggingFace implementation. Also includes MMLUEnvironment, MMLUEvaluator, load_tasks(), and compute_benchmark_metrics(). Install with pip install maseval[mmlu]. Optional extras: lm-eval (for DefaultMMLUBenchmark.precompute_all_logprobs_lmeval), disco (for DISCO prediction in the example). (PR: #34 and #41)
  • CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including ConverseBenchmark, DefaultAgentConverseBenchmark, ConverseEnvironment, ConverseExternalAgent, PrivacyEvaluator, SecurityEvaluator, and load_tasks() utilities for travel, real_estate, and insurance domains. Benchmark source files are now downloaded on first use via ensure_data_exists() instead of being bundled in the package. (PR: #28)
  • GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
    • Gaia2Benchmark, Gaia2Environment, Gaia2Evaluator components for framework-agnostic evaluation with ARE simulation (PR: #26)
    • DefaultAgentGaia2Benchmark with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
    • Generic tool wrapper (Gaia2GenericTool) for MASEval tracing of ARE tools with simulation time tracking (PR: #26, #30)
    • Data loading utilities: load_tasks(), configure_model_ids() for loading scenarios from HuggingFace (PR: #26)
    • Gaia2JudgeEngineConfig for configuring the judge's LLM model and provider (PR: #30)
    • Metrics: compute_gaia2_metrics() for GSR (Goal Success Rate) computation by capability type (PR: #26)
    • Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
    • Added gaia2 optional dependency: pip install maseval[gaia2] (PR: #26)
  • MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
    • MultiAgentBenchBenchmark abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
    • MarbleMultiAgentBenchBenchmark for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
    • MultiAgentBenchEnvironment and MultiAgentBenchEvaluator components (PR: #25)
    • Data loading utilities: load_tasks(), configure_model_ids(), get_domain_info(), ensure_marble_exists() (PR: #25)
    • MARBLE adapter: MarbleAgentAdapter for wrapping MARBLE agents with MASEval tracing (PR: #25)

Examples

  • Added usage tracking to the 5-A-Day benchmark: five_a_day_benchmark.ipynb (section 2.7) and five_a_day_benchmark.py (post-run usage summary with per-component and per-task breakdowns). (PR: #45)
  • MMLU benchmark example at examples/mmlu_benchmark/ for evaluating HuggingFace models on MMLU with optional DISCO prediction (--disco_model_path, --disco_transform_path). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34 and #41)
  • Added a dedicated runnable CONVERSE default benchmark example at examples/converse_benchmark/default_converse_benchmark.py for quick start with DefaultAgentConverseBenchmark. (PR: #28)
  • Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)

Documentation

  • Usage & Cost Tracking guide (docs/guides/usage-tracking.md) and API reference (docs/reference/usage.md). (PR: #45)

Testing

  • Composable pytest markers (live, credentialed, slow, smoke) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
  • Marker implication hook: credentialed implies live, so -m "not live" always gives a fully offline run (PR: #29)
  • Skip decorators (requires_openai, requires_anthropic, requires_google) for tests needing API keys (PR: #29)
  • Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29, #30)
  • Real-data integration tests for GAIA2 and MultiAgentBench (PR: #30)
  • HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using respx mocks — no API keys needed (PR: #29)
  • Live API round-trip tests for all model adapters (-m credentialed) (PR: #29)
  • CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
  • Added respx dev dependency for HTTP-level mocking (PR: #29)
  • pytest marker mmlu for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34 and #41)

Changed

Core

  • Simplified seeding API: seed_generator parameter in setup methods is now always non-None (`SeedGener...
Read more

v0.3.0

18 Jan 21:55
v0.3.0
2f70bcc

Choose a tag to compare

[0.3.0] - 2025-01-18

Added

Parallel Execution

  • Added parallel task execution with num_workers parameter in Benchmark.run() using ThreadPoolExecutor (PR: #14)
  • Added ComponentRegistry class for thread-safe component registration with thread-local storage (PR: #14)
  • Added TaskContext for cooperative timeout checking with check_timeout(), elapsed, remaining, and is_expired properties (PR: #14)
  • Added TaskProtocol dataclass with timeout_seconds, timeout_action, max_retries, priority, and tags fields for task-level execution control (PR: #14)
  • Added TimeoutAction enum (SKIP, RETRY, RAISE) for configurable timeout behavior (PR: #14)
  • Added TaskTimeoutError exception with elapsed, timeout, and partial_traces attributes (PR: #14)
  • Added TASK_TIMEOUT to TaskExecutionStatus enum for timeout classification (PR: #14)

Task Queue Abstraction

  • Added TaskQueue abstract base class with iterator interface for flexible task scheduling (PR: #14)
  • Added SequentialQueue for simple FIFO task ordering (PR: #14)
  • Added PriorityQueue for priority-based task scheduling using TaskProtocol.priority (PR: #14)
  • Added AdaptiveTaskQueue abstract base class for feedback-based adaptive scheduling with initial_state(), select_next_task(remaining, state), and update_state(task, report, state) methods (PR: #14)

ModelAdapter Chat Interface

  • Added chat() method to ModelAdapter as the primary interface for LLM inference, accepting a list of messages in OpenAI format and returning a ChatResponse object and accepting tools
  • Added ChatResponse dataclass containing content, tool_calls, role, usage, model, and stop_reason fields for structured response handling

AnthropicModelAdapter

  • New AnthropicModelAdapter for direct integration with Anthropic Claude models via the official Anthropic SDK
  • Handles Anthropic-specific message format conversion (system messages, tool_use/tool_result blocks) internally while accepting OpenAI-compatible input
  • Added anthropic optional dependency: pip install maseval[anthropic]

Benchmarks

  • Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: #16)
  • Tau2Benchmark, Tau2Environment, Tau2User, Tau2Evaluator components for framework-agnostic evaluation (PR: #16)
  • DefaultAgentTau2Benchmark using an agent setup closely resembeling to the original tau2-bench implementation (PR: #16)
  • Data loading utilities: load_tasks(), ensure_data_exists(), configure_model_ids() (PR: #16)
  • Metrics: compute_benchmark_metrics(), compute_pass_at_k(), compute_pass_hat_k() for tau2-style scoring (PR: #16)
  • Domain implementations with tool kits: AirlineTools, RetailTools, TelecomTools with full database simulation (PR: #16)

User

  • AgenticUser class for users that can use tools during conversations (PR: #16)
  • Multiple stop token support: User now accepts stop_tokens (list) instead of single stop_token, enabling different termination reasons (PR: #16)
  • Stop reason tracking: User traces now include stop_reason, max_turns, turns_used, and stopped_by_user for detailed termination analysis (PR: #16)

Simulator

  • AgenticUserLLMSimulator for LLM-based user simulation with tool use capabilities (PR: #16)

Examples

  • Tau2 benchmark example with default agent implementation and result comparison scripts (PR: #16)

Changed

Benchmark

  • Benchmark.agent_data parameter is now optional (defaults to empty dict) (PR: #16)
  • Refactored Benchmark to delegate registry operations to ComponentRegistry class (PR: #)
  • Benchmark.run() now accepts optional queue parameter (BaseTaskQueue) for custom task scheduling (PR: #14)

Task

  • Task.id is now str type instead of UUID. Benchmarks can provide human-readable IDs directly (e.g., Task(id="retail_001", ...)). Auto-generates UUID string if not provided. (PR: #16)

Fixed

  • Task reports now use task.id directly instead of metadata["task_id"] (PR: #16)

Removed

v0.2.0

05 Dec 16:29
c25fcdc

Choose a tag to compare

[0.2.0] - 2025-12-05

Added

Exceptions and Error Classification

  • Added AgentError, EnvironmentError, UserError exception hierarchy in maseval.core.exceptions for classifying execution failures by responsibility (PR: #13)
  • Added TaskExecutionStatus.AGENT_ERROR, ENVIRONMENT_ERROR, USER_ERROR, UNKNOWN_EXECUTION_ERROR for fine-grained error classification enabling fair scoring (PR: #13)
  • Added validation helpers: validate_argument_type(), validate_required_arguments(), validate_no_extra_arguments(), validate_arguments_from_schema() for tool implementers (PR: #13)
  • Added ToolSimulatorError and UserSimulatorError exception subclasses for simulator-specific context while inheriting proper classification (PR: #13)

Documentation

  • Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)

Benchmarks

  • MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)

Benchmark

  • Added execution_loop() method to Benchmark base class enabling iterative agent-user interaction (PR: #13)
  • Added max_invocations constructor parameter to Benchmark (default: 1 for backwards compatibility) (PR: #13)
  • Added abstract get_model_adapter(model_id, **kwargs) method to Benchmark base class as universal model factory to be used throughout the benchmarks. (PR: #13)

User

  • Added max_turns and stop_token parameters to User base class for multi-turn support with early stopping. Same applied to UserLLMSimulator. (PR: #13)
  • Added is_done(), _check_stop_token(), and increment_turn() methods to User base class (PR: #13)
  • Added get_initial_query() method to User base class for LLM-generated initial messages (PR: #13)
  • Added initial_query parameter in User base class to trigger the agentic system. (PR: #13)

Environment

  • Added Environment.get_tool(name) method for single-tool lookup (PR: #13)

Interface

  • LlamaIndex integration: LlamaIndexAgentAdapter and LlamaIndexUser for evaluating LlamaIndex workflow-based agents (PR: #7)
  • The logs property inside SmolAgentAdapter and LanggraphAgentAdapter are now properly filled. (PR: #3)

Examples

  • Added a new example: The 5_a_day_benchmark (PR: #10)

Changed

Exception Handling

  • Benchmark now classifies execution errors into AGENT_ERROR (agent's fault), ENVIRONMENT_ERROR (tool/infra failure), USER_ERROR (user simulator failure), or UNKNOWN_EXECUTION_ERROR (unclassified) instead of generic TASK_EXECUTION_FAILED (PR: #13)
  • ToolLLMSimulator now raises ToolSimulatorError (classified as ENVIRONMENT_ERROR) on failure (PR: #13)
  • UserLLMSimulator now raises UserSimulatorError (classified as USER_ERROR) on failure (PR: #13)

Environment

  • Environment.create_tools() now returns Dict[str, Any] instead of list (PR: #13)

Benchmark

  • Benchmark.run_agents() signature changed: added query: str parameter (PR: #13)
  • Benchmark.run() now uses execution_loop() internally to handle agent-user interaction cycles (PR: #13)
  • Benchmark class now has a fail_on_setup_error flag that raises errors observed during setup of task (PR: #10)

Callback

  • FileResultLogger now accepts pathlib.Path for argument output_dir and has an overwrite argument to prevent overwriting of existing logs files.

Evaluator

  • The Evaluator class now has a filter_traces base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).

Simulator

  • The LLMSimulator now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).

Other

  • Documentation formatting improved. Added darkmode and links to Github (PR: #11).
  • Improved Quick Start Guide in docs/getting-started/quickstart.md. (PR: #10)
  • maseval.interface.agents structure changed. Tools requiring framework imports (beyond just typing) now in <framework>_optional.py and imported dynamically from <framework>.py. (PR: #12)
  • Various formatting improvements in the documentation (PR: #12)
  • Added documentation for View Source Code pattern in CONTRIBUTING.md and _optional.py pattern in interface README (PR: #12)

Fixed

Interface

  • LlamaIndexAgentAdapter now supports multiple LlamaIndex agent types including ReActAgent (workflow-based), FunctionAgent, and legacy agents by checking for .chat(), .query(), and .run() methods in priority order (PR: #10)

Other

  • Consistent naming of agent adapter over wrapper (PR: #3)
  • Fixed an issue that LiteLLM interface and Mixins were not shown in documentation properly (#PR: 12)

Removed

  • Removed set_message_history, append_message_history and clear_message_history for AgentAdapter and subclasses. (PR: #3)

v0.1.2

18 Nov 18:03
982fca7

Choose a tag to compare

Full Changelog: v0.1.1...v0.1.2

Initial Release

18 Nov 15:47
a6294ff

Choose a tag to compare

This is the initial code release. Library under active development. API might change anytime.

v0.1.0-alpha

17 Nov 17:25
2728779

Choose a tag to compare

fixed email