Releases · parameterlab/MASEval

28 Mar 07:06

github-actions

v0.4.0

45098b0

v0.4.0 Latest

Latest

[0.4.0] - 2026-03-28

Fixed

Core

Fixed MessageHistory.to_list() returning a reference to the internal list instead of a copy, causing simulator logs to contain future conversation messages that hadn't occurred at the time of logging. (PR: #48)
Fixed get_git_info() crashing on detached HEAD (e.g. in CI checkout), now returns detached@<short-hash> as the branch name. (PR: #41)

Interface

Agent adapter gather_config() in smolagents, langgraph, and llamaindex no longer silently swallows exceptions, ensuring config collection errors are visible instead of producing incomplete configuration data. (PR: #53)

Added

Core

Usage and cost tracking via Usage and TokenUsage data classes. ModelAdapter tracks token usage automatically after each chat() call. Components that implement UsageTrackableMixin are collected via gather_usage(). Live totals available during benchmark runs via benchmark.usage (grand total) and benchmark.usage_by_component (per-component breakdowns). Post-hoc analysis via UsageReporter.from_reports(benchmark.reports) with breakdowns by task, component, or model. (PR: #45)
Pluggable cost calculation via CostCalculator protocol. StaticPricingCalculator computes cost from user-supplied per-token rates. LiteLLMCostCalculator in maseval.interface.usage for automatic pricing via LiteLLM's model database (supports custom_pricing overrides and model_id_map; requires litellm). Pass a cost_calculator to ModelAdapter or AgentAdapter to compute Usage.cost. Provider-reported cost always takes precedence. (PR: #45)
AgentAdapter now accepts cost_calculator and model_id parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (LiteLLMCostCalculator if litellm is installed). LangGraph requires explicit model_id since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)
Task.freeze() and Task.unfreeze() methods to make task data read-only during benchmark runs, preventing accidental mutation of environment_data, user_data, evaluation_data, and metadata (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with Task.is_frozen. (PR: #42)
TaskFrozenError exception in maseval.core.exceptions, raised when attempting to modify a frozen task. (PR: #42)
Added InformativeSubsetQueue and DISCOQueue to maseval.core.task for subset-based evaluation (e.g., anchor-point selection for DISCO). DISCOQueue accepts anchor_points_path to load indices from a .json/.pkl file via DISCOQueue.load_anchor_points(). Available via from maseval import DISCOQueue, InformativeSubsetQueue. (PR: #34 and #41)
Added ModelScorer abstract base class in maseval.core.scorer for log-likelihood scoring, with loglikelihood(), loglikelihood_batch(), and loglikelihood_choices() methods. (PR: #34 and #41)
Added SeedGenerator abstract base class and DefaultSeedGenerator implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
Added seed and seed_generator parameters to Benchmark.__init__ for enabling reproducibility (PR: #24)
Added seed_generator parameter to all benchmark setup methods (setup_environment, setup_user, setup_agents, setup_evaluators) (PR: #24)
Added seed parameter to ModelAdapter.__init__ for deterministic model inference (PR: #24)
Added SeedingError exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
Added UserExhaustedError exception in maseval.core.exceptions for flow control when a user's turns are exhausted (PR: #39)

Interface

Added seed support to interface adapters: OpenAIModelAdapter, GoogleGenAIModelAdapter, LiteLLMModelAdapter, HuggingFacePipelineModelAdapter pass seeds to underlying APIs (PR: #24)
Added HuggingFaceModelScorer in maseval.interface.inference — log-likelihood scorer backed by a HuggingFace AutoModelForCausalLM, with single-token optimisation for MCQ evaluation. Implements the ModelScorer interface. (PR: #34 and #41)
CAMEL-AI integration: CamelAgentAdapter and CamelLLMUser for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
- Added CamelAgentUser for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
- Added camel_role_playing_execution_loop() for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
- Added CamelRolePlayingTracer and CamelWorkforceTracer for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)

Benchmarks

MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. MMLUBenchmark is a framework-agnostic base class (setup_agents() and get_model_adapter() must be implemented by subclasses); DefaultMMLUBenchmark provides a ready-made HuggingFace implementation. Also includes MMLUEnvironment, MMLUEvaluator, load_tasks(), and compute_benchmark_metrics(). Install with pip install maseval[mmlu]. Optional extras: lm-eval (for DefaultMMLUBenchmark.precompute_all_logprobs_lmeval), disco (for DISCO prediction in the example). (PR: #34 and #41)
CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including ConverseBenchmark, DefaultAgentConverseBenchmark, ConverseEnvironment, ConverseExternalAgent, PrivacyEvaluator, SecurityEvaluator, and load_tasks() utilities for travel, real_estate, and insurance domains. Benchmark source files are now downloaded on first use via ensure_data_exists() instead of being bundled in the package. (PR: #28)
GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
- Gaia2Benchmark, Gaia2Environment, Gaia2Evaluator components for framework-agnostic evaluation with ARE simulation (PR: #26)
- DefaultAgentGaia2Benchmark with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
- Generic tool wrapper (Gaia2GenericTool) for MASEval tracing of ARE tools with simulation time tracking (PR: #26, #30)
- Data loading utilities: load_tasks(), configure_model_ids() for loading scenarios from HuggingFace (PR: #26)
- Gaia2JudgeEngineConfig for configuring the judge's LLM model and provider (PR: #30)
- Metrics: compute_gaia2_metrics() for GSR (Goal Success Rate) computation by capability type (PR: #26)
- Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
- Added gaia2 optional dependency: pip install maseval[gaia2] (PR: #26)
MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
- MultiAgentBenchBenchmark abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
- MarbleMultiAgentBenchBenchmark for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
- MultiAgentBenchEnvironment and MultiAgentBenchEvaluator components (PR: #25)
- Data loading utilities: load_tasks(), configure_model_ids(), get_domain_info(), ensure_marble_exists() (PR: #25)
- MARBLE adapter: MarbleAgentAdapter for wrapping MARBLE agents with MASEval tracing (PR: #25)

Examples

Added usage tracking to the 5-A-Day benchmark: five_a_day_benchmark.ipynb (section 2.7) and five_a_day_benchmark.py (post-run usage summary with per-component and per-task breakdowns). (PR: #45)
MMLU benchmark example at examples/mmlu_benchmark/ for evaluating HuggingFace models on MMLU with optional DISCO prediction (--disco_model_path, --disco_transform_path). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34 and #41)
Added a dedicated runnable CONVERSE default benchmark example at examples/converse_benchmark/default_converse_benchmark.py for quick start with DefaultAgentConverseBenchmark. (PR: #28)
Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)

Documentation

Usage & Cost Tracking guide (docs/guides/usage-tracking.md) and API reference (docs/reference/usage.md). (PR: #45)

Testing

Composable pytest markers (live, credentialed, slow, smoke) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
Marker implication hook: credentialed implies live, so -m "not live" always gives a fully offline run (PR: #29)
Skip decorators (requires_openai, requires_anthropic, requires_google) for tests needing API keys (PR: #29)
Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29, #30)
Real-data integration tests for GAIA2 and MultiAgentBench (PR: #30)
HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using respx mocks — no API keys needed (PR: #29)
Live API round-trip tests for all model adapters (-m credentialed) (PR: #29)
CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
Added respx dev dependency for HTTP-level mocking (PR: #29)
pytest marker mmlu for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34 and #41)

Changed

Core

Simplified seeding API: seed_generator parameter in setup methods is now always non-None (`SeedGener...

Assets 4

18 Jan 21:55

github-actions

v0.3.0

2f70bcc

v0.3.0

[0.3.0] - 2025-01-18

Added

Parallel Execution

Added parallel task execution with num_workers parameter in Benchmark.run() using ThreadPoolExecutor (PR: #14)
Added ComponentRegistry class for thread-safe component registration with thread-local storage (PR: #14)
Added TaskContext for cooperative timeout checking with check_timeout(), elapsed, remaining, and is_expired properties (PR: #14)
Added TaskProtocol dataclass with timeout_seconds, timeout_action, max_retries, priority, and tags fields for task-level execution control (PR: #14)
Added TimeoutAction enum (SKIP, RETRY, RAISE) for configurable timeout behavior (PR: #14)
Added TaskTimeoutError exception with elapsed, timeout, and partial_traces attributes (PR: #14)
Added TASK_TIMEOUT to TaskExecutionStatus enum for timeout classification (PR: #14)

Task Queue Abstraction

Added TaskQueue abstract base class with iterator interface for flexible task scheduling (PR: #14)
Added SequentialQueue for simple FIFO task ordering (PR: #14)
Added PriorityQueue for priority-based task scheduling using TaskProtocol.priority (PR: #14)
Added AdaptiveTaskQueue abstract base class for feedback-based adaptive scheduling with initial_state(), select_next_task(remaining, state), and update_state(task, report, state) methods (PR: #14)

ModelAdapter Chat Interface

Added chat() method to ModelAdapter as the primary interface for LLM inference, accepting a list of messages in OpenAI format and returning a ChatResponse object and accepting tools
Added ChatResponse dataclass containing content, tool_calls, role, usage, model, and stop_reason fields for structured response handling

AnthropicModelAdapter

New AnthropicModelAdapter for direct integration with Anthropic Claude models via the official Anthropic SDK
Handles Anthropic-specific message format conversion (system messages, tool_use/tool_result blocks) internally while accepting OpenAI-compatible input
Added anthropic optional dependency: pip install maseval[anthropic]

Benchmarks

Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: #16)
Tau2Benchmark, Tau2Environment, Tau2User, Tau2Evaluator components for framework-agnostic evaluation (PR: #16)
DefaultAgentTau2Benchmark using an agent setup closely resembeling to the original tau2-bench implementation (PR: #16)
Data loading utilities: load_tasks(), ensure_data_exists(), configure_model_ids() (PR: #16)
Metrics: compute_benchmark_metrics(), compute_pass_at_k(), compute_pass_hat_k() for tau2-style scoring (PR: #16)
Domain implementations with tool kits: AirlineTools, RetailTools, TelecomTools with full database simulation (PR: #16)

User

AgenticUser class for users that can use tools during conversations (PR: #16)
Multiple stop token support: User now accepts stop_tokens (list) instead of single stop_token, enabling different termination reasons (PR: #16)
Stop reason tracking: User traces now include stop_reason, max_turns, turns_used, and stopped_by_user for detailed termination analysis (PR: #16)

Simulator

AgenticUserLLMSimulator for LLM-based user simulation with tool use capabilities (PR: #16)

Examples

Tau2 benchmark example with default agent implementation and result comparison scripts (PR: #16)

Changed

Benchmark

Benchmark.agent_data parameter is now optional (defaults to empty dict) (PR: #16)
Refactored Benchmark to delegate registry operations to ComponentRegistry class (PR: #)
Benchmark.run() now accepts optional queue parameter (BaseTaskQueue) for custom task scheduling (PR: #14)

Task

Task.id is now str type instead of UUID. Benchmarks can provide human-readable IDs directly (e.g., Task(id="retail_001", ...)). Auto-generates UUID string if not provided. (PR: #16)

Fixed

Task reports now use task.id directly instead of metadata["task_id"] (PR: #16)

Removed

Assets 4

05 Dec 16:29

github-actions

v0.2.0

c25fcdc

v0.2.0

[0.2.0] - 2025-12-05

Added

Exceptions and Error Classification

Added AgentError, EnvironmentError, UserError exception hierarchy in maseval.core.exceptions for classifying execution failures by responsibility (PR: #13)
Added TaskExecutionStatus.AGENT_ERROR, ENVIRONMENT_ERROR, USER_ERROR, UNKNOWN_EXECUTION_ERROR for fine-grained error classification enabling fair scoring (PR: #13)
Added validation helpers: validate_argument_type(), validate_required_arguments(), validate_no_extra_arguments(), validate_arguments_from_schema() for tool implementers (PR: #13)
Added ToolSimulatorError and UserSimulatorError exception subclasses for simulator-specific context while inheriting proper classification (PR: #13)

Documentation

Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)

Benchmarks

MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)

Benchmark

Added execution_loop() method to Benchmark base class enabling iterative agent-user interaction (PR: #13)
Added max_invocations constructor parameter to Benchmark (default: 1 for backwards compatibility) (PR: #13)
Added abstract get_model_adapter(model_id, **kwargs) method to Benchmark base class as universal model factory to be used throughout the benchmarks. (PR: #13)

User

Added max_turns and stop_token parameters to User base class for multi-turn support with early stopping. Same applied to UserLLMSimulator. (PR: #13)
Added is_done(), _check_stop_token(), and increment_turn() methods to User base class (PR: #13)
Added get_initial_query() method to User base class for LLM-generated initial messages (PR: #13)
Added initial_query parameter in User base class to trigger the agentic system. (PR: #13)

Environment

Added Environment.get_tool(name) method for single-tool lookup (PR: #13)

Interface

LlamaIndex integration: LlamaIndexAgentAdapter and LlamaIndexUser for evaluating LlamaIndex workflow-based agents (PR: #7)
The logs property inside SmolAgentAdapter and LanggraphAgentAdapter are now properly filled. (PR: #3)

Examples

Added a new example: The 5_a_day_benchmark (PR: #10)

Changed

Exception Handling

Benchmark now classifies execution errors into AGENT_ERROR (agent's fault), ENVIRONMENT_ERROR (tool/infra failure), USER_ERROR (user simulator failure), or UNKNOWN_EXECUTION_ERROR (unclassified) instead of generic TASK_EXECUTION_FAILED (PR: #13)
ToolLLMSimulator now raises ToolSimulatorError (classified as ENVIRONMENT_ERROR) on failure (PR: #13)
UserLLMSimulator now raises UserSimulatorError (classified as USER_ERROR) on failure (PR: #13)

Environment

Environment.create_tools() now returns Dict[str, Any] instead of list (PR: #13)

Benchmark

Benchmark.run_agents() signature changed: added query: str parameter (PR: #13)
Benchmark.run() now uses execution_loop() internally to handle agent-user interaction cycles (PR: #13)
Benchmark class now has a fail_on_setup_error flag that raises errors observed during setup of task (PR: #10)

Callback

FileResultLogger now accepts pathlib.Path for argument output_dir and has an overwrite argument to prevent overwriting of existing logs files.

Evaluator

The Evaluator class now has a filter_traces base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).

Simulator

The LLMSimulator now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).

Other

Documentation formatting improved. Added darkmode and links to Github (PR: #11).
Improved Quick Start Guide in docs/getting-started/quickstart.md. (PR: #10)
maseval.interface.agents structure changed. Tools requiring framework imports (beyond just typing) now in <framework>_optional.py and imported dynamically from <framework>.py. (PR: #12)
Various formatting improvements in the documentation (PR: #12)
Added documentation for View Source Code pattern in CONTRIBUTING.md and _optional.py pattern in interface README (PR: #12)

Fixed

Interface

LlamaIndexAgentAdapter now supports multiple LlamaIndex agent types including ReActAgent (workflow-based), FunctionAgent, and legacy agents by checking for .chat(), .query(), and .run() methods in priority order (PR: #10)

Other

Consistent naming of agent adapter over wrapper (PR: #3)
Fixed an issue that LiteLLM interface and Mixins were not shown in documentation properly (#PR: 12)

Removed

Removed set_message_history, append_message_history and clear_message_history for AgentAdapter and subclasses. (PR: #3)

Assets 4

18 Nov 18:03

github-actions

v0.1.2

982fca7

v0.1.2

Full Changelog: v0.1.1...v0.1.2

Assets 4

18 Nov 15:47

cemde

v0.1.1

a6294ff

Initial Release

This is the initial code release. Library under active development. API might change anytime.

Assets 2

17 Nov 17:25

cemde

v0.1.0

2728779

v0.1.0-alpha

fixed email

Assets 2

Releases: parameterlab/MASEval

v0.4.0

[0.4.0] - 2026-03-28

Fixed

Added

Changed

Uh oh!

v0.3.0

[0.3.0] - 2025-01-18

Added

Changed

Fixed

Removed

Uh oh!

v0.2.0

[0.2.0] - 2025-12-05

Added

Changed

Fixed

Removed

Uh oh!

v0.1.2

Uh oh!

Initial Release

Uh oh!

v0.1.0-alpha

Uh oh!