Skip to content

feat: add GAIA benchmark for multi-step reasoning#239

Open
zeroasterisk wants to merge 1 commit into
Exgentic:mainfrom
zeroasterisk:feat/gaia-benchmark
Open

feat: add GAIA benchmark for multi-step reasoning#239
zeroasterisk wants to merge 1 commit into
Exgentic:mainfrom
zeroasterisk:feat/gaia-benchmark

Conversation

@zeroasterisk

Copy link
Copy Markdown
Contributor

Summary

  • Adds GAIA (General AI Assistants) benchmark adapter following the existing Exgentic benchmark pattern
  • Loads tasks from HuggingFace datasets (gaia-benchmark/GAIA, validation split)
  • Supports 4 subset configs: 2023_all, 2023_level1, 2023_level2, 2023_level3
  • Scores by exact match with standard normalization (lowercase, strip articles, remove punctuation, collapse whitespace)
  • Reports per-level accuracy metrics in aggregation results
  • Registers the benchmark in interfaces/registry.py with slug gaia

Structure

src/exgentic/benchmarks/gaia/
  __init__.py
  gaia_benchmark.py    # GAIASession, GAIAEvaluator, GAIABenchmark
  requirements.txt     # datasets>=2.0.0

Pattern mapping

Exgentic concept GAIA implementation
Benchmark config GAIABenchmark — subset selection, runner config
Session GAIASession — loads one question, presents to agent, collects answer
Evaluator GAIAEvaluator — lists task IDs, aggregates per-level accuracy
Finish action submit with answer: str field
Scoring Exact match after normalization (case-insensitive, whitespace/punctuation/article stripped)

Notes

  • The GAIA dataset is gated on HuggingFace — users need an accepted access request and HF_TOKEN set
  • Works with any Exgentic agent (tool_calling, claude_code, openai_solo, etc.)
  • No additional tools are provided by default; agents use their own capabilities to reason through multi-step questions

Test plan

  • Registry loads gaia benchmark correctly (validated via load_benchmark)
  • All existing benchmarks still load after registry changes
  • Answer normalization and exact-match scoring verified with unit assertions
  • Evaluator returns correct task counts for all 4 subsets
  • End-to-end run with a real agent (requires HF_TOKEN with GAIA access)

GAIA (General AI Assistants) benchmark for multi-step reasoning.
Loads tasks from HuggingFace datasets (gaia-benchmark/GAIA).
3 difficulty levels, exact-match scoring with answer normalization.
Reports per-level accuracy metrics in aggregation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant