feat: add GAIA benchmark for multi-step reasoning by zeroasterisk · Pull Request #239 · Exgentic/exgentic

zeroasterisk · 2026-06-21T16:44:58Z

Summary

Adds GAIA (General AI Assistants) benchmark adapter following the existing Exgentic benchmark pattern
Loads tasks from HuggingFace datasets (gaia-benchmark/GAIA, validation split)
Supports 4 subset configs: 2023_all, 2023_level1, 2023_level2, 2023_level3
Scores by exact match with standard normalization (lowercase, strip articles, remove punctuation, collapse whitespace)
Reports per-level accuracy metrics in aggregation results
Registers the benchmark in interfaces/registry.py with slug gaia

Structure

src/exgentic/benchmarks/gaia/
  __init__.py
  gaia_benchmark.py    # GAIASession, GAIAEvaluator, GAIABenchmark
  requirements.txt     # datasets>=2.0.0

Pattern mapping

Exgentic concept	GAIA implementation
`Benchmark` config	`GAIABenchmark` — subset selection, runner config
`Session`	`GAIASession` — loads one question, presents to agent, collects answer
`Evaluator`	`GAIAEvaluator` — lists task IDs, aggregates per-level accuracy
Finish action	`submit` with `answer: str` field
Scoring	Exact match after normalization (case-insensitive, whitespace/punctuation/article stripped)

Notes

The GAIA dataset is gated on HuggingFace — users need an accepted access request and HF_TOKEN set
Works with any Exgentic agent (tool_calling, claude_code, openai_solo, etc.)
No additional tools are provided by default; agents use their own capabilities to reason through multi-step questions

Test plan

Registry loads gaia benchmark correctly (validated via load_benchmark)
All existing benchmarks still load after registry changes
Answer normalization and exact-match scoring verified with unit assertions
Evaluator returns correct task counts for all 4 subsets
End-to-end run with a real agent (requires HF_TOKEN with GAIA access)

GAIA (General AI Assistants) benchmark for multi-step reasoning. Loads tasks from HuggingFace datasets (gaia-benchmark/GAIA). 3 difficulty levels, exact-match scoring with answer normalization. Reports per-level accuracy metrics in aggregation.

feat: add GAIA benchmark adapter

19d8d67

GAIA (General AI Assistants) benchmark for multi-step reasoning. Loads tasks from HuggingFace datasets (gaia-benchmark/GAIA). 3 difficulty levels, exact-match scoring with answer normalization. Reports per-level accuracy metrics in aggregation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add GAIA benchmark for multi-step reasoning#239

feat: add GAIA benchmark for multi-step reasoning#239
zeroasterisk wants to merge 1 commit into
Exgentic:mainfrom
zeroasterisk:feat/gaia-benchmark

zeroasterisk commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zeroasterisk commented Jun 21, 2026

Summary

Structure

Pattern mapping

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant