feat: add PinchBench benchmark adapter for coding agent evaluation by zeroasterisk · Pull Request #240 · Exgentic/exgentic

zeroasterisk · 2026-06-21T16:51:31Z

Summary

Adds PinchBench benchmark support to Exgentic — 147 tasks across 11 categories for evaluating coding agents on real-world tasks.

PinchBench evaluates LLM agents on practical tasks: scheduling, email triage, file management, CSV analysis, log analysis, coding, and DevOps. Tasks come from the open-source pinchbench/skill repo.

What's included

PinchBenchSession — loads tasks from PinchBench skill repo, provisions workspace files, collects agent output
PinchBenchEvaluator — aggregates per-category accuracy metrics
3 grading modes: automated (Python grade function), LLM judge, hybrid (weighted)
11 task categories, filterable via subset parameter
Follows the same Session/Evaluator/Benchmark pattern as gsm8k and GAIA adapters

Task categories (147 tasks)

productivity (8), research (12), writing (6), coding (14), analysis (12), csv_analysis (26), log_analysis (30), meeting_analysis (28), memory (2), skills (6), integrations (3)

Attribution

PinchBench is created by Kilo. All tasks and grading criteria are open source under their license.

GAIA (General AI Assistants) benchmark for multi-step reasoning. Loads tasks from HuggingFace datasets (gaia-benchmark/GAIA). 3 difficulty levels, exact-match scoring with answer normalization. Reports per-level accuracy metrics in aggregation.

Adds an Exgentic adapter for PinchBench (https://pinchbench.com), a real-world benchmark suite for AI coding agents created by PinchBench / Kilo Code. The adapter loads 147 tasks across 11 categories (productivity, research, writing, coding, analysis, csv_analysis, log_analysis, meeting_analysis, memory, skills, integrations) from a local clone of the PinchBench skill repo. Grading supports all three PinchBench modes: automated Python checks, LLM-judge rubrics, and hybrid weighted combinations. Tasks with workspace file requirements are provisioned into temporary directories. Follows the same Session/Evaluator/Benchmark pattern as the existing GAIA and GSM8k adapters. Registered in the benchmark registry with per-category subset filtering.

zeroasterisk added 2 commits June 21, 2026 16:44

feat: add GAIA benchmark adapter

19d8d67

GAIA (General AI Assistants) benchmark for multi-step reasoning. Loads tasks from HuggingFace datasets (gaia-benchmark/GAIA). 3 difficulty levels, exact-match scoring with answer normalization. Reports per-level accuracy metrics in aggregation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add PinchBench benchmark adapter for coding agent evaluation#240

feat: add PinchBench benchmark adapter for coding agent evaluation#240
zeroasterisk wants to merge 2 commits into
Exgentic:mainfrom
zeroasterisk:feat/pinchbench-adapter

zeroasterisk commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zeroasterisk commented Jun 21, 2026

Summary

What's included

Task categories (147 tasks)

Attribution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant