Skip to content

feat: add PinchBench benchmark adapter for coding agent evaluation#240

Open
zeroasterisk wants to merge 2 commits into
Exgentic:mainfrom
zeroasterisk:feat/pinchbench-adapter
Open

feat: add PinchBench benchmark adapter for coding agent evaluation#240
zeroasterisk wants to merge 2 commits into
Exgentic:mainfrom
zeroasterisk:feat/pinchbench-adapter

Conversation

@zeroasterisk

Copy link
Copy Markdown
Contributor

Summary

Adds PinchBench benchmark support to Exgentic — 147 tasks across 11 categories for evaluating coding agents on real-world tasks.

PinchBench evaluates LLM agents on practical tasks: scheduling, email triage, file management, CSV analysis, log analysis, coding, and DevOps. Tasks come from the open-source pinchbench/skill repo.

What's included

  • PinchBenchSession — loads tasks from PinchBench skill repo, provisions workspace files, collects agent output
  • PinchBenchEvaluator — aggregates per-category accuracy metrics
  • 3 grading modes: automated (Python grade function), LLM judge, hybrid (weighted)
  • 11 task categories, filterable via subset parameter
  • Follows the same Session/Evaluator/Benchmark pattern as gsm8k and GAIA adapters

Task categories (147 tasks)

productivity (8), research (12), writing (6), coding (14), analysis (12), csv_analysis (26), log_analysis (30), meeting_analysis (28), memory (2), skills (6), integrations (3)

Attribution

PinchBench is created by Kilo. All tasks and grading criteria are open source under their license.

GAIA (General AI Assistants) benchmark for multi-step reasoning.
Loads tasks from HuggingFace datasets (gaia-benchmark/GAIA).
3 difficulty levels, exact-match scoring with answer normalization.
Reports per-level accuracy metrics in aggregation.
Adds an Exgentic adapter for PinchBench (https://pinchbench.com), a
real-world benchmark suite for AI coding agents created by PinchBench /
Kilo Code. The adapter loads 147 tasks across 11 categories
(productivity, research, writing, coding, analysis, csv_analysis,
log_analysis, meeting_analysis, memory, skills, integrations) from a
local clone of the PinchBench skill repo.

Grading supports all three PinchBench modes: automated Python checks,
LLM-judge rubrics, and hybrid weighted combinations. Tasks with
workspace file requirements are provisioned into temporary directories.

Follows the same Session/Evaluator/Benchmark pattern as the existing
GAIA and GSM8k adapters. Registered in the benchmark registry with
per-category subset filtering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant