feat: add PinchBench benchmark adapter for coding agent evaluation#240
Open
zeroasterisk wants to merge 2 commits into
Open
feat: add PinchBench benchmark adapter for coding agent evaluation#240zeroasterisk wants to merge 2 commits into
zeroasterisk wants to merge 2 commits into
Conversation
GAIA (General AI Assistants) benchmark for multi-step reasoning. Loads tasks from HuggingFace datasets (gaia-benchmark/GAIA). 3 difficulty levels, exact-match scoring with answer normalization. Reports per-level accuracy metrics in aggregation.
Adds an Exgentic adapter for PinchBench (https://pinchbench.com), a real-world benchmark suite for AI coding agents created by PinchBench / Kilo Code. The adapter loads 147 tasks across 11 categories (productivity, research, writing, coding, analysis, csv_analysis, log_analysis, meeting_analysis, memory, skills, integrations) from a local clone of the PinchBench skill repo. Grading supports all three PinchBench modes: automated Python checks, LLM-judge rubrics, and hybrid weighted combinations. Tasks with workspace file requirements are provisioned into temporary directories. Follows the same Session/Evaluator/Benchmark pattern as the existing GAIA and GSM8k adapters. Registered in the benchmark registry with per-category subset filtering.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds PinchBench benchmark support to Exgentic — 147 tasks across 11 categories for evaluating coding agents on real-world tasks.
PinchBench evaluates LLM agents on practical tasks: scheduling, email triage, file management, CSV analysis, log analysis, coding, and DevOps. Tasks come from the open-source pinchbench/skill repo.
What's included
PinchBenchSession— loads tasks from PinchBench skill repo, provisions workspace files, collects agent outputPinchBenchEvaluator— aggregates per-category accuracy metricssubsetparameterTask categories (147 tasks)
productivity (8), research (12), writing (6), coding (14), analysis (12), csv_analysis (26), log_analysis (30), meeting_analysis (28), memory (2), skills (6), integrations (3)
Attribution
PinchBench is created by Kilo. All tasks and grading criteria are open source under their license.