Benchmark: ToolComp Enterprise — multi-tool dependent reasoning

## Overview
Evaluate OpenSymbolicAI against the **ToolComp Enterprise** benchmark (Scale AI SEAL) — 287 prompts testing dependent multi-tool usage with process supervision.

## Why this benchmark
- Tests **composing multiple tools together** with dependent calls — directly maps to our primitive chains
- Includes golden answer chains and process supervision labels for detailed evaluation
- High visibility via Scale AI's SEAL leaderboard
- Current leaders are GPT-4o and Claude 3.5 Sonnet — opportunity to show framework-level gains

## References
- [ToolComp (Scale AI)](https://scale.com/research/toolcomp-a-multi-tool-reasoning-and-process-supervision-benchmark)
- [Agentic Tool Use Leaderboard](https://labs.scale.com/leaderboard/tool_use_enterprise)

## Tasks
- [ ] Review ToolComp dataset and evaluation protocol
- [ ] Map ToolComp's 11 enterprise tools to OpenSymbolicAI primitives
- [ ] Implement benchmark harness
- [ ] Run evaluation and collect results
- [ ] Submit to Scale AI SEAL leaderboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: ToolComp Enterprise — multi-tool dependent reasoning #24

Overview

Why this benchmark

References

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark: ToolComp Enterprise — multi-tool dependent reasoning #24

Description

Overview

Why this benchmark

References

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions