Skip to content

Benchmark: ToolComp Enterprise — multi-tool dependent reasoning #24

@rajkumar42

Description

@rajkumar42

Overview

Evaluate OpenSymbolicAI against the ToolComp Enterprise benchmark (Scale AI SEAL) — 287 prompts testing dependent multi-tool usage with process supervision.

Why this benchmark

  • Tests composing multiple tools together with dependent calls — directly maps to our primitive chains
  • Includes golden answer chains and process supervision labels for detailed evaluation
  • High visibility via Scale AI's SEAL leaderboard
  • Current leaders are GPT-4o and Claude 3.5 Sonnet — opportunity to show framework-level gains

References

Tasks

  • Review ToolComp dataset and evaluation protocol
  • Map ToolComp's 11 enterprise tools to OpenSymbolicAI primitives
  • Implement benchmark harness
  • Run evaluation and collect results
  • Submit to Scale AI SEAL leaderboard

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions