-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Overview
Evaluate OpenSymbolicAI against the ToolComp Enterprise benchmark (Scale AI SEAL) — 287 prompts testing dependent multi-tool usage with process supervision.
Why this benchmark
- Tests composing multiple tools together with dependent calls — directly maps to our primitive chains
- Includes golden answer chains and process supervision labels for detailed evaluation
- High visibility via Scale AI's SEAL leaderboard
- Current leaders are GPT-4o and Claude 3.5 Sonnet — opportunity to show framework-level gains
References
Tasks
- Review ToolComp dataset and evaluation protocol
- Map ToolComp's 11 enterprise tools to OpenSymbolicAI primitives
- Implement benchmark harness
- Run evaluation and collect results
- Submit to Scale AI SEAL leaderboard
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels