add-benchmark: check Harbor + Inspect Evals for prior art#19
Open
elronbandel wants to merge 1 commit into
Open
add-benchmark: check Harbor + Inspect Evals for prior art#19elronbandel wants to merge 1 commit into
elronbandel wants to merge 1 commit into
Conversation
Add a pre-start step to the add-benchmark skill: before porting from the raw upstream, check Inspect Evals (UKGovernmentBEIS/inspect_evals) and Harbor (harbor-framework/harbor) for an existing reference implementation, and reuse their pinned dataset revision, scorer, and prompt format as references (not dependencies). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
3b71a71 to
4ddc059
Compare
4ddc059 to
30d862f
Compare
30d862f to
7685ecd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a pre-start step to the add-benchmark skill: before porting a benchmark from the raw upstream, check Inspect Evals (UKGovernmentBEIS/inspect_evals) and Harbor (harbor-framework/harbor) for an existing reference implementation, and reuse their pinned dataset revision, scorer, and prompt format as references (not dependencies — the Dock benchmark stays self-contained per rule 2).
Motivated by checking SkillsBench: it's in neither harness, so that one would be a from-scratch port — exactly the thing this step makes explicit up front.