feat(eval): expand dataset to 37 tasks with JSON scenarios#185
Merged
Conversation
Add 6 jq_mastery scenarios: - jq_config_merge: deep-merge two JSON config files - jq_log_ndjson: aggregate errors from NDJSON logs by service - jq_reshape_api: transform API records between schema versions - jq_json_to_csv: convert JSON array to CSV with headers - jq_package_update: programmatically update package.json fields - jq_group_aggregate: group_by + sum aggregation (SQL-like) Add 6 scenarios covering other gaps: - pipe_dedup_merge: merge and deduplicate sorted lists - text_multifile_replace: rename function across multiple files - script_health_check: multi-condition validation script using jq - data_column_transform: TSV-to-CSV column reorder with awk - complex_release_notes: parse conventional commits into changelog - data_csv_join: join two CSVs on shared key column Total scenarios: 25 → 37 https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
Prompts should describe the task, not prescribe the tool. Removes "Use jq", "Use awk", "using sed", etc. from all scenario prompts (both pre-existing and new). Renames tool-based IDs: - text_grep_extract → text_log_error_count - text_sed_config → text_hostname_replace - text_awk_report → text_csv_revenue - jq_nested_transform → json_nested_names - jq_api_response → json_api_pagination - jq_config_merge → json_config_merge - jq_log_ndjson → json_ndjson_error_aggregate - jq_reshape_api → json_api_schema_migration - jq_json_to_csv → json_to_csv_export - jq_package_update → json_package_update - jq_group_aggregate → json_order_totals Renames category jq_mastery → json_processing. Updates spec table. https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
Haiku 4.5: 32/37 passed (95%), 81% tool success GPT-5.2: 23/37 passed (80%), 71% tool success Opus 4.6: rate-limited, skipped Updates README with new results and per-scenario breakdown. https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Root cause: Opus eval failed on all 37 tasks with hidden 404 error.
The error was wrapped by anyhow .context() and only "provider chat
failed" was visible. Actual error: wrong model ID (claude-opus-4-6-20250610
doesn't exist, correct ID is claude-opus-4-6).
Changes:
- Add exponential backoff retry (2s, 4s, 8s, 16s) for 429 and 529/5xx
in both Anthropic and OpenAI providers
- Use {:#} format in runner error output to show full error chain
- Update spec non-goals to reflect retry support
https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
Opus 4.6: 29/37 passed (87%), 82% tool success, 25.2 min Full 3-model comparison now in README with per-category and per-new-scenario breakdown. https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
jq_config_merge→json_config_merge,text_sed_config→text_hostname_replace) and categoryjq_mastery→json_processingTest plan
cargo buildpassescargo test -p bashkit-evalpasses