feat: Leaderboard Submission Script#1
Conversation
|
This is a benchmark review for experiment This pull request was cloned from Experiment configurationreview_config:
# User configuration for the review
# - benchmark - use the user config from the benchmark reviews
# - <value> - use the value directly
user_review_config:
enable_ai_review: true
enable_rule_comments: false
enable_complexity_comments: benchmark
enable_security_comments: benchmark
enable_tests_comments: benchmark
enable_comment_suggestions: benchmark
enable_pull_request_summary: benchmark
enable_review_guide: benchmark
enable_approvals: false
base_branches: [base-sha.*]
ai_review_config:
# The model responses to use for the experiment
# - benchmark - use the model responses from the benchmark reviews
# - llm - call the language model to generate responses
model_responses:
comments_model: benchmark
comment_validation_model: benchmark
comment_suggestion_model: benchmark
complexity_model: benchmark
security_model: benchmark
tests_model: benchmark
pull_request_summary_model: benchmark
review_guide_model: benchmark
overall_comments_model: benchmark
# The pull request dataset to run the experiment on
pull_request_dataset:
# CodeRabbit
- https://github.com/neerajkumar161/node-coveralls-integration/pull/5
- https://github.com/gunner95/vertx-rest/pull/1
- https://github.com/Altinn/altinn-access-management-frontend/pull/1427
- https://github.com/theMr17/github-notifier/pull/14
- https://github.com/bearycool11/AI_memory_Loops/pull/142
# Greptile
- https://github.com/gumloop/guMCP/pull/119
- https://github.com/autoblocksai/python-sdk/pull/335
- https://github.com/grepdemos/ImageSharp/pull/6
- https://github.com/grepdemos/server/pull/61
- https://github.com/websentry-ai/pipelines/pull/25
# Graphite
- https://github.com/KittyCAD/modeling-app/pull/6648
- https://github.com/KittyCAD/modeling-app/pull/6628
- https://github.com/Varedis-Org/AI-Test-Repo/pull/2
- https://github.com/deeep-network/bedrock/pull/198
- https://github.com/Metta-AI/metta/pull/277
# Copilot
- https://github.com/hmcts/rpx-xui-webapp/pull/4438
- https://github.com/ganchdev/quez/pull/104
- https://github.com/xbcsmith/ymlfxr/pull/13
- https://github.com/tinapayy/B-1N1T/pull/36
- https://github.com/coder/devcontainer-features/pull/6
# Questions to ask to label the review comments
review_comment_labels: []
# - label: correct
# question: Is this comment correct?
# Benchmark reviews generated by running
# python -m scripts.experiment benchmark <experiment_name>
benchmark_reviews: []
|
Reviewer's GuideThis pull request introduces a new shell script ( File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Hellebore
left a comment
There was a problem hiding this comment.
Hey @Hellebore - I've reviewed your changes and found some issues that need to be addressed.
Blocking issues:
-
Using eval for command execution may introduce security concerns. (link)
-
The logic within the
_policy_recordsmethod, particularly for the "top" selector, has grown complex; consider refactoring parts of this logic into smaller helper functions. -
The
add_to_leaderboard.shscript usesevalto run constructed commands; consider executing the commands directly to avoid potential quoting issues or security risks associated witheval.
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🔴 Security: 1 blocking issue
- 🟢 Testing: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| self.eval_stats_logger = EvalStatsLogger(self.sim_suite_config, wandb_run) | ||
| self.average_reward = 0.0 # Initialize average reward estimate | ||
| self._current_eval_score = None | ||
| self.eval_scores = None |
There was a problem hiding this comment.
suggestion (bug_risk): Changed initialization of eval_scores to an empty dictionary.
Verify downstream code treats eval_scores as a dict—this avoids null checks but requires consistent usage.
Suggested implementation:
if not self.eval_scores: self.eval_scores["latest"] = current_scoreDepending on your downstream code usage you may have to:
- Change all conditional checks that compare self.eval_scores to None (e.g., “if self.eval_scores is None:”) into checks for emptiness (e.g., “if not self.eval_scores:”).
- Replace any method calls (like append, extend, etc.) on self.eval_scores with dictionary updates that use keys.
- Ensure that downstream code which reads from eval_scores uses the correct dictionary key(s) rather than assuming a list.
Adjust the key names (“latest” in the example) to match the intended logic of your evaluation scoring.
| echo "Step 2: Running simulation..." | ||
| SIM_CMD="python3 -m tools.sim sim=navigation run=\"$RUN_NAME\" policy_uri=\"$WANDB_PATH\" +eval_db_uri=wandb://artifacts/navigation_db $ADDITIONAL_ARGS" | ||
| echo "Executing: $SIM_CMD" | ||
| eval $SIM_CMD |
There was a problem hiding this comment.
🚨 suggestion (security): Using eval for command execution may introduce security concerns.
Sanitize all arguments to prevent shell injection, or use an array-based command invocation instead of eval.
Suggested implementation:
# Step 2: Run the simulation using array-based command execution
echo "Step 2: Running simulation..."
# Split ADDITIONAL_ARGS into an array
read -r -a additional_args <<< "$ADDITIONAL_ARGS"
SIM_CMD_ARRAY=(python3 -m tools.sim sim=navigation "run=${RUN_NAME}" "policy_uri=${WANDB_PATH}" "+eval_db_uri=wandb://artifacts/navigation_db")
SIM_CMD_ARRAY+=( "${additional_args[@]}" )
echo "Executing: ${SIM_CMD_ARRAY[*]}"
"${SIM_CMD_ARRAY[@]}"
# Step 3: Analyze and update dashboard using array-based command execution
echo "Step 3: Analyzing results and updating dashboard..."
# Split ADDITIONAL_ARGS into an array for analyze (if needed)
read -r -a additional_args_analyze <<< "$ADDITIONAL_ARGS"
ANALYZE_CMD_ARRAY=(python3 -m tools.analyze "run=analyze" "+eval_db_uri=wandb://artifacts/navigation_db" "analyzer.output_path=s3://softmax-public/policydash/dashboard.html" "+analyzer.num_output_policies=all")
ANALYZE_CMD_ARRAY+=( "${additional_args_analyze[@]}" )
echo "Executing: ${ANALYZE_CMD_ARRAY[*]}"
"${ANALYZE_CMD_ARRAY[@]}"
Note: Ensure that the variable ADDITIONAL_ARGS does not contain unintended extra characters and is properly defined. Review the rest of the script so the usage of read -r -a additional_args does not interfere with other parts of your code.
| # Get the directory where this script is located | ||
| SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" |
There was a problem hiding this comment.
suggestion (bug_risk): Usage of 'readlink -f' may have cross-platform issues.
macOS lacks 'readlink -f'. Consider using a portable method (e.g., realpath or a shell function) to determine the script directory.
| # Get the directory where this script is located | |
| SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" | |
| # Get the directory where this script is located using a portable method | |
| SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd -P)" |
This PR introduces a new utility script (
devops/add_to_leaderboard.sh) that streamlines the process of adding policies to our navigation evaluation leaderboard. The script handles the end-to-end workflow of submitting a policy, running the navigation simulation, and updating the dashboard.Changes:
add_to_leaderboard.shwith clear documentation and error handlingsim_job.yamlfrom "top" to "latest" for consistent evaluationpolicy_store.pyto properly handle missing eval scores in metadataUsage:
Example:
Dashboard URL
Summary by Sourcery
Introduce a new leaderboard submission script and improve policy selection and build processes for navigation evaluation
New Features:
add_to_leaderboard.shfor streamlined policy submission and evaluationBug Fixes:
Enhancements:
policy_store.pyDeployment:
Chores: