advisor: gate autonomous learning on rewarded trajectories#52
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0458360c50
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| reward_value=float(final_reward.get("total_reward") or 0.0), | ||
| packet=trajectory["turns"][0].get("state_packet") if trajectory.get("turns") else row.get("input") or {}, | ||
| advice=trajectory["turns"][0].get("advice") if trajectory.get("turns") else row.get("advice") or {}, | ||
| outcome=final_outcome, | ||
| reward_label=final_reward, |
There was a problem hiding this comment.
Use canonical run rewards when building ingestion records
This now derives reward_value and reward_label from trajectory.final_reward, but the training pipeline treats rollout reward_label as canonical because stored trajectories can lag after replay/repair (training_backends._trajectory_final_reward). If a run’s reward is corrected in reward_labels without rewriting advisor_trajectories.final_reward_json, autonomous learning will train on stale rewards and report incorrect reward variation even though the canonical run reward is newer. This is a data-quality regression introduced by switching ingestion from run rows to trajectory payloads.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2182809bbd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| final_reward = trajectory.get("final_reward") or {} | ||
| final_outcome = trajectory.get("final_outcome") or {} | ||
| if final_reward.get("total_reward") is None: | ||
| continue |
There was a problem hiding this comment.
Guard scalar trajectory rewards before calling
.get()
This code assumes trajectory.final_reward is always a dict, but legacy/repaired payloads can contain scalar rewards (the backend explicitly handles int|float in training_backends._trajectory_final_reward). In that case final_reward.get("total_reward") raises AttributeError, which bubbles out of readiness evaluation and can abort tick() before any profile is processed. Please normalize/type-check final_reward before dict access so autonomous learning remains resilient to older trajectory rows.
Useful? React with 👍 / 👎.
docs/plans/2026-04-24-final-multiturn-real-learning-completion.md.consumed_trajectory_idsalongside run ids after successful cycles.rollout_group_size, when trajectories are missing, or when final trajectory rewards are absent.Testing
python -m pytest tests/agent/advisor/test_learning_controller.py tests/agent/advisor/test_learning_service.py -qpython -m pytest tests/agent/advisor/test_real_training_backend.py tests/agent/advisor/test_training_runtime.py -qruff check .python -m pytest tests/agent/advisor -qgit diff --check