advisor: gate autonomous learning on rewarded trajectories by sarvesh1327 · Pull Request #52 · sarvesh1327/advisor

sarvesh1327 · 2026-04-24T15:56:59Z

Restart Phase 8 from docs/plans/2026-04-24-final-multiturn-real-learning-completion.md.
Make autonomous learning readiness depend on fresh, reward-bearing, unconsumed trajectories instead of only rewarded runs.
Persist and consume consumed_trajectory_ids alongside run ids after successful cycles.
Build rollout groups from trajectory payloads and expose source trajectory ids in diagnostics/summary.
Block training when trajectory count is below profile rollout_group_size, when trajectories are missing, or when final trajectory rewards are absent.

Testing

python -m pytest tests/agent/advisor/test_learning_controller.py tests/agent/advisor/test_learning_service.py -q
python -m pytest tests/agent/advisor/test_real_training_backend.py tests/agent/advisor/test_training_runtime.py -q
ruff check .
python -m pytest tests/agent/advisor -q
git diff --check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0458360c50

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T16:02:35Z

+            reward_value=float(final_reward.get("total_reward") or 0.0),
+            packet=trajectory["turns"][0].get("state_packet") if trajectory.get("turns") else row.get("input") or {},
+            advice=trajectory["turns"][0].get("advice") if trajectory.get("turns") else row.get("advice") or {},
+            outcome=final_outcome,
+            reward_label=final_reward,


Use canonical run rewards when building ingestion records

This now derives reward_value and reward_label from trajectory.final_reward, but the training pipeline treats rollout reward_label as canonical because stored trajectories can lag after replay/repair (training_backends._trajectory_final_reward). If a run’s reward is corrected in reward_labels without rewriting advisor_trajectories.final_reward_json, autonomous learning will train on stale rewards and report incorrect reward variation even though the canonical run reward is newer. This is a data-quality regression introduced by switching ingestion from run rows to trajectory payloads.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2182809bbd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T16:11:49Z

+        final_reward = trajectory.get("final_reward") or {}
+        final_outcome = trajectory.get("final_outcome") or {}
+        if final_reward.get("total_reward") is None:
+            continue


Guard scalar trajectory rewards before calling .get()

This code assumes trajectory.final_reward is always a dict, but legacy/repaired payloads can contain scalar rewards (the backend explicitly handles int|float in training_backends._trajectory_final_reward). In that case final_reward.get("total_reward") raises AttributeError, which bubbles out of readiness evaluation and can abort tick() before any profile is processed. Please normalize/type-check final_reward before dict access so autonomous learning remains resilient to older trajectory rows.

Useful? React with 👍 / 👎.

advisor: gate autonomous learning on rewarded trajectories

0458360

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

fix: use canonical rewards for trajectory ingestion

2182809

sarvesh1327 merged commit 692f257 into main Apr 24, 2026
1 check passed

sarvesh1327 deleted the phase8-real-eligible-trajectories branch April 24, 2026 16:06

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

advisor: gate autonomous learning on rewarded trajectories#52

advisor: gate autonomous learning on rewarded trajectories#52
sarvesh1327 merged 2 commits into
mainfrom
phase8-real-eligible-trajectories

sarvesh1327 commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sarvesh1327 commented Apr 24, 2026

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant