Skip to content

advisor: gate autonomous learning on rewarded trajectories#52

Merged
sarvesh1327 merged 2 commits into
mainfrom
phase8-real-eligible-trajectories
Apr 24, 2026
Merged

advisor: gate autonomous learning on rewarded trajectories#52
sarvesh1327 merged 2 commits into
mainfrom
phase8-real-eligible-trajectories

Conversation

@sarvesh1327
Copy link
Copy Markdown
Owner

  • Restart Phase 8 from docs/plans/2026-04-24-final-multiturn-real-learning-completion.md.
  • Make autonomous learning readiness depend on fresh, reward-bearing, unconsumed trajectories instead of only rewarded runs.
  • Persist and consume consumed_trajectory_ids alongside run ids after successful cycles.
  • Build rollout groups from trajectory payloads and expose source trajectory ids in diagnostics/summary.
  • Block training when trajectory count is below profile rollout_group_size, when trajectories are missing, or when final trajectory rewards are absent.

Testing

  • python -m pytest tests/agent/advisor/test_learning_controller.py tests/agent/advisor/test_learning_service.py -q
  • python -m pytest tests/agent/advisor/test_real_training_backend.py tests/agent/advisor/test_training_runtime.py -q
  • ruff check .
  • python -m pytest tests/agent/advisor -q
  • git diff --check

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0458360c50

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agent/advisor/learning/readiness.py Outdated
Comment on lines +236 to +240
reward_value=float(final_reward.get("total_reward") or 0.0),
packet=trajectory["turns"][0].get("state_packet") if trajectory.get("turns") else row.get("input") or {},
advice=trajectory["turns"][0].get("advice") if trajectory.get("turns") else row.get("advice") or {},
outcome=final_outcome,
reward_label=final_reward,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use canonical run rewards when building ingestion records

This now derives reward_value and reward_label from trajectory.final_reward, but the training pipeline treats rollout reward_label as canonical because stored trajectories can lag after replay/repair (training_backends._trajectory_final_reward). If a run’s reward is corrected in reward_labels without rewriting advisor_trajectories.final_reward_json, autonomous learning will train on stale rewards and report incorrect reward variation even though the canonical run reward is newer. This is a data-quality regression introduced by switching ingestion from run rows to trajectory payloads.

Useful? React with 👍 / 👎.

@sarvesh1327 sarvesh1327 merged commit 692f257 into main Apr 24, 2026
1 check passed
@sarvesh1327 sarvesh1327 deleted the phase8-real-eligible-trajectories branch April 24, 2026 16:06
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2182809bbd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +220 to +223
final_reward = trajectory.get("final_reward") or {}
final_outcome = trajectory.get("final_outcome") or {}
if final_reward.get("total_reward") is None:
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard scalar trajectory rewards before calling .get()

This code assumes trajectory.final_reward is always a dict, but legacy/repaired payloads can contain scalar rewards (the backend explicitly handles int|float in training_backends._trajectory_final_reward). In that case final_reward.get("total_reward") raises AttributeError, which bubbles out of readiness evaluation and can abort tick() before any profile is processed. Please normalize/type-check final_reward before dict access so autonomous learning remains resilient to older trajectory rows.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant