Skip to content

Add OPD training example#45

Merged
atoniolo76 merged 4 commits into
mainfrom
alessio/add-opd-rl-tutorial
May 15, 2026
Merged

Add OPD training example#45
atoniolo76 merged 4 commits into
mainfrom
alessio/add-opd-rl-tutorial

Conversation

@atoniolo76
Copy link
Copy Markdown
Contributor

Checklist

  • Example is documented with comments throughout, in a Literate Programming style.
  • Example does not require third-party dependencies to be installed locally
  • Example pins its dependencies
    • Example pins container images to a stable tag, not a dynamic tag like latest
    • Example specifies a python_version for the base image, if it is used
    • Example pins all dependencies to at least minor version, ~=x.y.z or ==x.y
    • Example dependencies with version < 1 are pinned to patch version, ==0.y.z

Outside contributors

You're great! Thanks for your contribution.

@atoniolo76 atoniolo76 requested a review from joyliu-q May 15, 2026 00:18
@devin-ai-integration
Copy link
Copy Markdown
Contributor

Review

Nice tutorial — the OPD walkthrough is a solid addition. A few things to address before merge:

Bug

Source docstring says 004_ but file is 003_

"""Tutorial source for `004_on_policy_distillation` — parsed by generate_tutorial.py."""

Should be 003_on_policy_distillation.

Typos (in source, propagated to all generated files)

Typo Fix Occurrences
determnistic deterministic 2 (dataset intro, reward section)
paramter parameter 1 (next steps)
models vocabularies models' vocabularies 1 (next steps)
log ()P(x) / Q(x)) log(P(x) / Q(x)) 1 (reward section — stray open paren)

Conceptual: KL explanation contradicts itself

The reward function markdown says:

we use the "reverse" KL divergence where P is the student's probability distribution, allowing us to apply a teacher penalty to modes only valued by the student model.

Then immediately:

In the "reverse" case, where P is the teacher's probability distribution, we apply a penalty to modes that the teacher thinks is valuable but may be out of distribution for the student model.

These say opposite things about which distribution is P in the "reverse" case. The second sentence reads like a description of the forward KL, not the reverse. Worth clarifying — the advantage formula in the intro (log π_student − log π_teacher) is KL(student ‖ teacher), which is the conventional "reverse" KL when the teacher is treated as the reference. One clean sentence would do.

Nit

Missing newline at end of tutorials/tutorial_generator/rl/003_on_policy_distillation.py.

PR checklist

The checklist items are all unchecked — might want to tick the ones that apply (literate-programming style, dependency pinning, etc.).

@@ -0,0 +1,500 @@
# pyright: reportUndefinedVariable=false, reportMissingImports=false
"""Tutorial source for `004_on_policy_distillation` — parsed by generate_tutorial.py."""
Copy link
Copy Markdown
Contributor

@joyliu-q joyliu-q May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: 003

Copy link
Copy Markdown
Contributor

@joyliu-q joyliu-q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin found some typos but otherwise this lgtm!

@atoniolo76 atoniolo76 merged commit fb559b5 into main May 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants