Add OPD training example by atoniolo76 · Pull Request #45 · modal-projects/training-gym

atoniolo76 · 2026-05-15T00:18:42Z

Checklist

Example is documented with comments throughout, in a Literate Programming style.
Example does not require third-party dependencies to be installed locally
Example pins its dependencies
- Example pins container images to a stable tag, not a dynamic tag like latest
- Example specifies a python_version for the base image, if it is used
- Example pins all dependencies to at least minor version, ~=x.y.z or ==x.y
- Example dependencies with version < 1 are pinned to patch version, ==0.y.z

Outside contributors

You're great! Thanks for your contribution.

devin-ai-integration · 2026-05-15T14:28:31Z

Review

Nice tutorial — the OPD walkthrough is a solid addition. A few things to address before merge:

Bug

Source docstring says 004_ but file is 003_

"""Tutorial source for `004_on_policy_distillation` — parsed by generate_tutorial.py."""

Should be 003_on_policy_distillation.

Typos (in source, propagated to all generated files)

Typo	Fix	Occurrences
`determnistic`	`deterministic`	2 (dataset intro, reward section)
`paramter`	`parameter`	1 (next steps)
`models vocabularies`	`models' vocabularies`	1 (next steps)
`log ()P(x) / Q(x))`	`log(P(x) / Q(x))`	1 (reward section — stray open paren)

Conceptual: KL explanation contradicts itself

The reward function markdown says:

we use the "reverse" KL divergence where P is the student's probability distribution, allowing us to apply a teacher penalty to modes only valued by the student model.

Then immediately:

In the "reverse" case, where P is the teacher's probability distribution, we apply a penalty to modes that the teacher thinks is valuable but may be out of distribution for the student model.

These say opposite things about which distribution is P in the "reverse" case. The second sentence reads like a description of the forward KL, not the reverse. Worth clarifying — the advantage formula in the intro (log π_student − log π_teacher) is KL(student ‖ teacher), which is the conventional "reverse" KL when the teacher is treated as the reference. One clean sentence would do.

Nit

Missing newline at end of tutorials/tutorial_generator/rl/003_on_policy_distillation.py.

PR checklist

The checklist items are all unchecked — might want to tick the ones that apply (literate-programming style, dependency pinning, etc.).

joyliu-q · 2026-05-15T14:29:47Z

@@ -0,0 +1,500 @@
+# pyright: reportUndefinedVariable=false, reportMissingImports=false
+"""Tutorial source for `004_on_policy_distillation` — parsed by generate_tutorial.py."""


joyliu-q

Devin found some typos but otherwise this lgtm!

Add OPD training example

890cdb2

atoniolo76 requested a review from joyliu-q May 15, 2026 00:18

joyliu-q reviewed May 15, 2026

View reviewed changes

joyliu-q approved these changes May 15, 2026

View reviewed changes

atoniolo76 and others added 3 commits May 15, 2026 19:12

Tighten up wording and fix Devin nits

f14943e

Merge branch 'main' into alessio/add-opd-rl-tutorial

fba980d

Replace terminology for KL distributions with RL-correct terms

a18befb

atoniolo76 merged commit fb559b5 into main May 15, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OPD training example#45

Add OPD training example#45
atoniolo76 merged 4 commits into
mainfrom
alessio/add-opd-rl-tutorial

atoniolo76 commented May 15, 2026

Uh oh!

devin-ai-integration Bot commented May 15, 2026

Uh oh!

joyliu-q May 15, 2026 •

edited

Loading

Uh oh!

joyliu-q left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,500 @@
		# pyright: reportUndefinedVariable=false, reportMissingImports=false
		"""Tutorial source for `004_on_policy_distillation` — parsed by generate_tutorial.py."""

Conversation

atoniolo76 commented May 15, 2026

Checklist

Outside contributors

Uh oh!

devin-ai-integration Bot commented May 15, 2026

Review

Bug

Typos (in source, propagated to all generated files)

Conceptual: KL explanation contradicts itself

Nit

PR checklist

Uh oh!

joyliu-q May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joyliu-q left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joyliu-q May 15, 2026 •

edited

Loading