Add OPD training example#45
Conversation
ReviewNice tutorial — the OPD walkthrough is a solid addition. A few things to address before merge: BugSource docstring says """Tutorial source for `004_on_policy_distillation` — parsed by generate_tutorial.py."""Should be Typos (in source, propagated to all generated files)
Conceptual: KL explanation contradicts itselfThe reward function markdown says:
Then immediately:
These say opposite things about which distribution is P in the "reverse" case. The second sentence reads like a description of the forward KL, not the reverse. Worth clarifying — the advantage formula in the intro ( NitMissing newline at end of PR checklistThe checklist items are all unchecked — might want to tick the ones that apply (literate-programming style, dependency pinning, etc.). |
| @@ -0,0 +1,500 @@ | |||
| # pyright: reportUndefinedVariable=false, reportMissingImports=false | |||
| """Tutorial source for `004_on_policy_distillation` — parsed by generate_tutorial.py.""" | |||
joyliu-q
left a comment
There was a problem hiding this comment.
Devin found some typos but otherwise this lgtm!
Checklist
latestpython_versionfor the base image, if it is used~=x.y.zor==x.yversion < 1are pinned to patch version,==0.y.zOutside contributors
You're great! Thanks for your contribution.