following the og book - suttonBartoIPRLBook2ndEd.pdf i really like this sheet for all the policy gradient algos - https://lilianweng.github.io/posts/2018-04-08-policy-gradient/#ppo this for ppo - https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details
following are my obsidian notes while studying RL from Sutton and Barto.
| Reinforcement Learning | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Gets to decide implicitly the next state of the environment. | Doesn't get to decide at all which data points to sample. | Doesn't get to decide at all which data points to sample. |
| Learns to predict what the next state would be, in conjunction to action. | Only learns the label and, as a consequence, the loss. | Predicts structure in un-labelled data. |
| Does not rely on examples of correct behavior. | Relies on labelled examples. | Does not rely on examples of correct behavior. |
Policy:
An RL task that satisfies the Markov Property is called Markov Decision Process.
Transition Probability Function:
Value of a state
Expresses relation between the value of a state and the value of its successor states. Just draw backup diagram of
$v_(s)=max_\pi v_\pi(s)$ : The maximum state value of state s under any policy $\pi$ $q_(s,a)=max_\pi q_\pi(s,a)$
$v_(s)=\max_{a\in A(s)}\sum_{s',r}p(s',r|s,a)[r+\gamma v_(s')]$
These methods use other state-values/action-values to update the estimates.
This calculates the state-value
Iterative Policy Evaluation: Algorithm to compute state-values for all states, until convergence of an arbitrary policy.
Q. How to improve a policy, once it is evaluated.
A. At state s, choose another Greedy.
Q. When to choose?
A. Whenever
In 4.1 we used Bellman equation for state-value to update the state value. But what if we used the Bellman optimality equation to update it instead?
We would converge to
Q. Must we wait for exact convergence of step
Do 4.3, but do E for a single state. This is asynchronous DP. Shit doesn't need a full sweep in the E step.
==Bottleneck Alert:== These are DP methods and all require
These methods wait till the end of the episode and update the state-value/action-value with the actual return.
If we don't have access to transition probabilities, we cannot simulate and weight using
Need
Therefore we can keep on refining the state-value
To update policy, we must do:
To estimate action-values: is called MC evaluation.
Alternate MC evaluation with a greedy policy-improvement step + guarantee some initial exploration by starting episodes in random state-action pairs.
MC ES: Do {MC Evaluation + PI episode by episode} + Exploring Starts
- On-policy Methods:
Enforce
$\epsilon-greedy$ policies i.e.$\pi(a|s)\geq \frac{\epsilon}{|A(s)|}$ instead of greedy policy. - Off-Policy Methods:
$\pi$ : Target Policy,$\mu$ : Behavior Policy. Assumption of Coverage:$\pi(a|s)\geq0\implies\mu(a|s)$
Here, x is an episode.
p(x) is Pr{episode generated under
==Trick:==
is an algorithm for Policy Evaluation.
| MC Method [constant-$\alpha$ MC] | TD Method [TD(0)] | |
|---|---|---|
| Update Rule | ||
| Update Freq. | Must wait till end of episode. | Updates on the next step. |
| Bootstrapping | No | Yes |
| Nature of Update | Each estimate state/action-value shifts towards the estimated return. | Each estimate state/action-value shifts towards the estimate that immediately follows it. |
| Off-line | On-line: advantageous when applications have really long episodes. | |
| Environment Transition Prob? | Model-free | Model-free |
| In batch-training | Minimizes the MSE on the training data. | Maximizes the likelihood of the training data. |
| Bias-Variance Tradeoff | High Variance, and unbiased. | Introduces small Bias, but has smaller variance. Bias$^2$ + Variance is lower hence MSE is overall lower. |
| TD(0) converges to the |
- MC's estimate $\hat{V}*(s)=\sum{i=1}^k G(s)$
- TD(0)'s estimate $\hat{V}*(s)=\hat{R}+\gamma\hat{P}(s'|s)\hat{V}*(s')$. Where
$\hat{P}(s'|s)=\frac{N(s\rightarrow s')}{N(s)}$ and$\hat{R}(s)$ is just the average of all rewards after state s in the batch.
TD(0) is implicitly using the using the assumptions of the data generating process (transition probability
MRP = MDP + Fixed Policy. MRPs are useful for long-term evaluation of the policy.
Constant-$\alpha$ MC is identical to TD(
is an on-policy TD control PGI algorithm. That iteratively updates Q values and policy
is an off-policy TD control GPI algorithm.
The effect is of directly estimating
-
Planning Methods: Require Model of the Environment: Heuristic Search, DP.
-
Learning Methods: Model Free: MC and TD.
-
Environment Model: Distribution | Sample.
-
Simulated Experience: When models are used to generate a single episode (using sample models) or generate all possible episodes (using distribution models).
- Planning
$\rightarrow$ [1] State-space Planning, [2] Plan-space Planning. - State-space Planning: $$ model \rightarrow simulated \space experience \overset{backups}{\rightarrow} values \rightarrow \pi $$ Heart of both learning and planning methods is estimation of value functions using backup methods.
We can actually take 6.5 Q-learning and have it generate episodes from a sample model of the environment. In which case its called Q-planning.
Planning can be done on-line. i.e. model can be learnt while interacting with the environment and it can be used to improve the policy. Hence there are two roles for real experience: [1] model-learning: improve the model [2] direct RL: directly improve the value-function (using MC, TD(0), SARSA, Q-learning)
![[Pasted image 20250514101838.png]] Basically this sums up the Dyna-Q algorithm.
Vanilla Dyna-Q algo does Q-planning by sampling from the Model(S,A) at random.
Prioritized sweeping algorithm adds those (S,A) pairs to a priority queue whose update value
All state-space planning methods in AI are collectively called heuristic search. Actually we go over a tutorial: Searching for Solutions.
==Aims to give a method on how to solve for continuous state space or a huge state space==
Using function approximation:
So we parametrize the value functions
a. initialize w
b. repeat (for each episode):
c. e=0
d. S
This is an algo for policy evaluation. This can be extended to control as well. These methods are all called action-value methods.
Here we parametrize the policy. And we shall update the policy parameters using a performance
$$
\theta_{t+1} := \theta_{t} + \alpha \nabla_{\theta}\hat{J(\theta_{t})}
$$
where
In an episodic case the value of performance measure of a policy
$$
\nabla_\theta J(\theta) = \nabla_\theta v_{\pi_\theta} (s_0) \propto \underset{s \in S}{\sum} \mu(s) \underset{a \in A}{\sum} q_\pi(s,a) \nabla_\theta \pi(a|s; \theta)
$$
Now if we sample an episode on-policy, we are only concerned with
The expectation on the RHS becomes: $$ \underset{\pi}{\mathbb{E}} \left[\underset{a \in A}{\sum} q_\pi(s,a) \nabla_\theta \pi(a|s; \theta)\right] $$
We have removed the second summation owing to the fact that
This has high variance and slow convergence. Fix? Baseline.
See that:
$$
\underset{a}{\sum} b(s).\nabla_\theta\pi(a|s;\theta) = 0
$$
so long as
$$
\theta_{t+1} := \theta_{t} + \alpha \left[(G_t-b(S_t)) .\nabla_\theta {ln\space\pi(a|s; \theta)}\right]
$$
-
$\mathbb{E}[G_t]=\mathbb{E}[v(S_t, w)]$ i.e. unbiased. -
$var(G_t-v(S_t, w))\leq var(G_t)$ i.e. variance reduction. We are pushing$\theta$ in proportion to what is expected at that state and not just due to a large$G_t$ .
but how do we get the value of
knowing the value function can assist the policy update, such as by reducing gradient variance in vanilla policy gradients.
If we bootstrap the state-value function in REINFORCE w/ baseline, we have an actor-critic method.
The TD-error
So intuitively,
==The gradients give me information on how sensitive my action / state-value is to each policy / state function parameter. ==
Policy Gradient Algos: https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
Advantage: How good is taking an action
-
$\hat{A}(s_t,a_t) = G_t-V(s_t)$ $-$ Monte-Carlo estimate -
$\hat{A}(s_t, a_t)=\delta_t=r_t+\gamma V(s_{t+1}) - V(s_t)$ $-$ TD error - $\hat{A_t}^{GAE(\lambda, \gamma)}=\sum^{\infty}{l=0}(\gamma \lambda)^l \delta{t+l}$
$-$ Exponential average of past $A$ estimates: Generalized Advantage Estimation (GAE)
I'd implemented actor-critic method using Q-value net and a policy net. To convert this to PPO, following changes were made:
- Q-value net
$\rightarrow$ V-value net - Instead of updating the
$\theta$ and$w$ after each step, we save all states, rewards, actions, v-values, action log-probs in a rollout buffer. - Need to compute advantages and returns.
Note: in TRPO the importance sampling ratio in the