This doc provides a quick start for developers. Prerequisites:
- Basic knowledge of reinforcement learning, we do expect you to have run at least some local PPO training.
- Basic knowledge about the gym API.
- Basic Knowledge about docker/container.
In the end, you will
- know how to wrap your environment, policy and trainer.
- run a minimal test.
- run a local mini experiment.
We will use a minimal test as a testing template.
Different from gym api, reset and step both return a list of StepResults.
- Write your environment code in
environment, like legacy/environment/atari/atari_env.py; - Implement required methods:
reset,step, andagent_count; - Implement optional methods:
observation_space,action_space,seed,set_curriculum_stage; - Register your experiment via
registermethod from api/environment.py; - Build an image where your environment code runs. Doc: build_your_image
test_environment()will test your environments. If you do not have an action space with sampling method implemented, passtest_stepping=False.
- If your are using ppo, ppg, you could use the actor-critic policies
Suppose you have observation space (187,), state_space (623,) and 16 discrete actions.
Policy(type_="actor-critic-separate",
args=dict(obs_dim={"obs": 187},
state_dim={"state": 623},
action_dim=16,
**other_kwargs)will suffice in most cases.
-
For more info and variants such as share backbone, auxiliary value head, and continuous actions, check out algorithm/actor_critic_policies/readme.md.
-
Actor critic policies are known to have a shared pre-processing layer on common inputs, breaking the separate backbone initiatives. This is a bug and is to be fixed.
-
Implement your policy in a subdirectory of algorithm;
-
Consider subclassing
SingleModelPytorchPolicyfrom algorithm/policy. It may save your effort to implement device assigning, DDP initialization, checkpoint_saving and version controlling. -
Implement methods
default_policy_state,rolloutandanalyze.
Some policies have a memory unit, such as RNN. In the system we call the memory policy_state.
When an agent spawns in an environment, it has no memory. And default_policy_state serves
as a default value for the policy memory.
In the system, policy_state is implemented as NamedArray. E.g. for a recurrent policy, with a gru unit, it could be
@namedarray
class MyPolicyState:
hx: np.ndarray
default_policy_state = MyPolicyState(
hx=np.zeros((num_rnn_layers, 1, rnn_hidden_dim), dtype=np.float32)
) # The second dimension is for batching. Here it is set to 1 as a system convention.If your policy does not have a memory unit, default_policy_state can be None.
The rollout method takes in a RolloutRequest and returns a RolloutResult, the structure of which can both be found in policy.py. Both RolloutRequest and RolloutResult are NamedArrays.
# algorithm/policy.py
class Policy:
...
def rollout(self, requests: RolloutRequest, **kwargs) -> RolloutResult:
""" Generate actions (and rnn hidden states) during evaluation.
Args:
requests: All request received from actor generated by env.step.
Returns:
RolloutResult: Rollout results to be distributed (namedarray).
"""
raise NotImplementedError()
...The argument requests will provide sufficient information for you to run model inference, including:
obs, as your environment implements.policy_state, as you just specified in this policy.is_evaluation, specifies whether the action is sampled deterministically or stochastically.
Please be aware that these values are batched along the first dimension. They come from different agent of different environments.
A rollout method normally contains the following steps:
- move
obsandpolicy_stateto GPU tensors; - run model forward pass;
- sample actions from the result of 2;
- if necessary, rearrange the deterministic and stochastic actions.
- Assemble
RolloutResultand return.
In most cases, we have to specify action and policy_state in RolloutResult. This policy_state is the memory state after this model forward pass. The agents, on the other end of the system, will pass their policy_state back to the policy on the next time they require an action.
Algorithm specific values can also be added. E.g. for ppo, we should add log_prob and maybe also values.
NOTE: The data structure of RolloutRequest and RolloutResults are subject to change in the next few version. Algorithm specific value will be put into a sub-NamedArray.
Analyze method is closely related to the your training algorithm. The input value is a sample batch,
and the return value is determined by the training algorithm. E.g. if your training algorithm is PPO,
the analyzed result should at least return new_logprob, old_logprob and entropy. This is formalized
by trainers specifying a dataclass named AnalyzedResult.
Some rule of thumbs:
- If you are only customizing a policy, checkout the AnalyzedResult specified by the trainers, and make sure the return value of your analyze result matching the trainer's requirement.
- If you are customizing both trainer and policy, implement your trainer first. Pretend that the policy would return just the values you need, formalize a AnalyzedResult and then implement analyze method of your policy.
- If your customized trainer works with an old policy, implement your trainer first. And then
try twinkling the
analyzemethod of the policy. Note that there is atargetargument that allows your trainer to tell the policy what analyze method to execute.
Unlike RolloutResult, AnalyzedResult is a classic python dataclass. The data fields can be of any datatype, and their dimensions doesn't have to match, as long as your trainer can properly update the model parameters.
In some cases, e.g. if your model contains more than a simple pytorch neural network,
implementing get_checkpoint and load_checkpoint is necessary. Just note that the return value of get_checkpoint
has to be pytorch-savable.
To be documented.
reanalyze
When trainers are initialized, a policy must be specified. If confused, Think of trainers as optimizers and policy as neural networks. If confused still, read Trainer and Policy.
Let the code speak for itself.
# algorithm/trainer.py
class Trainer:
@property
def policy(self) -> api.policy.Policy:
"""Running policy of the trainer.
"""
raise NotImplementedError()
def step(self, samples: SampleBatch) -> TrainerStepResult:
"""Advances one training step given samples collected by actor workers.
Example code:
...
some_data = self.policy.analyze(sample)
loss = loss_fn(some_data, sample)
self.optimizer.zero_grad()
loss.backward()
...
self.optimizer.step()
...
Args:
samples (SampleBatch): A batch of data required for training.
Returns:
TrainerStepResult: Entry to be logged by trainer worker.
"""
raise NotImplementedError()test_pipeline will start a local process and runs your environment, policy, and trainer for several episodes.
If the test passes, you can assume that your code can be deployed to the distributed system.
It is natural to think of the "neural network" and the "optimizer" as a whole, especially in local projects. We compute action with some neural network(NN), and optimize directly on the same NN. But in a distributed system, trainer and policy must be decoupled. When trainer is updating the parameters, our policy cannot stop functioning.
To summarize the relation between trainer and policy in our system:
- Agents make decision through policies.
- Trainers optimize policies.
Then the method rollout and analyze comes in naturally for policies.
As a metaphor, they are like humans making decisions(rollout),
and at the end of day, reflecting on what we did right or wrong.
Note that for analyze, we expect policies to tell us, what was good or bad,
but it is the trainers who decide what to do in the next day.
Trainers, seeing the rights and wrongs, behave more like a methodology, or philosophy:
"Do more rights and fewer wrongs".