Quick Start Developer's guide

This doc provides a quick start for developers. Prerequisites:

Basic knowledge of reinforcement learning, we do expect you to have run at least some local PPO training.
Basic knowledge about the gym API.
Basic Knowledge about docker/container.

In the end, you will

know how to wrap your environment, policy and trainer.
run a minimal test.
run a local mini experiment.

We will use a minimal test as a testing template.

Environment

Different from gym api, reset and step both return a list of StepResults.

Steps

Write your environment code in environment, like legacy/environment/atari/atari_env.py;
Implement required methods: reset, step, and agent_count;
Implement optional methods: observation_space, action_space, seed, set_curriculum_stage;
Register your experiment via register method from api/environment.py;
Build an image where your environment code runs. Doc: build_your_image

Test

test_environment() will test your environments. If you do not have an action space with sampling method implemented, pass test_stepping=False.

Policy

Natively supported actor-critic policies.

If your are using ppo, ppg, you could use the actor-critic policies

Suppose you have observation space (187,), state_space (623,) and 16 discrete actions.

Policy(type_="actor-critic-separate",
       args=dict(obs_dim={"obs": 187},
                 state_dim={"state": 623},
                 action_dim=16,
                 **other_kwargs)

will suffice in most cases.

For more info and variants such as share backbone, auxiliary value head, and continuous actions, check out algorithm/actor_critic_policies/readme.md.
Actor critic policies are known to have a shared pre-processing layer on common inputs, breaking the separate backbone initiatives. This is a bug and is to be fixed.

Developing your own policy - Steps

Implement your policy in a subdirectory of algorithm;
Consider subclassing SingleModelPytorchPolicy from algorithm/policy. It may save your effort to implement device assigning, DDP initialization, checkpoint_saving and version controlling.
Implement methods default_policy_state, rollout and analyze.

`default_policy_state`

Some policies have a memory unit, such as RNN. In the system we call the memory policy_state. When an agent spawns in an environment, it has no memory. And default_policy_state serves as a default value for the policy memory.

In the system, policy_state is implemented as NamedArray. E.g. for a recurrent policy, with a gru unit, it could be

@namedarray
class MyPolicyState:
    hx: np.ndarray

default_policy_state = MyPolicyState(
    hx=np.zeros((num_rnn_layers, 1, rnn_hidden_dim), dtype=np.float32)
)  # The second dimension is for batching. Here it is set to 1 as a system convention.

If your policy does not have a memory unit, default_policy_state can be None.

`rollout`

The rollout method takes in a RolloutRequest and returns a RolloutResult, the structure of which can both be found in policy.py. Both RolloutRequest and RolloutResult are NamedArrays.

# algorithm/policy.py
class Policy:
    ...
    def rollout(self, requests: RolloutRequest, **kwargs) -> RolloutResult:
        """ Generate actions (and rnn hidden states) during evaluation.
        Args:
            requests: All request received from actor generated by env.step.
        Returns:
            RolloutResult: Rollout results to be distributed (namedarray).
        """
        raise NotImplementedError()
    ...

The argument requests will provide sufficient information for you to run model inference, including:

obs, as your environment implements.
policy_state, as you just specified in this policy.
is_evaluation, specifies whether the action is sampled deterministically or stochastically.

Please be aware that these values are batched along the first dimension. They come from different agent of different environments.

A rollout method normally contains the following steps:

move obs and policy_state to GPU tensors;
run model forward pass;
sample actions from the result of 2;
if necessary, rearrange the deterministic and stochastic actions.
Assemble RolloutResult and return.

In most cases, we have to specify action and policy_state in RolloutResult. This policy_state is the memory state after this model forward pass. The agents, on the other end of the system, will pass their policy_state back to the policy on the next time they require an action.

Algorithm specific values can also be added. E.g. for ppo, we should add log_prob and maybe also values.

NOTE: The data structure of RolloutRequest and RolloutResults are subject to change in the next few version. Algorithm specific value will be put into a sub-NamedArray.

`analyze`

Analyze method is closely related to the your training algorithm. The input value is a sample batch, and the return value is determined by the training algorithm. E.g. if your training algorithm is PPO, the analyzed result should at least return new_logprob, old_logprob and entropy. This is formalized by trainers specifying a dataclass named AnalyzedResult.

Some rule of thumbs:

If you are only customizing a policy, checkout the AnalyzedResult specified by the trainers, and make sure the return value of your analyze result matching the trainer's requirement.
If you are customizing both trainer and policy, implement your trainer first. Pretend that the policy would return just the values you need, formalize a AnalyzedResult and then implement analyze method of your policy.
If your customized trainer works with an old policy, implement your trainer first. And then try twinkling the analyze method of the policy. Note that there is a target argument that allows your trainer to tell the policy what analyze method to execute.

Unlike RolloutResult, AnalyzedResult is a classic python dataclass. The data fields can be of any datatype, and their dimensions doesn't have to match, as long as your trainer can properly update the model parameters.

Other utility methods

In some cases, e.g. if your model contains more than a simple pytorch neural network, implementing get_checkpoint and load_checkpoint is necessary. Just note that the return value of get_checkpoint has to be pytorch-savable.

Advanced Usage

To be documented.

reanalyze

Trainer

When trainers are initialized, a policy must be specified. If confused, Think of trainers as optimizers and policy as neural networks. If confused still, read Trainer and Policy.

Step

Let the code speak for itself.

# algorithm/trainer.py
class Trainer:

    @property
    def policy(self) -> api.policy.Policy:
        """Running policy of the trainer.
        """
        raise NotImplementedError()

    def step(self, samples: SampleBatch) -> TrainerStepResult:
        """Advances one training step given samples collected by actor workers.

        Example code:
          ...
          some_data = self.policy.analyze(sample)
          loss = loss_fn(some_data, sample)
          self.optimizer.zero_grad()
          loss.backward()
          ...
          self.optimizer.step()
          ...

        Args:
            samples (SampleBatch): A batch of data required for training.

        Returns:
            TrainerStepResult: Entry to be logged by trainer worker.
        """
        raise NotImplementedError()

Testing environment, trainer and policy

test_pipeline will start a local process and runs your environment, policy, and trainer for several episodes. If the test passes, you can assume that your code can be deployed to the distributed system.

Trainer and Policy (Optional Reading)

It is natural to think of the "neural network" and the "optimizer" as a whole, especially in local projects. We compute action with some neural network(NN), and optimize directly on the same NN. But in a distributed system, trainer and policy must be decoupled. When trainer is updating the parameters, our policy cannot stop functioning.

To summarize the relation between trainer and policy in our system:

Agents make decision through policies.
Trainers optimize policies.

Then the method rollout and analyze comes in naturally for policies. As a metaphor, they are like humans making decisions(rollout), and at the end of day, reflecting on what we did right or wrong. Note that for analyze, we expect policies to tell us, what was good or bad, but it is the trainers who decide what to do in the next day. Trainers, seeing the rights and wrongs, behave more like a methodology, or philosophy: "Do more rights and fewer wrongs".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Start Developer's guide

Environment

Steps

Test

Policy

Natively supported actor-critic policies.

Developing your own policy - Steps

`default_policy_state`

`rollout`

`analyze`

Other utility methods

Advanced Usage

Trainer

Step

Testing environment, trainer and policy

Trainer and Policy (Optional Reading)

Read Next

FilesExpand file tree

dev_quickstart.md

Latest commit

History

dev_quickstart.md

File metadata and controls

Quick Start Developer's guide

Environment

Steps

Test

Policy

Natively supported actor-critic policies.

Developing your own policy - Steps

default_policy_state

rollout

analyze

Other utility methods

Advanced Usage

Trainer

Step

Testing environment, trainer and policy

Trainer and Policy (Optional Reading)

Read Next

`default_policy_state`

`rollout`

`analyze`