Reinforcement learning applied to Frogger, featuring a trained RL agent and an interactive command-line UI.
Objective: develop a competent Frogger agent that could play on a simplified / discretized version of the arcade game. Ultimately, the state was implemented as follows:
- A fixed
W x Hgrid, with cells harboring the player's frog, a car, or the goal (or simply being empty). - There are four car lanes, two of which flow to the right and two of which flow left.
- The frog begins in an empty row and has to progress through the car lanes in order to reach the 'goal' row without hitting a car.
- Additionally, a
max_stepsattribute was added to introduce a time element that constraints the agent's behavior to optimize the time to reach the goal (and not just linger).
The environment is rendered as frogger ASCII or emoji grids for user clarity.
The goal was to incentivize the agent to "win" and actually reach the goal row, which motivated the following rewards (and penalties):
+5for reaching the goal row (the only positive reward since we want to emphasize this)-1for collision with car (want to avoid death, but not so large as to discourage exploration)-10for exceeding maximum number of steps (i.e., running out of time generally isn't justified).-0.02per step to avoid unnecessary long paths, waiting, "dithering".
Through the first few iterations of training, found that the magnitude of the penalty for exceeding max_steps was too small, such that it was actually advantageous to exceed the maximum number of steps rather than collide with a car. As such, the agent's policy likely learned this strategy as a local minimum that sufficed in optimizing loss.
As such, decided to drastically increase the penalty from an initial -0.5 to -2.0, and then found that further increases (to an arbitrarily large penalty of -10 yielded greater success by training away the lingering, max_steps strategy).
Note: because this reward structure is inherently sparse, a small positive reward was later added for forward progress, discussed in reward shaping further below.
Ultimately trained the action-taking agent using an MLP policy using the REINFORCE policy gradient (augmented with entropy and baseline).
TL;DR: built Frogger-playing agent with 95.9% success rate (validated over 20,000 episodes using greedy action selection).
Implemented a 2-layer MLP with the following:
Input: 3 x H x W vector
- Had 3 channels in the input grid (one for the frog location, car locations, and goal-row locations respectively)
- Flattened this grid into a (by default,
3 x 6 x 5 = 90input-) state vector
Output: 5 logits that are translated into one of 5 actions:
0 = UP1 = DOWN2 = LEFT3 = RIGHT4 = STAY
Using whatever action the policy produced in the current episode (training sample), the environment was progressed using its .step() method, which moves cars, the frog using the action, and calculates reward, collision occurrence, etc.
- Temporal Credit Assignment (via Monte Carlo Returns): used discount factor (gamma) to more proportionately adjust recent actions when handling rewards. The following process was used to calculate the returns for the action taken given timestep
tusing rewardsr_k:
- Baseline Smoothing: found lots of noise in the gradients when training, so used this to reduce the high variance and smooth gradient estimates (by subtracting out an expected, baseline return, meaning that the policy specifically updates when action is worse / better than the mean). The following was used to translate raw returns into an 'advantage' by subtracting
b, the running mean of returns to compare against.
- Decaying Entropy: increased action stochasticity earlier on in training to encourage exploration, which helped the agent develop a more generalizable policy rather than relying on highly probable actions. Specifically calculated the entropy of the action distribution from the MLP policy (probabilities
pi_theta) as follows:
Intuition: high entropy corresponds to a more uniform disribution, i.e., one that has a more widespread likelihood within the action space, so the agent is more likely to explore a variety of actions.
- Following the standard optimized REINFORCE approach, composed the loss as a balance between (1) 'greedy' exploitation from seeking to adjust probabilities of taken actions by their (dis)advantages, i.e. increasing/decreasing the probabilities of actions that performed better/worse than expected and (2) adjusting policy parameters to encourage exploring breadth of actions earlier on.
- Used
betaas a decaying term throughout the training process such that the effect of entropy was larger earlier on, which has a progressively lower impact as the parameter itself decays and as the advantages from the agent's learning increase to dominate the loss calculation:
This loss is used to compute the policy gradient, which accordingly updates MLP policy parameters.
Ran training with key parameters as follows, computing loss and updating the policy accordingly after each episode:
python train_agent.pynum_episodes=30000
gamma=0.99
beta_range=(0.1, 0.02)
learning_rate=1e-3- Training logs are saved to
training_logs.txt(not displayed to console) - Every
render_everyepisodes, can watch the agent play with clean rendering - After training, plots show returns, loss, and success rate
- Trained policy is saved to
checkpoints/frogger_policy_{accuracy}.pt
Note: chose
num_episodesafter finding policy success/gradient jumped again after ~20k episodes;gamma,beta, and other parameters were chosen and loosely tweaked based on standard values.
Upon augmenting the initial training process with things like advantages (baseline smoothing) and decaying entropy, found that the improvement in guiding the agent to find winning strategy that was (a) more successful and (b) achieved earlier on lay in a mix of these two optimizations, but mostly in the addition of entropy.
As explained prior, observed that the return and success (acheived goal or not) achieved their maxima (at least with this policy and training approach) far earlier when incorporating decaying entropy. This can likely be attributed to the fact that in doing so, the agent is pushed to adjust its weights early on to explore a broader variety of actions in its space, allowing it to discover successful patterns like moving UP, LEFT, etc. in whatever pattern to avoid cars in order to be able to achieve the reward at the goal line.
The following shows this with greater clarity.
Return & success respectively across 30000 episodes prior to adding entropy
Return & success respectively across 30000 episodes after adding entropy
A problem with applying a strategy like REINFORCE / policy gradients to a sparse reward structure like Frogger (i.e., only achieve any positive reward upon reaching the end goal) is that positive signals for things like forward progress aren't learned explicitly until the agent manages to reach the end goal for the first time.
Solution: decided to add a small reward when the agent makes forward progress into the lanes, with that reward growing linearly as it reaches closer to the goal (at which point the agent is granted the final +5 reward) to also help offset the dithering step penalty since successful forward progress is good.
fwd_progress_reward = 0.02 * row # row in lanes [0, 5]Upon using this slightly adapted reward structure — along with slightly more training episodes, also noting that cumulative return is higher because of shaping-induced intermediate rewards — was able to boost return and success to
Return & success respectively across 40000 episodes after reward shaping
Note: the loss across all versions here (not shown) is not as telling, given that it's an indirect representation of returns, and as such, the loss through training remains relatively noisy (which is not inherently problematic).
During training, tracked success rate, episode returns, and policy loss across all episodes. Logs are saved to training_logs.txt for detailed inspection. Training process showed steady improvement with the final checkpoints achieving high success rates.
After training, performed comprehensive validation using validate_agent.py (with 20k runs to obtain more reliable metrics):
Validation Setup:
- Episodes: 20,000 evaluation runs
- Action Selection: greedy/deterministic (argmax), NOT stochastic sampling
- Policy:
checkpoints/frogger_policy_0.98.pt(best trained policy)
Results:
| Metric | Value | 95% Confidence Interval |
|---|---|---|
| Success Rate | 95.91% (19,181/20,000) | [95.63%, 96.18%] |
| Mean Return | 4.51 ± 3.15 | [4.47, 4.55] |
| Median Return | 5.16 | — |
| Mean Episode Length | 8.83 ± 8.48 steps | — |
| Median Episode Length | 7 steps | — |
Key Findings:
- The agent successfully reaches the goal in 95.91% of episodes, demonstrating learned behavior
- When successful, episodes typically complete in 7-9 steps, showing efficient pathfinding
- Return distribution is bimodal: successful episodes cluster around +5.0, while failures result in negative returns
- Narrow confidence interval (±0.3% on success rate) confirms statistical significance over 20k episodes
Validation Visualizations:
The validation script is also helpful for generating comprehensive plots showing return distributions, rolling success rates, episode lengths, and correlations between metrics:
Interpretation: return histogram (top-left) shows the clear bimodal distribution between successful (+5 and incremental intermediate rewards/penalties) and failed episodes (with failed episodes almost exclusively being from the lingering max-steps, dithering behavior). The rolling success rate (top-right) remains stable around 96% throughout evaluation, confirming consistent performance. Episode length distribution (bottom-left) shows most episodes terminate quickly, while the scatter plot (bottom-right) reveals that successful episodes cluster in the 5-10 step range; failed episodes are evidently due to dying or max-steps achieved (with most genuine fails, i.e., not because of max-steps, having slightly-positive rewards, indicating some forward progress).
Running Validation:
To reproduce these results or validate other checkpoints:
python validate_agent.pyConfiguration can be adjusted in the script:
CHECKPOINT_PATH: Policy to evaluateNUM_EPISODES: Number of validation episodes (default: 20,000)USE_GREEDY: Deterministic vs. stochastic action selection (default: True for evaluation)
While this MLP + policy gradient approach (via REINFORCE pattern) works relatively well, shown through the minimal training load and success achieved (with the few optimizations that were made), there are definitely ways in which this can be taken further:
-
Q-Learning (+ DQN): an approach that might lend itself to this set-up, which has a relatively simple environment and action space compared to something like Go or even chess.
-
Actor-Critic: makes use of a
criticthat can estimate the value of the given state to generate a more stable learning signal, then communicate that to theactorto update its policy more efficiently; helpful for reducing variance. -
Augmenting Environment: to make the game more interesting, one could introduce stochasticity to existing obstacles (e.g., the number of cars, speed), create new obstacles (e.g., fast-moving vehicles, logs on water), and do much more, inviting exploration into whether this RL approach might succeed there or whether stronger approaches (including those mentioned above) might be necessary.
The trained agent can be run live in the frogger grid. As validated through rigorous testing (see Metrics & Evaluation above), the agent achieves a 95.9% success rate over 20,000 episodes using greedy action selection. Though largely minimized through training optimizations, occasional suboptimal behaviors (like the "dithering" / max-steps waiting) still appear in edge cases...
python frogger_cli.py-
Two Play Modes:
Watch the trained RL agent navigate and try to win, or play yourself with real-time game updates (cars keep moving)
-
Two Rendering Modes:
ASCII mode (initial, more simple) or Emoji mode
-
Three Speed Settings (Human Play):
Fast(0.75s per step),Medium(1s per step),Slow(1.25s per step)
The game runs continuously, i.e., cars keep moving even when user doesn't act.
W: Move upA: Move leftS: Move downD: Move rightSpace: Stay in place (or don't press anything)Q: Quit
Core Implementation:
frogger_env.py- Frogger game environment implementationfrogger_policy.py- Neural network (MLP) policy for agent decision-makingrender_utils.py- Shared rendering utilities for ASCII/emoji display
Training & Evaluation:
train_agent.py- Training script with REINFORCE algorithm, logging, and visualizationvalidate_agent.py- Rigorous validation script (20k episodes, greedy evaluation, statistical analysis)training_logs.txt- Episode-by-episode training logs
Interactive Modes:
frogger_cli.py- Interactive CLI (recommended) - watch agent or play yourselfsimulate_frogger_agent.py- Isolated agent evaluation and visualizationhuman_play.py- Step-by-step human play mode
Saved Models & Results:
checkpoints/frogger_policy_0.98.pt- Best learned policy (~98% training, 95.9% val. success rate)evaluation/validation_frogger_policy_0.98.png- Validation results visualizationevaluation/validation_results.json- Validation metrics and statistics










