Skip to content

kabirmoghe/frogger-RL

Repository files navigation

Frogger + RL

Reinforcement learning applied to Frogger, featuring a trained RL agent and an interactive command-line UI.

Frogger RL Demo

Defining the Environement

Objective: develop a competent Frogger agent that could play on a simplified / discretized version of the arcade game. Ultimately, the state was implemented as follows:

  • A fixed W x H grid, with cells harboring the player's frog, a car, or the goal (or simply being empty).
  • There are four car lanes, two of which flow to the right and two of which flow left.
  • The frog begins in an empty row and has to progress through the car lanes in order to reach the 'goal' row without hitting a car.
  • Additionally, a max_steps attribute was added to introduce a time element that constraints the agent's behavior to optimize the time to reach the goal (and not just linger).

The environment is rendered as frogger ASCII or emoji grids for user clarity.

Reward Landscape / Surface

The goal was to incentivize the agent to "win" and actually reach the goal row, which motivated the following rewards (and penalties):

  • +5 for reaching the goal row (the only positive reward since we want to emphasize this)
  • -1 for collision with car (want to avoid death, but not so large as to discourage exploration)
  • -10 for exceeding maximum number of steps (i.e., running out of time generally isn't justified).
  • -0.02 per step to avoid unnecessary long paths, waiting, "dithering".

Through the first few iterations of training, found that the magnitude of the penalty for exceeding max_steps was too small, such that it was actually advantageous to exceed the maximum number of steps rather than collide with a car. As such, the agent's policy likely learned this strategy as a local minimum that sufficed in optimizing loss.

As such, decided to drastically increase the penalty from an initial -0.5 to -2.0, and then found that further increases (to an arbitrarily large penalty of -10 yielded greater success by training away the lingering, max_steps strategy).

Note: because this reward structure is inherently sparse, a small positive reward was later added for forward progress, discussed in reward shaping further below.

Training

Ultimately trained the action-taking agent using an MLP policy using the REINFORCE policy gradient (augmented with entropy and baseline).

TL;DR: built Frogger-playing agent with 95.9% success rate (validated over 20,000 episodes using greedy action selection).

Policy Definition

Implemented a 2-layer MLP with the following:

Input: 3 x H x W vector

  • Had 3 channels in the input grid (one for the frog location, car locations, and goal-row locations respectively)
  • Flattened this grid into a (by default, 3 x 6 x 5 = 90 input-) state vector

Output: 5 logits that are translated into one of 5 actions:

  • 0 = UP
  • 1 = DOWN
  • 2 = LEFT
  • 3 = RIGHT
  • 4 = STAY

Optimizing Policy

Using whatever action the policy produced in the current episode (training sample), the environment was progressed using its .step() method, which moves cars, the frog using the action, and calculates reward, collision occurrence, etc.

Return & Advantage Computation

  • Temporal Credit Assignment (via Monte Carlo Returns): used discount factor (gamma) to more proportionately adjust recent actions when handling rewards. The following process was used to calculate the returns for the action taken given timestep t using rewards r_k:
$$G_t = \sum_{k=t}^{T-1} \gamma^{\,k-t} \, r_k, \gamma=0.99$$
  • Baseline Smoothing: found lots of noise in the gradients when training, so used this to reduce the high variance and smooth gradient estimates (by subtracting out an expected, baseline return, meaning that the policy specifically updates when action is worse / better than the mean). The following was used to translate raw returns into an 'advantage' by subtracting b, the running mean of returns to compare against.
$$A_t = G_t - b$$

Encouraging Initial Policy Exploration

  • Decaying Entropy: increased action stochasticity earlier on in training to encourage exploration, which helped the agent develop a more generalizable policy rather than relying on highly probable actions. Specifically calculated the entropy of the action distribution from the MLP policy (probabilities pi_theta) as follows:
$$H(\pi_\theta(\cdot \mid s)) = - \sum_a \pi_\theta(a \mid s) \, \log \pi_\theta(a \mid s)$$

Intuition: high entropy corresponds to a more uniform disribution, i.e., one that has a more widespread likelihood within the action space, so the agent is more likely to explore a variety of actions.

Loss Formulation

  • Following the standard optimized REINFORCE approach, composed the loss as a balance between (1) 'greedy' exploitation from seeking to adjust probabilities of taken actions by their (dis)advantages, i.e. increasing/decreasing the probabilities of actions that performed better/worse than expected and (2) adjusting policy parameters to encourage exploring breadth of actions earlier on.
  • Used beta as a decaying term throughout the training process such that the effect of entropy was larger earlier on, which has a progressively lower impact as the parameter itself decays and as the advantages from the agent's learning increase to dominate the loss calculation:
$$L(\theta) = -\sum_{t=0}^{T-1} \Big( A_t \,\log \pi_\theta(a_t \mid s_t) \;+\; \beta \, H\!\left(\pi_\theta(\cdot \mid s_t)\right) \Big)$$

This loss is used to compute the policy gradient, which accordingly updates MLP policy parameters.

Training Details

Ran training with key parameters as follows, computing loss and updating the policy accordingly after each episode:

python train_agent.py
num_episodes=30000
gamma=0.99
beta_range=(0.1, 0.02)
learning_rate=1e-3
  • Training logs are saved to training_logs.txt (not displayed to console)
  • Every render_every episodes, can watch the agent play with clean rendering
  • After training, plots show returns, loss, and success rate
  • Trained policy is saved to checkpoints/frogger_policy_{accuracy}.pt

Note: chose num_episodes after finding policy success/gradient jumped again after ~20k episodes; gamma, beta , and other parameters were chosen and loosely tweaked based on standard values.

Training / Optimization Takeaways

Upon augmenting the initial training process with things like advantages (baseline smoothing) and decaying entropy, found that the improvement in guiding the agent to find winning strategy that was (a) more successful and (b) achieved earlier on lay in a mix of these two optimizations, but mostly in the addition of entropy.

As explained prior, observed that the return and success (acheived goal or not) achieved their maxima (at least with this policy and training approach) far earlier when incorporating decaying entropy. This can likely be attributed to the fact that in doing so, the agent is pushed to adjust its weights early on to explore a broader variety of actions in its space, allowing it to discover successful patterns like moving UP, LEFT, etc. in whatever pattern to avoid cars in order to be able to achieve the reward at the goal line.

The following shows this with greater clarity.

Return & success respectively across 30000 episodes prior to adding entropy

Initial Return During Training Initial Return During Training

Return & success respectively across 30000 episodes after adding entropy

Initial Return During Training Initial Return During Training

Successful Reward Shaping

A problem with applying a strategy like REINFORCE / policy gradients to a sparse reward structure like Frogger (i.e., only achieve any positive reward upon reaching the end goal) is that positive signals for things like forward progress aren't learned explicitly until the agent manages to reach the end goal for the first time.

Solution: decided to add a small reward when the agent makes forward progress into the lanes, with that reward growing linearly as it reaches closer to the goal (at which point the agent is granted the final +5 reward) to also help offset the dithering step penalty since successful forward progress is good.

fwd_progress_reward = 0.02 * row # row in lanes [0, 5]

Upon using this slightly adapted reward structure — along with slightly more training episodes, also noting that cumulative return is higher because of shaping-induced intermediate rewards — was able to boost return and success to $\geq 95$%:

Return & success respectively across 40000 episodes after reward shaping

Initial Return During Training Initial Return During Training

Note: the loss across all versions here (not shown) is not as telling, given that it's an indirect representation of returns, and as such, the loss through training remains relatively noisy (which is not inherently problematic).

Metrics & Evaluation

Training Monitoring

During training, tracked success rate, episode returns, and policy loss across all episodes. Logs are saved to training_logs.txt for detailed inspection. Training process showed steady improvement with the final checkpoints achieving high success rates.

Rigorous Validation Protocol

After training, performed comprehensive validation using validate_agent.py (with 20k runs to obtain more reliable metrics):

Validation Setup:

  • Episodes: 20,000 evaluation runs
  • Action Selection: greedy/deterministic (argmax), NOT stochastic sampling
  • Policy: checkpoints/frogger_policy_0.98.pt (best trained policy)

Results:

Metric Value 95% Confidence Interval
Success Rate 95.91% (19,181/20,000) [95.63%, 96.18%]
Mean Return 4.51 ± 3.15 [4.47, 4.55]
Median Return 5.16
Mean Episode Length 8.83 ± 8.48 steps
Median Episode Length 7 steps

Key Findings:

  • The agent successfully reaches the goal in 95.91% of episodes, demonstrating learned behavior
  • When successful, episodes typically complete in 7-9 steps, showing efficient pathfinding
  • Return distribution is bimodal: successful episodes cluster around +5.0, while failures result in negative returns
  • Narrow confidence interval (±0.3% on success rate) confirms statistical significance over 20k episodes

Validation Visualizations:

The validation script is also helpful for generating comprehensive plots showing return distributions, rolling success rates, episode lengths, and correlations between metrics:

Validation Results for Best Policy

Interpretation: return histogram (top-left) shows the clear bimodal distribution between successful (+5 and incremental intermediate rewards/penalties) and failed episodes (with failed episodes almost exclusively being from the lingering max-steps, dithering behavior). The rolling success rate (top-right) remains stable around 96% throughout evaluation, confirming consistent performance. Episode length distribution (bottom-left) shows most episodes terminate quickly, while the scatter plot (bottom-right) reveals that successful episodes cluster in the 5-10 step range; failed episodes are evidently due to dying or max-steps achieved (with most genuine fails, i.e., not because of max-steps, having slightly-positive rewards, indicating some forward progress).

Running Validation:

To reproduce these results or validate other checkpoints:

python validate_agent.py

Configuration can be adjusted in the script:

  • CHECKPOINT_PATH: Policy to evaluate
  • NUM_EPISODES: Number of validation episodes (default: 20,000)
  • USE_GREEDY: Deterministic vs. stochastic action selection (default: True for evaluation)

Next Steps & Future Directions

While this MLP + policy gradient approach (via REINFORCE pattern) works relatively well, shown through the minimal training load and success achieved (with the few optimizations that were made), there are definitely ways in which this can be taken further:

  • Q-Learning (+ DQN): an approach that might lend itself to this set-up, which has a relatively simple environment and action space compared to something like Go or even chess.

  • Actor-Critic: makes use of a critic that can estimate the value of the given state to generate a more stable learning signal, then communicate that to the actor to update its policy more efficiently; helpful for reducing variance.

  • Augmenting Environment: to make the game more interesting, one could introduce stochasticity to existing obstacles (e.g., the number of cars, speed), create new obstacles (e.g., fast-moving vehicles, logs on water), and do much more, inviting exploration into whether this RL approach might succeed there or whether stronger approaches (including those mentioned above) might be necessary.

Running Agent ( + Optional Gameplay)

The trained agent can be run live in the frogger grid. As validated through rigorous testing (see Metrics & Evaluation above), the agent achieves a 95.9% success rate over 20,000 episodes using greedy action selection. Though largely minimized through training optimizations, occasional suboptimal behaviors (like the "dithering" / max-steps waiting) still appear in edge cases...

Run Frogger Agent CLI

python frogger_cli.py

Features

  • Two Play Modes:

    Watch the trained RL agent navigate and try to win, or play yourself with real-time game updates (cars keep moving)

    Watch Agent or Play

  • Two Rendering Modes:

    ASCII mode (initial, more simple) or Emoji mode

    Frogger ASCII vs Emoji Rendering

  • Three Speed Settings (Human Play):

    Fast (0.75s per step), Medium (1s per step), Slow (1.25s per step)

    Frogger Player Speed Settings

Controls (User Play)

The game runs continuously, i.e., cars keep moving even when user doesn't act.

  • W: Move up
  • A: Move left
  • S: Move down
  • D: Move right
  • Space: Stay in place (or don't press anything)
  • Q: Quit

Project Components

Core Implementation:

  • frogger_env.py - Frogger game environment implementation
  • frogger_policy.py - Neural network (MLP) policy for agent decision-making
  • render_utils.py - Shared rendering utilities for ASCII/emoji display

Training & Evaluation:

  • train_agent.py - Training script with REINFORCE algorithm, logging, and visualization
  • validate_agent.py - Rigorous validation script (20k episodes, greedy evaluation, statistical analysis)
  • training_logs.txt - Episode-by-episode training logs

Interactive Modes:

  • frogger_cli.py - Interactive CLI (recommended) - watch agent or play yourself
  • simulate_frogger_agent.py - Isolated agent evaluation and visualization
  • human_play.py - Step-by-step human play mode

Saved Models & Results:

  • checkpoints/frogger_policy_0.98.pt - Best learned policy (~98% training, 95.9% val. success rate)
  • evaluation/validation_frogger_policy_0.98.png - Validation results visualization
  • evaluation/validation_results.json - Validation metrics and statistics

References & Resources

REINFORCE Algorithm

Q-Learning & SARSA

Actor-Critic Methods

Deep Q-Networks (DQN)

About

An RL agent learns to play Frogger

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages