Author: Enlin Gu
Course: Independent Study (Fall 2025)
This study implemented and evaluated three Reinforcement Learning strategy in robot racing simulators as approximate human for future studies. The approaches evaluated are: sim-to-sim transfer, direct end-to-end training and reproduction of the TC-Driver paper.
Reproduce the paper TC-Drievr on the full suite of F1TENTH benchmark tracks. This paper is introduced a structured SAC strategy and shows generalizability compared to existing methods.
The success rate of the TC-Driver paper on all F1TENTH tracks are: success on 7 out of 23 tracks. Though the paper highlights generalizability, the result shows that it does not generalize that well. The policy struggles to adapt to track geometries not present in their training and demostartion set.
| Track | Austin | BrandsHatch | IMS | MexicoCity | Sakhir | SaoPaulo | YasMarine |
|---|---|---|---|---|---|---|---|
| LapTime (s) | 181 | 153 | 124 | 148 | 192 | 155 | 223 |
| Mean Velocity (m/s) | 3.05 | 2.56 | 3.24 | 2.91 | 2.82 | 2.78 | 2.37 |
The recordings of successful tracks are below.
F1TENTHTCDRIVER.mp4
Youtube Link: https://youtu.be/QI4dagA5XG4
The primary failure mode is oscillation with low speed.
In a condition that the car is at an angle relative to the track centerline, the agent fails to steer effectively and oscillates at low speeds. Instead, the steering output fluctuates rapidly between left/right limits, causing the vehicle to stall.
This shows that the policy lacks the ability to using steering to recover in this condition.
The video below illustrates the agent failing to correct its heading, resulting in the oscillation loop:
Hockenheim.mp4
Youtube Link: https://youtu.be/1RZKaoyIHcE
The trained policy is an end-to-end PPO in F1TENTH using stable-baseline 3 for two days. The agent converged to a smooth, collision-free policy on the source track (though a bit slow).
f1tenth_evaluation.mp4
Youtube Link: https://youtu.be/7TR5dYTiEbI
Pre-Alignment: Without precise action scaling, the agent exhibited drastic, high-frequency steering oscillations immediately upon initialization.
first_migration.mp4
To enable transfer, the physical parameters, observation (LIDAR, Pos, state, etc.) and action spaces (throttle, steering) of the target simulator were manually aligned to match the source training environment.
Youtube Link: https://youtu.be/-Bidc7pVoOM
Post-Alignment: Even after aligning the observation vectors and action scalars, the policy failed to generalize.
second_migration.mp4
Youtube Link: https://youtu.be/j6FQdK1cLrM
The potential reasons are:
- The internal physics engines differ significantly (e.g., friction coefficients, tire slip models).
- The PPO policy overfitted to the specific dynamics of the F1TENTH Gym.
Training was conducted using an end-to-end PPO implementation directly in the rendered environment. Through some unsuccessful intermediate training runtimes, the current reward function is training with progress and can avoid "Zero-Throttle Convergence" problem (where the agent learns to stand still to avoid collision penalties) through:
- Smaller penalty on collision
- Modify the reset: start with throttle
- A reward for accumulated distance covered and speed (moving forward)
- Tracking the central line
The latest runtime is run for less than a week (10M steps)
This training did not converge to a smooth driving policy.
trained_for_week.mp4
Youtube Link: https://youtu.be/KQ0D8JEPjgc
The agent developed a "Pulsing" control strategy. It outputs throttle in short bursts (approx. 2 Hz frequency) and random steering when no throttle. While this strategy avoids high-speed crashes, it fails to complete a lap efficiently. The agent need more timesteps to learn for better racing strategy.
-
TCDriver Reproduction: Achieved a 30% success rate (7/23 tracks). Though this method depicts better generalizability than other methods, it still failed to validate "zero-shot generalization" and identified overfitting issues in the original baseline.
-
Sim-to-Sim Transfer: The policy transfer from high speed gym to high-fidelity rendered simulators proved infeasible due to simulation gaps.
-
End-to-End Visual Training: Direct training in rendered simulators is currently bottlenecked by simulation speed (real-time training), preventing convergence within feasible timeframes.
Based on the limitations identified above, my future research should move away from pure end-to-end RL and focus on structured learning paradigms (like TC-Driver) for faster training.
Besides, from car motion during trainning, we can notice a significant difference from PPO learning (initializing from initial state randomly) and human learning (from sense to actual input). My future work should focus on human study and building learning algotithm structure that better mimic human learning patterns in car racing.