Evaluation of Using RL Strategies as Approximate Human in Autonomous Racing

Author: Enlin Gu

Course: Independent Study (Fall 2025)

This study implemented and evaluated three Reinforcement Learning strategy in robot racing simulators as approximate human for future studies. The approaches evaluated are: sim-to-sim transfer, direct end-to-end training and reproduction of the TC-Driver paper.

1. Experiment 1: TC-Driver Reproduction (on ALL F1TENTH Race Tracks)

Reproduce the paper TC-Drievr on the full suite of F1TENTH benchmark tracks. This paper is introduced a structured SAC strategy and shows generalizability compared to existing methods.

1.1 Experiment Results

The success rate of the TC-Driver paper on all F1TENTH tracks are: success on 7 out of 23 tracks. Though the paper highlights generalizability, the result shows that it does not generalize that well. The policy struggles to adapt to track geometries not present in their training and demostartion set.

Track	Austin	BrandsHatch	IMS	MexicoCity	Sakhir	SaoPaulo	YasMarine
LapTime (s)	181	153	124	148	192	155	223
Mean Velocity (m/s)	3.05	2.56	3.24	2.91	2.82	2.78	2.37

The recordings of successful tracks are below.

F1TENTHTCDRIVER.mp4

Youtube Link: https://youtu.be/QI4dagA5XG4

1.2 Key Failure Analysis:

The primary failure mode is oscillation with low speed.

In a condition that the car is at an angle relative to the track centerline, the agent fails to steer effectively and oscillates at low speeds. Instead, the steering output fluctuates rapidly between left/right limits, causing the vehicle to stall.

This shows that the policy lacks the ability to using steering to recover in this condition.

The video below illustrates the agent failing to correct its heading, resulting in the oscillation loop:

Hockenheim.mp4

Youtube Link: https://youtu.be/1RZKaoyIHcE

2. Experiment 2: Sim-to-Sim Transfer

2.1 Train PPO in F1TENTH

The trained policy is an end-to-end PPO in F1TENTH using stable-baseline 3 for two days. The agent converged to a smooth, collision-free policy on the source track (though a bit slow).

f1tenth_evaluation.mp4

Youtube Link: https://youtu.be/7TR5dYTiEbI

2.2 Direct Migration

Pre-Alignment: Without precise action scaling, the agent exhibited drastic, high-frequency steering oscillations immediately upon initialization.

first_migration.mp4

To enable transfer, the physical parameters, observation (LIDAR, Pos, state, etc.) and action spaces (throttle, steering) of the target simulator were manually aligned to match the source training environment.

Youtube Link: https://youtu.be/-Bidc7pVoOM

2.3 Migration After Alignment

Post-Alignment: Even after aligning the observation vectors and action scalars, the policy failed to generalize.

second_migration.mp4

Youtube Link: https://youtu.be/j6FQdK1cLrM

The potential reasons are:

The internal physics engines differ significantly (e.g., friction coefficients, tire slip models).
The PPO policy overfitted to the specific dynamics of the F1TENTH Gym.

3. Experiment 3: Direct End-to-End training

3.1 Methodology Details

Training was conducted using an end-to-end PPO implementation directly in the rendered environment. Through some unsuccessful intermediate training runtimes, the current reward function is training with progress and can avoid "Zero-Throttle Convergence" problem (where the agent learns to stand still to avoid collision penalties) through:

Smaller penalty on collision
Modify the reset: start with throttle
A reward for accumulated distance covered and speed (moving forward)
Tracking the central line

The latest runtime is run for less than a week (10M steps)

3.2 Most-Recent Result

This training did not converge to a smooth driving policy.

trained_for_week.mp4

Youtube Link: https://youtu.be/KQ0D8JEPjgc

The agent developed a "Pulsing" control strategy. It outputs throttle in short bursts (approx. 2 Hz frequency) and random steering when no throttle. While this strategy avoids high-speed crashes, it fails to complete a lap efficiently. The agent need more timesteps to learn for better racing strategy.

4. Conclusion and Future Prospects

4.1 Key Results

TCDriver Reproduction: Achieved a 30% success rate (7/23 tracks). Though this method depicts better generalizability than other methods, it still failed to validate "zero-shot generalization" and identified overfitting issues in the original baseline.
Sim-to-Sim Transfer: The policy transfer from high speed gym to high-fidelity rendered simulators proved infeasible due to simulation gaps.
End-to-End Visual Training: Direct training in rendered simulators is currently bottlenecked by simulation speed (real-time training), preventing convergence within feasible timeframes.

4.2 Proposed Improvements

Based on the limitations identified above, my future research should move away from pure end-to-end RL and focus on structured learning paradigms (like TC-Driver) for faster training.

Besides, from car motion during trainning, we can notice a significant difference from PPO learning (initializing from initial state randomly) and human learning (from sense to actual input). My future work should focus on human study and building learning algotithm structure that better mimic human learning patterns in car racing.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.catkin_tools		.catkin_tools
.vscode		.vscode
F110_ROS_Simulator		F110_ROS_Simulator
Gym		Gym
TC_Driver		TC_Driver
build		build
f110_msgs		f110_msgs
install		install
log		log
misc/imgs		misc/imgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_tracks.py		convert_tracks.py
process_map_image.py		process_map_image.py
requirements.txt		requirements.txt
runsmemeo.txt		runsmemeo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of Using RL Strategies as Approximate Human in Autonomous Racing

1. Experiment 1: TC-Driver Reproduction (on ALL F1TENTH Race Tracks)

1.1 Experiment Results

1.2 Key Failure Analysis:

2. Experiment 2: Sim-to-Sim Transfer

2.1 Train PPO in F1TENTH

2.2 Direct Migration

2.3 Migration After Alignment

3. Experiment 3: Direct End-to-End training

3.1 Methodology Details

3.2 Most-Recent Result

4. Conclusion and Future Prospects

4.1 Key Results

4.2 Proposed Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Using RL Strategies as Approximate Human in Autonomous Racing

1. Experiment 1: TC-Driver Reproduction (on ALL F1TENTH Race Tracks)

1.1 Experiment Results

1.2 Key Failure Analysis:

2. Experiment 2: Sim-to-Sim Transfer

2.1 Train PPO in F1TENTH

2.2 Direct Migration

2.3 Migration After Alignment

3. Experiment 3: Direct End-to-End training

3.1 Methodology Details

3.2 Most-Recent Result

4. Conclusion and Future Prospects

4.1 Key Results

4.2 Proposed Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages