Multi-Agent Reinforcement Learning & DRFM

BAE- Towed Decoy (AI Gen.)

2 Drone Agents Success/Failure

This project aims to create a realistic Digital Radio Frequency Memory module embedded on a drone that operates using Reinforcement Learning algorithms, the drone itself also maneuvers using an algorithm tasked with surviving Radar tracking. Maneuverability is based on [9] which is based off the research paper [4]. Drone is used to survive some electromagnetic environment with deterministic radars that attempt to gain a lock, the DRFM module is trained to survive using realistic jamming techniques: transponder and repeater false targeting, combination of Off, RGPO, VGPO and RVGPO.

Foundations

We highly recommend readers to go through docs/ directory for quickly catching up to speed with how foundational things are implemented in this project or building a mental mindmap on where inspiration came from.

RGPO

VGPO

Coordinated

Setup

Requirements: NVIDIA GPU (tested on RTX 4090), CUDA 12.x, Python 3.11.

Install Isaac Sim (binary) and IsaacLab following the official guide.
Create the conda environment and set Isaac Sim paths:

conda env create -f environment.yaml -n [name]
conda activate [name]
export ISAACSIM_PATH="${HOME}/isaacsim/_build/linux-x86_64/release"
export ISAACSIM_PYTHON_EXE="${ISAACSIM_PATH}/python.sh"
ln -s ${ISAACSIM_PATH} _isaac_sim

Run all scripts from the repo root. Python resolves local packages (drfm, dynamics) via the working directory.
Verify:

python scripts/train.py --task singleDRFM --headless --num_envs 4 --max_iterations 5

Usage

Early PPO Godmode Case

Environment is split into two phases: (1) navigation, (2) DRFM. This allows us to test different agents, architectures on invidiual problems. Later the agent will be packaged without any regards for which phase to use.

Full task (navigation + DRFM):

python3 scripts/train.py --task singleDRFM --headless --num_envs 4096 --algorithm PPO_GRU --log-level INFO
python3 scripts/play.py --task singleDRFM --num_envs 1 --algorithm PPO_GRU --debug

Scaffolding (PPO+GRU)

python3 scripts/train.py --task singleDRFM_stage1 --headless --num_envs 8192 --algorithm PPO_GRU --log-level INFO
python3 scripts/train.py --task singleDRFM_stage2 --headless --num_envs 8192 --algorithm PPO_GRU --log-level INFO --checkpoint path/to/stage1/best_agent.pt

python3 scripts/play.py --task singleDRFM_stage1 --num_envs 1 --algorithm PPO_GRU --debug
python3 scripts/play.py --task singleDRFM_stage2 --num_envs 1 --algorithm PPO_GRU --debug

MAPPO (MARL)

python3 scripts/train.py --task multiDRFM --headless --num_envs 2048 --algorithm MAPPO --log-level INFO
python3 scripts/play.py --task multiDRFM --num_envs 1 --algorithm MAPPO --debug

Environment

Observations

Term	Description
`target_pos_b`	Next waypoint in body frame
`waypoints_remaining`	Count of remaining waypoints
`attitude`	Quaternion orientation
`altitude`	Altitude error from target z=3 m
`vertical_vel`	Vertical velocity
`lin_vel`	Linear velocity in body frame
`ang_vel`	Angular velocity in body frame
`rwr`	Radar warning receiver per radar
`drfm_state`	DRFM jammer state

Actions (11D)

Term	Dim	Description
`control_action`	4	Per-motor thrust [-1, 1]
`drfm_technique`	4	Logits → OFF / RGPO / VGPO / RVGPO
`drfm_params`	3	Pull-off rate, velocity pull-off, coordination ratio

Rewards

Term	Weight	Description
`waypoint_reached`	+50	Per waypoint bonus
`completion_bonus`	+100	All waypoints done
`progress`	+5	Forward progress toward waypoint
`forward_speed`	+2	Speed toward goal (target 5 m/s)
`heading`	+2	Aligned heading to goal
`drfm_effective`	+2	Jamming an active radar
`smart_jam`	+1	Jamming the right radar for the threat
`power_conserve`	+0.5	Low DRFM power when not needed
`upright`	+1	Upright orientation
`terminating`	-200	Bad termination (collision / radar lock)
`altitude_band`	-5	Deviation from z=3 m ±1 m
`illumination_low`	-2	Radar illumination on drone
`proximity`	-3	Within 2.5–6 m of obstacle
`ang_vel_l2`	-0.02	Angular velocity magnitude
`action_smooth`	-0.01	Action jitter between steps
`step_penalty`	-0.01	Time alive penalty

Algorithms

Episode Return	Episode Length
DRFM Technique Usage	Policy Loss

We used Proximal Policy Optimization (PPO) as the backbone throughout the project with Soft Actor-Critic (SAC) added later for ablation & replay buffer comparison. Both agents support hybrid discrete-continuous actions which is critical for the DRFM module - technique selection is discrete (OFF, RGPO, VGPO, RVGPO) while each technique's parameters (pull-off rate, velocity pull-off rate, coordination ratio) are continuous. PPO and SAC cover decent variance since one is on-policy and the other is off-policy.

We also implemented PPO_GRU (PPO with a GRU recurrent encoder) specifically to handle partial observability in the radar environment. The drone receives Radar Warning Receiver (RWR) observations including: bearing, power, illumination rate, pulse interval variance which are noisy single-timestamp snapshots. A memoryless MLP policy cannot distinguish whether a radar is ramping up toward lock or cooling down from a failed track. The GRU encodes the temporal RWR stream (32D) into a hidden state while passing static observations (attitude, velocity, DRFM state) through directly. Theoretically, this split-stream design lets the agent build a mental model of radar over time without forcing navigation state through recurrence.

MAPPO works compared to normal PPO and PPO GRU. Training 25K timesteps with 5 drones takes roughly 45 minutes on an RTX 4090. The model shares parameters between all 5 drones via centralized critic in total its 280-dimensions. Each drone has to manage their own observation of radar and DRFM status, i.e. which DRFM technique is turned on and which radar is currently illuminating them. Each drone receives a shared reward by averaging out all individual rewards to encourage a team effort. Realistically, this is not ideal at all but works as a starting point. Each drone should maximize its own effort given its belief, and state of environment not get bogged down by drones abilities (Mini4). I could not realistically find online any library to incorporate QMIX or similar, RayLib exists but so vastly different than current implementation.:w

All other agents mentioned, DQN, REINFORCE, vanilla Actor-Critic, DDPG, TD3 and TRPO cannot be used for any of these reasons: discrete only, continuous only, higher variance. Also PPO is pretty popular compared to all the others ...

Project Organization

  ├── LICENSE / NOTICE
  ├── README.md
  ├── environment.yaml
  ├── docs/               # Research notes & technical challenges
  │   ├── drfm.md
  │   ├── radar.md
  │   ├── meta.md
  │   ├── references.md
  │   └── technical-challenges.md
  │
  ├── media/
  ├── scripts/
  │   ├── train.py
  │   └── play.py
  │
  ├── outputs/
  ├── drfm/
  │   ├── assets/
  │   │   └── configuration/
  │   ├── robots/
  │   ├── dynamics/
  │   ├── algorithms/
  │   ├── agents/
  │   ├── utils/
  │   └── isaac/
  │       ├── drfm_env.py
  │       ├── agents/
  │       └── mdp/
  │
  └── models/
      ├── architectures/
      ├── checkpoints/
      └── replay_buffers/

Note on AI Use and Assets

Claude helped in refactoring old SKRL version 1.4.3 to 2.1.0, although couldn't really tell whether it was working or not since agents broke as a result. The initial baseline foundations of this project was based on isaac drone racer, and we largely stuck with the some allocation and motor usage, largely nothing has changed here besides some testing we did with how thrust initializes. The MDP structure for actions, observations rewards we took from isaac drone racer too and added on top including waypoints, and a lot of reward structure primarily so that the drone is limited in altitude (minz and maxz) and urged towards the waypoint while preventing itself from colliding with objects.

The radar for both deterministic and probabilistic was designed based on the Phillip E. Pace book and I got claude to validate some of the mathematics for calculating DRFM and drone interactions. The equations were pulled directly from the book or online (SPJ). I did not create any of the assets used (drone mesh, USD, URDF and so on), most media was created by me besides the AI generated image of BAE systems decoy as header in this file.

Future

1. Properly get MAPPO & PPO GRU working.
    - I need an easier way to validate whats going on, visualize inconsistencies
      and debug easier.
    - MAPPO does work compared to dead PPO (GRU) but collision into obtacles
      isn't solved and both drones and radar need to inner communicate with each
      other.
2. Change reward structure so its not unbearably fragile.
3. Change environment to be more realistic.
4. Add IQ waveforms using USRP recorded signals instead of janky radar
   interactions we current have.
5. Radars should share communication with each other to fit more realistic
   environment.
    - Also the calculations are way dumbed down to allow scaffolding training,
      but we never reverted values.

References

Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." International conference on machine learning. Pmlr, 2018.
Wang, Chao, et al. "Autonomous navigation of UAV in large-scale unknown complex environment with deep reinforcement learning." GlobalSIP 2017
Kaufmann, E., et al. "Champion-level drone racing using deep reinforcement learning." Nature, 2023
Sutton, R. S., & Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, 2018.
Merrick, R. Getting Started with FPGAs: Digital Circuit Design, Verilog, and VHDL for Beginners. No Starch Press, 2023.
Pace, P. E. Developing Digital RF Memories and Transceiver Technologies for Electromagnetic Warfare. Artech House, 2022.
Salimpour, Sahar, et al. "Sim-to-real transfer for mobile robots with reinforcement learning: from nvidia isaac sim to gazebo and real ros 2 robots." arXiv preprint arXiv:2501.02902 (2025).
PPO SKRL
Isaac Drone Racer
Isaac Sim: Foundation Model
Isaac Lab: RL Environments
Isaac Lab: Actuators
Radar Equations - MIT Lincoln Lab
Radar Jamming and Deception - Wikipedia
DRFM: History, Circuit & Testing - Rohde & Schwarz
TD Learning - Stanford CME241
Bellman Equation - Wikipedia
Bellman's Principle of Optimality - Wikipedia
MDP Algorithms: Value & Policy Iteration - Wikipedia
AN/ALE-55 Fiber-Optic Towed Decoy (FOTD) Image - BAE SYSTEMS
Radar Tutorials: Self Protection Jammer
Claude (Anthropic)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Agent Reinforcement Learning & DRFM

Foundations

Setup

Usage

Environment

Algorithms

Project Organization

Note on AI Use and Assets

Future

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
docs		docs
drfm		drfm
media		media
models		models
outputs		outputs
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
environment.yaml		environment.yaml

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent Reinforcement Learning & DRFM

Foundations

Setup

Usage

Environment

Algorithms

Project Organization

Note on AI Use and Assets

Future

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages