Skip to content

ash-al1/drfm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Agent Reinforcement Learning & DRFM


BAE- Towed Decoy (AI Gen.)

2 Drone Agents Success/Failure

This project aims to create a realistic Digital Radio Frequency Memory module embedded on a drone that operates using Reinforcement Learning algorithms, the drone itself also maneuvers using an algorithm tasked with surviving Radar tracking. Maneuverability is based on [9] which is based off the research paper [4]. Drone is used to survive some electromagnetic environment with deterministic radars that attempt to gain a lock, the DRFM module is trained to survive using realistic jamming techniques: transponder and repeater false targeting, combination of Off, RGPO, VGPO and RVGPO.

Foundations

We highly recommend readers to go through docs/ directory for quickly catching up to speed with how foundational things are implemented in this project or building a mental mindmap on where inspiration came from.


RGPO

VGPO

Coordinated

Setup

Requirements: NVIDIA GPU (tested on RTX 4090), CUDA 12.x, Python 3.11.

  1. Install Isaac Sim (binary) and IsaacLab following the official guide.

  2. Create the conda environment and set Isaac Sim paths:

conda env create -f environment.yaml -n [name]
conda activate [name]
export ISAACSIM_PATH="${HOME}/isaacsim/_build/linux-x86_64/release"
export ISAACSIM_PYTHON_EXE="${ISAACSIM_PATH}/python.sh"
ln -s ${ISAACSIM_PATH} _isaac_sim
  1. Run all scripts from the repo root. Python resolves local packages (drfm, dynamics) via the working directory.

  2. Verify:

python scripts/train.py --task singleDRFM --headless --num_envs 4 --max_iterations 5

Usage


Early PPO Godmode Case

Environment is split into two phases: (1) navigation, (2) DRFM. This allows us to test different agents, architectures on invidiual problems. Later the agent will be packaged without any regards for which phase to use.

Full task (navigation + DRFM):

python3 scripts/train.py --task singleDRFM --headless --num_envs 4096 --algorithm PPO_GRU --log-level INFO
python3 scripts/play.py --task singleDRFM --num_envs 1 --algorithm PPO_GRU --debug

Scaffolding (PPO+GRU)

python3 scripts/train.py --task singleDRFM_stage1 --headless --num_envs 8192 --algorithm PPO_GRU --log-level INFO
python3 scripts/train.py --task singleDRFM_stage2 --headless --num_envs 8192 --algorithm PPO_GRU --log-level INFO --checkpoint path/to/stage1/best_agent.pt

python3 scripts/play.py --task singleDRFM_stage1 --num_envs 1 --algorithm PPO_GRU --debug
python3 scripts/play.py --task singleDRFM_stage2 --num_envs 1 --algorithm PPO_GRU --debug

MAPPO (MARL)

python3 scripts/train.py --task multiDRFM --headless --num_envs 2048 --algorithm MAPPO --log-level INFO
python3 scripts/play.py --task multiDRFM --num_envs 1 --algorithm MAPPO --debug

Environment

Observations

Term Description
target_pos_b Next waypoint in body frame
waypoints_remaining Count of remaining waypoints
attitude Quaternion orientation
altitude Altitude error from target z=3 m
vertical_vel Vertical velocity
lin_vel Linear velocity in body frame
ang_vel Angular velocity in body frame
rwr Radar warning receiver per radar
drfm_state DRFM jammer state

Actions (11D)

Term Dim Description
control_action 4 Per-motor thrust [-1, 1]
drfm_technique 4 Logits → OFF / RGPO / VGPO / RVGPO
drfm_params 3 Pull-off rate, velocity pull-off, coordination ratio

Rewards

Term Weight Description
waypoint_reached +50 Per waypoint bonus
completion_bonus +100 All waypoints done
progress +5 Forward progress toward waypoint
forward_speed +2 Speed toward goal (target 5 m/s)
heading +2 Aligned heading to goal
drfm_effective +2 Jamming an active radar
smart_jam +1 Jamming the right radar for the threat
power_conserve +0.5 Low DRFM power when not needed
upright +1 Upright orientation
terminating -200 Bad termination (collision / radar lock)
altitude_band -5 Deviation from z=3 m ±1 m
illumination_low -2 Radar illumination on drone
proximity -3 Within 2.5–6 m of obstacle
ang_vel_l2 -0.02 Angular velocity magnitude
action_smooth -0.01 Action jitter between steps
step_penalty -0.01 Time alive penalty

Algorithms


Episode Return

Episode Length

DRFM Technique Usage

Policy Loss

We used Proximal Policy Optimization (PPO) as the backbone throughout the project with Soft Actor-Critic (SAC) added later for ablation & replay buffer comparison. Both agents support hybrid discrete-continuous actions which is critical for the DRFM module - technique selection is discrete (OFF, RGPO, VGPO, RVGPO) while each technique's parameters (pull-off rate, velocity pull-off rate, coordination ratio) are continuous. PPO and SAC cover decent variance since one is on-policy and the other is off-policy.

We also implemented PPO_GRU (PPO with a GRU recurrent encoder) specifically to handle partial observability in the radar environment. The drone receives Radar Warning Receiver (RWR) observations including: bearing, power, illumination rate, pulse interval variance which are noisy single-timestamp snapshots. A memoryless MLP policy cannot distinguish whether a radar is ramping up toward lock or cooling down from a failed track. The GRU encodes the temporal RWR stream (32D) into a hidden state while passing static observations (attitude, velocity, DRFM state) through directly. Theoretically, this split-stream design lets the agent build a mental model of radar over time without forcing navigation state through recurrence.

MAPPO works compared to normal PPO and PPO GRU. Training 25K timesteps with 5 drones takes roughly 45 minutes on an RTX 4090. The model shares parameters between all 5 drones via centralized critic in total its 280-dimensions. Each drone has to manage their own observation of radar and DRFM status, i.e. which DRFM technique is turned on and which radar is currently illuminating them. Each drone receives a shared reward by averaging out all individual rewards to encourage a team effort. Realistically, this is not ideal at all but works as a starting point. Each drone should maximize its own effort given its belief, and state of environment not get bogged down by drones abilities (Mini4). I could not realistically find online any library to incorporate QMIX or similar, RayLib exists but so vastly different than current implementation.:w

All other agents mentioned, DQN, REINFORCE, vanilla Actor-Critic, DDPG, TD3 and TRPO cannot be used for any of these reasons: discrete only, continuous only, higher variance. Also PPO is pretty popular compared to all the others ...

Project Organization

  ├── LICENSE / NOTICE
  ├── README.md
  ├── environment.yaml
  ├── docs/               # Research notes & technical challenges
  │   ├── drfm.md
  │   ├── radar.md
  │   ├── meta.md
  │   ├── references.md
  │   └── technical-challenges.md
  │
  ├── media/
  ├── scripts/
  │   ├── train.py
  │   └── play.py
  │
  ├── outputs/
  ├── drfm/
  │   ├── assets/
  │   │   └── configuration/
  │   ├── robots/
  │   ├── dynamics/
  │   ├── algorithms/
  │   ├── agents/
  │   ├── utils/
  │   └── isaac/
  │       ├── drfm_env.py
  │       ├── agents/
  │       └── mdp/
  │
  └── models/
      ├── architectures/
      ├── checkpoints/
      └── replay_buffers/

Note on AI Use and Assets

Claude helped in refactoring old SKRL version 1.4.3 to 2.1.0, although couldn't really tell whether it was working or not since agents broke as a result. The initial baseline foundations of this project was based on isaac drone racer, and we largely stuck with the some allocation and motor usage, largely nothing has changed here besides some testing we did with how thrust initializes. The MDP structure for actions, observations rewards we took from isaac drone racer too and added on top including waypoints, and a lot of reward structure primarily so that the drone is limited in altitude (minz and maxz) and urged towards the waypoint while preventing itself from colliding with objects.

The radar for both deterministic and probabilistic was designed based on the Phillip E. Pace book and I got claude to validate some of the mathematics for calculating DRFM and drone interactions. The equations were pulled directly from the book or online (SPJ). I did not create any of the assets used (drone mesh, USD, URDF and so on), most media was created by me besides the AI generated image of BAE systems decoy as header in this file.

Future

1. Properly get MAPPO & PPO GRU working.
    - I need an easier way to validate whats going on, visualize inconsistencies
      and debug easier.
    - MAPPO does work compared to dead PPO (GRU) but collision into obtacles
      isn't solved and both drones and radar need to inner communicate with each
      other.
2. Change reward structure so its not unbearably fragile.
3. Change environment to be more realistic.
4. Add IQ waveforms using USRP recorded signals instead of janky radar
   interactions we current have.
5. Radars should share communication with each other to fit more realistic
   environment.
    - Also the calculations are way dumbed down to allow scaffolding training,
      but we never reverted values.

References

  1. Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  2. Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." International conference on machine learning. Pmlr, 2018.
  3. Wang, Chao, et al. "Autonomous navigation of UAV in large-scale unknown complex environment with deep reinforcement learning." GlobalSIP 2017
  4. Kaufmann, E., et al. "Champion-level drone racing using deep reinforcement learning." Nature, 2023
  5. Sutton, R. S., & Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, 2018.
  6. Merrick, R. Getting Started with FPGAs: Digital Circuit Design, Verilog, and VHDL for Beginners. No Starch Press, 2023.
  7. Pace, P. E. Developing Digital RF Memories and Transceiver Technologies for Electromagnetic Warfare. Artech House, 2022.
  8. Salimpour, Sahar, et al. "Sim-to-real transfer for mobile robots with reinforcement learning: from nvidia isaac sim to gazebo and real ros 2 robots." arXiv preprint arXiv:2501.02902 (2025).
  9. PPO SKRL
  10. Isaac Drone Racer
  11. Isaac Sim: Foundation Model
  12. Isaac Lab: RL Environments
  13. Isaac Lab: Actuators
  14. Radar Equations - MIT Lincoln Lab
  15. Radar Jamming and Deception - Wikipedia
  16. DRFM: History, Circuit & Testing - Rohde & Schwarz
  17. TD Learning - Stanford CME241
  18. Bellman Equation - Wikipedia
  19. Bellman's Principle of Optimality - Wikipedia
  20. MDP Algorithms: Value & Policy Iteration - Wikipedia
  21. AN/ALE-55 Fiber-Optic Towed Decoy (FOTD) Image - BAE SYSTEMS
  22. Radar Tutorials: Self Protection Jammer
  23. Claude (Anthropic)

About

Digital Radio Frequency Memory with RL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages