A ROS2 (Humble) implementation of a multi-agent task coordination system with fault tolerance, performance metrics, and algorithm comparison. Demonstrates autonomous task assignment, parallel execution, and real-time status monitoring.
[coordinator]: Assigned T001 (navigate@zone_A) -> agent_1 [coordinator]: Assigned T003 (navigate@zone_C) -> agent_2 [agent_1]: Executing T001 (navigate) for 4s... [agent_2]: Executing T003 (navigate) for 4s... [agent_1]: Completed T001 [coordinator]: Task T001 completed by agent_1
graph LR
C[Coordinator Node]
A1[Agent 1]
A2[Agent 2]
C -->|/tasks| A1
C -->|/tasks| A2
A1 -->|/agent_status| C
A2 -->|/agent_status| C
Coordinator Node (coordinator_node.py)
- Maintains a priority-sorted task queue
- Monitors agent status via
/agent_statustopic - Assigns tasks to idle agents dynamically
Agent Node (agent_node.py)
- Subscribes to
/taskstopic - Executes assigned tasks concurrently (threading)
- Publishes real-time status to
/agent_status
- Priority-based scheduling — higher priority tasks assigned first
- Parallel execution — multiple agents work simultaneously
- Dynamic load balancing — tasks assigned immediately when agent becomes idle
- Fault tolerance — heartbeat monitoring with automatic task recovery
- Algorithm comparison — benchmarking framework for scheduling strategies
- Scalable — add more agents without changing coordinator logic
This demo extends my experience building multi-agent decision systems (100+ assets, 25 time dimensions, 6 API integrations) into the ROS2 robotics domain, demonstrating the same coordination patterns applied to autonomous robot task management.
- Docker Desktop (Apple Silicon / Linux)
- Git
git clone https://github.com/s870488-dev/ros2-multi-agent-demo.git
cd ros2-multi-agent-demo
# Build Docker image
cd docker && docker compose build
# Start container
docker compose up -d
# Enter container
docker exec -it ros2_dev bash
# Build ROS2 package
cd /ros2_ws && colcon build
source install/setup.bash
# Launch demo (basic)
ros2 launch multi_agent_coordinator demo.launch.py
# Run algorithm comparison
ros2 run multi_agent_coordinator coordinator_comparison priority
ros2 run multi_agent_coordinator coordinator_comparison random
ros2 run multi_agent_coordinator coordinator_comparison least_loadedThe coordinator monitors each agent via heartbeat. If an agent stops responding for >5 seconds, it is marked dead and all in-progress tasks are automatically re-queued and reassigned.
[FAULT] agent_1 heartbeat lost (7.0s) → marking dead
[RECOVER] T002 re-queued (was on agent_1)
[RECOVER] T005 re-queued (was on agent_1)
[RECOVER] T004 re-queued (was on agent_1)
[ASSIGN] T002 → agent_2
[ASSIGN] T005 → agent_2
[ASSIGN] T004 → agent_2
Result: All 6 tasks completed. Zero task loss.
Full log: docs/fault_tolerance_demo.txt
Three scheduling strategies were benchmarked on an identical 6-task workload (2× navigate, 2× inspect, 1× collect, varying priority levels):
| Algorithm | Avg Wait (s) | Avg Exec (s) | Runtime (s) | Avg Util (%) |
|---|---|---|---|---|
| Priority | 8.3 | 4.9 | 21.0 | 71.4 |
| Random | 9.3 | 4.9 | 21.0 | 70.2 |
| Least-loaded | 8.3 | 4.9 | 21.0 | 72.6 |
Observation: In execution-time-dominated scenarios, total runtime converges across strategies. The key differentiator is wait time distribution — Priority and Least-loaded consistently assign high-priority tasks earlier, while Random introduces unnecessary delay for critical tasks. Least-loaded achieves marginally higher agent utilization by balancing cumulative workload.
Raw data: docs/metrics_priority.csv · docs/metrics_random.csv · docs/metrics_least_loaded.csv
| Topic | Type | Description |
|---|---|---|
/tasks |
std_msgs/String (JSON) |
Task assignments from coordinator |
/agent_status |
std_msgs/String (JSON) |
Agent status reports |
| Type | Duration | Description |
|---|---|---|
| navigate | 4s | Move to target zone |
| inspect | 6s | Inspect target zone |
| collect | 3s | Collect item at zone |
- ROS2 Humble (LTS)
- Python 3.10
- Docker (linux/arm64 + amd64)
- Tested on Apple Silicon (M4) and Linux
Wait time distribution is the primary differentiator across algorithms:
- Priority and Least-loaded assign high-priority tasks (T001, T003) first, achieving 2s wait — the theoretical minimum given startup delay
- Random scheduling assigned inspect tasks (T002, T005) first in the tested run, causing navigate tasks to wait up to 16s
- In execution-time-dominated workloads, total runtime converges across all strategies (21s) — the scheduling bottleneck only emerges when task arrival rate exceeds agent capacity
Agent utilization remains consistent (70–73%) across all algorithms with 2 agents and 6 tasks, suggesting utilization is bounded by task density rather than scheduling strategy in this configuration.
Implication for multi-robot systems: Priority-based scheduling provides predictable latency for critical tasks (e.g. emergency navigation) without sacrificing overall throughput — a desirable property for real-world robot coordination where task urgency varies.
