Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: Lint

on:
push:
branches: [ main, master ]
pull_request:
branches: [ main, master ]

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'

- name: Run pre-commit
uses: pre-commit/action@v3.0.1
# This action will fail the build if any files would be modified by the hooks.
# It does NOT commit or push any changes back to the repo.
21 changes: 21 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.14.10
hooks:
- id: ruff
args: [ --fix ]
- id: ruff-format

- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v18.1.8
hooks:
- id: clang-format
types_or: [c++, c]
84 changes: 42 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# rostrace
An eBPF-based performance tool for pinpointing latency bottlenecks in
An eBPF-based performance tool for pinpointing latency bottlenecks in
multi-process ROS systems.

## Quick Start
Expand All @@ -18,21 +18,21 @@ rostrace trace --node /processing_delay_node --topic /slow_topic
For detailed instructions on running native nodes on your host, advanced configuration, and troubleshooting, see the [**Usage Guide**](Usage.md).

## The Problem
You publish a topic with a desired frequency - let's say 20Hz. You know it has
to arrive every 50ms, but it keeps missing its schedule.
You publish a topic with a desired frequency - let's say 20Hz. You know it has
to arrive every 50ms, but it keeps missing its schedule.
Now you want to know: why?
Standard ROS tools (like `rostopic hz` or `rqt_graph`) treat the nodes as a
black box. They can tell you a message arrived late, but they cannot tell you
why.
- Was your code inefficient?
- Was the OS scheduler starving the thread?
- Did a "silent" page fault pause the CPU?
- Was the message dropped inside the middleware queue before the callback
- Was the message dropped inside the middleware queue before the callback
even started?

## The Solution
`rostrace` uses eBPF (via `bpftrace`) to peek inside the "black box" of both
ROS middleware and the Linux kernel simultaneously. It decomposes the
`rostrace` uses eBPF (via `bpftrace`) to peek inside the "black box" of both
ROS middleware and the Linux kernel simultaneously. It decomposes the
lifecycle of a message into four distinct metrics:

* **Network Latency** (Planned)
Expand All @@ -44,83 +44,83 @@ lifecycle of a message into four distinct metrics:

To debug a robot, you must see through the layers of the Operating System.

* **Multi-Process (Distributed) Debugging**: ROS splits a robot's brain into
many independent processes. Latency often hides in the communication
between these processes. `rostrace` tracks messages as they cross
* **Multi-Process (Distributed) Debugging**: ROS splits a robot's brain into
many independent processes. Latency often hides in the communication
between these processes. `rostrace` tracks messages as they cross
process boundaries.
* **Zero-Instrumentation**: `rostrace` attaches to running processes
dynamically. You don't need to change a single line of your code or
* **Zero-Instrumentation**: `rostrace` attaches to running processes
dynamically. You don't need to change a single line of your code or
recompile your nodes.
* **Eliminating Heisenbugs**: Standard tools run in *User-Space* and are
biased by the very latency they try to measure. `rostrace` runs inside
the *Linux Kernel*. It captures 100% of events with nanosecond precision,
seeing system-wide events—like the OS scheduler preempting your node—that
* **Eliminating Heisenbugs**: Standard tools run in *User-Space* and are
biased by the very latency they try to measure. `rostrace` runs inside
the *Linux Kernel*. It captures 100% of events with nanosecond precision,
seeing system-wide events—like the OS scheduler preempting your node—that
are invisible to ROS.

### Tracing the Entire ROS Ecosystem

While ROS offers many communication abstractions (Topics, Services, Actions, TF),
they all ultimately rely on common middleware patterns: data serialization,
While ROS offers many communication abstractions (Topics, Services, Actions, TF),
they all ultimately rely on common middleware patterns: data serialization,
internal queuing, and callback dispatch.

`rostrace` targets these fundamental building blocks inside `libroscpp.so`,
allowing it to provide latency insights across the entire ROS graph using a
`rostrace` targets these fundamental building blocks inside `libroscpp.so`,
allowing it to provide latency insights across the entire ROS graph using a
unified analysis approach.

## Performance Bottleneck Analysis

`rostrace` breaks down the fault domain by measuring how long a message
`rostrace` breaks down the fault domain by measuring how long a message
spends in each internal stage:

* **Network Latency (Planned)**: From kernel TCP stack entry until the
* **Network Latency (Planned)**: From kernel TCP stack entry until the
middleware begins processing the raw buffer.
* **Serialization Latency**: Measured from the start to the end of the
message deserialization process. This identifies overhead from large or
* **Serialization Latency**: Measured from the start to the end of the
message deserialization process. This identifies overhead from large or
complex message types.
* **Queuing Latency**: The time a message sits "ready and waiting" in the
* **Queuing Latency**: The time a message sits "ready and waiting" in the
`ros::CallbackQueue` before your code is free to process it.
* **Processing Latency**: The time your own callback code took to execute.

### How to Interpret the Results

* **High Serialization Latency**: Indicates the message format is a
* **High Serialization Latency**: Indicates the message format is a
bottleneck. Consider optimizing message structures or using compression.
* **High Queuing Latency**: Indicates the node is overloaded or being
* **High Queuing Latency**: Indicates the node is overloaded or being
starved of CPU time by the OS scheduler.
* **High Processing Latency**: Indicates an inefficiency inside your
* **High Processing Latency**: Indicates an inefficiency inside your
callback function (e.g., heavy algorithms or blocking I/O).

## Technical Architecture

* **Event Tracing**: Unlike statistical profilers (like `parca`) that sample
the CPU at intervals, `rostrace` uses deterministic event tracing. It
captures 100% of occurrences, ensuring that even a single 100ms latency
* **Event Tracing**: Unlike statistical profilers (like `parca`) that sample
the CPU at intervals, `rostrace` uses deterministic event tracing. It
captures 100% of occurrences, ensuring that even a single 100ms latency
spike is never missed.
* **Dynamic Symbol Resolution**: The tool automatically discovers mangled
C++ symbols in your specific version of `libroscpp.so`, ensuring robust
* **Dynamic Symbol Resolution**: The tool automatically discovers mangled
C++ symbols in your specific version of `libroscpp.so`, ensuring robust
operation across different ROS distributions.
* **Container Compatibility**: Includes specific logic to resolve
namespaced PIDs, allowing you to trace production containers from the
* **Container Compatibility**: Includes specific logic to resolve
namespaced PIDs, allowing you to trace production containers from the
host or within a privileged sidecar.

## Future Work

### Extended Metrics
* **Network Latency (Kernel-to-User)**: Tracing from `tcp_recvmsg` to
* **Network Latency (Kernel-to-User)**: Tracing from `tcp_recvmsg` to
middleware entry using `kprobes`.
* **Action-Specific Latency**: Direct tracing of the `actionlib` API to
* **Action-Specific Latency**: Direct tracing of the `actionlib` API to
measure **Goal-to-Accept** and **Feedback** delay.
* **TF Lookup Latency**: Measuring time spent inside blocking
* **TF Lookup Latency**: Measuring time spent inside blocking
`lookupTransform()` calls.

### Root Cause Analysis (System Events)
* **Scheduler Starvation**: Correlating latency with `sched_switch` events
* **Scheduler Starvation**: Correlating latency with `sched_switch` events
to detect when a node is blocked by other processes.
* **Context Switch Storms**: Detecting thrashing due to high-frequency
* **Context Switch Storms**: Detecting thrashing due to high-frequency
preemption or priority inversion.

### Reliability & Failure Detection
* **Black-Box Drop Detection**: Identifying packet loss by tracking
* **Black-Box Drop Detection**: Identifying packet loss by tracking
divergence between arrival rates and execution rates.
* **Real-Time Safety Auditing**: Flagging dynamic memory allocation
(`malloc`) or blocking I/O inside critical control loops.
* **Real-Time Safety Auditing**: Flagging dynamic memory allocation
(`malloc`) or blocking I/O inside critical control loops.
40 changes: 20 additions & 20 deletions Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This guide explains how to use `rostrace` to identify and debug latency in your

## 1. Quick Start with Demos (Docker)

We provide pre-packaged demos using Docker Compose. These scenarios simulate common
We provide pre-packaged demos using Docker Compose. These scenarios simulate common
latency issues without requiring any installation on your host machine.

### Prerequisites
Expand All @@ -13,8 +13,8 @@ latency issues without requiring any installation on your host machine.

### 1.1 The Streamlined Way (Recommended)

To make tracing as seamless as possible, we provide a wrapper script
`./rostrace-docker` in the project root. This script automatically handles all
To make tracing as seamless as possible, we provide a wrapper script
`./rostrace-docker` in the project root. This script automatically handles all
the complex Docker flags required for eBPF tracing.

```bash
Expand All @@ -37,7 +37,7 @@ In this scenario, a node has a callback that takes ~200ms to complete.
You will see an output like below:
```
1. Serialization Latency (us):
@serialization_latency_us:
@serialization_latency_us:
[2, 4) 1 |@ |
[4, 8) 50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16) 5 |@@@@@ |
Expand All @@ -46,22 +46,22 @@ You will see an output like below:


2. Queuing Latency (us):
@queuing_latency_us:
@queuing_latency_us:
[128K, 256K) 57 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|



3. Processing Latency (us):
@processing_latency_us:
@processing_latency_us:
[128K, 256K) 56 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
```

You can see that due to **High Processing Latency** (~200ms), the node cannot
keep up with incoming messages. This causes them to back up, leading to a direct
You can see that due to **High Processing Latency** (~200ms), the node cannot
keep up with incoming messages. This causes them to back up, leading to a direct
increase in **Queuing Latency** (also ~200ms). The serialization overhead remains negligible.

### Scenario B: Queuing Latency (Service Blocking)
In this scenario, a node is busy handling a heavy service call, causing incoming
In this scenario, a node is busy handling a heavy service call, causing incoming
topic messages to wait in the queue.

1. **Start the environment**:
Expand All @@ -76,7 +76,7 @@ topic messages to wait in the queue.
You will see an output like below:
```
1. Serialization Latency (us):
@serialization_latency_us:
@serialization_latency_us:
[4, 8) 128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16) 19 |@@@@@@@ |
[16, 32) 33 |@@@@@@@@@@@@@ |
Expand All @@ -85,7 +85,7 @@ You will see an output like below:


2. Queuing Latency (us):
@queuing_latency_us:
@queuing_latency_us:
[8, 16) 9 |@@@@@@ |
[16, 32) 31 |@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 67 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
Expand All @@ -106,7 +106,7 @@ You will see an output like below:


3. Processing Latency (us):
@processing_latency_us:
@processing_latency_us:
[8, 16) 28 |@@@@@@@@@@@@@@ |
[16, 32) 29 |@@@@@@@@@@@@@@@ |
[32, 64) 99 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
Expand All @@ -126,12 +126,12 @@ You will see an output like below:

```

This scenario reveals a **bimodal distribution** in Processing Latency. The
cluster in the microseconds range represents your fast topic callbacks. The
cluster at ~500ms represents the `heavy_service` handler blocking the main thread.
This scenario reveals a **bimodal distribution** in Processing Latency. The
cluster in the microseconds range represents your fast topic callbacks. The
cluster at ~500ms represents the `heavy_service` handler blocking the main thread.

Note how the **Queuing Latency** shows a wide spread: messages that arrive while
the service is running are forced to wait, while messages that arrive when the
Note how the **Queuing Latency** shows a wide spread: messages that arrive while
the service is running are forced to wait, while messages that arrive when the
thread is free are processed instantly. This is a classic "Thread Starvation" pattern.


Expand Down Expand Up @@ -161,7 +161,7 @@ rosrun rostrace rostrace trace --node /my_node_name --topic /my_topic_name
## 3. Advanced Configuration

### Explicit Library Path (`--lib`)
If `rostrace` cannot automatically find `libroscpp.so` (common in custom builds
If `rostrace` cannot automatically find `libroscpp.so` (common in custom builds
or statically linked environments), you can provide the path explicitly:

```bash
Expand All @@ -175,7 +175,7 @@ If you prefer to run the Python logic in a virtual environment:
3. Note: You still need `bpftrace` installed on your system.

### Troubleshooting
- **Missing DebugFS**: If you see "Kernel debug filesystem not found", ensure
- **Missing DebugFS**: If you see "Kernel debug filesystem not found", ensure
`/sys/kernel/debug` is mounted. On host: `sudo mount -t debugfs none /sys/kernel/debug`.
- **Permissions**: `rostrace` requires root privileges to attach eBPF probes.
- **Permissions**: `rostrace` requires root privileges to attach eBPF probes.
It will attempt to use `sudo` automatically if not run as root.
8 changes: 4 additions & 4 deletions demo/docker-compose.processing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ services:
container_name: roscore_proc
pid: host
network_mode: host
command:
command:
- "bash"
- "-c"
- "source /catkin_ws/devel/setup.bash && roscore"
Expand All @@ -27,7 +27,7 @@ services:
- /sys/kernel/debug:/sys/kernel/debug:rw
- /lib/modules:/lib/modules:ro
- /usr/src:/usr/src:ro
command:
command:
- "bash"
- "-c"
- |
Expand All @@ -45,7 +45,7 @@ services:
- ROS_MASTER_URI=http://localhost:11311
depends_on:
- scenario
command:
command:
- "bash"
- "-c"
- |
Expand All @@ -67,7 +67,7 @@ services:
- /usr/src:/usr/src:ro
stdin_open: true
tty: true
command:
command:
- "bash"
- "-c"
- "source /catkin_ws/devel/setup.bash && echo 'Run: rostrace trace --node /processing_delay_node --topic /slow_topic' && sleep infinity"
8 changes: 4 additions & 4 deletions demo/docker-compose.queuing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ services:
container_name: roscore_queue
pid: host
network_mode: host
command:
command:
- "bash"
- "-c"
- "source /catkin_ws/devel/setup.bash && roscore"
Expand All @@ -27,7 +27,7 @@ services:
- /sys/kernel/debug:/sys/kernel/debug:rw
- /lib/modules:/lib/modules:ro
- /usr/src:/usr/src:ro
command:
command:
- "bash"
- "-c"
- |
Expand All @@ -45,7 +45,7 @@ services:
- ROS_MASTER_URI=http://localhost:11311
depends_on:
- scenario
command:
command:
- "bash"
- "-c"
- |
Expand All @@ -67,7 +67,7 @@ services:
- /usr/src:/usr/src:ro
stdin_open: true
tty: true
command:
command:
- "bash"
- "-c"
- "source /catkin_ws/devel/setup.bash && echo 'Run: rostrace trace --node /service_blocker_node --topic /blocked_topic' && sleep infinity"
Loading