Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/ai-control/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
linkTitle: "AI & learned control"
title: "AI and learned control"
weight: 45
layout: "docs"
type: "docs"
no_list: true
description: "Run learned policies, vision-language-action models, and LLM-driven task planning on a machine using modules and the Viam APIs."
---

Classic robot control is written by hand: a PID loop, a motion planner, a
state machine. A growing class of applications instead runs a **learned
model** in the loop, a reinforcement-learning policy, a vision-language-action
(VLA) model, or a large language model that decomposes a goal into skills.

On Viam these run the same way any custom capability does: you package the
model in a [module](/build-modules/) that implements a component or service
API, and your application talks to it through the standard APIs. This section
explains how each kind of model fits that pattern.

- [Inference latency and loop rate](inference-latency/): why a model in the
loop cannot run faster than its own inference time, and how to size it.
- [Learned and policy-based control](learned-and-policy-control/): when a
trained policy beats a hand-written controller, and how it runs on a machine.
- [Run a vision-language-action model](run-a-vla/): drive a robot from a camera
frame plus a language prompt.
- [Integrate an LLM with a robot](integrate-an-llm/): use a language model to
plan tasks and dispatch robot skills, safely.
- [Simulation and sim-to-real](simulation-and-sim-to-real/): develop and
validate a policy before it touches hardware.
118 changes: 118 additions & 0 deletions docs/ai-control/inference-latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
linkTitle: "Inference latency and loop rate"
title: "Inference latency and loop rate"
weight: 5
layout: "docs"
type: "docs"
description: "Understand why model inference latency sets the ceiling on how fast a perception or control loop can run, and estimate an achievable loop rate from model size, image resolution, and hardware."
aliases:
- "/concepts/inference-latency/"
---

Consider a small program that watches a camera and reacts to what it sees:

```python
while True:
detections = await detector.get_detections_from_camera("my-camera")
steer(detections)
await asyncio.sleep(0.05) # aim for 20 Hz
```

The `sleep(0.05)` looks like it sets the pace: run 20 times per second.
In practice the loop rate depends far more on the line above it.
The call to [`GetDetectionsFromCamera`](/vision/) captures a frame, runs it through a machine learning model, and returns the results.
That call is synchronous: it blocks until inference finishes.
If the model takes 200 ms to produce detections, one trip through the loop takes at least 200 ms, and the loop runs near 5 Hz no matter what number you pass to `sleep`.

This page explains why inference latency sets the ceiling on loop rate, and how to estimate that ceiling before you build a real-time task.

## Why each iteration waits for inference

A perception or control loop does its work one iteration at a time, in order.
Each iteration acquires an input, computes on it, and acts on the result.
When the compute step is a model inference call, the loop reaches that call and waits for a return value before it can act or start the next iteration.

The wall-clock time of one iteration is the sum of its blocking steps.
The single largest term is usually inference:

- Acquiring a frame from a camera: a few milliseconds to tens of milliseconds.
- Running inference on that frame: milliseconds to hundreds of milliseconds.
- Acting on the result (sending a command to a motor or base): typically a few milliseconds.

Your `sleep` only adds idle time on top of that sum.
It can slow the loop down, but it cannot speed it up past the inference call.
The achievable loop rate is therefore bounded by the slowest blocking step, roughly:

```text
max_loop_rate ≈ 1 / max(inference_time, actuation_time, frame_time)
```

In most vision loops the inference term dominates, so `max_loop_rate ≈ 1 / inference_time`.
A 200 ms model caps the loop near 5 Hz; a 20 ms model allows up to about 50 Hz, before you add any deliberate `sleep`.

## What determines inference time

Inference time is a property of three things together: the model, the input, and the hardware.

**Model size and architecture.**
A larger model with more parameters and layers performs more arithmetic per frame.
A compact detector aimed at edge devices runs much faster than a large, high-accuracy backbone.
Quantized formats such as TFLite `int8` trade a small amount of accuracy for a substantial speedup.

**Input resolution.**
Compute grows with the number of pixels, which grows with the square of the linear resolution.
Halving both image dimensions cuts pixel count to a quarter and often cuts inference time by a similar factor.
Feeding a model a smaller frame is one of the cheapest ways to raise loop rate.

**Hardware.**
Where inference runs matters more than any other single factor.
As rough orders of magnitude, for a typical object detector:

| Where inference runs | Rough per-frame latency | Rough loop ceiling |
| ---------------------------------------------------------------- | --------------------------- | ------------------ |
| TFLite on a Raspberry Pi CPU | ~150-500 ms | ~2-6 Hz |
| Same model on a coprocessor or small GPU (for example, a Jetson) | ~10-40 ms | ~25-100 Hz |
| Larger model on a desktop or server GPU | single-digit to ~20 ms | ~50-200 Hz |

Treat these as illustrative ranges, not specifications.
Actual numbers depend on the exact model, resolution, framework, and device, and the only reliable figure is one you measure on your own hardware.
The pattern holds across cases: moving the same model from a general-purpose CPU to an accelerator built for tensor math changes latency by roughly an order of magnitude.

## Why remote inference adds latency

For an ML-backed [vision service](/vision/), the detection ultimately blocks on the ML model service's `Infer` method, and you can run that model locally on the machine or call one hosted elsewhere. (A heuristic detector such as `color_detector` runs no model and skips this cost.)
Running inference on a remote or cloud server can give you access to hardware far more powerful than an edge device.
That power comes with an added cost: every frame travels to the server and every result travels back.

Remote inference latency is the sum of the network round trip and the server-side compute:

```text
remote_inference_time ≈ upload_time + server_inference_time + download_time
```

Uploading a full-resolution frame over a constrained or high-latency link can add tens to hundreds of milliseconds and can vary from frame to frame.
For a monitoring task that reports a status every few seconds, that overhead is comfortably absorbed.
For a control loop steering a moving base, a variable extra 100 ms per iteration both lowers the loop rate and makes its timing less predictable.

## Sizing a real-time task

The tolerable loop rate follows from what the loop controls.

**Real-time control** acts on the physical world, where staleness compounds.
A base moving at 1 m/s travels 20 cm during a 200 ms inference call, so at 5 Hz every decision is based on a frame already 20 cm out of date.
For steering, obstacle avoidance, or closed-loop reaction, you generally want inference well under the physical time constant of the system, and you want that latency to be steady rather than bursty.
Viam's feedback controllers expose their own cadence directly: a [sensor-controlled base](/reference/components/base/sensor-controlled/) accepts a `control_frequency_hz` value (default 10 Hz), and its movement sensors must report at least that fast for the loop to hold rate.

**Monitoring and logging** consume detections rather than steering on them, so a loop running at 1 Hz, or slower, is often plenty.
Here you can favor a larger, more accurate model or a remote GPU and accept the higher per-frame latency, because no actuator is waiting on the result.

To size a task, work backward from the required rate.
Decide the loop rate the application needs, invert it to get the latency budget per iteration, subtract the frame-capture and actuation time, and choose a model, resolution, and hardware whose measured inference time fits what remains.
If nothing fits, you have three levers: shrink or quantize the model, lower the input resolution, or move inference to faster hardware.
Because these levers trade accuracy and cost against speed, measuring inference time on the target device is the step that turns an estimate into a design you can rely on.

## Next steps

- Learn how detection and classification calls work in the [vision service](/vision/).
- See how a feedback loop consumes sensor input at a fixed rate on a [sensor-controlled base](/reference/components/base/sensor-controlled/).
- Explore the [components](/reference/components/) that a perception or control loop reads from and acts on.
131 changes: 131 additions & 0 deletions docs/ai-control/integrate-an-llm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
linkTitle: "Integrate an LLM"
title: "Integrate an LLM with a robot"
weight: 30
layout: "docs"
type: "docs"
description: "Build a logic module that uses an LLM to turn a high-level goal into a validated sequence of robot skills, then dispatches them through the Viam APIs."
---

Give a robot a high-level goal in plain language, such as "clear the cups off the table," and have it carry out a bounded sequence of actions: drive to the table, pick up a cup, place it in the bin, repeat.
A large language model (LLM) is well suited to decomposing that goal into steps.
Your module calls the LLM to propose which robot skills to run and with what arguments, validates each proposed action, and then dispatches the approved actions through the Viam APIs.

Viam does not host the LLM.
Your module calls an external LLM provider (such as an OpenAI-compatible API) or a local model that you run yourself, using that provider's own SDK.
Everything below is the code you write inside a [module](/build-modules/).

{{% alert title="Safety first" color="caution" %}}
An LLM produces unconstrained text.
Keep a validation layer between the model's proposed action and any [Viam API](/reference/apis/) call.
That layer admits only actions in a fixed allowlist, within fixed parameter bounds, so a malformed or out-of-range response never reaches an actuator.
Steps 4 and 5 cover this validation and the timeouts and human-confirmation gates that back it up.
{{% /alert %}}

## Prerequisites

- A machine with the components and services your robot uses (for example a [base](/reference/apis/components/base/) and an [arm](/reference/apis/components/arm/)), already [configured](/hardware/).
- Credentials for an LLM provider, or a local model you can query.
- Familiarity with [writing a module](/build-modules/).

## Author the logic module

1. **Scaffold a logic module.**
Generate a [generic component or service](/build-modules/) module in your language of choice.
A logic module holds no hardware of its own; it depends on the components and services it orchestrates and coordinates them.
Declare those resources as [dependencies](/build-modules/dependencies/) so your module receives clients for them at runtime.

2. **Define a set of robot skills.**
A skill is a small function that wraps one or more [Viam API](/reference/apis/) calls into a named, self-contained action.
Give each skill a clear name, a short description, and a typed set of arguments.
Keep the surface small and the arguments bounded.

```python {class="line-numbers linkable-line-numbers"}
async def drive_to(self, zone: str):
"""Drive the base to a named zone in the workspace."""
pose = self.zones[zone] # look up a known target
await self.motion.move(...) # see the Motion API reference

async def pick_cup(self):
"""Close the gripper on a cup at the current pose."""
await self.gripper.grab()

SKILLS = {
"drive_to": {"zones": list(self.zones)}, # allowed values
"pick_cup": {},
}
```

The `SKILLS` table is the allowlist.
It is the single source of truth for what the robot can be asked to do and which arguments are legal.
For exact method signatures, see the [component and service APIs](/reference/apis/).

## Prompt the LLM to choose a skill

3. **Ask the model to select a skill and arguments.**
Send the goal and the skill definitions to your LLM provider using its function-calling (also called tool-use) API.
That style constrains the response to a structured choice: a skill name plus arguments, rather than free-form prose you would have to parse.

```python {class="line-numbers linkable-line-numbers"}
response = llm_client.chat(
goal=goal,
tools=self.skill_schemas, # your SKILLS table as the provider's tool schema
)
proposed = response.tool_call # e.g. {"skill": "drive_to", "args": {"zone": "table"}}
```

The provider returns a proposed action.
Treat it as a request, not a command: nothing runs until it passes validation.

## Validate before executing

4. **Check the proposed action against the allowlist and bounds.**
Run this check on every proposed action, before any API call.
This is the guardrail that keeps an LLM from issuing an unsafe actuator command.

```python {class="line-numbers linkable-line-numbers"}
def validate(self, proposed):
skill = proposed["skill"]
if skill not in self.SKILLS: # reject unknown skills
raise ValueError(f"unknown skill: {skill}")
for name, value in proposed["args"].items():
allowed = self.SKILLS[skill].get(name)
if allowed is not None and value not in allowed:
raise ValueError(f"{name}={value} out of bounds")
return proposed
```

Each rule maps to a specific failure it prevents:

- The **allowlist** check admits only skills you wrote and tested, so a hallucinated or misspelled action name never becomes an API call.
- The **parameter-bounds** check confirms every argument falls in a legal range, so an out-of-range value, such as a velocity above your safe cap, is refused before it reaches a motor.

If validation fails, log the rejected action and either stop or ask the model to try again.
A rejected action costs nothing; an unchecked one can move hardware.

## Execute through the Viam APIs

5. **Dispatch the validated action, with timeouts and optional confirmation.**
Only actions that pass step 4 reach the Viam APIs.
Two further guardrails bound what execution can do:

- **Timeouts** bound each dispatched action in time, so a stalled or long-running motion returns control to your module instead of running unattended.
- **Human confirmation** gates high-consequence skills, such as moving an arm near a person, on an explicit approval before the action runs.

```python {class="line-numbers linkable-line-numbers"}
async def execute(self, action):
action = self.validate(action) # never skip this
if action["skill"] in self.CONFIRM_REQUIRED:
if not await self.confirm(action): # await a person's approval
return
async with asyncio.timeout(self.step_timeout_s): # bound the action
await self.skills[action["skill"]](**action["args"])
```

Loop steps 3 through 5 until the goal is met or a step fails, feeding the outcome of each action back to the model as context for the next choice.

## Next steps

- For a worked example of driving a robot from an LLM, see [Integrate Viam with ChatGPT](/tutorials/projects/integrating-viam-with-openai/).
- To package and deploy your logic module, see [Build and deploy modules](/build-modules/).
- To review the exact methods your skills call, see the [component and service APIs](/reference/apis/).
Loading
Loading