This repo now includes a local capability server at capabilities/server.py that acts as the control plane for:
- map building and persistence
- localization against a saved map
- autonomous navigation to named landmarks or poses
- camera-based scene understanding and 3D object grounding
- face enrollment and recognition
- manipulation and an end-to-end pick-object task pipeline
The current bridge remains the low-level locomotion interface for teleop and simple voice control. The new capability stack is the higher-level interface that OpenClaw should use for autonomous tasks.
The navigation model is intentionally "map once, then localize and navigate":
- Build and save a map.
- Load that map in later sessions.
- Initialize localization on the saved map.
- Navigate using persistent map coordinates and named landmarks.
GET /statusGET /mapsGET /localization/statusGET /navigation/status
POST /maps/buildPOST /maps/loadPOST /localization/initializePOST /navigation/goalPOST /navigation/cancel
POST /perception/scenePOST /perception/object_posePOST /perception/grasp_posePOST /perception/face/enrollPOST /perception/face/recognize
In real-backend mode, the capability stack now defaults to a live ZED stereo perception backend.
scene()capturesLEFTimages andXYZRGBApoint clouds from the ZED camera.- If you provide 2D detections directly, the backend grounds them into 3D using the ZED point cloud.
- If
DETECTOR_BACKEND=httpis configured, the backend sends the current ZED frame to an HTTP detector service and grounds the returned detections into 3D. - A local detector service is included in
scripts/detector_service.py; it can run an Ultralytics / YOLO model, an OpenAI vision-language model, or a fixture mode for debugging. - If you do not provide detections and no detector service is enabled, the backend falls back to heuristic tabletop and color-based segmentation for simple scenes like a green apple on a table.
- Face recognition remains mock-only for now; the real ZED backend does not yet include a face embedding/recognition adapter.
- Camera-to-base extrinsics can be configured with
ZED_TO_BASE_{X,Y,Z,ROLL,PITCH,YAW}. GET /statusnow reports the activeperception_backendblock, including the active detector backend, so you can see whether the live ZED backend actually initialized.
POST /manipulation/pickPOST /mission/pick_object
- OpenClaw calls
perception_stackto identify the table and green apple. - OpenClaw calls
navigation_stackto ensure a saved map is loaded and localization is ready. - OpenClaw calls
navigation_stackagain to move to thetablelandmark. - OpenClaw calls
perception_stackagain to refine the green apple pose at close range. - OpenClaw calls
perception_stackto turn that refined 3D pose into a candidate grasp. - OpenClaw calls
manipulation_stackto execute the pick with the pose-aware grasp candidate. - OpenClaw verifies completion from the returned task status.
mock_mode is a flag on the capability server itself.
./scripts/start_capability_server.shstarts the server in mock mode by default.CAPABILITY_REAL_BACKEND=1 PERCEPTION_BACKEND=zed ./scripts/start_capability_server.shstarts it in real-backend mode, where the default perception backend becomeszed.DETECTOR_BACKEND=http DETECTOR_URL=http://127.0.0.1:8790/detectenables the detector-service path for the live ZED backend; inside that detector service you can chooseDETECTOR_SERVICE_BACKEND=ultralytics,openai, orfixture.PERCEPTION_DETECTIONS_PATH=/path/to/detections.jsonstill lets you inject 2D detections from a fixture file for grounding.GET /statusincludesmock_modeandperception_backend, so you can verify which mode is live and which detector backend is active.
In mock mode:
- localization, navigation, and verification can succeed from the stored mock scene state
- pick verification succeeds by updating the in-memory scene after the staged pick executes
In real-backend mode:
- the server stops fabricating verification success from the mock scene state
- the staged pick still runs through the bridge and WBC path
- if every bridge stage succeeds, verification currently falls back to
bridge_execution_trustand marks the pick as succeeded - the long-term fix is still real perception confirmation that the object disappeared from the table or is now in hand
The current implementation is still an honest scaffold:
- the API surface, state transitions, persistence, task sequencing, and pose-aware grasp planning contract are implemented
- sensor fusion, object detection, SLAM, path planning, IK feasibility checks, and perception-based grasp verification are still represented by mock results inside
capabilities/state.py
That means this is ready to serve as the orchestration contract for OpenClaw and for future ROS2 adapters, but it is not yet a production autonomy stack.