This repository accompanies the paper "ApBot: Robot Operation of Home Appliances by Reading User Manuals", and contains open-sourced code, datasets, simulation tools, and baseline experiments to support research on robotic operation of household appliances.
ApBot enables robots to operate previously unseen appliances by “reading” their user manuals, grounding symbolic actions to real-world control elements, and executing policies robustly and interactively with textual or visual feedback.
The repository is organized into three main parts:
-
code/– Core implementation for foundation models, symbolic reasoning, visual grounding, simulation, and real-world execution. It consists of:-
foundation_models/– GPT, OWL, and SAM2 integration -
simulated/– Scripts for parsing user manuals, grounding, generating test cases, and running experiments. Includes:paper_exp/– Scripts and outputs for reproducing experiments and baselines reported in the paper
-
real_world/– Scripts for running real-world experiments
-
-
dataset/– Structured datasets for training and evaluation. It consists of:simulated/– Includes six simulated appliances (e.g., microwave, washer), each with user manuals, control panel images, and symbolic simulatorsreal_world/– Includes real-world recordings of five appliances, structured in the same format as the simulated data
-
README– Documentation and usage guide
- Structured Appliance Modeling: Automatically extract and build symbolic models (variables, features, actions) from unstructured manuals.
- Vision-Language Grounding: Ground control instructions to appliance control panels using large VLMs (e.g., OWL-V2).
- Closed-loop Execution: Monitor execution visually and update world models based on feedback.
- Simulated + Real-world Evaluation: Test and benchmark baseline methods in both controlled and real scenarios.
This repository contains structured code, foundational models, and scripts for appliance control research. It is organized into foundational models, real-world, and simulated directories.
📁 code/
The heart of the project, containing the logic for visual grounding, reasoning, and simulation.
- 📄
gpt_4o_model.py– Used to call GPT - 📄
owlv2_crane5_api.py– Starts a microservice to recognize appliance panel buttons and dials - 📄
owlv2_crane5_query.py– Once the API is running, functions in this script can be used to call OWLv2 - 📄
init.py - 📄
_0_t2a_config.py– Specifies the root code path - 📄
_1_pdf_to_text.py– Converts PDF user manuals to text - 📄
_2_extract_control_panel_element_names_from_user_manual.py– Extracts button/dial name lists from manuals - 📄
_3_detect_bbox_from_photos.py– Detects bounding boxes from appliance panel images - 📄
_4_map_control_panel_element_names_to_bbox_indexes.py– Maps button/dial names to possible bounding box indices - 📄
_5_json_map_control_panel_element_names_to_bbox_indexes.py– Formats the previous mapping results into JSON - 📄
_6_remove_duplicate_bboxes.py– Ensures one-to-one mapping between buttons/dials and bounding box indices - 📄
_7_json_map_control_panel_element_names_to_bbox_indexes.py– (Appears redundant; formats mapping results into JSON again) - 📄
_8_visualise_grounding_control_element_name_result.py– Generates visualization images showing all buttons/dials as labeled bounding boxes - 📄
_10_propose_action_names.py– Proposes actions based on user manuals - 📄
_11_generated_grounded_action_json.py– Maps proposed actions to bounding box coordinates - 📄
_12_match_proposed_action_to_oracle_action.py– Checks if proposed actions match oracle execution regions using bounding box coordinates - 📄
_13_visualisation_grounding_action_result.py– Generates visualization images displaying grounded actions as labeled bounding boxes - 📄
_18_generate_testcases.py– (Deprecated) Generates ambiguous instructions that can be satisfied by one or multiple policies - 📄
_18_generate_testcases_v2.py– Generates instructions requiring different step sizes to achieve goals - 📄
_19_generate_target_state.py– For each test instruction, generates the target state in the appliance simulator - 📄
_0_logic_units.py– Appliance model templates, combines the content of variables, features and actions - 📄
_0_sample_machine.py– Example appliance models - 📄
_0_search.py– Macro action logics - 📄
_1_variables.py– Appliance variable templates - 📄
_2_features.py– Appliance feature templates, used to specify the macro actions - 📄
_3_actions.py– Appliance dynamics templates, used to specify action effects - 📄
various prompt text file:– Is used for the python files in the simulated/ folder - 📄
paper_exp/baselines_v1/:– Is used for various baselines in the paper. _4_HV_M_SR_MA_agnostic refers to the ApBot algorithm. - 📄
_17_testcase/:– The prompt used to generate instructions that are ambiguous or specific - 📄
_17_testcase_v2/:– The prompt used to generate instructions that require different numbers of steps to finish -
📁 baselines_v1/
- 📄 code for running paper baselines. ApBot has the file name _4_M_SR_MA_agnostic.
-
📁 real_world/
- 📄 code for running real world experiments
- Some helper functions to load files.
-
📁
task/- 📄
calibrate_current_value.py– During close loop execution, if the display is an unexpected value, calibrates the appliance model according to predefined routines and templates - 📄
check_reasoning_file_existance.py– Checks file existence - 📄
check_state_reached.py– Checks if the appliance state has reached those required by an instruction - 📄
compare_visual_grounding.py– Compares the results between ApBot and Molmo visual grounding performance - 📄
derive_variable_definition.py– Helper function to help update the variable in the appliance model when there is a feedback mismatch - 📄
generate_report.py– After baseline results are out, use it to generate instruction execution results - 📄
generate_updated_goal.py– Used to update the ApBot appliance model target state - 📄
interact_with_simulator.py– Used for ApBot to interact with appliance simulators - 📄
mathces_run_action_format.py– Used to check command formats used by ApBot to interact with appliance simulators. - 📄
prepare_simulator.py– Used to load appliance simulators - 📄
propose_actions.py– During close-loop execution, ApBot proposes a suitable action given the current appliance state - 📄
propose_feature_list.py– Proposes macro actions of appliance models - 📄
propose_goal_state.py– Proposes the appliance model's target state - 📄
propose_variables.py– Proposes appliance model variables - 📄
propose_world_model_agnostic_to_command.py– Proposes the appliance model's action dynamics
- 📄
📁 foundation_models/
Contains foundational models such as GPT, OWL, and SAM2.
📁 simulated/
Simulation-related scripts for processing user manuals, grounding control panel elements, and generating test cases.
📁 samples/
Used as example appliance models.
📁 prompts/
Stores all the prompts used.
📁 paper_exp/
📁 utils/
📁 real_world/
Scripts for real-world execution. Code structure is the combined content of the dataset folder and the simulated code folder.
This repository contains structured datasets, simulators, and baseline experiment results for appliance control research. It is organized into simulated and real-world datasets.
The simulated dataset includes data for the following appliances, with 5 instances each:
- Dehumidifier
- Bottle Washer
- Rice Cooker
- Microwave Oven
- Bread Maker
- Washing Machine
📁 simulated/
The root directory for all simulated appliance data.
- 📁 0_pdf/ – Raw PDF manuals, manually added
- 📁 1_image/ – Image versions of manuals, generated via code
- 📁 2_text/ – Extracted text from PDFs, generated via code
- 📁 3_extracted_control_panel_element_names/ – Control panel element names extracted from manuals, generated via code
- 📁 0_raw/ – Images copied from websites, manually added
- 📁 1_selected/ – Final selected single image, manually added
-
📁 2_ground_control_panel_elements/
- 📁 1_validity_control_panel/ – One bounding box per image circling a button or dial
- 📁 2_bboxes_on_control_panel/ – JSON files of button/dial bounding boxes
- 📁 3_bboxes_on_control_panel_visualisation/ – Visual summary of all bounding boxes
- 📁 4_query_images_bbox_to_name/ – Red box for query, green boxes as references
- 📁 5_query_images_unique_bbox_id/ – Indexed candidate boxes per button/dial
- 📁 0_control_panel_images_groundtruth_annotation/ – COCO annotations (manual)
- 📁 1_oracle_available_actions/ – Oracle action lists (manual)
- 📁 2_map_oracle_action_to_annotation_label/ – Dict: action names → labels + types (manual)
- 📁 3_oracle_simulator_action_to_bbox_mapping/ – Auto: action names → bbox coords
- 📁 4_simulators_with_states_and_display/ – Sim: action → text display (manual)
-
📁 5_testcases/
- 📁 1_testcases_var_raw/ – Generated task instructions
- 📁 2_testcases_var/ – Selected instructions (manual)
- 📁 3_testcases_var_with_target_states/ – Target states (generated)
- 📁 4_testcases_var_formatted/ – Generated + manually corrected
-
📁 0_control_panel_element_bbox/
- 📁 0_control_panel_element_index/ – Bbox index → element names
- 📁 1_control_panel_element_index_json/ – Element names → bbox indices
- 📁 2_control_panel_element_index_json_unique/ – Unique name → bbox index
- 📁 3_proposed_control_panel_element_bbox/ – Element name → coordinates
- 📁 4_visualised_proposed_control_panel_element_bbox/ – Indexed bbox visualization
- 📁 5_visualised_localised_buttons_no_label/ – Center-indexed bbox visualizations
- 📁 6_visualised_localised_buttons_legends/ – Bbox index → coordinates
- 📁 7_proposed_buttons_to_oracle_action_mapping/ – Proposed bbox → oracle actions
- 📁 9_extracted_control_panel_labels/ – Element names from manual
-
📁 1_action_names/
- 📁 1_proposed_action_names/ – Proposed actions
- 📁 2_proposed_world_model_action_bbox/ – Action name → bbox
- 📁 3_proposed_to_oracle_action_mapping/ – Action name → oracle mapping (ApBot)
- 📁 4_molmo_proposed_action_coords/ – Action name → image coords
- 📁 5_molmo_proposed_actions_visualisation/ – Visualized action positions
- 📁 6_molmo_proposed_to_oracle_action_mapping/ – Action name → oracle mapping (Molmo)
- 📁 1_visual_grounding_action_results/ – Metrics for action visual grounding
- 📁 3_visualize_proposed_actions/ – Visual results of grounded actions
-
📁 6_paper_exp/
- Baseline experiment results
- 📁 11_visualisation/ – Baseline output visualizations
📁 0_websites/
Appliance panel images and user manuals (source websites).
📁 1_user_manual/
Contains different forms of user manuals (PDF, image, text, and parsed elements).
📁 2_control_panel_images/
Control panel images and grounded elements.
📁 3_simulators/
Contains simulators and their associated assets.
📁 4_visual_grounding/
Visual grounding data for control panel elements and action names.
📁 6_results/
The real-world dataset includes data for the following appliances, with 1 instance each:
- Washing Machine
- Rice Cooker
- Blender
- Water Dispenser
- Induction Cooker
The internal folder structure is identical to the simulated dataset.
The baseline experiments evaluate different combinations of perception, appliance model, reasoning, and policy.
| ID | Name | Visual Grounding (Perception) | Have Access To User Manual | Reasoning | Action Policy | Name in Paper |
|---|---|---|---|---|---|---|
| 1 | NV_M_UR_LP | Oracle | Yes | Unstructured | LLM | LLM as policy w/ grounded actions |
| 2 | HV_M_UR_LP | Proposed | Yes | Unstructured | LLM | LLM as policy w/ image |
| 4 | M_SR_MA | Proposed | Yes | Structured | Macro-actions | ApBot |
| 5 | HV_M_SR_LP | Proposed | Yes | Structured | LLM | ApBot w/o button policy |
| 7 | HV_M_UR_MA | Proposed | Yes | Unstructured | Macro-actions | ApBot w/o model |
| 8 | HV_M_SR_MA_OL | Proposed | Yes | Structured | Macro-actions | ApBot w/o close-loop update |
| 9 | oracle_V_oracle_M | Oracle | Yes | Structured & Oracle | Macro-actions | N.A |
| 10 | oracle_V_proposed_M | Oracle | Yes | Structured | Macro-actions | N.A |
- Visual Grounding Metrics: Stored in
dataset/simulated/6_results/1_visual_grounding_action_results/ - Action Visualizations: Stored in
dataset/simulated/6_results/3_visualize_proposed_actions/ - Baseline Experiment Outputs: Stored in
dataset/simulated/6_results/6_paper_exp/ - Comparative Visualisations: Stored in
dataset/simulated/6_results/6_paper_exp/11_visualisation/
To understand and run the full ApBot pipeline, start from the code/simulated/ directory. The pipeline includes reading PDFs, visual grounding, building symbolic models, and testing the model. Below is the step-by-step process:
Convert appliance manuals from PDF to structured elements.
| Step | Description | Script |
|---|---|---|
| 1.1 | Convert PDF manuals into raw text | _1_pdf_to_text.py |
| 1.2 | Extract control panel element names from text | _2_extract_control_panel_element_names_from_user_manual.py |
Use control panel images and align them with textual instructions.
| Step | Description | Script |
|---|---|---|
| 2.1 | Detect bounding boxes from panel images | _3_detect_bbox_from_photos.py |
| 2.2 | Map detected boxes to control element names | _4_map_control_panel_element_names_to_bbox_indexes.py |
| 2.3 | Format mappings into structured JSON | _5_json_map_control_panel_element_names_to_bbox_indexes.py, _7_json_map_control_panel_element_names_to_bbox.py |
| 2.4 | Remove duplicate or conflicting mappings | _6_remove_duplicate_bboxes.py |
| 2.5 | Visualize grounding results | _8_visualise_grounding_control_element_name_result.py |
Generate appliance variables, features, and macro-actions.
| Step | Description | Script |
|---|---|---|
| 3.1 | Propose actions from manuals | _10_propose_action_names.py |
| 3.2 | Map actions to bounding boxes | _11_generated_grounded_action_json.py |
| 3.3 | Match proposed actions to oracle ground-truth | _12_match_proposed_action_to_oracle_action.py |
| 3.4 | Visualize grounded actions | _13_visualisation_grounding_action_result.py |
Generate test instructions and target appliance states.
| Step | Description | Script |
|---|---|---|
| 4.1 | Generate ambiguous or multistep instructions | _18_generate_testcases_v2.py |
| 4.2 | Generate target states for each test | _19_generate_target_state.py |
Run the main ApBot experiment using the generated appliance model and test instructions.
| Step | Description | Script |
|---|---|---|
| 5.1 | Main file for symbolic reasoning and macro-action execution (ApBot pipeline) | paper_exp/baselines_v1/_4_M_SR_MA_agnostic.py |
Located in code/utils/task/, these helper scripts support reasoning, calibration, and simulator interaction during closed-loop execution. For example:
interact_with_simulator.py– ApBot interacting with simulatorgenerate_updated_goal.py– Update goals dynamically based on feedbackpropose_goal_state.py– Generate target statescheck_state_reached.py– Check if state matches the goalcalibrate_current_value.py– Adjust model if observed feedback is off