This is the official implementation of Spolacq-GDS.
Need Python>=3.10. 3.10.15 is recommended. We use pyenv and poetry for environment creation.
Linux/macOS
poetry install
source <env_name>/bin/activate
Windows
poetry install
./<env_name>/Scripts/activate
Please run the following command in your virtual environment.
python main.py --experiment_config_name yy.yaml
On yy, please specify the name of the config file(yaml).
Then, you can start learning!
You can specify food_task.yaml or food_corner_task.yaml as samples. The explanation of food_task or food_corner_task is given on here.
If you want to change what agent or task to use on learning, please change agent_name or task_name in the config file(yaml).
For the details how to write it, please refer to How to write config file(yaml) below.
This, Spolacq-GDS, is the multi-modal dialogue simulator that controls scene transition by automaton and generates natural dialogue action with generative models on each scene. It generates data with image generator and audio generator, and interpret audio, generate response, transit state and give items with LLM, ASR and audio generator. The figure of this simulator is as follows:
(DESM: DialogueEnvironmentStateManger)
You can use this simulator by defining dialogue task (dialogue defined file) and agent. It auto-generates automaton and data for scenario from defined file, uses them and roles as an opponent of the dialogue, then give the environment for dialogue reinforcement learning.
This simulator generally targets the agents and the tasks that use images and audio (optional) as inputs, and audio as an output, but you might want to use other modalities. You can do it by defining wrappers. Here wrapper can be defined by inheriting Wrapper class given by Gymnasium
On this simulator, there are a pre-defined sample task, food_task, and the agent (Continuous Action Space-Based Spoken Language Acquisition Agent Using Residual Sentence Embedding and Transformer Decoder) that can handle with that task. This agent receives image and outputs audio, but your agent might have to receive audio as an input. Even though, you can contain it in the observation or preprocess it without changing implementation of the environment.
You can see sample codes of task by visiting under tasks directory.
You can see sample codes of dialogue agent by visiting under agents directory.
Here I will explain how to define the task. At first, please decide the name of the task, and make the directory whose name is that. After that, please make following files in it.
xx.jsontask_functions.pyutterance_patterns.jsontask_config.json
On xx, you can specify arbitrary name.
Also, you can add following files.
task_wrappers.py
On this file please define the task.
For the details how to write it, please refer to How to write task json file below.
On this file, please define the function needed to have dialogue learning. To be detailed, following functions are defined.
| Function | Return value | Role |
|---|---|---|
| reward_function | float | Returns reward according to the previous environment and the next one. |
| termination_function | bool | Returns whether it terminates according to the previous environment and the next one. |
| truncation_function | bool | Returns whether it truncates according to the next environment. |
| internal_state_update_function | numpy.ndarray | Returns next internal state according to the internal state on this step and the feedback of dialogue. |
| observation_function | dict[str, numpy.ndarray] | Returns observation space according to the next environment. |
| initial_internal_state_function | numpy.ndarray | Returns the internal state when the environment resets. |
The error will occur if any function on the above isn't defined.
About EnvironmentState class, that is one of the arguments of these functions, please refer to the explanation on the later section.
On this file, please define what speech will be used on pretraining. You can make suitable sentences for each scene.
The format of this file can be as follows:
{
"scene": {
"japanese_food_corner":"food_corner",
"western_food_corner":"food_corner"
},
"food_corner": [
"{name}",
"i want to go to the {name}",
"let's go to the {name}",
"i'd like to visit the {name}",
"take me to the {name}"
],
"else": [
"{name}",
"i want {name}",
"i would like to have {name}",
"i'll take {name}",
"give me {name}"
]
}As the above, if you give identifier (here food_corner) on the scene, the audio will be generated based on the utterance patterns on food_corner on the scenes that takes food_corner as an identifier. If the identifier isn't given, the audio will be generated based on the utterance patterns on else on the scenes.
Then, the placeholder {name} will be replaced by the name of the item.
Finally, please add task_config.json in the folder.
This file has the following format:
{
"env_name": "FoodTask-v0",
"make_env": "tasks.food_task.task_wrappers:wrap_task_env",
"max_step": 2,
"task_config_name": "food_task_prompt.json"
}These contents are following things:
| content | explanation |
|---|---|
| env_name | The task name on gymnasium. |
| make_env (optional) | The function name if you wrap the environment with following task_wrappers.py etc. |
| max_step | The max step of the dialogue. |
| task_config_name | The file name of the former xx.json. |
You can add task_wrappers.py to wrap with the environment by Wrapper of gymnasium.
If so, please add following codes after the definition of the wrapper, on task_wrappers.py.
def wrap_task_env(base_env):
wrapped_env = YourOriginalWrapper(base_env)
return wrapped_envHere I will explain how to define the agent that will learn dialogue.
To define dialogue learning agent, please make the directory whose name is the name of the agent under agents.
Next please make the following file in it.
agent.pytrainer.pyagent_config.json
You have to define agent on agent.py and trainer on trainer.py.
Then you define used classes on agent_config.json.
This is the example.
{
"agent_class": "AgentWithUnits",
"trainer_class": "Trainer"
}Here you should define the name of the classes of agent and trainer.
From here, you diverge implementing ways whether you use units on language or image (such as audio representations or image representations) and so on.
To learn the simplest (and the most free) dialogue agent, it's easiest to use the agent that inherits BaseAgent.
This BaseAgent has some needed APIs to have dialogue learning.
| name | return value | explanation |
|---|---|---|
| action2speech | numpy.ndarray | It converts action of reinforcement learning to audio data. This function won't be used unless you define Wrapper. |
Then, it's simple to define trainer, with inheriting BaseTrainer.
BaseTrainer has following APIs.
| name | return value | explanation |
|---|---|---|
| pretrain | None | If you have some modules that needs to be pretrained, please define this function. |
| train_rl | None | Function that has procedure of reinforcement learning of agent. |
Please refer to sample01 under agents as an example of implementation. This is the agent proposed on Continuous Action Space-Based Spoken Language Acquisition Agent Using Residual Sentence Embedding and Transformer Decoder.
The files to be defined are same, but if you use units, you can define agent with inheriting BaseAgentWithUnits. Then you can define easier because you can use the following properties.
| name | type/return value | explanation |
|---|---|---|
| i2u | BaseImage2Unit | The instance of Image2Unit. |
| s2u | BaseSpeech2Unit | The instance of Speech2Unit. |
| u2s | BaseUnit2Speech | The instance of Unit2Speech. |
Also, you have to define the trainer that inherits BaseTrainerWithUnits, that has following functions.
| name | type/return value | explanation |
|---|---|---|
| train_i2u | None | Please define if you train I2U model. |
| train_u2s | None | Please define if you train U2S model. |
Task json file should be named as xx.json, and place it in task folder (e.g. in tasks/food_task for food_task).
Here xx is the value that is specified on task_config_name in task_config.json.
Task file can be structured as follows:
{
"initial_scene_id": 0,
"items": [
{
"id": 1,
"name": "<item name 1>",
"attributes": {}
},
...
],
"scenes": [
{
"id": 0,
"name": "<scene name one>",
"role_description": "<description of role>",
"speaker_description": "<description of speaker>",
"items": [<IDs of items that can be taken on this scene>],
"possible_next_scenes": [<IDs of scenes that it can transit from this scene to>],
"system_guidelines": "<Description that how simulator gives items or transits the scenes>"
},
...
]
}As the above, please specify the following values.
initial_scene_iditemsscenes
On initial_scene_id, please specify the id of the scene of the first step. We will explain scene later.
On items, please register the items that will appear on this dialogue, with list format.
You will specify the objects (often called "item") that has the following values in items.
| name | value |
|---|---|
| id | Integer that identifies the item. |
| name | Name of the item. |
| attributes | Attributes you want to have the item to take. |
| prompts (optional) | The image for this item will be generated based on this prompt. On default, the prompt is "A {name} on a white background, uncooked, realistic.", on {name} is the name of this item. |
On scenes, please register the scene that will appear on this dialogue, with list format.
You will specify the objects (often called "scene") that has the following values in scenes.
| name | value |
|---|---|
| id | Integer that identifies the scene. |
| name | Name of the scene. |
| role_description | What role this dialogue simulator plays. |
| speaker_description | What characteristics of person replies by this dialogue simulator. To see what you can choice, please refer to https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md#american-english |
| items | Ids of items that might be treated on this scene. |
| possible_next_scenes | Ids that will be the next scene of this. From this value and the all of scenes, you can construct automaton. |
| system_guidelines | The prompt that explains what situation this dialogue simulator is on. This will be input to LLM and the dialogue simulator decides what to reply. |
You can refer to food_task.json or food_corner_task.json in tasks/food_task or tasks/food_corner_task as examples.
The name of config file can be arbitrary, but place it in experiments folder.
Config file can be structured as follows:
base:
task_name: "food_task"
agent_name: "sample01"
device: "cuda"
dir_id: "1"
sampling_rate: 22050
pretrain:
audio:
noise_num: 5
overwrite_audio_dataset: False
SNR: 30
model_name: "kokoro"
voice: "af_heart"
image:
train_images_per_folder: 30
test_images_per_folder: 10
overwrite_image_dataset: False
batch_size: 16
model_name: "Mann-E_Turbo"
model:
num_inference_steps: 10
guidance_scale: 8.0
height: 512
width: 512
rl:
env:
device: "cuda"
num_images_per_item: 10
enable_audio_response: False
record_train_env: False
record_eval_env: False
llm_prompt_name: "food_task.txt"
agent:
...The above keys are at least needed if you don't change files under env and main.py, so they might not be necessary if you modify the codes.
The explanation of each key is as follows:
| variable name | explanation |
|---|---|
| task_name | The folder name of task. |
| agent_name | The folder name of agent for training. |
| (base) device | The device for training. We recommend "cuda". |
| dir_id | The ID for the directory that the log is saved in. We recommend to set number like 1 or 2. |
| sampling_rate | Sampling rate that the agent uses. |
| noise_num | It represents how many patterns of sounds with noise will be generated for pretraining. |
| overwrite_audio_dataset | Whether the dataset is overwritten if it already exists. |
| SNR | Signal-to-Noise Ratio. |
| (pretrain->audio) model_name | Which model to generate audio. Now only "kokoro" is supported. |
| voice | Which voice will be used. If you specify "kokoro" as model_name, you can refer to https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md#american-english to decide which to use. |
| train_images_per_folder | How many images for training will be generated for each item. |
| test_images_per_folder | How many images for test will be generated for each item. |
| overwrite_image_datset | Whether the dataset is overwritten if it already exists. |
| batch_size | Image generation model will generate how many images for pretraining at once. |
| (pretrain->image) model_name | Which model to generate images. Now only "Mann-E_Turbo" is supported. |
| num_inference_steps | How many steps will be used for inference on diffusion model for image generation. |
| guidance_scale | How loyal the image generation model to generation prompt. |
| height | The height of the images. |
| width | The width of the images. |
| (rl->env) device | The device for training. We recommend "cuda". |
| num_images_per_item | On reinforcement learning how many images will be used on each item. |
| enable_audio_response | Whether we use audio input on reinforcement learning. If you want to use audio input with food_task or food_corner_task, please turn this to True and please remove the comment on observation["audio"] = ... in task_functions.py in task folder. |
| record_train_env | Whether it records dialogue history on training of reinforcement learning. |
| record_eval_env | Whether it records dialogue history on evaluation of reinforcement learning. |
| llm_prompt_name | The file of prompt given for environment LLM. For more details for this file, please refer to How to write prompt file below. |
Addition to these, you can contain parameters needed for the training of your agent. In the codes, each section and variables are corresponded as follows:
| The section if the whole config file is as config | variable in codes |
|---|---|
| config["base"] | base_config |
| config["pretrain"] | pretrain_config |
| config["pretrain"]["audio"] | pretrain_audio_config |
| config["pretrain"]["image"] | pretrain_image_config |
| config["rl"] | rl_config |
| config["rl"]["env"] | rl_env_config |
| config["rl"]["agent"] | rl_agent_config |
If you want to refer to the parameters, please add mappings on the proper section and read them.
On this simulator, LLM judges all of transitions of the scenes and how to give the items on each task.
Therefore, we specify how LLM judges by prompt on each task. On constructing task, please write that file and save it under env/prompts.
The format of the prompt file can be as follows:
You are {role_description}. Generate a response based on the following guidelines to interact with the customer effectively.
Scene Information:
{scene_information}
Available Items:
{available_items_text}
Note: Only offer the listed items if the customer specifically and clearly chooses one, using correct grammar. Do not offer any other items, even if the customer requests something else or mentions the item indirectly.
Scene Transitions:
{transitions}
Guidelines:
{guidelines}
Response Format:
Your response should follow this JSON format:
{{
"response_text": "text of response",
"next_scene_id": scene ID (or null if staying in the current scene),
"next_scene_reason": "reason for transition (or null if staying in the current scene)",
"item_name": "name of the food item offered (or null if no item is offered)"
}}Here we can use placefolder with using brackets. Each has following meaning.
| name | explanation |
|---|---|
| role_description | role_description in task json file will be assigned. |
| scene_information | The text that unites all of scenes. |
| available_items_text | All items that can be offered now. |
| transitions | The text that explains how it can transit scenes. |
| guidelines | system_guidelines in task json file will be assigned. |
EnvironmentState class contains the environment state.
This class contains the following get-only properties.
| name | type | explanation |
|---|---|---|
| dialogue_scene | DialogueScene | Contains dialogue scene. |
| dialogue_feedback | Optional[DialogueFeedback] | Contains dialogue feedback. If the dialogue isn't held, this contains None. |
| internal_state | numpy.ndarray | Internal state at this time. |
DialogueScene class contains the dialogue scene at some point.
This class contains the following get-only properties.
| name | type | explanation |
|---|---|---|
| scene_id | int | The scene id on this point. The id is same as on the task defined file. |
| prompt_waveform | numpy.ndarray | The utterance from the environment. |
| images | list[numpy.ndarray] | The images proposed by the environment. |
DialogueFeedback class contains the feedback on the dialogue.
This class contains the following get-only properties.
| name | type | explanation |
|---|---|---|
| selected_item | Optional[FeedbackItem] | The item proposed on this dialogue. If nothing was given, this contains None. |
| response_waveform | numpy.ndarray | The utterance from the environment based on the agent's utterance. |
FeedbackItem class represents the item given as the result of the dialogue.
This class contains the following get-only properties.
| name | type | explanation |
|---|---|---|
| id | int | Represents item ID. Item ID is the same as on the task defined file. |
| image | Optional[numpy.ndarray] | The image itself. |
| attributes | dict[str, Any] | attributes defined on this item on the task defined file. |
Copyright (C) 2025- Shinozaki Lab Science Tokyo
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
LLM: Phi-4-mini-instruct
ASR: Whisper-large-v3-turbo
TTS: Kokoro-82M
Image generator: Stable diffusion + Mann-E_Turbo