Spolacq-GDS: Spoken Language Acquisition Generative Dialogue Simulator

Overview

This is the official implementation of Spolacq-GDS.

About auther

Preparation

Need Python>=3.10. 3.10.15 is recommended. We use pyenv and poetry for environment creation.

Linux/macOS

poetry install
source <env_name>/bin/activate

Windows

poetry install
./<env_name>/Scripts/activate

Usage

Please run the following command in your virtual environment.

python main.py --experiment_config_name yy.yaml

On yy, please specify the name of the config file(yaml). Then, you can start learning! You can specify food_task.yaml or food_corner_task.yaml as samples. The explanation of food_task or food_corner_task is given on here.

If you want to change what agent or task to use on learning, please change agent_name or task_name in the config file(yaml). For the details how to write it, please refer to How to write config file(yaml) below.

Introduction

This, Spolacq-GDS, is the multi-modal dialogue simulator that controls scene transition by automaton and generates natural dialogue action with generative models on each scene. It generates data with image generator and audio generator, and interpret audio, generate response, transit state and give items with LLM, ASR and audio generator. The figure of this simulator is as follows:

(DESM: DialogueEnvironmentStateManger)

You can use this simulator by defining dialogue task (dialogue defined file) and agent. It auto-generates automaton and data for scenario from defined file, uses them and roles as an opponent of the dialogue, then give the environment for dialogue reinforcement learning.

This simulator generally targets the agents and the tasks that use images and audio (optional) as inputs, and audio as an output, but you might want to use other modalities. You can do it by defining wrappers. Here wrapper can be defined by inheriting Wrapper class given by Gymnasium

On this simulator, there are a pre-defined sample task, food_task, and the agent (Continuous Action Space-Based Spoken Language Acquisition Agent Using Residual Sentence Embedding and Transformer Decoder) that can handle with that task. This agent receives image and outputs audio, but your agent might have to receive audio as an input. Even though, you can contain it in the observation or preprocess it without changing implementation of the environment.

You can see sample codes of task by visiting under tasks directory. You can see sample codes of dialogue agent by visiting under agents directory.

Define tasks

Here I will explain how to define the task. At first, please decide the name of the task, and make the directory whose name is that. After that, please make following files in it.

xx.json
task_functions.py
utterance_patterns.json
task_config.json

On xx, you can specify arbitrary name.

Also, you can add following files.

task_wrappers.py

`xx.json`

On this file please define the task. For the details how to write it, please refer to How to write task json file below.

`task_functions.py`

On this file, please define the function needed to have dialogue learning. To be detailed, following functions are defined.

Function	Return value	Role
reward_function	float	Returns reward according to the previous environment and the next one.
termination_function	bool	Returns whether it terminates according to the previous environment and the next one.
truncation_function	bool	Returns whether it truncates according to the next environment.
internal_state_update_function	numpy.ndarray	Returns next internal state according to the internal state on this step and the feedback of dialogue.
observation_function	dict[str, numpy.ndarray]	Returns observation space according to the next environment.
initial_internal_state_function	numpy.ndarray	Returns the internal state when the environment resets.

The error will occur if any function on the above isn't defined.

About EnvironmentState class, that is one of the arguments of these functions, please refer to the explanation on the later section.

`utterance_patterns.json`

On this file, please define what speech will be used on pretraining. You can make suitable sentences for each scene.

The format of this file can be as follows:

{
  "scene": {
    "japanese_food_corner":"food_corner",
    "western_food_corner":"food_corner"
  },
  "food_corner": [
    "{name}",
    "i want to go to the {name}",
    "let's go to the {name}",
    "i'd like to visit the {name}",
    "take me to the {name}"
  ],
  "else": [
    "{name}",
    "i want {name}",
    "i would like to have {name}",
    "i'll take {name}",
    "give me {name}"
  ]
}

As the above, if you give identifier (here food_corner) on the scene, the audio will be generated based on the utterance patterns on food_corner on the scenes that takes food_corner as an identifier. If the identifier isn't given, the audio will be generated based on the utterance patterns on else on the scenes.

Then, the placeholder {name} will be replaced by the name of the item.

`task_config.json`

Finally, please add task_config.json in the folder. This file has the following format:

{
  "env_name": "FoodTask-v0",
  "make_env": "tasks.food_task.task_wrappers:wrap_task_env",
  "max_step": 2,
  "task_config_name": "food_task_prompt.json"
}

These contents are following things:

content	explanation
env_name	The task name on gymnasium.
make_env (optional)	The function name if you wrap the environment with following `task_wrappers.py` etc.
max_step	The max step of the dialogue.
task_config_name	The file name of the former `xx.json`.

`task_wrappers.py`

You can add task_wrappers.py to wrap with the environment by Wrapper of gymnasium. If so, please add following codes after the definition of the wrapper, on task_wrappers.py.

def wrap_task_env(base_env):
    wrapped_env = YourOriginalWrapper(base_env)
    return wrapped_env

Define agent

Here I will explain how to define the agent that will learn dialogue. To define dialogue learning agent, please make the directory whose name is the name of the agent under agents. Next please make the following file in it.

agent.py
trainer.py
agent_config.json

You have to define agent on agent.py and trainer on trainer.py. Then you define used classes on agent_config.json. This is the example.

{
    "agent_class": "AgentWithUnits",
    "trainer_class": "Trainer"
}

Here you should define the name of the classes of agent and trainer.

From here, you diverge implementing ways whether you use units on language or image (such as audio representations or image representations) and so on.

Agents without using units

To learn the simplest (and the most free) dialogue agent, it's easiest to use the agent that inherits BaseAgent. This BaseAgent has some needed APIs to have dialogue learning.

name	return value	explanation
action2speech	numpy.ndarray	It converts action of reinforcement learning to audio data. This function won't be used unless you define Wrapper.

Then, it's simple to define trainer, with inheriting BaseTrainer. BaseTrainer has following APIs.

name	return value	explanation
pretrain	None	If you have some modules that needs to be pretrained, please define this function.
train_rl	None	Function that has procedure of reinforcement learning of agent.

Please refer to sample01 under agents as an example of implementation. This is the agent proposed on Continuous Action Space-Based Spoken Language Acquisition Agent Using Residual Sentence Embedding and Transformer Decoder.

Agents with using units

The files to be defined are same, but if you use units, you can define agent with inheriting BaseAgentWithUnits. Then you can define easier because you can use the following properties.

name	type/return value	explanation
i2u	BaseImage2Unit	The instance of Image2Unit.
s2u	BaseSpeech2Unit	The instance of Speech2Unit.
u2s	BaseUnit2Speech	The instance of Unit2Speech.

Also, you have to define the trainer that inherits BaseTrainerWithUnits, that has following functions.

name	type/return value	explanation
train_i2u	None	Please define if you train I2U model.
train_u2s	None	Please define if you train U2S model.

How to write task json file

Task json file should be named as xx.json, and place it in task folder (e.g. in tasks/food_task for food_task). Here xx is the value that is specified on task_config_name in task_config.json.

Task file can be structured as follows:

{
  "initial_scene_id": 0,
  "items": [
    {
      "id": 1,
      "name": "<item name 1>",
      "attributes": {}
    },
    ...
  ],
  "scenes": [
    {
      "id": 0,
      "name": "<scene name one>",
      "role_description": "<description of role>",
      "speaker_description": "<description of speaker>",
      "items": [<IDs of items that can be taken on this scene>],
      "possible_next_scenes": [<IDs of scenes that it can transit from this scene to>],
      "system_guidelines": "<Description that how simulator gives items or transits the scenes>"
    },
    ...
  ]
}

As the above, please specify the following values.

initial_scene_id
items
scenes

On initial_scene_id, please specify the id of the scene of the first step. We will explain scene later.

On items, please register the items that will appear on this dialogue, with list format. You will specify the objects (often called "item") that has the following values in items.

name	value
id	Integer that identifies the item.
name	Name of the item.
attributes	Attributes you want to have the item to take.
prompts (optional)	The image for this item will be generated based on this prompt. On default, the prompt is "A {name} on a white background, uncooked, realistic.", on `{name}` is the name of this item.

On scenes, please register the scene that will appear on this dialogue, with list format. You will specify the objects (often called "scene") that has the following values in scenes.

name	value
id	Integer that identifies the scene.
name	Name of the scene.
role_description	What role this dialogue simulator plays.
speaker_description	What characteristics of person replies by this dialogue simulator. To see what you can choice, please refer to https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md#american-english
items	Ids of items that might be treated on this scene.
possible_next_scenes	Ids that will be the next scene of this. From this value and the all of scenes, you can construct automaton.
system_guidelines	The prompt that explains what situation this dialogue simulator is on. This will be input to LLM and the dialogue simulator decides what to reply.

You can refer to food_task.json or food_corner_task.json in tasks/food_task or tasks/food_corner_task as examples.

How to write config file(yaml)

The name of config file can be arbitrary, but place it in experiments folder.

Config file can be structured as follows:

base:
  task_name: "food_task"
  agent_name: "sample01"
  device: "cuda"
  dir_id: "1"
  sampling_rate: 22050


pretrain:
  audio:
    noise_num: 5
    overwrite_audio_dataset: False
    SNR: 30
    model_name: "kokoro"
    voice: "af_heart"
  image:
    train_images_per_folder: 30
    test_images_per_folder: 10
    overwrite_image_dataset: False
    batch_size: 16
    model_name: "Mann-E_Turbo"
    
    model:
      num_inference_steps: 10
      guidance_scale: 8.0
      height: 512
      width: 512

rl:
  env:
    device: "cuda"
    num_images_per_item: 10
    enable_audio_response: False
    record_train_env: False
    record_eval_env: False
    llm_prompt_name: "food_task.txt"

  agent:
    ...

The above keys are at least needed if you don't change files under env and main.py, so they might not be necessary if you modify the codes. The explanation of each key is as follows:

variable name	explanation
task_name	The folder name of task.
agent_name	The folder name of agent for training.
(base) device	The device for training. We recommend "cuda".
dir_id	The ID for the directory that the log is saved in. We recommend to set number like 1 or 2.
sampling_rate	Sampling rate that the agent uses.
noise_num	It represents how many patterns of sounds with noise will be generated for pretraining.
overwrite_audio_dataset	Whether the dataset is overwritten if it already exists.
SNR	Signal-to-Noise Ratio.
(pretrain->audio) model_name	Which model to generate audio. Now only "kokoro" is supported.
voice	Which voice will be used. If you specify "kokoro" as `model_name`, you can refer to https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md#american-english to decide which to use.
train_images_per_folder	How many images for training will be generated for each item.
test_images_per_folder	How many images for test will be generated for each item.
overwrite_image_datset	Whether the dataset is overwritten if it already exists.
batch_size	Image generation model will generate how many images for pretraining at once.
(pretrain->image) model_name	Which model to generate images. Now only "Mann-E_Turbo" is supported.
num_inference_steps	How many steps will be used for inference on diffusion model for image generation.
guidance_scale	How loyal the image generation model to generation prompt.
height	The height of the images.
width	The width of the images.
(rl->env) device	The device for training. We recommend "cuda".
num_images_per_item	On reinforcement learning how many images will be used on each item.
enable_audio_response	Whether we use audio input on reinforcement learning. If you want to use audio input with food_task or food_corner_task, please turn this to True and please remove the comment on `observation["audio"] = ...` in `task_functions.py` in task folder.
record_train_env	Whether it records dialogue history on training of reinforcement learning.
record_eval_env	Whether it records dialogue history on evaluation of reinforcement learning.
llm_prompt_name	The file of prompt given for environment LLM. For more details for this file, please refer to `How to write prompt file` below.

Addition to these, you can contain parameters needed for the training of your agent. In the codes, each section and variables are corresponded as follows:

The section if the whole config file is as config	variable in codes
config["base"]	base_config
config["pretrain"]	pretrain_config
config["pretrain"]["audio"]	pretrain_audio_config
config["pretrain"]["image"]	pretrain_image_config
config["rl"]	rl_config
config["rl"]["env"]	rl_env_config
config["rl"]["agent"]	rl_agent_config

If you want to refer to the parameters, please add mappings on the proper section and read them.

How to write prompt file

On this simulator, LLM judges all of transitions of the scenes and how to give the items on each task. Therefore, we specify how LLM judges by prompt on each task. On constructing task, please write that file and save it under env/prompts.

The format of the prompt file can be as follows:

You are {role_description}. Generate a response based on the following guidelines to interact with the customer effectively.

Scene Information:
{scene_information}

Available Items:
{available_items_text}
Note: Only offer the listed items if the customer specifically and clearly chooses one, using correct grammar. Do not offer any other items, even if the customer requests something else or mentions the item indirectly.

Scene Transitions:
{transitions}

Guidelines:
{guidelines}

Response Format:
Your response should follow this JSON format:
{{
"response_text": "text of response", 
"next_scene_id": scene ID (or null if staying in the current scene),
"next_scene_reason": "reason for transition (or null if staying in the current scene)", 
"item_name": "name of the food item offered (or null if no item is offered)"
}}

Here we can use placefolder with using brackets. Each has following meaning.

name	explanation
role_description	`role_description` in task json file will be assigned.
scene_information	The text that unites all of scenes.
available_items_text	All items that can be offered now.
transitions	The text that explains how it can transit scenes.
guidelines	`system_guidelines` in task json file will be assigned.

API Reference

`EnvironmentState` class

EnvironmentState class contains the environment state. This class contains the following get-only properties.

name	type	explanation
dialogue_scene	DialogueScene	Contains dialogue scene.
dialogue_feedback	Optional[DialogueFeedback]	Contains dialogue feedback. If the dialogue isn't held, this contains `None`.
internal_state	numpy.ndarray	Internal state at this time.

`DialogueScene` class

DialogueScene class contains the dialogue scene at some point. This class contains the following get-only properties.

name	type	explanation
scene_id	int	The scene id on this point. The id is same as on the task defined file.
prompt_waveform	numpy.ndarray	The utterance from the environment.
images	list[numpy.ndarray]	The images proposed by the environment.

`DialogueFeedback` class

DialogueFeedback class contains the feedback on the dialogue. This class contains the following get-only properties.

name	type	explanation
selected_item	Optional[FeedbackItem]	The item proposed on this dialogue. If nothing was given, this contains `None`.
response_waveform	numpy.ndarray	The utterance from the environment based on the agent's utterance.

`FeedbackItem` class

FeedbackItem class represents the item given as the result of the dialogue. This class contains the following get-only properties.

name	type	explanation
id	int	Represents item ID. Item ID is the same as on the task defined file.
image	Optional[numpy.ndarray]	The image itself.
attributes	dict[str, Any]	`attributes` defined on this item on the task defined file.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Credits

LLM: Phi-4-mini-instruct
ASR: Whisper-large-v3-turbo
TTS: Kokoro-82M
Image generator: Stable diffusion + Mann-E_Turbo

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
agents		agents
env		env
experiments		experiments
imgs		imgs
tasks		tasks
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
README_ja.md		README_ja.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spolacq-GDS: Spoken Language Acquisition Generative Dialogue Simulator

Overview

About auther

Preparation

Usage

Introduction

Define tasks

`xx.json`

`task_functions.py`

`utterance_patterns.json`

`task_config.json`

`task_wrappers.py`

Define agent

Agents without using units

Agents with using units

How to write task json file

How to write config file(yaml)

How to write prompt file

API Reference

`EnvironmentState` class

`DialogueScene` class

`DialogueFeedback` class

`FeedbackItem` class

License

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spolacq-GDS: Spoken Language Acquisition Generative Dialogue Simulator

Overview

About auther

Preparation

Usage

Introduction

Define tasks

xx.json

task_functions.py

utterance_patterns.json

task_config.json

task_wrappers.py

Define agent

Agents without using units

Agents with using units

How to write task json file

How to write config file(yaml)

How to write prompt file

API Reference

EnvironmentState class

DialogueScene class

DialogueFeedback class

FeedbackItem class

License

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`xx.json`

`task_functions.py`

`utterance_patterns.json`

`task_config.json`

`task_wrappers.py`

`EnvironmentState` class

`DialogueScene` class

`DialogueFeedback` class

`FeedbackItem` class

Packages