CLEVR3D questions are generated using the script generate_questions.py, which is expected to be run from
the question_generation directory.
This script reads a JSON file 3dssg_scenes.json containing information about scenes and outputs a JSON file containing questions, functional programs, and answers for those images. In most cases the script will be invoked
like this:
python generate_questions.py --input_scene_file $INPUT_FILE --output_questions_file $OUTPUT_FILEQuestion generation has no dependencies other than Python itself. The code was developed on Python 3.7, but should also work on Python 2.7.
Questions are generated by instantiating question templates; the question templates used for our CVPR paper can be found in the directory templates. Each file in this directory contains several related templates.
By default generate_questions.py will generate questions for all scenes in the input file. However you can generate questions
for only a subset of scenes using the --scene_start_idx and --num_scenes flags: the former gives the index at which to
start generating questions, and the latter gives the number of scenes for which questions should be generated.
These flags can be useful for distributing question generation among many workers.
The flag --templates_per_image (default 100) is the number of templates that we will aim to instantiate for every scene, and
the flag --instances_per_template gives the number of instantiations we will try to find per template. (Note that it may cause some bugs if --instances_per_template is greater than 1, so I just set it to 1 and increase --templates_per_image.)
In total the number of questions per image will be the product of --templates_per_image and --instances_per_template; however some scenes may
have slightly fewer questions if no valid template instantiations can be found.
Each question template consists of four components:
- One or more parameters, each with a type and a name. Instantiating the template amounts to choosing a value for
each of these parameters; parameters may be given a
NULLvalue - One or more text templates that give a natural-language representation of the question
- A program template consisting of a sequence of nodes; each node in the program template may expand to multiple functions in the final program instantiated from the template
- Zero or more constraints restricting the allowed values that the parameters are allowed to take.
Here is an example template:
{
"params": [
{"type": "Size", "name": "<Z>"},
{"type": "Color", "name": "<C>"},
{"type": "Material", "name": "<M>"},
{"type": "Shape", "name": "<S>"},
{"type": "Label", "name": "<L>"},
{"type": "Relation", "name": "<R>"},
{"type": "Size", "name": "<Z2>"},
{"type": "Color", "name": "<C2>"},
{"type": "Material", "name": "<M2>"},
{"type": "Shape", "name": "<S2>"},
{"type": "Label", "name": "<L2>"}
],
"text": [
"What size is the <Z2> <C2> <M2> <S2> <L2> [that is] <R> the <Z> <C> <M> <S> <L>?",
"What is the size of the <Z2> <C2> <M2> <S2> <L2> [that is] <R> the <Z> <C> <M> <S> <L>?",
"How big is the <Z2> <C2> <M2> <S2> <L2> [that is] <R> the <Z> <C> <M> <S> <L>?",
"There is a <Z2> <C2> <M2> <S2> <L2> [that is] <R> the <Z> <C> <M> <S> <L>; what size is it?",
"There is a <Z2> <C2> <M2> <S2> <L2> [that is] <R> the <Z> <C> <M> <S> <L>; how big is it?",
"There is a <Z2> <C2> <M2> <S2> <L2> [that is] <R> the <Z> <C> <M> <S> <L>; what is its size?"
],
"nodes": [
{"type": "scene", "inputs": []},
{"type": "filter_unique", "inputs": [0], "side_inputs": ["<Z>", "<C>", "<M>", "<S>", "<L>"]},
{"type": "relate_filter_unique", "inputs": [1], "side_inputs": ["<R>", "<Z2>", "<C2>", "<M2>", "<S2>", "<L2>"]},
{"type": "query_size", "inputs": [2]}
],
"constraints": [
{"type": "NULL", "params": ["<Z2>"]}
]
}The special file 3dssg_metadata.json defines the simple functional programming language used to construct programs and
program templates.
Each template parameter has a type and a name; the allowed types are Size, Color, Material, Shape, Label and Relation.
The allowed values for each of these types is stored in 3dssg_metadata.json; in addition to the values defined here, each
non-Relation template parameter may also be assigned the value NULL.
By convention, Size parameters are called <Z>, <Z2>, <Z2>, etc; similarly Color parameters are called <C>,
Material parameters are called <M>, Shape parameters are called <S>, Label parameters are called <L> and Relation parameters are called <R>.
Each question template defines one or more text templates which give different ways of expressing the question in
natural language. Text templates must use all of the template parameters. After values have been chosen for all template
parameters, a natural language version of the question is generated by randomly choosing one of the text templates and
replacing the parameter names with their values. Parameters whose value is NULL are replaced with the empty string, and parameter has type Label must has a value.
To increase linguistic diversity, the file 3dssg_synonyms.json defines a set of synonyms for template parameter values,
e.g. "small" is a synonym for "tiny". When instantiating templates, values are randomly replaced by synonyms.
Text templates can also have optional segments; any text surrounded by brackets will be removed with probability 0.5 during
template instantiation. In the example above, the substring "that is" is optional in all text templates.
Finally, there are some special-case heuristics to replace the word "other" with "another", "a", or the empty string
in some circumstances to try and minimize ambiguity.
A program template is defined as a sequence of nodes; each node receives input from zero or more other nodes, and produces
an output; this sequence is expected to be sorted topologically in the template. The inputs to each node are identified by
nodes field of a node, which is a list of integers indexing into the node sequence. A node in a program template may expand
to more than one node in the program instantiated from the template.
Each node has a type, such as scene
or filter_color; the 3dssg_metadata.json defines the full list of available nodes types, as well as input and output types for
each node type.
In addition to receiving inputs from earlier nodes, some nodes also receive side inputs (also called value inputs
in some places); these are literal values of some type. The number and types of expected side inputs for all node types are
also listed in the 3dssg_metadata.json file.
As a concrete example, in the template above the first node has type scene; the 3dssg_metadata.json file gives us the following
information about this node type:
// From 3dssg_metadata.json
{
"name": "scene",
"inputs": [],
"output": "ObjectSet",
"terminal": false
}This indicates that scene nodes receive no inputs, and output an ObjectSet; scene nodes receive no side inputs, and
cannot be the final node in a fully instantiated program since they are not terminal.
The next node in the sequence above has type filter_unique; since its input is [0] it receives as input the output from
the previous scene node. the 3dssg_metadata.json file gives us the following information about this node type:
// From 3dssg_metadata.json
{
"name": "filter_unique",
"inputs": ["ObjectSet"],
"side_inputs": ["Size", "Color", "Material", "Shape"],
"output": "Object",
"terminal": false,
"template_only": true
}Thus nodes of type filter_unique receive one input of type ObjectSet and four side inputs of type Size, Color,
Material Shape, and Label (corresponding to parameters <Z>, <C>, <M> <S>, <L> in the side_inputs field of the template
node), and produce an output of type Object. Again, this node is not terminal so it cannot be the final node of a
fully instantiated program. This node type is marked as template_only, indicating this node type is only valid as part of
a program template and cannot be used in a fully instantiated program; during instantiation template nodes of type
filter_unique will be replaced by a subsequence of filter_size, filter_color, filter_material, filter_shape, filter_label,
followed by a unique node. The use of special template-only nodes like this lead to more expressive templates, and also
allow us to more easily prune the search space during template instantiation.
Continuing with the example template above, the output from the filter_unique node is passed to another node of type
relate_filter_unique, which takes an input of type Object and five side inputs, and produces an output of type Object.
This is another special template-only node type which will expand into a relate node followed by some subsequence of
filter_size, filter_color, filter_material, filter_shape, filter_label, followed by a unique node. The output
of the relate_filter_unique node is then passed to a node of type query_size, which takes an Object as input and
produces an output of type Size. This node type is terminal and is not template-only, so it will be the final node of both
the program template as well as all programs instantiated from that template.
Templates can define constraints on the values that template parameters are allowed to take; constraints can be necessary
to ensure that the question does not give away its answer. The example template above includes a constraint that the
parameter <Z2> must be NULL; without this constraint the template could produce questions such as "What size is the big
thing left of the table?" which can be trivially answered from the text of the question.
The following two constraint types are supported:
NULL: The parameter must take the valueNULL, as in the example above.OUT_NEQ: The outputs of the two specified nodes must have different values when the instantiated program is run. This is used for templates like "Are there an equal number of <Z> <C> <M> <S> <L>s and <Z2> <C2> <M2> <S2> <L2>?" to ensure that the two question subparts refer to different sets of objects, which avoids trivial questions like "Are there an equal number of spheres and balls?".
Questions are generated based on the original labels, and the labels will be mapped to 27 categories in the post-processing step. The mappings are in the file 3RScan.v2_Mapping.csv.