Generating Benchmark Configurations

Overview

pbench genconfig generates cluster configuration files from Go templates. It takes:

Parameter files (-p) — global settings (memory percentages, thresholds) that apply to all cluster sizes
Template directory (-t) — Go template files that define the output file structure
Output directory (positional arg) — directory tree to search for config.json files; each config.json defines one cluster size

For each config.json found, the parameters and cluster config are merged into a single flat map[string]any, then each template is executed against that map to produce the output files. After generation, any files in the output directories that were not generated (and are not config.json) are automatically removed. This ensures that deleting or renaming a template does not leave stale output files behind.

Data Flow

params.json + overrides.json   (global parameters, merged left-to-right)
        │
        ▼
   merged params ─── + ─── config.json   (per-cluster values override params)
                     │
                     ▼
              map[string]any   (single flat map)
                     │
                     ▼
              .prelude template   (computes derived values, adds them to the map)
                     │
                     ▼
              template execution   (each template file renders using the map)
                     │
                     ▼
              output files   (config.properties, jvm.config, etc.)

CLI Reference

pbench genconfig [flags] [directory]

Flags:
  -p, --parameter-file stringArray   Parameter file (repeatable; later files override earlier ones)
  -t, --template-dir string          Template directory (default: built-in templates)
  -h, --help                         Help for genconfig

Subcommands:
  default     Print the built-in default parameter file

Examples

# Use built-in params and templates, generate for all clusters/ subdirectories
pbench genconfig clusters

# Explicit params and templates
pbench genconfig -p clusters/params.json -t clusters/templates clusters

# Stack parameter files (base + overrides)
pbench genconfig -p clusters/params.json -p my-overrides.json clusters

# Print the default parameter file
pbench genconfig default

Parameter Files (`-p`)

Parameter files are JSON files containing key-value pairs. All values are loaded into a generic map[string]any — there is no fixed schema, so you can add any keys you want.

The built-in default is clusters/params.json:

{
  "sys_reserved_mem_percent": 0.05,
  "sys_reserved_mem_cap_gb": 2,
  "heap_size_percent_of_container_mem": 0.9,
  "headroom_percent_of_heap": 0.2,
  "query_max_total_mem_per_node_percent_of_heap": 0.8,
  "query_max_mem_per_node_percent_of_total": 0.9,
  "proxygen_mem_per_worker_gb": 0.125,
  "proxygen_mem_cap_gb": 2,
  "native_buffer_mem_percent": 0.05,
  "native_buffer_mem_cap_gb": 32,
  "native_query_mem_percent_of_sys_mem": 0.95,
  "join_max_bcast_size_percent_of_container_mem": 0.01,
  "memory_push_back_start_below_limit_gb": 5,
  "trino": "${PROVISION_TRINO}"
}

Stacking parameter files

When multiple -p flags are provided, files are merged left-to-right. Later files override keys from earlier ones. Keys not present in the override file are kept from the base.

# my-overrides.json: { "heap_size_percent_of_container_mem": 0.85, "my_custom_key": "hello" }
pbench genconfig -p clusters/params.json -p my-overrides.json clusters

Result: the merged map has all keys from params.json, with heap_size_percent_of_container_mem overridden to 0.85, plus the new key my_custom_key set to "hello".

Adding custom parameters

Since parameters are generic maps, you can add any key to a parameter file and reference it in templates. No Go code changes needed.

For example, add "my_timeout_minutes": 60 to a parameter file, then use {{ .my_timeout_minutes }} in a template.

Cluster Config Files (`config.json`)

Each cluster size directory contains a config.json file that defines cluster-specific values like node count, memory, and instance types. These values are merged on top of the parameter file values, so cluster-specific keys override global parameters with the same name.

Example clusters/small/config.json:

{
    "cluster_size": "small",
    "coordinator_instance_type": "r6i.2xlarge",
    "coordinator_instance_ebs_size": 50,
    "worker_instance_type": "r6i.2xlarge",
    "worker_instance_ebs_size": 50,
    "number_of_workers": 4,
    "memory_per_node_gb": 62,
    "vcpu_per_worker": 8,
    "fragment_result_cache_enabled": true,
    "data_cache_enabled": true
}

Skipping directories with `.genconfigignore`

If a cluster directory contains a .genconfigignore file, pbench genconfig will skip it entirely — no files are generated or cleaned up. This is useful for directories that contain a config.json but are manually maintained rather than auto-generated.

# Create the marker file to opt out of generation
touch my-manual-cluster/.genconfigignore

The .genconfigignore file can be empty; only its presence matters.

Merge order

The final map for each cluster is:

base params (-p files merged left-to-right)  →  config.json values on top

If params.json has "foo": 1 and config.json has "foo": 2, the template sees "foo": 2.

Available cluster sizes

Cluster	config.json	README
small	config.json	README
medium	config.json	README
medium-ssd	config.json	README
medium-spill	config.json	README
large	config.json	README
starburst-comp	config.json	README

Writing Templates

Templates use the Go text/template syntax. The data context (.) is the merged map[string]any containing all parameter and cluster config values.

Referencing values

Use .key_name to access any value from the merged map. Keys use snake_case matching the JSON field names.

Simple value substitution:

-Xmx{{ .heap_size_gb }}G

If heap_size_gb is 55, this renders as -Xmx55G.

Inline expressions:

query.max-total-memory={{ mul .java_query_max_total_mem_per_node_gb .number_of_workers }}GB

If java_query_max_total_mem_per_node_gb is 43 and number_of_workers is 4, this renders as query.max-total-memory=172GB.

Nested function calls:

system-mem-limit-gb={{ floor (mul .container_memory_gb 0.95) }}
native-buffer-mem={{ ceil (min .native_buffer_mem_cap_gb (mul .container_memory_gb .native_buffer_mem_percent)) }}GB

Conditional sections

Use {{ if }} to conditionally include blocks. Map keys that are missing or false are falsy.

{{ if .spill_enabled -}}
experimental.spiller-spill-path=/opt/presto-server/spilled_data
{{ end -}}

{{ if .ssd_cache_size -}}
async-cache-ssd-gb={{ .ssd_cache_size }}
async-cache-ssd-path=/opt/presto-server/async_data_cache/
{{ end -}}

The - after {{ and before }} trims whitespace, keeping the output clean. See the Go template docs on trimming for details.

Conditionally skipping entire files

If a template's output is empty (or whitespace-only), the file is not created at all. This lets you conditionally skip entire files by wrapping the full template content in an {{ if }} block:

{{ if .spark_enabled -}}
spark.master=yarn
spark.driver.memory={{ .heap_size_gb }}g
{{ end -}}

If spark_enabled is absent or false in the cluster's config.json, no spark-defaults.conf file is generated for that cluster.

Using the `default` function

Use default to provide a fallback when a key might not exist in the map:

timeout={{ default .my_custom_timeout 30 }}m

If my_custom_timeout is not in the map, renders as timeout=30m.

Iterating with `seq`

Use seq to generate a sequence of integers (useful for generating numbered entries):

{{ range $i := seq 0 (dec .number_of_workers) -}}
worker-{{ $i }}.host=worker{{ $i }}
{{ end -}}

Real-world template example

Here is a simplified version of coordinator/config.properties:

{{ template "prelude" . -}}
coordinator=true
http-server.http.port=8080

memory.heap-headroom-per-node={{ .headroom_gb }}GB
join-max-broadcast-table-size={{ .join_max_broadcast_table_size_mb }}MB

query.max-total-memory-per-node={{ .java_query_max_total_mem_per_node_gb }}GB
query.max-total-memory={{ mul .java_query_max_total_mem_per_node_gb .number_of_workers }}GB
query.max-memory-per-node={{ .java_query_max_mem_per_node_gb }}GB
query.max-memory={{ mul .java_query_max_mem_per_node_gb .number_of_workers }}GB

The {{ template "prelude" . }} on line 1 invokes the prelude to compute derived values before the rest of the template runs. See the next section for details.

The `.prelude` Template: Computed Values

Many config values (heap size, query memory limits, broadcast table size) are derived from the input parameters through formulas. Rather than hardcoding these calculations in Go, they live in a special template file called .prelude.

How it works

The .prelude file defines a named template block called "prelude"
Inside, it uses the set function to compute values and add them to the data map
Any template that needs computed values includes {{ template "prelude" . }} at the top
After the prelude runs, the computed keys are available via .key_name like any other value
Files prefixed with . are never written to the output directory

The `set` function

set takes a map, a key name, and a value. It adds the key-value pair to the map and returns an empty string (so it produces no output):

{{ set . "heap_size_gb" (floor (mul .container_memory_gb .heap_size_percent_of_container_mem)) -}}

This computes floor(container_memory_gb * heap_size_percent_of_container_mem) and stores the result as heap_size_gb in the map. The .- suffix trims the trailing newline.

Full `.prelude` source

clusters/templates/.prelude:

{{ define "prelude" -}}
{{ set . "container_memory_gb" (sub .memory_per_node_gb (ceil (min .sys_reserved_mem_cap_gb (mul .memory_per_node_gb .sys_reserved_mem_percent)))) -}}
{{ set . "heap_size_gb" (floor (mul .container_memory_gb .heap_size_percent_of_container_mem)) -}}
{{ set . "headroom_gb" (ceil (mul .heap_size_gb .headroom_percent_of_heap)) -}}
{{ set . "java_query_max_total_mem_per_node_gb" (floor (mul .heap_size_gb .query_max_total_mem_per_node_percent_of_heap)) -}}
{{ set . "java_query_max_mem_per_node_gb" (floor (mul .java_query_max_total_mem_per_node_gb .query_max_mem_per_node_percent_of_total)) -}}
{{ set . "native_proxygen_mem_gb" (ceil (min .proxygen_mem_cap_gb (mul .proxygen_mem_per_worker_gb .number_of_workers))) -}}
{{ set . "native_buffer_mem_gb" (ceil (min .native_buffer_mem_cap_gb (mul .container_memory_gb .native_buffer_mem_percent))) -}}
{{ set . "native_system_mem_gb" (sub .container_memory_gb (add .native_buffer_mem_gb .native_proxygen_mem_gb)) -}}
{{ set . "native_query_mem_gb" (floor (mul .native_system_mem_gb .native_query_mem_percent_of_sys_mem)) -}}
{{ set . "join_max_broadcast_table_size_mb" (ceil (mul .container_memory_gb .join_max_bcast_size_percent_of_container_mem 1024)) -}}
{{ set . "fragment_result_cache_size_gb" (ceil (mul .memory_per_node_gb 2 0.95)) -}}
{{ set . "data_cache_size_gb" (ceil (mul .memory_per_node_gb 3 0.95)) -}}
{{ end }}

Formulas explained

Starting from memory_per_node_gb (e.g., 62 GB for the small cluster):

Computed Value	Formula	Example (small)
`container_memory_gb`	`memory_per_node - ceil(min(sys_reserved_cap, memory_per_node * sys_reserved_percent))`	`62 - ceil(min(2, 62 * 0.05))` = `62 - 2` = 60
`heap_size_gb`	`floor(container_memory * heap_percent)`	`floor(60 * 0.9)` = 54
`headroom_gb`	`ceil(heap_size * headroom_percent)`	`ceil(54 * 0.2)` = 11
`java_query_max_total_mem_per_node_gb`	`floor(heap_size * query_max_total_percent)`	`floor(54 * 0.8)` = 43
`java_query_max_mem_per_node_gb`	`floor(java_query_max_total * query_max_percent)`	`floor(43 * 0.9)` = 38
`native_proxygen_mem_gb`	`ceil(min(proxygen_cap, proxygen_per_worker * workers))`	`ceil(min(2, 0.125 * 4))` = 1
`native_buffer_mem_gb`	`ceil(min(buffer_cap, container_memory * buffer_percent))`	`ceil(min(32, 60 * 0.05))` = 3
`native_system_mem_gb`	`container_memory - native_buffer - native_proxygen`	`60 - 3 - 1` = 56
`native_query_mem_gb`	`floor(native_system * query_percent)`	`floor(56 * 0.95)` = 53
`join_max_broadcast_table_size_mb`	`ceil(container_memory * broadcast_percent * 1024)`	`ceil(60 * 0.01 * 1024)` = 615
`fragment_result_cache_size_gb`	`ceil(memory_per_node * 2 * 0.95)`	`ceil(62 * 2 * 0.95)` = 118
`data_cache_size_gb`	`ceil(memory_per_node * 3 * 0.95)`	`ceil(62 * 3 * 0.95)` = 177

Adding your own computed values

You can define computed values in two places:

In .prelude (shared across all templates): Add a set line to clusters/templates/.prelude. The value is computed once per cluster and available to every template that includes the prelude.

{{ set . "my_derived_value" (floor (mul .memory_per_node_gb 0.5)) -}}

Then reference it in any template that includes {{ template "prelude" . }}:

my-config-property={{ .my_derived_value }}GB

Directly in a template (local to that template): You can also use set directly inside a template file. This is useful for intermediate calculations that only that template needs — the value is not shared with other templates.

{{ set . "local_buffer_size" (ceil (div .container_memory_gb 4)) -}}
buffer-size={{ .local_buffer_size }}GB

Because set mutates the map, computed values can reference other computed values defined earlier — just define them in the right order:

{{ set . "a" (mul .memory_per_node_gb 0.8) -}}
{{ set . "b" (floor (mul .a 0.5)) -}}

Templates that don't need computed values

If a template only uses raw input values (from params or config.json), it does not need to include the prelude. For example, static catalog config files like catalog/hive-native.properties don't include {{ template "prelude" . }} because they don't reference any computed values.

Template Functions Reference

All numeric functions accept any numeric type (int, float64, etc.) and convert automatically.

Function	Signature	Description	Example
`add`	`add(a, b) → float64`	Addition	`{{ add .heap_size_gb 10 }}` → `64`
`sub`	`sub(a, b) → float64`	Subtraction	`{{ sub .container_memory_gb 5 }}` → `55`
`mul`	`mul(args...) → float64`	Variadic multiplication	`{{ mul .heap_size_gb .number_of_workers }}` → `216`
`div`	`div(a, b) → float64`	Division	`{{ div .memory_per_node_gb 2 }}` → `31`
`min`	`min(a, b) → float64`	Minimum of two values	`{{ min .sys_reserved_mem_cap_gb 10 }}` → `2`
`max`	`max(a, b) → float64`	Maximum of two values	`{{ max .heap_size_gb 32 }}` → `54`
`floor`	`floor(x) → int`	Round down to integer	`{{ floor 54.7 }}` → `54`
`ceil`	`ceil(x) → int`	Round up to integer	`{{ ceil 3.1 }}` → `4`
`dec`	`dec(x) → int`	Decrement by 1 (integer)	`{{ dec .vcpu_per_worker }}` → `7`
`set`	`set(map, key, value) → ""`	Set a key in the data map	`{{ set . "foo" 42 }}`
`default`	`default(val, fallback) → any`	Fallback if val is nil	`{{ default .optional_key 10 }}`
`hasPrefix`	`hasPrefix(s, prefix) → bool`	String starts with prefix	`{{ if hasPrefix .worker_instance_type "r6i" }}...{{ end }}`
`hasSuffix`	`hasSuffix(s, suffix) → bool`	String ends with suffix	`{{ if hasSuffix .worker_instance_type "xlarge" }}...{{ end }}`
`contains`	`contains(s, substr) → bool`	String contains substring	`{{ if contains .worker_instance_type "gpu" }}...{{ end }}`
`seq`	`seq(start, end) → chan int`	Integer sequence [start, end]	`{{ range $i := seq 1 3 }}...{{ end }}`

Notes:

floor and ceil return int, so they render without a decimal point (e.g., 54 not 54.0)
mul is variadic — {{ mul .a .b .c }} computes a * b * c
set returns an empty string, so {{ set . "key" value }} produces no output
Whole-number float64 values also render without a decimal point in Go templates (e.g., 60 not 60.0)

Nesting functions

Go template functions nest using parentheses. The innermost expression is evaluated first:

{{ floor (mul .container_memory_gb .heap_size_percent_of_container_mem) }}

This computes mul(60, 0.9) → 54.0, then floor(54.0) → 54.

Deeper nesting example from the prelude:

{{ set . "container_memory_gb" (sub .memory_per_node_gb (ceil (min .sys_reserved_mem_cap_gb (mul .memory_per_node_gb .sys_reserved_mem_percent)))) }}

Reading inside-out:

mul .memory_per_node_gb .sys_reserved_mem_percent → 62 * 0.05 = 3.1
min .sys_reserved_mem_cap_gb 3.1 → min(2, 3.1) = 2
ceil 2 → 2
sub .memory_per_node_gb 2 → 62 - 2 = 60
set . "container_memory_gb" 60 → stores 60 in the map

Template Directory Structure

The built-in template directory (clusters/templates/) mirrors the output directory structure:

clusters/templates/
  .prelude                             # Computed value definitions (not output)
  README.md                            # Cluster README (template)
  docker-stack-java.yaml               # Docker Swarm config for Java workers
  docker-stack-native.yaml             # Docker Swarm config for native (Prestissimo) workers
  docker-stack-spark.yaml              # Docker Swarm config for Spark
  coordinator/
    config.properties                  # Java coordinator config
    config-native.properties           # Native (Prestissimo) coordinator config
    config-trino.properties            # Trino coordinator config
    jvm.config                         # JVM settings
    jvm-trino.config                   # Trino JVM settings
    node.properties                    # Static node properties
    session-property-config.json       # Session property config
    session-property-config.properties # Session property config
  workers/
    config.properties                  # Java worker config
    config-native.properties           # Native (Prestissimo) worker config
    config-trino.properties            # Trino worker config
    jvm.config                         # Worker JVM settings
    jvm-trino.config                   # Trino worker JVM settings
    node.properties                    # Static node properties
  catalog/
    hive.properties                    # Hive catalog (Java)
    hive-native.properties             # Hive catalog (native)
    hive-trino.properties              # Hive catalog (Trino)
    tpcds.properties                   # TPC-DS catalog
    ...                                # Other catalog files (static, no templates)

Files prefixed with . (like .prelude) are parsed but never written to the output.

Customization Guide

Scenario 1: Change a tuning parameter for all clusters

Edit clusters/params.json (or pass a separate override file with -p):

{
  "heap_size_percent_of_container_mem": 0.85
}

pbench genconfig -p clusters/params.json -p tuning-override.json clusters

Scenario 2: Add a new config property to all clusters

Add the parameter to clusters/params.json:
```
{ "my_new_timeout_minutes": 45 }
```

Reference it in the relevant template file(s):

my-config.timeout={{ .my_new_timeout_minutes }}m

Regenerate: make clusters

Scenario 3: Add a per-cluster override

Add the key directly to that cluster's config.json. It will override the same key from params:

{
    "cluster_size": "xlarge",
    "number_of_workers": 16,
    "memory_per_node_gb": 480,
    "my_new_timeout_minutes": 90
}

Scenario 4: Add a new computed value

Add input parameters to params.json if needed
Add a set line to clusters/templates/.prelude (if it should be shared), or directly in the template that needs it:
```
{{ set . "my_computed_value" (floor (div .memory_per_node_gb .number_of_workers)) -}}
```
Reference in templates: {{ .my_computed_value }}

Scenario 5: Use a completely custom template directory

pbench genconfig -p my-params.json -t my-templates/ my-output/

Your template directory can have any structure. Each template file is rendered for every config.json found in the output directory.

Quick Start

Create an output directory and copy a config.json from an existing cluster size:
```
mkdir my-cluster
cp clusters/small/config.json my-cluster/
```

Edit my-cluster/config.json to match your cluster's hardware:

{
    "cluster_size": "my-cluster",
    "number_of_workers": 8,
    "memory_per_node_gb": 128,
    "vcpu_per_worker": 16,
    "fragment_result_cache_enabled": true,
    "data_cache_enabled": true
}

Generate the configs:

pbench genconfig -p clusters/params.json -t clusters/templates my-cluster

The output directory now contains all the generated config files:

my-cluster/
  config.json
  README.md
  docker-stack-native.yaml
  coordinator/config.properties
  coordinator/jvm.config
  workers/config-native.properties
  ...

Generating Benchmark Configurations

Overview

Data Flow

CLI Reference

Examples

Parameter Files (-p)

Stacking parameter files

Adding custom parameters

Cluster Config Files (config.json)

Skipping directories with .genconfigignore

Merge order

Available cluster sizes

Writing Templates

Referencing values

Conditional sections

Conditionally skipping entire files

Using the default function

Iterating with seq

Real-world template example

The .prelude Template: Computed Values

How it works

The set function

Full .prelude source

Formulas explained

Adding your own computed values

Templates that don't need computed values

Template Functions Reference

Nesting functions

Template Directory Structure

Customization Guide

Scenario 1: Change a tuning parameter for all clusters

Scenario 2: Add a new config property to all clusters

Scenario 3: Add a per-cluster override

Scenario 4: Add a new computed value

Scenario 5: Use a completely custom template directory

Quick Start

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Parameter Files (`-p`)

Cluster Config Files (`config.json`)

Skipping directories with `.genconfigignore`

Using the `default` function

Iterating with `seq`

The `.prelude` Template: Computed Values

The `set` function

Full `.prelude` source