Skip to content

Generating Benchmark Configurations

Yiqun (Ethan) Zhang edited this page Feb 24, 2026 · 18 revisions

Overview

pbench genconfig generates cluster configuration files from Go templates. It takes:

  1. Parameter files (-p) — global settings (memory percentages, thresholds) that apply to all cluster sizes
  2. Template directory (-t) — Go template files that define the output file structure
  3. Output directory (positional arg) — directory tree to search for config.json files; each config.json defines one cluster size

For each config.json found, the parameters and cluster config are merged into a single flat map[string]any, then each template is executed against that map to produce the output files. After generation, any files in the output directories that were not generated (and are not config.json) are automatically removed. This ensures that deleting or renaming a template does not leave stale output files behind.

Data Flow

params.json + overrides.json   (global parameters, merged left-to-right)
        │
        ▼
   merged params ─── + ─── config.json   (per-cluster values override params)
                     │
                     ▼
              map[string]any   (single flat map)
                     │
                     ▼
              .prelude template   (computes derived values, adds them to the map)
                     │
                     ▼
              template execution   (each template file renders using the map)
                     │
                     ▼
              output files   (config.properties, jvm.config, etc.)

CLI Reference

pbench genconfig [flags] [directory]

Flags:
  -p, --parameter-file stringArray   Parameter file (repeatable; later files override earlier ones)
  -t, --template-dir string          Template directory (default: built-in templates)
  -h, --help                         Help for genconfig

Subcommands:
  default     Print the built-in default parameter file

Examples

# Use built-in params and templates, generate for all clusters/ subdirectories
pbench genconfig clusters

# Explicit params and templates
pbench genconfig -p clusters/params.json -t clusters/templates clusters

# Stack parameter files (base + overrides)
pbench genconfig -p clusters/params.json -p my-overrides.json clusters

# Print the default parameter file
pbench genconfig default

Parameter Files (-p)

Parameter files are JSON files containing key-value pairs. All values are loaded into a generic map[string]any — there is no fixed schema, so you can add any keys you want.

The built-in default is clusters/params.json:

{
  "sys_reserved_mem_percent": 0.05,
  "sys_reserved_mem_cap_gb": 2,
  "heap_size_percent_of_container_mem": 0.9,
  "headroom_percent_of_heap": 0.2,
  "query_max_total_mem_per_node_percent_of_heap": 0.8,
  "query_max_mem_per_node_percent_of_total": 0.9,
  "proxygen_mem_per_worker_gb": 0.125,
  "proxygen_mem_cap_gb": 2,
  "native_buffer_mem_percent": 0.05,
  "native_buffer_mem_cap_gb": 32,
  "native_query_mem_percent_of_sys_mem": 0.95,
  "join_max_bcast_size_percent_of_container_mem": 0.01,
  "memory_push_back_start_below_limit_gb": 5,
  "trino": "${PROVISION_TRINO}"
}

Stacking parameter files

When multiple -p flags are provided, files are merged left-to-right. Later files override keys from earlier ones. Keys not present in the override file are kept from the base.

# my-overrides.json: { "heap_size_percent_of_container_mem": 0.85, "my_custom_key": "hello" }
pbench genconfig -p clusters/params.json -p my-overrides.json clusters

Result: the merged map has all keys from params.json, with heap_size_percent_of_container_mem overridden to 0.85, plus the new key my_custom_key set to "hello".

Adding custom parameters

Since parameters are generic maps, you can add any key to a parameter file and reference it in templates. No Go code changes needed.

For example, add "my_timeout_minutes": 60 to a parameter file, then use {{ .my_timeout_minutes }} in a template.

Cluster Config Files (config.json)

Each cluster size directory contains a config.json file that defines cluster-specific values like node count, memory, and instance types. These values are merged on top of the parameter file values, so cluster-specific keys override global parameters with the same name.

Example clusters/small/config.json:

{
    "cluster_size": "small",
    "coordinator_instance_type": "r6i.2xlarge",
    "coordinator_instance_ebs_size": 50,
    "worker_instance_type": "r6i.2xlarge",
    "worker_instance_ebs_size": 50,
    "number_of_workers": 4,
    "memory_per_node_gb": 62,
    "vcpu_per_worker": 8,
    "fragment_result_cache_enabled": true,
    "data_cache_enabled": true
}

Skipping directories with .genconfigignore

If a cluster directory contains a .genconfigignore file, pbench genconfig will skip it entirely — no files are generated or cleaned up. This is useful for directories that contain a config.json but are manually maintained rather than auto-generated.

# Create the marker file to opt out of generation
touch my-manual-cluster/.genconfigignore

The .genconfigignore file can be empty; only its presence matters.

Merge order

The final map for each cluster is:

base params (-p files merged left-to-right)  →  config.json values on top

If params.json has "foo": 1 and config.json has "foo": 2, the template sees "foo": 2.

Available cluster sizes

Cluster config.json README
small config.json README
medium config.json README
medium-ssd config.json README
medium-spill config.json README
large config.json README
starburst-comp config.json README

Writing Templates

Templates use the Go text/template syntax. The data context (.) is the merged map[string]any containing all parameter and cluster config values.

Referencing values

Use .key_name to access any value from the merged map. Keys use snake_case matching the JSON field names.

Simple value substitution:

-Xmx{{ .heap_size_gb }}G

If heap_size_gb is 55, this renders as -Xmx55G.

Inline expressions:

query.max-total-memory={{ mul .java_query_max_total_mem_per_node_gb .number_of_workers }}GB

If java_query_max_total_mem_per_node_gb is 43 and number_of_workers is 4, this renders as query.max-total-memory=172GB.

Nested function calls:

system-mem-limit-gb={{ floor (mul .container_memory_gb 0.95) }}
native-buffer-mem={{ ceil (min .native_buffer_mem_cap_gb (mul .container_memory_gb .native_buffer_mem_percent)) }}GB

Conditional sections

Use {{ if }} to conditionally include blocks. Map keys that are missing or false are falsy.

{{ if .spill_enabled -}}
experimental.spiller-spill-path=/opt/presto-server/spilled_data
{{ end -}}

{{ if .ssd_cache_size -}}
async-cache-ssd-gb={{ .ssd_cache_size }}
async-cache-ssd-path=/opt/presto-server/async_data_cache/
{{ end -}}

The - after {{ and before }} trims whitespace, keeping the output clean. See the Go template docs on trimming for details.

Conditionally skipping entire files

If a template's output is empty (or whitespace-only), the file is not created at all. This lets you conditionally skip entire files by wrapping the full template content in an {{ if }} block:

{{ if .spark_enabled -}}
spark.master=yarn
spark.driver.memory={{ .heap_size_gb }}g
{{ end -}}

If spark_enabled is absent or false in the cluster's config.json, no spark-defaults.conf file is generated for that cluster.

Using the default function

Use default to provide a fallback when a key might not exist in the map:

timeout={{ default .my_custom_timeout 30 }}m

If my_custom_timeout is not in the map, renders as timeout=30m.

Iterating with seq

Use seq to generate a sequence of integers (useful for generating numbered entries):

{{ range $i := seq 0 (dec .number_of_workers) -}}
worker-{{ $i }}.host=worker{{ $i }}
{{ end -}}

Real-world template example

Here is a simplified version of coordinator/config.properties:

{{ template "prelude" . -}}
coordinator=true
http-server.http.port=8080

memory.heap-headroom-per-node={{ .headroom_gb }}GB
join-max-broadcast-table-size={{ .join_max_broadcast_table_size_mb }}MB

query.max-total-memory-per-node={{ .java_query_max_total_mem_per_node_gb }}GB
query.max-total-memory={{ mul .java_query_max_total_mem_per_node_gb .number_of_workers }}GB
query.max-memory-per-node={{ .java_query_max_mem_per_node_gb }}GB
query.max-memory={{ mul .java_query_max_mem_per_node_gb .number_of_workers }}GB

The {{ template "prelude" . }} on line 1 invokes the prelude to compute derived values before the rest of the template runs. See the next section for details.

The .prelude Template: Computed Values

Many config values (heap size, query memory limits, broadcast table size) are derived from the input parameters through formulas. Rather than hardcoding these calculations in Go, they live in a special template file called .prelude.

How it works

  1. The .prelude file defines a named template block called "prelude"
  2. Inside, it uses the set function to compute values and add them to the data map
  3. Any template that needs computed values includes {{ template "prelude" . }} at the top
  4. After the prelude runs, the computed keys are available via .key_name like any other value
  5. Files prefixed with . are never written to the output directory

The set function

set takes a map, a key name, and a value. It adds the key-value pair to the map and returns an empty string (so it produces no output):

{{ set . "heap_size_gb" (floor (mul .container_memory_gb .heap_size_percent_of_container_mem)) -}}

This computes floor(container_memory_gb * heap_size_percent_of_container_mem) and stores the result as heap_size_gb in the map. The .- suffix trims the trailing newline.

Full .prelude source

clusters/templates/.prelude:

{{ define "prelude" -}}
{{ set . "container_memory_gb" (sub .memory_per_node_gb (ceil (min .sys_reserved_mem_cap_gb (mul .memory_per_node_gb .sys_reserved_mem_percent)))) -}}
{{ set . "heap_size_gb" (floor (mul .container_memory_gb .heap_size_percent_of_container_mem)) -}}
{{ set . "headroom_gb" (ceil (mul .heap_size_gb .headroom_percent_of_heap)) -}}
{{ set . "java_query_max_total_mem_per_node_gb" (floor (mul .heap_size_gb .query_max_total_mem_per_node_percent_of_heap)) -}}
{{ set . "java_query_max_mem_per_node_gb" (floor (mul .java_query_max_total_mem_per_node_gb .query_max_mem_per_node_percent_of_total)) -}}
{{ set . "native_proxygen_mem_gb" (ceil (min .proxygen_mem_cap_gb (mul .proxygen_mem_per_worker_gb .number_of_workers))) -}}
{{ set . "native_buffer_mem_gb" (ceil (min .native_buffer_mem_cap_gb (mul .container_memory_gb .native_buffer_mem_percent))) -}}
{{ set . "native_system_mem_gb" (sub .container_memory_gb (add .native_buffer_mem_gb .native_proxygen_mem_gb)) -}}
{{ set . "native_query_mem_gb" (floor (mul .native_system_mem_gb .native_query_mem_percent_of_sys_mem)) -}}
{{ set . "join_max_broadcast_table_size_mb" (ceil (mul .container_memory_gb .join_max_bcast_size_percent_of_container_mem 1024)) -}}
{{ set . "fragment_result_cache_size_gb" (ceil (mul .memory_per_node_gb 2 0.95)) -}}
{{ set . "data_cache_size_gb" (ceil (mul .memory_per_node_gb 3 0.95)) -}}
{{ end }}

Formulas explained

Starting from memory_per_node_gb (e.g., 62 GB for the small cluster):

Computed Value Formula Example (small)
container_memory_gb memory_per_node - ceil(min(sys_reserved_cap, memory_per_node * sys_reserved_percent)) 62 - ceil(min(2, 62 * 0.05)) = 62 - 2 = 60
heap_size_gb floor(container_memory * heap_percent) floor(60 * 0.9) = 54
headroom_gb ceil(heap_size * headroom_percent) ceil(54 * 0.2) = 11
java_query_max_total_mem_per_node_gb floor(heap_size * query_max_total_percent) floor(54 * 0.8) = 43
java_query_max_mem_per_node_gb floor(java_query_max_total * query_max_percent) floor(43 * 0.9) = 38
native_proxygen_mem_gb ceil(min(proxygen_cap, proxygen_per_worker * workers)) ceil(min(2, 0.125 * 4)) = 1
native_buffer_mem_gb ceil(min(buffer_cap, container_memory * buffer_percent)) ceil(min(32, 60 * 0.05)) = 3
native_system_mem_gb container_memory - native_buffer - native_proxygen 60 - 3 - 1 = 56
native_query_mem_gb floor(native_system * query_percent) floor(56 * 0.95) = 53
join_max_broadcast_table_size_mb ceil(container_memory * broadcast_percent * 1024) ceil(60 * 0.01 * 1024) = 615
fragment_result_cache_size_gb ceil(memory_per_node * 2 * 0.95) ceil(62 * 2 * 0.95) = 118
data_cache_size_gb ceil(memory_per_node * 3 * 0.95) ceil(62 * 3 * 0.95) = 177

Adding your own computed values

You can define computed values in two places:

In .prelude (shared across all templates): Add a set line to clusters/templates/.prelude. The value is computed once per cluster and available to every template that includes the prelude.

{{ set . "my_derived_value" (floor (mul .memory_per_node_gb 0.5)) -}}

Then reference it in any template that includes {{ template "prelude" . }}:

my-config-property={{ .my_derived_value }}GB

Directly in a template (local to that template): You can also use set directly inside a template file. This is useful for intermediate calculations that only that template needs — the value is not shared with other templates.

{{ set . "local_buffer_size" (ceil (div .container_memory_gb 4)) -}}
buffer-size={{ .local_buffer_size }}GB

Because set mutates the map, computed values can reference other computed values defined earlier — just define them in the right order:

{{ set . "a" (mul .memory_per_node_gb 0.8) -}}
{{ set . "b" (floor (mul .a 0.5)) -}}

Templates that don't need computed values

If a template only uses raw input values (from params or config.json), it does not need to include the prelude. For example, static catalog config files like catalog/hive-native.properties don't include {{ template "prelude" . }} because they don't reference any computed values.

Template Functions Reference

All numeric functions accept any numeric type (int, float64, etc.) and convert automatically.

Function Signature Description Example
add add(a, b) → float64 Addition {{ add .heap_size_gb 10 }}64
sub sub(a, b) → float64 Subtraction {{ sub .container_memory_gb 5 }}55
mul mul(args...) → float64 Variadic multiplication {{ mul .heap_size_gb .number_of_workers }}216
div div(a, b) → float64 Division {{ div .memory_per_node_gb 2 }}31
min min(a, b) → float64 Minimum of two values {{ min .sys_reserved_mem_cap_gb 10 }}2
max max(a, b) → float64 Maximum of two values {{ max .heap_size_gb 32 }}54
floor floor(x) → int Round down to integer {{ floor 54.7 }}54
ceil ceil(x) → int Round up to integer {{ ceil 3.1 }}4
dec dec(x) → int Decrement by 1 (integer) {{ dec .vcpu_per_worker }}7
set set(map, key, value) → "" Set a key in the data map {{ set . "foo" 42 }}
default default(val, fallback) → any Fallback if val is nil {{ default .optional_key 10 }}
hasPrefix hasPrefix(s, prefix) → bool String starts with prefix {{ if hasPrefix .worker_instance_type "r6i" }}...{{ end }}
hasSuffix hasSuffix(s, suffix) → bool String ends with suffix {{ if hasSuffix .worker_instance_type "xlarge" }}...{{ end }}
contains contains(s, substr) → bool String contains substring {{ if contains .worker_instance_type "gpu" }}...{{ end }}
seq seq(start, end) → chan int Integer sequence [start, end] {{ range $i := seq 1 3 }}...{{ end }}

Notes:

  • floor and ceil return int, so they render without a decimal point (e.g., 54 not 54.0)
  • mul is variadic — {{ mul .a .b .c }} computes a * b * c
  • set returns an empty string, so {{ set . "key" value }} produces no output
  • Whole-number float64 values also render without a decimal point in Go templates (e.g., 60 not 60.0)

Nesting functions

Go template functions nest using parentheses. The innermost expression is evaluated first:

{{ floor (mul .container_memory_gb .heap_size_percent_of_container_mem) }}

This computes mul(60, 0.9)54.0, then floor(54.0)54.

Deeper nesting example from the prelude:

{{ set . "container_memory_gb" (sub .memory_per_node_gb (ceil (min .sys_reserved_mem_cap_gb (mul .memory_per_node_gb .sys_reserved_mem_percent)))) }}

Reading inside-out:

  1. mul .memory_per_node_gb .sys_reserved_mem_percent62 * 0.05 = 3.1
  2. min .sys_reserved_mem_cap_gb 3.1min(2, 3.1) = 2
  3. ceil 22
  4. sub .memory_per_node_gb 262 - 2 = 60
  5. set . "container_memory_gb" 60 → stores 60 in the map

Template Directory Structure

The built-in template directory (clusters/templates/) mirrors the output directory structure:

clusters/templates/
  .prelude                             # Computed value definitions (not output)
  README.md                            # Cluster README (template)
  docker-stack-java.yaml               # Docker Swarm config for Java workers
  docker-stack-native.yaml             # Docker Swarm config for native (Prestissimo) workers
  docker-stack-spark.yaml              # Docker Swarm config for Spark
  coordinator/
    config.properties                  # Java coordinator config
    config-native.properties           # Native (Prestissimo) coordinator config
    config-trino.properties            # Trino coordinator config
    jvm.config                         # JVM settings
    jvm-trino.config                   # Trino JVM settings
    node.properties                    # Static node properties
    session-property-config.json       # Session property config
    session-property-config.properties # Session property config
  workers/
    config.properties                  # Java worker config
    config-native.properties           # Native (Prestissimo) worker config
    config-trino.properties            # Trino worker config
    jvm.config                         # Worker JVM settings
    jvm-trino.config                   # Trino worker JVM settings
    node.properties                    # Static node properties
  catalog/
    hive.properties                    # Hive catalog (Java)
    hive-native.properties             # Hive catalog (native)
    hive-trino.properties              # Hive catalog (Trino)
    tpcds.properties                   # TPC-DS catalog
    ...                                # Other catalog files (static, no templates)

Files prefixed with . (like .prelude) are parsed but never written to the output.

Customization Guide

Scenario 1: Change a tuning parameter for all clusters

Edit clusters/params.json (or pass a separate override file with -p):

{
  "heap_size_percent_of_container_mem": 0.85
}
pbench genconfig -p clusters/params.json -p tuning-override.json clusters

Scenario 2: Add a new config property to all clusters

  1. Add the parameter to clusters/params.json:

    { "my_new_timeout_minutes": 45 }
  2. Reference it in the relevant template file(s):

    my-config.timeout={{ .my_new_timeout_minutes }}m
    
  3. Regenerate: make clusters

Scenario 3: Add a per-cluster override

Add the key directly to that cluster's config.json. It will override the same key from params:

{
    "cluster_size": "xlarge",
    "number_of_workers": 16,
    "memory_per_node_gb": 480,
    "my_new_timeout_minutes": 90
}

Scenario 4: Add a new computed value

  1. Add input parameters to params.json if needed
  2. Add a set line to clusters/templates/.prelude (if it should be shared), or directly in the template that needs it:
    {{ set . "my_computed_value" (floor (div .memory_per_node_gb .number_of_workers)) -}}
    
  3. Reference in templates: {{ .my_computed_value }}

Scenario 5: Use a completely custom template directory

pbench genconfig -p my-params.json -t my-templates/ my-output/

Your template directory can have any structure. Each template file is rendered for every config.json found in the output directory.

Quick Start

  1. Create an output directory and copy a config.json from an existing cluster size:

    mkdir my-cluster
    cp clusters/small/config.json my-cluster/
  2. Edit my-cluster/config.json to match your cluster's hardware:

    {
        "cluster_size": "my-cluster",
        "number_of_workers": 8,
        "memory_per_node_gb": 128,
        "vcpu_per_worker": 16,
        "fragment_result_cache_enabled": true,
        "data_cache_enabled": true
    }
  3. Generate the configs:

    pbench genconfig -p clusters/params.json -t clusters/templates my-cluster
  4. The output directory now contains all the generated config files:

    my-cluster/
      config.json
      README.md
      docker-stack-native.yaml
      coordinator/config.properties
      coordinator/jvm.config
      workers/config-native.properties
      ...
    

Clone this wiki locally