Skip to content

Parameters

Yiqun (Ethan) Zhang edited this page Mar 25, 2026 · 85 revisions

PBench uses JSON files to define each stage of a benchmark. These stage files contain JSON format parameters.

Use the JSON parameters defined here to write stage files. For more information about stage files, see Creating a Stage File.

abort_on_error

Format

"abort_on_error": Boolean

Definition

Set abort_on_error to true to abort all running and future stages of the benchmark when an error occurs.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

Example:

"abort_on_error": true

catalog

Format

"catalog": "catalog-name"

Definition

Set the catalog for queries in queries and query_files.

catalog and schema cannot be set to null.

This parameter and its value are inherited by child stages.

If a child stage sets a different catalog, schema, or timezone than what the inherited client has, a new client is automatically created. You can also use start_on_new_client = true to force a new client (e.g., to get fresh session params).

Example:

"catalog": "iceberg"

cold_runs

Format

"cold_runs": integer

Definition

The number of cold runs to run to populate the cache. If not set, defaults to 0. However, if both cold_runs and warm_runs are 0 (neither is explicitly set), cold_runs is automatically set to 1 so that each query runs at least once.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

A cycle is defined as the total of (cold_runs + warm_runs) for a query.

Example:

"cold_runs": 1

description

Format

"description": “Description of the JSON file.”

Definition

JSON does not support comments, so comments in PBench are formatted as data pairs. For more information see Comments Inside JSON - Commenting in a JSON File.

Begin every stage JSON file with a description.

description is a recognized field on the stage struct. It is not used during execution but is preserved when the stage is parsed.

Example:

"description": "Specifies the catalog and the schema for TPC-DS Iceberg scale factor 1 TB partitioned."

expected_row_counts

Format

"expected_row_counts": {
  "file1": [
    1,
    1
  ],
  "file2": [
    1,
    1
  ]
}

Definition

A map from [catalog.schema] to arrays of integers that are expected row counts for the queries that are run under different schemas.

The key of this map can be either:

  • [schema] - match the schema name regardless of the catalog they are under

  • [catalog.schema] - match both catalog and schema

  • [regular expression] - used to match [catalog.schema]

List the expected row counts for queries first, then list the expected row counts for the queries in each query file listed in query_files.

Example:

"expected_row_counts": {
  "tpcds_sf10000_": [
    100,
    100
  ],
  "tpcds_sf1000_": [
    100,
    100
  ]
}

Use regular expressions to match multiple [catalog.schema] pairs. In this example, .*\\.tpcds_sf10000 matches hive.tpcds_sf10000 and iceberg.tpcds_sf10000.

"expected_row_counts": {
  ".*\\.tpcds_sf10000": [
    100,
    100
  ],
  "tpcds_sf1000_": [
    100,
    100
  ]
}

next

Format

"next": [
  "stage_2.json",
  "stage_3.json"
]

Definition

Specifies one or more child stages of the current stage. Child stages start after the parent stage finishes, and in parallel with each other. Child stages inherit some parameters from the parent stage if those parameters are not explicitly set in the child stage.

Example:

"next": [
  "stage_2.json",
  "stage_3.json"
]

queries

Format

"queries": [
  "query_string"
]

Definition

Run the SQL query in query_string. If a query is long or complex, or there are several queries, consider saving the queries in a SQL file to be run using query_files.

Do not end the SQL query in query_string with a semi-colon.

SQL queries in queries are executed first, then SQL queries in files listed in query_files are read and executed, then external commands in post_stage_scripts are run.

Example:

"queries": [
  "select 'query 1'"
]

query_files

Format

"query_files": [
  "file1",
  "file2",
  "directory"
]

Definition

One or more files or directories containing SQL queries.

SQL queries in queries are executed first, then SQL queries in files listed in query_files are read and executed, then external commands in post_stage_scripts are run.

A relative file path in the query_files array is evaluated based on the location of the stage JSON file.

When an entry is a directory, PBench expands it to all files in the directory (non-recursive), sorted by filename. Subdirectories within the directory are ignored. Directory expansion happens at execution time (after pre_stage_scripts run), so scripts can generate query files into a directory before they are executed.

Examples:

"query_files": [
  "queries/query_01.sql",
  "queries/query_02.sql"
]

Using a directory with pre_stage_scripts to generate queries:

{
  "pre_stage_scripts": [
    "dsqgen -DIRECTORY templates -SCALE 1000 -RNGSEED 12345 -OUTPUT_DIR generated/"
  ],
  "query_files": ["generated"]
}

random_execution

Format

"random_execution": Boolean

Definition

When random_execution is set to false, PBench runs the queries in queries and query_files sequentially.

When random_execution is set to true, PBench runs the queries and query_files randomly, until the duration or integer set using randomly_execute_until is met.

Each query file counts as 1 regardless of the number of queries in that query file. For example, a stage has:

  • 3 queries in queries
  • 2 query files in query_files, with 3 queries in each file

random_execution selects from 5 (3 queries + 2 query files), not 9 (3 queries + 3 queries in one file + 3 queries in the other file).

If a query file is selected, all of the queries in the file are executed and it is counted as 1 selection towards the integer specified in randomly_execute_until.

Expected row counts are ignored when random_execution is set to true.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

The default value of random_execution is false.

Example:

"random_execution": true

randomly_execute_until

Format

"randomly_execute_until": "duration"

"randomly_execute_until": "integer"

Definition

Specify either

  • a duration like 15m, 1h, 120h
  • an integer as the number of queries

to randomly run SQL queries. Valid duration units are h, m, s, ms, us, and ns (Go's time.ParseDuration format). Note that d (days) is not a valid unit; use 120h instead of 5d for 5 days.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

Example:

"randomly_execute_until": "15m"

"randomly_execute_until": "700"

no_random_duplicates

Format

"no_random_duplicates": Boolean

Definition

When no_random_duplicates is set to true, PBench shuffles all queries and executes each one once before any query is repeated. When all queries have been executed, the pool resets and shuffles again if more executions are needed (based on randomly_execute_until).

This only takes effect when random_execution is true.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

The default value is false.

Example:

{
  "random_execution": true,
  "randomly_execute_until": "20",
  "no_random_duplicates": true,
  "query_files": [
    "queries/q01.sql",
    "queries/q02.sql",
    "queries/q03.sql"
  ]
}

save_column_metadata

Format

"save_column_metadata": Boolean

Definition

Save a JSON file of the query's column metadata in the columns field of Presto's query API response.

Column metadata is saved once for a query on its first run, regardless of the number of cold_runs and warm_runs.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

The file name format uses the naming process as described in PBench Output File Name Format.

Example:

"save_column_metadata": true

save_json

Format

"save_json": Boolean

Definition

Set save_json to true to save a successful query's JSON after the query is executed. The file name is [query_name].json. For example, ds_power_query_59.json. This file is valuable when debugging a problem with a run of PBench.

A failed query also saves the error information for the query in a file named [query_name].error.json.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

Example:

"save_json": true

save_output

Format

"save_output": Boolean

Definition

Set save_output to true to save the query result to files in raw form.

Set the parameter to null in a stage to unset the value inherited from a parent stage.

This parameter and its value are inherited by child stages.

The file name format uses the naming process as described in PBench Output File Name Format.

Example:

"save_output": true

schema

Format

”schema”: “schema-name”

Definition

Set the schema for queries in queries and query_files.

catalog and schema cannot be set to null.

This parameter and its value are inherited by child stages.

If a child stage sets a different catalog, schema, or timezone than what the inherited client has, a new client is automatically created. You can also use start_on_new_client = true to force a new client (e.g., to get fresh session params).

Example

"schema": "sf1"

session_params

Format

"session_params": {
  "session-property-name": "session-property-value"
}

Definition

Session properties passed to Presto.

This parameter and its value are inherited by child stages.

Set a session parameter to null to unset the value inherited from a parent stage.

If a child stage sets a different catalog, schema, or timezone than what the inherited client has, a new client is automatically created. Session params are always applied additively to the client when a new client is created. You can use start_on_new_client = true to force a new client with fresh session params.

Example:

"session_params": {
  "iceberg.hive_statistics_merge_strategy": "USE_NULLS_FRACTION_AND_NDV",
  "hive.pushdown_filter_enabled": false
}

Alternative: SET SESSION in query files

You can also set session parameters using SET SESSION statements directly in query files. The Presto client handles the server's response header, so the parameter is applied to all subsequent queries in the same client session:

SET SESSION join_reordering_strategy = 'NONE';
SELECT l_orderkey, l_partkey FROM lineitem;

This is useful when a session parameter applies to a specific query rather than an entire stage. See Setting Session Parameters in Query Files for details.

post_query_cycle_scripts

Format

"post_query_cycle_scripts": [
  "shell_command"
]

Definition

Run shell scripts after all runs (cold + warm) of each individual query have completed.

Example:

"post_query_cycle_scripts": [
  "echo \"all runs of this query are done\"",
  "python3 cleanup.py"
]

pre_query_scripts

Format

"pre_query_scripts": [
  "shell_command"
]

Definition

Run shell scripts before each individual query execution starts (once per cold/warm run).

Example:

"pre_query_scripts": [
  "echo \"query starting\"",
  "python3 prepare_query.py"
]

post_query_scripts

Format

"post_query_scripts": [
  "shell_command"
]

Definition

Run shell scripts after each individual query execution completes (once per cold/warm run).

Example:

"post_query_scripts": [
  "echo \"query finished\"",
  "python3 record_result.py"
]

post_stage_scripts

Format

"post_stage_scripts": [
  "shell_command"
]

Definition

Run shell scripts after executing all SQL queries in queries and query_files.

SQL queries in queries are executed first, then SQL queries in files listed in query_files are read and executed, then external commands in post_stage_scripts are run.

Example:

"post_stage_scripts": [
  "echo \"this is a script\"",
  "python3 test_script.py",
  "ls -l"
]

pre_query_cycle_scripts

Format

"pre_query_cycle_scripts": [
  "shell_command"
]

Definition

Run shell scripts before starting all runs (cold + warm) of each individual query.

Example:

"pre_query_cycle_scripts": [
  "echo \"starting runs for next query\"",
  "python3 prepare.py"
]

pre_stage_scripts

Format

"pre_stage_scripts": [
  "shell_command"
]

Definition

Run shell scripts before starting the execution of queries in a stage.

Example:

"pre_stage_scripts": [
  "echo \"stage is starting\"",
  "python3 setup.py"
]

Script Execution Order

The six script hooks execute in the following order during a stage:

  1. pre_stage_scripts — once, before any queries
  2. For each query:
    1. pre_query_cycle_scripts — before starting all runs of the query
    2. For each run (cold/warm):
      1. pre_query_scripts — before the run starts
      2. post_query_scripts — after the run completes
    3. post_query_cycle_scripts — after all runs of the query
  3. post_stage_scripts — once, after all queries

Script Behavior on Query Failure

When a query fails, whether subsequent hooks run depends on abort_on_error:

Hook abort_on_error: true abort_on_error: false
pre_query_scripts Already ran before the query Already ran before the query
post_query_scripts Runs (error is joined with query error) Runs (error is joined with query error)
post_query_cycle_scripts Runs (teardown for pre_query_cycle_scripts) Runs after all runs complete
Remaining queries in stage Skipped Continue running
post_stage_scripts Runs Runs

Notes:

  • post_query_scripts always executes after each query run, even on failure. If both the query and the post-query script fail, their errors are combined via errors.Join.
  • post_query_cycle_scripts always runs as the teardown counterpart to pre_query_cycle_scripts, even when a query fails and abort_on_error is set. With abort_on_error: true, no further queries are executed after the cycle completes.
  • post_stage_scripts always runs before the stage exits.
  • If a script itself fails with abort_on_error: false, the failure is logged but execution continues. With abort_on_error: true, the stage aborts.

Script Environment Variables

All shell script hooks receive PBENCH_* environment variables with context about the current stage and query. These are injected automatically in addition to the process's existing environment.

Variable Available in Description
PBENCH_STAGE_ID All hooks The ID of the current stage
PBENCH_OUTPUT_DIR All hooks Path to the output directory for this run
PBENCH_QUERY_FILE query-cycle and query hooks Query file path (unset for inline queries)
PBENCH_QUERY_INDEX query-cycle and query hooks Zero-based index of the query within the batch
PBENCH_QUERY_SEQ query hooks only Sequence number within the cycle (0, 1, 2, ...)
PBENCH_QUERY_COLD_RUN query hooks only true if this is a cold run, false for warm
PBENCH_QUERY_ID post_query_scripts only Presto query ID (e.g., 20260101_abc123)
PBENCH_QUERY_ERROR post_query_scripts only Error message when the query fails (unset on success)

Example: Upload failed query JSON to S3

"save_json": true,
"post_query_scripts": [
  "[ -n \"$PBENCH_QUERY_ERROR\" ] && aws s3 cp ${PBENCH_OUTPUT_DIR}/${PBENCH_STAGE_ID}_*.json s3://${MY_BUCKET}/queries/${PBENCH_QUERY_ID}.json || true"
]

start_on_new_client

Format

"start_on_new_client": Boolean

Definition

Set start_on_new_client to true for this stage will create a new client to execute itself. Each client has its own set of client information, tags, session properties, user credentials, and other parameters.

Example:

"start_on_new_client": true

stream_count

Format

"stream_count": integer

Definition

Specifies how many parallel instances of this stage should run. Each stream instance executes the same queries independently, with a deterministically derived random seed (RandSeed + i * 1000) for reproducible randomization.

Each stream gets a unique stage ID suffix (e.g., stage_stream_1, stage_stream_2) so output files don't collide.

This parameter is not inherited by child stages. Child stages only start after all stream instances complete.

The entire execution is reproducible from the single run-level --rand-seed value.

Example:

{
  "random_execution": true,
  "randomly_execute_until": "100",
  "no_random_duplicates": true,
  "stream_count": 3,
  "query_files": [
    "queries/q01.sql",
    "queries/q02.sql",
    "queries/q03.sql"
  ]
}

timezone

Format

"timezone": timezone_string

Definition

The value of timezone_string can be any value in the Time Zone ID column of Time Zone ID.

The default value of timezone is the user's local timezone.

If a child stage sets a different catalog, schema, or timezone than what the inherited client has, a new client is automatically created. You can also use start_on_new_client = true to force a new client (e.g., to get fresh session params).

This parameter and its value are inherited by child stages.

Example:

"timezone": "America/Los_Angeles"

warm_runs

Format

"warm_runs": integer

Definition

The number of query runs to perform after the number of cold runs. The default value is 0.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

A cycle is defined as the total of (cold_runs + warm_runs) for a query.

Example:

"warm_runs": 2

Clone this wiki locally