-
Notifications
You must be signed in to change notification settings - Fork 25
Parameters
PBench uses JSON files to define each stage of a benchmark. These stage files contain JSON format parameters.
Use the JSON parameters defined here to write stage files. For more information about stage files, see Creating a Stage File.
Format
"abort_on_error": Boolean
Definition
Set abort_on_error to true to abort all running and future stages of the benchmark when an error occurs.
This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.
Example:
"abort_on_error": true
Format
"catalog": "catalog-name"
Definition
Set the catalog for queries in queries and query_files.
catalog and schema cannot be set to null.
This parameter and its value are inherited by child stages.
If a child stage sets a different catalog, schema, or timezone than what the inherited client has, a new client is automatically created. You can also use start_on_new_client = true to force a new client (e.g., to get fresh session params).
Example:
"catalog": "iceberg"
Format
"cold_runs": integer
Definition
The number of cold runs to run to populate the cache. If not set, defaults to 0. However, if both cold_runs and warm_runs are 0 (neither is explicitly set), cold_runs is automatically set to 1 so that each query runs at least once.
This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.
A cycle is defined as the total of (cold_runs + warm_runs) for a query.
Example:
"cold_runs": 1
Format
"description": “Description of the JSON file.”
Definition
JSON does not support comments, so comments in PBench are formatted as data pairs. For more information see Comments Inside JSON - Commenting in a JSON File.
Begin every stage JSON file with a description.
description is a recognized field on the stage struct. It is not used during execution but is preserved when the stage is parsed.
Example:
"description": "Specifies the catalog and the schema for TPC-DS Iceberg scale factor 1 TB partitioned."
Format
"expected_row_counts": {
"file1": [
1,
1
],
"file2": [
1,
1
]
}Definition
A map from [catalog.schema] to arrays of integers that are expected row counts for the queries that are run under different schemas.
The key of this map can be either:
-
[schema] - match the schema name regardless of the catalog they are under
-
[catalog.schema] - match both catalog and schema
-
[regular expression] - used to match [catalog.schema]
List the expected row counts for queries first, then list the expected row counts for the queries in each query file listed in query_files.
Example:
"expected_row_counts": {
"tpcds_sf10000_": [
100,
100
],
"tpcds_sf1000_": [
100,
100
]
}Use regular expressions to match multiple [catalog.schema] pairs. In this example, .*\\.tpcds_sf10000 matches hive.tpcds_sf10000 and iceberg.tpcds_sf10000.
"expected_row_counts": {
".*\\.tpcds_sf10000": [
100,
100
],
"tpcds_sf1000_": [
100,
100
]
}Format
"next": [
"stage_2.json",
"stage_3.json"
]Definition
Specifies one or more child stages of the current stage. Child stages start after the parent stage finishes, and in parallel with each other. Child stages inherit some parameters from the parent stage if those parameters are not explicitly set in the child stage.
Example:
"next": [
"stage_2.json",
"stage_3.json"
]Format
"queries": [
"query_string"
]Definition
Run the SQL query in query_string. If a query is long or complex, or there are several queries, consider saving the queries in a SQL file to be run using query_files.
Do not end the SQL query in query_string with a semi-colon.
SQL queries in queries are executed first, then SQL queries in files listed in query_files are read and executed, then external commands in post_stage_scripts are run.
Example:
"queries": [
"select 'query 1'"
]Format
"query_files": [
"file1",
"file2",
"directory"
]Definition
One or more files or directories containing SQL queries.
SQL queries in queries are executed first, then SQL queries in files listed in query_files are read and executed, then external commands in post_stage_scripts are run.
A relative file path in the query_files array is evaluated based on the location of the stage JSON file.
When an entry is a directory, PBench expands it to all files in the directory (non-recursive), sorted by filename. Subdirectories within the directory are ignored. Directory expansion happens at execution time (after pre_stage_scripts run), so scripts can generate query files into a directory before they are executed.
Examples:
"query_files": [
"queries/query_01.sql",
"queries/query_02.sql"
]Using a directory with pre_stage_scripts to generate queries:
{
"pre_stage_scripts": [
"dsqgen -DIRECTORY templates -SCALE 1000 -RNGSEED 12345 -OUTPUT_DIR generated/"
],
"query_files": ["generated"]
}Format
"random_execution": Boolean
Definition
When random_execution is set to false, PBench runs the queries in queries and query_files sequentially.
When random_execution is set to true, PBench runs the queries and query_files randomly, until the duration or integer set using randomly_execute_until is met.
Each query file counts as 1 regardless of the number of queries in that query file. For example, a stage has:
- 3 queries in
queries - 2 query files in
query_files, with 3 queries in each file
random_execution selects from 5 (3 queries + 2 query files), not 9 (3 queries + 3 queries in one file + 3 queries in the other file).
If a query file is selected, all of the queries in the file are executed and it is counted as 1 selection towards the integer specified in randomly_execute_until.
Expected row counts are ignored when random_execution is set to true.
This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.
The default value of random_execution is false.
Example:
"random_execution": true
Format
"randomly_execute_until": "duration"
"randomly_execute_until": "integer"
Definition
Specify either
- a
durationlike15m,1h,120h - an
integeras the number of queries
to randomly run SQL queries. Valid duration units are h, m, s, ms, us, and ns (Go's time.ParseDuration format). Note that d (days) is not a valid unit; use 120h instead of 5d for 5 days.
This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.
Example:
"randomly_execute_until": "15m"
"randomly_execute_until": "700"
Format
"no_random_duplicates": Boolean
Definition
When no_random_duplicates is set to true, PBench shuffles all queries and executes each one once before any query is repeated. When all queries have been executed, the pool resets and shuffles again if more executions are needed (based on randomly_execute_until).
This only takes effect when random_execution is true.
This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.
The default value is false.
Example:
{
"random_execution": true,
"randomly_execute_until": "20",
"no_random_duplicates": true,
"query_files": [
"queries/q01.sql",
"queries/q02.sql",
"queries/q03.sql"
]
}Format
"save_column_metadata": Boolean
Definition
Save a JSON file of the query's column metadata in the columns field of Presto's query API response.
Column metadata is saved once for a query on its first run, regardless of the number of cold_runs and warm_runs.
This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.
The file name format uses the naming process as described in PBench Output File Name Format.
Example:
"save_column_metadata": true
Format
"save_json": Boolean
Definition
Set save_json to true to save a successful query's JSON after the query is executed. The file name is [query_name].json. For example, ds_power_query_59.json. This file is valuable when debugging a problem with a run of PBench.
A failed query also saves the error information for the query in a file named [query_name].error.json.
This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.
Example:
"save_json": true
Format
"save_output": Boolean
Definition
Set save_output to true to save the query result to files in raw form.
Set the parameter to null in a stage to unset the value inherited from a parent stage.
This parameter and its value are inherited by child stages.
The file name format uses the naming process as described in PBench Output File Name Format.
Example:
"save_output": true
Format
”schema”: “schema-name”
Definition
Set the schema for queries in queries and query_files.
catalog and schema cannot be set to null.
This parameter and its value are inherited by child stages.
If a child stage sets a different catalog, schema, or timezone than what the inherited client has, a new client is automatically created. You can also use start_on_new_client = true to force a new client (e.g., to get fresh session params).
Example
"schema": "sf1"
Format
"session_params": {
"session-property-name": "session-property-value"
}Definition
Session properties passed to Presto.
This parameter and its value are inherited by child stages.
Set a session parameter to null to unset the value inherited from a parent stage.
If a child stage sets a different catalog, schema, or timezone than what the inherited client has, a new client is automatically created. Session params are always applied additively to the client when a new client is created. You can use start_on_new_client = true to force a new client with fresh session params.
Example:
"session_params": {
"iceberg.hive_statistics_merge_strategy": "USE_NULLS_FRACTION_AND_NDV",
"hive.pushdown_filter_enabled": false
}Alternative: SET SESSION in query files
You can also set session parameters using SET SESSION statements directly in query files. The Presto client handles the server's response header, so the parameter is applied to all subsequent queries in the same client session:
SET SESSION join_reordering_strategy = 'NONE';
SELECT l_orderkey, l_partkey FROM lineitem;This is useful when a session parameter applies to a specific query rather than an entire stage. See Setting Session Parameters in Query Files for details.
Format
"post_query_cycle_scripts": [
"shell_command"
]Definition
Run shell scripts after all runs (cold + warm) of each individual query have completed.
Example:
"post_query_cycle_scripts": [
"echo \"all runs of this query are done\"",
"python3 cleanup.py"
]Format
"pre_query_scripts": [
"shell_command"
]Definition
Run shell scripts before each individual query execution starts (once per cold/warm run).
Example:
"pre_query_scripts": [
"echo \"query starting\"",
"python3 prepare_query.py"
]Format
"post_query_scripts": [
"shell_command"
]Definition
Run shell scripts after each individual query execution completes (once per cold/warm run).
Example:
"post_query_scripts": [
"echo \"query finished\"",
"python3 record_result.py"
]Format
"post_stage_scripts": [
"shell_command"
]Definition
Run shell scripts after executing all SQL queries in queries and query_files.
SQL queries in queries are executed first, then SQL queries in files listed in query_files are read and executed, then external commands in post_stage_scripts are run.
Example:
"post_stage_scripts": [
"echo \"this is a script\"",
"python3 test_script.py",
"ls -l"
]Format
"pre_query_cycle_scripts": [
"shell_command"
]Definition
Run shell scripts before starting all runs (cold + warm) of each individual query.
Example:
"pre_query_cycle_scripts": [
"echo \"starting runs for next query\"",
"python3 prepare.py"
]Format
"pre_stage_scripts": [
"shell_command"
]Definition
Run shell scripts before starting the execution of queries in a stage.
Example:
"pre_stage_scripts": [
"echo \"stage is starting\"",
"python3 setup.py"
]The six script hooks execute in the following order during a stage:
-
pre_stage_scripts— once, before any queries - For each query:
-
pre_query_cycle_scripts— before starting all runs of the query - For each run (cold/warm):
-
pre_query_scripts— before the run starts -
post_query_scripts— after the run completes
-
-
post_query_cycle_scripts— after all runs of the query
-
-
post_stage_scripts— once, after all queries
When a query fails, whether subsequent hooks run depends on abort_on_error:
| Hook | abort_on_error: true |
abort_on_error: false |
|---|---|---|
pre_query_scripts |
Already ran before the query | Already ran before the query |
post_query_scripts |
Runs (error is joined with query error) | Runs (error is joined with query error) |
post_query_cycle_scripts |
Runs (teardown for pre_query_cycle_scripts) |
Runs after all runs complete |
| Remaining queries in stage | Skipped | Continue running |
post_stage_scripts |
Runs | Runs |
Notes:
-
post_query_scriptsalways executes after each query run, even on failure. If both the query and the post-query script fail, their errors are combined viaerrors.Join. -
post_query_cycle_scriptsalways runs as the teardown counterpart topre_query_cycle_scripts, even when a query fails andabort_on_erroris set. Withabort_on_error: true, no further queries are executed after the cycle completes. -
post_stage_scriptsalways runs before the stage exits. - If a script itself fails with
abort_on_error: false, the failure is logged but execution continues. Withabort_on_error: true, the stage aborts.
All shell script hooks receive PBENCH_* environment variables with context about the current stage and query. These are injected automatically in addition to the process's existing environment.
| Variable | Available in | Description |
|---|---|---|
PBENCH_STAGE_ID |
All hooks | The ID of the current stage |
PBENCH_OUTPUT_DIR |
All hooks | Path to the output directory for this run |
PBENCH_QUERY_FILE |
query-cycle and query hooks | Query file path (unset for inline queries) |
PBENCH_QUERY_INDEX |
query-cycle and query hooks | Zero-based index of the query within the batch |
PBENCH_QUERY_SEQ |
query hooks only | Sequence number within the cycle (0, 1, 2, ...) |
PBENCH_QUERY_COLD_RUN |
query hooks only |
true if this is a cold run, false for warm |
PBENCH_QUERY_ID |
post_query_scripts only |
Presto query ID (e.g., 20260101_abc123) |
PBENCH_QUERY_ERROR |
post_query_scripts only |
Error message when the query fails (unset on success) |
Example: Upload failed query JSON to S3
"save_json": true,
"post_query_scripts": [
"[ -n \"$PBENCH_QUERY_ERROR\" ] && aws s3 cp ${PBENCH_OUTPUT_DIR}/${PBENCH_STAGE_ID}_*.json s3://${MY_BUCKET}/queries/${PBENCH_QUERY_ID}.json || true"
]Format
"start_on_new_client": Boolean
Definition
Set start_on_new_client to true for this stage will create a new client to execute itself. Each client has its own set of client information, tags, session properties, user credentials, and other parameters.
Example:
"start_on_new_client": true
Format
"stream_count": integer
Definition
Specifies how many parallel instances of this stage should run. Each stream instance executes the same queries independently, with a deterministically derived random seed (RandSeed + i * 1000) for reproducible randomization.
Each stream gets a unique stage ID suffix (e.g., stage_stream_1, stage_stream_2) so output files don't collide.
This parameter is not inherited by child stages. Child stages only start after all stream instances complete.
The entire execution is reproducible from the single run-level --rand-seed value.
Example:
{
"random_execution": true,
"randomly_execute_until": "100",
"no_random_duplicates": true,
"stream_count": 3,
"query_files": [
"queries/q01.sql",
"queries/q02.sql",
"queries/q03.sql"
]
}Format
"timezone": timezone_string
Definition
The value of timezone_string can be any value in the Time Zone ID column of Time Zone ID.
The default value of timezone is the user's local timezone.
If a child stage sets a different catalog, schema, or timezone than what the inherited client has, a new client is automatically created. You can also use start_on_new_client = true to force a new client (e.g., to get fresh session params).
This parameter and its value are inherited by child stages.
Example:
"timezone": "America/Los_Angeles"
Format
"warm_runs": integer
Definition
The number of query runs to perform after the number of cold runs. The default value is 0.
This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.
A cycle is defined as the total of (cold_runs + warm_runs) for a query.
Example:
"warm_runs": 2