Command Reference

This page presents the online help for all pbench commands.

pbench

Run ./pbench --help to see the online help for pbench.

Tool for running Presto benchmarks

Usage:
  pbench [command]

Available Commands:
  cmp         Compare two query result directories
  completion  Generate the autocompletion script for the specified shell
  forward     Watch incoming query workloads from the first Presto cluster (cluster 0) and forward them to the rest clusters.
  genconfig   Generate benchmark cluster configurations
  genddl      Generate DDL scripts based on a config file
  help        Help about any command
  loadjson    Load query JSON files into event listener database and run recorders
  queryplan   Parse query plan
  replay      Replay workload from a CSV file
  round       Round the decimal values in the benchmark query output files for easier comparison.
  run         Run a benchmark
  save        Save table information for recreating the schema and data

Flags:
  -h, --help   help for pbench

Use "pbench [command] --help" for more information about a command.

pbench cmp

For more information about pbench cmp see Comparing Benchmarks.

Run ./pbench cmp --help to see the online help for pbench cmp.

Compare two query result directories

Usage:
  pbench cmp [flags] [directory 1] [directory 2]

Flags:
  -r, --file-id-regex string   regex to extract file id from file names in two directories to find matching files to compare (default ".*(query_\\d+)(?:_c0)?(?:_ordered)?\\.output")
  -h, --help                   help for cmp
  -o, --output-path string     diff output path (default "./diff")

Example

Compare the output of two benchmark runs:

pbench cmp results/run_baseline results/run_experiment

Compare with a custom regex that matches files like q1.csv, q2.csv:

pbench cmp -r ".*(q\d+)\.csv" results/baseline results/experiment -o my_diff

pbench completion

Run ./pbench completion --help to see the online help for pbench completion.

Generate the autocompletion script for pbench for the specified shell.
See each sub-command's help for details on how to use the generated script.

Usage:
  pbench completion [command]

Available Commands:
  bash        Generate the autocompletion script for bash
  fish        Generate the autocompletion script for fish
  powershell  Generate the autocompletion script for powershell
  zsh         Generate the autocompletion script for zsh

Flags:
  -h, --help   help for completion

Use "pbench completion [command] --help" for more information about a command.

pbench forward

Run ./pbench forward --help to see the online help for pbench forward.

Watch incoming query workloads from the first Presto cluster (cluster 0) and forward them to the rest clusters.

Usage:
  pbench forward [flags]

Flags:
      --dry-run                  Turning on dry run will only show the queries but not sending them to the target server.
  -x, --exclude stringArray      Regular expressions to filter queries to forward
      --force-https bools        Force all API requests to use HTTPS (default [false])
  -h, --help                     help for forward
  -n, --name string              Assign a name to this run. (default: "forward_<current time>")
  -o, --output-path string       Output directory path (default "current directory")
  -p, --password stringArray     Presto user password (optional)
  -i, --poll-interval duration   Interval between polls to the source cluster (default 5s)
  -r, --replace stringArray      Pairs of regular expressions to match pattern in the query and the replacement expression.
                                 Use $1, $2, ... to reference capture groups. This will be applied after filters.
  -m, --schema-mapping strings   Pairs of schema names to establish schema mapping relationships for schema replacement
                                 when forwarding queries. You can specify something like -m schema1,schema2
  -s, --server stringArray       Presto server address (default [http://127.0.0.1:8080])
  -u, --user stringArray         Presto user name (default [pbench])

The forward command monitors query workloads on a source Presto cluster and replays them on one or more target clusters. The first --server is the source cluster (cluster 0); additional --server entries are target clusters.

Use --exclude to filter out queries matching specific patterns. Use --replace to modify queries before forwarding (e.g., changing table names). Use --schema-mapping to remap schema names between clusters.

Examples

Forward all queries from a source cluster to two target clusters:

pbench forward \
  -s http://source-cluster:8080 \
  -s http://target-cluster-1:8080 \
  -s http://target-cluster-2:8080

Forward with schema mapping (queries referencing prod_schema on the source will use test_schema on targets):

pbench forward \
  -s http://source:8080 \
  -s http://target:8080 \
  -m prod_schema,test_schema

Dry run to preview which queries would be forwarded, excluding EXPLAIN queries:

pbench forward \
  -s http://source:8080 \
  -s http://target:8080 \
  -x "^EXPLAIN" \
  --dry-run

pbench genconfig

For more information about pbench genconfig see Generating Benchmark Configurations.

Run ./pbench genconfig --help to see the online help for pbench genconfig.

Generate benchmark cluster configurations

Usage:
  pbench genconfig [flags] [directory to search recursively for config.json]
  pbench genconfig [command]

Available Commands:
  default     Print the built-in default generator parameter file.

Flags:
  -h, --help                           help for genconfig
  -p, --parameter-file stringArray     Specifies a parameter file. Can be repeated; later files override earlier ones. Use built-in defaults if not specified.
  -t, --template-dir string            Specifies the template directory. Use built-in template if not specified.

Use "pbench genconfig [command] --help" for more information about a command.

Examples

Print the built-in default generator parameters:

pbench genconfig default

Generate configurations for all clusters found in a directory:

pbench genconfig /path/to/clusters/

Stack multiple parameter files (later overrides earlier):

pbench genconfig -p base-params.json -p overrides.json /path/to/clusters/

pbench genddl

Run ./pbench genddl --help to see the online help for pbench genddl.

Generate DDL scripts based on a config file

Usage:
  pbench genddl [config file]

Flags:
  -h, --help   help for genddl

All paths (table definitions, templates, output directories) are resolved relative to the config file's directory, so you can run pbench genddl from any working directory.

Config File

The config file specifies the workload, scale factor, file format, and compression method. An example config (cmd/genddl/config.json):

{
  "scale_factor": "10",
  "file_format": "parquet",
  "compression_method": "uncompressed"
}

This defaults to TPC-DS. To generate DDL for a different workload (e.g., TPC-H), add the workload and workload_definition fields:

{
  "workload": "tpch",
  "workload_definition": "tpc-h",
  "scale_factor": "100",
  "file_format": "parquet",
  "compression_method": "uncompressed"
}

Field	Description	Default
`scale_factor`	Data scale factor (e.g., `"10"`, `"1k"`)	(required)
`file_format`	Storage format (e.g., `"parquet"`)	(required)
`compression_method`	`"uncompressed"` or `"zstd"`	(required)
`workload`	Presto connector/catalog name used in INSERT statements and schema naming	`"tpcds"`
`workload_definition`	Subdirectory name under `definition/` containing table JSON files	`"tpc-ds"`

The command generates CREATE TABLE and INSERT scripts for all four schema variants (Hive/Iceberg x non-partitioned/partitioned). Pre-generated examples are available in cmd/genddl/generated-examples/.

Directory Layout

The config file expects the following sibling directories:

<config-dir>/
  config.json              # the config file
  definition/<workload>/   # table definition JSON files (one per table)
  *.tmpl                   # SQL/shell templates
  out/                     # generated output (cleaned on each run)
  generated-examples/      # named output for version control

Example

pbench genddl cmd/genddl/config.json

Output is written to <config-dir>/out/ and also to a named subdirectory under <config-dir>/generated-examples/ (e.g., tpcds-sf10-parquet/).

pbench help

Run ./pbench help --help to see the online help for pbench help.

Help provides help for any command in the application.
Simply type pbench help [path to command] for full details.

Usage:
  pbench help [command] [flags]

Flags:
  -h, --help   help for help

For example, the two commands

pbench genconfig default --help
pbench help genconfig default

return the same output:

Print the built-in default generator parameter file.

Usage:
  pbench genconfig default

Flags:
  -h, --help   help for default

pbench loadjson

Run ./pbench loadjson --help to see the online help for pbench loadjson.

Load query JSON files into event listener database and run recorders

Usage:
  pbench loadjson [flags] [list of files or directories to process]

Flags:
  -c, --comment string       Add a comment to this run (optional)
  -x, --extract-plan         Extract the plan JSON from query JSON then save them to the output path
  -h, --help                 help for loadjson
      --influx string        InfluxDB connection config for run recorder (optional)
      --mysql string         MySQL connection config for event listener and run recorder (optional)
  -n, --name string          Assign a name to this run. (default: "load_<current time>")
  -o, --output-path string   Output directory path (default "current directory")
  -P, --parallel int         Number of parallel threads to load json files (default 10)
  -r, --record-run           Record all the loaded JSON as a run

The default for -P varies, as its default is the number of CPU cores on the system.

Use --extract-plan to extract the query plan JSON from each query info file and save it as a separate .plan.json file in the output directory. This is useful for offline plan analysis without needing a MySQL database.

The input files are Presto query info JSON files, which are the responses from the Presto /v1/query/{queryId} API endpoint. These are also the files saved by pbench run when save_json is enabled.

Examples

Load query JSON files into a MySQL database and record the run:

pbench loadjson --mysql mysql.json --record-run -c "baseline run" /path/to/query_jsons/

Extract query plans from JSON files without a database:

pbench loadjson --extract-plan -o plans/ /path/to/query_jsons/

pbench queryplan

Run ./pbench queryplan --help to see the online help for pbench queryplan.

Read a CSV file, parse the "query plan" column, and write the JOIN information into a JSON file

Usage:
  pbench queryplan [flags] <CSV file>

Flags:
  -c, --column int      The column index for the Query Plans in the CSV file (index starts with 0)
  -H, --has-header      contain the header line or not (default true)
  -h, --help            help for queryplan
  -o, --output string   Output JSON file (default "queryplan.json")

Example

Parse query plans from the third column of a CSV file:

pbench queryplan -c 2 -o join_analysis.json exported_queries.csv

pbench replay

Run ./pbench replay --help to see the online help for pbench replay.

Replay workload from a CSV file
The fields in the CSV file are:
"query_id","create_time","wall_time_millis","output_rows","written_output_rows","catalog","schema","session_properties","query"
We also expect the queries in this CSV file are sorted by "create_time" in ascending order.

Usage:
  pbench replay [flags] [workload csv file]

Flags:
      --force-https          Force all API requests to use HTTPS
  -h, --help                 help for replay
  -n, --name string          Assign a name to this run. (default: "replay_<current time>")
  -o, --output-path string   Output directory path (default "current directory")
  -P, --parallel int         Maximum number of concurrent queries to replay (default 150)
  -p, --password string      Presto user password (optional)
  -s, --server string        Presto server address (default "http://127.0.0.1:8080")
  -u, --user string          Presto user name (default "pbench")

NOTE: Unlike pbench loadjson and pbench save where -P defaults to the number of CPU cores, pbench replay defaults to 150 concurrent queries.

NOTE: The default value for -P (150) is suitable for most workloads. If the recorded customer workload has a higher server concurrency level, you should increase this value to match. Otherwise, the replay will artificially throttle the workload and the timing will not accurately reflect the original traffic pattern.

The input CSV file is typically exported from a Presto event listener database. The CSV must be sorted by create_time in ascending order so that queries are replayed at the same relative timing as the original workload.

CSV Format

The header fields are query_id, create_time, wall_time_millis, output_rows, written_output_rows, catalog, schema, session_properties, query. Newlines within SQL queries are encoded as <<>>.

"query_id","create_time","wall_time_millis","output_rows","written_output_rows","catalog","schema","session_properties","query"
"20240416_000117_00004_jgufv","2024-04-16 00:01:17.466 UTC","14","1","0","hive","tpcds_sf1000","{query_max_execution_time=3h, join_distribution_type=AUTOMATIC}","SELECT count(*) FROM store_sales"
"20240416_000130_00005_jgufv","2024-04-16 00:01:30.789 UTC","2500","100","0","hive","tpcds_sf1000","{}","SELECT * FROM customer<<>>WHERE c_customer_sk < 100"

Example

pbench replay workload.csv -s http://presto-server:8080 -P 200

pbench round

Run ./pbench round --help to see the online help for pbench round.

Note: pbench round is an experimental feature that is not included in the default builds in Releases.

The program will try to match every column in the first row to see which column has matching decimal.

After processing the first row, it will only look at the matched columns. So if the overly long decimal only appears from the second row, this might not work properly.

A PR was opened to fix the native/Java decimal precision discrepancy but so far it does not work quite well:

https://github.com/facebookincubator/velox/pull/7944

Usage:
  pbench round [flags] [list of files or directories to process]

Flags:
  -e, --file-extension stringArray   Specifies the file extensions to include for processing (including the dot). You can specify multiple file extensions. (default [.output])
  -f, --format string                Specifies the format of the files. Accepted values are: "csv" or "json" which is the output file from the "run" command (default "json")
  -h, --help                         help for round
  -p, --precision int                Decimal precision to preserve. (default 12)
  -r, --recursive                    Recursively walk a path if a directory is provided in the arguments.
  -i, --rewrite-in-place             When turned on, we will rewrite the file in-place. Otherwise, we save the rewritten file separately.

Example

Round all .output files in a benchmark result directory to 10 decimal places, in-place:

pbench round -p 10 -i -r results/my_run/

Round specific files and save the rounded copies separately:

pbench round -p 12 results/my_run/query_01.output results/my_run/query_02.output

pbench run

For more information about pbench run, see The Run Command.

NOTE: To set Presto session properties for pbench run, use the session_params field in a stage JSON file rather than a CLI flag. This allows session properties to be inherited by child stages and scoped per stage. See Parameters - session_params for details.

Examples

Run the TPC-H power test at scale factor 1 against a local Presto server (uses the built-in tpch connector):

pbench run benchmarks/tpch/sf1.json benchmarks/tpch/tpch.json

Run TPC-DS at scale factor 10 with the Java OSS engine, save query info JSON files, and compare output for correctness:

pbench run \
  benchmarks/java_oss.json \
  benchmarks/tpc-ds/sf10.json \
  benchmarks/tpc-ds/ds_power.json \
  benchmarks/save_json.json \
  benchmarks/save_output.json

Run TPC-H throughput test (40 randomized streams) with 1 cold run and 2 warm runs:

pbench run \
  benchmarks/tpch/sf1.json \
  benchmarks/tpch/streams/stream_01.json \
  benchmarks/c1w2.json \
  -s http://presto-server:8080

The stage JSON files are composable building blocks. See the benchmarks/ directory for the full set of available configurations, including:

benchmarks/tpch/ - TPC-H queries and scale factors
benchmarks/tpc-ds/ - TPC-DS queries, power runs, and throughput streams
benchmarks/test/ - Simple test stages demonstrating stage structure and scripts

pbench save

Run ./pbench save --help to see the online help for pbench save.

Save table information for recreating the schema and data

Usage:
  pbench save [flags] [list of table names]

Flags:
      --catalog string        Catalog name
  -f, --file string           CSV file to read catalog,schema,table
      --force-https           Force all API requests to use HTTPS
  -h, --help                  help for save
  -n, --no-analyze            Do not run additional queries to analyze table when stats were missing.
  -o, --output-path string    Output directory path (default "current directory")
  -P, --parallel int          Number of parallel threads to save table summaries. (default 10)
  -p, --password string       Presto user password (optional)
      --schema string         Schema name
  -s, --server string         Presto server address (default "http://127.0.0.1:8080")
      --session stringArray   Session property (property can be used multiple times; format is
                              key=value; use 'SHOW SESSION' in Presto CLI to see available properties)
      --trino                 Use Trino protocol
  -u, --user string           Presto user name (default "pbench")

The default for -P varies, as its default is the number of CPU cores on the system.

Examples

Save metadata for specific tables:

pbench save -s http://presto-server:8080 --catalog hive --schema tpcds_sf1000 \
  customer store_sales date_dim

Save tables listed in a CSV file (each line: catalog,schema,table):

pbench save -s http://presto-server:8080 -f tables.csv -o saved_schemas/

Save with a Trino server, skipping additional analyze queries:

pbench save -s http://trino-server:8080 --trino --no-analyze \
  --catalog iceberg --schema benchmark customer store_sales

Command Reference

pbench

pbench cmp

Example

pbench completion

pbench forward

Examples

pbench genconfig

Examples

pbench genddl

Config File

Directory Layout

Example

pbench help

pbench loadjson

Examples

pbench queryplan

Example

pbench replay

CSV Format

Example

pbench round

Example

pbench run

Examples

pbench save

Examples

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally