-
Notifications
You must be signed in to change notification settings - Fork 25
Command Reference
This page presents the online help for all pbench commands.
Run ./pbench --help to see the online help for pbench.
Tool for running Presto benchmarks
Usage:
pbench [command]
Available Commands:
cmp Compare two query result directories
completion Generate the autocompletion script for the specified shell
forward Watch incoming query workloads from the first Presto cluster (cluster 0) and forward them to the rest clusters.
genconfig Generate benchmark cluster configurations
genddl Generate DDL scripts based on a config file
help Help about any command
loadjson Load query JSON files into event listener database and run recorders
queryplan Parse query plan
replay Replay workload from a CSV file
round Round the decimal values in the benchmark query output files for easier comparison.
run Run a benchmark
save Save table information for recreating the schema and data
Flags:
-h, --help help for pbench
Use "pbench [command] --help" for more information about a command.
For more information about pbench cmp see Comparing Benchmarks.
Run ./pbench cmp --help to see the online help for pbench cmp.
Compare two query result directories
Usage:
pbench cmp [flags] [directory 1] [directory 2]
Flags:
-r, --file-id-regex string regex to extract file id from file names in two directories to find matching files to compare (default ".*(query_\\d+)(?:_c0)?(?:_ordered)?\\.output")
-h, --help help for cmp
-o, --output-path string diff output path (default "./diff")
Compare the output of two benchmark runs:
pbench cmp results/run_baseline results/run_experimentCompare with a custom regex that matches files like q1.csv, q2.csv:
pbench cmp -r ".*(q\d+)\.csv" results/baseline results/experiment -o my_diffRun ./pbench completion --help to see the online help for pbench completion.
Generate the autocompletion script for pbench for the specified shell.
See each sub-command's help for details on how to use the generated script.
Usage:
pbench completion [command]
Available Commands:
bash Generate the autocompletion script for bash
fish Generate the autocompletion script for fish
powershell Generate the autocompletion script for powershell
zsh Generate the autocompletion script for zsh
Flags:
-h, --help help for completion
Use "pbench completion [command] --help" for more information about a command.
Run ./pbench forward --help to see the online help for pbench forward.
Watch incoming query workloads from the first Presto cluster (cluster 0) and forward them to the rest clusters.
Usage:
pbench forward [flags]
Flags:
--dry-run Turning on dry run will only show the queries but not sending them to the target server.
-x, --exclude stringArray Regular expressions to filter queries to forward
--force-https bools Force all API requests to use HTTPS (default [false])
-h, --help help for forward
-n, --name string Assign a name to this run. (default: "forward_<current time>")
-o, --output-path string Output directory path (default "current directory")
-p, --password stringArray Presto user password (optional)
-i, --poll-interval duration Interval between polls to the source cluster (default 5s)
-r, --replace stringArray Pairs of regular expressions to match pattern in the query and the replacement expression.
Use $1, $2, ... to reference capture groups. This will be applied after filters.
-m, --schema-mapping strings Pairs of schema names to establish schema mapping relationships for schema replacement
when forwarding queries. You can specify something like -m schema1,schema2
-s, --server stringArray Presto server address (default [http://127.0.0.1:8080])
-u, --user stringArray Presto user name (default [pbench])
The forward command monitors query workloads on a source Presto cluster and replays them on one or more target clusters. The first --server is the source cluster (cluster 0); additional --server entries are target clusters.
Use --exclude to filter out queries matching specific patterns. Use --replace to modify queries before forwarding (e.g., changing table names). Use --schema-mapping to remap schema names between clusters.
Forward all queries from a source cluster to two target clusters:
pbench forward \
-s http://source-cluster:8080 \
-s http://target-cluster-1:8080 \
-s http://target-cluster-2:8080Forward with schema mapping (queries referencing prod_schema on the source will use test_schema on targets):
pbench forward \
-s http://source:8080 \
-s http://target:8080 \
-m prod_schema,test_schemaDry run to preview which queries would be forwarded, excluding EXPLAIN queries:
pbench forward \
-s http://source:8080 \
-s http://target:8080 \
-x "^EXPLAIN" \
--dry-runFor more information about pbench genconfig see Generating Benchmark Configurations.
Run ./pbench genconfig --help to see the online help for pbench genconfig.
Generate benchmark cluster configurations
Usage:
pbench genconfig [flags] [directory to search recursively for config.json]
pbench genconfig [command]
Available Commands:
default Print the built-in default generator parameter file.
Flags:
-h, --help help for genconfig
-p, --parameter-file stringArray Specifies a parameter file. Can be repeated; later files override earlier ones. Use built-in defaults if not specified.
-t, --template-dir string Specifies the template directory. Use built-in template if not specified.
Use "pbench genconfig [command] --help" for more information about a command.
Print the built-in default generator parameters:
pbench genconfig defaultGenerate configurations for all clusters found in a directory:
pbench genconfig /path/to/clusters/Stack multiple parameter files (later overrides earlier):
pbench genconfig -p base-params.json -p overrides.json /path/to/clusters/Run ./pbench genddl --help to see the online help for pbench genddl.
Generate DDL scripts based on a config file
Usage:
pbench genddl [config file]
Flags:
-h, --help help for genddl
All paths (table definitions, templates, output directories) are resolved relative to the config file's directory, so you can run pbench genddl from any working directory.
The config file specifies the workload, scale factor, file format, and compression method. An example config (cmd/genddl/config.json):
{
"scale_factor": "10",
"file_format": "parquet",
"compression_method": "uncompressed"
}This defaults to TPC-DS. To generate DDL for a different workload (e.g., TPC-H), add the workload and workload_definition fields:
{
"workload": "tpch",
"workload_definition": "tpc-h",
"scale_factor": "100",
"file_format": "parquet",
"compression_method": "uncompressed"
}| Field | Description | Default |
|---|---|---|
scale_factor |
Data scale factor (e.g., "10", "1k") |
(required) |
file_format |
Storage format (e.g., "parquet") |
(required) |
compression_method |
"uncompressed" or "zstd"
|
(required) |
workload |
Presto connector/catalog name used in INSERT statements and schema naming | "tpcds" |
workload_definition |
Subdirectory name under definition/ containing table JSON files |
"tpc-ds" |
The command generates CREATE TABLE and INSERT scripts for all four schema variants (Hive/Iceberg x non-partitioned/partitioned). Pre-generated examples are available in cmd/genddl/generated-examples/.
The config file expects the following sibling directories:
<config-dir>/
config.json # the config file
definition/<workload>/ # table definition JSON files (one per table)
*.tmpl # SQL/shell templates
out/ # generated output (cleaned on each run)
generated-examples/ # named output for version control
pbench genddl cmd/genddl/config.jsonOutput is written to <config-dir>/out/ and also to a named subdirectory under <config-dir>/generated-examples/ (e.g., tpcds-sf10-parquet/).
Run ./pbench help --help to see the online help for pbench help.
Help provides help for any command in the application.
Simply type pbench help [path to command] for full details.
Usage:
pbench help [command] [flags]
Flags:
-h, --help help for help
For example, the two commands
pbench genconfig default --helppbench help genconfig default
return the same output:
Print the built-in default generator parameter file.
Usage:
pbench genconfig default
Flags:
-h, --help help for default
Run ./pbench loadjson --help to see the online help for pbench loadjson.
Load query JSON files into event listener database and run recorders
Usage:
pbench loadjson [flags] [list of files or directories to process]
Flags:
-c, --comment string Add a comment to this run (optional)
-x, --extract-plan Extract the plan JSON from query JSON then save them to the output path
-h, --help help for loadjson
--influx string InfluxDB connection config for run recorder (optional)
--mysql string MySQL connection config for event listener and run recorder (optional)
-n, --name string Assign a name to this run. (default: "load_<current time>")
-o, --output-path string Output directory path (default "current directory")
-P, --parallel int Number of parallel threads to load json files (default 10)
-r, --record-run Record all the loaded JSON as a run
The default for -P varies, as its default is the number of CPU cores on the system.
Use --extract-plan to extract the query plan JSON from each query info file and save it as a separate .plan.json file in the output directory. This is useful for offline plan analysis without needing a MySQL database.
The input files are Presto query info JSON files, which are the responses from the Presto /v1/query/{queryId} API endpoint. These are also the files saved by pbench run when save_json is enabled.
Load query JSON files into a MySQL database and record the run:
pbench loadjson --mysql mysql.json --record-run -c "baseline run" /path/to/query_jsons/Extract query plans from JSON files without a database:
pbench loadjson --extract-plan -o plans/ /path/to/query_jsons/Run ./pbench queryplan --help to see the online help for pbench queryplan.
Read a CSV file, parse the "query plan" column, and write the JOIN information into a JSON file
Usage:
pbench queryplan [flags] <CSV file>
Flags:
-c, --column int The column index for the Query Plans in the CSV file (index starts with 0)
-H, --has-header contain the header line or not (default true)
-h, --help help for queryplan
-o, --output string Output JSON file (default "queryplan.json")
Parse query plans from the third column of a CSV file:
pbench queryplan -c 2 -o join_analysis.json exported_queries.csvRun ./pbench replay --help to see the online help for pbench replay.
Replay workload from a CSV file
The fields in the CSV file are:
"query_id","create_time","wall_time_millis","output_rows","written_output_rows","catalog","schema","session_properties","query"
We also expect the queries in this CSV file are sorted by "create_time" in ascending order.
Usage:
pbench replay [flags] [workload csv file]
Flags:
--force-https Force all API requests to use HTTPS
-h, --help help for replay
-n, --name string Assign a name to this run. (default: "replay_<current time>")
-o, --output-path string Output directory path (default "current directory")
-P, --parallel int Maximum number of concurrent queries to replay (default 150)
-p, --password string Presto user password (optional)
-s, --server string Presto server address (default "http://127.0.0.1:8080")
-u, --user string Presto user name (default "pbench")
NOTE: Unlike pbench loadjson and pbench save where -P defaults to the number of CPU cores, pbench replay defaults to 150 concurrent queries.
NOTE: The default value for -P (150) is suitable for most workloads. If the recorded customer workload has a higher server concurrency level, you should increase this value to match. Otherwise, the replay will artificially throttle the workload and the timing will not accurately reflect the original traffic pattern.
The input CSV file is typically exported from a Presto event listener database. The CSV must be sorted by create_time in ascending order so that queries are replayed at the same relative timing as the original workload.
The header fields are query_id, create_time, wall_time_millis, output_rows, written_output_rows, catalog, schema, session_properties, query. Newlines within SQL queries are encoded as <<>>.
"query_id","create_time","wall_time_millis","output_rows","written_output_rows","catalog","schema","session_properties","query"
"20240416_000117_00004_jgufv","2024-04-16 00:01:17.466 UTC","14","1","0","hive","tpcds_sf1000","{query_max_execution_time=3h, join_distribution_type=AUTOMATIC}","SELECT count(*) FROM store_sales"
"20240416_000130_00005_jgufv","2024-04-16 00:01:30.789 UTC","2500","100","0","hive","tpcds_sf1000","{}","SELECT * FROM customer<<>>WHERE c_customer_sk < 100"pbench replay workload.csv -s http://presto-server:8080 -P 200Run ./pbench round --help to see the online help for pbench round.
Note: pbench round is an experimental feature that is not included in the default builds in Releases.
The program will try to match every column in the first row to see which column has matching decimal.
After processing the first row, it will only look at the matched columns. So if the overly long decimal only appears from the second row, this might not work properly.
A PR was opened to fix the native/Java decimal precision discrepancy but so far it does not work quite well:
https://github.com/facebookincubator/velox/pull/7944
Usage:
pbench round [flags] [list of files or directories to process]
Flags:
-e, --file-extension stringArray Specifies the file extensions to include for processing (including the dot). You can specify multiple file extensions. (default [.output])
-f, --format string Specifies the format of the files. Accepted values are: "csv" or "json" which is the output file from the "run" command (default "json")
-h, --help help for round
-p, --precision int Decimal precision to preserve. (default 12)
-r, --recursive Recursively walk a path if a directory is provided in the arguments.
-i, --rewrite-in-place When turned on, we will rewrite the file in-place. Otherwise, we save the rewritten file separately.
Round all .output files in a benchmark result directory to 10 decimal places, in-place:
pbench round -p 10 -i -r results/my_run/Round specific files and save the rounded copies separately:
pbench round -p 12 results/my_run/query_01.output results/my_run/query_02.outputFor more information about pbench run, see The Run Command.
NOTE: To set Presto session properties for pbench run, use the session_params field in a stage JSON file rather than a CLI flag. This allows session properties to be inherited by child stages and scoped per stage. See Parameters - session_params for details.
Run the TPC-H power test at scale factor 1 against a local Presto server (uses the built-in tpch connector):
pbench run benchmarks/tpch/sf1.json benchmarks/tpch/tpch.jsonRun TPC-DS at scale factor 10 with the Java OSS engine, save query info JSON files, and compare output for correctness:
pbench run \
benchmarks/java_oss.json \
benchmarks/tpc-ds/sf10.json \
benchmarks/tpc-ds/ds_power.json \
benchmarks/save_json.json \
benchmarks/save_output.jsonRun TPC-H throughput test (40 randomized streams) with 1 cold run and 2 warm runs:
pbench run \
benchmarks/tpch/sf1.json \
benchmarks/tpch/streams/stream_01.json \
benchmarks/c1w2.json \
-s http://presto-server:8080The stage JSON files are composable building blocks. See the benchmarks/ directory for the full set of available configurations, including:
-
benchmarks/tpch/- TPC-H queries and scale factors -
benchmarks/tpc-ds/- TPC-DS queries, power runs, and throughput streams -
benchmarks/test/- Simple test stages demonstrating stage structure and scripts
Run ./pbench save --help to see the online help for pbench save.
Save table information for recreating the schema and data
Usage:
pbench save [flags] [list of table names]
Flags:
--catalog string Catalog name
-f, --file string CSV file to read catalog,schema,table
--force-https Force all API requests to use HTTPS
-h, --help help for save
-n, --no-analyze Do not run additional queries to analyze table when stats were missing.
-o, --output-path string Output directory path (default "current directory")
-P, --parallel int Number of parallel threads to save table summaries. (default 10)
-p, --password string Presto user password (optional)
--schema string Schema name
-s, --server string Presto server address (default "http://127.0.0.1:8080")
--session stringArray Session property (property can be used multiple times; format is
key=value; use 'SHOW SESSION' in Presto CLI to see available properties)
--trino Use Trino protocol
-u, --user string Presto user name (default "pbench")
The default for -P varies, as its default is the number of CPU cores on the system.
Save metadata for specific tables:
pbench save -s http://presto-server:8080 --catalog hive --schema tpcds_sf1000 \
customer store_sales date_dimSave tables listed in a CSV file (each line: catalog,schema,table):
pbench save -s http://presto-server:8080 -f tables.csv -o saved_schemas/Save with a Trino server, skipping additional analyze queries:
pbench save -s http://trino-server:8080 --trino --no-analyze \
--catalog iceberg --schema benchmark customer store_sales