Based on TPC-DS v1.1.0 found at http://www.tpc.org/tpcds/ and the Cloudera Impala benchmark
NOTICE: This repo contains modifications to the official TPC-DS specification so any results from this are not comparable to officially audited results.
The following tables are currently used:
Dimension Tables:
- DATE_DIM
- TIME_DIM
- CUSTOMER
- CUSTOMER_ADDRESS
- CUSTOMER_DEMOGRAPHICS
- HOUSEHOLD_DEMOGRAPHICS
- ITEM
- PROMOTION
- STORE
Fact Tables:
- STORE_SALES
These steps setup your environment to perform a distributed data generation for the given scale factor.
The scripts assume that you have passwordless SSH from the master node (where you will clone the repos to) to every DataNode that is in your cluster.
cd $HOME- clone this repo
git clone https://github.com/sohan/vertica-tpcds-kit cd vertica-tpcds-kitcp tpcds-env-sample.sh tpcds-env.shand modify tpcds-env.sh as needed.- Edit
dn.txtand put one DataNode hostname per line - no blank lines. - Run
push-bits.shwhich will scpvertica-tpcds-kitand buildtpcds-kiton each DataNode listed indn.txt. - Run
set-nodenum.sh. This will createvertica-tpcds-kit/nodenum.shon every DataNode and set the value accordingly. This is used to determine what portion of the distributed data generation is done on each node.
Data is landed directly in Vertica so there is no requirement for any local storage. Run the following in order:
pip install -r requirements.txt- Download python dependencies (recommended: use virtualenv)vertica-create-tables.sh- Creates tables in Verticarun-gen-dims.sh- Runsgen-dims.shon the first node in dn.txt. Generate dimension data and insert into Verticarun-gen-facts.sh- Runsgen-facts.shon each DataNode via ssh to generate thestore_salestable.
You can run queries in a single-user or multi-user (concurrent) setting. See below for usage.
./vertica-run-queries.sh --help
# single user run. queries are shuffled and then run sequentially:
./vertica-run-queries.sh --interactive --num-users 1 --num-runs 3
# multi-user run. for each run, asynchronously for each user, queries shuffled and run sequentially.
./vertica-run-queries.sh --interactive --num-users 5 --num-runs 3
cat results/*-scale-5-users-3-runs/metadata.json
{
"metadata": {
"num_users": 5,
"num_runs": 3,
"start_time": "2014-09-04 15:50:38.380715",
"scale_factor": "10",
"queries": [
"q19",
...
]
}
}
cat results/*-scale-5-users-3-runs/query_times.csv
"run_number","user","query","time_s"
"1","1","q42","0.17443999999999998"
"1","1","q52","0.432267"
"1","2","q19","0.763254"
"1","3","q52","0.467401"
"1","2","q42","0.509011"
"1","3","q42","0.760072"
"1","1","q19","1.065039"
"1","3","q73","0.469087"
"1","4","q68","1.527678"
"1","1","q98","1.8634000000000002"
"1","4","q98","2.42102"
"1","5","q98","2.111182"
...