Useful commands/scripts for use on Google Cloud Platform.
- Definition
- Start web console for BigQuery
- Query large datasets in seconds (Examples)
- Load and Export Data
- Architecting a BigQuery project
- More links
BigQuery is the PetaBytes-scale datawarehouse on GCP (for immutable very-large datasets)
Use one of these:
- https://bigquery.cloud.google.com
- https://console.cloud.google.com/ then from the 'hamburger' menu, look for
BigQueryunder theBig Datasection.
#standardsql
SELECT
nppes_provider_state AS state,
ROUND(SUM(total_claim_count) / 1e6) AS total_claim_count_millions
FROM
`bigquery-public-data.medicare.part_d_prescriber_2014`
GROUP BY
state
ORDER BY
total_claim_count_million DESC
LIMIT
5;SELECT
airline,
date,
departure_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
departure_delay > 0
AND departure_airport = 'LGA'
LIMIT
100To write a query to get both the number of flights delayed and the total number of flights in a single query:
SELECT
f.airline,
COUNT(f.departure_delay) AS total_flights,
SUM(IF(f.departure_delay > 0, 1, 0)) AS num_delayed
FROM
`bigquery-samples.airline_ontime_data.flights` AS f
WHERE
f.departure_airport = 'LGA' AND f.date = '2008-05-13'
GROUP BY
f.airline
LIMIT
100 SELECT
airline,
COUNT(departure_delay)
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
departure_airport = 'LGA'
AND date = '2008-05-13'
GROUP BY
airline
ORDER BY airlineSELECT
airline,
COUNT(departure_delay)
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
departure_delay > 0 AND
departure_airport = 'LGA'
AND date = '2008-05-13'
GROUP BY
airline
ORDER BY airlineSELECT
CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
FROM
`bigquery-samples.weather_geo.gsod`
WHERE
station_number = 725030
AND total_precipitation > 0SELECT
f.airline,
SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
COUNT(f.arrival_delay) AS total_flights
FROM
`bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
SELECT
CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
FROM
`bigquery-samples.weather_geo.gsod`
WHERE
station_number = 725030
AND total_precipitation > 0) AS w
ON
w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airlineSELECT
airline,
num_delayed,
total_flights,
num_delayed / total_flights AS frac_delayed
FROM (
SELECT
f.airline AS airline,
SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
COUNT(f.arrival_delay) AS total_flights
FROM
`bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
SELECT
CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
FROM
`bigquery-samples.weather_geo.gsod`
WHERE
station_number = 725030
AND total_precipitation > 0) AS w
ON
w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airline
)
ORDER BY
frac_delayed ASCAcknowledgment: This lab is based on an article by Felipe Hoffa: https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804
- From the web GUI
- Go to https://bigquery.cloud.google.com
- You will need to use/create a dataset (within a given project) where the data will be stored.
- click on the
+aside your dataset and follow the instructions (Table name in CAPITAL by convention)
- From the command line tool:
bq
- open
Cloud Shellfrom the Google Cloud Console which already hasbqinstalled - download or Create a schema for your table
You can check that file online: https://storage.googleapis.com/cloud-training/CBP200/BQ/lab4/schema_flight_performance.json
cat schema_flight_performance.json
[
{
"name": "YEAR",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "QUARTER",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "MONTH",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "DAY_OF_MONTH",
"type": "INTEGER"
},
{
"name": "FULL_DATE",
"type": "STRING"
},
...
{
"name": "SECURITY_DELAY",
"type": "INTEGER"
},
{
"name": "LATE_AIRCRAFT_DELAY",
"type": "INTEGER"
}
]% Type the following command to download schema_flight_performance.json (the schema file for the table in this example) to your working directory.
curl https://storage.googleapis.com/cloud-training/CPB200/BQ/lab4/schema_flight_performance.json -o schema_flight_performance.jsonThe JSON files containing the data for your table are now stored in a Google Cloud Storage bucket. They have URIs like the following: gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json
- use
bqand type the following command to create a table named flights_2014 in the cpb101_flight_data dataset, using data from files in Google Cloud Storage and the schema file stored on your virtual machine.
Note that your Project ID is stored as a variable in Cloud Shell ($DEVSHELL_PROJECT_ID) so there's no need for you to remember it. If you require it, you can view your Project ID in the command line to the right of your username (after the @ symbol).
bq load --source_format=NEWLINE_DELIMITED_JSON $DEVSHELL_PROJECT_ID:cpb101_flight_data.flights_2014 gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json ./schema_flight_performance.jsonNote
There are multiple JSON files in the bucket named according to the convention: domestic_2014_flights_.json. The wildcard () character is used to include all of the .json files in the bucket.
More on wildcards for better performance in this video.
Performance can also be improved at loading time with Table Partitioning using Time-partitioning tables.
Declare the table as partitioned at creation time using this flag: --time_partitioning_type, then filter your selection like so:
SELECT ...
FROM `sales`
WHERE _PARTITIONTIME
BETWEEN TIMESTAMP("20160101")
AND TIMESTAMP("20160131")You can then analyze the performance of your query using Google Stackdriver:
- check the multiple (sequential) stages of your query
- check the number of rows processed at each stage which you want to keep decreasing very rapidely (with adequate filtering)
Wait/Read/Copute/Write: helps undertand where a stage will be stucked or not, or have tai lskew (i.e. some stages being much longer than average!)
Once the table is created, type the following command to verify table flights_2014 exists in dataset cpb101_flight_data.
bq ls $DEVSHELL_PROJECT_ID:cpb101_flight_dataThe output should look like the following:
pme@cloudshell:~ (southern-flash-208711)$ bq ls $DEVSHELL_PROJECT_ID:cpb101_flight_data
tableId Type Labels Time Partitioning
-------------- ------- -------- -------------------
AIRPORTS TABLE
flights_2014 TABLEAcknowledgment: This lab is based on an article by Felipe Hoffa: https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804
- If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique. (e.g.
demo-export-table-bigquery) - Back to the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI
- Select the
AIRPORTStable that you created recently, and using the "down" button to its right, select the option for Export Table. - In the dialog, specify
gs://<your-bucket-name>/bq/airports.csvand click OK. (for me gs://demo-export-table-bigquery/bq/airports.csv) - Browse to your bucket and ensure that the .csv file has been created.
- Test using a query:
SELECT *
FROM `southern-flash-208711.cpb101_flight_data.AIRPORTS`WITH WashinghtonStations AS (
SELECT ...
...
)
SELECT washinghton_stations.name, ...
FROM WashinghtonStations AS washinghton_stations
ORDER BY rainy_days DESC;EXTRACTto extract a date from a timestamp.
SELECT
author.email,
diff.new_path AS path,
author.date
FROM
`bigquery-public-data.github_repos.commits`,
UNNEST(difference) diff
WHERE
EXTRACT(YEAR
FROM
author.date)=2016
LIMIT 10- using
ARRAYandSTRUCTto build for exmaple a table with several entries for a given date.
STARTS_WITHto find words that start with a prefix
JOINcondition example
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators
- Aggregate functions
- String functions (e.g.
REGEXP_CONTAINS) - Analytic (window) functions
- Standard aggregations (SUM, AVG, MIN, MAX, COUNT, ...)
- Navigation functions (LEAD, LAG, NTH_VALUE, ...)
- Ranking and numbering functions (CUME_DIST, DENSE_RANK, ROW_NUMBER, RANK, PERCENT_RANK, ...)
- Datetime functions
- Array functions
- Other functions and operators
WITH
commits AS (
SELECT
author.email,
EXTRACT(DAYOFWEEK
FROM
author.date) BETWEEN 2
AND 6 is_weekday,
LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
diff.new_path AS path,
author.date
FROM
`bigquery-public-data.github_repos.commits`,
UNNEST(difference) diff
WHERE
EXTRACT(YEAR
FROM
author.date)=2016)
SELECT
lang,
is_weekday,
COUNT(path) AS numcommits
FROM
commits
WHERE
LENGTH(lang) < 8
AND lang IS NOT NULL
AND REGEXP_CONTAINS(lang, '[a-zA-Z]')
GROUP BY
lang,
is_weekday
HAVING
numcommits > 100
ORDER BY
numcommits DESCref. The top weekend programming languages — based on GitHub’s activity
Multiple ways to create a date:
DATE(year, month, day)DATE(timestamp)and supports time zoneDATETIME(date, time)using DATE and TIME objects
written in SQL or Javascript, they are temporary. But user-defined functions have constraints.
- programming logic
- loops, complex conditions non-trivial string parsing (not supported by standard SQL)
Building a Mobile Gaming Analytics Platform - a Reference Architecture