Skip to content

alexfaye/pirlocalligraphy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pirlo Calligraphy eCommerce GA4 Analysis with Dataform & Data Studio

This is GA4 analysis of pirlocalligraphy.com which is a Woocommerce-based ecommerce website selling Pirlo's Chinese calligaphy artworks. The events_**** tables were automatically integrated from GA4 to BigQuery on a daily basis since 30/03/2026.

03_outputs/ga4_events.sqlx

After this ga4_events table finishes running, what I get is a perfect event table — no nested arrays requiring manual UNNEST, extremely accurate traffic-source attribution, absolutely no duplicate data, and built-in local timezone handling plus unique IDs.

pre_operations

According to the official GA4 documentation, event data may still be updated within 72 hours (3 days) after the event occurs. After that 72-hour window, the data is generally considered stable and unlikely to change. That’s why a 3-day threshold is used as the benchmark: data older than 3 days is treated as final, while the latest 3 days of data are reinserted to capture any updates or corrections.

source_ga4_export

This CTE excludes intraday tables (same-day realtime tables, which are highly unstable), and uses JavaScript helper functions to filter out all blacklist rules defined in the configuration file — such as test events I do not want to track or internal testing hostnames.

stage_unnest_params

GA4 raw tables are filled with large numbers of RECORD and ARRAY fields. This step uses the generators in helpers.js to extract information from event_params, user_properties, and even the items array, then reorganizes everything into logically structured Structs.

stage_repack_structs

GA4 data often contains messy values — for example, the ecommerce revenue fields may include NaN (Not a Number) values or annoying (not set) strings of transation IDs. This step cleans them all up by converting them into proper 0 or NULL values. In addition, starting in July 2024, GA4 moved the batch_* fields from event parameters into standalone columns, and here they are also packaged into a Struct.

stage_url_parsing_and_event_id

GA4 raw data does not contain an Event ID (a unique event identifier). Here, BigQuery’s hashing function FARM_FINGERPRINT is used to combine multiple dimensions — such as timestamp, event name, and user ID — into a unique fingerprint (hash value), which serves as the event’s primary key event_id.

It also directly parses the URL parameters from event_params.page_location (the full URL of the current page), generating three separate Structs: standard URL parameters (url_params), custom URL parameters (url_params_custom), and extracted click IDs (click_ids).

stage_discard_synthetic_bundle_events

GA4’s “Audience Triggers” generate system-created audience trigger events. These events are not triggered by actual user interactions on the website, such as clicking buttons, but are instead synthetically generated by Google’s servers through scheduled background processing. They can artificially inflate overall event counts and session counts.

Therefore, they should be excluded to ensure that the filtered dataset represents a 100% clean clickstream generated only by real user actions on the website — such as clicks, scrolling, and purchases.

stage_traffic_source_coalesce_and_click_ids

Traffic source information is scattered across official GA4 fields, event parameters, and URL parameters. Here, COALESCE is used to consolidate them into a single source of truth, following the priority order: official fields > event parameters > URL parameters.

In addition, subsequent events bundled together on the same page may incorrectly inherit UTM parameters from the URL repeatedly. Therefore, when batch_ordering_id > 1, the UTM values extracted from the URL are forcibly cleared, ensuring that only the first touchpoint event (hit) in the batch is assigned the traffic source attribution.

The click_ids are then regenerated. Whenever specific Click IDs are detected either in the URL or in GA4’s underlying fields, and GA4 currently classifies the traffic medium as organic or null, the system forcibly overrides it using the default values defined in config.js. This helps resolve attribution issues caused by redirects, cross-domain navigation, browser privacy restrictions, or other situations where standard UTM parameters are lost or blocked.

Finally, the condition WHERE event_date >= date_checkpoint is intentionally placed in this CTE. One reason is to preserve the reusability and general-purpose nature of the upstream foundational modules. Another reason is that in BigQuery, as long as the CTE chain remains a purely linear data flow (without complex aggregations such as GROUP BY operations or window functions interrupting the execution plan), placing the WHERE clause at the first layer or at the final layer results in exactly the same underlying compute cost and scanned bytes (billing).

stage_create_fixed_traffic_source

Fixes several attribution issues: ad traffic containing gclid is often broken and misclassified by GA4 as organic or referral traffic; traffic coming from the Google app or social media apps may appear as android-app://com... package names; and payment redirects such as PayPal, if not excluded, can interrupt the original traffic session.

stage_add_meta_info

Resolves the following issues: generates local time fields timestamp_local and date_local directly based on configuration variables; filters out Measurement Protocol server-side events that can distort frontend engagement analysis such as bounce rate; combines the user ID and session timestamp into a hash to generate a truly globally unique Session ID; and since page_location contains both the domain and URL parameters — making it unsuitable for direct page traffic aggregation — uses an RFC 3986–compliant regular expression to extract a clean page path field (path).

stage_dedupe_by_event_id

Resolves the following issues: network retries can cause BigQuery to receive completely identical duplicate events, so a QUALIFY ROW_NUMBER() ... = 1 mechanism is used to precisely deduplicate events based on the event ID.

In addition, fields that have already been extracted (such as page URL and page title) may still remain inside the event_params array and continue occupying storage space. A REPLACE ... EXCEPT operation is used to remove these processed fields from the original array.

final_add_row_numbers

Apply sequence numbers for hits and pages partitioned by session.

About

This is GA4 analysis of pirlocalligraphy.com which is a Woocommerce-based ecommerce website selling Pirlo's Chinese calligaphy artworks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors