This project delivers an end-to-end unsupervised learning pipeline designed to eliminate human cognitive bias from enterprise retail data. Utilizing transactional data from Online Retail.xlsx, the architecture processes, standardizes, compresses, and clusters behavior into mathematically isolated, strategically actionable market segments. By pairing Principal Component Analysis (PCA) with a validated K-Means algorithm, we successfully compressed over 20 dimensions of retail behavior down to 3 major customer personas.
The data goes through a strict chronological pipeline following the Input-Process-Output blueprint:
-
SCALE (Standardization): Continuous features are mapped to a common geometric range (
$z$ -score scaling where$\mu = 0, \sigma = 1$ ) to eliminate magnitude-induced bias and ensure equal mathematical voting power. -
COMPRESS (PCA): High-dimensional cloud data (
$D > 20$ ) is transformed onto orthogonal Principal Components to dodge the curse of dimensionality while maximizing variance preservation. - CLUSTER (K-Means): Modeled to minimize the Within-Cluster Sum of Squares (WCSS), pulling customer metrics into highly cohesive clusters.
- TRANSLATE (Inverse Projection): PCA-space centroid coordinates are inverse-transformed back into human-centric, real-world metrics to build marketing personas.
The script isolated 3 orthogonal components capturing an exceptional cumulative spread of original consumer behavior:
-
PC1:
$79.10%$ explained variance ratio -
PC2:
$18.33%$ explained variance ratio -
PC3:
$2.57%$ explained variance ratio
Total Preserved Variance:
The target tier framework enforced a cluster count of
- The Elbow Method: Identified the precise inflection point of diminishing returns for WCSS reduction.
- The Silhouette Score: Confirmed strong cluster cohesion versus neighbor separation, ensuring no natural customer groups were artificially split.
By executing an inverse scaling transformation on the cluster centroids, the abstract coordinates were mapped back to their original physical medians:
| Cluster ID | Assigned Corporate Persona | Median Frequency | Median Total Spend | Median Quantity | Strategic Business Action |
|---|---|---|---|---|---|
| Cluster 1 | CLUSTER A: HIGH-VALUE ENGAGERS | 46.0 | $143,825.06 | 74,215.0 | Deploy high-touch VIP support, exclusive perks, and experiential marketing. |
| Cluster 2 | CLUSTER B: MID-TIER EXPLORERS | 34.0 | $27,498.04 | 15,752.0 | Push targeted loyalty programs, product bundles, and flash sales. |
| Cluster 0 | CLUSTER C: LOW-ACTIVITY CHURN RISK | 2.0 | $660.00 | 370.0 | Execute drop-off re-engagement, clear value options, and win-back campaigns. |