A beginner-to-intermediate data analysis project built with Python in a Jupyter Notebook. It loads mall customer data, cleans it, and produces a single multi-panel matplotlib/seaborn dashboard saved as final_mall_dashboard.png.
- Dataset Overview
- Project Structure
- What the Notebook Does — Step by Step
- Dashboard Panels Explained
- Key Insights from the Current Analysis
- What Interns Can Improve
- New Things to Practice for a Better Dashboard
- How to Run
File: da_mall_customer.csv
Rows: 200 customers
Columns: 7
| Column | Type | Description |
|---|---|---|
CustomerID |
Integer | Unique identifier for each customer |
Gender |
Categorical (M / F) | Customer gender |
Age |
Integer | Customer age (range ~18–70) |
Education |
Categorical | Highest education level — High School, Graduate, College, Post-Graduate, Doctorate, Uneducated, Unknown |
Marital Status |
Categorical | Married, Single, Divorced, Unknown |
Annual Income (k$) |
Integer | Annual income in thousands of USD |
Spending Score (1-100) |
Integer | Mall-assigned score based on customer spending behavior (1 = lowest, 100 = highest) |
Notable data quality issues (handled in the notebook):
- The
Educationcolumn had a trailing whitespace in its name ("Education ") — stripped during cleaning. - Both
EducationandMarital Statuscontain"Unknown"entries that are relabeled to"Not Specified"for cleaner chart labels.
da_dashbrd/
├── anailyze.ipynb # Main analysis notebook
├── da_mall_customer.csv # Raw dataset (200 mall customers)
├── final_mall_dashboard.png # Output dashboard image (auto-generated)
├── requirements.txt # Python dependencies
└── README.md # This file
import pandas as pd
import numpy as npLoads pandas for data manipulation and numpy for numerical operations.
df = pd.read_csv('da_mall_customer.csv')
df.head()Reads the CSV into a DataFrame and displays the first 5 rows to verify it loaded correctly.
import matplotlib.pyplot as plt
import seaborn as snsLoads matplotlib (base charts) and seaborn (statistical visualizations on top of matplotlib).
df.columns = df.columns.str.strip()Removes any leading/trailing whitespace from all column headers. Without this, "Education " (with a trailing space) would cause KeyError failures later.
df['Education'] = df['Education'].replace('Unknown', 'Not Specified')
df['Marital Status'] = df['Marital Status'].replace('Unknown', 'Not Specified')Replaces the literal string "Unknown" with "Not Specified" in both categorical columns for cleaner pie chart labels and better readability.
Creates an 18 × 12 inch figure with a 3 × 3 GridSpec layout containing 6 panels. The dashboard is saved as final_mall_dashboard.png and rendered inline.
A text-only panel displaying three headline numbers:
- Total Shoppers: 200
- Avg. Income: mean of
Annual Income (k$) - Avg. Spend Score: mean of
Spending Score (1-100)
This gives an at-a-glance executive summary without any chart overhead.
A seaborn.countplot showing how many Male vs Female customers are in the dataset. Quick check on demographic composition.
A matplotlib pie chart showing the percentage split across education levels (Graduate, High School, College, Post-Graduate, Doctorate, Uneducated, Not Specified).
A seaborn.regplot (scatter + regression line) plotting Age on the X-axis against Spending Score on the Y-axis. The red regression line shows the overall trend — generally, spending score decreases slightly as age increases.
A seaborn.histplot with KDE (Kernel Density Estimate) overlay for Annual Income (k$). Reveals whether income is normally distributed, skewed, or bimodal.
The most strategic panel. A seaborn.scatterplot mapping:
- X-axis:
Annual Income (k$) - Y-axis:
Spending Score (1-100) - Color (hue):
Gender - Point size:
Age
Two dashed grey lines divide the space into quadrants:
- Vertical line at income = $60k
- Horizontal line at spending score = 50
Three quadrant labels are annotated:
| Label | Quadrant | Meaning |
|---|---|---|
| IMPULSIVE | Low Income, High Spending | Spends a lot despite earning little |
| TARGET GROUP | High Income, High Spending | Best customers — high value, high engagement |
| HIGH POTENTIAL | High Income, Low Spending | Earns a lot but doesn't spend — opportunity for upselling |
- The dataset is relatively balanced but slightly female-dominant.
- Graduate-level customers form the largest education segment.
- There is a mild negative correlation between age and spending score — younger shoppers tend to spend more.
- Income follows a roughly normal distribution with most customers earning $40k–$80k.
- The segmentation scatter reveals a clear cluster in the Target Group (high income + high spending) that warrants focused marketing.
- The High Potential segment (high income, low spending) is the biggest untapped opportunity.
These are concrete improvements to the existing analysis:
- Handle the
"Uneducated"and"Not Specified"labels more carefully — consider grouping them or excluding them from percentage calculations. - Check for and handle duplicate CustomerIDs.
- Validate the
Spending Scorecolumn for out-of-range values (should be 1–100). - Add
df.info()anddf.describe()cells to document data types and basic statistics.
- Add value labels on top of the gender countplot bars (e.g., show exact counts).
- Add a fourth quadrant label ("SAVERS" — Low Income, Low Spending) to complete the segmentation story.
- Use consistent color palettes across all panels for a more professional look.
- Add axis labels and units to every chart (e.g., "Annual Income (USD thousands)").
- Increase font sizes for tick labels — currently hard to read at smaller display sizes.
- Add a Marital Status panel — it's in the dataset but not visualized at all.
- Break down Spending Score by Education using a boxplot to see which education group spends the most.
- Add a correlation heatmap (
sns.heatmap) for numeric columns: Age, Income, Spending Score. - Add a Gender × Income grouped bar chart to compare income levels between male and female customers.
These are skills and tools interns can learn to significantly level up the dashboard:
| Skill | What to Practice |
|---|---|
groupby + agg |
Calculate average spending score per education level, per gender, per age group |
pd.cut / pd.qcut |
Create age buckets (18–25, 26–35, 36–50, 50+) and analyze each group separately |
value_counts(normalize=True) |
Convert counts to percentages for cleaner reporting |
| Boolean filtering | Isolate the "Target Group" customers and profile them separately |
| Technique | Purpose |
|---|---|
K-Means Clustering (sklearn.cluster.KMeans) |
Automatically find customer segments from Income + Spending Score instead of using manual quadrant lines |
| Elbow Method | Determine the optimal number of clusters k |
| PCA (Principal Component Analysis) | Reduce dimensions for visualizing clusters when more features are added |
| Tool | Use Case |
|---|---|
Plotly Express (plotly.express) |
Drop-in replacement for seaborn — produces interactive hover charts |
| Dash by Plotly | Build a full web dashboard app in Python with dropdown filters and sliders |
| Streamlit | Fastest way to turn a notebook into a shareable web app — add a gender/income filter sidebar |
| Power BI / Tableau | Industry-standard BI tools — practice connecting CSV data and building the same dashboard visually |
| Practice | Why It Matters |
|---|---|
| Write reusable functions for each chart panel | Keeps code DRY and easier to maintain |
| Add Markdown cells between code cells | Documents your reasoning — essential for professional notebooks |
Pin library versions in requirements.txt (e.g., pandas==2.2.0) |
Ensures the notebook runs identically on any machine |
Use pathlib.Path instead of raw strings for file paths |
Cross-platform compatibility (Windows vs Mac/Linux) |
- Structure the notebook as a story: Business Question → Data → Finding → Recommendation
- Add a final Markdown cell with a written summary of the 3 most important findings and their business implications
- Export the notebook as a PDF or HTML report (
jupyter nbconvert) to share without requiring Python
Prerequisites: Python 3.8+
# 1. Install dependencies
pip install -r requirements.txt
# 2. Open the notebook
jupyter notebook anailyze.ipynb
# or in VS Code: open anailyze.ipynb directly
# 3. Run all cells (Kernel > Restart & Run All)
# Output: final_mall_dashboard.png will be created in the same folderDataset: Mall Customer Segmentation — commonly used for beginner clustering and EDA practice.