Author: Duncan Chicho (NC-Dan)
Tool: Microsoft SQL Server (T-SQL)
Dataset: Credit Card Fraud Detection — ULB Machine Learning Group (Kaggle)
Scale: 283,726 clean transactions | 473 confirmed fraud cases
Complexity: Advanced — Views, Stored Procedures, Indexes, Anomaly Detection
Credit card fraud costs the global financial industry billions annually. This project analyses 283,726 real European credit card transactions to:
- Detect — identify fraud patterns by amount, time and behaviour
- Profile — build a statistical profile of fraudulent transactions
- Monitor — create reusable Views and Stored Procedures for ongoing fraud surveillance
- Source: Real anonymised transactions from European cardholders (2 days)
- Features: V1-V28 are PCA-transformed to protect cardholder privacy
- Challenge: Only 0.1667% of transactions are fraudulent — finding 473 cases among 283,253 legitimate transactions is the core analytical challenge
- Industry relevance: This imbalance mirrors real production fraud data
| Issue | Detail | Action Taken |
|---|---|---|
| NULLs in V2-V27 | 21 per column — import precision loss | Retained — imputation would corrupt anonymised PCA features |
| NULLs in V28 | 237 values | Retained — same reason |
| Duplicate rows | 1,081 identified | Deleted — ROW_NUMBER() partition method |
| Fraud count corrected | 492 → 473 after deduplication | 19 duplicate fraud rows removed |
| Fraud rate corrected | 0.173% → 0.1667% | Recalculated post-cleaning |
Cleaning method: ROW_NUMBER() OVER (PARTITION BY Time, Amount, Class, V1, V2, V3 ORDER BY TransactionID) — kept lowest TransactionID per duplicate group.
- 473 fraud cases hidden among 283,253 legitimate transactions
- Fraud rate: 0.1667% — 1 in every 600 transactions
- A model predicting everything as legitimate would be 99.83% accurate — and completely useless
- This is why statistical anomaly detection is essential

- Fraud avg amount: $123.87 vs Legitimate avg: $88.41 — 40% higher
- Fraud MaxAmount: $2,125.87 vs Legitimate MaxAmount: $25,691.16
- Fraudsters deliberately avoid extremely large transactions — they stay below detection thresholds
- This behaviour is known as structuring in financial crime analysis
- 73.6% of all fraud (348 cases) involves amounts under $100
- Micro transactions (<$10): second highest fraud rate at 0.2457%
- Small transactions ($10-$100): lowest fraud rate at 0.0849%
- Pattern: criminals use tiny amounts to test stolen cards before escalating
- Large transactions (>$1,000): 0.2937% fraud rate — highest of all bands
- Only 3,064 large transactions exist — but 9 are fraudulent
- Combined with micro test transactions — fraud follows a clear two-stage pattern

- Hour 2am: 1.4550% fraud rate — nearly 10x the dataset average
- Hour 4am: 1.0459% — second highest
- Top 3 fraud hours are all between midnight and 5am
- Fraudsters operate when human monitoring is at its lowest

- Hour 0 (midnight): only 6 fraud cases but avg amount $303.35 — highest night avg
- Hour 2am: 48 cases but avg only $78.19 — high volume, lower amounts
- Two distinct night fraud profiles: bold midnight operators vs high-volume 2am attackers
- 93.9% of fraud (444 of 473 cases) occurs in Normal transactions (Z-Score ≤ 2)
- Only 29 fraud cases appear in Anomalous (high Z-Score) transactions
- Z-Score detection alone would miss 94% of fraud
- Multi-factor detection combining amount, time and behavioural signals is required
- $100-$1,000 fraud: 116 transactions | $38,497.60 total value — 66% of all fraud value
- $0-$100 fraud: 348 transactions | $6,856.46 total value — high volume, low value
- Total fraud value across all 473 cases: $58,591.39

A typical fraud transaction is:
- Amount between $0-$100 (73.6% probability)
- Occurring between midnight and 5am (highest risk window)
- Appearing statistically normal (93.9% of cases)
- Part of a sequence — micro test followed by mid-range escalation
- Standard accuracy metrics are misleading on 0.1667% fraud rate data
- Precision, Recall and F1-Score are required for proper fraud model evaluation
- SQL anomaly detection (Z-Score) is a first-pass filter — not a complete solution
- Recommended next step: Python scikit-learn with SMOTE oversampling for full model
1. Implement real-time night monitoring (midnight–5am) Fraud rate at 2am is 10x the daytime average. Automated alerts for transactions in this window — especially amounts between $50-$500 — would flag the highest risk period with minimal false positives.
2. Flag micro-transaction sequences A card making multiple transactions under $10 within a short window is exhibiting test transaction behaviour. Rule-based flagging of 3+ micro-transactions within 1 hour from the same origin would catch this pattern early.
3. Build a multi-factor fraud score Z-Score alone catches only 6% of fraud. A composite score combining:
- Transaction amount relative to card history
- Time of day risk weight
- Micro-transaction sequence detection
- V1, V2, V14 feature values (strongest PCA signals in literature) would dramatically improve detection rate.
4. Prioritise mid-range fraud recovery $100-$1,000 fraud represents 66% of total fraud value at lower transaction volumes. Chargeback and recovery efforts should prioritise this band for maximum financial impact.
- Views — 3 production-ready Views for fraud monitoring
- Stored Procedures — 2 parameterised procedures for analyst self-service
- Indexes — Clustered PK + 2 Nonclustered indexes for query performance
- Anomaly Detection — Z-Score calculation using CTEs and CROSS JOIN
- Data Cleaning — ROW_NUMBER() deduplication, NULL audit and documentation
- Window Functions — PERCENTILE_CONT, ROW_NUMBER, SUM OVER
- Multi-level CTEs — chained CTEs for statistical calculations
- IDENTITY columns — auto-incrementing Primary Key
- DROP PROCEDURE IF EXISTS — safe procedure management
| File | Description |
|---|---|
| 01_setup_and_data_quality.sql | Import verification, NULL audit, deduplication, PK, Indexes |
| 02_views_and_setup.sql | 3 Views: vw_FraudTransactions, vw_LegitTransactions, vw_TransactionSummary |
| 03_fraud_analysis.sql | Amount comparison, fraud by band, fraud by hour using Views |
| 04_stored_procedures.sql | sp_FraudByAmountThreshold, sp_FraudByHourRange, Z-Score anomaly detection |
Credit Card Fraud Detection
ULB Machine Learning Group — real anonymised European cardholder transactions
Connect on LinkedIn 🔗 linkedin.com/in/duncanalyst