Skip to content

LozanoLsa/Process_Decisions_Optimization

Repository files navigation

Process Decisions Optimization — Decision Tree

Open in Colab

"A new operator running a high-speed setup on a difficult material batch without completing the pre-run checklist. These factors interact in non-linear ways: velocity alone does not cause scrap, but that combination almost always does."


🎯 Business Problem

Scrap in a stamping press line doesn't announce itself. By the time it's counted in the bin at end of shift, the material is already lost — along with the machine time, tooling wear, and downstream scheduling impact that came with it. The standard response is reactive: log the defect, investigate the cause, issue a corrective action. Repeat next week.

This project reframes the question: instead of investigating scrap after it happens, can a model score the risk of a bad run before the press starts? The answer is yes — because the conditions that produce scrap (operator experience, checklist completion, supplier lot quality, press speed) are all known at setup time. They exist in the data. They just haven't been connected to an actionable risk score.

A Decision Tree doesn't find hidden patterns in this problem. It formalizes what experienced engineers already know — and makes those rules auditable, transferable, and documentable.


📊 Dataset

  • 2,000 stamping press production records from a manufacturing environment
  • Target: scrap_risk — three classes: Low / Medium / High
  • Class distribution: Low 19.6% · Medium 50.1% · High 30.3%
  • Source: Simulated operational data reflecting real stamping process factor interactions
Layer Feature Description
Machine press_speed_spm Press speed in strokes per minute
Machine raw_material_hardness_hrb Material hardness in HRB scale
Operator operator_experience_yrs Years of operator experience
Operator shift Day / Night / Early_Morning
Material critical_supplier_lot Flag: 1 = lot from critical supplier
Environment ambient_temp_c Shop floor temperature at run start
Process recent_model_change Flag: 1 = model change in last 48h
Process setup_checklist_complete Flag: 1 = pre-run checklist completed

Key EDA findings:

  • Critical supplier lots: 54.0% High Risk vs 19.9% for standard lots — the largest single structural gap
  • Incomplete checklist: 42.5% High Risk vs 17.8% when complete — a process control lever, not a luck factor
  • Night shift: 38.7% High Risk vs 23.2% Day — partially explained by operator experience distribution

🤖 Model

Algorithm: Decision Tree (Gini, max_depth=5) — sklearn.tree.DecisionTreeClassifier

Decision Trees are the right model here for a reason that goes beyond performance: the output is a set of if-then-else rules that can be printed and posted at the press. The model doesn't just classify — it generates process documentation. A LinearSVC coefficient communicates direction; a Decision Tree rule communicates the exact threshold and the path to the decision.

This is a multiclass problem (three risk levels), so macro-averaged F1 is the primary metric — it penalizes poor performance on any class equally, regardless of frequency.

Why max_depth=5, min_samples_leaf=50: Deliberately constrained. Each leaf must represent at least 50 production runs — not a single outlier. The tree is slightly less accurate than an unconstrained version, and significantly more generalizable. In manufacturing, that trade is always worth it.

Preprocessing: OneHotEncoder on shift (three levels), passthrough on everything else. No scaling — trees split on thresholds, not distances.


📈 Key Results

Metric Value
Test Accuracy 66.3%
Train Accuracy 72.1% (small gap — well controlled)
F1 Macro 63.8%
F1 Weighted 65.7%
CV Accuracy (5-fold) 68.8% ± 1.5%

Per-class performance:

Class Precision Recall F1
High 0.64 0.70 0.67
Low 0.89 0.41 0.56
Medium 0.64 0.74 0.69

Honest note: Low class recall is 41% — the tree frequently confuses Low with Medium. Operationally, this is the less critical error: sending a Low-risk run through Medium-risk protocols wastes some caution, but doesn't allow scrap to happen undetected. The model prioritizes High-risk detection, which it handles well (70% recall).

Confusion matrix (600 test runs):

Pred: High Pred: Low Pred: Medium
Actual: High 128 ✅ 0 54
Actual: Low 0 48 ✅ 70
Actual: Medium 72 6 222 ✅

🔍 Feature Importance (Gini)

Feature Importance What it means
operator_experience_yrs 30.4% Strongest driver — experience compensates for difficult conditions
setup_checklist_complete 21.9% Process control lever — the most actionable single intervention
critical_supplier_lot 20.1% Material quality — triggers a mandatory change in run conditions
press_speed_spm 19.6% Speed interacts with experience — not dangerous alone
recent_model_change 7.9% Setup instability signal
shift, hardness, temp 0.0% Zero Gini importance in this tree configuration

The top four features account for 92% of the model's decision power. The tree's top split is operator experience at 2 years — a threshold that any HR or production planning system already tracks.


🗂️ Repository Structure

Process_Decisions_Optimization/
├── 05_DT_Process_Decisions_Optimization.ipynb  # Notebook (no outputs)
├── scrap_risk_data.csv                         # Sample dataset (250 rows)
├── README.md
└── requirements.txt

📦 Full Project Pack — complete dataset (2,000 rows), notebook with full outputs including tree visualization and text rules, presentation deck (PPTX + PDF), and app.py pre-run risk simulator available on Gumroad.


🚀 How to Run

Option 1 — Google Colab: Click the badge above.

Option 2 — Local:

pip install -r requirements.txt
jupyter notebook 05_DT_Process_Decisions_Optimization.ipynb

💡 Key Learnings

  1. The model output is the SOPexport_text() produces if-then-else rules that can be transcribed directly into process control documents. No translation needed between model and practice.
  2. Multiclass F1 Macro is not optional — with three classes and class imbalance, accuracy is misleading. A model that ignores Low entirely can still score 80% accuracy. Macro F1 prevents that illusion.
  3. Controlling complexity is a design decision — max_depth=5 and min_samples_leaf=50 are not limitations imposed by the data. They're choices made to produce a model that generalizes to next week's production, not just last week's.
  4. Zero-importance features tell the story too — shift, hardness, and ambient temperature carry no Gini importance. This doesn't mean they're irrelevant to scrap — it means their effect is already captured by the features that matter (experience, checklist, supplier lot).
  5. The interaction structure matters more than individual variables — press speed at 55 spm with an experienced operator is manageable. At 55 spm with a 6-month operator and an incomplete checklist, it isn't. Trees capture this logic without feature engineering.

👤 Author

Luis Lozano | Operational Excellence Manager · Master Black Belt · Machine Learning
GitHub: LozanoLsa · Gumroad: lozanolsa.gumroad.com

Turning Operations into Predictive Systems — Clone it. Fork it. Improve it.

About

[5] Can we score scrap risk before the press starts?

Topics

Resources

License

Stars

Watchers

Forks

Contributors