Skip to content

dbestdataguy/Well-Log-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Well Log Analysis + ML — Lithology Classification & Porosity Prediction

Supervised machine learning pipeline for subsurface reservoir characterisation using wireline well logs from the FORCE 2020 benchmark dataset. Two tasks are solved from the same data: classifying rock type (lithology) across 11 classes, and predicting continuous porosity values from indirect log curves.

Dataset: FORCE 2020 Lithofacies Prediction Benchmark — 98 wells, Norwegian North Sea
Models: Random Forest and XGBoost (both tasks)
Evaluation: Split by well — 79 training wells, 19 held-out test wells


Results

Task 1 — Lithology Classification (11 classes)

Model Accuracy Macro F1
Random Forest 75.0% 0.468
XGBoost 67.9% 0.429

Per-class F1 on held-out test wells:

Lithology RF F1 XGB F1 Notes
Shale 0.861 0.802 Dominant class — well separated by GR
Halite 0.864 0.778 Near-zero GR — highly distinctive
Sandstone 0.711 0.723 XGBoost slightly better
Limestone 0.610 0.520 RF advantage
Coal 0.587 0.413 Low density signature
Marl 0.373 0.352 Transitional — overlaps with Shale
Sandstone/Shale 0.340 0.398 Mixed facies — hard by definition
Chalk 0.028 0.123 XGBoost better on rare classes
Dolomite 0.000 0.015 Insufficient samples

Key finding: Random Forest outperforms XGBoost on common lithologies. XGBoost is better on rare and transitional classes (Chalk, Dolomite, Sandstone/Shale), consistent with boosting's focus on hard examples.

Task 2 — Porosity Regression

Model RMSE MAE
Random Forest 0.0489 0.0375 0.866
XGBoost 0.0523 0.0388 0.847

Target: density porosity (PHIT) derived from RHOB using the standard petrophysical formula. RHOB, NPHI, and their derivatives were excluded from the feature set to prevent data leakage — the model predicts porosity from independent logs only (DTC, GR, resistivity, SP, caliper).

Key finding: DTC (sonic travel time) is the dominant predictor (importance 0.36 RF, 0.55 XGBoost), consistent with the Wyllie time-average equation — 60 years of petrophysical theory independently confirmed by the model.


Predicted lithology log

Five-track plot showing GR, resistivity, expert labels, RF predictions, and XGBoost predictions for a held-out well. The model correctly identifies major stratigraphic boundaries and reservoir sand zones.


Dataset

FORCE 2020 Lithofacies Benchmark

Bormann, P., Aursand, P., Dilib, F., Dischington, P., & Garland, C. (2020). FORCE Machine Learning Competition. https://github.com/bolgebrygg/Force-2020-Machine-Learning-competition

Available at: https://drive.google.com/drive/folders/0B7brcf-eGK8CRUhfRW9rSG91bW8

The dataset contains wireline well logs and expert lithofacies labels for 98 wells in the Norwegian North Sea, released for the FORCE 2020 machine learning competition. Labels were hand-crafted by skilled geoscientists.

Why FORCE 2020 instead of Volve?

This project was originally scoped around the Volve field dataset released by Equinor. Volve is a landmark open dataset but its download is unreliable and the raw files require substantial format engineering before ML can be applied. The FORCE 2020 dataset is a peer-reviewed benchmark specifically designed for ML — it provides clean, labelled CSV data directly comparable to published baselines. Using it is standard practice in the well log ML community.

Log curves used

Log Measures Missing %
GR Natural radioactivity — primary shale indicator 0.0%
RDEP Deep resistivity — hydrocarbon indicator 0.9%
RMED Medium resistivity 3.3%
DTC Compressional sonic travel time 6.9%
CALI Borehole diameter (quality control) 7.5%
RHOB Bulk density 13.8%
SP Spontaneous potential 26.2%
NPHI Neutron porosity 34.6%
RSHA Shallow resistivity 46.1%

Seven logs with >70% missingness were dropped (SGR, DTS, RMIC, ROPA, DCAL, MUDWEIGHT, RXO). BS was dropped as redundant with CALI (correlation = −0.90).

Lithology classes

Class Geological character O&G significance
Sandstone High porosity clastic Primary reservoir
Shale Clay-rich, radioactive Seal rock
Sandstone/Shale Interbedded Reduced reservoir quality
Limestone Carbonate Carbonate reservoir
Marl Carbonate-clay mix Variable
Chalk Soft carbonate North Sea specific
Halite Evaporite salt Perfect seal
Anhydrite Evaporite Seal rock
Tuff Volcanic ash Marker horizon
Coal Organic-rich Source rock indicator
Dolomite Dolomitised carbonate Fractured reservoir

Methodology

Why split by well — not by row?

This is the most important methodological decision in the project. A random row split would allow the model to train on 18,000 rows from a well and test on 2,000 rows from the same well. The model would have already learned that well's depth trends, GR baseline, and porosity regime — making the test set trivially easy to predict.

Splitting by entire wells forces the model to generalise to wells it has never seen — which is the real-world task. A petrophysicist trains on offset wells and predicts lithology in a new well before it is drilled.

Preprocessing

  • Sentinel values (−999.25, −9999) replaced with NaN before imputation
  • Physical impossibilities clipped to instrument range (e.g. GR < 0, RHOB < 1.0)
  • Baseline merged into Shale (103 samples — insufficient for classification)
  • Per-well median imputation — not global median — to preserve local geological character
  • Class weights applied (Shale = 61.6% of data)

Feature engineering

Seven domain-informed features derived from standard petrophysical relationships:

Feature Formula / method Geological meaning
DPHI (2.65 − RHOB) / 1.65 Density porosity
NPHI_RHOB_SEP NPHI − DPHI Gas crossover indicator
RES_RATIO log10(RDEP / RSHA) Invasion ratio — hydrocarbon flag
VSHALE (GR − GR_clean) / (GR_shale − GR_clean) Volume of shale
GR_NORM Per-well z-score of GR Removes depth trend
GR_ROLLING_MEAN 5-sample window mean Bed thickness context
GR_ROLLING_STD 5-sample window std Boundary detection

Models

Random Forest: 300 trees, max depth 25 (classification) / 20 (regression), balanced class weights, sqrt features per split, all CPU cores.

XGBoost: 300 rounds, max depth 6, learning rate 0.1, 80% row and column subsampling, sample weights for class imbalance.

Porosity regression — leakage prevention

Density porosity (PHIT) is derived from RHOB. Including RHOB as a feature would allow the model to trivially invert the formula, producing R² ≈ 1.000. RHOB, NPHI, DPHI, NPHI_RHOB_SEP, and DRHO are all excluded from the regression feature set. The model predicts porosity from DTC, resistivity logs, GR, and engineered features only — the geologically honest approach.


Project structure

Well-Log-Analysis/
├── data/
│   ├── raw/                         ← CSV_train.csv, CSV_test.csv (not tracked)
│   └── processed/                   ← numpy arrays and dataframes (not tracked)
├── notebooks/
│   ├── 01_eda.ipynb                 ← data loading, visualisation, QC
│   ├── 02_preprocessing.ipynb       ← cleaning, imputation, feature engineering
│   ├── 03_lithology_classification.ipynb  ← RF + XGBoost classification
│   ├── 04_porosity_regression.ipynb ← RF + XGBoost regression
│   └── 05_evaluation.ipynb          ← comparison figures and summary
├── outputs/
│   ├── figures/                     ← all saved plots
│   └── models/                      ← saved model files (not tracked)
├── requirements.txt
├── .gitignore
└── README.md

How to run

1. Get the data

Download CSV_train.csv and CSV_test.csv from the FORCE 2020 Google Drive folder:
https://drive.google.com/drive/folders/0B7brcf-eGK8CRUhfRW9rSG91bW8

Place both files in data/raw/.

2. Set up the environment

conda create -n well-log-analysis python=3.10 -y
conda activate well-log-analysis
pip install -r requirements.txt
python -m ipykernel install --user --name well-log-analysis --display-name "Well Log Analysis"

3. Run notebooks in order

01_eda.ipynb                    ← explore the data (run first)
02_preprocessing.ipynb          ← clean and engineer features
03_lithology_classification.ipynb  ← train and evaluate classifiers
04_porosity_regression.ipynb    ← train and evaluate regressors
05_evaluation.ipynb             ← generate all portfolio figures

Key figures

Figure Description
outputs/figures/lith_pred_*.png Predicted lithology log vs expert labels
outputs/figures/final_per_class_f1.png Per-class F1 comparison RF vs XGBoost
outputs/figures/final_summary_bars.png Cross-task performance summary
outputs/figures/final_rf_confusion.png RF confusion matrix
outputs/figures/porosity_scatter_clean.png Porosity predicted vs actual
outputs/figures/porosity_pred_*.png Porosity prediction log track
outputs/figures/log_correlation.png Log curve correlation matrix
outputs/figures/porosity_by_lithology.png Median porosity per rock type

Reference

Bormann, P., Aursand, P., Dilib, F., Dischington, P., & Garland, C. (2020). FORCE Machine Learning Competition — Lithofacies prediction from well logs. Norwegian Petroleum Directorate. https://github.com/bolgebrygg/Force-2020-Machine-Learning-competition


Author: Olasunkanmi
Tasks: Lithology classification (11 classes) + porosity regression
Dataset: FORCE 2020 Lithofacies Benchmark — 98 Norwegian North Sea wells

About

Supervised machine learning pipeline for subsurface reservoir characterisation using wireline well logs from the FORCE 2020 benchmark dataset. Two tasks are solved from the same data: classifying rock type (lithology) across 11 classes, and predicting continuous porosity values from indirect log curves.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors