Skip to content

benjmoh/asteroid-pha-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Asteroid PHA Classification using Machine Learning

Overview

This project applies machine learning techniques to classify Potentially Hazardous Asteroids (PHAs) using NASA asteroid data.

Near Earth Objects (NEOs) are asteroids and comets whose orbits bring them close to Earth. NASA defines a Potentially Hazardous Asteroid (PHA) as an asteroid with:

  • Minimum Orbit Intersection Distance (MOID) < 0.05 AU
  • Diameter greater than 140 meters

The goal of this project is to train machine learning models to classify whether an asteroid is hazardous based on its orbital and physical properties.

Three machine learning models were implemented and compared:

  • Logistic Regression
  • Random Forest
  • Neural Network (Multi-Layer Perceptron)

The models were evaluated using precision, recall, and F1-score, with particular focus on recall since missing a hazardous asteroid could have severe consequences.


Dataset

The dataset used is the NASA Asteroid Classification dataset from Kaggle.

Dataset characteristics:

  • 4687 asteroid observations
  • 40 columns
  • 39 input features
  • 1 target variable (Hazardous)

The features describe orbital and physical asteroid properties such as:

  • eccentricity
  • semi-major axis
  • perihelion distance
  • absolute magnitude
  • estimated diameter

The target variable:

  • Hazardous = True → Potentially Hazardous Asteroid
  • Hazardous = False → Non-hazardous asteroid

Data Preprocessing

Several preprocessing steps were applied before training the models.

Removed identifier features

These columns contain identifiers rather than predictive information:

  • Neo Reference ID
  • Orbit ID

Removed constant features

These features contained a single value across the dataset and therefore could not help separate classes:

  • Equinox (J2000)
  • Orbiting Body (Earth)

Removed redundant features

Estimated diameter appeared in multiple units:

  • kilometers
  • meters
  • miles
  • feet

Only the kilometer measurements were kept.

Removed timestamp features

These columns relate to observation timing rather than asteroid properties:

  • Close Approach Date
  • Perihelion Time

Feature scaling

All features were standardized to zero mean and unit variance.
This is particularly important for models sensitive to feature scale such as:

  • Logistic Regression
  • Neural Networks

No missing values were present in the dataset.


Train/Test Split

The dataset was split into:

  • 80% training
  • 20% testing

Stratified sampling was used to maintain the original class distribution.


Models

Logistic Regression

Logistic Regression was used as a baseline model.

It assumes a linear relationship between features and the target variable, making it a useful reference point when comparing more complex models.


Random Forest

Random Forest is an ensemble of decision trees and performs particularly well on structured tabular datasets.

Parameters used:

  • n_estimators = 300
  • n_jobs = -1
  • random_state = 42

Using 300 trees improves stability while maintaining efficient training time.


Neural Network (MLP)

A Multi-Layer Perceptron neural network was implemented to model more complex non-linear relationships.

Configuration:

  • hidden_layer_sizes = (64, 32)
  • activation = relu
  • solver = adam
  • learning_rate = adaptive
  • max_iter = 500
  • early_stopping = True
  • n_iter_no_change = 20
  • random_state = 42

Early stopping was used to prevent overfitting.


Model Performance

Model Precision Recall F1-score False Negatives False Positives
Logistic Regression 85.5% 86.1% 85.8% 21 22
Random Forest 99.3% 99.3% 99.3% 1 1
Neural Network (MLP) 96.4% 88.1% 92.0% 18 5

Results

Logistic Regression

Logistic Regression performed well as a baseline model, achieving balanced precision and recall around 86%.

However, the model misclassified a larger number of asteroids compared to the other models because linear decision boundaries cannot fully capture the non-linear relationships present in the dataset.


Random Forest

Random Forest achieved the best performance by a large margin, with:

  • Precision: 99.3%
  • Recall: 99.3%
  • F1-score: 99.3%

The confusion matrix showed only:

  • 1 false positive
  • 1 false negative

Training performance was 100%, which is normal for Random Forest due to deep decision trees.

The similarity between training and testing performance suggests the model is not overfitting but instead the classes are highly separable in the feature space.


Confusion Matrices

The confusion matrices below show the classification performance of each model on the test dataset.

Logistic Regression

Logistic Regression Confusion Matrix

Random Forest

Random Forest Confusion Matrix

Neural Network (MLP)

Neural Network Confusion Matrix


Feature Importance

Random Forest feature importance analysis showed that Minimum Orbit Intersection Distance (MOID) dominated the model.

Most important features:

  1. Minimum Orbit Intersection
  2. Absolute Magnitude
  3. Estimated Diameter (km max)
  4. Estimated Diameter (km min)
  5. Perihelion Distance

This confirms that orbital proximity and asteroid size are the most critical indicators of hazardous asteroids.

Random Forest Feature Importance


Neural Network (MLP)

The Neural Network performed strongly with an F1-score of 92%, outperforming Logistic Regression.

However, it produced 18 false negatives, which is significantly worse than Random Forest.

In the context of planetary defence, false negatives are particularly dangerous because they represent hazardous asteroids that were not detected.

The training loss curve showed successful convergence.

MLP Training Loss


Why Random Forest Performed Best

Random Forest significantly outperformed the other models because tree-based models excel on structured tabular datasets.

The NASA definition of a hazardous asteroid effectively creates a threshold-based rule:

  • MOID < 0.05 AU
  • Diameter > 140 m

Decision trees naturally discover such threshold rules. Random Forest ensembles combine many trees, allowing them to capture these relationships extremely effectively.

Neural networks typically perform best on:

  • very large datasets
  • highly complex patterns

In this case, the dataset was relatively small (4687 samples) and the decision boundaries were largely threshold-driven, which strongly favours tree-based models.


Limitations

One limitation is that the Hazardous label in the dataset is derived from NASA’s predefined rules.

This means the models are largely rediscovering existing thresholds rather than discovering new relationships.

Additionally:

  • the dataset is relatively small
  • feature importance was dominated by MOID
  • only three models were tested

Future Work

Future improvements could include:

  • testing gradient boosting models such as XGBoost
  • hyperparameter tuning for Random Forest and MLP
  • training on much larger NASA NEO datasets
  • predicting asteroid physical properties such as diameter instead of binary classification

Technologies Used

  • Python
  • Pandas
  • NumPy
  • Scikit-learn
  • Matplotlib
  • Seaborn
  • Jupyter Notebook

How to Run

1. Clone the repository

git clone https://github.com/benjmoh/asteroid-pha-classification.git
cd asteroid-pha-classification

2. Install dependencies

pip install -r requirements.txt

3. Launch Jupyter Notebook

jupyter notebook

4. Open the notebook

Open the file:

asteroid-pha-classification.ipynb

Run the notebook cells from top to bottom to reproduce the full machine learning workflow, including:

  • data exploration
  • preprocessing and feature selection
  • train/test splitting
  • model training (Logistic Regression, Random Forest, Neural Network)
  • model evaluation using confusion matrices
  • feature importance analysis
  • neural network training loss visualisation

About

Machine learning models for classifying potentially hazardous asteroids using NASA asteroid data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors