Asteroid PHA Classification using Machine Learning

Overview

This project applies machine learning techniques to classify Potentially Hazardous Asteroids (PHAs) using NASA asteroid data.

Near Earth Objects (NEOs) are asteroids and comets whose orbits bring them close to Earth. NASA defines a Potentially Hazardous Asteroid (PHA) as an asteroid with:

Minimum Orbit Intersection Distance (MOID) < 0.05 AU
Diameter greater than 140 meters

The goal of this project is to train machine learning models to classify whether an asteroid is hazardous based on its orbital and physical properties.

Three machine learning models were implemented and compared:

Logistic Regression
Random Forest
Neural Network (Multi-Layer Perceptron)

The models were evaluated using precision, recall, and F1-score, with particular focus on recall since missing a hazardous asteroid could have severe consequences.

Dataset

The dataset used is the NASA Asteroid Classification dataset from Kaggle.

Dataset characteristics:

4687 asteroid observations
40 columns
39 input features
1 target variable (Hazardous)

The features describe orbital and physical asteroid properties such as:

eccentricity
semi-major axis
perihelion distance
absolute magnitude
estimated diameter

The target variable:

Hazardous = True → Potentially Hazardous Asteroid
Hazardous = False → Non-hazardous asteroid

Data Preprocessing

Several preprocessing steps were applied before training the models.

Removed identifier features

These columns contain identifiers rather than predictive information:

Neo Reference ID
Orbit ID

Removed constant features

These features contained a single value across the dataset and therefore could not help separate classes:

Equinox (J2000)
Orbiting Body (Earth)

Removed redundant features

Estimated diameter appeared in multiple units:

kilometers
meters
miles
feet

Only the kilometer measurements were kept.

Removed timestamp features

These columns relate to observation timing rather than asteroid properties:

Close Approach Date
Perihelion Time

Feature scaling

All features were standardized to zero mean and unit variance.
This is particularly important for models sensitive to feature scale such as:

Logistic Regression
Neural Networks

No missing values were present in the dataset.

Train/Test Split

The dataset was split into:

80% training
20% testing

Stratified sampling was used to maintain the original class distribution.

Models

Logistic Regression

Logistic Regression was used as a baseline model.

It assumes a linear relationship between features and the target variable, making it a useful reference point when comparing more complex models.

Random Forest

Random Forest is an ensemble of decision trees and performs particularly well on structured tabular datasets.

Parameters used:

n_estimators = 300
n_jobs = -1
random_state = 42

Using 300 trees improves stability while maintaining efficient training time.

Neural Network (MLP)

A Multi-Layer Perceptron neural network was implemented to model more complex non-linear relationships.

Configuration:

hidden_layer_sizes = (64, 32)
activation = relu
solver = adam
learning_rate = adaptive
max_iter = 500
early_stopping = True
n_iter_no_change = 20
random_state = 42

Early stopping was used to prevent overfitting.

Model Performance

Model	Precision	Recall	F1-score	False Negatives	False Positives
Logistic Regression	85.5%	86.1%	85.8%	21	22
Random Forest	99.3%	99.3%	99.3%	1	1
Neural Network (MLP)	96.4%	88.1%	92.0%	18	5

Results

Logistic Regression

Logistic Regression performed well as a baseline model, achieving balanced precision and recall around 86%.

However, the model misclassified a larger number of asteroids compared to the other models because linear decision boundaries cannot fully capture the non-linear relationships present in the dataset.

Random Forest

Random Forest achieved the best performance by a large margin, with:

Precision: 99.3%
Recall: 99.3%
F1-score: 99.3%

The confusion matrix showed only:

1 false positive
1 false negative

Training performance was 100%, which is normal for Random Forest due to deep decision trees.

The similarity between training and testing performance suggests the model is not overfitting but instead the classes are highly separable in the feature space.

Confusion Matrices

The confusion matrices below show the classification performance of each model on the test dataset.

Logistic Regression

Random Forest

Neural Network (MLP)

Feature Importance

Random Forest feature importance analysis showed that Minimum Orbit Intersection Distance (MOID) dominated the model.

Most important features:

Minimum Orbit Intersection
Absolute Magnitude
Estimated Diameter (km max)
Estimated Diameter (km min)
Perihelion Distance

This confirms that orbital proximity and asteroid size are the most critical indicators of hazardous asteroids.

Neural Network (MLP)

The Neural Network performed strongly with an F1-score of 92%, outperforming Logistic Regression.

However, it produced 18 false negatives, which is significantly worse than Random Forest.

In the context of planetary defence, false negatives are particularly dangerous because they represent hazardous asteroids that were not detected.

The training loss curve showed successful convergence.

Why Random Forest Performed Best

Random Forest significantly outperformed the other models because tree-based models excel on structured tabular datasets.

The NASA definition of a hazardous asteroid effectively creates a threshold-based rule:

MOID < 0.05 AU
Diameter > 140 m

Decision trees naturally discover such threshold rules. Random Forest ensembles combine many trees, allowing them to capture these relationships extremely effectively.

Neural networks typically perform best on:

very large datasets
highly complex patterns

In this case, the dataset was relatively small (4687 samples) and the decision boundaries were largely threshold-driven, which strongly favours tree-based models.

Limitations

One limitation is that the Hazardous label in the dataset is derived from NASA’s predefined rules.

This means the models are largely rediscovering existing thresholds rather than discovering new relationships.

Additionally:

the dataset is relatively small
feature importance was dominated by MOID
only three models were tested

Future Work

Future improvements could include:

testing gradient boosting models such as XGBoost
hyperparameter tuning for Random Forest and MLP
training on much larger NASA NEO datasets
predicting asteroid physical properties such as diameter instead of binary classification

Technologies Used

Python
Pandas
NumPy
Scikit-learn
Matplotlib
Seaborn
Jupyter Notebook

How to Run

1. Clone the repository

git clone https://github.com/benjmoh/asteroid-pha-classification.git
cd asteroid-pha-classification

2. Install dependencies

pip install -r requirements.txt

3. Launch Jupyter Notebook

jupyter notebook

4. Open the notebook

Open the file:

asteroid-pha-classification.ipynb

Run the notebook cells from top to bottom to reproduce the full machine learning workflow, including:

data exploration
preprocessing and feature selection
train/test splitting
model training (Logistic Regression, Random Forest, Neural Network)
model evaluation using confusion matrices
feature importance analysis
neural network training loss visualisation

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
outputs		outputs
.DS_Store		.DS_Store
README.md		README.md
asteroid-pha-classification.ipynb		asteroid-pha-classification.ipynb
nasa.csv		nasa.csv
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Asteroid PHA Classification using Machine Learning

Overview

Dataset

Data Preprocessing

Removed identifier features

Removed constant features

Removed redundant features

Removed timestamp features

Feature scaling

Train/Test Split

Models

Logistic Regression

Random Forest

Neural Network (MLP)

Model Performance

Results

Logistic Regression

Random Forest

Confusion Matrices

Logistic Regression

Random Forest

Neural Network (MLP)

Feature Importance

Neural Network (MLP)

Why Random Forest Performed Best

Limitations

Future Work

Technologies Used

How to Run

1. Clone the repository

2. Install dependencies

3. Launch Jupyter Notebook

4. Open the notebook

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages