This project applies machine learning techniques to classify Potentially Hazardous Asteroids (PHAs) using NASA asteroid data.
Near Earth Objects (NEOs) are asteroids and comets whose orbits bring them close to Earth. NASA defines a Potentially Hazardous Asteroid (PHA) as an asteroid with:
- Minimum Orbit Intersection Distance (MOID) < 0.05 AU
- Diameter greater than 140 meters
The goal of this project is to train machine learning models to classify whether an asteroid is hazardous based on its orbital and physical properties.
Three machine learning models were implemented and compared:
- Logistic Regression
- Random Forest
- Neural Network (Multi-Layer Perceptron)
The models were evaluated using precision, recall, and F1-score, with particular focus on recall since missing a hazardous asteroid could have severe consequences.
The dataset used is the NASA Asteroid Classification dataset from Kaggle.
Dataset characteristics:
- 4687 asteroid observations
- 40 columns
- 39 input features
- 1 target variable (
Hazardous)
The features describe orbital and physical asteroid properties such as:
- eccentricity
- semi-major axis
- perihelion distance
- absolute magnitude
- estimated diameter
The target variable:
Hazardous = True→ Potentially Hazardous AsteroidHazardous = False→ Non-hazardous asteroid
Several preprocessing steps were applied before training the models.
These columns contain identifiers rather than predictive information:
- Neo Reference ID
- Orbit ID
These features contained a single value across the dataset and therefore could not help separate classes:
- Equinox (
J2000) - Orbiting Body (
Earth)
Estimated diameter appeared in multiple units:
- kilometers
- meters
- miles
- feet
Only the kilometer measurements were kept.
These columns relate to observation timing rather than asteroid properties:
- Close Approach Date
- Perihelion Time
All features were standardized to zero mean and unit variance.
This is particularly important for models sensitive to feature scale such as:
- Logistic Regression
- Neural Networks
No missing values were present in the dataset.
The dataset was split into:
- 80% training
- 20% testing
Stratified sampling was used to maintain the original class distribution.
Logistic Regression was used as a baseline model.
It assumes a linear relationship between features and the target variable, making it a useful reference point when comparing more complex models.
Random Forest is an ensemble of decision trees and performs particularly well on structured tabular datasets.
Parameters used:
n_estimators = 300n_jobs = -1random_state = 42
Using 300 trees improves stability while maintaining efficient training time.
A Multi-Layer Perceptron neural network was implemented to model more complex non-linear relationships.
Configuration:
hidden_layer_sizes = (64, 32)activation = relusolver = adamlearning_rate = adaptivemax_iter = 500early_stopping = Truen_iter_no_change = 20random_state = 42
Early stopping was used to prevent overfitting.
| Model | Precision | Recall | F1-score | False Negatives | False Positives |
|---|---|---|---|---|---|
| Logistic Regression | 85.5% | 86.1% | 85.8% | 21 | 22 |
| Random Forest | 99.3% | 99.3% | 99.3% | 1 | 1 |
| Neural Network (MLP) | 96.4% | 88.1% | 92.0% | 18 | 5 |
Logistic Regression performed well as a baseline model, achieving balanced precision and recall around 86%.
However, the model misclassified a larger number of asteroids compared to the other models because linear decision boundaries cannot fully capture the non-linear relationships present in the dataset.
Random Forest achieved the best performance by a large margin, with:
- Precision: 99.3%
- Recall: 99.3%
- F1-score: 99.3%
The confusion matrix showed only:
- 1 false positive
- 1 false negative
Training performance was 100%, which is normal for Random Forest due to deep decision trees.
The similarity between training and testing performance suggests the model is not overfitting but instead the classes are highly separable in the feature space.
The confusion matrices below show the classification performance of each model on the test dataset.
Random Forest feature importance analysis showed that Minimum Orbit Intersection Distance (MOID) dominated the model.
Most important features:
- Minimum Orbit Intersection
- Absolute Magnitude
- Estimated Diameter (km max)
- Estimated Diameter (km min)
- Perihelion Distance
This confirms that orbital proximity and asteroid size are the most critical indicators of hazardous asteroids.
The Neural Network performed strongly with an F1-score of 92%, outperforming Logistic Regression.
However, it produced 18 false negatives, which is significantly worse than Random Forest.
In the context of planetary defence, false negatives are particularly dangerous because they represent hazardous asteroids that were not detected.
The training loss curve showed successful convergence.
Random Forest significantly outperformed the other models because tree-based models excel on structured tabular datasets.
The NASA definition of a hazardous asteroid effectively creates a threshold-based rule:
- MOID < 0.05 AU
- Diameter > 140 m
Decision trees naturally discover such threshold rules. Random Forest ensembles combine many trees, allowing them to capture these relationships extremely effectively.
Neural networks typically perform best on:
- very large datasets
- highly complex patterns
In this case, the dataset was relatively small (4687 samples) and the decision boundaries were largely threshold-driven, which strongly favours tree-based models.
One limitation is that the Hazardous label in the dataset is derived from NASA’s predefined rules.
This means the models are largely rediscovering existing thresholds rather than discovering new relationships.
Additionally:
- the dataset is relatively small
- feature importance was dominated by MOID
- only three models were tested
Future improvements could include:
- testing gradient boosting models such as XGBoost
- hyperparameter tuning for Random Forest and MLP
- training on much larger NASA NEO datasets
- predicting asteroid physical properties such as diameter instead of binary classification
- Python
- Pandas
- NumPy
- Scikit-learn
- Matplotlib
- Seaborn
- Jupyter Notebook
git clone https://github.com/benjmoh/asteroid-pha-classification.git
cd asteroid-pha-classification
pip install -r requirements.txt
jupyter notebook
Open the file:
asteroid-pha-classification.ipynb
Run the notebook cells from top to bottom to reproduce the full machine learning workflow, including:
- data exploration
- preprocessing and feature selection
- train/test splitting
- model training (Logistic Regression, Random Forest, Neural Network)
- model evaluation using confusion matrices
- feature importance analysis
- neural network training loss visualisation




