DATA ANALYTICS NSSC’24
COSMIC COLLISION:
ANALYSING ASTEROID RISKS WITH DATA
TEAM KAJUBADAM
Archisha Ahuja | Dharnesh Bala | Gargi Bhagwat
EXPLORATORY DATA ANALYSIS
DATA INSPECTION
The data has 24 columns
It has 25378 missing values
Name, Relative velocity km per sec,
Orbital period, Orbit uncertainity
are categorical , all others are
numerical
Calculation of basic statistics for
each numerical feature was done
The feature ‘Name’ is dropped as it
is not an actual feature, but only
used for indexing
EXPLORATORY DATA ANALYSIS
FILLING MISSING VALUES
The numerical features are filled with mean and the categorical features are filled with
mode through imputation
The feature ‘Miss Dist.(kilometers)’ was filled in a different manner due to presence of 3
other columns with Miss Distance in units AU/Miles/Lunar. The missing values in
kilometers column was replaced with values in these columns after conversion. In the
end the values were filled with mean.
Similar method was applied with the feature 'Relative Velocity km per hr'.
EXPLORATORY DATA ANALYSIS
HISTOGRAMS TO CHECK SKEWNESS
The Dataset is heavely skewed with a lot of
outliers to tackle the same we used z
normalization
EXPLORATORY DATA ANALYSIS
Z-SCORE CALCULATION FOR OUTLIERS
Total 148 outliers were found in Miles per hour, Relative Velocity km per hr, Aphelion
distance.
These were handled by normalising the dataset using z normalisation
EXPLORATORY DATA ANALYSIS
CORRELATION MATRIX HEAT MAP
There are 15 highly correlated pairs of
features with either direct correlation or
inverse correlation.
These can cause an issue with the model
and therefore are dealt by either removing
one of the features or applying Principle
Component Analysis which helps reduce
the dimensionality and formulates the
feature groups to uncorrelated features.
EXPLORATORY DATA ANALYSIS
CLASS IMBALANCE
There is a clear class imbalance with
approximately 5:1 ratio
Due to the presence of class imbalance, we
compare Confusion Matrix and F1 score to
compare the performance of the model
EXPLORATORY DATA ANALYSIS
SCATTER PLOTS
Epoch Osculation vs Epoch Date Close Approach Jupiter Tisserand Invariant vs Semi Major Axis Epoch Date Close Approach vs Perihelion Arg
Liner Relationship Non-Liner Relationship No Correlation
EXPLORATORY DATA ANALYSIS
SCATTER PLOTS
Inference:
Linear relationships: If the points form a line (positively or negatively sloped), it indicates a strong
linear correlation between the two variables.
Non-linear relationships: Curves or other patterns might indicate more complex relationships.
Clusters: Separate groups of points might indicate natural clusters or categories in the data.
Outliers: Points far away from the main distribution might be potential outliers
Positive correlation: If the values of one feature increase as another feature increases, the
scatter plot will show an upward slope.
Negative correlation: If one feature decreases as another increases, it will show a downward
slope.
No correlation: If the points are scattered randomly, there may not be any clear relationship
between the two features.
NUMERICAL INTERPRETATION
AND STATISTICAL ANALYSIS
FEATURE ENGINEERING
All of the features were calculated by non-normalised dataset to avoid error due to negative square root.
Initial features were selected to represent key data characteristics; irrelevant and redundant features were
removed for a cleaner dataset.
Cluster labels from K-Means were assigned as new categorical features, improving anomaly differentiation
by capturing underlying data groupings.
Distance metrics to cluster centroids were added, enabling precise anomaly flagging by highlighting
outliers based on proximity to typical data patterns.
A percentile-based threshold was applied to distance metrics to classify anomalies effectively, setting a
distinct separation between normal and anomalous instances.
After feature transformation, all engineered features were normalized individually to bring them onto a
common scale, ensuring model compatibility and stability.
Principal Component Analysis (PCA) was applied to enhance visualization by reducing dimensionality
while preserving core variance, aiding in cluster and anomaly interpretation.
NUMERICAL INTERPRETATION
AND STATISTICAL ANALYSIS
ADDITIONAL FEATURES
Class: To create the 'class' feature, asteroids were grouped based on their semi-major axis.
This helps segment asteroids by their orbital distances, making it easier for the model to detect
patterns or anomalies within each group.
Perihelion Longitude: This feature is the sum of the perihelion argument and ascending node
longitude, showing the asteroid’s closest point to the Sun. It helps the model understand orbital
positions and identify anomalies based on spatial distribution.
Distance from Sun (Perihelion Distance): Calculated using the semi-major axis and eccentricity,
this feature measures how close the asteroid is to the Sun at its closest point. It helps the
model identify asteroids that behave unusually in their orbit.
Eccentricity: This feature measures how elongated an asteroid’s orbit is. It helps the model
detect anomalies in asteroid trajectories that deviate from standard elliptical orbits.
NUMERICAL INTERPRETATION
AND STATISTICAL ANALYSIS
HANDLING BINNED VALUES
Relative Velocity Encoding: The Relative Velocity km per sec column is encoded into four
categories: 'Slow', 'Very Slow', 'Fast', and 'Very Fast', mapped to values 0 through 3.
Orbital Period and Orbit Uncertainty Encoding: These two columns are encoded to reflect
three levels of measurement: Low, Medium, and High, mapped to values 0, 1, and 2,
respectively.
Classification Encoding: The classification column, which categorizes objects as Main Belt
Asteroid, Near Earth Object, Centaur, or Other, is mapped to values 0 through 3.
The data was cleaned and unecessary features were dropped.
HAZARDOUS CLASSIFICATION
MODEL TESTING
Models like SVM, artificial neural networks and Random Forest were tried.
Random forest was chosen as it yielded the best accuracy and F2 score
1. SVM :
HAZARDOUS CLASSIFICATION
MODEL TESTING
2. Artificial Neural Network :
HAZARDOUS CLASSIFICATION
MODEL TESTING
3. Random Forest :
HAZARDOUS CLASSIFICATION
RANDOM FOREST K-FOLD FOR K=2 TO K=11
We can see
that the best
performance
was see for
K = 9 where
both Accuracy
and F2 Score
are maximum
HAZARDOUS CLASSIFICATION
HYPERPARAMETER TUNING
1. GridSearchCV
The best parameters selected were:
n_estimators: 300, min_samples_split: 10, min_samples_leaf: 2, max_features: 'log2', max_depth: 20,
and bootstrap: False.
Accuracy : 0.8137, F2 score : 0.8155.
The tuning led to overfitting or less generalization.
2. Random Search
n_estimators (e.g., 50 to 500), max_depth (e.g., 5 to 50), and min_samples_split (e.g., 2 to 10).
As Hyperparameter tuning did not give the best results, we stick to the simpler model as it worked
best .
HAZARDOUS CLASSIFICATION
ROC CURVE
Inference: The model is distinguishing
between the positive and negative classes
very well, even at low FPR.
AUC close to 1.0 suggests that the model is
highly accurate.
HAZARDOUS CLASSIFICATION
CONFUSION MATRIX
True Positive (TP): 761
False Negative (FN): 7
False Positive (FP): 125
True Negative (TN): 14
This means the model predicted
761+14=775 out of 907 (85.44%)
values correctly.
HAZARDOUS CLASSIFICATION
FEATURE IMPORTANCE ANALYSIS Permutation Importance:
the most impactful feature
Permutation Importance SHAP Values is ratio. While other
features like Jupiter
Tessard Distance and
rel_vel_kmps_labelled are
negatively affecting our
predictions
SHAP: The color of each
point on the graph
represents the value of
the corresponding
feature, with red
indicating high values
and blue indicating low
values.
ANOMALY DETECTION
K MEANS CLUSTERING
This method groups the data into clusters and calculates the distance of each point from its cluster center. By
setting a threshold based on these distances, we identified which points were anomalies.
ANOMALY DETECTION
K MEANS CLUSTERING
No. of Anomalies detected are
363
ANOMALY DETECTION
KNN
ANOMALY DETECTION 05
K MEANS VS KNN
THANK YOU!
APPENDIX
FORMULAE FOR FEATURES IN FEATURE ENGINEERING
1. Day of the year :
This shows the Approach day of the asteroid formed through some manipulation of the
columns.
2. Eccentricity : 5. Heliocentric Distance : 8. Specific Angular Momentum :
3. Average Orbital Velocity 6. Escape Velocity : 9. V_Aphelion :
:
4. Time Period days : 7. Specific Orbital Energy 10. V_Perihelion :
:
11. Mean Motion :
15. Class :
12. Synodic Period :
13. Ratio :
14. Perihelion Longitude :