[go: up one dir, main page]

0% found this document useful (0 votes)
27 views27 pages

NSSC-Data Analytics

Uploaded by

Dharnesh Balq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views27 pages

NSSC-Data Analytics

Uploaded by

Dharnesh Balq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

DATA ANALYTICS NSSC’24

COSMIC COLLISION:
ANALYSING ASTEROID RISKS WITH DATA
TEAM KAJUBADAM
Archisha Ahuja | Dharnesh Bala | Gargi Bhagwat
EXPLORATORY DATA ANALYSIS

DATA INSPECTION

The data has 24 columns


It has 25378 missing values
Name, Relative velocity km per sec,
Orbital period, Orbit uncertainity
are categorical , all others are
numerical
Calculation of basic statistics for
each numerical feature was done
The feature ‘Name’ is dropped as it
is not an actual feature, but only
used for indexing
EXPLORATORY DATA ANALYSIS

FILLING MISSING VALUES


The numerical features are filled with mean and the categorical features are filled with
mode through imputation
The feature ‘Miss Dist.(kilometers)’ was filled in a different manner due to presence of 3
other columns with Miss Distance in units AU/Miles/Lunar. The missing values in
kilometers column was replaced with values in these columns after conversion. In the
end the values were filled with mean.
Similar method was applied with the feature 'Relative Velocity km per hr'.
EXPLORATORY DATA ANALYSIS

HISTOGRAMS TO CHECK SKEWNESS

The Dataset is heavely skewed with a lot of


outliers to tackle the same we used z
normalization
EXPLORATORY DATA ANALYSIS

Z-SCORE CALCULATION FOR OUTLIERS

Total 148 outliers were found in Miles per hour, Relative Velocity km per hr, Aphelion
distance.
These were handled by normalising the dataset using z normalisation
EXPLORATORY DATA ANALYSIS

CORRELATION MATRIX HEAT MAP


There are 15 highly correlated pairs of
features with either direct correlation or
inverse correlation.
These can cause an issue with the model
and therefore are dealt by either removing
one of the features or applying Principle
Component Analysis which helps reduce
the dimensionality and formulates the
feature groups to uncorrelated features.
EXPLORATORY DATA ANALYSIS

CLASS IMBALANCE

There is a clear class imbalance with


approximately 5:1 ratio
Due to the presence of class imbalance, we
compare Confusion Matrix and F1 score to
compare the performance of the model
EXPLORATORY DATA ANALYSIS

SCATTER PLOTS

Epoch Osculation vs Epoch Date Close Approach Jupiter Tisserand Invariant vs Semi Major Axis Epoch Date Close Approach vs Perihelion Arg

Liner Relationship Non-Liner Relationship No Correlation


EXPLORATORY DATA ANALYSIS

SCATTER PLOTS
Inference:
Linear relationships: If the points form a line (positively or negatively sloped), it indicates a strong
linear correlation between the two variables.
Non-linear relationships: Curves or other patterns might indicate more complex relationships.
Clusters: Separate groups of points might indicate natural clusters or categories in the data.
Outliers: Points far away from the main distribution might be potential outliers
Positive correlation: If the values of one feature increase as another feature increases, the
scatter plot will show an upward slope.
Negative correlation: If one feature decreases as another increases, it will show a downward
slope.
No correlation: If the points are scattered randomly, there may not be any clear relationship
between the two features.
NUMERICAL INTERPRETATION
AND STATISTICAL ANALYSIS

FEATURE ENGINEERING
All of the features were calculated by non-normalised dataset to avoid error due to negative square root.
Initial features were selected to represent key data characteristics; irrelevant and redundant features were
removed for a cleaner dataset.
Cluster labels from K-Means were assigned as new categorical features, improving anomaly differentiation
by capturing underlying data groupings.
Distance metrics to cluster centroids were added, enabling precise anomaly flagging by highlighting
outliers based on proximity to typical data patterns.
A percentile-based threshold was applied to distance metrics to classify anomalies effectively, setting a
distinct separation between normal and anomalous instances.
After feature transformation, all engineered features were normalized individually to bring them onto a
common scale, ensuring model compatibility and stability.
Principal Component Analysis (PCA) was applied to enhance visualization by reducing dimensionality
while preserving core variance, aiding in cluster and anomaly interpretation.
NUMERICAL INTERPRETATION
AND STATISTICAL ANALYSIS

ADDITIONAL FEATURES
Class: To create the 'class' feature, asteroids were grouped based on their semi-major axis.
This helps segment asteroids by their orbital distances, making it easier for the model to detect
patterns or anomalies within each group.
Perihelion Longitude: This feature is the sum of the perihelion argument and ascending node
longitude, showing the asteroid’s closest point to the Sun. It helps the model understand orbital
positions and identify anomalies based on spatial distribution.
Distance from Sun (Perihelion Distance): Calculated using the semi-major axis and eccentricity,
this feature measures how close the asteroid is to the Sun at its closest point. It helps the
model identify asteroids that behave unusually in their orbit.
Eccentricity: This feature measures how elongated an asteroid’s orbit is. It helps the model
detect anomalies in asteroid trajectories that deviate from standard elliptical orbits.
NUMERICAL INTERPRETATION
AND STATISTICAL ANALYSIS

HANDLING BINNED VALUES


Relative Velocity Encoding: The Relative Velocity km per sec column is encoded into four
categories: 'Slow', 'Very Slow', 'Fast', and 'Very Fast', mapped to values 0 through 3.

Orbital Period and Orbit Uncertainty Encoding: These two columns are encoded to reflect
three levels of measurement: Low, Medium, and High, mapped to values 0, 1, and 2,
respectively.

Classification Encoding: The classification column, which categorizes objects as Main Belt
Asteroid, Near Earth Object, Centaur, or Other, is mapped to values 0 through 3.

The data was cleaned and unecessary features were dropped.


HAZARDOUS CLASSIFICATION

MODEL TESTING
Models like SVM, artificial neural networks and Random Forest were tried.
Random forest was chosen as it yielded the best accuracy and F2 score

1. SVM :
HAZARDOUS CLASSIFICATION

MODEL TESTING
2. Artificial Neural Network :
HAZARDOUS CLASSIFICATION

MODEL TESTING
3. Random Forest :
HAZARDOUS CLASSIFICATION

RANDOM FOREST K-FOLD FOR K=2 TO K=11

We can see
that the best
performance
was see for
K = 9 where
both Accuracy
and F2 Score
are maximum
HAZARDOUS CLASSIFICATION

HYPERPARAMETER TUNING
1. GridSearchCV
The best parameters selected were:
n_estimators: 300, min_samples_split: 10, min_samples_leaf: 2, max_features: 'log2', max_depth: 20,
and bootstrap: False.
Accuracy : 0.8137, F2 score : 0.8155.
The tuning led to overfitting or less generalization.

2. Random Search
n_estimators (e.g., 50 to 500), max_depth (e.g., 5 to 50), and min_samples_split (e.g., 2 to 10).

As Hyperparameter tuning did not give the best results, we stick to the simpler model as it worked
best .
HAZARDOUS CLASSIFICATION

ROC CURVE

Inference: The model is distinguishing


between the positive and negative classes
very well, even at low FPR.
AUC close to 1.0 suggests that the model is
highly accurate.
HAZARDOUS CLASSIFICATION

CONFUSION MATRIX

True Positive (TP): 761


False Negative (FN): 7
False Positive (FP): 125
True Negative (TN): 14
This means the model predicted
761+14=775 out of 907 (85.44%)
values correctly.
HAZARDOUS CLASSIFICATION

FEATURE IMPORTANCE ANALYSIS Permutation Importance:


the most impactful feature
Permutation Importance SHAP Values is ratio. While other
features like Jupiter
Tessard Distance and
rel_vel_kmps_labelled are
negatively affecting our
predictions

SHAP: The color of each


point on the graph
represents the value of
the corresponding
feature, with red
indicating high values
and blue indicating low
values.
ANOMALY DETECTION

K MEANS CLUSTERING

This method groups the data into clusters and calculates the distance of each point from its cluster center. By
setting a threshold based on these distances, we identified which points were anomalies.
ANOMALY DETECTION

K MEANS CLUSTERING
No. of Anomalies detected are
363
ANOMALY DETECTION

KNN
ANOMALY DETECTION 05

K MEANS VS KNN
THANK YOU!
APPENDIX
FORMULAE FOR FEATURES IN FEATURE ENGINEERING
1. Day of the year :
This shows the Approach day of the asteroid formed through some manipulation of the
columns.
2. Eccentricity : 5. Heliocentric Distance : 8. Specific Angular Momentum :

3. Average Orbital Velocity 6. Escape Velocity : 9. V_Aphelion :


:

4. Time Period days : 7. Specific Orbital Energy 10. V_Perihelion :


:
11. Mean Motion :
15. Class :

12. Synodic Period :

13. Ratio :

14. Perihelion Longitude :

You might also like