0% found this document useful (0 votes)

27 views27 pages

NSSC-Data Analytics

Uploaded by

Dharnesh Balq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views27 pages

NSSC-Data Analytics

Uploaded by

Dharnesh Balq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

DATA ANALYTICS NSSC’24

COSMIC COLLISION:
ANALYSING ASTEROID RISKS WITH DATA
TEAM KAJUBADAM
Archisha Ahuja | Dharnesh Bala | Gargi Bhagwat
EXPLORATORY DATA ANALYSIS

DATA INSPECTION

The data has 24 columns

It has 25378 missing values
Name, Relative velocity km per sec,
Orbital period, Orbit uncertainity
are categorical , all others are
numerical
Calculation of basic statistics for
each numerical feature was done
The feature ‘Name’ is dropped as it
is not an actual feature, but only
used for indexing
EXPLORATORY DATA ANALYSIS

FILLING MISSING VALUES

The numerical features are filled with mean and the categorical features are filled with
mode through imputation
The feature ‘Miss Dist.(kilometers)’ was filled in a different manner due to presence of 3
other columns with Miss Distance in units AU/Miles/Lunar. The missing values in
kilometers column was replaced with values in these columns after conversion. In the
end the values were filled with mean.
Similar method was applied with the feature 'Relative Velocity km per hr'.
EXPLORATORY DATA ANALYSIS

HISTOGRAMS TO CHECK SKEWNESS

The Dataset is heavely skewed with a lot of

outliers to tackle the same we used z
normalization
EXPLORATORY DATA ANALYSIS

Z-SCORE CALCULATION FOR OUTLIERS

Total 148 outliers were found in Miles per hour, Relative Velocity km per hr, Aphelion
distance.
These were handled by normalising the dataset using z normalisation
EXPLORATORY DATA ANALYSIS

CORRELATION MATRIX HEAT MAP

There are 15 highly correlated pairs of
features with either direct correlation or
inverse correlation.
These can cause an issue with the model
and therefore are dealt by either removing
one of the features or applying Principle
Component Analysis which helps reduce
the dimensionality and formulates the
feature groups to uncorrelated features.
EXPLORATORY DATA ANALYSIS

CLASS IMBALANCE

There is a clear class imbalance with

approximately 5:1 ratio
Due to the presence of class imbalance, we
compare Confusion Matrix and F1 score to
compare the performance of the model
EXPLORATORY DATA ANALYSIS

SCATTER PLOTS

Epoch Osculation vs Epoch Date Close Approach Jupiter Tisserand Invariant vs Semi Major Axis Epoch Date Close Approach vs Perihelion Arg

Liner Relationship Non-Liner Relationship No Correlation

EXPLORATORY DATA ANALYSIS

SCATTER PLOTS
Inference:
Linear relationships: If the points form a line (positively or negatively sloped), it indicates a strong
linear correlation between the two variables.
Non-linear relationships: Curves or other patterns might indicate more complex relationships.
Clusters: Separate groups of points might indicate natural clusters or categories in the data.
Outliers: Points far away from the main distribution might be potential outliers
Positive correlation: If the values of one feature increase as another feature increases, the
scatter plot will show an upward slope.
Negative correlation: If one feature decreases as another increases, it will show a downward
slope.
No correlation: If the points are scattered randomly, there may not be any clear relationship
between the two features.
NUMERICAL INTERPRETATION
AND STATISTICAL ANALYSIS

FEATURE ENGINEERING
All of the features were calculated by non-normalised dataset to avoid error due to negative square root.
Initial features were selected to represent key data characteristics; irrelevant and redundant features were
removed for a cleaner dataset.
Cluster labels from K-Means were assigned as new categorical features, improving anomaly differentiation
by capturing underlying data groupings.
Distance metrics to cluster centroids were added, enabling precise anomaly flagging by highlighting
outliers based on proximity to typical data patterns.
A percentile-based threshold was applied to distance metrics to classify anomalies effectively, setting a
distinct separation between normal and anomalous instances.
After feature transformation, all engineered features were normalized individually to bring them onto a
common scale, ensuring model compatibility and stability.
Principal Component Analysis (PCA) was applied to enhance visualization by reducing dimensionality
while preserving core variance, aiding in cluster and anomaly interpretation.
NUMERICAL INTERPRETATION
AND STATISTICAL ANALYSIS

ADDITIONAL FEATURES
Class: To create the 'class' feature, asteroids were grouped based on their semi-major axis.
This helps segment asteroids by their orbital distances, making it easier for the model to detect
patterns or anomalies within each group.
Perihelion Longitude: This feature is the sum of the perihelion argument and ascending node
longitude, showing the asteroid’s closest point to the Sun. It helps the model understand orbital
positions and identify anomalies based on spatial distribution.
Distance from Sun (Perihelion Distance): Calculated using the semi-major axis and eccentricity,
this feature measures how close the asteroid is to the Sun at its closest point. It helps the
model identify asteroids that behave unusually in their orbit.
Eccentricity: This feature measures how elongated an asteroid’s orbit is. It helps the model
detect anomalies in asteroid trajectories that deviate from standard elliptical orbits.
NUMERICAL INTERPRETATION
AND STATISTICAL ANALYSIS

HANDLING BINNED VALUES

Relative Velocity Encoding: The Relative Velocity km per sec column is encoded into four
categories: 'Slow', 'Very Slow', 'Fast', and 'Very Fast', mapped to values 0 through 3.

Orbital Period and Orbit Uncertainty Encoding: These two columns are encoded to reflect
three levels of measurement: Low, Medium, and High, mapped to values 0, 1, and 2,
respectively.

Classification Encoding: The classification column, which categorizes objects as Main Belt
Asteroid, Near Earth Object, Centaur, or Other, is mapped to values 0 through 3.

The data was cleaned and unecessary features were dropped.

HAZARDOUS CLASSIFICATION

MODEL TESTING
Models like SVM, artificial neural networks and Random Forest were tried.
Random forest was chosen as it yielded the best accuracy and F2 score

1. SVM :
HAZARDOUS CLASSIFICATION

MODEL TESTING
2. Artificial Neural Network :
HAZARDOUS CLASSIFICATION

MODEL TESTING
3. Random Forest :
HAZARDOUS CLASSIFICATION

RANDOM FOREST K-FOLD FOR K=2 TO K=11

We can see
that the best
performance
was see for
K = 9 where
both Accuracy
and F2 Score
are maximum
HAZARDOUS CLASSIFICATION

HYPERPARAMETER TUNING
1. GridSearchCV
The best parameters selected were:
n_estimators: 300, min_samples_split: 10, min_samples_leaf: 2, max_features: 'log2', max_depth: 20,
and bootstrap: False.
Accuracy : 0.8137, F2 score : 0.8155.
The tuning led to overfitting or less generalization.

2. Random Search
n_estimators (e.g., 50 to 500), max_depth (e.g., 5 to 50), and min_samples_split (e.g., 2 to 10).

As Hyperparameter tuning did not give the best results, we stick to the simpler model as it worked
best .
HAZARDOUS CLASSIFICATION

ROC CURVE

Inference: The model is distinguishing

between the positive and negative classes
very well, even at low FPR.
AUC close to 1.0 suggests that the model is
highly accurate.
HAZARDOUS CLASSIFICATION

CONFUSION MATRIX

True Positive (TP): 761

False Negative (FN): 7
False Positive (FP): 125
True Negative (TN): 14
This means the model predicted
761+14=775 out of 907 (85.44%)
values correctly.
HAZARDOUS CLASSIFICATION

FEATURE IMPORTANCE ANALYSIS Permutation Importance:

the most impactful feature
Permutation Importance SHAP Values is ratio. While other
features like Jupiter
Tessard Distance and
rel_vel_kmps_labelled are
negatively affecting our
predictions

SHAP: The color of each

point on the graph
represents the value of
the corresponding
feature, with red
indicating high values
and blue indicating low
values.
ANOMALY DETECTION

K MEANS CLUSTERING

This method groups the data into clusters and calculates the distance of each point from its cluster center. By
setting a threshold based on these distances, we identified which points were anomalies.
ANOMALY DETECTION

K MEANS CLUSTERING
No. of Anomalies detected are
363
ANOMALY DETECTION

KNN
ANOMALY DETECTION 05

K MEANS VS KNN
THANK YOU!
APPENDIX
FORMULAE FOR FEATURES IN FEATURE ENGINEERING
1. Day of the year :
This shows the Approach day of the asteroid formed through some manipulation of the
columns.
2. Eccentricity : 5. Heliocentric Distance : 8. Specific Angular Momentum :

3. Average Orbital Velocity 6. Escape Velocity : 9. V_Aphelion :

4. Time Period days : 7. Specific Orbital Energy 10. V_Perihelion :

:
11. Mean Motion :
15. Class :

12. Synodic Period :

13. Ratio :

14. Perihelion Longitude :

Anomaly-Fraud-Detection
No ratings yet
Anomaly-Fraud-Detection
50 pages
Asteroid
No ratings yet
Asteroid
5 pages
Astronomical Data Analysis Guide
No ratings yet
Astronomical Data Analysis Guide
30 pages
IDA Paper Gondek Hafner Sampson
No ratings yet
IDA Paper Gondek Hafner Sampson
6 pages
ADS Notes Module 4
No ratings yet
ADS Notes Module 4
13 pages
(Astrophysics and Space Science Library 131) Fionn Murtagh, André Heck (Auth.) - Multivariate Data Analysis-Springer Netherlands (1987) PDF
No ratings yet
(Astrophysics and Space Science Library 131) Fionn Murtagh, André Heck (Auth.) - Multivariate Data Analysis-Springer Netherlands (1987) PDF
224 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
Decision Trees
No ratings yet
Decision Trees
49 pages
Unit 4-2
No ratings yet
Unit 4-2
7 pages
Spatial Weighted Outlier Detection: (Ykou, Ctlu) @VT - Edu
No ratings yet
Spatial Weighted Outlier Detection: (Ykou, Ctlu) @VT - Edu
5 pages
NguyenThanhNam ITCSIU22311 Lab01
No ratings yet
NguyenThanhNam ITCSIU22311 Lab01
20 pages
Anomaly Detection Overview
No ratings yet
Anomaly Detection Overview
36 pages
Capstone
No ratings yet
Capstone
20 pages
Project
No ratings yet
Project
18 pages
Make 05 00042 v3
No ratings yet
Make 05 00042 v3
21 pages
Slides Concepts
No ratings yet
Slides Concepts
55 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Basic Tool Help IncrementalSpatialAutocorrelation
No ratings yet
Basic Tool Help IncrementalSpatialAutocorrelation
2 pages
Nasanearestobjects: 1 Nasa - Nearest Earth Objects
No ratings yet
Nasanearestobjects: 1 Nasa - Nearest Earth Objects
9 pages
Shakiba Rahimiaghdam - 61130 - Assignsubmission - File - DatasetAnalysis - MINERS
No ratings yet
Shakiba Rahimiaghdam - 61130 - Assignsubmission - File - DatasetAnalysis - MINERS
56 pages
Lec 5
No ratings yet
Lec 5
24 pages
参考文献3
No ratings yet
参考文献3
9 pages
Cluster Hdbscan Dan GMM
No ratings yet
Cluster Hdbscan Dan GMM
45 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
No ratings yet
Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
4 pages
Weka
No ratings yet
Weka
22 pages
Pattern Recognition: C G (P) G (F (M) )
No ratings yet
Pattern Recognition: C G (P) G (F (M) )
143 pages
Screenshot 2023-06-29 at 10.08.52 AM
No ratings yet
Screenshot 2023-06-29 at 10.08.52 AM
13 pages
Lecture23 2
No ratings yet
Lecture23 2
10 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
The Aim of The Dataset - 040835
No ratings yet
The Aim of The Dataset - 040835
4 pages
Spatio-Temporal Outlier Detection in Large Databases: Derya Birant, Alp Kut
No ratings yet
Spatio-Temporal Outlier Detection in Large Databases: Derya Birant, Alp Kut
7 pages
Lab 01 - Nguyen Duy Phuc - ITDSIU21030
No ratings yet
Lab 01 - Nguyen Duy Phuc - ITDSIU21030
12 pages
8 Clustering2
No ratings yet
8 Clustering2
84 pages
SEEM2460 Unsupervised Learning Clustering
No ratings yet
SEEM2460 Unsupervised Learning Clustering
76 pages
DWM
No ratings yet
DWM
9 pages
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
No ratings yet
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
11 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
10.2. Accuracy and Quality Measurements
No ratings yet
10.2. Accuracy and Quality Measurements
55 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Pat Recogn
No ratings yet
Pat Recogn
145 pages
Asteroid Deflection System With ML
No ratings yet
Asteroid Deflection System With ML
11 pages
Supplementary - Active Learning Alloys
No ratings yet
Supplementary - Active Learning Alloys
38 pages
ISA - Summer School - Project Report Dynamical Mass of Galaxy Cluster and SNe Cosmology
No ratings yet
ISA - Summer School - Project Report Dynamical Mass of Galaxy Cluster and SNe Cosmology
20 pages
Object-Attribute Biclustering For Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data
No ratings yet
Object-Attribute Biclustering For Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data
24 pages
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
No ratings yet
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
25 pages
Evaluating Classification Algorithms: Exoplanet Detection Using Kepler Time Series Data
No ratings yet
Evaluating Classification Algorithms: Exoplanet Detection Using Kepler Time Series Data
6 pages
NguyenCongSang ITITIU20292 Lab1
No ratings yet
NguyenCongSang ITITIU20292 Lab1
7 pages
Steel Plate Fault Prediction
No ratings yet
Steel Plate Fault Prediction
21 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
Address All Correspondence To P.seshadri@imperial - Ac.uk
No ratings yet
Address All Correspondence To P.seshadri@imperial - Ac.uk
29 pages
Anomaly Detection Explained
No ratings yet
Anomaly Detection Explained
3 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
,ilnadad (1lfr: Petroleum Corporation
No ratings yet
,ilnadad (1lfr: Petroleum Corporation
18 pages
SP 58 (Refer)
No ratings yet
SP 58 (Refer)
22 pages
Grade 9 Conservation of Mechanical Energy
100% (8)
Grade 9 Conservation of Mechanical Energy
9 pages
Application Form Valmasci
67% (3)
Application Form Valmasci
2 pages
Student Communication Challenges
No ratings yet
Student Communication Challenges
13 pages
Week 4
No ratings yet
Week 4
11 pages
Mathematician 20241106 110938 0000
No ratings yet
Mathematician 20241106 110938 0000
1 page
MODBUS Connection C5 en
No ratings yet
MODBUS Connection C5 en
20 pages
Jay Joseph - The Trouble With Twin Studies - A Reassessment of Twin Research in The Social and Behavioral Sciences-Routledge (2014)
No ratings yet
Jay Joseph - The Trouble With Twin Studies - A Reassessment of Twin Research in The Social and Behavioral Sciences-Routledge (2014)
337 pages
Science & Technology MEM0
No ratings yet
Science & Technology MEM0
14 pages
PT716B SBR
No ratings yet
PT716B SBR
4 pages
Ultra Performance Liquid Chromatography (Uplc)
No ratings yet
Ultra Performance Liquid Chromatography (Uplc)
12 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
SPM English (CEFR) 1119/1
100% (3)
SPM English (CEFR) 1119/1
14 pages
Grade 7 Progress Test 1 October 2021 Final QP
No ratings yet
Grade 7 Progress Test 1 October 2021 Final QP
15 pages
Dti Unit-2,3
No ratings yet
Dti Unit-2,3
30 pages
Ndet QB2
No ratings yet
Ndet QB2
5 pages
Power Up Prelims Test Series Batch 10 066750dc89
No ratings yet
Power Up Prelims Test Series Batch 10 066750dc89
4 pages
Concept Design of Twist Beam Rear Axles
No ratings yet
Concept Design of Twist Beam Rear Axles
5 pages
Emotional Inequality: Solutions For Women in The Workplace
No ratings yet
Emotional Inequality: Solutions For Women in The Workplace
6 pages
Tableau Functions Guide
No ratings yet
Tableau Functions Guide
98 pages
F19 IEC TC14 Report
No ratings yet
F19 IEC TC14 Report
13 pages
Energy in Ecosystems Term Quiz
No ratings yet
Energy in Ecosystems Term Quiz
3 pages
Noteworthy: A New Book On The Heart and Circulation
No ratings yet
Noteworthy: A New Book On The Heart and Circulation
1 page
Redd
No ratings yet
Redd
183 pages
Global Wildlife Awareness Initiative
No ratings yet
Global Wildlife Awareness Initiative
1 page
Literature Review
No ratings yet
Literature Review
15 pages
17 Types of Contractors What Is Contractor 17 Different Types of Contractors
No ratings yet
17 Types of Contractors What Is Contractor 17 Different Types of Contractors
13 pages
BOSH - Syllabus
No ratings yet
BOSH - Syllabus
12 pages
Course Guide Gec 9 1ST Sem 2023-2024
No ratings yet
Course Guide Gec 9 1ST Sem 2023-2024
4 pages