DMcase 2

This case study aims to build a classification model to predict flight delays from Washington, DC to New York using the FlightDelays.csv dataset. The analysis involves data preprocessing, developing a classification tree, and optimizing it with GridSearchCV to improve accuracy. The final model demonstrates an accuracy of 86.85% on validation data, making it the recommended choice for predicting flight status.

Uploaded by

kiranodedra1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

DMcase 2

Uploaded by

kiranodedra1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Flight Delay Classification Case

Introduction
This case study focuses on building a classification model to predict whether a commercial flight from
the Washington, DC area to New York will be delayed or ontime. Using the FlightDelays.csv dataset,
which includes flight details such as departure time, carrier, distance, weather, and more, the goal is
to analyze these factors and apply a classification tree to determine the flight status (FL_STATUS).

A delay is defined as an arrival 15 minutes or more later than scheduled. The final model aims to
support better decision-making by identifying key patterns that lead to delays.

Questions 1. Upload, explore, clean, and preprocess data for classification tree.

1(a). Create the flight_df data frame by uploading the original data set into Python. Determine and
present in this report the data frame dimensions, i.e., number of rows and columns.

Ans.The flight_df data frame was created by loading the dataset into Python. It contains 2000 rows
and 19 columns, representing flight details such as times, distance, carrier, weather, and status. This
gives a basic overview of the dataset before preprocessing.

1(b). Remove ‘DEST’ and ‘ORIGIN’ variables from the flight_df data frame. Then, display the column
data types in flight_df, provide and briefly explain them in your report.

Ans. We removed the columns 'DEST' and 'ORIGIN' from the flight_df data frame using the drop()
function. After removal, we displayed the data types of the remaining columns using .dtypes.

The data types include int64 for numerical values such as SCH_TIME, DEP_TIME, and DISTANCE, and
object (string) for categorical variables like CARRIER and FL_STATUS. This helps in identifying which
columns need to be encoded before model training.

1(c). You leave the outcome variable ‘FL_STATUS’ unchanged in flight_df. However, for the ‘CARRIER’
predictor variable, you need to convert it into the binary variables and avoid using the Boolean
(‘bool’) values. Display in Python the modified column data types, provide and briefly explain them in
your report.

Ans. The outcome variable 'FL_STATUS' was left unchanged in its categorical form. The predictor
'CARRIER' was converted into binary (dummy) variables using pd.get_dummies(), and the bool data
types were converted to int (0 or 1) using .astype(int) to ensure compatibility with modeling
algorithms.

The modified data types were displayed, confirming that all binary columns were now int64, which is
required for proper numeric processing during model training.
1(d). Display in Python and provide in your report the first 10 records of the modified flight_df data
frame. Briefly explain the outcome and predictors in this case.

Ans. We displayed the first 10 records of the modified flight_df using flight_df.head(10). These
records show the updated structure of the data, where original categorical predictors like CARRIER
have been expanded into separate columns (e.g., CARRIER_DL, CARRIER_UA, etc.) with 0s and 1s
indicating presence.

The outcome variable remains as FL_STATUS, which identifies whether a flight was delayed or
ontime. The predictors now include only numerical values, making the dataset ready for machine
learning algorithms.

2(a). Develop a classification tree for the Flight Delays case. a. Develop in Python the predictor
variables (14 variables) and outcome variable (‘FL_STATUS’), partition the data set (80% for training
and 20% for validation partitions, random_state=1). Train a classification tree model using
DecisionTreeClassifier() with the training data set and the following tree control parameters: (a)
maximum depth (number of splits) equals 4; (b) minimum impurity decrease per split of 0.001; and
(c) minimum number of node records (samples) to split equals to 30. Use plotDecisionTree() with the
feature_names and class_names parameters to display the classification tree in Python and present
it in your report.

Ans. A classification tree was developed using DecisionTreeClassifier() with 14 predictor variables
and FL_STATUS as the outcome. The dataset was split into 80% training and 20% validation sets
(random_state=1). The tree was trained with parameters: max_depth=4,
min_impurity_decrease=0.001, and min_samples_split=30.

The tree was visualized using plotDecisionTree() to show key decision paths.
2(b) . Using the classification tree, explain the outcome (‘FL_STATUS’) of a flight if the weather
(‘WEATHER’) is in good flying condition, departure time (‘DEP_TIME’) is 1450 (2:50 pm), and
scheduled time (‘SCH_TIME’) is 1435 (2:35 pm).

Ans. Based on the classification tree, the flight is predicted to be on time. This is because the
weather is good (WEATHER = 0), the departure time (1450) is before 1500.5, and all decision nodes
along this path lead to a class where the majority of flights were on time. Therefore, the model
confidently classifies this flight as on time.

2(c). Identify and display in Python confusion matrices for training and validation partitions. Present
them in your report and comment on accuracy (misclassification) rate for both partitions and explain
if there is a possibility of overfitting.

Ans. The confusion matrices for both the training and validation partitions were generated. The
training confusion matrix shows 86 true negatives, 1402 true positives, 260 false positives, and 12
false negatives. The validation confusion matrix shows 26 true negatives, 353 true positives, 56 false
positives, and 6 false negatives.

The training accuracy is 84.55%, and the validation accuracy is 85.94%. The misclassification rates
are 15.45% for training and 14.06% for validation.

Since the accuracy values are fairly close between the training and validation sets, and the model
performs slightly better on validation than training, there is no strong evidence of overfitting. The
model generalizes well and performs consistently across both partitions.

2(d).Using the trained classification tree, make classification of flight status (‘delayed’ or ‘ontime’) for
the following two new flight records

Ans.The classification tree predicted the first flight to be delayed, likely because it’s scheduled late at
night and uses CARRIER_MQ, which the model associates with delays. The second flight was
predicted to be ontime, as it departs earlier and involves different carriers. Since the model has high
accuracy on both training (84.55%) and validation (85.94%) data, these predictions are likely reliable
and show no signs of overfitting.
3(a). Apply grid search to improve classification results. a. Use the GridSearchCV() algorithm in
Python to improve (optimize) the classification tree control parameters. Consider the following
control parameters: (a) maximum depth (number of splits) in the range from 2 to 30; (b) minimum
impurity decrease per split of 0, 0.0005, and 0.001; and (c) minimum number of node records
(samples) to split in the range from 5 to 30 (cv=5, n_jobs=-1). Do not use the initial guess grid search,
and directly apply the improved grid search. In your report, provide the improved parameters and
the associated classification tree. Display the confusion matrices for training and validation partitions
for the improved classification tree. Present them in your report and comment on accuracy
(misclassification) rate for both partitions and explain if there is a possibility of overfitting.

Ans. GridSearchCV was used to improve classification performance.

Parameters tested:
 max_depth: 2 to 30
 min_impurity_decrease: 0, 0.0005, 0.001
 min_samples_split: 5 to 30

Best parameters:
 max_depth = 9
 min_impurity_decrease = 0.001
 min_samples_split = 20

Cross-validated score: 0.8858

Training confusion matrix: [[193, 153], [30, 1384]]

 Accuracy: 89.60%
 Misclassification: 10.40%

Validation confusion matrix: [[42, 40], [18, 344]]

 Accuracy: 86.85%
 Misclassification: 13.15%

Conclusion: Model performs well with no clear overfitting — training and validation scores are
close.
3(b).Present and compare in your report the validation confusion matrices for the classification
results in questions 2c and 3a. Using the accuracy value (misclassification rate), which classification
tree model would you recommend using for making predictions in this case of flight status (‘delayed’
or ‘on time’)? Briefly explain your answer.

Ans. Based on the validation confusion matrices and accuracy values obtained from Questions 2c and
3a, it is evident that the improved model from Question 3a performs slightly better than the initial
model. The initial model (Q2c) achieved an accuracy of 85.94% with a misclassification rate of
14.06%, while the improved model (Q3a), which was optimized using GridSearchCV, achieved a
higher accuracy of 86.85% and a lower misclassification rate of 13.15%.

Although the improvement is modest, the grid search-optimized model generalizes slightly better on
the validation data. Therefore, for making future predictions on flight status (either 'delayed' or
'ontime'), the improved classification tree model from Question 3a is recommended. This
recommendation is supported by its higher accuracy, lower error rate, and more balanced
classification as seen in the confusion matrix.

620 Case Study2
No ratings yet
620 Case Study2
2 pages
620 Case Study3
No ratings yet
620 Case Study3
2 pages
Final
No ratings yet
Final
15 pages
Car Price Prediction
No ratings yet
Car Price Prediction
42 pages
Flight Delay Prediction Model
No ratings yet
Flight Delay Prediction Model
17 pages
Predicting Flight Delays With Error Calculation Using Machine Learned Classifiers
No ratings yet
Predicting Flight Delays With Error Calculation Using Machine Learned Classifiers
6 pages
Flight Delay Prediction
No ratings yet
Flight Delay Prediction
17 pages
SNU Assignment 1
No ratings yet
SNU Assignment 1
3 pages
Assignment1 Code and Conclude DSA Nikhil Mishra
No ratings yet
Assignment1 Code and Conclude DSA Nikhil Mishra
36 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
3 pages
FLIGHT DELAY Prediction 4th
No ratings yet
FLIGHT DELAY Prediction 4th
18 pages
Predicting Flight Delays
No ratings yet
Predicting Flight Delays
7 pages
Major Project Final
No ratings yet
Major Project Final
21 pages
Model
No ratings yet
Model
20 pages
M818A: Machine Learning and Cyber Security-A
No ratings yet
M818A: Machine Learning and Cyber Security-A
11 pages
This Study Resource Was
No ratings yet
This Study Resource Was
5 pages
Flight Delay Prediction Based On Machine Learning Full
No ratings yet
Flight Delay Prediction Based On Machine Learning Full
9 pages
Flight Price Prediction
No ratings yet
Flight Price Prediction
34 pages
Flight Delay Prediction Using ML
No ratings yet
Flight Delay Prediction Using ML
24 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
Class and Case Diagram With Pseudocode Word
No ratings yet
Class and Case Diagram With Pseudocode Word
5 pages
Flight DElay Report
No ratings yet
Flight DElay Report
49 pages
Flight DElay Report
No ratings yet
Flight DElay Report
49 pages
Base Paper (Flight Delay Prediction)
No ratings yet
Base Paper (Flight Delay Prediction)
6 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
Credit Card Approval Prediction Report-Final
No ratings yet
Credit Card Approval Prediction Report-Final
27 pages
Random Forest Model
No ratings yet
Random Forest Model
16 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
Project 1
No ratings yet
Project 1
9 pages
IJRTI2305086
No ratings yet
IJRTI2305086
6 pages
Model Answers For Chapter 7: CLASSIFICATION AND REGRESSION TREES
No ratings yet
Model Answers For Chapter 7: CLASSIFICATION AND REGRESSION TREES
3 pages
Bachelor thesis-G.H. Van de Water-S2297213
No ratings yet
Bachelor thesis-G.H. Van de Water-S2297213
48 pages
GNR 652 Assignment 2
No ratings yet
GNR 652 Assignment 2
4 pages
A Hybrid Machine Learning Based Model For Predicting Flight Delay Through Aviation Big Data
No ratings yet
A Hybrid Machine Learning Based Model For Predicting Flight Delay Through Aviation Big Data
16 pages
Flight Delay Report
No ratings yet
Flight Delay Report
29 pages
Report
No ratings yet
Report
5 pages
5th International Conference On Electronics and Sustainable Communication Systems (ICESC 2024)
No ratings yet
5th International Conference On Electronics and Sustainable Communication Systems (ICESC 2024)
15 pages
Project 1.1
No ratings yet
Project 1.1
3 pages
If With: February 26, 2024
No ratings yet
If With: February 26, 2024
7 pages
Rainfall Prediction Using Machine Learning
No ratings yet
Rainfall Prediction Using Machine Learning
9 pages
CS PROJECT (Flight Management System) 2
No ratings yet
CS PROJECT (Flight Management System) 2
21 pages
Machine Learning For Airline Customer Satisfaction Prediction
No ratings yet
Machine Learning For Airline Customer Satisfaction Prediction
14 pages
Week 2 Lab: Data Analysis Insights
No ratings yet
Week 2 Lab: Data Analysis Insights
6 pages
MANUAL
No ratings yet
MANUAL
33 pages
Homework
No ratings yet
Homework
2 pages
Lab 06
No ratings yet
Lab 06
2 pages
Report
No ratings yet
Report
25 pages
Ict Project Report
No ratings yet
Ict Project Report
14 pages
4aa. Hw-5.knit
No ratings yet
4aa. Hw-5.knit
1 page
Ip Sample
No ratings yet
Ip Sample
32 pages
CPE531 S18 MT Sol PDF
No ratings yet
CPE531 S18 MT Sol PDF
3 pages
(IJCST-V10I5P36) :mrs R Jhansi Rani, T Govardhan Reddy
No ratings yet
(IJCST-V10I5P36) :mrs R Jhansi Rani, T Govardhan Reddy
5 pages
Delay Prediction
No ratings yet
Delay Prediction
37 pages
Find - Flight: Def,, For in If and Return Return None Def
No ratings yet
Find - Flight: Def,, For in If and Return Return None Def
2 pages
Flight Delay Prediction Team3
No ratings yet
Flight Delay Prediction Team3
8 pages
Example On Flight Delay Data
No ratings yet
Example On Flight Delay Data
10 pages
Documentation & Report For Flyzy Flight Cancellation Project
No ratings yet
Documentation & Report For Flyzy Flight Cancellation Project
25 pages
Seminar PPT - Lipika-1
No ratings yet
Seminar PPT - Lipika-1
21 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
A Security Evaluation of IEC - 62351
No ratings yet
A Security Evaluation of IEC - 62351
10 pages
UML Distilled: Chapter 8 State Diagrams
No ratings yet
UML Distilled: Chapter 8 State Diagrams
4 pages
Demand Forecasting of A Perishable Dairy Drink: An ARIMA Approach
No ratings yet
Demand Forecasting of A Perishable Dairy Drink: An ARIMA Approach
15 pages
Simulation and Modeling
No ratings yet
Simulation and Modeling
15 pages
ASSIGNMENT
No ratings yet
ASSIGNMENT
3 pages
Electronics 3099169 Peer Review v1
No ratings yet
Electronics 3099169 Peer Review v1
12 pages
04 Process Variability With Solutions
No ratings yet
04 Process Variability With Solutions
28 pages
DPA-HairNet A Dual Encoder Attention Based Network For Hair Artifact Removal in Dermoscopic Images
No ratings yet
DPA-HairNet A Dual Encoder Attention Based Network For Hair Artifact Removal in Dermoscopic Images
27 pages
Lecture On Euler's Method
No ratings yet
Lecture On Euler's Method
15 pages
Image Restoration for Engineers
No ratings yet
Image Restoration for Engineers
32 pages
Chapter 13
No ratings yet
Chapter 13
65 pages
Digital Signal Processing Course
No ratings yet
Digital Signal Processing Course
40 pages
Short 16 Negative Sequence OC Protection Function
No ratings yet
Short 16 Negative Sequence OC Protection Function
4 pages
(Springer Series in Statistics) Wolfgang Härdle (Auth.) - Smoothing Techniques - With Implementation in S-Springer-Verlag New York (1991)
No ratings yet
(Springer Series in Statistics) Wolfgang Härdle (Auth.) - Smoothing Techniques - With Implementation in S-Springer-Verlag New York (1991)
266 pages
Get Academic Press Library in Signal Processing Contents Vol 1 4 1st Edition Sergios Theodoridis Free All Chapters
100% (1)
Get Academic Press Library in Signal Processing Contents Vol 1 4 1st Edition Sergios Theodoridis Free All Chapters
31 pages
FJ ORD
No ratings yet
FJ ORD
14 pages
AKU EB - General Mathematics - X - Paper I - 2010 - May
No ratings yet
AKU EB - General Mathematics - X - Paper I - 2010 - May
8 pages
Probability Mmec
No ratings yet
Probability Mmec
23 pages
ML Course Slides
No ratings yet
ML Course Slides
345 pages
Chapter11 Multirate Digital Signalprocessing, Oversamplingof Analog-To-Digitalconversion, Andundersampling Ofbandpass Signals
No ratings yet
Chapter11 Multirate Digital Signalprocessing, Oversamplingof Analog-To-Digitalconversion, Andundersampling Ofbandpass Signals
62 pages
Bibliography
No ratings yet
Bibliography
3 pages
MUSIC Algorithm for EE Students
No ratings yet
MUSIC Algorithm for EE Students
3 pages
Detect SQL Injection with ML Techniques
No ratings yet
Detect SQL Injection with ML Techniques
8 pages
BUULEEEE
No ratings yet
BUULEEEE
6 pages
Problem Set 3
No ratings yet
Problem Set 3
3 pages
A2.7.6 Practice With Answer
No ratings yet
A2.7.6 Practice With Answer
3 pages
Daud 2017
No ratings yet
Daud 2017
7 pages
Image Classification
No ratings yet
Image Classification
3 pages
Bisection Method and Newton Raphson Method
No ratings yet
Bisection Method and Newton Raphson Method
5 pages
ICM Job Offer PHD Thesis Seismicity 2025
No ratings yet
ICM Job Offer PHD Thesis Seismicity 2025
2 pages