Flight Delay Classification Case
Introduction
This case study focuses on building a classification model to predict whether a commercial flight from
the Washington, DC area to New York will be delayed or ontime. Using the FlightDelays.csv dataset,
which includes flight details such as departure time, carrier, distance, weather, and more, the goal is
to analyze these factors and apply a classification tree to determine the flight status (FL_STATUS).
A delay is defined as an arrival 15 minutes or more later than scheduled. The final model aims to
support better decision-making by identifying key patterns that lead to delays.
Questions 1. Upload, explore, clean, and preprocess data for classification tree.
1(a). Create the flight_df data frame by uploading the original data set into Python. Determine and
present in this report the data frame dimensions, i.e., number of rows and columns.
Ans.The flight_df data frame was created by loading the dataset into Python. It contains 2000 rows
and 19 columns, representing flight details such as times, distance, carrier, weather, and status. This
gives a basic overview of the dataset before preprocessing.
1(b). Remove ‘DEST’ and ‘ORIGIN’ variables from the flight_df data frame. Then, display the column
data types in flight_df, provide and briefly explain them in your report.
Ans. We removed the columns 'DEST' and 'ORIGIN' from the flight_df data frame using the drop()
function. After removal, we displayed the data types of the remaining columns using .dtypes.
The data types include int64 for numerical values such as SCH_TIME, DEP_TIME, and DISTANCE, and
object (string) for categorical variables like CARRIER and FL_STATUS. This helps in identifying which
columns need to be encoded before model training.
1(c). You leave the outcome variable ‘FL_STATUS’ unchanged in flight_df. However, for the ‘CARRIER’
predictor variable, you need to convert it into the binary variables and avoid using the Boolean
(‘bool’) values. Display in Python the modified column data types, provide and briefly explain them in
your report.
Ans. The outcome variable 'FL_STATUS' was left unchanged in its categorical form. The predictor
'CARRIER' was converted into binary (dummy) variables using pd.get_dummies(), and the bool data
types were converted to int (0 or 1) using .astype(int) to ensure compatibility with modeling
algorithms.
The modified data types were displayed, confirming that all binary columns were now int64, which is
required for proper numeric processing during model training.
1(d). Display in Python and provide in your report the first 10 records of the modified flight_df data
frame. Briefly explain the outcome and predictors in this case.
Ans. We displayed the first 10 records of the modified flight_df using flight_df.head(10). These
records show the updated structure of the data, where original categorical predictors like CARRIER
have been expanded into separate columns (e.g., CARRIER_DL, CARRIER_UA, etc.) with 0s and 1s
indicating presence.
The outcome variable remains as FL_STATUS, which identifies whether a flight was delayed or
ontime. The predictors now include only numerical values, making the dataset ready for machine
learning algorithms.
2(a). Develop a classification tree for the Flight Delays case. a. Develop in Python the predictor
variables (14 variables) and outcome variable (‘FL_STATUS’), partition the data set (80% for training
and 20% for validation partitions, random_state=1). Train a classification tree model using
DecisionTreeClassifier() with the training data set and the following tree control parameters: (a)
maximum depth (number of splits) equals 4; (b) minimum impurity decrease per split of 0.001; and
(c) minimum number of node records (samples) to split equals to 30. Use plotDecisionTree() with the
feature_names and class_names parameters to display the classification tree in Python and present
it in your report.
Ans. A classification tree was developed using DecisionTreeClassifier() with 14 predictor variables
and FL_STATUS as the outcome. The dataset was split into 80% training and 20% validation sets
(random_state=1). The tree was trained with parameters: max_depth=4,
min_impurity_decrease=0.001, and min_samples_split=30.
The tree was visualized using plotDecisionTree() to show key decision paths.
2(b) . Using the classification tree, explain the outcome (‘FL_STATUS’) of a flight if the weather
(‘WEATHER’) is in good flying condition, departure time (‘DEP_TIME’) is 1450 (2:50 pm), and
scheduled time (‘SCH_TIME’) is 1435 (2:35 pm).
Ans. Based on the classification tree, the flight is predicted to be on time. This is because the
weather is good (WEATHER = 0), the departure time (1450) is before 1500.5, and all decision nodes
along this path lead to a class where the majority of flights were on time. Therefore, the model
confidently classifies this flight as on time.
2(c). Identify and display in Python confusion matrices for training and validation partitions. Present
them in your report and comment on accuracy (misclassification) rate for both partitions and explain
if there is a possibility of overfitting.
Ans. The confusion matrices for both the training and validation partitions were generated. The
training confusion matrix shows 86 true negatives, 1402 true positives, 260 false positives, and 12
false negatives. The validation confusion matrix shows 26 true negatives, 353 true positives, 56 false
positives, and 6 false negatives.
The training accuracy is 84.55%, and the validation accuracy is 85.94%. The misclassification rates
are 15.45% for training and 14.06% for validation.
Since the accuracy values are fairly close between the training and validation sets, and the model
performs slightly better on validation than training, there is no strong evidence of overfitting. The
model generalizes well and performs consistently across both partitions.
2(d).Using the trained classification tree, make classification of flight status (‘delayed’ or ‘ontime’) for
the following two new flight records
Ans.The classification tree predicted the first flight to be delayed, likely because it’s scheduled late at
night and uses CARRIER_MQ, which the model associates with delays. The second flight was
predicted to be ontime, as it departs earlier and involves different carriers. Since the model has high
accuracy on both training (84.55%) and validation (85.94%) data, these predictions are likely reliable
and show no signs of overfitting.
3(a). Apply grid search to improve classification results. a. Use the GridSearchCV() algorithm in
Python to improve (optimize) the classification tree control parameters. Consider the following
control parameters: (a) maximum depth (number of splits) in the range from 2 to 30; (b) minimum
impurity decrease per split of 0, 0.0005, and 0.001; and (c) minimum number of node records
(samples) to split in the range from 5 to 30 (cv=5, n_jobs=-1). Do not use the initial guess grid search,
and directly apply the improved grid search. In your report, provide the improved parameters and
the associated classification tree. Display the confusion matrices for training and validation partitions
for the improved classification tree. Present them in your report and comment on accuracy
(misclassification) rate for both partitions and explain if there is a possibility of overfitting.
Ans. GridSearchCV was used to improve classification performance.
Parameters tested:
max_depth: 2 to 30
min_impurity_decrease: 0, 0.0005, 0.001
min_samples_split: 5 to 30
Best parameters:
max_depth = 9
min_impurity_decrease = 0.001
min_samples_split = 20
Cross-validated score: 0.8858
Training confusion matrix: [[193, 153], [30, 1384]]
Accuracy: 89.60%
Misclassification: 10.40%
Validation confusion matrix: [[42, 40], [18, 344]]
Accuracy: 86.85%
Misclassification: 13.15%
Conclusion: Model performs well with no clear overfitting — training and validation scores are
close.
3(b).Present and compare in your report the validation confusion matrices for the classification
results in questions 2c and 3a. Using the accuracy value (misclassification rate), which classification
tree model would you recommend using for making predictions in this case of flight status (‘delayed’
or ‘on time’)? Briefly explain your answer.
Ans. Based on the validation confusion matrices and accuracy values obtained from Questions 2c and
3a, it is evident that the improved model from Question 3a performs slightly better than the initial
model. The initial model (Q2c) achieved an accuracy of 85.94% with a misclassification rate of
14.06%, while the improved model (Q3a), which was optimized using GridSearchCV, achieved a
higher accuracy of 86.85% and a lower misclassification rate of 13.15%.
Although the improvement is modest, the grid search-optimized model generalizes slightly better on
the validation data. Therefore, for making future predictions on flight status (either 'delayed' or
'ontime'), the improved classification tree model from Question 3a is recommended. This
recommendation is supported by its higher accuracy, lower error rate, and more balanced
classification as seen in the confusion matrix.