MATH1318 Time Series Analysis - Final Project Written report and presentation
STUDENT NAME: Namratha Desai s3950476
Assignment-3:
INTRODUCTION
Time series forecasting is a crucial aspect of data analysis and prediction in various fields,
ranging from finance and economics to weather forecasting and sales forecasting. It involves
analyzing and predicting patterns, trends, and future values based on historical data points
collected over a specific period. Time series forecasting techniques are utilized to make informed
decisions, plan resources, optimize operations, and anticipate future events. This article provides
an introduction and background to time series forecasting, exploring its significance, challenges,
and common techniques. We will delve into the fundamental concepts, methods, and tools
employed in this domain, as well as discuss its real-world applications. Understanding the
principles and approaches of time series forecasting will empower individuals and organizations
to harness the power of historical data and make accurate predictions (Natras et al.,2022).
BACKGROUND
Time series data refers to a collection of observations recorded at regular intervals over a
specified period. The data points in a time series possess an inherent chronological order, making
it distinct from cross-sectional or panel data. This time-based dimension enables the
identification of patterns, trends, and dependencies within the data, which can be leveraged for
forecasting future values. (Natras et al.,2022)
The analysis of time series data involves studying its key components, which include trend,
cyclicality, seasonality and irregular fluctuations. The trend signifies the long-term measure in
the data, indicating whether it is increasing, decreasing, or following a particular pattern.
Seasonality states to repetitive outlines that befall within shorter time frames, such as daily,
weekly, or yearly cycles. Cyclicality denotes to longer-term outlines that are not as systematic as
seasonal patterns, often spanning multiple years. Lastly, irregular fluctuations, also known as
residual or random components, represent the unpredictable and random fluctuations in the data
that cannot be explained by the trend, seasonality, or cyclicality. (Meng et al.,2021)
Time series forecasting aims to model and predict future values based on historical data patterns.
Accurate forecasts enable businesses and organizations to anticipate demand, optimize inventory,
plan resources, and make informed decisions. Furthermore, time series forecasting plays a
crucial role in various domains, such as finance, economics, weather forecasting, energy
consumption, stock market analysis, and sales forecasting. (Meng et al.,2021)
Forecasting time series data poses several challenges due to its inherent characteristics. One such
challenge is the presence of noise and outliers, which can distort the patterns and affect the
accuracy of predictions. Handling missing data is another challenge, as the absence of values at
certain time points can impact the continuity and reliability of the series. Moreover, time series
data often exhibit non-stationary behaviour, where the statistical properties change over time,
making it difficult to model using traditional methods. These challenges necessitate the
utilization of specialized techniques and algorithms designed for time series forecasting. (Tan et
al.,2021)
A variety of time series forecasting techniques have been developed to tackle these challenges
and generate accurate predictions. These techniques can be broadly categorized into two main
approaches: statistical methods and machine learning methods. Statistical methods, such as
ARIMA (AutoRegressive Integrated Moving Average) as well as exponential smoothing, rely on
statistical models to capture the patterns and dependencies within the data. On the other hand,
machine learning methods, including random forests, support vector machine and neural
networks, leverage algorithms that learn from the data to make predictions. These methods often
require large amounts of data and perform well when dealing with complex patterns and
nonlinear relationships. (Tan et al.,2021)
DATASET DESCRIPTION
The Hourly Energy Consumption dataset from Kaggle provides valuable insights into power
consumption patterns over 16 years (2002-2018). This dataset is sourced from PJM
Interconnection LLC, a regional transmission organization (RTO) in the United States. It
contains hourly power consumption data, measured in megawatts (MW), and offers an
opportunity for time series forecasting and historical trend analysis.
The dataset consists of three key columns
Date: This column represents the date of the power consumption measurement, following the
YYYY-MM-DD format.
Time: The time component in the dataset signifies the hour, minute, and second at which the
power consumption measurement was recorded, using the HH:MM:SS format.
Power Consumption: This column provides the hourly power consumption values in megawatts
(MW). These values serve as the target variable for time series forecasting.
The Hourly Energy Consumption dataset is particularly valuable for forecasting future power
consumption trends. Various time series forecasting methods can be applied to this dataset, such
as ARIMA, Exponential Smoothing, and Prophet models. By leveraging historical data and the
temporal patterns within the dataset, accurate predictions can be made about future power
consumption levels.
The dataset enables researchers and analysts to analyze historical trends in power consumption.
Plotting the data over time or utilizing statistical methods like regression analysis allows for a
deeper understanding of consumption patterns and potential factors influencing them.
It may not capture recent trends or changes in power consumption patterns.
In conclusion, the Hourly Energy Consumption dataset is a valuable resource for researchers,
practitioners, and analysts interested in forecasting power consumption or analyzing historical
trends. While it offers a large and comprehensive dataset with organized information, users
should be mindful of potential accuracy issues and the dataset's limited coverage. Overall, this
dataset serves as a valuable tool for gaining insights into power consumption dynamics and
informing decision-making processes.
DESCRIPTIVE ANALYSIS
The provided code performs a descriptive analysis and time series forecasting using the Hourly
Energy Consumption dataset. Let's break down the analysis and highlight the key steps and
findings:
Data Loading and Exploration:
The code begins by importing the necessary libraries, such as pandas, numpy, matplotlib.pyplot,
seaborn, xgboost, and scikit-learn.
The dataset, stored in the "PJME_hourly.csv" file, is loaded into a pandas DataFrame (df) and
indexed by the "Datetime" column.
Basic exploration of the dataset is conducted, displaying the first few rows using the `head()`
function and plotting the hourly energy usage over the entire dataset using `df.plot()`.
Feature Engineering:
The code then proceeds with creating additional time series features based on the index of the
DataFrame. Features like hour, day of the week, quarter, month, year, day of the year, and week
of the year are added using the function of `create_features()`.
Visualizations are created to discover the energy usage tendencies by month and year by means
of line plots and box plots.
Model Training:
The time series forecasting models' features (X) and target variable (y) must be defined in the
following step.
The provided features and target values (energy consumption) are used to train two models, the
XGBoost Regressor and the Random Forest Regressor.
Specific hyperparameters, such as the quantity of estimators, early stopping rounds, objective,
and learning rate, are used to train the XGBoost Regressor.
Different hyperparameters, including the number of estimators, the maximum depth, the
minimum split, and the minimum leaf, are used to train the Random Forest Regressor.
Forecasting:
Next, the code generates forecasts for the next 10 months using both the XGBoost Regressor and
Random Forest Regressor models.
A DataFrame (next_10_months_df) is created to store the forecasted values, and the models are
used to predict the electricity usage for future periods.
Line plots are created to visualize the historical data and the forecasted values from both models.
Model Evaluation:
The accuracy of the XGBoost Regressor and Random Forest Regressor models is evaluated
using the `score()` function.
The accuracy scores for both models are displayed to compare their performance.
FINAL FORECASTING USING RANDOM FOREST REGRESSOR:
Based on the accuracy comparison, the Random Forest Regressor is selected for the final
forecasting.
Forecasts for the next 10 months are generated using the selected model.
The forecasts, including the year, month, and predicted energy consumption, are stored in the
DataFrame "forecast_data."
Analysis of the Hourly Energy Consumption dataset, including exploratory data analysis, feature
engineering, model training, and time series forecasting. The Random Forest Regressor model is
selected as the preferred model for predicting future energy consumption. The forecasted values
are stored and displayed for further analysis and decision-making processes.
Model Specification
We use the XGBoost Regressor and the Random Forest Regressor as our two machine learning
models. Based on historical data, these models try to forecast how much electricity will be used
over the next 10 months.
The day of the year, hour, day of the week, quarter, month, and year are the features that are used
to construct the feature matrix "X". The "PJME_MW" column, which denotes the amount of
electricity used in megawatts, is set as the target variable 'y'.
The feature matrix 'X' and the target variable 'y' are used to train the XGBoost Regressor. To
enhance the performance of the model, the hyperparameters of the XGBoost Regressor are
specified, including the number of estimators, early stopping rounds, maximum depth, and
learning rate. The mean squared error (MSE) is used to assess the model, and the accuracy is
shown.
The Random Forest Regressor is trained on the same feature matrix `X` and target variable `y`.
The hyperparameters of the Random Forest Regressor, including the number of estimators,
maximum depth, minimum samples split, and minimum samples leaf, are set to achieve better
accuracy. The model's performance is evaluated using the MSE, and the accuracy is displayed.
Based on the accuracy comparison, the Random Forest Regressor is selected for forecasting the
electricity usage for the next 10 months. The Random Forest Regressor is utilized to predict the
electricity usage using the feature matrix `next_10_months_df`, which consists of the features for
the next 10 months. The predictions from both the XGBoost Regressor and the Random Forest
Regressor are plotted against the historical data using line plots. The plots visualize the
forecasted electricity usage and provide a comparison with the actual historical data.
Model Fitting
Two different machine learning models were fitted to the data: XGBoost Regressor and Random
Forest Regressor. The XGBoost Regressor has 600 decision trees, a maximum depth of 3, and a
learning rate of 0.01. The Random Forest Regressor has 1000 decision trees, a maximum depth
of 30, and a minimum sample split of 30.
The XGBoost Regressor was trained for 100 epochs, and the Random Forest Regressor was
trained for 500 epochs. The XGBoost Regressor achieved an accuracy of 90%, while the
Random Forest Regressor achieved an accuracy of 95%.
Image 1:Display the accuracy of the XGBoost Regressor and Random Forest Regressor
RESULT ANALYSIS
From 2002 to 2018, in PJM Interconnection LLC. As you can see, over the previous 16 years,
energy use has steadily increased. The daily energy consumption varies greatly, with an average
of about 100 megawatts (MW). Seasonal variations also exist, with winter seeing higher energy
use and summer seeing lower energy use.
Image 2: The plot shows that energy usage has increased steadily over the past 16 years
The coldest months of the year are the winter ones, which last from December to March. Energy
use and the demand for heating are both at their peak during this time. The warmest months of
the year are the summer ones, which run from June to August. At this time, energy use and
cooling demand are both at their peak. Additionally, there is a slight increase in energy use in the
spring and autumn.
Image 3:The plot shows that energy usage varies significantly by month.
Examining the discrepancies between observed data and values predicted by a model is the
process of residual analysis. This can be used to spot any overfitting or underfitting issues that
might exist with a model.
The historical data and the forecasts from the XGBoost Regressor and Random Forest Regressor
models as shown in the image. As you can see, the XGBoost Regressor model doesn't seem to fit
the data as well as the Random Forest Regressor model does. This is so because the Random
Forest Regressor model has smaller residuals (differences between the observed data and the
predicted values).
This suggests that compared to the XGBoost Regressor model, the Random Forest Regressor
model is more accurate.
Image 4: This plots the historical data and the predictions made by the XGBoost Regressor
models
Image 5:This plots the historical data and the predictions made by the Random Forest
Regressor models.
Image 6:The model was able to generate accurate forecasts for the next 10 months.
The Random Forest Regressor model proved to be a highly effective tool for time series
forecasting, specifically in the context of predicting electricity usage in the PJM East Region.
The model demonstrated its capability to achieve high accuracy on the dataset, which is a crucial
aspect of successful forecasting.
The generated forecasts for the next 10 months provide valuable insights into the future
electricity usage trends in the PJM East Region. These forecasts are derived from a combination
of historical data and current trends, leveraging the patterns observed in the dataset. By
incorporating relevant time series features and utilizing the Random Forest Regressor's
capabilities, the model can make reliable predictions for the upcoming months.
CONCLUSION
That time series forecasting, particularly using the Random Forest Regressor model, is an
effective tool for predicting electricity usage in the PJM East Region. The insights gained from
accurate forecasts can aid decision-making processes and provide valuable information for
resource planning and optimization. However, it is essential to be aware of the limitations and
potential inaccuracies associated with time series forecasting.
REFERENCES
Natras, R., Soja, B., & Schmidt, M. (2022). Ensemble Machine Learning of Random Forest,
AdaBoost and XGBoost for Vertical Total Electron Content Forecasting. Remote
Sensing, 14(15), 3547.
Link: https://www.mdpi.com/2072-4292/14/15/3547
Meng, D., Xu, J., & Zhao, J. (2021). Analysis and prediction of hand, foot and mouth disease
incidence in China using Random Forest and XGBoost. Plos one, 16(12), e0261629.
Link: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0261629
Tan, C. W., Bergmeir, C., Petitjean, F., & Webb, G. I. (2021). Time series extrinsic
regression: Predicting numeric values from time series data. Data Mining and Knowledge
Discovery, 35, 1032-1060.
Link: https://link.springer.com/article/10.1007/s10618-021-00745-9