Ministry of Education
Humber College
BIA 5302-Machine Learning and Programming 2
Week 06: Regression-Based Forecasting
Dr. Raed Karim
Dr. Salam Ismaeel
Agenda
• Basic Ideas
• Regression-Based Forecasting
✓ Linear Trend
✓ Exponential Trend
✓ Polynomial Trend
✓ Handling Seasonality
• Summary
• Next Week's Midterm
1
Basic Idea
• Modeling time series data is done for either descriptive or predictive purposes.
• In descriptive modeling, or time series analysis, a time series is modeled to determine its components
in terms of seasonal patterns, trends, relation to external factors, etc.
✓ These can then be used for decision-making and policy formulation.
• In contrast, time series forecasting uses the information in a time series (and perhaps other
information) to forecast the future values of that series.
• The difference between the goals of time series analysis and time series forecasting leads to differences
in the type of methods used and in the modeling process itself.
Basic Idea (cont.)
• Time-Series forecasting uses the information in a time series to forecast future values of that series
• Time-series analysis, a time series modeled to determine its components in terms of seasonal patterns,
trends, relation to external factors, etc.
• Time-Series forecasting methods: Regression models vs smoothing (Data-driven) models.
• In both types of time series, and in general, can be consists of (discussed in) four components: level,
trend, seasonality, and noise.
• It is components Fit linear trend, time as a predictor
• Modify & use also for non-linear trends
✓ Exponential
✓ Polynomial
• Can also capture seasonality
Import Required Packages
import math
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import statsmodels.formula.api as sm
from statsmodels.tsa import tsatools, stattools
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics import tsaplots
4
A Model with Trend - Linear Trend
• To create a linear regression model that captures a time series with a global linear trend,
• The outcome variable (Y) is set as the time series values or some function of it, and the predictor (X) is set as a
time index.
Example1: fitting a linear trend to the Amtrak ridership data.
5
Example1: Ridership on Amtrak Trains Data
• Amtrak a US railway company
• Contain a series of monthly ridership between January 1991 and March 2004.
• Amtrak, a US railway company, routinely collects data on ridership. Here we focus on forecasting future
ridership using the series of monthly ridership.
# Load, convert Amtrak data for time series analysis
Amtrak_df = pd.read_csv('Amtrak.csv', squeeze=True) Amtrak_df.head(9)
print(Amtrak_df)
6
Example1: (cont.)
##Create column 'Date' that is a date data type
Amtrak_df['Date'] = pd.to_datetime(Amtrak_df.Month, format='%d/%m/%Y')
Amtrak_df.head(9)
# Pandas Version
ridership_ts.plot(ylim=[1300, 2300], legend=False)
plt.xlabel('Year'); plt.ylabel('Ridership (in 000s)')
7
Example1: Linear Trend
A linear fit to Amtrak ridership data
(Doesn’t fit too well – more later)
8
The Regression Model
Ridership Y is a function of time (t) and noise (error = e)
Yi = B0 + B1*t + e
Thus we model 3 of the 4 components:
✓ Level (B0)
✓ Trend* (B1)
✓ Noise (e)
*Our trend model is linear, which we can see from the graph is not a good fit (more later)
9
Example1: Regression Mode
# load data and convert to time series
Amtrak_df = pd.read_csv('Amtrak.csv')
Amtrak_df['Date'] = pd.to_datetime(Amtrak_df.Month, format='%d/%m/%Y')
ridership_ts = pd.Series(Amtrak_df.Ridership.values, index=Amtrak_df.Date)
# fit a linear trend model to the time series
ridership_df = tsatools.add_trend(ridership_ts, trend='ct')
ridership_lm = sm.ols(formula='Ridership ~ trend', data=ridership_df).fit()
# plot the time series
ax = ridership_ts.plot()
ax.set_xlabel('Time')
ax.set_ylabel('Ridership (in 000s)')
ax.set_ylim(1300, 2300)
ridership_lm.predict(ridership_df).plot(ax=ax)
plt.show()
https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
10
Applying the model to partitioned data
# fit a linear model using the training set and predict on the validation set
ridership_lm = sm.ols(formula='Ridership ~ trend', data=train_df).fit()
predict_df = ridership_lm.predict(valid_df)
Ridership
Trend based on
training data
underestimates
validation period
Forecast
Errors
11
Implementation
def singleGraphLayout(ax, ylim, train_df, valid_df):
ax.set_xlim('1990', '2004-6')
ax.set_ylim(*ylim)
ax.set_xlabel('Time')
one_month = pd.Timedelta('31 days')
xtrain = (min(train_df.index), max(train_df.index) - one_month)
xvalid = (min(valid_df.index) + one_month, max(valid_df.index) - one_month)
xtv = xtrain[1] + 0.5 * (xvalid[0] - xtrain[1])
ypos = 0.9 * ylim[1] + 0.1 * ylim[0]
ax.add_line(plt.Line2D(xtrain, (ypos, ypos), color='black',linewidth=0.5))
ax.add_line(plt.Line2D(xvalid, (ypos, ypos), color='black',linewidth=0.5))
ax.axvline(x=xtv, ymin=0, ymax=1, color='black', linewidth=0.5)
ypos = 0.925 * ylim[1] + 0.075 * ylim[0]
ax.text('1995', ypos, 'Training')
ax.text('2002-3', ypos, 'Validation')
12
Implementation (cont.)
def graphLayout(axes, train_df, valid_df):
singleGraphLayout(axes[0], [1300, 2550], train_df, valid_df)
singleGraphLayout(axes[1], [-550, 550], train_df, valid_df)
train_df.plot(y='Ridership', ax=axes[0], color='C0', linewidth=0.75)
valid_df.plot(y='Ridership', ax=axes[0], color='C0', linestyle='dashed',
linewidth=0.75)
axes[1].axhline(y=0, xmin=0, xmax=1, color='black', linewidth=0.5)
axes[0].set_xlabel('')
axes[0].set_ylabel('Ridership (in 000s)')
axes[1].set_ylabel('Forecast Errors')
if axes[0].get_legend():
axes[0].get_legend().remove()
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(9, 7.5))
ridership_lm.predict(train_df).plot(ax=axes[0], color='C1')
ridership_lm.predict(valid_df).plot(ax=axes[0], color='C1', linestyle='dashed')
residual = train_df.Ridership - ridership_lm.predict(train_df)
residual.plot(ax=axes[1], color='C1')
residual = valid_df.Ridership - ridership_lm.predict(valid_df)
residual.plot(ax=axes[1], color='C1', linestyle='dashed')
graphLayout(axes, train_df, valid_df)
plt.tight_layout()
plt.show()
13
Example1: Summary
Summary: Linear model output
(training data)
ridership_lm.summary()
Partial output
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1750.3595 29.073 60.206 0.000 1692.802 1807.917
trend 0.3514 0.407 0.864 0.390 -0.454 1.157
14
Exponential Trend
• Appropriate model when increase/decrease in series over time is multiplicative
e.g., t1 is x% more than t0, t2 is x% more than t1…
• Replace Y with log(Y) then fit linear regression
log(Yi) = B0 + B1t + e
15
Example 1: Exponential Trend
Fitting the exponential trend model, making predictions
ridership_lm_linear = sm.ols(formula='Ridership ~ trend', data=train_df).fit()
predict_df_linear = ridership_lm_linear.predict(valid_df)
ridership_lm_expo = sm.ols(formula='np.log(Ridership) ~ trend', data=train_df).fit()
predict_df_expo = ridership_lm_expo.predict(valid_df)
Exponential trend
(green) is very similar to
linear trend (orange) –
neither copes well with
an initial period of
decline followed by a
growth period
16
Polynomial Trend
• Add additional predictors as appropriate
• For example, for quadratic relationships add a t2 predictor
• Fit linear regression using both t and t2
Example: Fitting a quadratic model
ridership_lm_poly = sm.ols(formula='Ridership ~ trend + np.square(trend)',
data=train_df).fit()
Better job capturing the
trend, though it over
forecasts in the
validation period.
17
Handling Seasonality
• Seasonality is any recurring cyclical pattern of
consistently higher or lower values (daily, weekly,
monthly, quarterly, etc.)
• Handle in regression by adding a categorical variable
for the season, e.g., 11 dummies for the month (using
all 12 would produce multicollinearity error)
Adding seasonality
ridership_df = tsatools.add_trend(ridership_ts, trend='c')
ridership_df['Month'] = ridership_df.index.month
# partition the data
train_df = ridership_df[:nTrain]
valid_df = ridership_df[nTrain:]
ridership_lm_season = sm.ols(formula='Ridership ~ C(Month)', data=train_df).fit()
ridership_lm_season.summary()
18
Example 1: Model with Seasonality
19
Summary
Regression-Based Forecasting:
• Can use linear regression for exponential models (use logs) and polynomials (exponentiation)
• For seasonality, use a categorical variable (make dummies)
20
Agenda
• Basic Ideas
• Regression-Based Forecasting
✓ Linear Trend
✓ Exponential Trend
✓ Polynomial Trend
✓ Handling Seasonality
• Summary
• Next Week’s Midterm
21
Ministry of Education
Humber College
BIA 5302-Machine Learning and Programming 2
Week 07: Midterm (In-Person Only)
Dr. Raed Karim
Dr. Salam Ismaeel