[go: up one dir, main page]

0% found this document useful (0 votes)
48 views28 pages

01 - Quantitative Methods

The document provides an overview of multiple regression analysis, including its uses, basic model structure, and underlying assumptions. It discusses the evaluation of regression models through metrics like R², AIC, and BIC, as well as hypothesis testing for coefficients. Additionally, it addresses issues such as model misspecification, heteroskedasticity, serial correlation, and multicollinearity, along with methods for detecting and correcting these problems.

Uploaded by

Aditya Gusain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views28 pages

01 - Quantitative Methods

The document provides an overview of multiple regression analysis, including its uses, basic model structure, and underlying assumptions. It discusses the evaluation of regression models through metrics like R², AIC, and BIC, as well as hypothesis testing for coefficients. Additionally, it addresses issues such as model misspecification, heteroskedasticity, serial correlation, and multicollinearity, along with methods for detecting and correcting these problems.

Uploaded by

Aditya Gusain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

01

Quantitative Methods
Quantitative Methods 2024 Level II High Yield Notes

LM01 Basics of Multiple Regression and Underlying Assumptions


Uses of multiple linear regression
Multiple regression allows us to determine the effect of more than one independent
variable on a particular dependent variable.
Multiple regression can be used:
• To explain the relationships between financial variables: e.g., the relationship
between inflation, GDP growth rates and interest rates.
• To test existing theories – e.g., are equity returns impacted by a stock’s market cap
and value/growth factors.
• To make forecasts – e.g., using variables such as financial leverage, profitability,
revenue growth, and changes in market share to predict whether a company will face
financial distress.

The basics of multiple linear regression


A multiple linear regression model has the general form:
Yi = b0 + b1 X1i + b2 X2i + ⋯ + bk Xki + εi , i = 1, 2,….n
The slope coefficient bj measures how much the dependent variable Y changes when the
independent variable, Xj, changes by one unit holding all other independent variables
constant.
The intercept coefficient b0 represents the expected value of Y if all independent variables
are zero.
For example, consider the following regression equation:
Y = 0.2 + 0.6X1 + 0.5 X2 + ϵ
If X1 changes by 1 unit and X2 remains constant, then Y will change by 0.6 units. Similarly, if
X1 remains constant and X2 changes by 1 unit, then Y will change by 0.5 units. If X1 and X2
are each zero, then the expected value of Y is 0.2.

Assumptions underlying multiple linear regression


The five main assumptions underlying multiple regression models are:
1. Linearity: The relationship between the dependent variable and the independent
variables is linear.
2. Homoskedasticity: The variance of the regression residuals is the same for all
observations.
3. Independence of errors: The observations are independent of one another. This
implies the regression residuals are uncorrelated across observations.

© IFT. All rights reserved 1


Quantitative Methods 2024 Level II High Yield Notes

4. Normality: The regression residuals are normally distributed.


5. Independence of independent variables:
5a. Independent variables are not random.
5b. There is no exact linear relation between two or more of the independent
variables or combinations of the independent variables.
Commonly used diagnostic plots
Scatterplots of dependent and independent variables are used to check if the assumptions
of ‘linearity’ and ‘independence of independent variables’ have been violated.
Scatterplots of residuals is used to check if the assumptions of ‘homoskedasticity’ and
‘independence of errors’ have been violated.
A ‘Q-Q’ plot of residuals is used to check if the assumption of ‘normality’ has been violated.

© IFT. All rights reserved 2


Quantitative Methods 2024 Level II High Yield Notes

LM02 Evaluating Regression Model Fit and Interpreting Model Results


Goodness of fit
R2 and adjusted R2
R2 measures the percentage of variation in Y that is explained by the independent variables.
In multiple regression, R2 increases as we add new independent variables, even if the
̅2
amount of variation explained by them is not statistically significant. Hence, the adjusted R
is used because it does not automatically increase as independent variables are added to
the model.
AIC and BIC
When evaluating a collection of models that explain the same dependent variable, we
cannot rely on the adjusted R2 alone. Two commonly used statistics for this purpose are
Akaike’s information criterion(AIC) and Schwarz’s Bayesian information criterion (BIC).
These stats are often provided as part of the regression software output.
Lower values of both measures are better.
When do we prefer one measure over the other?
• AIC is preferred if the model is used for prediction purposes
• BIC is preferred when the best goodness of fit is the goal

Testing joint hypothesis for coefficients


Hypothesis tests of a single coefficient
The hypothesis tests of a single coefficient in a multiple regression are identical to those in
a simple regression.
If we are testing simply whether a variable is significant in explaining the dependent
variable’s variation, the hypotheses are:
H0: bj = 0
Ha: bj ≠ 0
The statistical software produces the t-statistics and P-values for a test of the slope
coefficient against zero for each independent variable in the model. t-stats greater than the
critical value indicate that the variable is significant.
Joint F-test
The joint F-test is used to jointly test a subset of variables in a multiple regression, where
the “restricted” model is based on a narrower set of independent variables nested in the
broader “unrestricted” model. The null hypothesis is that the slope coefficients of all
independent variables outside the restricted model are zero.
It is calculated as:

© IFT. All rights reserved 1


Quantitative Methods 2024 Level II High Yield Notes

(Sum of squares error restricted model − Sum of squares error unrestricted)/q


F=
Sum of squares error unrestricted model/(n − k − 1)
where: q is the number of restrictions, i.e. the number of variables omitted
If the calculated F-stat exceeds the critical F-value we can conclude that at least one of the
omitted variables is statically significant.
General linear F-test
The general linear F-test is an extension of the joint F-test, where we test the significance of
the whole regression equation. The null hypothesis is that the slope coefficients on all
independent variables are 0, against the alternative that at least one coefficient is different
from 0.
The F-stat is calculated as:
mean regression sum of squares MSR
F= =
mean squared error MSE
If the calculated F-stat exceeds the critical F-value we can conclude that at least one of the
slope coefficients is different from 0.

Forecasting using multiple regression


To forecast the value of a dependent variable using a multiple linear regression model,
follow these three steps:
1. Obtain estimates b ̂0 , b
̂1 , b
̂2 , … . , b
̂k of the regression parameters b0, b1, b2,…bk. The ̂
symbol indicates that the values are estimated.
2. Determine the assumed values of the independent variables X̂ ̂ ̂
1i , X 2i , … … , X ki
3. Compute the predicted value of the dependent variable, 𝑌𝑖 ̂ using the following equation:
̂ ̂0 + b
Yi = b ̂1 ̂
X1i + b̂2 ̂
X2i + ⋯ … + b ̂k ̂
Xki
The level of uncertainty around the forecast of the dependent variable is called the
standard error of forecast. This forecast error depends on how well the independent
variables (X1, X2, X3…) were forecasted i.e. the sampling error, as well as the model error.
The larger the sampling error, the larger is the standard error of the forecast of Y and the
wider is the confidence interval.

© IFT. All rights reserved 2


Quantitative Methods 2024 Level II High Yield Notes

LM03 Model Misspecification


Misspecified functional form
Whenever we estimate a regression, we must assume that the regression has the correct
functional form. This assumption can fail in several ways as shown in Exhibit 2.
Failures in Explanation Consequence
Regression
Functional Form
Omitted variables One or more important May lead to
variables are omitted from the heteroskedasticity or serial
regression correlation
Inappropriate form of Ignoring a nonlinear May lead to
variables relationship between the heteroskedasticity
dependent and independent
variable
Inappropriate variable One or more regression May lead to
scaling variables may need to be heteroskedasticity or
transformed before estimating multicollinearity
the regression.
Inappropriate data Regression model pools data May lead to
pooling from different samples that heteroskedasticity or serial
should not be pooled correlation

Heteroskedasticity
There are two types of heteroskedasticity:
• Unconditional heteroskedasticity: When heteroskedasticity of the error variance is
not correlated with the independent variables. Unconditional heteroskedasticity is
not a problem.
• Conditional heteroskedasticity: The error variance is correlated with the values of
the independent variables. Conditional heteroskedasticity is a problem. It results in
underestimation of standard errors, so t-statistics are inflated and Type I errors are
more likely.
The figure below illustrates conditional heteroskedasticity.

© IFT. All rights reserved 1


Quantitative Methods 2024 Level II High Yield Notes

Conditional heteroskedasticity can be detected using the Breusch–Pagan (BP) test. It can be
corrected by computing ‘robust standard errors’.

Serial correlation
In serial correlation, regression errors in one period are correlated with errors from
previous periods. It is often found in time-series regressions.
The following figure demonstrates the presence of serial correlation. When the previous
error term is positive the next error term is also most likely positive and vice versa.

Consequences:
Independent Variable Is Lagged Value Invalid Coefficient Invalid Standard Error
of Dependent Variable Estimates Estimates
No No Yes
Yes Yes Yes

© IFT. All rights reserved 2


Quantitative Methods 2024 Level II High Yield Notes

Serial correlation can be detected using Breusch–Godfrey (BG) test. It can be corrected by
computing ‘robust standard errors’.

Multicollinearity
Multicollinearity may occur when two or more independent variables are highly correlated
or when there is an approximate linear relationship among independent variables.
Consequences: The standard errors are inflated so the t-stats of coefficients are artificially
small and cannot reject the null hypothesis.
Multicollinearity can be detected by using the variance inflation factor (VIF). VIF values
above 5 warrant further investigation. Whereas, VIF values above 10 indicate serious
multicollinearity problems.
It can be corrected by dropping one or more of the regression variables, using a different
proxy for one of the variables, or increasing the sample size.
Exhibit 14 summarizes the three issues arising from regression assumption violations.
Assumption Violation Issue Detection Correction
Homoskedastic Heteroskedastic Biased Visual Revise model;
error terms error terms estimates of inspection of use robust
coefficients’ residuals; standard errors
standard errors Breusch–
Pagan test
Independence Serial Inconsistent Breusch– Revise model;
of observations correlation estimates of Godfrey test use serial-
coefficients and correlation
biased standard consistent
errors standard errors
Independence Multicollinearity Inflated Variance Revise model;
of independent standard errors inflation factor increase
variables sample size

© IFT. All rights reserved 3


Quantitative Methods 2024 Level II High Yield Notes

LM04 Extensions of Multiple Regression


Influence analysis
An influential observation is an observation whose inclusion may significantly alter
regression results. Two kinds of observations may potentially influence regression results:
• A high-leverage point: A data point having an extreme value of an independent
variable (X).
• An outlier: A data point having an extreme value of the dependent variable (Y).
Detecting Influential Points
Leverage (hii)
A high-leverage point can be identified using a measure called leverage (hii). Leverage
measures the distance between the value of the ith observation of an independent variable
and the mean value of that variable across all n observations. Leverage ranges from 0 to 1,
the higher the leverage the more distant the observation’s value is from the mean, and
hence the more influence it can exert on the estimated regression line. Statistical software
packages can easily calculate the leverage measure.
Interpretation:
If Then
Leverage > 3 (
𝑘+1
) The observation is potentially
𝑛
influential
where: k is the number of independent variables
Studentized residual (ti*)
An outlier can be identified using a measure called studentized residuals. Statistical
software packages can calculate and present this measure.
Interpretation:
If Then
| ti*| > 3 Flag observation as being an outlier
| ti*| > critical value of the t-statistic with n-k-2 Flag outlier observation as being
degrees of freedom potentially influential
Cook’s distance
Cook’s distance, or Cook’s D (Di), is a metric for identifying influential data points. It
measures how much the estimated values of the regression change if observation i is
deleted from the sample.
Interpretation:
If Then
Di > 0.5 The ith observation may be influential and merits further
investigation.

© IFT. All rights reserved 1


Quantitative Methods 2024 Level II High Yield Notes

Di > 1.0 The ith observation is highly likely to be an influential data point.
Di > 2√𝑘/𝑛 The ith observation is highly likely to be an influential data point.

Dummy variables in multiple regression


Dummy (or indicator) variables represent qualitative independent variables. They take on
a value of 1 if a particular condition is true and 0 if that condition is false. To distinguish
among n categories, the model must include n – 1 dummy variables.
An intercept dummy adds to or reduces the original intercept if a specific condition is met.
When the intercept dummy is 1, the regression line shifts up or down parallel to the base
regression line.
Yi = b0 + d0Di + b1Xi + εi.
A slope dummy allows for the slope of the regression line to change if a specific condition is
met.
Yi = b0 + b1Xi + d1DiXi + εi.
It is also possible for a regression model to use both intercept and slope dummy variables.
Yi = b0 + d0Di + b1Xi + d1DiXi + εi.
Exhibit 11, Panel A, shows the effect of an intercept dummy variable.

Exhibit 11, Panel B, shows the effect of a slope dummy variable.

© IFT. All rights reserved 2


Quantitative Methods 2024 Level II High Yield Notes

Exhibit 11, Panel C, shows the combined effect of a intercept and slope dummy variable.

Logistic regression model


Qualitative dependent variables are outcome variables describing data that fit into
categories (e.g. bankrupt or not bankrupt). A logistic transformation is applied when the
model contains qualitative dependent variables. The logistic transformation is:
ln[P/(1 − P)]
The logit transformation linearizes the relation between the transformed dependent
variable and the independent variables.

© IFT. All rights reserved 3


Quantitative Methods 2024 Level II High Yield Notes

𝑃
𝑙𝑛 ( ) = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 + €
1−𝑃
The natural logarithm (ln) of the odds of an event happening is the log odds which is also
called the logit function.
Logistic regression coefficients are typically estimated using the maximum likelihood
estimation (MLE) method rather than by least squares.
In a logit model, slope coefficients are interpreted as the change in the log odds that the
event happens per unit change in the independent variable, holding all other independent
variables constant.
A likelihood ratio (LR) test is a method to assess the fit of logistic regression models. The
test is similar to the joint F-test. It compares the fit of the restricted and unrestricted
models.

© IFT. All rights reserved 4


Quantitative Methods 2024 Level II High Yield Notes

LM05 Time-Series Analysis


Time series
A time series is a set of observations on a variable measured over different time periods. A
time series model allows us to make predictions about the future values of a variable.

Linear vs log-linear trend models


• When the dependent variable changes at a constant amount with time, a linear trend
model is used.
The linear trend equation is given by yt = b0 + b1 t + εt , t = 1, 2, … , T
• When the dependent variable changes at a constant rate (grows exponentially), a log-
linear trend model is used.
The log-linear trend equation is given by ln yt = b0 + b1t, t = 1, 2, …, T
• A limitation of trend models is that by nature they tend to exhibit serial correlation in
errors, due to which they are not useful.
• The Durban-Watson statistic is used to test for serial correlation. If this statistic
differs significantly from 2, then we can conclude the presence of serial correlation in
errors. To overcome this problem, we use autoregressive time series (AR) models.

Requirements for a time series to be covariance stationary


AR models can only be used for time series that are covariance stationary.
A time-series is covariance stationary if it meets the following three conditions:
• The expected value of the time series (its mean) must be constant and finite in all
periods.
• The variance must be constant and finite in all periods.
• The covariance of time series, past or future must also be constant and finite in all
periods.
Autoregressive (AR) models
An autoregressive time series model is a linear model that predicts its current value using
its most recent past value as the independent variable. An AR model of order p, denoted by
AR(p) uses p lags of a time series to predict its current value.
xt = b0 + b1 x(t–1) + b2 x(t–2) + … + bp x(t– p) + εt
The chain rule of forecasting is used to predict successive forecasts.
The one-period ahead forecast of xt from an AR(1) model is x̂t+1 = b̂0 + b̂1 xt
xt+1 can be used to forecast the two-period ahead value : x̂t+2 = b̂0 + b̂1 xt+1

Autocorrelations of the residuals


• If an AR model has been correctly specified, the residual terms will not exhibit serial

© IFT. All rights reserved 1


Quantitative Methods 2024 Level II High Yield Notes

correlation. We cannot use the Durban–Watson statistic to test for serial correlation in
AR models. Instead, we check the autocorrelations of the residuals.
• The autocorrelations of residuals are the correlations of the residuals with their own
past values. The autocorrelation between one residual and another one at lag k is
known as the kth order autocorrelation.
• If the model is correctly specified, the autocorrelation at all lags must be equal to 0.
• A t-test is used to test whether the error terms in a time series are serially correlated.
residual autocorrelation
Test stat =
standard error
Mean reversion
A time series is said to be mean-reverting if it tends to fall when its level is above its mean
and rise when its level is below its mean. If a time series is covariance stationary, then it
will be mean-reverting.
The mean-reverting level is calculated as:
b0
Mean– reverting level xt =
1 − b1

In- sample and out- of- sample forecasts, root mean squared error criterion (RMSE)
There are two types of forecasting errors based on the period used to predict values:
• In-sample forecast errors: these are residuals from a fitted series model used to
predict values within the sample period.
• Out-of-sample forecast errors: these are regression errors from the estimated
model used to predict values outside the sample period. Out-of-sample analysis is a
realistic way of testing the forecasting accuracy of a model and aids in the selection
of a model.
Root mean squared error (RMSE), square root of the average squared forecast error, is
used to compare the out-of-sample forecasting performance of the models. If two models
are being compared, the one with the lower RMSE for out-of-sample forecasts has better
forecast accuracy.

Instability of coefficients
Estimates of regression coefficients of the time-series model can change substantially
across different sample periods used for estimating the model. When selecting a time
period:
• Determine whether economics or environment have changed.
• Look at graphs of the data to see if the time series looks stationary.
Most economic and financial time series data are not stationary.

© IFT. All rights reserved 2


Quantitative Methods 2024 Level II High Yield Notes

Random walk
A random walk is a time series in which the value of the series in one period is the value of
the series in the previous period plus an unpredictable random error.
The equation for a random walk without a drift is:
xt = xt−1 + εt
The equation for a random walk with a drift is:
xt = b0 + xt−1 + εt
They do not have a mean-reverting level and are therefore not covariance stationary. For
example, currency exchange rates.

Unit root
• For an AR (1) model to be covariance stationary, the absolute value of the lag coefficient
b1 must be less than 1. When the absolute value of b1 is 1, the time series is said to have
a unit root.
• All random walks have unit roots. If the time series has a unit root, then it will not be
covariance stationary.
• A random-walk time series can be transformed into one that is covariance stationary by
first differencing the time series. We define a new variable y as follows:
yt = xt − xt−1 = εt , where E(εt ) = 0, E(ε2t ) = σ2 , E(εt εs ) = 0 if t ≠ s
• We can then use and AR model on the first-differenced series.

Unit root test


We can detect the unit root problem by using the Dickey-Fuller test.
It is a unit root test based on a transformed version of the AR (1) model xt = b0 + b1 xt−1 +
εt
Subtracting xt-1 from both the sides, we get
xt − xt−1 = b0 + (b1 − 1) xt−1 + εt or
xt − xt−1 = b0 + g1 xt−1 + εt
where 𝑔1 = 𝑏 − 1
If b1 = 1, then g1 = 0. This means there is a unit root in the model.

Seasonality
If the error term of a time-series model shows significant serial correlation at seasonal lags,
the time-series has significant seasonality. This means there is significant data in the error
terms that is not being captured in the model.
Seasonality can be corrected by including a seasonal lag in the model. For instance, to
correct seasonality in the quarterly time series, modify the AR (1) model to include a
seasonal lag 4:

© IFT. All rights reserved 3


Quantitative Methods 2024 Level II High Yield Notes

xt = b0 + b1 x(t–1) + b2 x(t−4) + εt
If the revised model shows no statistical significance of the lagged error terms, then the
model has been corrected for seasonality.

Autoregressive conditional heteroskedasticity (ARCH)


• If the variance of the error in a time series depends on the variance of the previous
errors than this condition is called autoregressive conditional heteroskedasticity
(ARCH).
• If ARCH exists, the standard errors for the regression parameters will not be correct.
We will have to use generalized least squares or other methods that correct for
heteroskedasticity.
• To test for first-order ARCH, we regress the squared residual on the squared residual
from the previous period. ε̂2t = a0 + a1 ε̂2t−1 + ut
If the coefficient in our model is statistically significant, the time- series model has
ARCH(1) errors.
• If a time-series model has significant ARCH, then we can predict the next period error
variance using the formula:
̂2t+1 = â0 + â1 ε̂2t
σ

Working with two time series


If a linear regression is used to model the relationship between two time series, a test such
as the Dickey-Fuller test should be performed to determine whether either time series has
a unit root.
• If neither of the time series has a unit root, then we can safely use linear regression.
• If one of the two time series has a unit root, then we should not use linear
regression.
• If both time series have a unit root and they are cointegrated (exposed to the same
macroeconomic variables), we may safely use linear regression.
• If both time series have a unit root but are not cointegrated, then we cannot not use
linear regression.
The Engle-Granger/Dicky-Fuller test is used to determine if a time series is cointegrated.

Selecting an appropriate time-series model


Section 16.1 from the curriculum provides a step-by-step guide on selecting an appropriate
time-series model.

1. Understand the investment problem you have, and make an initial choice of model. One
alternative is a regression model that predicts the future behavior of a variable based
on hypothesized causal relationships with other variables. Another is a time-series

© IFT. All rights reserved 4


Quantitative Methods 2024 Level II High Yield Notes

model that attempts to predict the future behavior of a variable based on the past
behavior of the same variable.
2. If you have decided to use a time-series model, compile the time series and plot it to see
whether it looks covariance stationary. The plot might show important deviations from
covariance stationarity, including the following:
• a linear trend;
• an exponential trend;
• seasonality; or
• a significant shift in the time series during the sample period (for example, a change
in mean or variance).
3. If you find no significant seasonality or shift in the time series, then perhaps either a
linear trend or an exponential trend will be sufficient to model the time series. In that
case, take the following steps:
• Determine whether a linear or exponential trend seems most reasonable (usually by
plotting the series).
• Estimate the trend.
• Compute the residuals.
• Use the Durbin–Watson statistic to determine whether the residuals have significant
serial correlation. If you find no significant serial correlation in the residuals, then
the trend model is sufficient to capture the dynamics of the time series and you can
use that model for forecasting.
4. If you find significant serial correlation in the residuals from the trend model, use a
more complex model, such as an autoregressive model. First, however, reexamine
whether the time series is covariance stationary. The following is a list of violations of
stationarity, along with potential methods to adjust the time series to make it
covariance stationary:
• If the time series has a linear trend, first-difference the time series.
• If the time series has an exponential trend, take the natural log of the time series
and then first-difference it.
• If the time series shifts significantly during the sample period, estimate different
time-series models before and after the shift.
• If the time series has significant seasonality, include seasonal lags (discussed in Step
7).
5. After you have successfully transformed a raw time series into a covariance-stationary
time series, you can usually model the transformed series with a short autoregression.
To decide which autoregressive model to use, take the following steps:
• Estimate an AR(1) model.
• Test to see whether the residuals from this model have significant serial correlation.

© IFT. All rights reserved 5


Quantitative Methods 2024 Level II High Yield Notes

• If you find no significant serial correlation in the residuals, you can use the AR(1)
model to forecast.
6. If you find significant serial correlation in the residuals, use an AR(2) model and test for
significant serial correlation of the residuals of the AR(2) model.
• If you find no significant serial correlation, use the AR(2) model.
• If you find significant serial correlation of the residuals, keep increasing the order of
the AR model until the residual serial correlation is no longer significant.
7. Your next move is to check for seasonality. You can use one of two approaches:
• Graph the data and check for regular seasonal patterns.
• Examine the data to see whether the seasonal autocorrelations of the residuals from
an AR model are significant (for example, the fourth autocorrelation for quarterly
data) and whether the autocorrelations before and after the seasonal
autocorrelations are significant. To correct for seasonality, add seasonal lags to your
AR model. For example, if you are using quarterly data, you might add the fourth lag
of a time series as an additional variable in an AR(1) or an AR(2) model.
8. Next, test whether the residuals have autoregressive conditional heteroskedasticity. To
test for ARCH(1), for example, do the following:
• Regress the squared residual from your time-series model on a lagged value of the
squared residual.
• Test whether the coefficient on the squared lagged residual differs significantly from
0.
• If the coefficient on the squared lagged residual does not differ significantly from 0,
the residuals do not display ARCH and you can rely on the standard errors from
your time-series estimates.
• If the coefficient on the squared lagged residual does differ significantly from 0, use
generalized least squares or other methods to correct for ARCH.
9. Finally, you may also want to perform tests of the model’s out-of-sample forecasting
performance to see how the model’s out-of-sample performance compares to its in-
sample performance.
Using these steps in sequence, you can be reasonably sure that your model is correctly
specified.

© IFT. All rights reserved 6


Quantitative Methods 2024 Level II High Yield Notes

LM06 Machine Learning


Supervised machine learning, unsupervised machine learning, and deep learning
Supervised machine learning makes use of labeled training data. It can be divided into
two categories:
• Regression: The target variable is continuous.
• Classification: The target variable is categorical or ordinal.
Unsupervised machine learning does not make use of labelled training data. The ML
program has to discover structure within the data on its own. Two important types of
problems well suited to unsupervised ML are:
• Dimension reduction: Reducing the number of features (X variables).
• Clustering: Sorting observations into groups.
Deep learning refers to sophisticated algorithms which are used for highly complex tasks
such as image classification, face recognition, speech recognition, and natural language
processing.

Overfitting
Overfitting refers to an issue where the model fits training data perfectly but does not work
well with out-of-sample data.
There are two methods to reduce overfitting:
• Preventing the algorithm from getting too complex: This is based on the principle
that the simplest solution often tends to be the correct one.
• Cross-validation: This is based on the principle of avoiding sampling bias. A
commonly used technique is k-fold cross-validation. Here the data is shuffled
randomly and then divided into k equal sub-samples, with k-1 samples used as
training samples and one sample, the kth, used as a validation sample.
The total out-of-sample error can be decomposed into:
• Bias error: refers to the degree to which a model fits the training data. Underfitted
models have high bias errors.
• Variance error: refers to how much the model’s result change in response to new
data. Overfitted models have high variance errors.
• Base error: refers to errors due to randomness in the data.
An overfitted model will have a low bias error but a high variance error.
Generalization refers to the degree to which a model retains its explanatory power when
predicting out-of-sample. A model that generalizes well has low variance error.

© IFT. All rights reserved 1


Quantitative Methods 2024 Level II High Yield Notes

Supervised machine learning algorithms


In penalized regression, the regression coefficients are chosen to minimize the sum of
squared residuals plus a penalty term that increases with the number of included variables.
So, a feature must make a sufficient contribution to model fit to offset the penalty from
including it. Because of this penalty, the model remains parsimonious and only the most
important variables for explaining Y remain in the model. A popular type of penalized
regression is LASSO.
Support vector machine (SVM) is a linear classifier that aims to seek the optimal
hyperplane – the one that separates the two sets of data points by the maximum margin.
K-nearest neighbor (KNN) classifies a new observation by finding similarities
(“nearness”) between it and its k-nearest neighbors in the existing data set.
Classification and regression tree (CART) can be applied to predict a categorical variable
or a continuous target variable. A binary CART tree is a combination of an initial root node,
decision nodes, and terminal nodes. The root node and each decision node represent a
single feature (f) and a cutoff value (c) for that feature. The CART algorithm iteratively
partitions the data into sub-groups until terminal nodes are formed that contain the
predicted label.
A random forest classifier is a collection of many decision trees generated by a bagging
method or by randomly reducing the number of features available during training.
In ensemble learning, we combine predictions from a collection of models. This method
typically produces more accurate and more stable predictions than the best single model.

Unsupervised machine learning algorithms


Principal components analysis (PCA) is used to reduce highly correlated features into a
few uncorrelated composite variables. A composite variable is a variable that combines two
or more variables that are statistically strongly related to each other.
K-means algorithm repeatedly partitions observations into k non-overlapping clusters.
The number of clusters, k, is a hyperparameter whose value must be set by the researcher
before learning begins. Each cluster is characterized by its centroid and each observation is
assigned to the cluster with the centroid to which that observation is closest.
Hierarchical clustering algorithms create intermediate rounds of clusters in increasing or
decreasing size until a final clustering is reached. Agglomerative clustering (or bottom-up)
hierarchical clustering begins with each observation being treated as its own cluster.
Divisive clustering (or top-down) hierarchical clustering starts with all the observations
belonging to a single cluster.

© IFT. All rights reserved 2


Quantitative Methods 2024 Level II High Yield Notes

Neural networks, deep learning nets, and reinforcement learning


Neural networks have layers of nodes connected by links. The three types of layers are:
input layer, hidden layer and output layer. Learning takes place in the hidden layer through
improvements in the weights applied to nodes.
Neural networks with many hidden layers (at least 3 but often more than 20 hidden layers)
are known as deep learning nets (DLNs). NNs and DLNs have been successfully applied to
a wide variety of complex tasks characterized by non-linearities and interactions among
features, particularly pattern recognition problems.
Reinforcement learning (RL) algorithm involves an agent that should perform actions
that will maximize its reward over time, taking into consideration the constraints of its
environment.

© IFT. All rights reserved 3


Quantitative Methods 2024 Level II High Yield Notes

LM07 Big Data Projects


Steps in a Data Analysis Project
Exhibit 1 from the curriculum shows the steps involved in executing a data analysis project.
The steps vary depending on whether we are working with structured data (data in tabular
format) or unstructured data (textual data).

Conceptualization: Define what the output of the model should be, (for example, will stock
prices go up/down in a week from now), how this model will be used, who will use this
model, and how this model will become part of the existing business process.
Text problem formulation: Define the text classification problem, identify the exact inputs
and outputs of the model. For example, computing sentiment scores (positive, negative,
neutral) from the text data.
Data Prep & Wrangling
Structured data
For structured data, data preparation and wrangling involve data cleansing and data
preprocessing.
Data cleansing involves resolving:
• Incompleteness errors: Data is missing.
• Invalidity errors: Data is outside a meaningful range.
• Inaccuracy errors: Data is not a measure of true value.
• Inconsistency errors: Data conflicts with the corresponding data points or reality.
• Non-uniformity errors: Data is not present in an identical format.
• Duplication errors: Duplicate observations are present.

© IFT. All rights reserved 1


Quantitative Methods 2024 Level II High Yield Notes

Data preprocessing involves performing the following transformations:


• Extraction: A new variable is extracted from the current variable for ease of
analyzing and using for training the ML model.
• Aggregation: Two or more variables are consolidated into a single variable.
• Filtration: Data rows not required for the project are removed.
• Selection: Data columns not required for the project are removed.
• Conversion: The variables in the dataset are converted into appropriate types to
further process and analyze them correctly.
Scaling is the process of adjusting the range of a feature by shifting and changing the
scale of data. Two common methods used for scaling are:
Xi −Xmin
o Normalization: Xi(normalized) =
Xmax −Xmin
Xi −µ
o Standardization: X i(standarized) = σ

Unstructured data
For unstructured data, data preparation and wrangling involve a set of text-specific
cleansing and preprocessing tasks.
Text cleansing involves removing the following unnecessary elements from the raw text:
• Html tags
• Most punctuations
• Most numbers
• White spaces
Text preprocessing involves performing the following transformations:
• Tokenization: the process of splitting a given text into separate tokens where each
token is equivalent to a word.
• Normalization: The normalization process involves the following actions:
o Lowercasing – Removes differences among the same words due to upper and
lower cases.
o Removing stop words – Stop words are commonly used words such a ‘the’,
‘is’ and ‘a’.
o Stemming – Converting inflected forms of a word into its base word.
o Lemmatization – Converting inflected forms of a word into its morphological
root (known as lemma). Lemmatization is a more sophisticated approach as
compared to stemming and is difficult and expensive to perform.
• Creating bag-of-words (BOW): It is a collection of distinct set of tokens that does not
capture the position or sequence of the words in the text.
• Organizing the BOW into a Document term matrix (DTM): It is a table, where each
row of the matrix belongs to a document (or text file), and each column represents a
token (or term). The number of rows is equal to the number of documents in the
sample dataset. The number of columns is equal to the number of tokens in the final

© IFT. All rights reserved 2


Quantitative Methods 2024 Level II High Yield Notes

BOW. The cells contain the counts of the number of times a token is present in each
document.
• N-grams and N-grams BOW: In some cases, a sequence of words may convey more
meaning than individual words. N-grams is a representation of word sequences. The
length of a sequence varies from 1 to n. A one-word sequence is a unigram; a two-
word sequence is a bigram, and a 3-word sequence is a trigram; and so on.

Data Exploration
Exploratory data analysis (EDA) is a preliminary step in data exploration that involves
summarizing and observing data. The objectives of EDA include:
• Serving as a communication medium among project stakeholders
• Understanding data properties
• Finding patterns and relationships in data
• Inspecting basic questions and hypotheses
• Documenting data distributions and other characteristics
• Planning modeling strategies for the next steps
Visualization techniques for EDA include: histograms, box plots, scatterplots, word clouds,
etc.
In general, EDA helps identify the general trends in the data as well as the relationships
between data. These relationships and trends can be used for feature selection and feature
engineering.
Feature selection is the process of selecting only the relevant features from the data set to
reduce model complexity. Feature selection methods used for text data include:
• Term frequency (TF): Features with very high or very low term frequencies are
removed. These represent noisy features.
• Document frequency (DF): The DF of a token is defined as the number of documents
(texts) that contain the respective token divided by the total number of documents.
This measure also helps in identifying and removing noisy features.
• Chi-square test: This test allows us to rank tokens by their usefulness to each class
in text classification problems. Features with high scores can be retained whereas
features with low scores can be removed.
• Mutual information measure: This measure tells us how much information is
contributed by a token to a class of texts. The MI score ranges from 0 – 1. A score
close to 0 indicates that the token is not very useful and can be removed. Whereas, a
score close to 1 indicates that the token is closely associated with a particular class
and should be retained.
Feature engineering is the process of creating new features by changing or transforming
existing features. Feature engineering for text data includes:

© IFT. All rights reserved 3


Quantitative Methods 2024 Level II High Yield Notes

• Converting numbers into tokens: To preserve the meaning conveyed by numbers of


different length they are converted into different tokens. For example, four-digit
numbers are replaced with “/number4/,” 10-digit numbers with “/number10/,”.
• Creating n-grams: Sometimes multi-word sequences can carry more meaning than
an individual word. N-grams are used to keep the connection between words intact.
• Using name entity recognition (NER): NER is an algorithm that allows us to identify
the names of organizations, date, money, time, etc. and tag certain tokens with these
names.
• Using parts of speech (POS): Similar to NER, POS algorithms allows us to tag each
token with the corresponding part of speech such as noun, verb, adjective, proper
noun etc.

Model Training
Model training consists of three major tasks: method selection, performance evaluation,
and model tuning.
Method selection: This decision is based on the following factors:
• Whether the data project involves labeled data (supervised learning)or unlabeled
data (unsupervised learning)
• Type of data: numerical, continuous, or categorical; text data; image data; speech
data; etc.
• Size of the dataset
Performance evaluation: Commonly used techniques are:
• Error analysis using confusion matrix: A confusion matrix is created with four
categories - true positives, false positives, true negatives and false negatives.
The following metrics are used to evaluate a confusion matrix:
Precision (P) = TP/(TP + FP)
Recall (R) = TP/(TP + FN)
Accuracy = (TP + TN)/(TP + FP + TN + FN)
F1 score = (2 * P * R)/(P + R)
The higher the accuracy and the F1 score, the better the model performance.
• Receiver Operating Characteristic (ROC): ROC curves and area under the curve
(AUC) of various models are calculated and compared. An area under the curve
(AUC) close to 1 indicates a perfect model. Whereas, an AUC of 0.5 indicated
random guessing. In other words, a more convex curve indicates better model
performance.
• Calculating root mean squared error (RMSE): The root mean squared error is
computed by finding the square root of the mean of the squared differences
between the actual values and the model’s predicted values (error). The model with
the smallest RMSE is the most accurate model.

© IFT. All rights reserved 4


Quantitative Methods 2024 Level II High Yield Notes

Model tuning
• Involves managing the trade-off between model bias error (which is associated with
underfitting) and model variance error (which is associated with overfitting).
• ‘Grid search’ is a method of systematically training an ML model by using different
hyperparameter values to determine which values lead to best model performance.
• A fitting curve of in-sample error and out-of-sample error on the y-axis versus
model complexity on the x-axis is useful for managing the trade-off between bias
and variance errors.

Evaluating the Fit of a ML algorithm


Fitting describes the degree to which an ML model can be generalized to new data.
• Underfit: Model does not fit the training data well.
• Overfit: Model is complex and fits the training data too well. It is unlikely to perform
well on out-of-sample data.
• Good fit: Model fits the training data well and yet is simple enough so that it can
generalize well to out-of-sample data.

© IFT. All rights reserved 5

You might also like