01 - Quantitative Methods
01 - Quantitative Methods
Quantitative Methods
Quantitative Methods 2024 Level II High Yield Notes
Heteroskedasticity
There are two types of heteroskedasticity:
• Unconditional heteroskedasticity: When heteroskedasticity of the error variance is
not correlated with the independent variables. Unconditional heteroskedasticity is
not a problem.
• Conditional heteroskedasticity: The error variance is correlated with the values of
the independent variables. Conditional heteroskedasticity is a problem. It results in
underestimation of standard errors, so t-statistics are inflated and Type I errors are
more likely.
The figure below illustrates conditional heteroskedasticity.
Conditional heteroskedasticity can be detected using the Breusch–Pagan (BP) test. It can be
corrected by computing ‘robust standard errors’.
Serial correlation
In serial correlation, regression errors in one period are correlated with errors from
previous periods. It is often found in time-series regressions.
The following figure demonstrates the presence of serial correlation. When the previous
error term is positive the next error term is also most likely positive and vice versa.
Consequences:
Independent Variable Is Lagged Value Invalid Coefficient Invalid Standard Error
of Dependent Variable Estimates Estimates
No No Yes
Yes Yes Yes
Serial correlation can be detected using Breusch–Godfrey (BG) test. It can be corrected by
computing ‘robust standard errors’.
Multicollinearity
Multicollinearity may occur when two or more independent variables are highly correlated
or when there is an approximate linear relationship among independent variables.
Consequences: The standard errors are inflated so the t-stats of coefficients are artificially
small and cannot reject the null hypothesis.
Multicollinearity can be detected by using the variance inflation factor (VIF). VIF values
above 5 warrant further investigation. Whereas, VIF values above 10 indicate serious
multicollinearity problems.
It can be corrected by dropping one or more of the regression variables, using a different
proxy for one of the variables, or increasing the sample size.
Exhibit 14 summarizes the three issues arising from regression assumption violations.
Assumption Violation Issue Detection Correction
Homoskedastic Heteroskedastic Biased Visual Revise model;
error terms error terms estimates of inspection of use robust
coefficients’ residuals; standard errors
standard errors Breusch–
Pagan test
Independence Serial Inconsistent Breusch– Revise model;
of observations correlation estimates of Godfrey test use serial-
coefficients and correlation
biased standard consistent
errors standard errors
Independence Multicollinearity Inflated Variance Revise model;
of independent standard errors inflation factor increase
variables sample size
Di > 1.0 The ith observation is highly likely to be an influential data point.
Di > 2√𝑘/𝑛 The ith observation is highly likely to be an influential data point.
Exhibit 11, Panel C, shows the combined effect of a intercept and slope dummy variable.
𝑃
𝑙𝑛 ( ) = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 + €
1−𝑃
The natural logarithm (ln) of the odds of an event happening is the log odds which is also
called the logit function.
Logistic regression coefficients are typically estimated using the maximum likelihood
estimation (MLE) method rather than by least squares.
In a logit model, slope coefficients are interpreted as the change in the log odds that the
event happens per unit change in the independent variable, holding all other independent
variables constant.
A likelihood ratio (LR) test is a method to assess the fit of logistic regression models. The
test is similar to the joint F-test. It compares the fit of the restricted and unrestricted
models.
correlation. We cannot use the Durban–Watson statistic to test for serial correlation in
AR models. Instead, we check the autocorrelations of the residuals.
• The autocorrelations of residuals are the correlations of the residuals with their own
past values. The autocorrelation between one residual and another one at lag k is
known as the kth order autocorrelation.
• If the model is correctly specified, the autocorrelation at all lags must be equal to 0.
• A t-test is used to test whether the error terms in a time series are serially correlated.
residual autocorrelation
Test stat =
standard error
Mean reversion
A time series is said to be mean-reverting if it tends to fall when its level is above its mean
and rise when its level is below its mean. If a time series is covariance stationary, then it
will be mean-reverting.
The mean-reverting level is calculated as:
b0
Mean– reverting level xt =
1 − b1
In- sample and out- of- sample forecasts, root mean squared error criterion (RMSE)
There are two types of forecasting errors based on the period used to predict values:
• In-sample forecast errors: these are residuals from a fitted series model used to
predict values within the sample period.
• Out-of-sample forecast errors: these are regression errors from the estimated
model used to predict values outside the sample period. Out-of-sample analysis is a
realistic way of testing the forecasting accuracy of a model and aids in the selection
of a model.
Root mean squared error (RMSE), square root of the average squared forecast error, is
used to compare the out-of-sample forecasting performance of the models. If two models
are being compared, the one with the lower RMSE for out-of-sample forecasts has better
forecast accuracy.
Instability of coefficients
Estimates of regression coefficients of the time-series model can change substantially
across different sample periods used for estimating the model. When selecting a time
period:
• Determine whether economics or environment have changed.
• Look at graphs of the data to see if the time series looks stationary.
Most economic and financial time series data are not stationary.
Random walk
A random walk is a time series in which the value of the series in one period is the value of
the series in the previous period plus an unpredictable random error.
The equation for a random walk without a drift is:
xt = xt−1 + εt
The equation for a random walk with a drift is:
xt = b0 + xt−1 + εt
They do not have a mean-reverting level and are therefore not covariance stationary. For
example, currency exchange rates.
Unit root
• For an AR (1) model to be covariance stationary, the absolute value of the lag coefficient
b1 must be less than 1. When the absolute value of b1 is 1, the time series is said to have
a unit root.
• All random walks have unit roots. If the time series has a unit root, then it will not be
covariance stationary.
• A random-walk time series can be transformed into one that is covariance stationary by
first differencing the time series. We define a new variable y as follows:
yt = xt − xt−1 = εt , where E(εt ) = 0, E(ε2t ) = σ2 , E(εt εs ) = 0 if t ≠ s
• We can then use and AR model on the first-differenced series.
Seasonality
If the error term of a time-series model shows significant serial correlation at seasonal lags,
the time-series has significant seasonality. This means there is significant data in the error
terms that is not being captured in the model.
Seasonality can be corrected by including a seasonal lag in the model. For instance, to
correct seasonality in the quarterly time series, modify the AR (1) model to include a
seasonal lag 4:
xt = b0 + b1 x(t–1) + b2 x(t−4) + εt
If the revised model shows no statistical significance of the lagged error terms, then the
model has been corrected for seasonality.
1. Understand the investment problem you have, and make an initial choice of model. One
alternative is a regression model that predicts the future behavior of a variable based
on hypothesized causal relationships with other variables. Another is a time-series
model that attempts to predict the future behavior of a variable based on the past
behavior of the same variable.
2. If you have decided to use a time-series model, compile the time series and plot it to see
whether it looks covariance stationary. The plot might show important deviations from
covariance stationarity, including the following:
• a linear trend;
• an exponential trend;
• seasonality; or
• a significant shift in the time series during the sample period (for example, a change
in mean or variance).
3. If you find no significant seasonality or shift in the time series, then perhaps either a
linear trend or an exponential trend will be sufficient to model the time series. In that
case, take the following steps:
• Determine whether a linear or exponential trend seems most reasonable (usually by
plotting the series).
• Estimate the trend.
• Compute the residuals.
• Use the Durbin–Watson statistic to determine whether the residuals have significant
serial correlation. If you find no significant serial correlation in the residuals, then
the trend model is sufficient to capture the dynamics of the time series and you can
use that model for forecasting.
4. If you find significant serial correlation in the residuals from the trend model, use a
more complex model, such as an autoregressive model. First, however, reexamine
whether the time series is covariance stationary. The following is a list of violations of
stationarity, along with potential methods to adjust the time series to make it
covariance stationary:
• If the time series has a linear trend, first-difference the time series.
• If the time series has an exponential trend, take the natural log of the time series
and then first-difference it.
• If the time series shifts significantly during the sample period, estimate different
time-series models before and after the shift.
• If the time series has significant seasonality, include seasonal lags (discussed in Step
7).
5. After you have successfully transformed a raw time series into a covariance-stationary
time series, you can usually model the transformed series with a short autoregression.
To decide which autoregressive model to use, take the following steps:
• Estimate an AR(1) model.
• Test to see whether the residuals from this model have significant serial correlation.
• If you find no significant serial correlation in the residuals, you can use the AR(1)
model to forecast.
6. If you find significant serial correlation in the residuals, use an AR(2) model and test for
significant serial correlation of the residuals of the AR(2) model.
• If you find no significant serial correlation, use the AR(2) model.
• If you find significant serial correlation of the residuals, keep increasing the order of
the AR model until the residual serial correlation is no longer significant.
7. Your next move is to check for seasonality. You can use one of two approaches:
• Graph the data and check for regular seasonal patterns.
• Examine the data to see whether the seasonal autocorrelations of the residuals from
an AR model are significant (for example, the fourth autocorrelation for quarterly
data) and whether the autocorrelations before and after the seasonal
autocorrelations are significant. To correct for seasonality, add seasonal lags to your
AR model. For example, if you are using quarterly data, you might add the fourth lag
of a time series as an additional variable in an AR(1) or an AR(2) model.
8. Next, test whether the residuals have autoregressive conditional heteroskedasticity. To
test for ARCH(1), for example, do the following:
• Regress the squared residual from your time-series model on a lagged value of the
squared residual.
• Test whether the coefficient on the squared lagged residual differs significantly from
0.
• If the coefficient on the squared lagged residual does not differ significantly from 0,
the residuals do not display ARCH and you can rely on the standard errors from
your time-series estimates.
• If the coefficient on the squared lagged residual does differ significantly from 0, use
generalized least squares or other methods to correct for ARCH.
9. Finally, you may also want to perform tests of the model’s out-of-sample forecasting
performance to see how the model’s out-of-sample performance compares to its in-
sample performance.
Using these steps in sequence, you can be reasonably sure that your model is correctly
specified.
Overfitting
Overfitting refers to an issue where the model fits training data perfectly but does not work
well with out-of-sample data.
There are two methods to reduce overfitting:
• Preventing the algorithm from getting too complex: This is based on the principle
that the simplest solution often tends to be the correct one.
• Cross-validation: This is based on the principle of avoiding sampling bias. A
commonly used technique is k-fold cross-validation. Here the data is shuffled
randomly and then divided into k equal sub-samples, with k-1 samples used as
training samples and one sample, the kth, used as a validation sample.
The total out-of-sample error can be decomposed into:
• Bias error: refers to the degree to which a model fits the training data. Underfitted
models have high bias errors.
• Variance error: refers to how much the model’s result change in response to new
data. Overfitted models have high variance errors.
• Base error: refers to errors due to randomness in the data.
An overfitted model will have a low bias error but a high variance error.
Generalization refers to the degree to which a model retains its explanatory power when
predicting out-of-sample. A model that generalizes well has low variance error.
Conceptualization: Define what the output of the model should be, (for example, will stock
prices go up/down in a week from now), how this model will be used, who will use this
model, and how this model will become part of the existing business process.
Text problem formulation: Define the text classification problem, identify the exact inputs
and outputs of the model. For example, computing sentiment scores (positive, negative,
neutral) from the text data.
Data Prep & Wrangling
Structured data
For structured data, data preparation and wrangling involve data cleansing and data
preprocessing.
Data cleansing involves resolving:
• Incompleteness errors: Data is missing.
• Invalidity errors: Data is outside a meaningful range.
• Inaccuracy errors: Data is not a measure of true value.
• Inconsistency errors: Data conflicts with the corresponding data points or reality.
• Non-uniformity errors: Data is not present in an identical format.
• Duplication errors: Duplicate observations are present.
Unstructured data
For unstructured data, data preparation and wrangling involve a set of text-specific
cleansing and preprocessing tasks.
Text cleansing involves removing the following unnecessary elements from the raw text:
• Html tags
• Most punctuations
• Most numbers
• White spaces
Text preprocessing involves performing the following transformations:
• Tokenization: the process of splitting a given text into separate tokens where each
token is equivalent to a word.
• Normalization: The normalization process involves the following actions:
o Lowercasing – Removes differences among the same words due to upper and
lower cases.
o Removing stop words – Stop words are commonly used words such a ‘the’,
‘is’ and ‘a’.
o Stemming – Converting inflected forms of a word into its base word.
o Lemmatization – Converting inflected forms of a word into its morphological
root (known as lemma). Lemmatization is a more sophisticated approach as
compared to stemming and is difficult and expensive to perform.
• Creating bag-of-words (BOW): It is a collection of distinct set of tokens that does not
capture the position or sequence of the words in the text.
• Organizing the BOW into a Document term matrix (DTM): It is a table, where each
row of the matrix belongs to a document (or text file), and each column represents a
token (or term). The number of rows is equal to the number of documents in the
sample dataset. The number of columns is equal to the number of tokens in the final
BOW. The cells contain the counts of the number of times a token is present in each
document.
• N-grams and N-grams BOW: In some cases, a sequence of words may convey more
meaning than individual words. N-grams is a representation of word sequences. The
length of a sequence varies from 1 to n. A one-word sequence is a unigram; a two-
word sequence is a bigram, and a 3-word sequence is a trigram; and so on.
Data Exploration
Exploratory data analysis (EDA) is a preliminary step in data exploration that involves
summarizing and observing data. The objectives of EDA include:
• Serving as a communication medium among project stakeholders
• Understanding data properties
• Finding patterns and relationships in data
• Inspecting basic questions and hypotheses
• Documenting data distributions and other characteristics
• Planning modeling strategies for the next steps
Visualization techniques for EDA include: histograms, box plots, scatterplots, word clouds,
etc.
In general, EDA helps identify the general trends in the data as well as the relationships
between data. These relationships and trends can be used for feature selection and feature
engineering.
Feature selection is the process of selecting only the relevant features from the data set to
reduce model complexity. Feature selection methods used for text data include:
• Term frequency (TF): Features with very high or very low term frequencies are
removed. These represent noisy features.
• Document frequency (DF): The DF of a token is defined as the number of documents
(texts) that contain the respective token divided by the total number of documents.
This measure also helps in identifying and removing noisy features.
• Chi-square test: This test allows us to rank tokens by their usefulness to each class
in text classification problems. Features with high scores can be retained whereas
features with low scores can be removed.
• Mutual information measure: This measure tells us how much information is
contributed by a token to a class of texts. The MI score ranges from 0 – 1. A score
close to 0 indicates that the token is not very useful and can be removed. Whereas, a
score close to 1 indicates that the token is closely associated with a particular class
and should be retained.
Feature engineering is the process of creating new features by changing or transforming
existing features. Feature engineering for text data includes:
Model Training
Model training consists of three major tasks: method selection, performance evaluation,
and model tuning.
Method selection: This decision is based on the following factors:
• Whether the data project involves labeled data (supervised learning)or unlabeled
data (unsupervised learning)
• Type of data: numerical, continuous, or categorical; text data; image data; speech
data; etc.
• Size of the dataset
Performance evaluation: Commonly used techniques are:
• Error analysis using confusion matrix: A confusion matrix is created with four
categories - true positives, false positives, true negatives and false negatives.
The following metrics are used to evaluate a confusion matrix:
Precision (P) = TP/(TP + FP)
Recall (R) = TP/(TP + FN)
Accuracy = (TP + TN)/(TP + FP + TN + FN)
F1 score = (2 * P * R)/(P + R)
The higher the accuracy and the F1 score, the better the model performance.
• Receiver Operating Characteristic (ROC): ROC curves and area under the curve
(AUC) of various models are calculated and compared. An area under the curve
(AUC) close to 1 indicates a perfect model. Whereas, an AUC of 0.5 indicated
random guessing. In other words, a more convex curve indicates better model
performance.
• Calculating root mean squared error (RMSE): The root mean squared error is
computed by finding the square root of the mean of the squared differences
between the actual values and the model’s predicted values (error). The model with
the smallest RMSE is the most accurate model.
Model tuning
• Involves managing the trade-off between model bias error (which is associated with
underfitting) and model variance error (which is associated with overfitting).
• ‘Grid search’ is a method of systematically training an ML model by using different
hyperparameter values to determine which values lead to best model performance.
• A fitting curve of in-sample error and out-of-sample error on the y-axis versus
model complexity on the x-axis is useful for managing the trade-off between bias
and variance errors.