Building a Robust
Geodemographic Segmentation
Model
Transforming Independent Variables
• Transformations are used to present data on a different scale. The
nature of a transformation determines how the scale of the
untransformed variable will be affected.
• In modeling and statistical applications, transformations are often
used to improve the compatibility of the data with assumptions
underlying a modeling process, to linearize the relation between two
variables whose relationship is non-linear, or to modify the range of
values of a variable.
What is Multicollinearity?
• Multicollinearity occurs when independent variables in a regression
model are correlated.
• This correlation is a problem because independent variables should
be independent. If the degree of correlation between variables is
high enough, it can cause problems when you fit the model and
interpret the results.
• This means that an independent variable can be predicted from
another independent variable in a regression model.
The Problem with having Multicollinearity
• Multicollinearity can be a problem in a regression model because we
would not be able to distinguish between the individual effects of the
independent variables on the dependent variable.
• For example, let’s assume that in the following linear equation:
Y = W0+W1*X1+W2*X2
• Coefficient W1 is the increase in Y for a unit increase in X1 while
keeping X2 constant. But since X1 and X2 are highly correlated,
changes in X1 would also cause changes in X2 and we would not be
able to see their individual effect on Y.
What causes Multicollinearity?
• Multicollinearity could exist because of the problems in the dataset at
the time of creation. These problems could be because of poorly
designed experiments, highly observational data, or the inability to
manipulate the data:
• For example, determining the electricity consumption of a household from
the household income and the number of electrical appliances. Here, we
know that the number of electrical appliances in a household will increase
with household income. However, this cannot be removed from the dataset
• Multicollinearity could also occur when new variables are created
which are dependent on other variables:
• For example, creating a variable for BMI from the height and weight variables
would include redundant information in the model
What causes Multicollinearity?
• Including identical variables in the dataset:
• For example, including variables for temperature in Fahrenheit and
temperature in Celsius
• Inaccurate use of dummy variables can also cause a multicollinearity
problem. This is called the Dummy variable trap:
• For example, in a dataset containing the status of marriage variable with two
unique values: ‘married’, ’single’. Creating dummy variables for both of them
would include redundant information. We can make do with only one
variable containing 0/1 for ‘married’/’single’ status.
• Insufficient data in some cases can also cause multicollinearity
problems
Detecting Multicollinearity using VIF
(Variable Inflation Factors)
• VIF determines the strength of the correlation between the
independent variables. It is predicted by taking a variable and
regressing it against every other variable.
• VIF score of an independent variable represents how well the variable
is explained by other independent variables.
• R^2 value is determined to find out how well an independent variable
is described by the other independent variables. A high value of R^2
means that the variable is highly correlated with the other variables.
This is captured by the VIF which is denoted below:
• VIF starts at 1 and has no upper limit
• VIF = 1, no correlation between the independent variable and the
other variables
• VIF exceeding 5 or 10 indicates high multicollinearity between this
independent variable and the others
Multicollinearity
• What does it mean?
• A high degree of correlation amongst the explanatory variables
• What are its consequences?
• It may be difficult to separate out the effects of the individual regressors.
Standard errors may be overestimated and t-values depressed.
• Note: a symptom may be high R2 but low t-values
• How can you detect the problem?
• Examine the correlation matrix of regressors - also carry out auxiliary
regressions amongst the regressors.
What is a Correlation Matrix?
• A correlation matrix is a table showing correlation coefficients
between variables. Each cell in the table shows the correlation
between two variables.
• A correlation matrix is used to summarize data, as an input into a
more advanced analysis, and as a diagnostic for advanced analyses.
What Problems Do Multicollinearity Cause?
• The coefficient estimates can swing wildly based on which other
independent variables are in the model. The coefficients become very
sensitive to small changes in the model.
• Multicollinearity reduces the precision of the estimate coefficients,
which weakens the statistical power of your regression model. You
might not be able to trust the p-values to identify independent
variables that are statistically significant.