[go: up one dir, main page]

0% found this document useful (0 votes)
171 views13 pages

Advanced Regression

- Generalized linear models (GLMs) are used when the assumptions of linear regression are not met, such as when the dependent variable is discrete or count data rather than continuous. - GLMs link the random component that describes the data distribution to the systematic component of linear predictors using a link function. This allows modeling on the scale of the data while incorporating the known assumptions. - Logistic regression and Poisson regression are common types of GLMs used for binary and count data respectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views13 pages

Advanced Regression

- Generalized linear models (GLMs) are used when the assumptions of linear regression are not met, such as when the dependent variable is discrete or count data rather than continuous. - GLMs link the random component that describes the data distribution to the systematic component of linear predictors using a link function. This allows modeling on the scale of the data while incorporating the known assumptions. - Logistic regression and Poisson regression are common types of GLMs used for binary and count data respectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 13

Why Generalized Linear Models?

Welcome to the course on Advanced Regression Models.

In the previous course on Regression, you have seen - how to predict a dependent
variable from one or more independent variable.

The dependent variable was numeric and error assumed to be normally distributed.

What if your dependent variable is discrete (0 or 1) or just count data and the
error terms do not follow normal distribution?

Do you think your prediction would be good with the Linear Model ?

Limitations of Linear Models


Linear Models are the most widely used models in Statistics.

But they come with their own limitations.

Not proficient in handling Binary Data

Not Accurate When count data(number of footfalls, number of pages visited etc ..)
is involved.

Some variable have a constraint of being only strictly positive

To fix some of these problems we go can go for Transformation.

In some scenarios Transformation minimises interpretability so we have to look for


other alternatives.

GLM
To overcome the some limitations of Linear Models , we can go for Generalized
Linear Models(GLMs).

In GLMs the modeling is done on the scale in which the data was recorded.

GLMs honor the known assumptions of the data

GLM Components
GLMs comprise of 3 components

Random Component that explains the data distribution that describes Randomness /
Errors.

Systematic Component consists of linear predictors (the covariate and the


coefficient)

Link function connects the mean of the response to Predictors

GLM Representation

The First Equation describes the Random Component, here it is the Gaussian
Distribution

The second equation is the systematic component which has the covariates and the
coefficients . This is the Linear Predictor

The third equation links the random component to the Link Function
The above set of equations are a generic representation of the Generalized Linear
Model.

Types of Generalized Models


Logistic Regression used for predicting Binary Outcomes.

Poisson Regression

used for predicting count data (# of footfalls, # of hits on a website)

Next in GLM ...


In the following topics you will be understanding GLM in detail

You will understand Logistic and Poisson Regression with some applications

You will first learn the statistical aspects and later understand it from a Machine
Learning perspective

You can try out some examples in Python.

Understanding the Logistic Function


Positive values of the coefficients predict class 1

Positive values coefficients increase linear regression piece there by increasing


the probability of y = 1

Negative values of the coefficients predict class 0

Negative values coefficients decrease linear regression piece there by decreasing


the probability of y = 1

Regression Coefficients
The coefficients β0 , β1 and β2 are selected in such a way that

Predict high probability for a given case

Predict low probability for the opposite case

Odds Ratio
Odds = p(y=1) / p(y=0)

Odds > 1 if y = 1 is more likely

Odds < 1 if y = 0 is more likely

Odds = 1 if outcome is equally likely

The Logit

Odds = e^(β0 + β1x1 + ... βkxk)

log(Odds) = β0 + β1x1 + ... βkxk

This is called logit and looks like linear regression.


Bigger the logit bigger the P(y=1)

+ve beta values increase logit increasing the odds

-ve beta values decrease logit decreasing the odds


Baseline Method
For Logistic Regression / Binary Classification the baseline method is to predict
the most frequent outcome.

The output is a probability value. To separate the 1 and 0 you have to identify a
threshold value.

Values above the threshold will be marked 1 and below will be marked 0.

Choosing the right threshold value is important.

ROC Curve

ROC stands for Receiver Operator Characteristics

It is a graphical way of identifying how the model has been fit

The true positive rate represents the vertical axis

The false positive rate represents the horizontal axis

ROC Curve Properties


ROC Curver captures all the threshold values

It also helps to ...

Choose the best threshold for the best trade off

Get the Cost of failing to detect the false positives

Get the Cost of raising the false alarm

Confusion Matrix is a tabular way of representing the model performance . It has

Actual outcomes along the rows

Predicted outcomes along the columns

Many Metrics can be derived from the values in a confusion matrix

Sensitivity is the true positive(TP) rate.

sensitivity = TP/(TP+FN)

Specificity is the true negative(TN) rate.

specificity = TN/(TN+FP)

The Story So far ...


You have learnt what are Generalized Linear Models

You have learnt how the Random Componets are linked to the predictor using Link
Function

What is Logit ? How the parameters are affecting the logit ?

How to measure the accuracy of a logistic regression model


You will now learn how to implement this idea using Python

Data and Code


The code below is a simple demonstration of how GLMs are implemented in Python

A dataset is created with scores a team got and Won or lost that respective game

This is to illustrate how the score is helping us predict the binary outcome
win/loose

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from __future__ import print_function
Scores = [(200,1),(100,0),(150,1),(320,1),(270,1),(134,0),(322,1),(140,0),(210,0),
(199,0)]
Labels = ['Score','Win']
df = pd.DataFrame.from_records(Scores, columns=Labels)
glm_binom = sm.GLM(df.Win, df.Score, family=sm.families.Binomial())
res = glm_binom.fit()
print(res.summary())

Sample Output

The value of the score coef tells us how it is able to tell us to what extent it is
able to predict the likilihood of winning a game .

The rest of the values are a standard outcome of a regression equation.

Sample Data
In the previous example you have seen how to fit a GLM using statsmodels package.

In this example you will learn how to fit a Logistic Regression using scikit
learn .

The previous example was a statistical perspective. The current example will give a
machine learning perspective.

from sklearn.datasets import make_classification


X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0,
n_clusters_per_class=1,
class_sep = 2.0, random_state=101)

The above code creates a sample dataset for a binary classification problem.

2 features are created and the 2 classes are created for the given features

Plotting the Data

%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y,
linewidth=0, edgecolor=None)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

The above code explains how the plot should be created for the input data.
Splitting the Data

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X,
y.astype(float),test_size=0.33, random_state=101)

In this code we are splitting the training and test sets.

Logistic Regression Model Fit

from sklearn.linear_model import LogisticRegression


clf = LogisticRegression()
clf.fit(X_train, y_train.astype(int))
y_clf = clf.predict(X_test)
print(classification_report(y_test, y_clf))

The code above explains how to fit a logistic regression model and view the
classification report.

Results of Model

from sklearn.linear_model import LogisticRegression


from sklearn.metrics import classification_report
clf = LogisticRegression()
clf.fit(X_train, y_train.astype(int))
y_clf = clf.predict(X_test)
print(classification_report(y_test, y_clf))

precision recall f1-score support


0.0 1.00 0.93 0.97 15
1.0 0.95 1.00 0.97 18
avg / total 0.97 0.97 0.97 33

Precision = TP/(TP + FP)

Precision is similar to accuracy but looks at only the positively predicted data.

Recall = TP / (TP + FN)

Recall is also similar to accuracy, it looks at only the relevant data.

Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_clf)

array([[14, 1],

[ 0, 18]])

Accuracy for this model is (14+18) / (14+1+0+18) = 0.96969

Sensitivity and Specificity


Sensitivity for the model is 100%

Specificity for the model is 94%

Based on the numbers we can interpret that the model is able to clearly separate
the data into 2 classes.
The model is also able to designate the individual numbers that do not belong to a
specific class as negative.

Data Prep for Hands On

from sklearn import datasets


iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
print(iris_X)
print(iris_y)

use the above commands to load the dataset and the required variables

Why Poisson Regression ?


One of the underlying assumptions of Linear Regression is that the error terms
follow a normal distribution

When the error terms do not follow normal distribution , we go for other types of
Regression

When we try to model count data(number of footfalls, traffic in a website), we go


for Poisson Regression

Where is Poisson Regression Used ?


Poisson Regression is used to model count data

Number of foot falls , number of call drops etc ...

In mathematical terms , Poisson Regression is used to model the logarithm of the


count data

Variables in Poisson Regression Equation


Dependent Variable Y represents count or sometimes Y/t is used signifying the rate

Independent variables are categorical or continuous variables depending on the


dataset

GLM
Link Function : g(μ)=β0+β1x1+β2x2+…+βkxk = xTiβ

Random component: Response Y has a Poisson distribution that is yi∼Poisson(μi) for


i=1,...,N where the expected count of yi is E(Y)=μ.

Systematic component: Any set of X = (X1, X2, … Xk) are independent variables.

Link Function
Identity link: μ=β0+β1x1
In some ocassions the identity link function is used in Poisson regression. Here
the random component is the Poisson distribution.

Natural log link: log(μ)=β0+β1x1


The Poisson regression model for counts is occassionally referred to as a “Poisson
loglinear model”.

For simplicity, with a single dependent variable, we can write: log(μ)=α+βx. This
is equivalent to:μ=exp(α+βx)=exp(α)exp(βx)
Interpreting Parameters
Interpreting the estimated parameter.

exp(α) = effect on the mean of Y, that is mean, when X = 0

exp(β) = with every unit increase in X, the predictor variable has **multiplicative
effect **of exp(β) on the mean of Y, that is μ

If β = 0, then exp(β) = 1, and the expected count, μ = E(y) = exp(α), and Y and X
are not related.

If β > 0, then exp(β) > 1, and the expected count μ = E(y) is exp(β) times larger
than when X = 0

If β < 0, then exp(β) < 1, and the expected count μ = E(y) is exp(β) times smaller
than when X = 0

Poisson Regression for Rate


The set of equations mentioned in the above cards can also be applicable for rate
data. Y/t

Y is the count data and t is the time

Invoking the required Libraries


import numpy as np

import pandas as pd

import statsmodels.api as sm
For the following code we will be needing pandas , numpy and statsmodels libraries

Create Data Frame

dataset = pd.DataFrame({'A':np.random.rand(100)*1000,

'B':np.random.rand(100)*100,

'C':np.random.rand(100)*10,

'target':np.random.randint(0, 5, 100)})

We are creating a sample dataframe . The variables are random numbers.

The Dependent variable signifies count data.

The Independent variables are random numbers.

This exercises is just to familiarize you with Poisson Regression.

Split the Variables

X = dataset[['A','B','C']]

X['constant'] = 1

y = dataset['target']

size = 1e5
nbeta = 3

The code splits the dependent and the independent variables. The required
parameters are also set here.

Model Fitting

fam = sm.families.Poisson()

pois_glm = sm.GLM(y,X, family=fam)

pois_res = pois_glm.fit()

pois_res.summary()

Using the above code , we are fitting the Model and viewing the results.

Interpreting the results


On viewing the results and the coefficient values, we can say to what extent each
coef is explaining the log of count data i.e the dependent variable.

The rest of the values are what a Regression Output shows

Advanced Models
In this topic you will learn some advance regression models

You will understand when to and when not to apply a specific regression model

Bayesian Vs Linear Regression


Bayesian Regression is similar to Linear Regression in many ways

In Linear Regression the output is number / value

In Bayesian the output is also a value but it also returns the entire probability
distribution

How is the Probability Distribution constructed?

Here, the predicted value is returned and the variance value is also returned.

With value as the mean and the variance value as the standard deviation the
probability distribution can be constructed

With value as the mean and the variance value as the standard deviation the
probability distribution can be constructed

Bayesian Regression in Python


regr = linear_model.BayesianRidge()
regr.fit(X, y)
Out:
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False,
copy_X=True, fit_intercept=True, lambda_1=1e-06,
lambda_2=1e-06, n_iter=300, normalize=False,
tol=0.001, verbose=False)
The above code is a sample that shows how to model Bayesian Regression using python

Pros and Cons of BR


Pro
It is Robust to Gaussian Noise

Works well if the number of features and observations in the dataset are comparable

Cons

It is really time-consuming

CART Algorithm
Classification and Regression Trees are a set of non-linear learning algorithms
which can be used for numerical as well as categorical features

Here the tree has a set of nodes that split the branch into children

In turn each of the branches can go into another node or just stay as a leaf along
with the forecasted value or the predicted class

Why Trees ?
Performing the prediction task is quick

The principal task is traversal along the the tree from the root node to the leaf
nodes and at each point check if the respective feature is above or below the
threshold

The concept of variance reduction is used in this algorithm

In each of the given nodes a search is performed along all the features across all
levels in that feature

The combination that contains the best variance is marked and selected as the best

Regression Trees with Python


In:
from sklearn.tree import DecisionTreeRegressor
regr = DecisionTreeRegressor(random_state=101)
regr.fit(X_train, y_train)

mean_absolute_error(y_test, regr.predict(X_test))
The syntax is similar to applying any regression model using scikit learn.

Pros and Cons of Trees


Pro ...

Trees are the go to algorithms for modeling non-linear behavior

They can be used for both categorical and numeric datatypes without performing any
kind of normalization

The training time , Prediction time are fast

They leave a very small memory fingerprint

Cons

It belongs to a class of Greedy Algorithms , does not optimize the entire


solution , it just optimizes specific choices

If there are significant number of features, it does not perform well


The leaf nodes can be very specific sometime leading to overfitting. In that case
those nodes can be pruned.

Bagging and Boosting


Bagging and Boosting are techniques that are used for combining multiple models to
improve overall accuracy.

The final combination is a non linear model containing a set of linear models.

Bootstrap Aggregation is abbreviated as Bagging.

The main objective of this technique is to reduce the overall variance by


aggregating the models.

How is Bagging Done ?


Each model is trained on the on the selected set of features with replacement

At the end of training , during prediction , each of the models perform their
respective prediction , the results are all taken , averaged and then the ensemble
prediction is performed.

Bagging Tip

The training and the prediction happens at individual model level. This gives
flexibility to parallelize the operation on multiple CPUs.

Bagging in Python
from sklearn.ensemble import BaggingRegressor
bagging = BaggingRegressor(SGDRegressor(), n_jobs=-1,
n_estimators=1000, random_state=101,
max_features=0.8)
bagging.fit(X_train, y_train)
mean_absolute_error(y_test, bagging.predict(X_test))
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(n_estimators=100,
n_jobs=-1, random_state=101)
regr.fit(X_train, y_train)
mean_absolute_error(y_test, regr.predict(X_test))
The sample code above explains how to implement bagging using python .

Boosting
Boosting is another way of combining multiple learning models

The objective of boosting is to reduce the prediction bias

In boosting the models are in a sequence , cascaded with each other , the output of
one is the input of another

Boosting Algorithm
During training , the output of one model is predicted

The error is calculated based on the actual value

This error is multiplied with the learning rate

New model is trained on that error set and inserted at final stage of the cascaded
and trained models

The output value from one stage is the value predicted combined with the learning
rate times by the output prediction from the current stage

Boosting Sample Code


from sklearn.ensemble import GradientBoostingRegressor

regr = GradientBoostingRegressor(n_estimators=500,

learning_rate=0.01,

random_state=101)

regr.fit(X_train, y_train)

Pros and Cons


Pros

We can build very good and robust models combining weak models

They support stochastic learning

The robustness in the solution is created by the stochastic or random nature of the
model

Cons

Time taken for training is very high . There is a high memory footprint

The steps in model building can be tricky because of the stochastic nature

Application Areas
In this course so far you have seen different types of regression models , now you
will learn in what kind scenarios are all these models applied.

Some of the application areas include

Prediction Problems

Binary and Multi Class Classification

Time Series Analysis

Ranking Problems

A Regression Problem
Consider a dataset from the Music Industry.

The descriptors of a particular song are given and the year the song was produced
is given.

Can this data be modeled as a Regression Problem to predict the year given the
descriptors ?

Problem Approach
For the question raised in the previous card, the answer is yes we can predict the
year of production based on the descriptors .

The features should be identified based on the relevance to the context

Once the features are extracted a model can be trained with Features as inputs and
year of production as output

The model can be evaluated using Mean Absolute Error between actual and predicted
values

The ultimate objective would be to minimize the error

Classification Problem
The previous problem can also be modeled as a Multi Class Classification problem.

The features and the descriptors still remain the same.

The output will belong to one of the classes from the range of years provided.

Mean absolute error can be used for validating the accuracy of the prediction.

Ranking Problem
Consider a dataset with some features related to a car along with a price.

Insurance companies would want to assess if the car is riskier or not to sell / buy
on a given scale.

How do you think you will design this problem ?

Ranking Problem Approach


The above problem can be modeled as a regression problem where we are predicting
the risk on a scale .

The methodology to asses the prediction will be different .

In this scenario , we can go for label ranking loss , a metric that indicates the
strength ranking

Mean Absolute and Mean Standard Errors are not applicable in this scenario.

Another way to measure the prediction accuracy is by Label Ranking Average


Precision

Time Series Problem


So far you have seen problems where the features and the target variables are
different.

In scenario where you try to analysis stock prices or currency fluctuations or


Support ticket trend over a period of time the variables themselves can be the
features and the targets .

These problems fall under Time Series Analysis.

In time series analysis , the data at time t+k can be the target and data at time t
can be the feature. The concept of auto regression is applied in these scenarios.

Summary of the Course


In this course,

You have understood the limitations of Linear Models and how they can be overcome
with Generalized Linear Models

How to represent Generalized Linear Models


Logistic Regression from a GLM and Machine Learning Perspective

Poisson Regression, its representation and how to apply this in a real world
application

Advance Regression Models

Real world applications of Regression Analysis

Hope you had fun taking this course.

You might also like