[go: up one dir, main page]

0% found this document useful (0 votes)
28 views16 pages

Regression Analysis in Machine Learning: Context

Regression analysis is used to estimate the relationship between variables and predict future or missing values. It fits a function to data by minimizing error. Common types include linear, polynomial, ridge, and lasso regression. Ridge and lasso address overfitting by regularizing weights. Bayesian regression finds a distribution of weights rather than single values.

Uploaded by

Navneet Lalwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Regression Analysis in Machine Learning: Context

Regression analysis is used to estimate the relationship between variables and predict future or missing values. It fits a function to data by minimizing error. Common types include linear, polynomial, ridge, and lasso regression. Ridge and lasso address overfitting by regularizing weights. Bayesian regression finds a distribution of weights rather than single values.

Uploaded by

Navneet Lalwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Regression Analysis in Machine Learning

Context:

In order to understand the motivation behind regression, let's consider


the following simple example. The scatter plot below shows the
number of college graduates in the US from the year 2001 to 2012.

Now based on the available data, what if someone asks you how many
college graduates with master's degrees will there be in the year 2018?
It can be seen that the number of college graduates with master’s
degrees increases almost linearly with the year. So by simple visual
analysis, we can get a rough estimate of that number to be between 2.0
to 2.1 million. Let's look at the actual numbers. The graph below plots
the same variable from the year 2001 to the year 2018. It can be seen
that our predicted number was in the ballpark of the actual value.
Since it was a simpler problem (fitting a line to data), our mind was
easily able to do that. This process of fitting a function to a set of data
points is known as regression analysis.

What is Regression Analysis?

Regression analysis is the process of estimating the relationship


between a dependent variable and independent variables. In simpler
words, it means fitting a function from a selected family of functions to
the sampled data under some error function. Regression analysis is one
of the most basic tools in the area of machine learning used for
prediction. Using regression you fit a function on the available data
and try to predict the outcome for the future or hold-out datapoints.
This fitting of function serves two purposes.

1. You can estimate missing data within your data range


(Interpolation)
2. You can estimate future data outside your data range
(Extrapolation)

Some real-world examples for regression analysis include predicting


the price of a house given house features, predicting the impact of
SAT/GRE scores on college admissions, predicting the sales based on
input parameters, predicting the weather, etc.

Let's consider the previous example of college graduates.

1. Interpolation: Let's assume we have access to somewhat sparse


data where we know the number of college graduates every 4 years,
as shown in the scatter plot below.

We want to estimate the number of college graduates for all the


missing years in between. We can do this by fitting a line to the limited
available data points. This process is called interpolation.
Extrapolation: Let’s assume we have access to limited data from the
year 2001 to the year 2012, and we want to predict the number of
college graduates from the year 2013 to 2018.

It can be seen that the number of college graduates with master’s


degrees increases almost linearly with the year. Hence, it makes sense
to fit a line to the dataset. Using the 12 points to fit a line, and then test
the prediction of this line on the future 6 points, it can be seen that the
prediction is very close.

Mathematically speaking

Types of regression analysis

Now let’s talk about different ways in which we can carry out
regression. Based on the family-of-functions (f_beta), and the loss
function (l) used, we can categorize regression into the following
categories.
1. Linear Regression

In linear regression, the objective is to fit a hyperplane (a line for 2D


data points) by minimizing the sum of mean-squared error for each
data point.

Mathematically speaking, linear regression solves the following


problem

Hence we need to find 2 variables denoted by beta that parameterize


the linear function f(.). An example of linear regression can be seen in
the figure 4 above where P=5. The figure also shows the fitted linear
function with beta_0 = -90.798 and beta_1 = 0.046

2. Polynomial Regression

Linear regression assumes that the relationship between the


dependant (y) and independent (x) variables are linear. It fails to fit the
data points when the relationship between them is not linear.
Polynomial regression expands the fitting capabilities of linear
regression by fitting a polynomial of degree m to the data points
instead. The richer the function under consideration, the better (in
general) its fitting capabilities. Mathematically speaking, polynomial
regression solves the following problem.

Hence we need to find (m+1) variables denoted by beta_0, …,beta_m.


It can be seen that linear regression is a special case of polynomial
regression with degree 2.

Consider the following set of data points plotted as a scatter plot. If we


use linear regression, we get a fit that clearly fails to estimate the data
points. But if we use polynomial regression with degree 6, we get a
much better fit as shown below

[Left] Scatter plot of data — [Center] Linear regression on data — [Right] Polynomial regression of


degree 6
Since the data points did not have a linear relationship between
dependant and independent variables, linear regression failed to
estimate a good fitting function. On the other hand, polynomial
regression was able to capture the non-linear relationship.

3. Ridge Regression

Ridge regression addresses the issue of overfitting in regression


analysis. To understand that, consider the same example as above.
When a polynomial of degree 25 is fit on the data with 10 training
points, it can be seen that it fits the red data points perfectly (center
figure below). But in doing so, it compromises other points in between
(spike between last two data points). This can be seen in the figure
below. Ridge regression tries to address this issue. It tries to minimize
the generalization error by compromising the fit on the training points.

Left] Scatter plot of data — [Center] Polynomial regression of degree 25— [Right] Polynomial Ridge


regression of degree 25

Mathematically speaking, ridge regression solves the following


problem by modifying the loss function.
The function f(x) can either be linear or polynomial. In the absence of
ridge regression, when the function overfits the data points, the
weights learned to tend to be pretty high. Ridge regression avoids over-
fitting by limiting the norm of the weights being learned by introducing
the scaled L2 norm of the weights (beta) in the loss function. Hence the
trained model trade-offs between fitting the data point perfectly (large
norm of the learned weights) and limiting the norm of the weights. The
scaling constant alpha>0 is used to control this trade-off. A small value
of alpha will result in higher norm weights and overfitting the training
data points. On the other hand, a large alpha value will result in a
function with a poor fit to the training data points but a very small
norm of the weights. Choosing the value of alpha carefully will yield the
best trade-off.

4. LASSO regression

LASSO regression is similar to Ridge regression as both of them are


used as regularizers against overfitting on the training data points. But
LASSO comes with an additional benefit. It enforces sparsity on the
learned weights.
Ridge regression enforces the norm of the learned weights to be small
yielding a set of weights where the total norm is reduced. Most of the
weights (if not all) will be non-zero. LASSO on the other hand tries to
find a set of weights by making most of them really close to zero. This
yields a sparse weight matrix whose implementation can be much
more energy-efficient than a non-sparse weight matrix while
maintaining similar accuracy in terms of fitting to the data points.

The figure below tries to visualize this idea on the same example as
above. The data points are fit using both the Ridge and Lasso
regression and their corresponding fit and weighs are plotted in
ascending order. It can be seen that most of the weights in the LASSO
regression are really close to zero.
Mathematically speaking, LASSO regression solves the following
problem by modifying the loss function.

The difference between LASSO and Ridge regression is that LASSO


uses the L1 norm of the weights instead of the L2 norm. This L1 norm
in the loss function tends to increase sparsity in the learned weights.

The constant alpha>0 is used to control the tradeoff between the fit
and the sparsity in the learned weights. A large value of alpha results in
poor fit but a sparser learned set of weights. On the other hand, a small
value of alpha results in a tight fit on training data points (might lead
to over-fitting) but with a less sparse set of weights.

5. ElasticNet Regression

ElasticNet regression is a combination of Ridge and LASSO regression.


The loss term includes both the L1 and L2 norm of the weights with
their respective scaling constants. It is often used to address the
limitations of LASSO regression such as the non-convex nature.
ElasticNet adds a quadratic penalty of the weights making it
predominantly convex.
Mathematically speaking, ElasticNet regression solves the following
problem by modifying the loss function.

6. Bayesian Regression

For the regression discussed above (the frequentists approach), the


goal is to find a set of deterministic values of weights (beta) that
explain the data. In Bayesian regression, instead of finding one value
for each weight, we rather try to find the distribution for these weights
assuming a prior.

So we start off with an initial distribution of the weights and based on


the

So we start off with an initial distribution of the weights and based on


the data nudge the distribution in the right direction by making use of
the Bayesian theorem that relates the prior distribution to posterior
distribution based on the likelihood and the evidence.
When we have infinite data points, the posterior distribution of the
weights becomes an impulse at the solution of ordinary least square
solution i.e. the variance approaches zero.

Finding the distribution of weights instead of a single set of


deterministic values serves two purposes

1. t naturally guards against the issue of overfitting hence acting as a


regularizer

2. It provides confidence and a range for weights which makes more


logical sense than just returning one value.

Let us mathematically formulate the problem and state its solution.

Let us a Gaussian prior on the weights with mean μ and


covariance Σ i.e

Based on the available data D, we update this distribution. For the


problem at hand, the posterior will be a gaussian distribution with the
following parameters
7. Logistic Regression

Logistic regression comes in handy in the classification tasks where the


output needs to be the conditional probability of the output given the
input. Mathematically speaking, logistic regression solves the following
problem

Consider the following example where the data points belong to one of
the two categories: {0 (red), 1 (yellow)} as shown in the scatter plot
below.
[left] Scatter plot of data points — [Right] Logistic regression trained on data
points plotted in blue

Logistic regression uses a sigmoid function at the output of the linear


or polynomial function to map the output from (-♾, ♾)to (0, 1). A
threshold (usually 0.5) is then used to categorize the test data into one
of the two categories.

This may seem like Logistic regression is not regression but a


classification algorithm. But that is not the case. You can find
more about it here in Adrian’s post.

https://towardsdatascience.com/a-beginners-guide-to-regression-analysis-in-machine-learning-
8a828b491bbf

You might also like