[go: up one dir, main page]

0% found this document useful (0 votes)
106 views2 pages

ML and Ls

The document discusses the relationship between maximum likelihood hypotheses and least-squared error in machine learning, particularly in the context of learning continuous-valued target functions. It explains how minimizing squared errors leads to a maximum likelihood hypothesis under the assumption of normally distributed noise in training data. The document also justifies the use of the Normal distribution for characterizing noise based on its properties and the Central Limit Theorem.

Uploaded by

pdacollege9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views2 pages

ML and Ls

The document discusses the relationship between maximum likelihood hypotheses and least-squared error in machine learning, particularly in the context of learning continuous-valued target functions. It explains how minimizing squared errors leads to a maximum likelihood hypothesis under the assumption of normally distributed noise in training data. The document also justifies the use of the Normal distribution for characterizing noise based on its properties and the Central Limit Theorem.

Uploaded by

pdacollege9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

lOMoARcPSD|44977908

MODULE 5: Machine Learning 18AI61

MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR HYPOTHESES

Consider the problem of learning a continuous-valued target function such as neural network
learning, linear regression, and polynomial curve fitting

A straightforward Bayesian analysis will show that under certain assumptions any learning
algorithm that minimizes the squared error between the output hypothesis predictions and the
training data will output a maximum likelihood (ML) hypothesis

• Learner L considers an instance space X and a hypothesis space H consisting of some


class of real-valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training
examples of the form <xi,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each example is
corrupted by random noise drawn according to a Normal probability distribution with
zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable
representing the noise.
– It is assumed that the values of the ei are drawn independently and that they are
distributed according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis or a MAP
hypothesis assuming all hypotheses are equally probable a priori.

Using the definition of hML we have

Assuming training examples are mutually independent given h, we can write P(D|h) as the
product of the various (di|h)

Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each
di must also obey a Normal distribution around the true targetvalue f(xi). Because we are
writing the expression for P(D|h), we assume h is the correct description of f.
Hence, µ = f(xi) = h(xi)

8
Downloaded by PDA COLLEGE (pdacollege9@gmail.com)
lOMoARcPSD|44977908

MODULE 5: Machine Learning 18AI61

Maximize the less complicated logarithm, which is justified because of the monotonicity of
function p

The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding

Maximizing this negative quantity is equivalent to minimizing the corresponding positive


quantity

Finally, discard constants that are independent of h.

Thus, above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and the
hypothesis predictions h(xi)

Note:
Why is it reasonable to choose the Normal distribution to characterize noise?
• Good approximation of many types of noise in physical systems
• Central Limit Theorem shows that the sum of a sufficiently large number of
independent, identically distributed random variables itself obeys a Normal distribution
Only noise in the target value is considered, not in the attributes describing the instances
themselves

9
Downloaded by PDA COLLEGE (pdacollege9@gmail.com)

You might also like