lOMoARcPSD|44977908
MODULE 5: Machine Learning 18AI61
MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR HYPOTHESES
Consider the problem of learning a continuous-valued target function such as neural network
learning, linear regression, and polynomial curve fitting
A straightforward Bayesian analysis will show that under certain assumptions any learning
algorithm that minimizes the squared error between the output hypothesis predictions and the
training data will output a maximum likelihood (ML) hypothesis
• Learner L considers an instance space X and a hypothesis space H consisting of some
class of real-valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training
examples of the form <xi,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each example is
corrupted by random noise drawn according to a Normal probability distribution with
zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable
representing the noise.
– It is assumed that the values of the ei are drawn independently and that they are
distributed according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis or a MAP
hypothesis assuming all hypotheses are equally probable a priori.
Using the definition of hML we have
Assuming training examples are mutually independent given h, we can write P(D|h) as the
product of the various (di|h)
Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each
di must also obey a Normal distribution around the true targetvalue f(xi). Because we are
writing the expression for P(D|h), we assume h is the correct description of f.
Hence, µ = f(xi) = h(xi)
8
Downloaded by PDA COLLEGE (pdacollege9@gmail.com)
lOMoARcPSD|44977908
MODULE 5: Machine Learning 18AI61
Maximize the less complicated logarithm, which is justified because of the monotonicity of
function p
The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding
Maximizing this negative quantity is equivalent to minimizing the corresponding positive
quantity
Finally, discard constants that are independent of h.
Thus, above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and the
hypothesis predictions h(xi)
Note:
Why is it reasonable to choose the Normal distribution to characterize noise?
• Good approximation of many types of noise in physical systems
• Central Limit Theorem shows that the sum of a sufficiently large number of
independent, identically distributed random variables itself obeys a Normal distribution
Only noise in the target value is considered, not in the attributes describing the instances
themselves
9
Downloaded by PDA COLLEGE (pdacollege9@gmail.com)