INSTANCE-BASED
LEARNING
Introduction,KNN Learning - Locally Weighted Regression -
Radial Bias Functions- Case-Based Reasoning
5.1 INTRODUCTION
In contrast to learning methods that construct a general, explicit description
of the target function when training examples are provided, instance-based
learning methods simply store the training examples. Generalizing beyond these
examples is postponed until a new instance must be classified. Each time a new
query instance is encountered, its relationship to the previously stored
examples is examined in order to assign a target function value for the new
instance. Instance based learning includes
1. K-nearest neighbor classification -
2. Locally weighted regression methods –
These methods assumes instances can be represented as points in a Euclidean
space.
3. Case-based reasoning methods that use more complex, symbolic
representations for instances.
Instance-based methods are sometimes referred to as "lazy" learning
methods because they delay processing until a new instance must be classified.
The key advantage of this kind of delayed, or lazy, learning is that instead of
estimating the target function once for the entire instance space, these methods
can estimate it locally and differently for each new instance to be classified.
Instance-based learning methods such as nearest neighbor and locally
weighted regression are conceptually straightforward approaches to
approximating real-valued or discrete-valued target functions. Learning in
these algorithms consists of simply storing the presented training data. When
a new query instance is encountered, a set of similar related instances is
retrieved from memory and used to classify the new query instance.
Instance-based approaches can construct a different approximation to the
target function for each distinct query instance that must be classified. In fact,
many techniques(Decision Tree, NB classifier, ANN) construct only a local
approximation to the target function that applies in the neighborhood of the
new query instance, and never construct an approximation designed to
perform well over the entire instance space. This has significant advantages
when the target function is very complex, but can still be described by a
collection of less complex local approximations.
One disadvantage of instance-based approaches is that the cost of
classifying new instances can be high. This is due to the fact that nearly all
computation takes place at classification time rather than when the training
examples are first encountered. Therefore, techniques for efficiently indexing
training examples are a significant practical issue in reducing the computation
required at query time.
5.2 K -NEAREST NEIGHBOR LEARNING (KNN Classifier)
The most basic instance-based method is the k-NEAREST NEIGHBOR algorithm.
This algorithm assumes all instances correspond to points in the n-dimensional
space .
The nearest neighbors of an instance are defined in terms of the standard Euclidean
distance. More precisely, let an arbitrary instance X can be described by the
feature vector
The following Figure illustrates the operation of the KNN algorithm for the case
where the instances are points in a two-dimensional space and where the target
function is boolean valued. The positive and negative training examples are shown
by "+" and "-“ respectively. A query point Xq is shown as well. Note the
1-NEAREST NEIGHBOR algorithm classifies x, as a positive example in this figure,
whereas the 5-NEAREST NEIGHBOR algorithm classifies it as a negative example.
Euclidean space Voronoi diagram
Voronoi diagram - Describes the areas that are nearest to any given point, given
a set of data. Each line segment is equidistant between two points of opposite
class.
The diagram on the right side of Figure 8.1 shows the shape of this decision
surface induced by 1-NEAREST NEIGHBOR over the entire instance space. The
decision surface is a combination of convex polyhedra surrounding each of the
training examples. For every training example, the polyhedron indicates the set
of query points whose classification will be completely determined by that
training example. Query points outside the polyhedron are closer to some other
training example. This kind of diagram is often called the Voronoi diagram of the
set of training examples.
The k-NEAREST NEIGHBOR algorithm is given below.
As shown there, the value f (Xq,) returned by this algorithm as its estimate of
f (Xq) is just the most common value of f among the k training examples nearest to
Xq. If we choose k = 1, then the 1-NEAREST NEIGHBOR algorithm assigns to
the value f(xi) where xi is the training instance nearest to Xq. For larger values
of k, the algorithm assigns the most common value among the k nearest training
examples.
KNN: Steps
Distance-Weighted NEAREST NEIGHBOR Algorithm
Refinement to the k-NEAREST NEIGHBOR algorithm is to weight the
contribution of each of the k neighbors according to their distance to the query
point Xq , giving greater weight to closer neighbors. In KNN algorithm of , which
approximates discrete-valued target functions, we might weight the vote of
each neighbor according to the inverse square of its distance from Xq.
Here denominator in the above eqn is a constant that normalizes the contributions of
the various weights .
5.3 LOCALLY WEIGHTED REGRESSION
The nearest-neighbor approaches described in the previous
section can be thought of as approximating the target function f (x)
at the single query point x = Xq. Locally weighted regression is a
generalization of this approach. It constructs an explicit
approximation to f over a local region surrounding Xq. Locally
weighted regression uses nearby or distance-weighted training
examples to form this local approximation to f.
For example, we might approximate the target function in
the neighborhood surrounding Xq, using a linear function, a
quadratic function, a multilayer neural network, or some other
functional form. The phrase "locally weighted regression" is called
local because the function is approximated based a only on data
near the query point, weighted because the contribution of each
training example is weighted by its distance from the query point,
and regression because this is the term used widely in the statistical
learning community for the problem of approximating real-valued
functions.
Given a new query instance Xq, the general approach in locally weighted
regression is to construct an approximation that fits the training examples in
the neighborhood surrounding Xq. This approximation is then used to calculate
the value , which is output as the estimated target value for the query instance.
The description of may then be deleted, because a different local
approximation will be calculated for each distinct query instance.
Locally Weighted Linear Regression
Consider the case of locally weighted regression in which the target function
f is approximated near Xq, using a linear function of the form
As before, ai(x) denotes the value of the ith attribute of the instance x.
In ANN methods such as gradient descent to find the coefficients W0. . . Wn
to minimize the error in fitting such linear functions to a given set of training
examples. In ANN methods we were interested in a global approximation to the
target function. Therefore, we derived methods to choose weights that minimize
the squared error summed over the set D of training example.
where n is a constant learning rate, and where the training rule has been reexpressed
from the notation of ANN to fit our current notation.
The simple way is to redefine the error criterion E to emphasize fitting the local
training examples. Three possible criteria are given below. Note we write the error
E(Xq) to emphasize the fact that now the error is being defined
as a function of the query point Xq.
However this approach requires computation that grows linearly with the
number of training examples. Criterion three is a good approximation to criterion
two and has the advantage that computational cost is independent of the total
number of training examples; its cost depends only on the number k of neighbors
considered.
5.4 RADIAL BASIS FUNCTIONS
One approach to function approximation that is closely related
to distance-weighted regression and also to artificial neural
networks is learning with radial basis functions. In this approach,
the learned hypothesis is a function of the form
where each Xu is an instance from X and where the kernel function Ku(d(Xu, X)) is
defined so that it decreases as the distance d(Xu,X) increases. Here k is a user
provided constant that specifies the number of kernel functions to be included.
Even though f(x) is a global approximation to , the contribution from each of
the Ku(d(Xu, X)) terms is localized to a region nearby the point Xu. It is common to
choose each function Ku(d (Xu, X)) to be a Gaussian function (
centered at the point Xu with some variance
The function given by Radian functions can be viewed as describing a two layer
network where the first layer of units computes the values of the various
Ku(d(Xu, X)) and where the second layer computes a linear combination of these
first-layer unit values. An example radial basis function (RBF) network is illustrated
below.
Each hidden unit produces
an activation determined by a
Gaussian function centered at
some instance Xu. Therefore, its
activation will be close to zero
unless the input X is near Xu. The
output unit produces a linear
combination of the hidden unit
activations. Although the network
shown here has just one output,
and multiple output units can
also be included.
Given a set of training examples of the target function, RBF networks are
typically trained in a two-stage process. First, the number k of hidden units is
determined and each hidden unit u is defined by choosing the values of Xu and
variance that define its kernel function Ku(d(Xu, X)). Second, the weights Wu are
trained to maximize the fit of the network to the training data, using the global
error criterion. Because the kernel functions are held fixed during this second
stage, the linear weight values Wu can be trained very efficiently
Several alternative methods have been proposed for choosing an appropriate
number of hidden units or, equivalently, kernel functions. One approach is to
allocate a Gaussian kernel function for each training example
< xi, f (xi) >, centering this Gaussian at the point xi. Each of these kernels may be
assigned the same width variance.
Given this approach, the RBF network learns a global approximation to the
target function in which each training example (xi, f (xi)) can influence the value of
f only in the neighborhood of xi. One advantage of this choice of kernel functions
is that it allows the RBF network to fit the training data exactly. That is, for any set
of m training examples the weights wo . . . w, for combining the m Gaussian
kernel functions can be set so that f(xi) = f (xi) for each training
5.5 CASE-BASED REASONING
Instance-based methods such as k-NEAREST NEIGHBOR and locally weighted
regression share three key properties.
• First, they are lazy learning methods in that they defer the decision of how to
generalize beyond the training data until a new query instance is observed.
• Second, they classify new query instances by analyzing similar instances while
ignoring instances that are very different from the query.
• Third, they represent instances as real-valued points in an n-dimensional
Euclidean space.
Case-based reasoning (CBR) is a learning paradigm based on the first two of
these principles, but not the third. In CBR, instances are typically represented
using more rich symbolic descriptions, and the methods used to retrieve similar
instances are correspondingly more elaborate. CBR has been applied to problems
such as conceptual design of mechanical devices based on a stored library of
previous designs , reasoning about new legal cases based on previous rulings , and
solving planning and scheduling problems by reusing and combining portions of
previous solutions to similar problems
Let us consider a prototypical example of a case-based reasoning system called
the CADET system employs case based reasoning to assist in the conceptual
design of simple mechanical devices such as water faucets. It uses a library
containing approximately 75 previous designs and design fragments to suggest
conceptual designs to meet the specifications of new design problems. Each
instance stored in memory (e.g., a water pipe) is represented by describing both
its structure and its qualitative function. New design problems are then
presented by specifying the desired function and requesting the corresponding
structure. This problem setting is illustrated in Figure.
This figure shows the description of a typical stored case called a
T-junction pipe. Its function is represented in terms of the qualitative
relationships among the water flow levels and temperatures at its inputs and
outputs. In the functional description at its right, an arrow with a "+" label
indicates that the variable at the arrowhead increases with the variable at its tail.
For example, the output waterflow Q3 increases with increasing input waterflow
Q1.
Similarly a "-" label indicates that the variable at the head decreases with the
variable at the tail.
The bottom half of this figure depicts a new design problem described by its
desired function.
This particular function describes the required behavior of one type of water
faucet. Here Qc, refers to the flow of cold water into the faucet, Qh to the input
flow of hot water, and Qm, to the single mixed flow out of the faucet. Similarly, Tc,
Th, and Tm, refer to the temperatures of the cold water, hot water, and mixed
water respectively. The variable Ct, denotes the control signal for temperature that
is input to the faucet, and Cf denotes the control signal for waterflow. Note the
description of the desired function specifies that these controls Ct, and Cf are to
influence the water flows Qc, and Qh, thereby indirectly influencing the faucet
output flow Qm, and temperature Tm.
Given this functional specification for the new design problem, CADET
searches its library for stored cases whose functional descriptions match the
design problem. If an exact match is found, indicating that some stored case
implements exactly the desired function, then this case can be returned as a
suggested solution to the design problem. If no exact match occurs, CADET may
find cases that match various sub graphs of the desired functional specification.
In the above example figure, for example, the T-junction function matches a subgraph of
the water faucet function graph. More generally, CADET searches for subgraph
isomorphisms between the two function graphs, so that parts of a case can be found to
match parts of the design specification. Furthermore, the system may elaborate the original
function specification graph in order to create functionally equivalent graphs that may
match still more cases. It uses general knowledge about physical influences to create these
elaborated function graphs. For example, it uses a rewrite rule that allows it to rewrite the
influence
By retrieving multiple cases that match different subgraphs, the entire design can
sometimes be pieced together. In general, the process of producing a final solution
from multiple retrieved cases can be very complex. It may require designing
portions of the system from first principles, in addition to merging retrieved
portions from stored cases. It may also require backtracking on earlier choices of
design subgoals and, therefore, rejecting cases that were previously retrieved.
CADET has very limited capabilities for combining and adapting multiple retrieved
cases to form the final design and relies heavily on the user for this adaptation stage
of the process.