Notes
Notes
Notes
Well posed learning problem: "A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E."(Tom Michel)
"Field of study that gives computers the ability to learn without being explicitly programmed".
- T : Play checkers
- P : % of games won in world tournament
- E: opportunity to play against self
2
Examples of tasks that are best solved by using a learning algorithm:
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences (demo)
• Recognizing anomalies:
– Unusual sequences of credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant or unusual sound in
your car engine.
• Prediction:
– Future stock prices or currency exchange rates
• The web contains a lot of data. Tasks with very big datasets often use machine learning
– especially if the data is noisy or non-stationary.
• Spam filtering, fraud detection:
– The enemy adapts so we must adapt too.
• Recommendation systems:
– Lots of noisy data. Million dollar prize!
• Information retrieval:
– Find documents or images with similar content.
• Data Visualization:
- Display a huge database in a revealing way
• Supervised learning
o Training data includes desired outputs. Examples include,
o Prediction
o Classification (discrete labels), Regression (real values)
3
• Unsupervised learning
o Training data does not include desired outputs, Examples include,
o Clustering
o Probability distribution estimation
o Finding association (in features)
o Dimension reduction
• Semi-supervised learning
o Training data includes a few desired outputs
• Reinforcement learning
o Rewards from sequence of actions
o Policies: what actions should an agent take in a particular situation
o Utility estimation: how good is a state (→used by policy)
o No supervised output but delayed reward
o Credit assignment problem (what was responsible for the outcome)
o Applications:
o Game playing
o Robot in a maze
o Multiple agents, partial observability, ...
4
Hypothesis Space
• One way to think about a supervised learning machine is as a device that explores a
“hypothesis space”.
– Each setting of the parameters in the machine is a different hypothesis about the
function that maps input vectors to output vectors.
– If the data is noise-free, each training example rules out a region of hypothesis
space.
– If the data is noisy, each training example scales the posterior probability of each
point in the hypothesis space in proportion to how likely the training example is
given that hypothesis.
• The art of supervised machine learning is in:
– Deciding how to represent the inputs and outputs
– Selecting a hypothesis space that is powerful enough to represent the relationship
between inputs and outputs but simple enough to be searched.
Generalization
• The real aim of supervised learning is to do well on test data that is not known during
learning.
• Choosing the values for the parameters that minimize the loss function on the training data
is not necessarily the best policy.
• We want the learning machine to model the true regularities in the data and to ignore the
noise in the data.
– But the learning machine does not know which regularities are real and which are
accidental quirks of the particular set of training examples we happen to pick.
• So how can we be sure that the machine will generalize correctly to new data?
5
• We could then re-divide the total dataset to get another unbiased estimate of the true error
rate.
Learning Associations:
• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are
products/services.
Example: P ( chips | beer ) = 0.7 // 70 percent of customers who buy beer also buy chips.
We may want to make a distinction among customers and toward this, estimate P(Y|X,D) where
D is the set of customer attributes, for example, gender, age, marital status, and so on, assuming
that we have access to this information. If this is a bookseller instead of a supermarket, products
can be books or authors. In the case of a Web portal, items correspond to links to Web pages, and
we can estimate the links a user is likely to click and use this information to download such
pages in advance for faster access.
Classification:
6
Prediction - Regression:
y = g (x | θ )
A training dataset of used cars and the function fitted. For simplicity,mileage is taken as the only
input attribute and a linear model is used.
Supervised Learning:
• Input representation:
x1: price, x2 : engine power
7
8
9
Probably Approximately Correct (PAC):
• Consider a concept class C defined over an instance space X (containing instances of length
n), and a learner L using a hypothesis space H.
• C is PAC learnable by L using H if
• for all f ϵ C,
• Is there an efficient algorithm that can process the sample and produce a good
hypothesis h ?
10
• To be PAC learnable, there must be a hypothesis h H with arbitrary small error for every
f ϵ C. We generally assume H ⸧ C. (Properly PAC learnable if H=C)
Occam’s Razor:
11
Noise and Model Complexity:
Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to
learn and zero error may be infeasible with a simple hypothesis class. There are several
interpretations of noise:
• There may be imprecision in recording the input attributes, which may shift the data points
in the input space.
• There may be errors in labeling the data points, which may relabel positive instances as
negative and vice versa. This is sometimes called teacher noise.
• There may be additional attributes, which we have not taken into account, that affect the
label of an instance. Such attributes may be hidden or latent in that they may be
unobservable. The effect of these neglected attributes is thus modeled as a random
component and is included in “noise.”
Use the simpler one because
• Simpler to use (lower computational complexity)
• Easier to train (lower space complexity)
• Easier to explain (more interpretable)
• Generalizes better (lower variance - Occam’s razor)
12
Learning Multiple Classes:
In our example of learning a family car, we have positive examples belonging to the class family
car and the negative examples belonging to all other cars. This is a two-class problem. In the
general case, we have K classes denoted as Ci, i = 1, . . . , K, and an input instance belongs to one
and exactly one of them. The training set is now of the form,
There are three classes: family car, sports car, and luxury sedan. There are three hypotheses
induced, each one covering the instances of one class and leaving outside the instances of the
other two classes .’?’ are reject regions where no, or more than one, class is chosen.
13
Model Selection & Generalization:
14
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Bayesian networks
A Bayesian network, Bayes network, belief network, Bayes(ian)
model or probabilistic directed acyclic graphical model is a probabilistic graphical
model (a type of statistical model) that represents a set of variables and
their conditional dependencies via a directed acyclic graph (DAG). For example, a
Bayesian network could represent the probabilistic relationships between diseases
and symptoms. Given symptoms, the network can be used to compute the
probabilities of the presence of various diseases.
Bayesian Net Example:
Consider the following Bayesian network:
Prob(D=T):
P(D=T) =
Prob(A=T|C=T):
Association rule
Association rule mining is explained using the Apriori Algorithm.
Apriori Algorithm:
Confidence:
Parametric Methods
Parametric Estimation
• X = { xt }t where xt ~ p (x)
• Parametric estimation:
Assume a form for p (x | θ) and estimate θ, its sufficient statistics,
using X
e.g., N ( μ, σ2) where θ = { μ, σ2}
• Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
Classification
Regression
Linear Regression:
Multivariate Parameters:
Parameter Estimation
Classification
Estimating the mean and Variance,
Quadratic Discriminant:
Tuning Complexity
Discrete Features
Dimensionality Reduction
Necessity:
1.Reduces time complexity: Less computation
2. Reduces space complexity: Less parameters
3. Saves the cost of observing the feature
4. Simpler models are more robust on small datasets
5. More interpretable; simpler explanation
6. Data visualization (structure, groups, outliers, etc) if plotted in 2
or 3 dimensions.
Feature Selection and Extraction:
UNIT III
UNIT III CLUSTERING & REGRESSION
Clustering - Mixture densities - k-means clustering - Supervised Learning after clustering - Hierarchical
clustering - Nonparametric Methods - Density estimation - Generalization of multivariate data -
Classification -Regression - Smoothing models - Decision Trees - Univariate trees - Multivariate trees -
Learning rules from data - Linear Discrimination.
Clustering
• Cluster: A collection of data objects
o similar (or related) to one another within the same group
o dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
o Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning
by examples: supervised)
• A good clustering method will produce high quality clusters
o high intra-class similarity: cohesive within clusters
o low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
o the similarity measure used by the method
o its implementation, and
o its ability to discover some or all of the hidden patterns.
• Partitioning approach:
o Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
o Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
o Create a hierarchical decomposition of the set of data (or objects) using some
criterion
o Typical methods: DIANA, AGNES, BIRCH, CAMELEON
Mixture densities
Partitioning method
• Partitioning a database D of n objects into a set of k clusters, such that the sum of squared
distances is minimized (where ci is the centroid or medoid of cluster Ci)
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
(
dm xr , xs =) [ (x d
j =1
r
j − x sj )]
p 1/ p
City-block distance
(
d cb x r , x s =) d
j =1
x rj − x sj
• Two important paradigms:
– Bottom-up agglomerative clustering(Agglomerative Nesting-AGNES)
– Top-down divisive clustering(Divisive Analysis-DIANA)
AgglomerativeClustering:
Start with N groups each with one instance and merge two closest groups at each
iteration
Distance between two groups Gi and Gj:
Single-link clustering:
d (Gi ,G j ) = r
(
mins d x r , x s
x ∈Gi , x ∈G j
)
Complete-link clustering:
1. Given a datset, first find the distance matrix using Euclidean distance measure.
2. After finding the distance matrix, find the smallest element in the distance matix.
3. Merge these two points to form a cluster.
4. Next find the minium distance between the new cluster obtained with all other points.
5. Now frame the distance matrix and find the smallest value in the obtained distance matrix
and then merge these points to form a cluster. The process repeats until all points are
grouped into clusters.
6. Finally draw the dendogram.
Example:
Nonparametric Methods
Density Estimation
pˆ ( x ) =
{
# x − h < xt ≤ x + h }
or 2 Nh
1 N x − xt 1 / 2 if u < 1
pˆ ( x ) = w
Nh t =1 h
w(u ) =
0 otherwise
Kernel Estimator:
1 N
x − xt
pˆ ( x ) =
Nh
K
t =1 h
Instead of fixing bin width h and counting the number of instances, fix the instances
(neighbors) k and check bin width,
k
pˆ ( x ) =
2 Nd k ( x )
The nearest neighbor class of estimators adapts the amount of smoothing to the local density of
data. The degree of smoothing is controlled by k, the number of neighbors taken into account,
which is much smaller than N, the sample size. Let us define a distance between a and b, for
example, |a − b|, and for each x, we define,
Generalization of multivariate data
1 N
x − xt
pˆ (x) =
Nhd
K
h
t =1
• Multivariate Gaussian kernel
1
d
u 2
• Spheric K (u ) = exp −
2π 2
1 1 T −1
• Ellipsoid K (u ) = d /2 1 / 2
exp − 2 u S u
(2π ) S
where S is the sample covariance matrix. This corresponds to using Mahalanobis distance instead
of the Euclidean distance. It is also possible to have the distance metric local where S is
calculated from instances in the vicinity of x, for example, some k closest instances.
Note that S calculated locally may be singular and PCA (or LDA, in thecase of classification)
may be needed.
Hamming distance:
When the inputs are discrete, we can use Hamming distance, which counts the number of
nonmatching attributes.
Nonparametric Classification
• Kernel estimator
1 N
x − xt
t ˆ N
pˆ (x | C i ) =
Nihd
K ri P (C i ) = i
t =1 h N
1 N x − xt t
d
ˆ
g i (x ) = pˆ (x | C i )P (C i ) = K ri
Nh t =1 h
• k-NN estimator
ki ˆ pˆ (x | Ci )Pˆ (Ci ) ki
pˆ (x | Ci ) = P(Ci | x) = =
NiV (x)
k
p(x)
ˆ k
The k-nn classifier assigns the input to the class having most examples among the k neighbors of
the input. All neighbors have equal vote, and the class having the maximum number of voters
among the k neighbors is chosen. Ties are broken arbitrarily or a weighted vote is taken. K is
generally taken to be an odd number to minimize ties: confusion is generally between two
neighboring classes. Again, the use of Euclidean distance corresponds to assuming uncorrelated
inputs with equal variances, and when this is not the case a suitable discriminant metric should
be used. One example is discriminant adaptive nearest neighbor where the optimal distance to
separate classes is estimated locally.
Nonparametric Regression
Instead of taking an average and giving a constant fit at a point, we can take into account one
more term in the Taylor expansion and calculate a linear fit. In the running line smoother, we can
use the data points in the neighborhood, as defined by h or k, and fit a local regression line. In
the locally weighted running line smoother, known as loess, instead of a hard definition of
neighborhoods, we use kernel weighting such that distant points have less effect on error.
Decision Trees
Algorithm:
Attribute Selection Measure:
Example:
After further iterations, the final decision tree would be,
Univariate Trees
Multivariate Trees
Likelihood-based: Assume a model for p(x|Ci), use Bayes’ rule to calculate P(Ci|x)
Estimating the boundaries is enough; no need to accurately estimate the densities inside
the boundaries
Linear Discriminant:
Linear discriminant:
d
gi (x | wi , wi0 ) = w x + wi0 = wij x j + wi0
T
i
j =1
Advantages:
Optimal when p(x|Ci) are Gaussian with shared cov matrix; useful when classes
are (almost) linearly separable
Quadratic discriminant:
g i (x | Wi , w i , wi 0 ) = xT Wi x + w Ti x + wi 0
Map from x to z using nonlinear basis functions and use a linear discriminant in z-space:
k
g i (x ) = w ij φ j (x )
j =1
Two Classes:
g (x )= g 1 (x ) − g 2 (x )
= (w T
1 x + w 10 ) − (w T
2 x + w 20 )
= (w 1 − w 2 )T x + (w 10 − w 20 )
T
= w x + w 0
C1 if g (x ) > 0
choose
C 2 otherwise
Multiple Classes:
UNIT IV
UNIT IV MULTILAYER PERCEPTRONS
The soma is the body of the neuron. Attached to the soma are long, irregularly shaped filaments,
called dendrites. These nerve processes are often less than a micron in diameter, and have
complex branching shapes. Their intricate shape resembles that of a tree in winter, without
leaves, whose branches fork and fork again into finer structure. The dendrites act as the
connections through which all the inputs to the neuron arrive. These cells are able to perform
more complex functions than simple addition on the inputs they receive, but considering a simple
summation is a reasonable approximation.
The basic features of a biological neuron
Another type of nerve process attached to the soma is called an axon. This is electrically active,
unlike the dendrite, and serves as the output channel of the neuron. Axons always appear on
output cells, but are often absent from interneurons, which have both inputs and outputs on
dendrites. The axon is a non-linear threshold device, producing a voltage pulse, called an action
potential, that lasts about 1 millisecond ( 10-3s) when the resting potential within the soma rises
above a certain critical threshold. This action potential is in fact a series of rapid voltage spikes.
The axon terminates in a specialised contact called a synapse that couples the axon with the
dendrite of another cell. There is no direct linkage across the junction; rather, it is a temporary
chemical one. The synapse releases chemicals called neurotransmitters when its potential is
raised sufficiently by the action potential. It may take the arrival of more than one action
potential before the synapse is triggered. The neurotransmitters that are released by the synapse
diffuse across the gap, and chemically activate gates on the dendrites, which, when open, allow
charged ions to flow. It is this flow of ions that alters the dendritic potential, and provides a
voltage pulse on the dendrite, which is then conducted along into the next neuron body. Each
dendrite may have many synapses acting on it, allowing massive interconnectivity to be
achieved. At the synaptic junction, the number of gates opened on the dendrite depends on the
number of neurotransmitters released. It also appears that some synapses excite the dendrite they
affect, whilst others serve to inhibit it. This corresponds to altering the local potential of the
dendrite in a positive or negative direction. A single neuron will have many synaptic inputs on its
dendrites, and may have many synaptic outputs connecting it to other cells.
It performs a weighted sum of its inputs, compares this to some internal threshold level, and
turns on only if this level is exceeded. If not, it stays off. Because the inputs are passed through
the model neuron to produce the output, the system is known as a feedforward one. We need to
formulate this mathematically. If there are n inputs, then there are n associated weights on the
input lines. The model neuron calculates the weighted sum of its inputs; it takes the first input,
multiplies it by the weight on that input line, then does the same for the next input, and so on,
adding them all up at the end. This can be written as,
This sum then has to be compared to a certain value in the neuron, the threshold value. This
thresholding process is accomplished by comparison; if the sum is greater than the threshold
value, then output a 1, if less, output a 0. This can be seen graphically in the figure given below,
where the x-axis represents the input, and the y-axis the output. ?
The thresholding function
Equivalently, the threshold value can be subtractec from the weighted sum, and the resulting
value compared to zero; if the result is positive, then output a 1, else output a 0. Notice that the
shape of the function is the same, but now the jump occurs at zero. The threshold effectively
adds an offset to the weighted sum. An alternative way of achieving the same effect is to take the
threshold out of the body of the model neuron and connect it to an extra input value that is fixed
to be “on” all the time. In this case, rather than subtracting the threshold value from the weighted
sum, the extra input of +1 is multiplied by a weight equal to minus the threshold value, , and
added in as well as all the other inputs-this is known as biasing the neuron. The value of -8 is
therefore known as the neuron’s bias or ofSset. Both approaches are equivalent, and either is
acceptable.
Notice that the lower limit of the summation has changed from 1 to 0, and that the value of the
input xo is always set to 1. This model of the neuron, shown in the below figure, was proposed
in 1943 by McCulloch and Pitts. Their model came about in much the same way as we have
developed ours, and stemmed from their research into the behaviour of the neurons in the brain.
It is important to look at the features of this McCulloch-Pitts neuron. It is a simple enough unit,
thresholding a weighted sum of its inputs to get an output. It specifically does not take any
account of the complex patterns and timings of actual nervous activity in real neural systems, nor
does it have any of the complicated features found in the body of biological neurons.
Details of basic Model
This ensures its status as a model, and not a copy, of a real neuron, and makes it possible to
implement on a digital computer. This is the strength of the model-now we need to investigate
what can be achieved using this simple design. The arrangement of the connections between the
neurons is important, but, continuing our trend of choosing simple models to get an idea of what
is happening in a complicated red-world situation, we shall for the time being consider only one
layer of neurons, where we study the outputs of the neurons under a known set of inputs. The
model neurons, connected up in a simple fashion, were given the name “perceptrons” by Frank
Rosenblatt in 1962. He pioneered the simulation of neural networks on digital computers, as well
as their formal analysis. In his book “Principles of Neurodynamics ”, he describes these
perceptrons as simplified networks in which certain properties of red nervous systems axe
exaggerated whilst others are ignored. He stated that they are not intended to serve as detailed
copies of any real nervous system; in other words, he realised at this early stage that he was
dealing with a basic model. This fact is often lost in the popular press as the idea of computer
“brains”, based on these techniques, grabs the imagination. We are not attempting to build
computer brains, nor are we trying to mimic parts of red brains-rather we are aiming to discover
the properties of models that take their behaviour from extremely simplified versions of natural
neural systems, usually on a massively reduced scale as well. Whereas the brain has at least 1011
neurons, each connected to 104 others, we are concerned here with maybe a few hundred neurons
at most, connected to a few thousand input lines.
The figure given below shows a multilayer feed-forward neural network. Let the learning rate be
0.9. The initial weight and bias values of the network are given in Table 1, along with the first
training tuple, X= (1, 0, 1), with a class label of 1 (target value Tj). This example shows the
calculations for backpropagation, given the first training tuple, X. The tuple is fed into the
network, and the net input and output of each unit are computed. The error of each unit is
computed and propagated backward. The error values and the weight and bias updates are
calculated. Given learning rate=0.9.
Training Procedures
Improving Convergence:
Momentum
∂E t
t
∆ w = −η
i + α ∆ w it −1
∂ wi
Adaptive learning rate
+ a if Et +τ < Et
∆η =
− bη otherwise
Overfitting/Overtraining:
As complexity increases, training error is fixed but the validation error starts to increase and the
network starts to overfit.
As training continues, the validation error starts to increase and the network starts to overfit.
Hints:
Tuning the Network Size
Learning Time
Applications:
Sequence recognition: Speech recognition
Sequence reproduction: Time-series prediction
Sequence association
Network architectures
Time-delay networks (Waibel et al., 1989)
Recurrent networks (Rumelhart et al., 1986)
In a time delay neural network, previous inputs are delayed in time so as to synchronize with the
final input, and all are fed together as input to the system.Backpropagation can then be used to
train the weights. To extract features local in time, one can have layers of structured connections
and weight sharing to get translation invariance in time. The main restriction of this architecture
is that he size of the time window we slide over the sequence should be fixed a priori.
Inputs in a time window of length T are delayed in time until we can feed all T inputs as the
input vector to the MLP.
Recurrent Networks:
Examples of MLP with partial recurrency. Recurrent connections are shown with dashed lines:
(a) self-connections in the hidden layer, (b) selfconnections in the output layer, and (c)
connections from the output to the hidden layer. Combinations of these are also possible.
Unfolding in Time:
If the sequences have a small maximum length, then unfolding in time can be used to convert an
arbitrary recurrent network to an equivalent feedforward network. A separate unit and
connection is created for copies at different times. The resulting network can be trained with
backpropagation with the additional requirement that all copies of each connection should
remain identical. The solution, as in weight sharing, is to sum up the different weight changes in
time and change the weight by the average. This is called backpropagation through time
(Rumelhart, Hinton, and Willams 1986b). The problem with this approach is the memory
requirement if the length of the sequence is large. Real time recurrent learning (Williams and
Zipser 1989) is an algorithm learning for training recurrent networks without unfolding and has
the advantage that it can use sequences of arbitrary length.
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIT V
UNIT V LOCAL MODELS
Competitive learning - Adaptive resonance theory - Self organizing map - Basis functions - Learning
vector quantization - Assessing and Comparing Classification Algorithms - Combining Multiple Learners –
Reinforcement Learning.
Introduction
Divide the input space into local regions and learn simple (constant/linear) models in each
patch
Competitive learning
The term competitive learning is used because it is as if these groups, or rather the units
representing these groups, compete among themselves to be the one responsible for
representing an instance. The model is also called winner-take-all; it is as if one group wins and
gets updated, and the others are not updated at all.
An online method has the usual advantages that (1) we do not need extra memory to store
the whole training set; (2) updates at each step are simple to implement, for example, in
hardware; and (3) the input distribution may change in time and the model adapts itself to these
changes automatically. If we were to use a batch algorithm, we would need to collect a new
sample and run the batch method from scratch over the whole sample.
Online k-Means:
E ({m i }ik= 1 X )= t i
b it x t
− m i
t
1 if x t
− m i = min x t
− m l
b i = l
0 otherwise
Batch k - means :m =
b xt i
t t
b
i t
t i
Online k - means :
t
∂E
∆m ij = −η
∂m
= η b it x ( t
j − m ij )
ij
The values are hard, 0/1, values. One can use soft values in (0, 1) and get a more informative
encoding. In the local representation, this is done by the Gaussian RBF that uses the distance to
the center, mi , and in the distributed representation, this is done by the sigmoid that uses the
distance to the hyperplane, wi .
Regression:
1
E ({m h , s h , w ih }i , h | X )= 2
(r i
t
− y i
t
) 2
t i
H
y i
t
= h =1
w ih p t
h + w i 0
∆ w ih = η (rt
i
t
− y i
t
)p t
h
(x t
− m )
∆ m hj = η t
(r i
i
t
− y i
t
)w ih p
t
h
j
s 2
hj
t 2
x − m
(r )w
t t t h
∆ s h = η i − y i ih p h 3
t i s h
Training RBF:
• Hybrid learning:
o First layer centers and spreads:
Unsupervised k-means
o Second layer weights:
Supervised gradient-descent
• Fully supervised
Classification:
( )
E {m h , s h , wih }i , h | X = − ri t log yit
t i
yit =
[
exp w p +w ] h ih
t
h i0
exp [ w p + w ]
k h kh
t
h k0
L ({m h , s h , w ih }i , h |X )= log g ht ∏ (y )
t
ih
ri t
t h i
=
t
log h
g ht exp ri t log y iht
i
exp w ih exp v ih x t
y iht = =
k exp w kh k exp v kh x t
Combining Multiple Learners
• Early integration: Concat all features and train a single learner
• Late integration: With each feature set, train one learner, then either use a fixed rule or
stacking to combine decisions
• Intermediate integration: With each feature set, calculate a kernel, then use a single SVM
with multiple kernels
• Combining features vs decisions vs kernels
Rationale:
No Free Lunch Theorem: There is no algorithm that is always the most accurate
Generate a group of base-learners which when combined has higher accuracy
Different learners use different
Algorithms
Hyperparameters
Representations /Modalities/Views
Training sets
Subproblems
Diversity vs accuracy
Voting:
Fixed Combination Rules:
Error-Correcting Output Codes:
Bagging:
Use bootstrapping to generate L training sets and train one base-learner with each
(Breiman, 1996)
Given an ensemble of dependent classifiers, do not use it as is, try to get independence
1. Subset selection: Forward (growing)/Backward (pruning) approaches to improve
accuracy/diversity/independence
2. Train metaclassifiers: From the output of correlated classifiers, extract new
combinations that are uncorrelated. Using PCA, we get “eigenlearners.”
Similar to feature selection vs feature extraction
Cascading:
Value Iteration:
Policy Iteration:
Q-learning
Sarsa: