[go: up one dir, main page]

0% found this document useful (0 votes)
6 views125 pages

Notes

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 125

UNIT I

INTRODUCTION TO MACHINE LEARNING

Machine learning - examples of machine learning applications - Learning associations - Classification -


Regression - Unsupervised learning - Supervised Learning - Learning class from examples - PAC learning -
Noise, model selection and generalization - Dimension of supervised machine learning algorithm.

Definition of Machine learning:

Well posed learning problem: "A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E."(Tom Michel)

"Field of study that gives computers the ability to learn without being explicitly programmed".

Learning = Improving with experience at some task

- Improve over task T,


- with respect to performance measure P,
- based on experience E.

E.g., Learn to play checkers

- T : Play checkers
- P : % of games won in world tournament
- E: opportunity to play against self

2
Examples of tasks that are best solved by using a learning algorithm:

• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences (demo)
• Recognizing anomalies:
– Unusual sequences of credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant or unusual sound in
your car engine.
• Prediction:
– Future stock prices or currency exchange rates
• The web contains a lot of data. Tasks with very big datasets often use machine learning
– especially if the data is noisy or non-stationary.
• Spam filtering, fraud detection:
– The enemy adapts so we must adapt too.
• Recommendation systems:
– Lots of noisy data. Million dollar prize!
• Information retrieval:
– Find documents or images with similar content.
• Data Visualization:
- Display a huge database in a revealing way

Types of learning algorithms:

• Supervised learning
o Training data includes desired outputs. Examples include,
o Prediction
o Classification (discrete labels), Regression (real values)

3
• Unsupervised learning
o Training data does not include desired outputs, Examples include,
o Clustering
o Probability distribution estimation
o Finding association (in features)
o Dimension reduction
• Semi-supervised learning
o Training data includes a few desired outputs
• Reinforcement learning
o Rewards from sequence of actions
o Policies: what actions should an agent take in a particular situation
o Utility estimation: how good is a state (→used by policy)
o No supervised output but delayed reward
o Credit assignment problem (what was responsible for the outcome)
o Applications:
o Game playing
o Robot in a maze
o Multiple agents, partial observability, ...

4
Hypothesis Space
• One way to think about a supervised learning machine is as a device that explores a
“hypothesis space”.
– Each setting of the parameters in the machine is a different hypothesis about the
function that maps input vectors to output vectors.
– If the data is noise-free, each training example rules out a region of hypothesis
space.
– If the data is noisy, each training example scales the posterior probability of each
point in the hypothesis space in proportion to how likely the training example is
given that hypothesis.
• The art of supervised machine learning is in:
– Deciding how to represent the inputs and outputs
– Selecting a hypothesis space that is powerful enough to represent the relationship
between inputs and outputs but simple enough to be searched.

Generalization
• The real aim of supervised learning is to do well on test data that is not known during
learning.
• Choosing the values for the parameters that minimize the loss function on the training data
is not necessarily the best policy.
• We want the learning machine to model the true regularities in the data and to ignore the
noise in the data.
– But the learning machine does not know which regularities are real and which are
accidental quirks of the particular set of training examples we happen to pick.
• So how can we be sure that the machine will generalize correctly to new data?

Training set, Test set and Validation set


• Divide the total dataset into three subsets:
– Training data is used for learning the parameters of the model.
– Validation data is not used of learning but is used for deciding what type of model
and what amount of regularization works best.
– Test data is used to get a final, unbiased estimate of how well the network works.
We expect this estimate to be worse than on the validation data.

5
• We could then re-divide the total dataset to get another unbiased estimate of the true error
rate.

Learning Associations:

• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are
products/services.

Example: P ( chips | beer ) = 0.7 // 70 percent of customers who buy beer also buy chips.

We may want to make a distinction among customers and toward this, estimate P(Y|X,D) where
D is the set of customer attributes, for example, gender, age, marital status, and so on, assuming
that we have access to this information. If this is a bookseller instead of a supermarket, products
can be books or authors. In the case of a Web portal, items correspond to links to Web pages, and
we can estimate the links a user is likely to click and use this information to download such
pages in advance for faster access.

Classification:

• Example: Credit scoring


• Differentiating between low-risk and high-risk customers from their income and savings
• Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk

6
Prediction - Regression:

• Example: Predict Price of a used car


• x : car attributes
y : price

y = g (x | θ )

where, g ( ) is the model, θ are the parameters

A training dataset of used cars and the function fitted. For simplicity,mileage is taken as the only
input attribute and a linear model is used.

Supervised Learning:

Learning a Class from Examples:

• Class C of a “family car”

o Prediction: Is car x a family car?


o Knowledge extraction: What do people expect from a family car?
• Output:

o Positive (+) and negative (–) examples

• Input representation:
x1: price, x2 : engine power

7
8
9
Probably Approximately Correct (PAC):

• Cannot expect a learner to learn a concept exactly.


• Cannot always expect to learn a close approximation to the target concept
• Therefore, the only realistic expectation of a good learner is that with high probability it
will learn a close approximation to the target concept.
• In Probably Approximately Correct (PAC) learning, one requires that given small
parameters ε and δ, with probability at least (1- δ) a learner produces a hypothesis with
error at most ε.
• The reason we can hope for that is the Consistent Distribution assumption.
PAC Learnability

• Consider a concept class C defined over an instance space X (containing instances of length
n), and a learner L using a hypothesis space H.
• C is PAC learnable by L using H if

• for all f ϵ C,

• for all distributions D over X, and fixed 0< ε , δ < 1,


• L, given a collection of m examples sampled independently according to D produces

• with probability at least (1- δ) a hypothesis h ϵ H with error at most ε, (ErrorD =


PrD[f(x) : = h(x)])
• where m is polynomial in 1/ ε, 1/ δ, n and size(H)
• C is efficiently learnable if L can produce the hypothesis in time polynomial in 1/ ε, 1/ δ, n
and size(H)
• We impose two limitations:
• Polynomial sample complexity (information theoretic constraint)

• Is there enough information in the sample to distinguish a hypothesis h that


approximate f ?
• Polynomial time complexity (computational complexity)

• Is there an efficient algorithm that can process the sample and produce a good
hypothesis h ?

10
• To be PAC learnable, there must be a hypothesis h H with arbitrary small error for every
f ϵ C. We generally assume H ⸧ C. (Properly PAC learnable if H=C)
Occam’s Razor:

11
Noise and Model Complexity:

Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to
learn and zero error may be infeasible with a simple hypothesis class. There are several
interpretations of noise:
• There may be imprecision in recording the input attributes, which may shift the data points
in the input space.
• There may be errors in labeling the data points, which may relabel positive instances as
negative and vice versa. This is sometimes called teacher noise.
• There may be additional attributes, which we have not taken into account, that affect the
label of an instance. Such attributes may be hidden or latent in that they may be
unobservable. The effect of these neglected attributes is thus modeled as a random
component and is included in “noise.”
Use the simpler one because
• Simpler to use (lower computational complexity)
• Easier to train (lower space complexity)
• Easier to explain (more interpretable)
• Generalizes better (lower variance - Occam’s razor)

12
Learning Multiple Classes:

In our example of learning a family car, we have positive examples belonging to the class family
car and the negative examples belonging to all other cars. This is a two-class problem. In the
general case, we have K classes denoted as Ci, i = 1, . . . , K, and an input instance belongs to one
and exactly one of them. The training set is now of the form,

There are three classes: family car, sports car, and luxury sedan. There are three hypotheses
induced, each one covering the instances of one class and leaving outside the instances of the
other two classes .’?’ are reject regions where no, or more than one, class is chosen.

13
Model Selection & Generalization:

• Learning is an ill-posed problem; data is not sufficient to find a unique solution


• The need for inductive bias, assumptions about H
• Generalization: How well a model performs on new data
• Overfitting: H more complex than C or f
• Underfitting: H less complex than C or f
Triple Trade-Off:
• There is a trade-off between three factors (Dietterich, 2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data As N increases, E decreases
As c (H), first E decreases and then E increases
Cross-Validation:
• To estimate generalization error, we need data unseen during training. We split the data as

• Training set (50%)


• Validation set (25%)
• Test (publication) set (25%)
• Resampling when there is few data

Dimensions of a Supervised Learner:

14
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT- II DECISION THEORY


MACHINE LEARNING - SIT1305

UNIT II DECISION THEORY

Bayesian Decision Theory - Introduction - Classification - Discriminant function - Bayesian networks -


Association rule - Parametric Methods - Introduction - Estimation - Classification - Regression -
Multivariate Methods - Data Parameter estimation - Classification - Complexity - Features -
Dimensionality Reduction - Analysis - Multidimensional scaling - Linear discriminant analysis.

Bayesian Decision Theory


Bayesian framework assumes that we always have a prior distribution for everything.
– The prior may be very vague.
– When we see some data, we combine our prior distribution with a likelihood term
to get a posterior distribution.
– The likelihood term takes into account how probable the observed data is given
the parameters of the model.
• It favors parameter settings that make the data likely.
• It fights the prior
• With enough data the likelihood terms always win.
Given database:
Example:
Losses and Risks:
Discriminant Functions
Classification can also be seen as implementing a set of discriminant functions,
gi(x), i = 1, . . K, such that we

Bayesian networks
A Bayesian network, Bayes network, belief network, Bayes(ian)
model or probabilistic directed acyclic graphical model is a probabilistic graphical
model (a type of statistical model) that represents a set of variables and
their conditional dependencies via a directed acyclic graph (DAG). For example, a
Bayesian network could represent the probabilistic relationships between diseases
and symptoms. Given symptoms, the network can be used to compute the
probabilities of the presence of various diseases.
Bayesian Net Example:
Consider the following Bayesian network:

Thus, the independence expressed in this Bayesian net are that


A and B are (absolutely) independent.
C is independent of B given A.
D is independent of C given A and B.
E is independent of A, B, and D given C.

Suppose that the net further records the following probabilities:


Prob(A=T) = 0.3
Prob(B=T) = 0.6
Prob(C=T|A=T) = 0.8
Prob(C=T|A=F) = 0.4
Prob(D=T|A=T,B=T) = 0.7
Prob(D=T|A=T,B=F) = 0.8
Prob(D=T|A=F,B=T) = 0.1
Prob(D=T|A=F,B=F) = 0.2
Prob(E=T|C=T) = 0.7
Prob(E=T|C=F) = 0.2

Some sample computations:

Prob(D=T):

P(D=T) =

P(D=T,A=T,B=T) + P(D=T,A=T,B=F) + P(D=T,A=F,B=T) + P(D=T,A=F,B=F) =


P(D=T|A=T,B=T) P(A=T,B=T) + P(D=T|A=T,B=F) P(A=T,B=F) +
P(D=T|A=F,B=T) P(A=F,B=T) + P(D=T|A=F,B=F) P(A=F,B=F) =
(since A and B are independent absolutely)

P(D=T|A=T,B=T) P(A=T) P(B=T) + P(D=T|A=T,B=F) P(A=T) P(B=F) +


P(D=T|A=F,B=T) P(A=F) P(B=T) + P(D=T|A=F,B=F) P(A=F) P(B=F) =

0.7*0.3*0.6 + 0.8*0.3*0.4 + 0.1*0.7*0.6 + 0.2*0.7*0.4 = 0.32

Prob(A=T|C=T):

P(A=T|C=T) = P(C=T|A=T)P(A=T) / P(C=T).

Now, P(C=T) = P(C=T,A=T) + P(C=T,A=F) =


P(C=T|A=T)P(A=T) + P(C=T|A=F)P(A=F) =
0.8*0.3+ 0.4*0.7 = 0.52

So P(C=T|A=T)P(A=T) / P(C=T) = 0.8*0.3/0.52= 0.46.

Association rule
Association rule mining is explained using the Apriori Algorithm.
Apriori Algorithm:
Confidence:

Parametric Methods
Parametric Estimation

• X = { xt }t where xt ~ p (x)
• Parametric estimation:
Assume a form for p (x | θ) and estimate θ, its sufficient statistics,
using X
e.g., N ( μ, σ2) where θ = { μ, σ2}

Maximum Likelihood Estimation:

• Likelihood of θ given the sample X


l (θ|X) = p (X |θ) = ∏t p (xt|θ)

• Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

• Maximum likelihood estimator (MLE)


θ* = argmaxθ L(θ|X)
Examples: Bernoulli/Multinomial:

Gaussian (Normal) Distribution:


Bias and Variance:

Classification
Regression
Linear Regression:

Other Error Measures:


Multivariate Methods
Data:

• Multiple measurements (sensors)


• d inputs/features/attributes: d-variate
• N instances/observations/examples

Multivariate Parameters:
Parameter Estimation

Classification
Estimating the mean and Variance,

Quadratic Discriminant:
Tuning Complexity

Discrete Features
Dimensionality Reduction
Necessity:
1.Reduces time complexity: Less computation
2. Reduces space complexity: Less parameters
3. Saves the cost of observing the feature
4. Simpler models are more robust on small datasets
5. More interpretable; simpler explanation
6. Data visualization (structure, groups, outliers, etc) if plotted in 2
or 3 dimensions.
Feature Selection and Extraction:

Principal Component Analysis(PCA)


Factor Analysis
Multidimensional Scaling

Linear Discriminant Analysis


Fisher’s Linear Discriminant:
For K>2 Classes,
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT-III CLUSTERING & REGRESSION


MACHINE LEARNING - SIT1305

UNIT III
UNIT III CLUSTERING & REGRESSION

Clustering - Mixture densities - k-means clustering - Supervised Learning after clustering - Hierarchical
clustering - Nonparametric Methods - Density estimation - Generalization of multivariate data -
Classification -Regression - Smoothing models - Decision Trees - Univariate trees - Multivariate trees -
Learning rules from data - Linear Discrimination.

Clustering
• Cluster: A collection of data objects
o similar (or related) to one another within the same group
o dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
o Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning
by examples: supervised)
• A good clustering method will produce high quality clusters
o high intra-class similarity: cohesive within clusters
o low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
o the similarity measure used by the method
o its implementation, and
o its ability to discover some or all of the hidden patterns.

Major Clustering Approaches

• Partitioning approach:
o Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
o Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
o Create a hierarchical decomposition of the set of data (or objects) using some
criterion
o Typical methods: DIANA, AGNES, BIRCH, CAMELEON
Mixture densities
Partitioning method
• Partitioning a database D of n objects into a set of k clusters, such that the sum of squared
distances is minimized (where ci is the centroid or medoid of cluster Ci)
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion

The K-Means Clustering Method


• Given k, the k-means algorithm is implemented in four steps:
o Partition objects into k nonempty subsets
o Compute seed points as the centroids of the clusters of the current partitioning
(the centroid is the center, i.e., mean point, of the cluster)
o Assign each object to the cluster with the nearest seed point
o Go back to Step 2, stop when the assignment does not change
Example:
The iteration stops here, as we get the same set of clusters as the previous step.
Hierarchical Clustering
Cluster based on similarities/distances
Distance measure between instances xr and xs

Minkowski distance (Lp) (Euclidean for p = 2)

(
dm xr , xs =) [ (x d
j =1
r
j − x sj )]
p 1/ p

City-block distance
(
d cb x r , x s =)  d
j =1
x rj − x sj
• Two important paradigms:
– Bottom-up agglomerative clustering(Agglomerative Nesting-AGNES)
– Top-down divisive clustering(Divisive Analysis-DIANA)
AgglomerativeClustering:

Start with N groups each with one instance and merge two closest groups at each
iteration
Distance between two groups Gi and Gj:
Single-link clustering:

d (Gi ,G j ) = r
(
mins d x r , x s
x ∈Gi , x ∈G j
)
Complete-link clustering:

d (Gi ,Gj ) = r max


s
d xr
, x s
x ∈Gi ,x ∈G j
( )
Average-link clustering, centroid
Dendrogram: Decompose data objects into a several levels of nested partitioning
(tree of clusters), called a dendrogram
A clustering of the data objects is obtained by cutting the dendrogram at the
desired level, then each connected component forms a cluster

Example of Single Linkage Clustering:

1. Given a datset, first find the distance matrix using Euclidean distance measure.
2. After finding the distance matrix, find the smallest element in the distance matix.
3. Merge these two points to form a cluster.
4. Next find the minium distance between the new cluster obtained with all other points.
5. Now frame the distance matrix and find the smallest value in the obtained distance matrix
and then merge these points to form a cluster. The process repeats until all points are
grouped into clusters.
6. Finally draw the dendogram.

Example:
Nonparametric Methods

Parametric (single global model), semiparametric (small number of local models)


Nonparametric: Similar inputs have similar outputs
Functions (pdf, discriminant, regression) change smoothly
Keep the training data;“let the data speak for itself”
Given x, find a small number of closest training instances and interpolate from these
Aka lazy/memory-based/case-based/instance-based learning

Density Estimation

Given the training set X={xt}t drawn iid from p(x)


Divide data into bins of size h
Histogram:
# {x t in the same bin as x }
pˆ ( x ) =
Naive estimator: Nh

pˆ ( x ) =
{
# x − h < xt ≤ x + h }
or 2 Nh

1 N  x − xt  1 / 2 if u < 1
pˆ ( x ) =  w
Nh t =1  h
 w(u ) = 
 0 otherwise

with the weight function defined as,,

Kernel Estimator:

Kernel function, e.g., Gaussian kernel:


1  u2 
K (u ) = exp  − 
2π  2 
Kernel estimator (Parzen windows)

1 N
 x − xt 
pˆ ( x ) =
Nh
 K  
t =1  h 

k-Nearest Neighbor Estimator:

Instead of fixing bin width h and counting the number of instances, fix the instances
(neighbors) k and check bin width,
k
pˆ ( x ) =
2 Nd k ( x )

dk(x), distance to kth closest instance to x.

The nearest neighbor class of estimators adapts the amount of smoothing to the local density of
data. The degree of smoothing is controlled by k, the number of neighbors taken into account,
which is much smaller than N, the sample size. Let us define a distance between a and b, for
example, |a − b|, and for each x, we define,
Generalization of multivariate data

• Kernel density estimator

1 N
 x − xt 
pˆ (x) =
Nhd
 K 
h

t =1  
• Multivariate Gaussian kernel

 1 
d
 u 2
• Spheric K (u ) =   exp − 
 2π   2 
1  1 T −1 
• Ellipsoid K (u ) = d /2 1 / 2
exp − 2 u S u 
(2π ) S
where S is the sample covariance matrix. This corresponds to using Mahalanobis distance instead
of the Euclidean distance. It is also possible to have the distance metric local where S is
calculated from instances in the vicinity of x, for example, some k closest instances.
Note that S calculated locally may be singular and PCA (or LDA, in thecase of classification)
may be needed.

Hamming distance:

When the inputs are discrete, we can use Hamming distance, which counts the number of
nonmatching attributes.
Nonparametric Classification

• Estimate p(x|Ci) and use Bayes’ rule

• Kernel estimator

1 N
 x − xt
 t ˆ N
pˆ (x | C i ) =
Nihd
 K  ri P (C i ) = i
t =1  h  N
1 N  x − xt  t
d 
ˆ
g i (x ) = pˆ (x | C i )P (C i ) = K  ri
Nh t =1  h 

• k-NN estimator

ki ˆ pˆ (x | Ci )Pˆ (Ci ) ki
pˆ (x | Ci ) = P(Ci | x) = =
NiV (x)
k
p(x)
ˆ k
The k-nn classifier assigns the input to the class having most examples among the k neighbors of
the input. All neighbors have equal vote, and the class having the maximum number of voters
among the k neighbors is chosen. Ties are broken arbitrarily or a weighted vote is taken. K is
generally taken to be an odd number to minimize ties: confusion is generally between two
neighboring classes. Again, the use of Euclidean distance corresponds to assuming uncorrelated
inputs with equal variances, and when this is not the case a suitable discriminant metric should
be used. One example is discriminant adaptive nearest neighbor where the optimal distance to
separate classes is estimated locally.

Nonparametric Regression

Aka smoothing models


Regressogram
Running Mean/Kernel Smoother:

Instead of taking an average and giving a constant fit at a point, we can take into account one
more term in the Taylor expansion and calculate a linear fit. In the running line smoother, we can
use the data points in the neighborhood, as defined by h or k, and fit a local regression line. In
the locally weighted running line smoother, known as loess, instead of a hard definition of
neighborhoods, we use kernel weighting such that distant points have less effect on error.

Decision Trees
Algorithm:
Attribute Selection Measure:

Example:
After further iterations, the final decision tree would be,
Univariate Trees
Multivariate Trees

Learning rules from data

From the decision tree, the following rules are identified.

If age=youth and student=no then buys_computer=no.


If age=youth and student=yes then buys_computer=yes.
If age=middle_aged then buys_computer=yes.
If age=senior and credit_rating=fair then buys_computer=no.
If age=senior and credit_rating=excellent then buys_computer=yes.

• Rule induction is similar to tree induction but


o tree induction is breadth-first,
o rule induction is depth-first; one rule at a time
• Rule set contains rules; rules are conjunctions of terms
• Rule covers an example if all terms of the rule evaluate to true for the example
• Sequential covering: Generate rules one at a time until all positive examples are covered.
Linear Discrimination

Likelihood- vs. Discriminant-based Classification:

Likelihood-based: Assume a model for p(x|Ci), use Bayes’ rule to calculate P(Ci|x)

gi(x) = log P(Ci|x)

Discriminant-based: Assume a model for gi(x|Φi); no density estimation

Estimating the boundaries is enough; no need to accurately estimate the densities inside
the boundaries

Linear Discriminant:

Linear discriminant:
d
gi (x | wi , wi0 ) = w x + wi0 = wij x j + wi0
T
i
j =1

Advantages:

Simple: O(d) space/computation

Knowledge extraction: Weighted sum of attributes; positive/negative weights,


magnitudes (credit scoring)

Optimal when p(x|Ci) are Gaussian with shared cov matrix; useful when classes
are (almost) linearly separable

Generalized Linear Model:

Quadratic discriminant:

g i (x | Wi , w i , wi 0 ) = xT Wi x + w Ti x + wi 0

Higher-order (product) terms:

z1 = x1, z2 = x2 , z3 = x12 , z4 = x22 , z5 = x1x2

Map from x to z using nonlinear basis functions and use a linear discriminant in z-space:
k
g i (x ) =  w ij φ j (x )
j =1
Two Classes:

g (x )= g 1 (x ) − g 2 (x )
= (w T
1 x + w 10 ) − (w T
2 x + w 20 )
= (w 1 − w 2 )T x + (w 10 − w 20 )
T
= w x + w 0

C1 if g (x ) > 0
choose 
C 2 otherwise
Multiple Classes:

Sigmoid (Logistic) Function:


Gradient-Descent:
Logistic Discrimination:
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT-IV MULTILAYER PERCEPTRONS


MACHINE LEARNING - SIT1305

UNIT IV
UNIT IV MULTILAYER PERCEPTRONS

Structure of brain - Neural networks as a parallel processing - Perceptron - Multilayer perceptron -


Backpropagation - Training procedures - Tuning the network size - Learning time.

Structure of the Brain


The human brain is one of the most complicated things that we have studied in detail, and is, on
the whole, poorly understood. We do not have satisfactory answers to the most fundamental of
questions such as “what is my mind?” and “how do I think?”. Nevertheless, we do have a basic
understanding of the operation of the brain at a low level. It contains approximately ten thousand
million (1011) basic units, called neurons. Each of these neurons is connected to about ten
thousand (104) others. To put this in perspective, imagine an Olympic-sized swimming pool,
empty. The number of raindrops that it would take to fill the pool is approximately 1011. You’d
also need at least a dozen full address books if you were to be able to contact lo4 other people.
The neuron is the basic unit of the brain, and is a stand-alone analogue logical processing unit.
The neurons form two main types, local processing interneuron cells that have their input and
output connections over about 100 microns, and output cells that connect different regions of the
brain to each other, connect the brain to muscle, or connect from sensory organs into the brain.
The operation of the neuron is a complicated and not fully understood process on a microscopic
level, although the basic details are relatively clear. The neuron accepts many inputs, which are
all added up in some fashion. If enough active inputs are received at once, then the neuron will
be activated and “fire”; if not, then the neuron will remain in its inactive, quiet state. A
representation of the basic features of a neuron is shown in the below figure.

The soma is the body of the neuron. Attached to the soma are long, irregularly shaped filaments,
called dendrites. These nerve processes are often less than a micron in diameter, and have
complex branching shapes. Their intricate shape resembles that of a tree in winter, without
leaves, whose branches fork and fork again into finer structure. The dendrites act as the
connections through which all the inputs to the neuron arrive. These cells are able to perform
more complex functions than simple addition on the inputs they receive, but considering a simple
summation is a reasonable approximation.
The basic features of a biological neuron

Another type of nerve process attached to the soma is called an axon. This is electrically active,
unlike the dendrite, and serves as the output channel of the neuron. Axons always appear on
output cells, but are often absent from interneurons, which have both inputs and outputs on
dendrites. The axon is a non-linear threshold device, producing a voltage pulse, called an action
potential, that lasts about 1 millisecond ( 10-3s) when the resting potential within the soma rises
above a certain critical threshold. This action potential is in fact a series of rapid voltage spikes.
The axon terminates in a specialised contact called a synapse that couples the axon with the
dendrite of another cell. There is no direct linkage across the junction; rather, it is a temporary
chemical one. The synapse releases chemicals called neurotransmitters when its potential is
raised sufficiently by the action potential. It may take the arrival of more than one action
potential before the synapse is triggered. The neurotransmitters that are released by the synapse
diffuse across the gap, and chemically activate gates on the dendrites, which, when open, allow
charged ions to flow. It is this flow of ions that alters the dendritic potential, and provides a
voltage pulse on the dendrite, which is then conducted along into the next neuron body. Each
dendrite may have many synapses acting on it, allowing massive interconnectivity to be
achieved. At the synaptic junction, the number of gates opened on the dendrite depends on the
number of neurotransmitters released. It also appears that some synapses excite the dendrite they
affect, whilst others serve to inhibit it. This corresponds to altering the local potential of the
dendrite in a positive or negative direction. A single neuron will have many synaptic inputs on its
dendrites, and may have many synaptic outputs connecting it to other cells.

Neural networks as a parallel processing


Perceptron

It performs a weighted sum of its inputs, compares this to some internal threshold level, and
turns on only if this level is exceeded. If not, it stays off. Because the inputs are passed through
the model neuron to produce the output, the system is known as a feedforward one. We need to
formulate this mathematically. If there are n inputs, then there are n associated weights on the
input lines. The model neuron calculates the weighted sum of its inputs; it takes the first input,
multiplies it by the weight on that input line, then does the same for the next input, and so on,
adding them all up at the end. This can be written as,

Outline of the basic model.

This sum then has to be compared to a certain value in the neuron, the threshold value. This
thresholding process is accomplished by comparison; if the sum is greater than the threshold
value, then output a 1, if less, output a 0. This can be seen graphically in the figure given below,
where the x-axis represents the input, and the y-axis the output. ?
The thresholding function

Equivalently, the threshold value can be subtractec from the weighted sum, and the resulting
value compared to zero; if the result is positive, then output a 1, else output a 0. Notice that the
shape of the function is the same, but now the jump occurs at zero. The threshold effectively
adds an offset to the weighted sum. An alternative way of achieving the same effect is to take the
threshold out of the body of the model neuron and connect it to an extra input value that is fixed
to be “on” all the time. In this case, rather than subtracting the threshold value from the weighted
sum, the extra input of +1 is multiplied by a weight equal to minus the threshold value, , and
added in as well as all the other inputs-this is known as biasing the neuron. The value of -8 is
therefore known as the neuron’s bias or ofSset. Both approaches are equivalent, and either is
acceptable.
Notice that the lower limit of the summation has changed from 1 to 0, and that the value of the
input xo is always set to 1. This model of the neuron, shown in the below figure, was proposed
in 1943 by McCulloch and Pitts. Their model came about in much the same way as we have
developed ours, and stemmed from their research into the behaviour of the neurons in the brain.
It is important to look at the features of this McCulloch-Pitts neuron. It is a simple enough unit,
thresholding a weighted sum of its inputs to get an output. It specifically does not take any
account of the complex patterns and timings of actual nervous activity in real neural systems, nor
does it have any of the complicated features found in the body of biological neurons.
Details of basic Model

This ensures its status as a model, and not a copy, of a real neuron, and makes it possible to
implement on a digital computer. This is the strength of the model-now we need to investigate
what can be achieved using this simple design. The arrangement of the connections between the
neurons is important, but, continuing our trend of choosing simple models to get an idea of what
is happening in a complicated red-world situation, we shall for the time being consider only one
layer of neurons, where we study the outputs of the neurons under a known set of inputs. The
model neurons, connected up in a simple fashion, were given the name “perceptrons” by Frank
Rosenblatt in 1962. He pioneered the simulation of neural networks on digital computers, as well
as their formal analysis. In his book “Principles of Neurodynamics ”, he describes these
perceptrons as simplified networks in which certain properties of red nervous systems axe
exaggerated whilst others are ignored. He stated that they are not intended to serve as detailed
copies of any real nervous system; in other words, he realised at this early stage that he was
dealing with a basic model. This fact is often lost in the popular press as the idea of computer
“brains”, based on these techniques, grabs the imagination. We are not attempting to build
computer brains, nor are we trying to mimic parts of red brains-rather we are aiming to discover
the properties of models that take their behaviour from extremely simplified versions of natural
neural systems, usually on a massively reduced scale as well. Whereas the brain has at least 1011
neurons, each connected to 104 others, we are concerned here with maybe a few hundred neurons
at most, connected to a few thousand input lines.

The learning paradigm can be summarised as follows:

• set the weights and thresholds randomly


• present an input
• calculate the actual out put by taking the thresholded value of the weighted sum of the
inputs
• alter the weights to reinforce correct decisions and discourage incorrect decisions-i.e.
reduce the error
• present the next input etc

The perceptron learning algorithm


Learning Boolean Functions:
Multilayer perceptron
Backpropagation
Backpropagation learns by iteratively processing a data set of training tuples, comparing the
network’s prediction for each tuple with the actual known target value. The target value may be
the known class label of the training tuple (for classification problems) or a continuous value (for
numeric prediction). For each training tuple, the weights are modified so as to minimize the
mean-squared error between the network’s prediction and the actual target value. These
modifications are made in the “backwards” direction (i.e., from the output layer) through each
hidden layer down to the first hidden layer (hence the name backpropagation). Although it is not
guaranteed, in general the weights will eventually converge, and the learning process stops.
Backpropogation Algorithm
Example:

The figure given below shows a multilayer feed-forward neural network. Let the learning rate be
0.9. The initial weight and bias values of the network are given in Table 1, along with the first
training tuple, X= (1, 0, 1), with a class label of 1 (target value Tj). This example shows the
calculations for backpropagation, given the first training tuple, X. The tuple is fed into the
network, and the net input and output of each unit are computed. The error of each unit is
computed and propagated backward. The error values and the weight and bias updates are
calculated. Given learning rate=0.9.
Training Procedures

Improving Convergence:

Momentum
∂E t
t
∆ w = −η
i + α ∆ w it −1
∂ wi
Adaptive learning rate

 + a if Et +τ < Et
∆η = 
− bη otherwise

Overfitting/Overtraining:

• Number of weights: H (d+1)+(H+1)K

As complexity increases, training error is fixed but the validation error starts to increase and the
network starts to overfit.
As training continues, the validation error starts to increase and the network starts to overfit.

Hints:
Tuning the Network Size

Learning Time

Applications:
Sequence recognition: Speech recognition
Sequence reproduction: Time-series prediction
Sequence association
Network architectures
Time-delay networks (Waibel et al., 1989)
Recurrent networks (Rumelhart et al., 1986)

Time-Delay Neural Networks:

In a time delay neural network, previous inputs are delayed in time so as to synchronize with the
final input, and all are fed together as input to the system.Backpropagation can then be used to
train the weights. To extract features local in time, one can have layers of structured connections
and weight sharing to get translation invariance in time. The main restriction of this architecture
is that he size of the time window we slide over the sequence should be fixed a priori.
Inputs in a time window of length T are delayed in time until we can feed all T inputs as the
input vector to the MLP.

Recurrent Networks:
Examples of MLP with partial recurrency. Recurrent connections are shown with dashed lines:
(a) self-connections in the hidden layer, (b) selfconnections in the output layer, and (c)
connections from the output to the hidden layer. Combinations of these are also possible.

Unfolding in Time:

If the sequences have a small maximum length, then unfolding in time can be used to convert an
arbitrary recurrent network to an equivalent feedforward network. A separate unit and
connection is created for copies at different times. The resulting network can be trained with
backpropagation with the additional requirement that all copies of each connection should
remain identical. The solution, as in weight sharing, is to sum up the different weight changes in
time and change the weight by the average. This is called backpropagation through time
(Rumelhart, Hinton, and Willams 1986b). The problem with this approach is the memory
requirement if the length of the sequence is large. Real time recurrent learning (Williams and
Zipser 1989) is an algorithm learning for training recurrent networks without unfolding and has
the advantage that it can use sequences of arbitrary length.
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT-V LOCAL MODELS


MACHINE LEARNING - SIT1305

UNIT V
UNIT V LOCAL MODELS

Competitive learning - Adaptive resonance theory - Self organizing map - Basis functions - Learning
vector quantization - Assessing and Comparing Classification Algorithms - Combining Multiple Learners –
Reinforcement Learning.

Introduction
Divide the input space into local regions and learn simple (constant/linear) models in each
patch

Unsupervised: Competitive, online clustering


Supervised: Radial-basis functions, mixture of experts

Competitive learning
The term competitive learning is used because it is as if these groups, or rather the units
representing these groups, compete among themselves to be the one responsible for
representing an instance. The model is also called winner-take-all; it is as if one group wins and
gets updated, and the others are not updated at all.

An online method has the usual advantages that (1) we do not need extra memory to store
the whole training set; (2) updates at each step are simple to implement, for example, in
hardware; and (3) the input distribution may change in time and the model adapts itself to these
changes automatically. If we were to use a batch algorithm, we would need to collect a new
sample and run the batch method from scratch over the whole sample.
Online k-Means:

The reconstruction error is,

E ({m i }ik= 1 X )=   t i
b it x t
− m i

t
1 if x t
− m i = min x t
− m l
b i =  l

0 otherwise

Batch k - means :m =
 b xt i
t t

 b
i t
t i

Online k - means :
t
∂E
∆m ij = −η
∂m
= η b it x ( t
j − m ij )
ij

Online k-means algorithm.


Winner-take-all network:
Adaptive Resonance Theory
Self-Organizing Maps
Radial Basis Functions
Local vs Distributed Representation:

The difference between local and distributed representations.

The values are hard, 0/1, values. One can use soft values in (0, 1) and get a more informative
encoding. In the local representation, this is done by the Gaussian RBF that uses the distance to
the center, mi , and in the distributed representation, this is done by the sigmoid that uses the
distance to the hyperplane, wi .

Regression:

1
E ({m h , s h , w ih }i , h | X )= 2
  (r i
t
− y i
t
) 2

t i
H
y i
t
= h =1
w ih p t
h + w i 0

∆ w ih = η  (rt
i
t
− y i
t
)p t
h

  (x t
− m )
∆ m hj = η  t


 (r i
i
t
− y i
t
)w ih  p

t
h
j

s 2
hj

t 2
  x − m
  (r )w
t t t h
∆ s h = η  i − y i ih  p h 3
t  i  s h
Training RBF:

• Hybrid learning:
o First layer centers and spreads:
Unsupervised k-means
o Second layer weights:
Supervised gradient-descent
• Fully supervised

Classification:

( )
E {m h , s h , wih }i , h | X = −   ri t log yit
t i

yit =
[
exp w p +w ] h ih
t
h i0

 exp [ w p + w ]
k h kh
t
h k0

Rules and Exceptions:


Learning Vector Quantization

Assessing and Comparing Classification Algorithms

L ({m h , s h , w ih }i , h |X )=  log  g ht ∏ (y )
t
ih
ri t

t h i

 
= 
t
log  h
g ht exp   ri t log y iht 
 i 
exp w ih exp v ih x t
y iht = =
 k exp w kh  k exp v kh x t
Combining Multiple Learners
• Early integration: Concat all features and train a single learner
• Late integration: With each feature set, train one learner, then either use a fixed rule or
stacking to combine decisions
• Intermediate integration: With each feature set, calculate a kernel, then use a single SVM
with multiple kernels
• Combining features vs decisions vs kernels

Rationale:

No Free Lunch Theorem: There is no algorithm that is always the most accurate
Generate a group of base-learners which when combined has higher accuracy
Different learners use different
Algorithms
Hyperparameters
Representations /Modalities/Views
Training sets
Subproblems
Diversity vs accuracy

Voting:
Fixed Combination Rules:
Error-Correcting Output Codes:

Bagging:

Use bootstrapping to generate L training sets and train one base-learner with each
(Breiman, 1996)

Use voting (Average or median with regression)

Unstable algorithms profit from bagging


AdaBoost:

Generate a sequence of base-learners each focusing on previous one’s errors

(Freund and Schapire, 1996).


Fine-Tuning an Ensemble:

Given an ensemble of dependent classifiers, do not use it as is, try to get independence
1. Subset selection: Forward (growing)/Backward (pruning) approaches to improve
accuracy/diversity/independence
2. Train metaclassifiers: From the output of correlated classifiers, extract new
combinations that are uncorrelated. Using PCA, we get “eigenlearners.”
Similar to feature selection vs feature extraction

Cascading:

• Use dj only if preceding ones are not confident


• Cascade learners in order of complexity
Reinforcement Learning

Elements of RL (Markov Decision Processes):

st : State of agent at time t


at: Action taken at time t
In st, action at is taken, clock ticks and reward rt+1 is received and state changes to st+1
Next state prob: P (st+1 | st , at )
Reward prob: p (rt+1 | st , at )
Initial state(s), goal state(s)
Episode (trial) of actions from initial state to goal
(Sutton and Barto, 1998; Kaelbling et al., 1996)

Policy and Cumulative Reward:


Model-Based Learning:

Value Iteration:
Policy Iteration:

Q-learning
Sarsa:

You might also like