INTRODUCTION TO
MACHINE LEARNING
Design a Learning system
⚫ According to Tom Mitchell, “A computer program is said
to be learning from experience (E), with respect to some
task (T). Thus, the performance measure (P) is the
performance at task T, which is measured by P, and it
improves with experience E.”
⚫ Example: In Spam E-Mail detection,
⚫ Task, T: To classify mails into Spam or Not Spam.
⚫ Performance measure, P: Total percent of mails being
correctly classified as being “Spam” or “Not Spam”.
⚫ Experience, E: Set of Mails with label “Spam”
Designing a Learning System
1.2.1 Choose the training experience
1.2.2Choose exactly what is too be learned, i.e.
the target function.
1.2.3Choose how to represent the target
function.
1.2.4Choose a learning algorithm to infer the
target function from the experience.
Learner
1.2.5 The Environment/
Final Design
Experience Knowledge
Performance
Element
1.2.1Choosing the Training Experience:
□ The very important and first task is to choose the training data or training
experience which will be fed to the Machine Learning Algorithm.
□ It is important to note that the data or experience that we fed to the
algorithm must have a significant impact on the Success or Failure of the
Model. So Training data or experience should be chosen wisely.
Below are the attributes which will impact on Success and Failure of
Data:
⚫ The training experience will be able to provide direct or indirect feedback
regarding choices. For example: While Playing chess the training data will
provide feedback to itself like instead of this move if this is chosen the
chances of success increases.
⚫ Second important attribute is the degree to which the learner will
control the sequences of training examples.
For example: when training data is fed to the machine then at
that time accuracy is very less but when it gains experience while
playing again and again with itself or opponent the machine algorithm
will get feedback and control the chess game accordingly.
⚫ Third important attribute is how it will represent the distribution of
examples over which performance will be measured.
For example, a Machine learning algorithm will get experience
while going through a number of different cases and different
examples. Thus, Machine Learning Algorithm will get more and more
experience by passing through more and more examples and hence its
performance will increase.
A checkers learning problem:
Task T: playing checkers
Performance measure P:
percent of games won in the world tournament
Training experience E:
games played against itself
In order to complete the design of the learning system, we
must now choose
1. the exact type of knowledge to be,learned
2. a representation for this target knowledge
3. a learning mechanism
1.2.2Choosing target
function:
The next important step is choosing the target function.
⚫ It means according to the knowledge fed to the algorithm the
machine learning will choose NextMove function which will
describe what type of legal moves should be taken.
⚫ For example :
While playing chess with the opponent, when opponent will play
then the machine learning algorithm will decide what be the
number of possible legal moves taken in order to get success.
Cont..
⚫ What function is to be learned and how will it
be used by the performance system?
⚫ For checkers, assume we are given a function for
generating the legal moves for a given board position
and want to decide the best move.
⚫ Could learn a function:
ChooseMove(board, legal-moves) → best-move
⚫ Or could learn an evaluation function, V(board) → R, that
gives each board position a score for how favorable it is. V
can be used to pick a move by applying each legal move,
scoring the resulting board position, and choosing the
move that results in the highest scoring board position.
Ideal Definition of V(b)
⚫ If b is a final winning board, then V(b) = 100
⚫ If b’is a final losing board, then V(b) = –100
⚫ If b is a final draw board, then V(b) = 0
⚫ Otherwise, if b is not a final state then V(b) = V(b´),
where b´ is the highest scoring final board position
that is achieved starting from b and playing
optimally until the end of the game (assuming the
opponent plays optimally as well).
⚫ Can be computed using complete mini-max search of
the finite game tree.
Approximating V(b)
⚫ Computing V(b) is intractable since it involves
searching the complete exponential game tree.
⚫ Therefore, this definition is said to be non-operational.
⚫ An operational definition can be computed in reasonable
(polynomial) time.
⚫ Need to learn an operational approximation to the ideal
evaluation function.
1.2.3Choosing Representation for
Target function:
⚫ When the machine algorithm will know all the possible
legal moves the next step is to choose the optimized
move using any representation i.e. using linear
Equations, Hierarchical Graph Representation, Tabular
form etc.
⚫ The NextMove function will move the Target move like
out of these move which will provide more success rate.
⚫ For Example : while playing chess machine have 4
possible moves, so the machine will choose that
optimized move which will provide success to it.
Linear Function for Representing
V(b)
⚫ In checkers, use a linear approximation of
the evaluation function.
⚫ x1: the number of black pieces on the board
⚫ x2: the number of red pieces on the board
⚫ x3: the number of black kings on the board
⚫ x4: the number of red kings on the board
⚫ x5: the number of black pieces threatened by red
(i.e., which can be captured on red's next turn)
⚫ X6: the number of red pieces threatened by black
Partial design of a checkers
learning program:
⚫ Task T: playing checkers
⚫ Performance measure P: percent of games won in the
world tournament
⚫ Training experience E: games played against itself
⚫ Targetfunction: V:Board + R
⚫ Targetfunction representation
The first three items above correspond to the specification of the learning task,
whereas the final two items constitute design choices for the implementation of
the learning program
1.2.4Choosing a Function
Approximation Algorithm
⚫ In order to learn the target function f we require a set of
training examples, each describing a specific board state
b and the training value Vtrain(b) for b.
⚫ In other words, each training example is an ordered pair
of the form (b, Vtrain(b)).
⚫ For instance, the following training example describes a
board state b in which black has won the game (note x2 =
0 indicates that red has no remaining pieces) and for
which the target function value Vtrain(b)is therefore +100.
1.2.4.1 ESTIMATING
TRAINING VALUES
⚫ Despite the ambiguity inherent in estimating training
values for intermediate board states, one simple
approach has been found to be surprisingly successful.
⚫ This approach is to assign the training value of
Vtrain(b)for any intermediate board state b to be
v’(successor (b) ) where V’ is the learner's current
approximation to V and where Successor(b) denotes the
next board state following b for which it is again the
program's turn to move (i.e., the board state following the
program's move and the opponent's response).
Temporal Difference Learning
⚫ Estimate training values for intermediate
(non-terminal) board positions by the estimated
value of their successor in an actual game
trace.
where successor(b) is the next board position
where it is the program’s move in actual play.
⚫ Values towards the end of the game are initially
more accurate and continued training slowly
“backs up” accurate values to earlier board
positions.
Learning Algorithm
⚫ Uses training values for the target function to
induce a hypothesized definition that fits these
examples and hopefully generalizes to unseen
examples.
⚫ In statistics, learning to approximate a
continuous function is called regression.
⚫ Attempts to minimize some measure of error (loss
function) such as mean squared error:
Least Mean Squares (LMS)
Algorithm
⚫ A gradient descent algorithm that incrementally
updates the weights of a linear function in an
attempt to minimize the mean squared error
Until weights converge :
For each training example b do :
1) Compute the absolute error :
2) For each board feature, fi, update its weight, wi :
for some small constant (learning rate) c
1.2.5The Final
Design
The final design of our checkers learning system can
be naturally described by four distinct program
modules
⚫ The Performance System is the module that must solve
the given performance task, It takes an instance of a
new problem (new game) as input and produces a trace
of its solution (game history) as output.
⚫ The Critic takes as input the history or trace of the
game and produces as output a set of training examples
of the target function.
⚫ The Generalizer takes as input the training examples
and produces an output hypothesis that is its estimate of
the target function.
⚫ The Experiment Generator takes as input the current
hypothesis (currently learned function) and outputs a
new problem (i.e., initial board state) for the Performance
System to explore. Its role is to pick new practice
problems that will maximize the learning rate of the
overall system.
PERSPECTIVES AND ISSUES IN
MACHINE LEARNING
□ It involves searching a very large space of possible hypotheses to determine one that
best fits the observed data and any prior knowledge held by the learner.
□ For example, consider the space of hypotheses that could in principle be output by
the above checkers learner.
□ This hypothesis space consists of all evaluation functions that can be represented by
some choice of values for the weights wo through w6.
□ The learner's task is thus to search through this vast space to locate the hypothesis
that is most consistent with the available training examples.
□ The LMS algorithm for fitting weights achieves this goal by iteratively tuning the
weights, adding a correction to each weight each time the hypothesized evaluation
function predicts a value that differs from the training value.
□ This algorithm works well when the hypothesis representation considered by the
learner defines a continuously parameterized space of potential hypotheses.
Issues in Machine Learning
⚫ What algorithms exist for learning general target
functions from specific training examples?
⚫ How much training data is sufficient?
⚫ When and how can prior knowledge held by the
learner guide the process of generalizing from
examples?
⚫ What is the best strategy for choosing a useful
next training experience
⚫ What is the best way to reduce the learning task to one
or more function approximation problems?