CS 9633 Machine Learning
Concept Learning
References:
Machine Learning by Tom Mitchell, 1997, Chapter 2
Artificial Intelligence: A Modern Approach, by Russell and Norvig,
Second Edition, 2003, pages 678 – 686
Elements of Machine Learning, by Pat Langley, 1996, Chapter 2
Computer Science Department
CS 9633 KDD
Concept Learning
• Inferring a Boolean-valued function from
training examples
• Training examples are labeled as
members or non-members of the
concept
Computer Science Department
CS 9633 KDD
Concept Learning Task
Defined By
• Set of instances over which target
function is defined
• Target function
• Set of candidate hypotheses considered
by the learner
• Set of available training examples
Computer Science Department
CS 9633 KDD
Example Concept
• Days when you would enjoy water
sports
Computer Science Department
CS 9633 KDD
Instances X
Possible days, each described by the
attributes
– Sky (Sunny, Cloudy, Rainy)
– AirTemp (Warm, Cold)
– Humidity (Normal, High)
– Wind (Strong, Weak)
– Water (Warm, Cold)
– Forecast (Same, Change)
Computer Science Department
CS 9633 KDD
Hypotheses H
• Each hypothesis is a vector of 6 constraints,
specifying the values of 6 attributes
• For each attribute, hypothesis is:
– Value of ? if any value is acceptable for this
attribute
– Single required value for the attribute
– Value of 0 if no value is acceptable
• Sample hypothesis
(Rainy, ?, ?,?, Warm ,?)
Computer Science Department
CS 9633 KDD
General and Specific
Hypotheses
• Most general hypothesis
(?, ?, ?, ?, ?, ?)
• Most specific hypothesis
(0, 0, 0, 0, 0, 0)
Computer Science Department
CS 9633 KDD
Target Concept c
EnjoySport: X (0,1)
Computer Science Department
CS 9633 KDD
Training Examples D
Example Sky Air Humidity Wind Water Forecast Enjoy
Temp Sport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
Computer Science Department
CS 9633 KDD
Determine:
A hypothesis h in H such that
h(x) = c(x)
for all x in X
Computer Science Department
CS 9633 KDD
Inductive Learning
Hypothesis
Any hypothesis found to approximate
the target function well over a
sufficiently large set of training
examples will also approximate the
target function well over other
unobserved examples.
Computer Science Department
CS 9633 KDD
Concept Learning as
Search
• Concept learning can be viewed as
searching through a large space of
hypotheses implicitly defined by the
hypothesis representation.
Computer Science Department
CS 9633 KDD
Sample and hypothesis
space size
How many instances?
How many hypotheses?
How many semantically distinct hypotheses?
• Sky (Sunny, Cloudy, Rainy)
• AirTemp (Warm, Cold)
• Humidity (Normal, High)
• Wind (Strong, Weak)
• Water (Warm, Cold)
• Forecast (Same, Change)
Computer Science Department
CS 9633 KDD
Searching hypothesis
space
• Goal is to efficiently search hypothesis
space to find the hypothesis that best
fits the training data
• Hypothesis space is potentially
– Very large
– Possibly infinite
Computer Science Department
CS 9633 KDD
General to Specific
Ordering of Hypotheses
• It is often possible to use a natural
general-to-specific ordering of the
hypothesis space to organize the
search
• Can often exhaustively search all of the
space without explicitly enumerating all
hypotheses
Computer Science Department
CS 9633 KDD
Example
h1 = <Sunny, ?, ?, Strong, ?, ?>
h2 = <Sunny, ?, ?, ?, ?, ?>
Which is more general?
Computer Science Department
CS 9633 KDD
Notation
• For any instance x in X and hypothesis
h in H, we say that x satisfies h if and
only if h(x) = 1.
Computer Science Department
CS 9633 KDD
Definition
• Let hj and hk be boolean-valued functions defined over X. Then hj is
more general than or equal to hk iff
• Definition of strictly more general than
h j g hk iff (x X ) [(hk ( x) 1) (h j ( x) 1)]
h j g hk iff (h j g hk ) (hk g h j )
Computer Science Department
CS 9633 KDD
Partially Ordered Sets
• Properties of a partial order
– Reflexive
– Transitive
– Antisymmetric
• Form a lattice
Computer Science Department
CS 9633 KDD
Important Point
• The g and >g are dependent only on
which instances satisfy the hypotheses
and not on the target concept.
• We will now consider algorithms that
take advantage of this partial order
among hypotheses to organize the
search space
Computer Science Department
CS 9633 KDD
FIND-S Algorithm
• Approach: start with most specific
hypothesis and then generalize the
hypothesis when it does not cover a
training example
• A hypothesis “covers” a training
example”—correctly classifies example
as true
Computer Science Department
CS 9633 KDD
FIND-S
1. Initialize h to the most specific hypothesis in
H
2. For each positive training instance x
For each attribute constraint ai in h
If the constraint ai is satisfied by x then
Do nothing
Else
Replace ai in h by the next more general
constraint that is satisfied by x
3. Output hypothesis h
Computer Science Department
CS 9633 KDD
Apply to Training Examples
D
Example Sky Air Humidity Wind Water Forecast Enjoy
Temp Sport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
Computer Science Department
CS 9633 KDD
Traversing lattice
Specific
General
Computer Science Department
CS 9633 KDD
Properties of FIND-S
• Hypothesis space is described as a
conjunction of attribute constraints
• Guaranteed to output the most specific
hypothesis within H that is consistent with
training examples.
• Final hypothesis is also consistent with
negative examples if:
– Correct target concept is contained in H
– Training examples are correct
Computer Science Department
CS 9633 KDD
Consider this example
Attribute 1 Attribute 2 Label
(Possible (Possible
values X,Y) values A, B, C)
X A Yes
X B Yes
X C No
Computer Science Department
CS 9633 KDD
Issues
• Has the learner converged to the
correct target concept? Are there other
consistent hypotheses?
• Why prefer the most specific
hypothesis?
• Are the training examples consistent?
• What if there are several maximally
specific consistent hypotheses?
Computer Science Department
CS 9633 KDD
Candidate Elimination
Algorithm
• Goal is to output a description of all
hypotheses consistent with the training
examples.
• Computes description without explicitly
enumerating all members.
• Is also called Least Commitment Search.
• Like FIND-S, it uses more-general-than
partial ordering
Computer Science Department
CS 9633 KDD
Definition
• A hypothesis h is consistent with a set
of training examples D iff h(x) = c(x) for
each example <x, c(x)> in D.
Consistent(h,D) ( x, c( x) D)h( x) c( x)
Computer Science Department
CS 9633 KDD
Definition
• The version space, denoted VSH,D, with
respect to hypothesis space H and
training examples D, is the subset of
hypotheses from H consistent with the
training examples in D
VS H , D {h H | Consistent(h,D)}
Computer Science Department
CS 9633 KDD
List-Then-Eliminate
Algorithm
1. VersionSpace a list of every
hypothesis in H
2. For each training example <x, c(x)>
Remove from VersionSpace any hypothesis
for which h(x) c(s)
3. Output the list of hypotheses in
VersionSpace
Computer Science Department
CS 9633 KDD
More compact
representation of VS
• Candidate-elimination algorithm uses a
more compact representation of VS
• VS represented by most general and
specific members.
• These members form a boundary that
delimits the version space within the
partially ordered hypothesis space.
• Also called Least Commitment Search.
Computer Science Department
CS 9633 KDD
{<Sunny, Warm, ?, Strong, ?, ?>}
S:
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> {<?, Warm, ?, Strong, ?, ?,>
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}
Inconsistent
Region
G-set G1 G2 G3 ... Gm
S-set S1 S2 S3 ... Sn
Inconsistent
Region
Definitions of general and
specific boundaries
• Definition : The general boundary G, with respect to
hypothesis space H and training data D, is the set of
maximally general members of H consistent with D.
G {g H | Consistent( g , D) (g ' H [( g ' g g ) Consistent( g ' , D)]}
• Definition : The specific boundary G, with respect to
hypothesis space H and training data D, is the set of minimally
general (maximally specific) members of H consistent with D.
S {s H | Consistent ( s, D) (s ' H [( s g s ' ) Consistent ( s' , D)]}
Computer Science Department
CS 9633 KDD
Theorem 2.1
Version space representation theorem
X is an arbitrary set of instances
H a set of boolean-valued hypotheses defined over X
c:X{0,1} is an arbitrary concept defined over X
D is an arbitrary set of training examples {<x, c(x)>}.
For all X, H, c, and D such that S and G are well
defined,
VS H , D {h H || (s S )(g G )( g g h g s)}
Computer Science Department
CS 9633 KDD
Candidate-Elimination
Learning Algorithm
• Initialize G to most general and S to
most specific
• Use examples to refine
Computer Science Department
CS 9633 KDD
Initialize G to the set of maximally general hypotheses in H
Initialize S to the set of maximally specific hypotheses in H
For each training example d, do
•If d is a positive example
•Remove from G any hypothesis inconsistent with d
•For each hypothesis s in S that is not consistent with d
•Remove s from S
•Add to S all minimal generalizations h of s such that
•h is consistent with d, and some member of G is more general
than h
•Remove from S any hypothesis that is more general than another
hypothesis in S
•If d is a negative example
•Remove from S any hypothesis inconsistent with d
•For each hypothesis g in G that is not consistent with d
•Remove g from G
•Add to G all minimal generalizations h of g such that
•h is consistent with d, and some member of S is more specific
than h
•Remove from G any hypothesis that is less general than another hypothesis
in G
S0: {<0, 0, 0, 0, 0, 0>} S1: {<Sunny,
{<0, 0, 0, Warm,
0, 0, 0>}
Normal, Strong,
Warm, Same>}
G0: {<?, ?, ?, ?, ?, ?>} G1: {<?, ?, ?, ?, ?, ?>}
Training Example 1:
<Sunny, Warm, Normal, Strong, Warm, Same>, Enjoy Sport = Yes
{<Sunny, Warm, {<Sunny, Warm, ?,
S1 : S2: Normal,Warm,
Strong, Strong,
Normal, Strong,
Warm, Same>} Warm, Same>}
Same>}
G1: {<?, ?, ?, ?, ?, ?>} G2: {<?, ?, ?, ?, ?, ?>}
Training Example 2:
<Sunny, Warm, High, Strong, Warm, Same>, Enjoy Sport = Yes
S2: {<Sunny, Warm, S3: {<Sunny, Warm, ?,
Normal, Strong, Strong, Warm,
Warm, Same>} Same>}
G2: {<?, ?, ?, ?, ?, ?>} G3: {<Sunny,
{<?, ?, ?, ?, ?, ?,
?>}?, ?>,
?>,}
<?, Warm, ?, ?, ?, ?>}
?>,
<?, ?, ?, ?, ?, Same>}
Training Example 3:
<Rainy, Cold, High, Strong, Warm, Change>, Enjoy Sport = No
S3: {<Sunny, Warm, ?, S3: {<Sunny, Warm, ?,
Strong, Warm, Strong, ?,
Warm,
?>}
Same>} Same>}
{<Sunny, ?, ?, ?, ?, ?>, {<Sunny, ?, ?, ?, ?, ?>,
G4: G3: <?, Warm, ?, ?, ?, ?>,}
?>,
<?, Warm, ?, ?, ?, ?>,
<?, ?, ?, ?, ?, Same>} <?, ?, ?, ?, ?, Same>}
Training Example 4:
<Sunny, Warm, High, Strong, Cool, Change>, Enjoy Sport = Yes
Questions
• Does the order of presentation of examples
matter?
• How will you know when the concept has
been learned?
• How do you know when you have presented
enough training data?
• What happens if incorrectly labeled examples
are presented?
• If the learner can request examples, which
example should be requested next?
Computer Science Department
CS 9633 KDD
{<Sunny, Warm, ?, Strong, ?, ?>}
S:
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> {<?, Warm, ?, Strong, ?, ?,>
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}
Select an example that would be classified as positive by
some hypotheses and negative by others.
Training Example Possibility:
<Sunny, Warm, Normal, Light, Warm, Same> Positive
Partially Learned Concepts
• Can we classify unseen examples even
though we still have multiple
hypotheses?
• The answer is yes for some.
Computer Science Department
CS 9633 KDD
Optimal strategy
• Generate instances that satisfy exactly
half of the hypotheses in current VS.
• Correct query concept can be found in
log2|VS| experiments
Computer Science Department
CS 9633 KDD
{<Sunny, Warm, ?, Strong, ?, ?>}
S:
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> {<?, Warm, ?, Strong, ?, ?,>
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}
A: <Sunny, Warm, Normal, Strong, Cool, Change> ?
{<Sunny, Warm, ?, Strong, ?, ?>}
S:
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> {<?, Warm, ?, Strong, ?, ?,>
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}
B: <Rainy, Cold, Normal, Light, Warm, Same> ?
{<Sunny, Warm, ?, Strong, ?, ?>}
S:
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> {<?, Warm, ?, Strong, ?, ?,>
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}
C: <Sunny, Warm, Normal, Light, Warm, Same> ?
{<Sunny, Warm, ?, Strong, ?, ?>}
S:
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> {<?, Warm, ?, Strong, ?, ?,>
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}
D: <Sunny, Cold, Normal, Strong, Warm, Same> ?
Inductive Bias
• Candidate Elimination converges toward the true
concept if
– It is given enough training examples
– Its initial hypothesis space contains the target concept
– No noisy data
• Fundamental questions
– What if the target concept is not contained in the hypothesis
space?
– Can we include every possible hypothesis?
– How does the size of the hypothesis space influence ability
to generalize?
– How does the size of the hypothesis space influence
number of examples we need?
Computer Science Department
CS 9633 KDD
Biased Hypothesis Space
• The hypothesis space we have been
considering biases the concepts we can
learn.
• An alternative is to use a hypothesis
space that can represent every
teachable concept (all possible subsets
of X).
Computer Science Department
CS 9633 KDD
Unbiased Hypothesis Space
• We need to be able to represent all
possible subsets of instances.
• If set X has |X| elements, what is the
size of the power set of X?
• For our example, the size of X is 96.
What is the size of the power set?
• What was the size of the hypothesis
space we have been considering?
Computer Science Department
CS 9633 KDD
Consider an Alternate
Definition of EnjoySport
• Want to be able to represent every subset of
instances with new hypothesis space H’.
• Let H’ be the power set of X.
• One method: allow arbitrary disjunctions,
conjunctions, and negations of earlier
hypotheses.
• Target concept Sky = sunny or sky = cloudy
would be:
<Sunny, ?, ?, ?, ?, ?> <Cloudy, ?, ?, ?, ?, ?>
Computer Science Department
CS 9633 KDD
Alternate representation
Positive examples: x1, x2, and x3
Negative examples: x4 and x5
Most general hypotheses: {(x4 x5)}
Most specific hypotheses: {(x1 x2 x3)}
What can we classify unambiguously?
Computer Science Department
CS 9633 KDD
The Conclusion: Problem of
Induction
• Generalizing from any set of observations is never
logically justified, since there always exist many
hypotheses that could account for the observed data.
• Learning system MUST limit or direct its search
through the space of possible knowledge structures.
• This is called the bias of the system
• Two kinds of bias
– Representational bias (limit the hypotheses)
– Search bias (considers all possible concept descriptions but
examines some earlier than others.
Computer Science Department
CS 9633 KDD
Inductive Learning
• All inductive learning systems use
inductive bias
• Learning algorithms can be
characterized by the bias they use
• Useful point of comparison of learning
algorithms
Computer Science Department
CS 9633 KDD
Notation
y z z inductively follows from y
yz z deductively follows from y
Computer Science Department
CS 9633 KDD
Notation
• L is an arbitrary learning algorithm
• Dc is an arbitrary set of training data for an arbitrary
concept c
• After learning L is asked to classify a new instance
xi
• Let L(xi, Dc) be the classification
• Inductive inference step
( Dc xi ) L( xi , Dc )
Computer Science Department
CS 9633 KDD
Inductive Bias Definition
Consider a concept learning algorithm L for the set of
instance X. Let c be an arbitrary concept defined over
X, and let Dc={<x,c(x)>} be an arbitrary set of training
examples of c. Let L(xi,Dc) denote the classification
assigned to the instance xi by L after training on data
Dc. The inductive bias of L is any minimal set of
assertions B such that for any target concept c and
corresponding training examples Dc
(xi X )[(B Dc xi ) L( xi , Dc )]
Computer Science Department
CS 9633 KDD
Inductive System
Training Examples
Candidate Elimination Classification of new
Algorithm instance or “don’t
New Instance Using Hypothesis know”
Space H
Training Examples
New Instance Theorem Prover Classification of new
instance or “don’t
Assertion “H know”
contains the target
concept”
Conclusions
• Concept learning as search
• General to specific ordering of hypotheses
provides useful structure for guiding search
• Version spaces and candidate elimination
provide useful framework for studying
machine learning issues
• Inductive learning must use representational
or search bias
Computer Science Department
CS 9633 KDD