Class announcements
• Recitations Th, F 4 PM – 46-3189
– This week Review of Basic Bayes
• PSet 1 out today, due Oct 3. Other psets due
approximately every two weeks thereafter.
• Classes next week are virtual: We will have a guest
lecture from Vikash Mansinghka on Thursday that
you can watch asynchronously, and I may give one
virtual lecture (depending on where we end up
today).
Plan for today
Basic Bayesian cognition
– The number game
The number game
60
Diffuse similarity
60 80 10 30 Rule:
“multiples of 10”
60 52 57 55 Focused similarity:
numbers near 50-60
Main phenomena to explain:
– Generalization can appear either similarity-
based (graded) or rule-based (all-or-none).
– Learning from just a few positive examples.
A single unifying account of (number) concept learning?
• We’re going to use this to introduce Bayesian
approaches, but first consider ...
– The “naïve programmer” approach?
– The “modern neural network” approach?
Traditional (algorithmic level) cognitive models
• Multiple representational systems: rules and
similarity
– Categorization, language (past tense), reasoning
• Questions this leaves open:
– How does each system work? How far and in ways to
generalize as a function of the examples observed?
• Which rule to choose?
– E.g., X = {60, 80, 10, 30}: multiples of 10 vs. even numbers?
• Which similarity metric?
– E.g., X = {60, 53} vs. {60, 20}?
– Why these two systems?
– When and why does a learner switch between them?
Reverse-engineering a cognitive system:
Marr’s three levels
• Level 1: Computational theory
– What are the inputs and outputs to the computation,
what is its goal, and what is the logic by which it is
carried out?
• Level 2: Representation and algorithm
– How is information represented and processed to
achieve the computational goal?
• Level 3: Hardware implementation
– How is the computation realized in physical or
biological hardware?
Bayesian model
• H: Hypothesis space of possible concepts:
– h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”)
– h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”)
– h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”)
– h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”)
– ...
Representational interpretations for H:
– Candidate rules
– Features for similarity
– “Consequential subsets” (Shepard, 1987)
Three hypothesis subspaces for number
concepts
• Mathematical properties (24 hypotheses):
– Odd, even, square, cube, prime numbers
– Multiples of small integers
– Powers of small integers
• Raw magnitude (5050 hypotheses):
– All intervals of integers with endpoints between 1 and
100.
• Approximate magnitude (10 hypotheses):
– Decades (1-10, 10-20, 20-30, …)
Bayesian model
• H: Hypothesis space of possible concepts:
– Mathematical properties: even, odd, square, prime, . . . .
– Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .
– Raw magnitude: all intervals between 1 and 100.
• X = {x1, . . . , xn}: n examples of a concept C.
• Evaluate hypotheses given data:
p ( X | h) p ( h)
p(h | X ) =
å p( X | h¢) p(h¢)
h¢ÎH
– p(h) [“prior”]: domain knowledge, pre-existing biases
– p(X|h) [“likelihood”]: statistical information in examples.
– p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
é 1 ù
p ( X | h) = ê ú if x1 , ! , xn Î h
ë size(h) û
= 0 if any xi Ï h
• Captures the intuition of a “representative” sample, versus
a “suspicious coincidence”.
Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100
Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100
Data slightly more of a coincidence under h1
Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100
Data much more of a coincidence under h1
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
é 1 ù
p ( X | h) = ê ú if x1 , ! , xn Î h
ë size(h) û
= 0 if any xi Ï h
• Captures the intuition of a “representative” sample, versus
a “suspicious coincidence”.
• A special case of the law of “conservation of belief”:
åx
p( X = x | Y = y ) = 1
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Do we need this? Why not allow all logically possible
hypotheses, with uniform priors, and let the data sort
them out (via the likelihood)?
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
e.g., X = {60 80 10 30}:
4
é1ù
p ( X | multiples of 10) = ê ú = 0.0001
ë10 û
4
é1 ù
p ( X | multiples of 10 except 50, 70) = ê ú = 0.00024
ë8 û
p ( X | h) p ( h)
Posterior: p(h | X ) =
å p( X | h¢) p(h¢)
h¢ÎH
• X = {60, 80, 10, 30}
• Why prefer “multiples of 10” over “even
numbers”? p(X|h).
• Why prefer “multiples of 10” over “multiples of
10 except 50 and 70”? p(h).
• Why does a good generalization need both high
prior and high likelihood? p(h|X) ~ p(X|h) p(h)
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
• p(h) encodes relative weights of alternative theories:
H: Total hypothesis space
p(H1) = 1/5 p(H3) = 1/5
p(H2) = 3/5
H1: Math properties (24) H2: Raw magnitude (5050) H3: Approx. magnitude (10)
• even numbers • 10-15 • 10-20
• powers of two • 20-32 • 20-30
• multiples of three • 37-54 • 30-40
…. p(h) = p(H1) / 24 …. p(h) = p(H2) / 5050 …. p(h) = p(H3) / 10
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
• p(h) encodes relative plausibility of alternative theories:
– Mathematical properties: p(h) ~ 1/120
– Approximate magnitude: p(h) ~ 1/50
– Raw magnitude: p(h) ~ 1/8500 (on average)
• Also degrees of plausibility within a theory,
e.g., for magnitude intervals of size s:
p(s)
s
Generalizing to new objects
From hypotheses to predictions:
How do we compute the probability that C
applies to some new object y, given the posterior
p(h|X)?
Hypothesis averaging
In general, we have the law of total probability:
p(A = a) = ∑ p(A = a | Z = z) p(Z = z)
z
p( A = a | B = b) = å p( A = a | Z = z, B = b) p(Z = z | B = b)
z
…especially useful if A and B are independent conditioned on Z:
p( A = a | B = b) = å p( A = a | Z = z) p(Z = z | B = b)
z
Hypothesis averaging
In general, we have the law of total probability:
p(A = a) = ∑ p(A = a | Z = z) p(Z = z)
z
p( A = a | B = b) = å p( A = a | Z = z, B = b) p(Z = z | B = b)
z
…especially useful if A and B are independent conditioned on Z:
p( A = a | B = b) = å p( A = a | Z = z) p(Z = z | B = b)
z
Another example: what is the probability that the republican will
win the election, given that the weather man predicts rain?
p( Republican win | Weather report: “Rain storm”) =
å pp((Repub.
wÎweather
Republican | W =| w)
winswin = w | Weatherman
w)p(p(wW|Weather saysstorm”)
report: “Rain ' rain' )
conditions
Generalizing to new objects
Hypothesis averaging:
Compute the probability that C applies to some
new object y by averaging the predictions of all
hypotheses h, weighted by p(h|X):
p( y Î C | X ) = å$
p( y Î C | h) p(h | X )
!#!"
hÎH é 1 if yÎh
=ê
ë 0 if yÏh
= å p(h | X )
h É{ y , X }
Examples:
16
Examples:
16
8
2
64
Examples:
16
23
19
20
+ Examples Human generalization Bayesian Model
60
60 80 10 30
60 52 57 55
16
16 8 2 64
16 23 19 20
Summary of the Bayesian model
• How do the statistics of the examples interact with
prior knowledge to guide generalization?
posterior µ likelihood ´ prior
• Why does generalization appear rule-based or
similarity-based?
hypothesis averaging + size principle
broad p(h|X): similarity gradient
narrow p(h|X): all-or-none rule
Summary of the Bayesian model
• How do the statistics of the examples interact with
prior knowledge to guide generalization?
posterior µ likelihood ´ prior
• Why does generalization appear rule-based or
similarity-based?
hypothesis averaging + size principle
broad p(h|X): Many h of similar size, or
very few examples (i.e. 1)
narrow p(h|X): One h much smaller
Model variants
1. Bayes with weak sampling
posterior µ likelihood ´ prior
hypothesis averaging + size principle
“Weak sampling” p( X | h) µ 1 if x1 ,!, xn Î h
= 0 if any xi Ï h
2. Maximum a posteriori (MAP)
Maximum likelihood /subset principle
posterior µ likelihood ´ prior
hypothesis averaging + size principle
p( y Î C | X ) = 1 if y Î h*; h* = arg max p(h | X )
hÎH
= 0 if y Ï h *
Human generalization Full Bayesian model
Bayes with weak sampling Maximum a posteriori (MAP) / subset
(no size principle) principle (no hypothesis averaging)
Taking stock
• A model of high-level, knowledge-driven inductive reasoning
that makes strong quantitative predictions with minimal free
parameters.
(r2 > 0.9 for mean judgments on 180 generalization stimuli, with 3 free
numerical parameters)
• Explains qualitatively different patterns of generalization
(rules, similarity) as the output of a single general-purpose
rational inference engine.
– Marr level 1 (Computational theory) explanation of phenomena that
have traditionally been treated only at Marr level 2 (Representation
and algorithm).
Looking forward
• Can we see these ideas at work in more natural cognitive
function, not just toy problems and games?
– How might differently structured hypothesis spaces, different
likelihood functions or priors, be needed?
• Can we move from ‘weak rational analysis’ to ‘strong
rational analysis’ in the priors, as with the likelihood?
– “Weak”: behavior consistent with some reasonable prior.
– “Strong”: behavior consistent with the “correct” prior given the
structure of the world.
• Can we work with more flexible priors, not just restricted to
a small subset of all logically possible concepts?
– Would like to be able to learn any concept, even very complex ones,
given enough data (a non-dogmatic prior).
• Can we describe formally how these hypothesis spaces and
priors are generated by abstract knowledge or theories?
• Can we explain how people learn these rich priors?
Learning more natural concepts
“horse” “horse” “horse”
“tufa”
“tufa”
“tufa”
Learning rectangle concepts
Weighting different rectangle
hypotheses based on the size principle:
n
é 1 ù
p ( X | h) = ê ú if x1 , ! , xn Î h
ë size(h) û
= 0 if any xi Ï h
Generalization gradients
Full Bayes Subset principle Bayes w/o size principle
(MAP Bayes) (0/1 likelihoods)
Modeling word learning (Xu & Tenenbaum, 2007)
Modeling word learning (Xu & Tenenbaum, 2007)
Modeling word learning (Xu & Tenenbaum, 2007)
Children’s
generalizations
Bayesian
concept learning
with tree-structured
hypothesis space
Exploring different models
• Priors, likelihoods derived from simple assumptions.
What about more complex cases?
• Different likelihoods?
– Suppose the examples are sampled by a different process,
such as active learning, or active pedagogy.
• Different priors?
– More complex language-like hypothesis spaces, allowing
exceptions, compound concepts, and much more…