CSCE 970 Lecture 6: System Evaluation and Combining Classifiers
CSCE 970 Lecture 6: System Evaluation and Combining Classifiers
CSCE 970 Lecture 6: System Evaluation and Combining Classifiers
1 2
Introduction
(cont’d) Outline
• Can’t use error on training set to estimate abil- • Sample error vs. true error
ity to generalize, because it’s too optimistic
• Estimators
• Can use statistical hypothesis testing techniques
to:
• Binomial distribution, Normal distribution, Cen-
– Give confidence intervals for error estimate
tral Limit Theorem
– Contrast performance of two classifiers (see
if the difference in their error estimates is
• Paired t tests
statistically significant)
• Sometimes need to train and test with a small • Comparing learning methods
data set
3 4
Two Definitions of Error
• The sample error of h with respect to target For unbiased estimate, h and S must be chosen
function f and data sample S is the proportion independently
of examples h misclassifies
• Variance: Even with unbiased S, errorS (h) may
1 X
errorS (h) ≡ δ(f (x) 6= h(x)) still vary from errorD (h)
|S| x∈S
Where δ(f (x) 6= h(x)) is 1 if f (x) 6= h(x), and
0 otherwise.
5 6
Confidence Intervals
Estimators
If
Experiment: • S contains n examples (feature vectors), drawn
independently of h and each other
Then
2. Measure errorS (h) • With approximately 95% probability, errorD (h)
lies in interval
s
errorS (h) is a random variable (i.e., result of an errorS (h)(1 − errorS (h))
errorS (h) ± 1.96
experiment) n
errorS (h) is an unbiased estimator for errorD (h) E.g. hypothesis h misclassifies 12 of the 40 exam-
ples in test set S:
12
Given observed errorS (h), what can we conclude errorS (h) =
= 0.30
40
about errorD (h)?
Then with approx. 95% confidence, errorD (h) ∈
[0.158, 0.442]
7 8
errorS (h) is a Random Variable
Confidence Intervals
(cont’d)
Repeatedly run the experiment, each with different
randomly drawn S (each of size n)
If
• S contains n examples, drawn independently Probability of observing r misclassified examples:
of h and each other 0.14
Binomial distribution for n = 40, p = 0.3
0.12
• n ≥ 30 0.1
0.08
P(r)
0.06
0.04
Then 0.02
0
0 5 10 15 20 25 30 35 40
• With approximately N % probability, errorD (h)
lies in interval
s
errorS (h)(1 − errorS (h))
errorS (h) ± zN n
n P (r) = errorD (h)r (1 − errorD (h))n−r
r
where
9 10
0.06
0.04 • mean µerrorS (h) = errorD (h) (i.e. unbiased est.)
0.02
0
0 5 10 15 20 25 30 35 40 • standard deviation σerrorS (h)
s
n errorD (h)(1 − errorD (h))
n! σerrorS (h) =
P (r) = pr (1 − p)n−r = pr (1 − p)n−r n
r r!(n − r)!
(i.e. increasing n decreases variance)
Probability P (r) of r heads in n coin flips, if p =
Pr(heads)
Want to compute confidence interval = interval
• Expected, or mean value of X, E[X], is centered at errorD (h) containing N % of the weight
n
X under the distribution (difficult for binomial)
E[X] ≡ iP (i) = np
i=0
Approximate binomial by normal (Gaussian) dist:
11 12
Normal Probability Distribution Normal Probability Distribution
Normal distribution with mean 0, standard deviation 1
(cont’d)
0.4
0.35
0.3 0.4
0.25 0.35
0.2 0.3
0.15 0.25
0.1 0.2
0.05 0.15
0 0.1
-3 -2 -1 0 1 2 3 0.05
0
-3 -2 -1 0 1 2 3
1 1
!
x−µ 2 80% of area (probability) lies in µ ± 1.28σ
p(x) = √ exp −
2πσ 2 2 σ N % of area (probability) lies in µ ± zN σ
• Defined completely by µ and σ N %: 50% 68% 80% 90% 95% 98% 99%
zN : 0.67 1.00 1.28 1.64 1.96 2.33 2.58
• The probability that X will fall into the interval
(a, b) is given by Can also have one-sided bounds:
0.4
Z b
0.35
p(x)dx 0.3
a 0.25
0.2
0.15
0 σ or > µ − z 0 σ, where
N % of area lies < µ + zN
• Variance of X is V ar(X) = σ 2 N
0
zN = z100−2(100−N )
• Standard deviation of X, σX , is
N %: 50% 68% 80% 90% 95% 98% 99%
σX = σ 0 :
zN 0.0 0.47 0.84 1.28 1.64 2.05 2.33
13 14
• errorD (h)
1. Pick parameter to estimate
• errorS (h) governed by binomial distribution, 3. Determine probability distribution that governs
estimator (difference between two normals is
approximated by normal when n ≥ 30 also normal, variances add)
s
errorS1 (h1 )(1 − errorS1 (h1)) errorS2 (h2 )(1 − errorS2 (h2 ))
σdˆ ≈ +
4. Find interval (L, U ) such that N % of probabil- n1 n2
ity mass falls in the interval
• Could have L = −∞ or U = ∞
4. Find interval (L, U ) such that N % of prob.
0 values
• Use table of zN or zN mass falls in the interval: dˆ± Zn σdˆ
17 18
• hA ← LA(Si)
ES⊂D0 [errorD (LA(S)) − errorD (LB (S))]
• hB ← LB (Si)
instead of
• δi ← errorTi (hA ) − errorTi (hB )
3. Return the value δ̄, where ES⊂D [errorD (LA(S)) − errorD (LB (S))]
k
1 X
δ̄ ≡ δi
k i=1 • But even this approximation is better than no
comparison
21 22
Combining Classifiers
• Each classifier (or expert) in the pool has its Weighted Majority Algorithm (WM)
own weight [Mitchell, Sec. 7.5.4]
23 24
Weighted Majority Algorithm (WM) Weighted Majority
(cont’d) Mistake Bound (On-Line Model)
– Implicitly agnostic
• Create Di by drawing m examples uniformly at
random with replacement from D
• Other results:
• Expect Di to omit ≈ 37% of examples from D
– Bounds hold for general values of β ∈ [0, 1)
27 28
Bagging Experiment
[Breiman, ML Journal, ’96]
1. Divide S randomly into test set T (10%) and Same experiment, but using a nearest neighbor
training set D (90%) classifier, where prediction of new feature vector
x’s label is that of x’s nearest neighbor in training
2. Learn decision tree from D and let eS be its
set, where distance is e.g. Euclidean distance
error rate on T
Boosting Classifiers
[Freund & Schapire, ICML ’96; many more]
When learner is unstable, i.e. if small change in Final classifier weights learned classifiers, but not
training set causes large change in hypothesis pro- uniform; instead weight of classifier ht depends on
duced its performance on data it was trained on
Repeat for t = 1, . . . , T :
• Decision trees, neural networks
1. Run learning algorithm on examples randomly
drawn from training set D according to distri-
• Not nearest neighbor bution Dt (D1 = uniform)
33 34
Boosting
Miscellany
=
– Empirically, generalization sometimes im-
proves if training continues after H(·)’s er-
ror on D drops to 0
– Contrary to intuition; expect overfitting
– Related to increasing the combined classi-
fier’s margin (confidence in prediction)