10-701 Midterm Exam Solutions, Spring 2007
10-701 Midterm Exam Solutions, Spring 2007
1. Personal info:
• Name:
• Andrew account:
• E-mail address:
2. There should be 16 numbered pages in this exam (including this cover sheet).
3. You can use any material you brought: any book, class notes, your print outs of class
materials that are on the class website, including my annotated slides and relevant
readings, and Andrew Moore’s tutorials. You cannot use materials brought by other
students. Calculators are allowed, but no laptops, PDAs, phones or Internet access.
4. If you need more room to work out your answer to a question, use the back of the page
and clearly mark on the front of the page if we are to look at what’s on the back.
5. Work efficiently. Some questions are easier, some more difficult. Be sure to give yourself
time to answer all of the easy ones, and avoid getting bogged down in the more difficult
ones before you have answered the easier ones.
6. Note there are extra-credit sub-questions. The grade curve will be made without
considering students’ extra credit points. The extra credit will then be used to try to
bump your grade up without affecting anyone else’s grade.
8. Good luck!
1
1 [21 Points] Short Questions
The following short questions should be answered with at most two sentences, and/or a
picture. For the (true/false) questions, answer true or false. If you answer true, provide a
short justification, if false explain why or provide a small counterexample.
1. [1 point] true/false A classifier trained on less training data is less likely to overfit.
⋆ SOLUTION: This is false. A specific classifier (with some fixed model complexity)
will be more likely to overfit to noise in the training data when there is less training data,
and is therefore more likely to overfit.
2. [1 point] true/false Given m data points, the training error converges to the true
error as m → ∞.
⋆ SOLUTION: This is true, if we assume that the data points are i.i.d. A few students
pointed out that this might not be the case.
3. [1 point] true/false The maximum likelihood model parameters (α) can be learned
using linear regression for the model: yi = α1 x1 x32 + ǫi where ǫi ∼ N(0, σ 2 ) iid noise.
4. [2 points] true/false The maximum likelihood model parameters (α) can be learned
using linear regression for the model: yi = xα1 1 eα2 + ǫi where ǫi ∼ N(0, σ 2 ) iid noise.
2
5. [2 points] true/false The maximum likelihood model parameters (α) can be learned
using linear regression for the model: yi = log(xα1 1 eα2 ) + ǫi where ǫi ∼ N(0, σ 2 ) iid
noise.
7. [2 points] true/false In AdaBoost, weighted training error ǫt of the tth weak classifier
on training data with weights Dt tends to increase as a function of t.
⋆ SOLUTION: True. In the course of boosting iterations the weak classifiers are
forced to try to classify more difficult examples. The weights will increase for examples
that are repeatedly misclassified by the weak classifiers. The weighted training error ǫt of
the tth weak classifier on the training data therefore tends to increase.
8. [2 points] true/false AdaBoost will eventually give zero training error regardless of
the type of weak classifier it uses, provided enough iterations are performed.
⋆ SOLUTION: Not if the data in the training set cannot be separated by a linear
combination of the specific type of weak classifiers we are using. For example consider the
EXOR example(In hw2 we worked with a rotated EXOR toy dataset) with decision stumps
as weak classifiers. No matter how many iterations are performed zero training error will
not be achieved.
9. [2 points] Consider a point that is correctly classified and distant from the decision
boundary. Why would SVM’s decision boundary be unaffected by this point, but the
one learned by logistic regression be affected?
⋆ SOLUTION: The hinge loss used by SVMs gives zero weight to these points while
the log-loss used by logistic regression gives a little bit of weight to these points.
10. [2 points] Why does the kernel trick allow us to solve SVMs with high dimensional
feature spaces, without significantly increasing the running time?
3
⋆ SOLUTION: In the dual formulation of the SVM, features only appear as dot
products which can be represented compactly by kernels.
11. [2 points] Consider a learning problem with 2D features. How are the decision tree
and 1-nearest neighbor decision boundaries related?
⋆ SOLUTION: In both cases, the decision boundary is piecewise linear. Decision trees
do axis-aligned splits while 1-NN gives a voronoi diagram.
12. [2 points] You are a reviewer for the International Mega-Conference on Algorithms
for Radical Learning of Outrageous Stuff, and you read papers with the following
experimental setups. Would you accept or reject each paper? Provide a one sentence
justification. (This conference has short reviews.)
• accept/reject “My algorithm is better than yours. Look at the training error
rates!”
• accept/reject “My algorithm is better than yours. Look at the test error rates!
(Footnote: reported results for λ = 1.789489345672120002.)”
• accept/reject “My algorithm is better than yours. Look at the test error rates!
(Footnote: reported results for best value of λ.)”
• accept/reject “My algorithm is better than yours. Look at the test error rates!
(Footnote: reported results for best value of λ, chosen with 10-fold cross valida-
tion.)”
13. [Extra credit: 0.911 points] You have designed the ultimate learning algorithm that
uses physical and metaphysical knowledge to learn and generalize beyond the quantum
P-NP barrier. You are now given the following test example:
What label will your algorithm output?
4
(a) Watch a cartoon.
(b) Call the anti-terrorism squad.
(c) Support the Boston Red Sox.
(d) All labels have equal probability.
⋆ SOLUTION: Watching a cartoon earned you 0.39 points. 0.2005 points were given
for supporting the Boston Red Sox. 0.666 points were given for calling the anti-terrorism
squad. 0.911 points were given for “all labels have equal probability.”
5
2 [16 Points] SVMs and the slack penalty C
The goal of this problem is to correctly classify test data points, given a training data set.
You have been warned, however, that the training data comes from sensors which can be
error-prone, so you should avoid trusting any specific point too much.
For this problem, assume that we are training an SVM with a quadratic kernel– that is,
our kernel function is a polynomial kernel of degree 2. You are given the data set presented
in Figure 1. The slack penalty C will determine the location of the separating hyperplane.
Please answer the following questions qualitatively. Give a one sentence answer/justification
for each and draw your solution in the appropriate part of the Figure at the end of the
problem.
12
10
0
1 2 3 4 5 6 7 8 9 10 11 12
1. [4 points] Where would the decision boundary be for very large values of C (i.e.,
C → ∞)? (remember that we are using an SVM with a quadratic kernel.) Draw on
the figure below. Justify your answer.
⋆ SOLUTION: For large values of C, the penalty for misclassifying points is very
high, so the decision boundary will perfectly separate the data if possible. See below for
the boundary learned using libSVM and C = 100000.
COMMON MISTAKE 1: Some students drew straight lines, which would not be
the result with a quadratic kernel.
6
COMMON MISTAKE 2: Some students confused the effect of C and thought
that a large C meant that the algorithm would be more tolerant of misclassifications.
2. [4 points] For C ≈ 0, indicate in the figure below, where you would expect the decision
boundary to be? Justify your answer.
⋆ SOLUTION: The classifier can maximize the margin between most of the points,
while misclassifying a few points, because the penalty is so low. See below for the boundary
learned by libSVM with C = 0.00005.
3. [2 points] Which of the two cases above would you expect to work better in the classi-
fication task? Why?
⋆ SOLUTION: We were warned not to trust any specific data point too much, so we
prefer the solution where C ≈ 0, because it maximizes the margin between the dominant
clouds of points.
4. [3 points] Draw a data point which will not change the decision boundary learned for
very large values of C. Justify your answer.
⋆ SOLUTION: We add the point circled below, which is correctly classified by the
original classifier, and will not be a support vector.
5. [3 points] Draw a data point which will significantly change the decision boundary
learned for very large values of C. Justify your answer.
⋆ SOLUTION: Since C is very large, adding a point that would be incorrectly classified
by the original boundary will force the boundary to move.
7
12 12
10 10
8 8
6 6
4 4
2 2
0 0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
12 12
10 10
8 8
6 6
4 4
2 2
0 0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
h(X; θ) = yXj
θ = {j, y} j is the word selector ; y is the associated class
y ∈ {−1, 1}
More intuitively, each weak learner is a word associated with a class label. For example
if we had a word football, and classes {sports,non-sports}, then we will have two weak
learners from this word, namely
8
⋆ SOLUTION: Two weak learners for each word, i.e. 2m weak learners.
2. This boosting algorithm can be used for feature selection. We run the algorithm and
select the features in the order in which they were identified by the algorithm.
(a) [4 points] Can this boosting algorithm select the same weak classifier more than
once? Explain.
(b) [4 points] Consider ranking the features based on their individual mutual informa-
ˆ Xj ). Will this ranking be more informative
tion with the class variable y, i.e. I(y;
than the ranking returned by AdaBoost ? Explain.
• Each data point has two real valued features, the X and Y coordinates.
For each of these datasets, draw the decision boundary that a Gaussian Naive Bayes classifier
will learn.
⋆ SOLUTION: For toydata1 the crucial detail is that GNB learns diagonal covariance
matrices yielding axis aligned Gaussians. In figure 4(A) the two circles are the gaussians learned
by GNB, and hence the decision surface is the tangent through the point of contact.
For toydata2 GNB learns two Gaussians , one for the circle inside with small variance , and
one for the circle outside with a much larger variance, and the decision surface is roughly shown
in figure 4(B).
9
4 3
3 2
2 1
1 0
0 −1
−1 −2
−2 −3
1 2 3 4 5 6 7 −3 −2 −1 0 1 2 3
A B
Remember that a very important piece of information was that all the classes had the same
number of points, and so we don’t have to worry about the prior.
10
5 3
4
2
1
2
1 0
0
−1
−1
−2
−2
−3 −3
0 1 2 3 4 5 6 7 8 −3 −2 −1 0 1 2 3
A B
11
Training Data
4
3.5
2.5
x2
2
1.5
0.5
0 1 2 3 4 5 6
x1
12
N1 (2 points)
N2 (2 points)
N3 (4 points)
13
5.2 Regularized Neural Networks
[8 points]
One method for preventing the neural networks’ weights from overfitting is to add reg-
ularization terms. You will now derive the update rules for the regularized neural network.
Note: Y = out(x)
Recall that the non-regularized gradient descent update rule for w1 is:
X
w1t+1 ← w1t + η y (j) − out(x(j) ) out(x(j) )(1 − out(x(j) )) ∗ V1 (x(j) )
(1)
j
[4 points] Derive the update rule for w1 in the regularized neural net loss function which
penalizes based on the square of each weight. Use λ to denote the magic regularization
parameter.
[4 points] Now, re-express the regularized update rule so that the only difference between
the regularized setting and the unregularized setting above is that the old weight w1t is scaled
by some constant. Explain how this scaling prevents overfitting.
⋆ SOLUTION:
X
w1t+1 ← w1t (1 − 2ηλ) + η y (j) − out(x(j) ) out(x(j) )(1 − out(x(j) )) ∗ V1 (x(j) )
j
At each update the weight is kept closer to zero by the (1 − 2ηλ) term. This prevents the weights
from becoming very large, which corresponds to overfitting.
⋆ SOLUTION: One solution follows from the observation that the decision boundary for N1
could be x1 = 3.7. In fact, N1 can be removed entirely from the model. This yields a similar
decision boundary for N3 except that V 1 = x1 ranges from 0 to 6.
14
Other possible solutions are to change the input feature space (e.g., by adding x21 as an input
to a single neuron along with x1 , x2 ).
15
6 [14 Points] The Effect of Irrelevant Features
1. (a) [3 points] Provide a 2D dataset where 1-nearest neighbor (1-NN) has lower leave-
one-out cross validation error (LOO error) than SVMs.
12
10
0
1 2 3 4 5 6 7 8 9 10 11 12
(b) [3 points] Provide a 2D dataset where 1-NN has higher LOO error than SVMs.
12
10
0
1 2 3 4 5 6 7 8 9 10 11 12
2. [8 points] You will now generate a dataset to illustrate SVMs’ robustness to irrelevant
features. In particular, create a 2D dataset with features X1 and X2 , here X2 will be
the irrelevant feature, such that:
• If you only use X1 , 1-NN will have lower LOO error than SVMs,
• but if you use both X1 and X2 , the SVM LOO error will remain the same, but
LOO error for 1-NN will increase significantly.
16
14
12
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Figure 7: Here the horizontal axis is X1 , and we ignore X2 . 1-NN is perfect, and SVM will
get the two points on the left wrong in LOOCV.
14
12
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Figure 8: Here the horizontal axis is X1 , and the vertical axis is X2 . X2 is the irrelevant
feature. 1-NN gets every point wrong in LOOCV, while SVM has the same error.
You will receive extra credit if the 1-NN LOO error before adding the irrelevant feature
is zero, but the error becomes 100% after adding the feature.
17
3. [Extra Credit (3 points)] SVMs tend to be robust to irrelevant features. Suppose
we run SVMs with features X1 , . . . , Xn , and then add a irrelevant feature Xn+1 that
cannot help increase the margin. How will SVMs automatically ignore this feature?
Justify your answer formally.
⋆ SOLUTION: SVMs will automatically ignore this feature because it cannot possibly
increase the margin, so giving it non-zero weight keeps the same margin but increases the
regularization penalty. Therefore the solution with zero weight is superior to (i.e. has
smaller objective function) all feasible solutions of the QP where this feature has non-zero
weight.
18
7 [15 points] Learning Theory
Consider binary data-points X in n dimensions, with binary labels Y , i.e. X ∈ {0, 1}n; Y ∈
{0, 1}. We wish to learn a mapping X → Y using a few different hypothesis classes, but
are concerned about the tradeoff between the expressivity of our hypothesis space and the
number of training examples required to learn the true mapping probably approximately
correctly.
1. Consider the following hypothesis class H: decision stumps that choose a value for Y
based on the value of one of the attributes of X. For example, there are two hypotheses
in H that involve feature i:
( (
1 if Xi = 1 0 if Xi = 1
hi (X) = h¬i (X) =
0 otherwise 1 otherwise;
⋆ SOLUTION: H = 2n
(b) [3 points] For given ǫ, δ how many training examples are needed to yield a decision
stump that satisfies the Haussler-PAC bound?
⋆ SOLUTION:
|H|e−mǫ ≤ δ
1 1
m ≥ log |H| + log
ǫ δ
1 1
= log(2n) + log
ǫ δ
19
2. Now let us define another hypothesis class H ′, where each hypothesis is a majority
over a set of simple decision stumps. Specifically, for each feature i, we either use hi
or h¬i , and the output is the result of a majority vote (in the case of a tie, we predict
1). For example, if we have 5 features, and we choose the stumps {h¬1 , h2 , h3 , h4 , h¬5 },
then the resulting hypothesis is:
(
′ 1 if h¬1 (X) + h2 (X) + h3 (X) + h4 (X) + h¬5 (X) ≥ 25
h (X) =
0 otherwise
(a) [4 points] What is the size of this hypothesis class?
(b) [2 points] For given ǫ, δ how many training examples are needed to yield a hy-
pothesis that satisfies the Haussler-PAC bound?
⋆ SOLUTION:
1 1
m≥ n log 2 + log
ǫ δ
3. [3 points] What can we say about the amount of extra samples necessary to learn this
voting classifier? Is this a concern?
Briefly explain the tradeoff between the expressive power of the hypothesis space and
the number of training samples required for these two classifier.
⋆ SOLUTION: In the first part of the problem, the required number of samples scales
as O(log n), and in the second part of the problem, it scales as O(n). So we need more
samples to get the PAC bound for the second hypothesis class, but scaling linearly in the
number of features is probably acceptable.
The tradeoff illustrated here is that greater expressive power (as with H ′ ) necessitates more
training samples. The smaller and less expressive class H requires fewer training examples.
20