0 ratings0% found this document useful (0 votes) 23 views4 pagesStatistical Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Univensipab Caatos IIT pe Maprip
Bachelor in Data Science and Engineering
Final Exam, January 2022
Statistical Learning,
‘Time: 2h
1, (2 points) Decide whether the following statemonts are true oF false, arguing your answer:
(a) Ifa dataset is standardized, the correlations do not change
(b) As the number of observations goes (o infinity in a dataset, a model trained on that data
will have lower bias,
(©) a clustering, there is uo tool to estimate te number of dusters.
(@) In k-means, the ehuster centers do not need to be observed data points
(0) In a Lasso regression model, if we increase the penalty parameter, the predietion bias is
increased! and the varianeo is decreased
(£) Ifa linear regression is trained using only half Une data, the bias will be smaller
(q) Support vector machines, like logistic regression models, give a probability distibution
over the possible labels given a new observation
(a) When the auuber of observations is kage, overftting is more Likely
(2) Holding, ont 20% of the observations for testing is faster
Sefold cross-validation
i less accurate than performing
(j) Both linear regression and logistic regression are special cases of neural networks,
Solution:
(a) Tue. 8
taudardzation affects the means and the
ances, not tHe correlations.
(b) False. The bias is the same, the variance is smalles
(0) Fase. There are several ways to estimate it, For instan
be used to estimate the mumber of groups.
a mixture gaussian model ean
(d) Tr, The coordinates of the centroid are the (arithmetic) mean of the variables over all
the observations in the eluster. It does not nced to be one of the data points,
(o) True,
(£) False. Bias depends on the model used, not on the number of observations,
(g) False, SVMs do not give # probability distribution, just the lnbel predictions,
(h) False, Ovorfitting is more likely when the
(i) Trwe.
(j) True.
amber of variables (features) is high.2. (L point) The Mahalanobis distance between two observations and y in a population is defined
Ax, y) = (2 = WEY —y),
where 32 denotes the covariance mateix of the population
Prove that this distance is equivalent ta the Enelidean distance iu a transformed space (iudi-
cating the corresponding transformation).
Solution: The Mahalanobis distance is defined as:
d(x.y) = (x y)PE'x-y),
which can be expanded in the following, way:
d(x.y) = (x —y)PE ey)
= (x-y/TEPE(x-y)
PE0¢—y)
= yT EV S1Ax— B-Hy) @
= (Ee? — (B98 )Fy)P (Ee By)
(2-x)" = (EM yx = Boy)
= ((EVPx) = (Be M2y)) "(SV — EeVey)
The last expression is equivalent to the (squared) Euclidean distance between the vectors B=".
and 3-”/#y. Hence, the transformation is x! = 2-"x.
3. (2 points)
(a) In what senso are the linear combinations defined by PCA optimal?
(b) Which are the main differences between Principal Component Analysis and Factor Anal-
(6) When is it more conveniout to apply PCA (respect to PAY?
(d) IEPCA is performed on an uncorrelated data set, which are the varianees of the principal
components? Why?
Solution:
(a) PCA finds Tinear combinations of responses that have most of the variance.
(b) In FA, the focus is to explain correlations betwcen indicators due to common factors (latent
variables) whereas in PCA is to explain total variance, In PCA, factors (components)
are uncorrelated whereas in FA are not. In PCA the factors (components) are linear
combinations of indicators whereas in FA the indicators are linear combinations of factors,
That implies FA is more interpretable but definition of factors is not explicit (more black-
box)(©) Whea the indicators are causing the latent factors.
of the correlation matrix,
(4) The variances are equal to the eigenvaln
1. (1 point) Which of the following figures correspond ( possitie values that PCA may return
Jor the fist principal component? Explain your answer.
© | @-
Solution: a) and ¢), because eigenvectors (or principal components) are unique except of the
sign.
5, (I point) Which are the advantages of hierarchical clustering respect to partitioning one?
Solution: No need to pre-define the umber of clusters. Easy to understand the ontput.
Clusters are more interpretable,
6. (1 point) Consider the next training set for classification:
ween
wrwe
What is the prediction for a Souearest, ueighbor classifier at the point (1,1)?
Solution: +
7. (1 point) For te following confusion matrix, has this classifier a better accuraey than that. for
the majority-class benchmark?
Solution
60%
Accumiey of classifier: (30+ 56)/106 = 81%, acenracy of naive is (8-+56)/106 =Predicted
o a
; o 30 12
1 8 56
8. (1 point) Given the following coeffiieuts of a logistic regression model trying 1 prediet the
target Diabotes
Intercept | Sugarhublood Weight
46 2H | 0] 0.055
write precisely the formula to compute the probability of having Diabetes.
Solution: Lot p be the probability of having Diabetes. Given the regeessons, the model for p
»
tox (2) 16-42
«(i2,) <46
Taking the exponential in both sides and transforming:
HI» SugarInBleed — 0.12 Age + 0.035
Wetgne
( p ) sep (as 4234, ugarInBlood — 0.12 Age + 0.85 - Weight}
1p,
p= (I~ plexp {46 + 2.34 SugnelnBlood ~ 0.12 Age + 0.35 Weight)
exp (4.6 + 2.34 - SugarInBlood ~ 0.12 Age + 0.35 - Weight)
1+ exp {46 + 2.34 SugarlnBlood — 0.12 - Age + 0.35 - Weight}