The Element of Statistical Learning – Chapter 6
oxstar@SJTU
January 6, 2011
PN PN
Ex. 6.2 Show that i=1 (xi − x0 )li (x0 ) = 0 for local linear regression. Define bj (x0 ) = i=1 (xi −
x0 )j li (x0 ). Show that b0 (x0 ) = 1 for local polynomial regression of any degree (including local
constants). Show that bj (x0 ) = 0 for all j ∈ {1, 2, . . . , k} for local polynomial regression of degree
k. What are the implications of this on the bias?
Proof By the definition of vector-valued function, b(x)T = (1, x) and B = [1, x], so we have
b(x0 )T = b(x0 )T (BT W(x0 )B)−1 BT W(x0 )B
(1, x0 ) = b(x0 )T (BT W(x0 )B)−1 BT W(x0 )[1, x0 ]
PN
b(x0 )T (BT W(x0 )B)−1 BT W(x0 )1 = i=1 li (x0 )
1
= PN (1)
x0 b(x0 )T (BT W(x0 )B)−1 BT W(x0 )x0 = i=1 li (x0 )xi
N
X N
X N
X
(xi − x0 )li (x0 ) = li (x0 )xi − x0 li (x0 )
i=1 i=1 i=1
= x0 − x0 · 1 = 0
From (1), we have
N
X N
X
0
b0 (x0 ) = (xi − x0 ) li (x0 ) = li (x0 ) = 1
i=1 i=1
When j ∈ {1, 2, . . . , k}, we have vector-valued function b(x)T = (1, x, x2 , . . . , xk ) and B =
[1, x, x2 , . . . , xk ].
From (1), we similarly have
N
X
xj0 = li (x0 )xji (2)
i=1
Expanding (xj − x0 )j without combing of similar terms, each term can be written as (−1)b xai xb0 ,
where a + b = j. Obviously, the number of positive terms equals with the number of negative terms,
P2j bn
i.e. n=1 (−1) = 0. So each term of bj (x0 ) can be written as
N
X N
X
(−1)b xai xb0 li (x0 ) = (−1)b xb0 li (x0 )xai
i=1 i=1
= b b a
(−1) x0 x0 = (−1)b xj0 // (2)
N
X
bj (x0 ) = (xi − x0 )j li (x0 )
i=1
j
N X
X 2
= (−1)bn xj0 li (x0 ) = 0
i=1 n=1
1
Hence we have the bias
N
X
E fˆ(x0 ) − f (x0 ) = li (x0 )f (xi ) − f (x0 )
i=1
N
X N
X
= f (x0 ) li (x0 ) − f (x0 ) + f 0 (x0 ) (xi − x0 )li (x0 )
i=1 i=1
N
X N
X
00 2 (j+1)
+ k2 f (x0 ) (xi − x0 ) li (x0 ) + . . . + kj+1 f (x0 ) (xi − x0 )j+1 li (x0 ) + . . .
i=1 i=1
N
X
= kj+1 f (j+1) (x0 ) (xi − x0 )j+1 li (x0 ) + . . .
i=1
where kn is the coefficients of series expansion terms.
Now we see that the bias depends only on (j + 1)th-degree and higher order terms in the expan-
sion of f .
Ex. 6.3 Show that ||l(x)|| (Section 6.1.2) increases with the degree of the local polynomial.
Proof Preliminary: B is a N × 2 regression matrix, so B is not invertible while BBT is invert-
ible.
From
fˆ(xj ) = b(xj )T (BT W(xj )B)−1 BT W(xj )y
N
X
= li (xj )yi = l(xj )T y
i=1
we have
l(xj )T = b(xj )T (BT W(xj )B)−1 BT W(xj )
l(xj ) = W(xj )B(BT W(xj )B)−1 b(xj )
||l(xj )||2 = l(xj )T l(xj ) = [b(xj )T (BT W(xj )B)−1 BT W(xj )][W(xj )B(BT W(xj )B)−1 b(xj )]
= b(xj )T (BT W(xj )B)−1 BT W(xj )[BBT (BBT )−1 (BBT )−1 BBT ]
W(xj )B(BT W(xj )B)−1 b(xj )
= b(xj )T BT (BBT )−1 (BBT )−1 Bb(xj )
d+1
X d+1
X
||l(x)||2 = ||l(xj )||2 = b(xj )T BT (BBT )−1 (BBT )−1 Bb(xj )
j=1 j=1
= trace(BBT (BBT )−1 (BBT )−1 BBT ) // Prof. Zhang has proved it
= trace(Id+1 ) = d + 1
√
Hence ||l(x)|| = d + 1 which increases with the degree of the local polynomial.
Ex. 6.4 Suppose that the p predictors X arise from sampling relatively smooth analog curves
at p uniformly spaced abscissa values. Denote by Cov(X|Y ) = Σ the conditional covariance matrix
of the predictors, and assume this does not change much with Y . Discuss the nature of Mahalanobis
choice A = Σ−1 for the metric in (6.14). How does this compare with A = I? How might you
construct a kernel A that (a) downweights high-frequency components in the distance metric; (b)
ignores them completely?
2
p
Answer D = (x − x0 )T Σ−1 (x − x0 ) is called the Mahalanobis distance of the point x to x0 .
It takes the correlations of the data set into account. If the predictors are highly correlated, Maha-
lanobis distance is much
p accurate than Euclidian distance.
If A = I, then d = (x − x0 )T (x − x0 ) equals to Euclidian distance of the point x to x0 . Prior to
smoothing, we should standardize each predictor, for example
xi − E(xi )
x0i = p
Var(xi )
When comparing it with Σ−1 and using the standard predictors, we have
Cov(x0i , x0j ) = E[(x0i − E(x0i ))(x0j − E(x0j ))] = E(x0i x0j )
xi − E(xi ) xj − E(xj )
=E (p )( p )
Var(xi ) Var(xj )
E[(xi − E(xi ))(xj − E(xj ))]
= p p
Var(xi ) Var(xj )
Cov(xi , xj )
=p p = ρ(xi , xj )
Var(xi ) Var(xj )
Then the covariance matrix Σ will change to its standardized version (correlation matrix). Hence
A = I means ∀i 6= j, ρ(xi , xj ) = 0, i.e., all dimensions of x are not correlated.
(a) If we want to construct a kernel A that downweights high-frequency components (xi s) in the
distance metric, we can just decrease Cov(xi , xj ) or ρ(xi , xj ) in order to suppress their influence.
(b) If we want to construct a kernel A that ignores them completely, we can just set Cov(xi , xj ) or
ρ(xi , xj ) as 0.