[go: up one dir, main page]

0% found this document useful (1 vote)
815 views3 pages

Solution of The Elements of Statistical Learning Ch6

The document discusses local polynomial regression and its properties. It proves that: 1) For local linear regression, the weighted sum of the differences between each point and the point of interest is 0. 2) For local polynomial regression of any degree, the weighted sum evaluating the constant term is 1. For higher order terms, the weighted sum is 0. 3) The bias of local polynomial regression depends only on terms of order j+1 and higher in the Taylor series expansion at the point of interest.

Uploaded by

zhoujing3721
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
815 views3 pages

Solution of The Elements of Statistical Learning Ch6

The document discusses local polynomial regression and its properties. It proves that: 1) For local linear regression, the weighted sum of the differences between each point and the point of interest is 0. 2) For local polynomial regression of any degree, the weighted sum evaluating the constant term is 1. For higher order terms, the weighted sum is 0. 3) The bias of local polynomial regression depends only on terms of order j+1 and higher in the Taylor series expansion at the point of interest.

Uploaded by

zhoujing3721
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

The Element of Statistical Learning – Chapter 6

oxstar@SJTU
January 6, 2011

PN PN
Ex. 6.2 Show that i=1 (xi − x0 )li (x0 ) = 0 for local linear regression. Define bj (x0 ) = i=1 (xi −
x0 )j li (x0 ). Show that b0 (x0 ) = 1 for local polynomial regression of any degree (including local
constants). Show that bj (x0 ) = 0 for all j ∈ {1, 2, . . . , k} for local polynomial regression of degree
k. What are the implications of this on the bias?

Proof By the definition of vector-valued function, b(x)T = (1, x) and B = [1, x], so we have

b(x0 )T = b(x0 )T (BT W(x0 )B)−1 BT W(x0 )B


(1, x0 ) = b(x0 )T (BT W(x0 )B)−1 BT W(x0 )[1, x0 ]
PN
b(x0 )T (BT W(x0 )B)−1 BT W(x0 )1 = i=1 li (x0 )

1
= PN (1)
x0 b(x0 )T (BT W(x0 )B)−1 BT W(x0 )x0 = i=1 li (x0 )xi
N
X N
X N
X
(xi − x0 )li (x0 ) = li (x0 )xi − x0 li (x0 )
i=1 i=1 i=1
= x0 − x0 · 1 = 0 

From (1), we have


N
X N
X
0
b0 (x0 ) = (xi − x0 ) li (x0 ) = li (x0 ) = 1 
i=1 i=1

When j ∈ {1, 2, . . . , k}, we have vector-valued function b(x)T = (1, x, x2 , . . . , xk ) and B =


[1, x, x2 , . . . , xk ].

From (1), we similarly have


N
X
xj0 = li (x0 )xji (2)
i=1

Expanding (xj − x0 )j without combing of similar terms, each term can be written as (−1)b xai xb0 ,
where a + b = j. Obviously, the number of positive terms equals with the number of negative terms,
P2j bn
i.e. n=1 (−1) = 0. So each term of bj (x0 ) can be written as
N
X N
X
(−1)b xai xb0 li (x0 ) = (−1)b xb0 li (x0 )xai
i=1 i=1

= b b a
(−1) x0 x0 = (−1)b xj0 // (2)
N
X
bj (x0 ) = (xi − x0 )j li (x0 )
i=1
j
N X
X 2 
= (−1)bn xj0 li (x0 ) = 0 
i=1 n=1

1
Hence we have the bias
N
X
E fˆ(x0 ) − f (x0 ) = li (x0 )f (xi ) − f (x0 )
i=1
 N
X  N
X
= f (x0 ) li (x0 ) − f (x0 ) + f 0 (x0 ) (xi − x0 )li (x0 )
i=1 i=1
N
X N
X
00 2 (j+1)
+ k2 f (x0 ) (xi − x0 ) li (x0 ) + . . . + kj+1 f (x0 ) (xi − x0 )j+1 li (x0 ) + . . .
i=1 i=1
N
X
= kj+1 f (j+1) (x0 ) (xi − x0 )j+1 li (x0 ) + . . .
i=1

where kn is the coefficients of series expansion terms.


Now we see that the bias depends only on (j + 1)th-degree and higher order terms in the expan-
sion of f . 

Ex. 6.3 Show that ||l(x)|| (Section 6.1.2) increases with the degree of the local polynomial.

Proof Preliminary: B is a N × 2 regression matrix, so B is not invertible while BBT is invert-


ible.

From

fˆ(xj ) = b(xj )T (BT W(xj )B)−1 BT W(xj )y


N
X
= li (xj )yi = l(xj )T y
i=1

we have

l(xj )T = b(xj )T (BT W(xj )B)−1 BT W(xj )


l(xj ) = W(xj )B(BT W(xj )B)−1 b(xj )
||l(xj )||2 = l(xj )T l(xj ) = [b(xj )T (BT W(xj )B)−1 BT W(xj )][W(xj )B(BT W(xj )B)−1 b(xj )]
= b(xj )T (BT W(xj )B)−1 BT W(xj )[BBT (BBT )−1 (BBT )−1 BBT ]
W(xj )B(BT W(xj )B)−1 b(xj )
= b(xj )T BT (BBT )−1 (BBT )−1 Bb(xj )
d+1
X d+1
X
||l(x)||2 = ||l(xj )||2 = b(xj )T BT (BBT )−1 (BBT )−1 Bb(xj )
j=1 j=1

= trace(BBT (BBT )−1 (BBT )−1 BBT ) // Prof. Zhang has proved it
= trace(Id+1 ) = d + 1

Hence ||l(x)|| = d + 1 which increases with the degree of the local polynomial.

Ex. 6.4 Suppose that the p predictors X arise from sampling relatively smooth analog curves
at p uniformly spaced abscissa values. Denote by Cov(X|Y ) = Σ the conditional covariance matrix
of the predictors, and assume this does not change much with Y . Discuss the nature of Mahalanobis
choice A = Σ−1 for the metric in (6.14). How does this compare with A = I? How might you
construct a kernel A that (a) downweights high-frequency components in the distance metric; (b)
ignores them completely?

2
p
Answer D = (x − x0 )T Σ−1 (x − x0 ) is called the Mahalanobis distance of the point x to x0 .
It takes the correlations of the data set into account. If the predictors are highly correlated, Maha-
lanobis distance is much
p accurate than Euclidian distance.
If A = I, then d = (x − x0 )T (x − x0 ) equals to Euclidian distance of the point x to x0 . Prior to
smoothing, we should standardize each predictor, for example

xi − E(xi )
x0i = p
Var(xi )

When comparing it with Σ−1 and using the standard predictors, we have

Cov(x0i , x0j ) = E[(x0i − E(x0i ))(x0j − E(x0j ))] = E(x0i x0j )


 
xi − E(xi ) xj − E(xj )
=E (p )( p )
Var(xi ) Var(xj )
E[(xi − E(xi ))(xj − E(xj ))]
= p p
Var(xi ) Var(xj )
Cov(xi , xj )
=p p = ρ(xi , xj )
Var(xi ) Var(xj )

Then the covariance matrix Σ will change to its standardized version (correlation matrix). Hence
A = I means ∀i 6= j, ρ(xi , xj ) = 0, i.e., all dimensions of x are not correlated.
(a) If we want to construct a kernel A that downweights high-frequency components (xi s) in the
distance metric, we can just decrease Cov(xi , xj ) or ρ(xi , xj ) in order to suppress their influence.
(b) If we want to construct a kernel A that ignores them completely, we can just set Cov(xi , xj ) or
ρ(xi , xj ) as 0.

You might also like