8000 Adding R page for PCA · johnbaums/plotly.r-docs@739b1c1 · GitHub
[go: up one dir, main page]

Skip to content

Commit 739b1c1

Browse files
author
Kalpit Desai
committed
Adding R page for PCA
1 parent 5f3f90b commit 739b1c1

File tree

1 file changed

+351
-0
lines changed

1 file changed

+351
-0
lines changed

r/2021-07-27-ml-pca.rmd

Lines changed: 351 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,351 @@
1+
## PCA Visualization in Python
2+
Visualize Principle Component Analysis (PCA) of your high-dimensional data in R with Plotly.
3+
4+
This page first shows how to visualize higher dimension data using various Plotly figures combined with dimensionality reduction (aka projection). Then, we dive into the specific details of our projection algorithm.
5+
6+
We will use [Tidymodels](https://www.tidymodels.org/) or [Caret](https://cran.r-project.org/web/packages/caret/vignettes/caret.html#) to load one of the datasets, and apply dimensionality reduction. Tidymodels is a popular Machine Learning (ML) library that offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models.
7+
8+
9+
## High-dimensional PCA Analysis with `splom`
10+
11+
The dimensionality reduction technique we will be using is called the [Principal Component Analysis (PCA)](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp). It is a powerful technique that arises from linear algebra and probability theory. In essence, it computes a matrix that represents the variation of your data ([covariance matrix/eigenvectors][covmatrix]), and rank them by their relevance (explained variance/eigenvalues).
12+
13+
[covmatrix]: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues#:~:text=As%20it%20is%20a%20square%20symmetric%20matrix%2C%20it%20can%20be%20diagonalized%20by%20choosing%20a%20new%20orthogonal%20coordinate%20system%2C%20given%20by%20its%20eigenvectors%20(incidentally%2C%20this%20is%20called%20spectral%20theorem)%3B%20corresponding%20eigenvalues%20will%20then%20be%20located%20on%20the%20diagonal.%20In%20this%20new%20coordinate%20system%2C%20the%20covariance%20matrix%20is%20diagonal%20and%20looks%20like%20that%3A
14+
15+
16+
### Visualize all the original dimensions
17+
18+
First, let's plot all the features and see how the `species` in the Iris dataset are grouped. In a [Scatter Plot Matrix (splom)](https://plot.ly/r/splom/), each subplot displays a feature against another, so if we have $N$ features we have a $N \times N$ matrix.
19+
20+
In our example, we are plotting all 4 features from the Iris dataset, thus we can see how `sepal_width` is compared against `sepal_length`, then against `petal_width`, and so forth. Keep in mind how some pairs of features can more easily separate different species.
21+
22+
```{r}
23+
library(plotly)
24+
25+
data(iris)
26+
27+
axis = list(showline=FALSE,
28+
zeroline=FALSE,
29+
gridcolor='#ffff',
30+
ticklen=4,
31+
titlefont=list(size=13))
32+
33+
34+
fig <- iris %>%
35+
plot_ly()
36+
fig <- fig %>%
37+
add_trace(
38+
type = 'splom',
39+
dimensions = list(
40+
list(label='sepal length', values=~Sepal.Length),
41+
list(label='sepal width', values=~Sepal.Width),
42+
list(label='petal length', values=~Petal.Length),
43+
list(label='petal width', values=~Petal.Width)
44+
),
45+
color = ~Species, colors = c('#636EFA','#EF553B','#00CC96') ,
46+
marker = list(
47+
size = 7,
48+
line = list(
49+
width = 1,
50+
color = 'rgb(230,230,230)'
51+
)
52+
)
53+
)
54+
fig <- fig %>% style(diagonal = list(visible = FALSE))
55+
fig <- fig %>%
56+
layout(
57+
hovermode='closest',
58+
dragmode= 'select',
59+
plot_bgcolor='rgba(240,240,240, 0.95)',
60+
xaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
61+
yaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
62+
xaxis2=axis,
63+
xaxis3=axis,
64+
xaxis4=axis,
65+
yaxis2=axis,
66+
yaxis3=axis,
67+
yaxis4=axis
68+
)
69+
70+
fig
71+
```
72+
73+
74+
### Visualize all the principal components
75+
76+
Now, we apply `PCA` the same dataset, and retrieve **all** the components. We use the same `splom` trace to display our results, but this time our features are the resulting *principal components*, ordered by how much variance they are able to explain.
77+
78+
The importance of explained variance is demonstrated in the example below. The subplot between PC3 and PC4 is clearly unable to separate each class, whereas the subplot between PC1 and PC2 shows a clear separation between each species.
79+
80+
```{r}
81+
library(plotly)
82+
library(stats)
83+
data(iris)
84+
X <- subset(iris, select = -c(Species))
85+
prin_comp <- prcomp(X)
86+
explained_variance_ratio <- summary(prin_comp)[["importance"]]['Proportion of Variance',]
87+
explained_variance_ratio <- 100 * explained_variance_ratio
88+
components <- prin_comp[["x"]]
89+
components <- data.frame(components)
90+
components <- cbind(components, iris$Species)
91+
components$PC3 <- -components$PC3
92+
components$PC2 <- -components$PC2
93+
94+
axis = list(showline=FALSE,
95+
zeroline=FALSE,
96+
gridcolor='#ffff',
97+
ticklen=4,
98+
titlefont=list(size=13))
99+
100+
fig <- components %>%
101+
plot_ly() %>%
102+
add_trace(
103+
type = 'splom',
104+
dimensions = list(
105+
list(label=paste('PC 1 (',toString(round(explained_variance_ratio[1],1)),'%)',sep = ''), values=~PC1),
106+
list(label=paste('PC 2 (',toString(round(explained_variance_ratio[2],1)),'%)',sep = ''), values=~PC2),
107+
list(label=paste('PC 3 (',toString(round(explained_variance_ratio[3],1)),'%)',sep = ''), values=~PC3),
108+
list(label=paste('PC 4 (',toString(round(explained_variance_ratio[4],1)),'%)',sep = ''), values=~PC4)
109+
),
110+
color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96')
111+
) %>%
112+
style(diagonal = list(visible = FALSE)) %>%
113+
layout(
114+
legend=list(title=list(text='color')),
115+
hovermode='closest',
116+
dragmode= 'select',
117+
plot_bgcolor='rgba(240,240,240, 0.95)',
118+
xaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
119+
yaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
120+
xaxis2=axis,
121+
xaxis3=axis,
122+
xaxis4=axis,
123+
yaxis2=axis,
124+
yaxis3=axis,
125+
yaxis4=axis
126+
)
127+
128+
fig
129+
```
130+
131+
132+
### Visualize a subset of the principal components
133+
134+
When you will have too many features to visualize, you might be interested in only visualizing the most relevant components. Those components often capture a majority of the [explained variance](https://en.wikipedia.org/wiki/Explained_variation), which is a good way to tell if those components are sufficient for modelling this dataset.
135+
136+
In the example below, our dataset contains 10 features, but we only select the first 4 components, since they explain 99% of the total variance.
137+
138+
```{r}
139+
library(plotly)
140+
library(stats)
141+
library(MASS)
142+
143+
db = Boston
144+
145+
prin_comp <- prcomp(db, rank. = 4)
146+
147+
components <- prin_comp[["x"]]
148+
components <- data.frame(components)
149+
components <- cbind(components, db$medv)
150+
components$PC2 <- -components$PC2
151+
colnames(components)[5] = 'Median_Price'
152+
153+
tot_explained_variance_ratio <- summary(prin_comp)[["importance"]]['Proportion of Variance',]
154+
tot_explained_variance_ratio <- 100 * sum(tot_explained_variance_ratio)
155+
156+
tit = 'Total Explained Variance = 99.56'
157+
158+
axis = list(showline=FALSE,
159+
zeroline=FALSE,
160+
gridcolor='#ffff',
161+
ticklen=4)
162+
163+
fig <- components %>%
164+
plot_ly() %>%
165+
add_trace(
166+
type = 'splom',
167+
dimensions = list(
168+
list(label='PC1', values=~PC1),
169+
list(label='PC2', values=~PC2),
170+
list(label='PC3', values=~PC3),
171+
list(label='PC4', values=~PC4)
172+
),
173+
color=~Median_Price,
174+
marker = list(
175+
size = 7
176+
)
177+
) %>% style(diagonal = list(visible = F)) %>%
178+
layout(
179+
title= tit,
180+
hovermode='closest',
181+
dragmode= 'select',
182+
plot_bgcolor='rgba(240,240,240, 0.95)',
183+
xaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
184+
yaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
185+
xaxis2=axis,
186+
xaxis3=axis,
187+
xaxis4=axis,
188+
yaxis2=axis,
189+
yaxis3=axis,
190+
yaxis4=axis
191+
)
192+
options(warn=-1)
193+
fig
194+
```
195+
196+
197+
## 2D PCA Scatter Plot
198+
199+
In the previous examples, you saw how to visualize high-dimensional PCs. In this example, we show you how to simply visualize the first two principal components of a PCA, by reducing a dataset of 4 dimensions to 2D.
200+
201+
```{r}
202+
library(plotly)
203+
library(stats)
204+
data(iris)
205+
X <- subset(iris, select = -c(Species))
206+
prin_comp <- prcomp(X, rank. = 2)
207+
components <- prin_comp[["x"]]
208+
components <- data.frame(components)
209+
components <- cbind(components, iris$Species)
210+
components$PC2 <- -components$PC2
211+
212+
fig <- plot_ly(components, x = ~PC1, y = ~PC2, color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96'), type = 'scatter', mode = 'markers')%>%
213+
layout(
214+
legend=list(title=list(text='color')),
215+
xaxis = list(
216+
title = "0"),
217+
yaxis = list(
218+
title = "1"))
219+
fig
220+
```
221+
222+
223+
## Visualize PCA with scatter3d
224+
225+
With scatter3d, you can visualize an additional dimension, which let you capture even more variance.
226+
227+
```{r}
228+
data("iris")
229+
230+
X <- subset(iris, select = -c(Species))
231+
232+
prin_comp <- prcomp(X, rank. = 3)
233+
234+
components <- prin_comp[["x"]]
235+
components <- data.frame(components)
236+
components$PC2 <- -components$PC2
237+
components$PC3 <- -components$PC3
238+
components = cbind(components, iris$Species)
239+
240+
tot_explained_variance_ratio <- summary(prin_comp)[["importance"]]['Proportion of Variance',]
241+
tot_explained_variance_ratio <- 100 * sum(tot_explained_variance_ratio)
242+
243+
tit = 'Total Explained Variance = 99.48'
244+
245+
fig <- plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3, color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96') ) %>%
246+
add_markers(size = 12)
247+
248+
249+
fig <- fig %>%
250+
layout(
251+
title= tit)
252+
253+
fig
254+
```
255+
256+
257+
## Plotting explained variance
258+
259+
Often, you might be interested in seeing how much variance PCA is able to explain as you increase the number of components, in order to decide how many dimensions to ultimately keep or analyze. This example shows you how to quickly plot the cumulative sum of explained variance for a high-dimensional dataset like [PimaIndiansDiabetes](https://rdrr.io/cran/mlbench/man/PimaIndiansDiabetes.html).
260+
261+
With a higher explained variance, you are able to capture more variability in your dataset, which could potentially lead to better performance when training your model. For a more mathematical explanation, see this [Q&A thread](https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained).
262+
263+
```{r}
264+
library(plotly)
265+
library(stats)
266+
library(base)
267+
library(mlbench)
268+
data(PimaIndiansDiabetes)
269+
270+
X <- subset(PimaIndiansDiabetes, select = -c(diabetes))
271+
prin_comp <- prcomp(X)
272+
explained_variance_ratio <- summary(prin_comp)[["importance"]]['Proportion of Variance',]
273+
cumsum <- cumsum(explained_variance_ratio)
274+
data <- data.frame(cumsum,seq(1, length(cumsum), 1))
275+
colnames(data) <- c('Explained_Variance','Components')
276+
277+
fig <- plot_ly(data = data, x = ~Components, y = ~Explained_Variance, type = 'scatter', mode = 'lines', fill = 'tozeroy') %>%
278+
layout(
279+
xaxis = list(
280+
title = "# Components", tickvals = seq(1, length(cumsum), 1)),
281+
yaxis = list(
282+
title = "Explained Variance"))
283+
fig
284+
```
285+
286+
287+
## Visualize Loadings
288+
289+
It is also possible to visualize loadings using `shapes`, and use `annotations` to indicate which feature a certain loading original belong to. Here, we define loadings as:
290+
291+
$$
292+
loadings = eigenvectors \cdot \sqrt{eigenvalues}
293+
$$
294+
295+
For more details about the linear algebra behind eigenvectors and loadings, see this [Q&A thread](https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another).
296+
297+
```{r}
298+
library(plotly)
299+
library(stats)
300+
data(iris)
301+
X <- subset(iris, select = -c(Species))
302+
prin_comp <- prcomp(X, rank = 2)
303+
components <- prin_comp[["x"]]
304+
components <- data.frame(components)
305+
components <- cbind(components, iris$Species)
306+
components$PC2 <- -components$PC2
307+
explained_variance <- summary(prin_comp)[["sdev"]]
308+
explained_variance <- explained_variance[1:2]
309+
comp <- prin_comp[["rotation"]]
310+
comp[,'PC2'] <- - comp[,'PC2']
311+
loadings <- comp
312+
for (i in seq(explained_variance)){
313+
loadings[,i] <- comp[,i] * explained_variance[i]
314+
}
315+
316+
features = c('sepal_length', 'sepal_width', 'petal_length', 'petal_width')
317+
318+
fig <- plot_ly(components, x = ~PC1, y = ~PC2, color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96'), type = 'scatter', mode = 'markers')%>%
319+
layout(
320+
legend=list(title=list(text='color')),
321+
xaxis = list(
322+
title = "0"),
323+
yaxis = list(
324+
title = "1"))
325+
for (i in seq(4)){
326+
fig <- fig %>%
327+
add_segments(x = 0, xend = loadings[i, 1], y = 0, yend = loadings[i, 2], line = list(color = 'black'),inherit = FALSE, showlegend = FALSE) %>%
328+
add_annotations(x=loadings[i, 1], y=loadings[i, 2], ax = 0, ay = 0,text = features[i], xanchor = 'center', yanchor= 'bottom')
329+
}
330+
331+
fig
332+
```
333+
334+
335+
## References
336+
337+
Learn more about `scatter3d`, and `splom` here:
338+
339+
* https://plot.ly/r/3d-scatter-plots/
340+
341+
* https://plot.ly/r/splom/
342+
343+
The following resources offer an in-depth overview of PCA and explained variance:
344+
345+
* https://en.wikipedia.org/wiki/Explained_variation
346+
347+
* https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579
348+
349+
* https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another
350+
351+
* https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained

0 commit comments

Comments
 (0)
0