8000 Completed AI/ML section and started on Fundamentals Section by kvdesai · Pull Request #70 · plotly/plotly.r-docs · GitHub
[go: up one dir, main page]

Skip to content

Completed AI/ML section and started on Fundamentals Section #70

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Aug 17, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
da8f391
ml-knn page with make_moons.csv data
Jul 24, 2021
858b82a
Fixing text
Jul 26, 2021
c81963c
Adding ML ROC & PR page draft
Jul 27, 2021
5f3f90b
Merge branch 'rpy-parity-dev' of https://github.com/plotly/plotly.r-d…
Jul 28, 2021
739b1c1
Adding R page for PCA
Jul 28, 2021
eb138bc
ML page for t-SNE and UMAP
Jul 29, 2021
ea5d15b
Adding fundamentals/multiple-chart-types page draft
Aug 1, 2021
e41bacd
Adding page for Styling in Plotly with R
Aug 3, 2021
39bc7d8
committing horiz-vert shapes and figure label pages, without Dash
Aug 4, 2021
4248f8e
Adding dependencies=TRUE for Anglr
Aug 5, 2021
dfd7011
Merge branch 'rpy-parity' of https://github.com/plotly/plotly.r-docs …
Aug 5, 2021
8c4a8ed
Added Dash code
Aug 5, 2021
d9c481b
Merge pull request #69 from plotly/rpy-parity-dev
kvdesai Aug 5, 2021
5d68cc0
Build fix with Anglr issue
Aug 5, 2021
477cced
Build fix2 with Anglr issue
Aug 5, 2021
3dd73c3
Build fix3 with Anglr issue
Aug 5, 2021
5d65e56
Build fix4 for Anglr
kvdesai Aug 5, 2021
7bd525c
Added the Dash code for Figure-Labels page
Aug 6, 2021
cfca991
Merge branch 'rpy-parity' of https://github.com/plotly/plotly.r-docs …
Aug 6, 2021
c5d254c
Fixing the broken data URL
kvdesai Aug 6, 2021
90f139a
Update r/2021-07-27-ml-pca.rmd
kvdesai Aug 16, 2021
d9d083c
Update r/2021-07-27-ml-pca.rmd
kvdesai Aug 16, 2021
2026b0d
Update r/2021-07-28-ml-tsne-umap.rmd
kvdesai Aug 16, 2021
c4f782b
Update r/2021-07-28-ml-tsne-umap.rmd
kvdesai Aug 16, 2021
9477afc
Update r/2021-08-02-styling-plotly-in-r.rmd
kvdesai Aug 16, 2021
697b63d
Update r/2021-08-04-figure-labels.rmd
kvdesai Aug 16, 2021
24d5d26
Merge branch 'master' of https://github.com/plotly/plotly.r-docs into…
Aug 17, 2021
567ea3b
Merge branch 'rpy-parity' of https://github.com/plotly/plotly.r-docs …
Aug 17, 2021
75082eb
Improving aesthetics for plot background and grid lines
Aug 17, 2021
ff0008b
Adding front-matter tags and cleanup
HammadTheOne Aug 17, 2021
dc9ebd6
Renaming Rmd files
HammadTheOne Aug 17, 2021
1d91867
Adding explicit tsne install
HammadTheOne Aug 17, 2021
de2521a
Added explicit umap install
HammadTheOne Aug 17, 2021
0169502
Explicitly install rsvd
HammadTheOne Aug 17, 2021
e824a2d
Explicitly install dash
HammadTheOne Aug 17, 2021
5132343
Fixing dependencies
HammadTheOne Aug 17, 2021
8cf5d02
Fixing order and skipping Dash chunk eval
HammadTheOne Aug 17, 2021
f3175c6
Deleting old figure-labels page
HammadTheOne Aug 17, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Improving aesthetics for plot background and grid lines
  • Loading branch information
Kalpit Desai committed Aug 17, 2021
commit 75082eb7b63f0df483a3322ed62f6978f56709e5
394 changes: 202 additions & 192 deletions r/2021-07-28-ml-tsne-umap.rmd
Original file line number Diff line number Diff line change
@@ -1,192 +1,202 @@
## t-SNE and UMAP projections in R


This page presents various ways to visualize two popular dimensionality reduction techniques, namely the [t-distributed stochastic neighbor embedding](https://lvdmaaten.github.io/tsne/) (t-SNE) and [Uniform Manifold Approximation and Projection](https://umap-learn.readthedocs.io/en/latest/index.html) (UMAP). They are needed whenever you want to visualize data with more than two or three features (i.e. dimensions).

We first show how to visualize data with more than three features using the [scatter plot matrix](https://plotly.com/r/splom/#:~:text=The%20Plotly%20splom%20trace%20implementation,array%2Fvariable%20represents%20a%20dimension), then we apply dimensionality reduction techniques to get 2D/3D representation of our data, and visualize the results with [scatter plots](https://plotly.com/r/line-and-scatter/) and [3D scatter plots](https://plotly.com/r/3d-scatter-plots/).


## Basic t-SNE projections

t-SNE is a popular dimensionality reduction algorithm that arises from probability theory. Simply put, it projects the high-dimensional data points (sometimes with hundreds of features) into 2D/3D by inducing the projected data to have a similar distribution as the original data points by minimizing something called the [KL divergence](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8).

Compared to a method like Principal Component Analysis (PCA), it takes significantly more time to converge, but present significantly better insights when visualized. For example, by projecting features of flowers, it will be able to distinctly group



### Visualizing high-dimensional data with `splom`

First, let's try to visualize every feature of the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris), and color everything by the species. We will use the Scatter Plot Matrix ([splom](https://plotly.com/r/splom/#:~:text=The%20Plotly%20splom%20trace%20implementation,array%2Fvariable%20represents%20a%20dimension)), which lets us plot each feature against everything else, which is convenient when your dataset has more than 3 dimensions.

```{r}
library(plotly)
library(stats)
data(iris)
X <- subset(iris, select = -c(Species))
axis = list(showline=FALSE,
zeroline=FALSE,
gridcolor='#ffff',
ticklen=4)
fig <- iris %>%
plot_ly() %>%
add_trace(
type = 'splom',
dimensions = list(
list(label = 'sepal_width',values=~Sepal.Width),
list(label = 'sepal_length',values=~Sepal.Length),
list(label ='petal_width',values=~Petal.Width),
list(label = 'petal_length',values=~Petal.Length)),
color = ~Species, colors = c('#636EFA','#EF553B','#00CC96')
)
fig <- fig %>%
layout(
legend=list(title=list(text='species')),
hovermode='closest',
dragmode= 'select',
plot_bgcolor='rgba(240,240,240,0.95)',
xaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
yaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
xaxis2=axis,
xaxis3=axis,
xaxis4=axis,
yaxis2=axis,
yaxis3=axis,
yaxis4=axis
)
fig

```

### Project data into 2D with t-SNE and `px.scatter`

Now, let's use the t-SNE algorithm to project the data shown above into two dimensions. Notice how each of the species is physically separate from each other.

```{r}
library(tsne)
library(plotly)
data("iris")

features <- subset(iris, select = -c(Species))

set.seed(0)
tsne <- tsne(features, initial_dims = 2)
tsne <- data.frame(tsne)
pdb <- cbind(tsne,iris$Species)
options(warn = -1)
fig <- plot_ly(data = pdb ,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~iris$Species)
fig

```

### Project data into 3D with t-SNE and `px.scatter_3d`

t-SNE can reduce your data to any number of dimensions you want! Here, we show you how to project it to 3D and visualize with a 3D scatter plot.

```{r}
library(tsne)
library(plotly)
data("iris")

features <- subset(iris, select = -c(Species))

#set.seed(0)
tsne <- tsne(features, initial_dims = 3, k =3)
tsne <- data.frame(tsne)
pdb <- cbind(tsne,iris$Species)
options(warn = -1)
fig <- plot_ly(data = pdb ,x = ~X1, y = ~X2, z = ~X3, color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96') ) %>%
add_markers(size = 8)
fig

```

## Projections with UMAP

Just like t-SNE, [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html) is a dimensionality reduction specifically designed for visualizing complex data in low dimensions (2D or 3D). As the number of data points increase, UMAP becomes more time efficient compared to TSNE.

In the example below, we see how easy it is to use UMAP in R.

```{r}

library(plotly)
library(umap)
iris.data = iris[, grep("Sepal|Petal", colnames(iris))]
iris.labels = iris[, "Species"]
iris.umap = umap(iris.data, n_components = 2, random_state = 15)
layout <- iris.umap[["layout"]]
layout <- data.frame(layout)
final <- cbind(layout, iris$Species)

fig <- plot_ly(final, x = ~X1, y = ~X2, color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96'), type = 'scatter', mode = 'markers')%>%
layout(
legend=list(title=list(text='species')),
xaxis = list(
title = "0"),
yaxis = list(
title = "1"))

iris.umap = umap(iris.data, n_components = 3, random_state = 15)
layout <- iris.umap[["layout"]]
layout <- data.frame(layout)
final <- cbind(layout, iris$Species)

fig2 <- plot_ly(final, x = ~X1, y = ~X2, z = ~X3, color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96'))
fig2 <- fig2 %>% add_markers()
fig2 <- fig2 %>% layout(scene = list(xaxis = list(title = '0'),
yaxis = list(title = '1'),
zaxis = list(title = '2')))

fig
fig2
```

## Visualizing image datasets

In the following example, we show how to visualize large image datasets using UMAP.

Although there's over 1000 data points, and many more dimensions than the previous example, it is still extremely fast. This is because UMAP is optimized for speed, both from a theoretical perspective, and in the way it is implemented. Learn more in [this comparison post](https://umap-learn.readthedocs.io/en/latest/benchmarking.html).

```{r}
library(rsvd)
library(plotly)
library(umap)
data('digits')
digits.data = digits[, grep("pixel", colnames(digits))]
digits.labels = digits[, "label"]
digits.umap = umap(digits.data, n_components = 2, k = 10)
layout <- digits.umap[["layout"]]
layout <- data.frame(layout)
final <- cbind(layout, digits[,'label'])
colnames(final) <- c('X1', 'X2', 'label')

fig <- plot_ly(final, x = ~X1, y = ~X2, split = ~label, type = 'scatter', mode = 'markers')%>%
layout(
legend=list(title=list(text='digit')),
xaxis = list(
title = "0"),
yaxis = list(
title = "1"))
fig

```

<!-- #region -->
## Reference

Plotly figures:
* https://plotly.com/r/line-and-scatter/

* https://plotly.com/r/3d-scatter-plots/

* https://plotly.com/r/splom/


Details about algorithms:
* UMAP library: https://umap-learn.readthedocs.io/en/latest/

* t-SNE User guide: https://cran.r-project.org/web/packages/tsne/tsne.pdf

* t-SNE paper: https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

* MNIST: http://yann.lecun.com/exdb/mnist/
<!-- #endregion -->
## t-SNE and UMAP projections in R


This page presents various ways to visualize two popular dimensionality reduction techniques, namely the [t-distributed stochastic neighbor embedding](https://lvdmaaten.github.io/tsne/) (t-SNE) and [Uniform Manifold Approximation and Projection](https://umap-learn.readthedocs.io/en/latest/index.html) (UMAP). They are needed whenever you want to visualize data with more than two or three features (i.e. dimensions).

We first show how to visualize data with more than three features using the [scatter plot matrix](https://plotly.com/r/splom/#:~:text=The%20Plotly%20splom%20trace%20implementation,array%2Fvariable%20represents%20a%20dimension), then we apply dimensionality reduction techniques to get 2D/3D representation of our data, and visualize the results with [scatter plots](https://plotly.com/r/line-and-scatter/) and [3D scatter plots](https://plotly.com/r/3d-scatter-plots/).


## Basic t-SNE projections

t-SNE is a popular dimensionality reduction algorithm that arises from probability theory. Simply put, it projects the high-dimensional data points (sometimes with hundreds of features) into 2D/3D by inducing the projected data to have a similar distribution as the original data points by minimizing something called the [KL divergence](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8).

Compared to a method like Principal Component Analysis (PCA), it takes significantly more time to converge, but present significantly better insights when visualized. For example, by projecting features of flowers, it will be able to distinctly group



### Visualizing high-dimensional data with `splom`

First, let's try to visualize every feature of the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris), and color everything by the species. We will use the Scatter Plot Matrix ([splom](https://plotly.com/r/splom/#:~:text=The%20Plotly%20splom%20trace%20implementation,array%2Fvariable%20represents%20a%20dimension)), which lets us plot each feature against everything else, which is convenient when your dataset has more than 3 dimensions.

```{r}
library(plotly)
library(stats)
data(iris)
X <- subset(iris, select = -c(Species))
axis = list(showline=FALSE,
zeroline=FALSE,
gridcolor='#ffff',
ticklen=4)
fig <- iris %>%
plot_ly() %>%
add_trace(
type = 'splom',
dimensions = list(
list(label = 'sepal_width',values=~Sepal.Width),
list(label = 'sepal_length',values=~Sepal.Length),
list(label ='petal_width',values=~Petal.Width),
list(label = 'petal_length',values=~Petal.Length)),
color = ~Species, colors = c('#636EFA','#EF553B','#00CC96')
)
fig <- fig %>%
layout(
legend=list(title=list(text='species')),
hovermode='closest',
dragmode= 'select',
plot_bgcolor='rgba(240,240,240,0.95)',
xaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
yaxis=list(domain=NULL, showline=F, zeroline=F, gridcolor='#ffff', ticklen=4),
xaxis2=axis,
xaxis3=axis,
xaxis4=axis,
yaxis2=axis,
yaxis3=axis,
yaxis4=axis
)
fig

```

### Project data into 2D with t-SNE and `px.scatter`

Now, let's use the t-SNE algorithm to project the data shown above into two dimensions. Notice how each of the species is physically separate from each other.

```{r}
library(tsne)
library(plotly)
data("iris")

features <- subset(iris, select = -c(Species))

set.seed(0)
tsne <- tsne(features, initial_dims = 2)
tsne <- data.frame(tsne)
pdb <- cbind(tsne,iris$Species)
options(warn = -1)
fig <- plot_ly(data = pdb ,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~iris$Species)
fig

```

### Project data into 3D with t-SNE and `px.scatter_3d`

t-SNE can reduce your data to any number of dimensions you want! Here, we show you how to project it to 3D and visualize with a 3D scatter plot.

```{r}
library(tsne)
library(plotly)
data("iris")

features <- subset(iris, select = -c(Species))

#set.seed(0)
tsne <- tsne(features, initial_dims = 3, k =3)
tsne <- data.frame(tsne)
pdb <- cbind(tsne,iris$Species)
options(warn = -1)
fig <- plot_ly(data = pdb ,x = ~X1, y = ~X2, z = ~X3, color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96') ) %>%
add_markers(size = 8) %>%
layout(
plot_bgcolor='#e5ecf6',
xaxis = list(
zerolinecolor = "#ffff",
zerolinewidth = 2,
gridcolor='#ffff'),
yaxis = list(
zerolinecolor = "#ffff",
zerolinewidth = 2,
gridcolor='#ffff'))
fig

```

## Projections with UMAP

Just like t-SNE, [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html) is a dimensionality reduction specifically designed for visualizing complex data in low dimensions (2D or 3D). As the number of data points increase, UMAP becomes more time efficient compared to TSNE.

In the example below, we see how easy it is to use UMAP in R.

```{r}

library(plotly)
library(umap)
iris.data = iris[, grep("Sepal|Petal", colnames(iris))]
iris.labels = iris[, "Species"]
iris.umap = umap(iris.data, n_components = 2, random_state = 15)
layout <- iris.umap[["layout"]]
layout <- data.frame(layout)
final <- cbind(layout, iris$Species)

fig <- plot_ly(final, x = ~X1, y = ~X2, color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96'), type = 'scatter', mode = 'markers')%>%
layout(
legend=list(title=list(text='species')),
xaxis = list(
title = "0"),
yaxis = list(
title = "1"))

iris.umap = umap(iris.data, n_components = 3, random_state = 15)
layout <- iris.umap[["layout"]]
layout <- data.frame(layout)
final <- cbind(layout, iris$Species)

fig2 <- plot_ly(final, x = ~X1, y = ~X2, z = ~X3, color = ~iris$Species, colors = c('#636EFA','#EF553B','#00CC96'))
fig2 <- fig2 %>% add_markers()
fig2 <- fig2 %>% layout(scene = list(xaxis = list(title = '0'),
yaxis = list(title = '1'),
zaxis = list(title = '2')))

fig
fig2
```

## Visualizing image datasets

In the following example, we show how to visualize large image datasets using UMAP.

Although there's over 1000 data points, and many more dimensions than the previous example, it is still extremely fast. This is because UMAP is optimized for speed, both from a theoretical perspective, and in the way it is implemented. Learn more in [this comparison post](https://umap-learn.readthedocs.io/en/latest/benchmarking.html).

```{r}
library(rsvd)
library(plotly)
library(umap)
data('digits')
digits.data = digits[, grep("pixel", colnames(digits))]
digits.labels = digits[, "label"]
digits.umap = umap(digits.data, n_components = 2, k = 10)
layout <- digits.umap[["layout"]]
layout <- data.frame(layout)
final <- cbind(layout, digits[,'label'])
colnames(final) <- c('X1', 'X2', 'label')

fig <- plot_ly(final, x = ~X1, y = ~X2, split = ~label, type = 'scatter', mode = 'markers')%>%
layout(
legend=list(title=list(text='digit')),
xaxis = list(
title = "0"),
yaxis = list(
title = "1"))
fig

```

<!-- #region -->
## Reference

Plotly figures:
* https://plotly.com/r/line-and-scatter/

* https://plotly.com/r/3d-scatter-plots/

* https://plotly.com/r/splom/


Details about algorithms:
* UMAP library: https://umap-learn.readthedocs.io/en/latest/

* t-SNE User guide: https://cran.r-project.org/web/packages/tsne/tsne.pdf

* t-SNE paper: https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

* MNIST: http://yann.lecun.com/exdb/mnist/
<!-- #endregion -->
Loading
0