8000 R-Py parity for ml-regression page by kvdesai · Pull Request #63 · plotly/plotly.r-docs · GitHub
[go: up one dir, main page]

Skip to content

R-Py parity for ml-regression page #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jul 26, 2021
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ jobs:
- run:
name: install application-level dependencies
command: |
sudo apt-get install -y pandoc libudunits2-dev libgdal-dev libxt-dev libglu1-mesa-dev libfftw3-dev libglpk40
sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) '
sudo apt-get install -y pandoc libudunits2-dev libgdal-dev libxt-dev libglu1-mesa-dev libfftw3-dev libglpk40 libxml2-dev libcurl4-openssl-dev apt-transport-https software-properties-common
sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kknn", "kernlab", "pracma", "reshape2", "ggplot2", "datasets")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) '
- save_cache:
key: cache4
paths:
Expand Down
312 changes: 312 additions & 0 deletions r/2021-07-08-ml-regression.Rmd
86A1
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
<!-- #region -->
This page shows how to use Plotly charts for displaying various types of regression models, starting from simple models like [Linear Regression](https://parsnip.tidymodels.org/reference/linear_reg.html) and progressively move towards models like Decision Tree and Polynomial Features. We highlight various capabilities of plotly, such as comparative analysis of the same model with different parameters, displaying Latex, and [surface plots](https://plotly.com/r/3d-surface-plots/) for 3D data.

We will use [tidymodels](https://tidymodels.tidymodels.org/) to split and preprocess our data and train various regression models. Tidymodels is a popular Machine Learning (ML) library in R that is compatible with the "tidyverse" concepts, and offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It is the next-gen version of the popular [caret](http://topepo.github.io/caret/index.html) library for R.

<!-- #endregion -->

## Basic linear regression plots

In this section, we show you how to apply a simple regression model for predicting tips a server will receive based on various client attributes (such as sex, time of the week, and whether they are a smoker).

We will be using the [Linear Regression][lr], which is a simple model that fits an intercept (the mean tip received by a server), and adds a slope for each feature we use, such as the value of the total bill.

[lr]: https://parsnip.tidymodels.org/reference/linear_reg.html

### Linear Regression with R

```{r}
library(reshape2) # to load tips data
library(tidyverse)
library(tidymodels) # for the fit() function
library(plotly)
data(tips)

y <- tips$tip
X <- tips$total_bill

lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression') %>%
fit(tip ~ total_bill, data = tips)

x_range <- seq(min(X), max(X), length.out = 100)
x_range <- matrix(x_range, nrow=100, ncol=1)
x_range <- data.frame(x_range)
colnames(x_range) <- c('total_bill')

y_range <- lm_model %>% predict(x_range)

colnames(y_range) <- c('tip')
xy <- data.frame(x_range, y_range)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion here is to rename the dataframe variable to df for this example and other ones (unless there's multiple dataframes).


fig <- plot_ly(tips, x = ~total_bill, y = ~tip, type = 'scatter', alpha = 0.65, mode = 'markers', name = 'Tips')
fig <- fig %>% add_trace(data = xy, x = ~total_bill, y = ~tip, name = 'Regression Fit', mode = 'lines', alpha = 1)
fig
```
## Model generalization on unseen data

With `add_trace()`, you can easily color your plot based on a predefined data split. By coloring the training and the testing data points with different colors, you can easily see if whether the model generalizes well to the test data or not.

```{r}
library(reshape2)
library(tidyverse)
library(tidymodels)
library(plotly)
data(tips)

y <- tips$tip
X <- tips$total_bill

set.seed(123)
tips_split <- initial_split(tips)
tips_training <- tips_split %>%
training()
tips_test <- tips_split %>%
testing()

lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression') %>%
fit(tip ~ total_bill, data = tips_training)

x_range <- seq(min(X), max(X), length.out = 100)
x_range <- matrix(x_range, nrow=100, ncol=1)
x_range <- data.frame(x_range)
colnames(x_range) <- c('total_bill')

y_range <- lm_model %>%
predict(x_range)

colnames(y_range) <- c('tip')
xy <- data.frame(x_range, y_range)

fig <- plot_ly(data = tips_training, x = ~total_bill, y = ~tip, type = 'scatter', name = 'train', mode = 'markers', alpha = 0.65) %>%
add_trace(data = tips_test, x = ~total_bill, y = ~tip, type = 'scatter', name = 'test', mode = 'markers', alpha = 0.65 ) %>%
add_trace(data = xy, x = ~total_bill, y = ~tip, name = 'prediction', mode = 'lines', alpha = 1)
fig
```

## Comparing different kNN models parameters

In addition to linear regression, it's possible to fit the same data using [k-Nearest Neighbors][knn]. When you perform a prediction on a new sample, this model either takes the weighted or un-weighted average of the neighbors. In order to see the difference between those two averaging options, we train a kNN model with both of those parameters, and we plot them in the same way as the previous graph.

Notice how we can combine scatter points with lines using Plotly. You can learn more about [multiple chart types](https://plotly.com/r/graphing-multiple-chart-types/).

[knn]: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to replace this with the R example instead.


```{r}
library(reshape2)
library(tidyverse)
library(tidymodels)
library(plotly)
library(kknn)
data(tips)

y <- tips$tip
X <- tips$total_bill

knn_dist <- nearest_neighbor(neighbors = 10, weight_func = 'inv') %>%
set_engine('kknn') %>%
set_mode('regression') %>%
fit(tip ~ total_bill, data = tips)
knn_uni <- nearest_neighbor(neighbors = 10, weight_func = 'rectangular') %>%
set_engine('kknn') %>%
set_mode('regression') %>%
fit(tip ~ total_bill, data = tips)

x_range <- seq(min(X), max(X), length.out = 100)
x_range <- matrix(x_range, nrow=100, ncol=1)
x_range <- data.frame(x_range)
colnames(x_range) <- c('total_bill')

y_dist <- knn_dist %>%
predict(x_range)
y_uni <- knn_uni %>%
predict(x_range)

colnames(y_dist) <- c('dist')
colnames(y_uni) <- c('uni')
xy <- data.frame(x_range, y_dist, y_uni)

fig <- plot_ly(tips, type = 'scatter', mode = 'markers', colors = c("#FF7F50", "#6495ED")) %>%
add_trace(data = tips, x = ~total_bill, y = ~tip, type = 'scatter', mode = 'markers', color = ~sex, alpha = 0.65) %>%
add_trace(data = xy, x = ~total_bill, y = ~dist, name = 'Weights: Distance', mode = 'lines', alpha = 1) %>%
add_trace(data = xy, x = ~total_bill, y = ~uni, name = 'Weights: Uniform', mode = 'lines', alpha = 1)
fig
```

## 3D regression surface with `mesh3d` and `add_surface`

Visualize the decision plane of your model whenever you have more than one variable in your input data. Here, we will use [`svm_rbf`](https://parsnip.tidymodels.org/reference/svm_rbf.html) with [`kernlab`](https://cran.r-project.org/web/packages/kernlab/index.html) engine in `regression` mode. For generating the 2D mesh on the surface, we use the package [`pracma`](https://cran.r-project.org/web/packages/pracma/index.html)

```{r}
library(reshape2)
library(tidyverse)
library(tidymodels)
library(plotly)
library(kernlab)
library(pracma) #For meshgrid()
data(iris)

mesh_size <- .02
margin <- 0
X <- iris %>% select(Sepal.Width, Sepal.Length)
y <- iris %>% select(Petal.Width)

model <- svm_rbf(cost = 1.0) %>%
set_engine("kernlab") %>%
set_mode("regression") %>%
fit(Petal.Width ~ Sepal.Width + Sepal.Length, data = iris)

x_min <- min(X$Sepal.Width) - margin
x_max <- max(X$Sepal.Width) - margin
y_min <- min(X$Sepal.Length) - margin
y_max <- max(X$Sepal.Length) - margin
xrange <- seq(x_min, x_max, mesh_size)
yrange <- seq(y_min, y_max, mesh_size)
xy <- meshgrid(x = xrange, y = yrange)
xx <- xy$X
yy <- xy$Y
dim_val <- dim(xx)
xx1 <- matrix(xx, length(xx), 1)
yy1 <- matrix(yy, length(yy), 1)
final <- cbind(xx1, yy1)
pred <- model %>%
predict(final)

pred <- pred$.pred
pred <- matrix(pred, dim_val[1], dim_val[2])

dim(pred)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? If not, I would ✂️

fig <- plot_ly(iris, x = ~Sepal.Width, y = ~Sepal.Length, z = ~Petal.Width ) %>%
add_markers(size = 5) %>%
add_surface(x=xrange, y=yrange, z=pred, alpha = 0.65, type = 'mesh3d', name = 'pred_surface')
fig

```


### Enhanced prediction error analysis using `ggplotly`

Add marginal histograms to quickly diagnoses any prediction bias your model might have.

```{r}
library(plotly)
library(ggplot2)
library(tidyverse)
library(tidymodels)
data(iris)

X <- iris %>% select(Sepal.Width, Sepal.Length)
y <- iris %>% select(Petal.Width)

set.seed(0)
iris_split <- initial_split(iris, prop = 3/4)
iris_training <- iris_split %>%
training()
iris_test <- iris_split %>%
testing()

train_index <- as.integer(rownames(iris_training))
test_index <- as.integer(rownames(iris_test))

iris[train_index,'split'] = 'train'
iris[test_index,'split'] = 'test'

lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression') %>%
fit(Petal.Width ~ Sepal.Width + Sepal.Length, data = iris_training)

prediction <- lm_model %>%
predict(X)
colnames(prediction) <- c('prediction')
iris = cbind(iris, prediction)

hist_top <- ggplot(iris,aes(x=Petal.Width)) +
geom_histogram(data=subset(iris,split == 'train'),fill = "red", alpha = 0.2, bins = 6) +
geom_histogram(data=subset(iris,split == 'test'),fill = "blue", alpha = 0.2, bins = 6) +
theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
hist_top <- ggplotly(p = hist_top)

scatter <- ggplot(iris, aes(x = Petal.Width, y = prediction, color = split)) +
geom_point() +
geom_smooth(method=lm, se=FALSE)
scatter <- ggplotly(p = scatter, type = 'scatter')

hist_right <- ggplot(iris,aes(x=prediction)) +
geom_histogram(data=subset(iris,split == 'train'),fill = "red", alpha = 0.2, bins = 13) +
geom_histogram(data=subset(iris,split == 'test'),fill = "blue", alpha = 0.2, bins = 13) +
theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank())+
coord_flip()
hist_right <- ggplotly(p = hist_right)

s <- subplot(
hist_top,
plotly_empty(),
scatter,
hist_right,
nrows = 2, heights = c(0.2, 0.8), widths = c(0.8, 0.2), margin = 0,
shareX = TRUE, shareY = TRUE, titleX = TRUE, titleY = TRUE
)
layout(s, showlegend = FALSE)

```
## Residual plots
Just like prediction error plots, it's easy to visualize your prediction residuals in just a few lines of codes using `ggplotly` built-in capabilities.
```{r}
library(plotly)
library(ggplot2)
library(tidyverse)
library(tidymodels)

data(iris)

X <- iris %>% select(Sepal.Width, Sepal.Length)
y <- iris %>% select(Petal.Width)

set.seed(0)
iris_split <- initial_split(iris, prop = 3/4)
iris_training <- iris_split %>%
training()
iris_test <- iris_split %>%
testing()

train_index <- as.integer(rownames(iris_training))
test_index <- as.integer(rownames(iris_test))

iris[train_index,'split'] = 'train'
iris[test_index,'split'] = 'test'

lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression') %>%
fit(Petal.Width ~ Sepal.Width + Sepal.Length, data = iris_training)

prediction <- lm_model %>%
predict(X)
colnames(prediction) <- c('prediction')
iris = cbind(iris, prediction)
residual <- prediction - iris$Petal.Width
colnames(residual) <- c('residual')
iris = cbind(iris, residual)

scatter <- ggplot(iris, aes(x = prediction, y = residual, color = split)) +
geom_point() +
geom_smooth(method=lm, se=FALSE)

scatter <- ggplotly(p = scatter, type = 'scatter')

violin <- iris %>%
plot_ly(x = ~split, y = ~residual, split = ~split, type = 'violin' )

s <- subplot(
scatter,
violin,
nrows = 1, heights = c(1), widths = c(0.65, 0.35), margin = 0.01,
shareX = TRUE, shareY = TRUE, titleX = TRUE, titleY = TRUE
)

layout(s, showlegend = FALSE)
```
0