From 8cd68470f61095c8997c46ddf2842368e5055180 Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Mon, 19 Jul 2021 20:32:34 +0530 Subject: [PATCH 01/20] ml-regression plot, first version --- r/2021-07-08-ml-regression.Rmd | 184 +++++++++++++++++++++++++++++++++ 1 file changed, 184 insertions(+) create mode 100644 r/2021-07-08-ml-regression.Rmd diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd new file mode 100644 index 00000000..57b92f6e --- /dev/null +++ b/r/2021-07-08-ml-regression.Rmd @@ -0,0 +1,184 @@ + +This page shows how to use Plotly charts for displaying various types of regression models, starting from simple models like Linear Regression and progressively move towards models like Decision Tree and Polynomial Features. We highlight various capabilities of plotly, such as comparative analysis of the same model with different parameters, displaying Latex, and [surface plots](https://plotly.com/r/3d-surface-plots/) for 3D data. + +We will use [tidymodels](https://tidymodels.tidymodels.org/) to split and preprocess our data and train various regression models. Tidymodels is a popular Machine Learning (ML) library in R that is compatible with the "tidyverse" concepts, and offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It is the next-gen version of the popular [caret](http://topepo.github.io/caret/index.html) library in R. + + + +## Basic linear regression plots + +In this section, we show you how to apply a simple regression model for predicting tips a server will receive based on various client attributes (such as sex, time of the week, and whether they are a smoker). + +We will be using the [Linear Regression][lr], which is a simple model that fit an intercept (the mean tip received by a server), and add a slope for each feature we use, such as the value of the total bill. + +[lr]: https://parsnip.tidymodels.org/reference/linear_reg.html + +### Linear Regression with R + +```{r} +library(plotly) +data(tips) + +y <- tips$tip +X <- tips$total_bill + +lm_model <- linear_reg() %>% + set_engine('lm') %>% + set_mode('regression') %>% + fit(tip ~ total_bill, data = tips) + +x_range <- seq(min(X), max(X), length.out = 100) +x_range <- matrix(x_range, nrow=100, ncol=1) +x_range <- data.frame(x_range) +colnames(x_range) <- c('total_bill') + +y_range <- lm_model %>% predict(x_range) + +colnames(y_range) <- c('tip') +xy <- data.frame(x_range, y_range) + +fig <- plot_ly(tips, x = ~total_bill, y = ~tip, type = 'scatter', alpha = 0.65, mode = 'markers', name = 'Tips') +fig <- fig %>% add_trace(data = xy, x = ~total_bill, y = ~tip, name = 'Regression Fit', mode = 'lines', alpha = 1) +fig +``` +## Model generalization on unseen data + +With `add_trace()`, you can easily color your plot based on a predefined data split. By coloring the training and the testing data points with different colors, you can easily see if whether the model generalizes well to the test data or not. + +```{r} +library(plotly) +data(tips) + +y <- tips$tip +X <- tips$total_bill + +set.seed(123) +tips_split <- initial_split(tips) +tips_training <- tips_split %>% + training() +tips_test <- tips_split %>% + testing() + +lm_model <- linear_reg() %>% + set_engine('lm') %>% + set_mode('regression') %>% + fit(tip ~ total_bill, data = tips_training) + +x_range <- seq(min(X), max(X), length.out = 100) +x_range <- matrix(x_range, nrow=100, ncol=1) +x_range <- data.frame(x_range) +colnames(x_range) <- c('total_bill') + +y_range <- lm_model %>% + predict(x_range) + +colnames(y_range) <- c('tip') +xy <- data.frame(x_range, y_range) + +fig <- plot_ly(data = tips_training, x = ~total_bill, y = ~tip, type = 'scatter', name = 'train', mode = 'markers', alpha = 0.65) %>% + add_trace(data = tips_test, x = ~total_bill, y = ~tip, type = 'scatter', name = 'test', mode = 'markers', alpha = 0.65 ) %>% + add_trace(data = xy, x = ~total_bill, y = ~tip, name = 'prediction', mode = 'lines', alpha = 1) +fig +``` + +## Comparing different kNN models parameters + +In addition to linear regression, it's possible to fit the same data using [k-Nearest Neighbors][knn]. When you perform a prediction on a new sample, this model either takes the weighted or un-weighted average of the neighbors. In order to see the difference between those two averaging options, we train a kNN model with both of those parameters, and we plot them in the same way as the previous graph. + +Notice how we can combine scatter points with lines using Plotly. You can learn more about [multiple chart types](https://plotly.com/r/graphing-multiple-chart-types/). + +[knn]: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html + +```{r} +library(plotly) +library(kknn) +data(tips) + +y <- tips$tip +X <- tips$total_bill + +knn_dist <- nearest_neighbor(neighbors = 10, weight_func = 'inv') %>% + set_engine('kknn') %>% + set_mode('regression') %>% + fit(tip ~ total_bill, data = tips) +knn_uni <- nearest_neighbor(neighbors = 10, weight_func = 'rectangular') %>% + set_engine('kknn') %>% + set_mode('regression') %>% + fit(tip ~ total_bill, data = tips) + +x_range <- seq(min(X), max(X), length.out = 100) +x_range <- matrix(x_range, nrow=100, ncol=1) +x_range <- data.frame(x_range) +colnames(x_range) <- c('total_bill') + +y_dist <- knn_dist %>% + predict(x_range) +y_uni <- knn_uni %>% + predict(x_range) + +colnames(y_dist) <- c('dist') +colnames(y_uni) <- c('uni') +xy <- data.frame(x_range, y_dist, y_uni) + +fig <- plot_ly(tips, type = 'scatter', mode = 'markers', colors = c("#FF7F50", "#6495ED")) %>% + add_trace(data = tips, x = ~total_bill, y = ~tip, type = 'scatter', mode = 'markers', color = ~sex, alpha = 0.65) %>% + add_trace(data = xy, x = ~total_bill, y = ~dist, name = 'Weights: Distance', mode = 'lines', alpha = 1) %>% + add_trace(data = xy, x = ~total_bill, y = ~uni, name = 'Weights: Uniform', mode = 'lines', alpha = 1) +fig +``` + +## 3D regression surface with `mesh3d` and `add_surface` + +Visualize the decision plane of your model whenever you have more than one variable in your input data. Here, we will use [`svm_rbf`](https://parsnip.tidymodels.org/reference/svm_rbf.html) with [`kernlab`](https://cran.r-project.org/web/packages/kernlab/index.html) engine in `regression` mode. For generating the 2D mesh on the surface, we use the package [`pracma`](https://cran.r-project.org/web/packages/pracma/index.html) + +```{r} +library(plotly) +library(kernlab) +library(pracma) #For meshgrid() +data(iris) + +mesh_size <- .02 +margin <- 0 +X <- iris %>% select(Sepal.Width, Sepal.Length) +y <- iris %>% select(Petal.Width) + +model <- svm_rbf(cost = 1.0) %>% + set_engine("kernlab") %>% + set_mode("regression") %>% + fit(Petal.Width ~ Sepal.Width + Sepal.Length, data = iris) + +x_min <- min(X$Sepal.Width) - margin +x_max <- max(X$Sepal.Width) - margin +y_min <- min(X$Sepal.Length) - margin +y_max <- max(X$Sepal.Length) - margin +xrange <- seq(x_min, x_max, mesh_size) +yrange <- seq(y_min, y_max, mesh_size) +xy <- meshgrid(x = xrange, y = yrange) +xx <- xy$X +yy <- xy$Y +dim_val <- dim(xx) +xx1 <- matrix(xx, length(xx), 1) +yy1 <- matrix(yy, length(yy), 1) +final <- cbind(xx1, yy1) +pred <- model %>% + predict(final) + +pred <- pred$.pred +pred <- matrix(pred, dim_val[1], dim_val[2]) + +dim(pred) +fig <- plot_ly(iris, x = ~Sepal.Width, y = ~Sepal.Length, z = ~Petal.Width ) %>% + add_markers(size = 5) %>% + add_surface(x=xrange, y=yrange, z=pred, alpha = 0.65, type = 'mesh3d', name = 'pred_surface') +fig + +``` + + +```{r} + +``` + +```{r} + +``` From c0102499e9f409bc476fbadd16b274238eb2c304 Mon Sep 17 00:00:00 2001 From: kvdesai Date: Mon, 19 Jul 2021 20:59:50 +0530 Subject: [PATCH 02/20] config.yml updated for ml-regression deps --- .circleci/config.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index c022be7a..1172b9ee 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -24,8 +24,8 @@ jobs: - run: name: install application-level dependencies command: | - sudo apt-get install -y pandoc libudunits2-dev libgdal-dev libxt-dev libglu1-mesa-dev libfftw3-dev libglpk40 - sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' + sudo apt-get install -y pandoc libudunits2-dev libgdal-dev libxt-dev libglu1-mesa-dev libfftw3-dev libglpk40 libxml2-dev libcurl4-openssl-dev apt-transport-https software-properties-common + sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kernlab", "pracma")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' - save_cache: key: cache4 paths: From 8673f4029035988cc7c300c9bc69f28f01d30b26 Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Mon, 19 Jul 2021 21:32:04 +0530 Subject: [PATCH 03/20] Fixing the CI error due to tips dataset not available --- .circleci/config.yml | 2 +- r/2021-07-08-ml-regression.Rmd | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 1172b9ee..5a630955 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -25,7 +25,7 @@ jobs: name: install application-level dependencies command: | sudo apt-get install -y pandoc libudunits2-dev libgdal-dev libxt-dev libglu1-mesa-dev libfftw3-dev libglpk40 libxml2-dev libcurl4-openssl-dev apt-transport-https software-properties-common - sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kernlab", "pracma")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' + sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kernlab", "pracma", "reshape2")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' - save_cache: key: cache4 paths: diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 57b92f6e..7c878f02 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -16,6 +16,7 @@ We will be using the [Linear Regression][lr], which is a simple model that fit a ### Linear Regression with R ```{r} +library(reshape2) #to load tips data library(plotly) data(tips) From 1f3b815c7d516c42ab87c47c72b6172ee41971e9 Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Mon, 19 Jul 2021 22:08:27 +0530 Subject: [PATCH 04/20] Explicit loading of tidymodels to fix CI --- r/2021-07-08-ml-regression.Rmd | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 7c878f02..844cb658 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -16,7 +16,9 @@ We will be using the [Linear Regression][lr], which is a simple model that fit a ### Linear Regression with R ```{r} -library(reshape2) #to load tips data +library(reshape2) # to load tips data +library(tidyverse) +library(tidymodels) # for the fit() function library(plotly) data(tips) From 4bbf84aa43584ce21b9acb1cce704285a41de904 Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Mon, 19 Jul 2021 22:39:11 +0530 Subject: [PATCH 05/20] CI Fix, load kknn --- .circleci/config.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 5a630955..d9495702 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -25,7 +25,7 @@ jobs: name: install application-level dependencies command: | sudo apt-get install -y pandoc libudunits2-dev libgdal-dev libxt-dev libglu1-mesa-dev libfftw3-dev libglpk40 libxml2-dev libcurl4-openssl-dev apt-transport-https software-properties-common - sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kernlab", "pracma", "reshape2")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' + sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kknn", "kernlab", "pracma", "reshape2")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' - save_cache: key: cache4 paths: From 74ab5a4e9c82fdaffe1769c83ce2710eca19ee6f Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Mon, 19 Jul 2021 23:12:32 +0530 Subject: [PATCH 06/20] Adding package load statements to all examples, so that each example can by copied and run independently --- r/2021-07-08-ml-regression.Rmd | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 844cb658..d87ddee0 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -49,6 +49,9 @@ fig With `add_trace()`, you can easily color your plot based on a predefined data split. By coloring the training and the testing data points with different colors, you can easily see if whether the model generalizes well to the test data or not. ```{r} +library(reshape2) +library(tidyverse) +library(tidymodels) library(plotly) data(tips) @@ -93,6 +96,9 @@ Notice how we can combine scatter points with lines using Plotly. You can learn [knn]: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html ```{r} +library(reshape2) +library(tidyverse) +library(tidymodels) library(plotly) library(kknn) data(tips) @@ -135,6 +141,9 @@ fig Visualize the decision plane of your model whenever you have more than one variable in your input data. Here, we will use [`svm_rbf`](https://parsnip.tidymodels.org/reference/svm_rbf.html) with [`kernlab`](https://cran.r-project.org/web/packages/kernlab/index.html) engine in `regression` mode. For generating the 2D mesh on the surface, we use the package [`pracma`](https://cran.r-project.org/web/packages/pracma/index.html) ```{r} +library(reshape2) +library(tidyverse) +library(tidymodels) library(plotly) library(kernlab) library(pracma) #For meshgrid() From a9c40b2964e42753f0d8990f3a5b61dc7cf9dd71 Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Tue, 20 Jul 2021 16:28:15 +0530 Subject: [PATCH 07/20] ml-regression one more plot --- .circleci/config.yml | 2 +- r/2021-07-08-ml-regression.Rmd | 63 ++++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+), 1 deletion(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index d9495702..75184656 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -25,7 +25,7 @@ jobs: name: install application-level dependencies command: | sudo apt-get install -y pandoc libudunits2-dev libgdal-dev libxt-dev libglu1-mesa-dev libfftw3-dev libglpk40 libxml2-dev libcurl4-openssl-dev apt-transport-https software-properties-common - sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kknn", "kernlab", "pracma", "reshape2")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' + sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kknn", "kernlab", "pracma", "reshape2", "ggplot2", "datasets")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' - save_cache: key: cache4 paths: diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index d87ddee0..1aa0e3ca 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -187,7 +187,70 @@ fig ``` +### Enhanced prediction error analysis using `ggplotly` + +Add marginal histograms to quickly diagnoses any prediction bias your model might have. + ```{r} +library(plotly) +library(ggplot2) +library(tidyverse) +library(tidymodels) +data(iris) + +X <- iris %>% select(Sepal.Width, Sepal.Length) +y <- iris %>% select(Petal.Width) + +set.seed(0) +iris_split <- initial_split(iris, prop = 3/4) +iris_training <- iris_split %>% + training() +iris_test <- iris_split %>% + testing() + +train_index <- as.integer(rownames(iris_training)) +test_index <- as.integer(rownames(iris_test)) + +iris[train_index,'split'] = 'train' +iris[test_index,'split'] = 'test' + +lm_model <- linear_reg() %>% + set_engine('lm') %>% + set_mode('regression') %>% + fit(Petal.Width ~ Sepal.Width + Sepal.Length, data = iris_training) + +prediction <- lm_model %>% + predict(X) +colnames(prediction) <- c('prediction') +iris = cbind(iris, prediction) + +hist_top <- ggplot(iris,aes(x=Petal.Width)) + + geom_histogram(data=subset(iris,split == 'train'),fill = "red", alpha = 0.2, bins = 6) + + geom_histogram(data=subset(iris,split == 'test'),fill = "blue", alpha = 0.2, bins = 6) + + theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank()) +hist_top <- ggplotly(p = hist_top) + +scatter <- ggplot(iris, aes(x = Petal.Width, y = prediction, color = split)) + + geom_point() + + geom_smooth(method=lm, se=FALSE) +scatter <- ggplotly(p = scatter, type = 'scatter') + +hist_right <- ggplot(iris,aes(x=prediction)) + + geom_histogram(data=subset(iris,split == 'train'),fill = "red", alpha = 0.2, bins = 13) + + geom_histogram(data=subset(iris,split == 'test'),fill = "blue", alpha = 0.2, bins = 13) + + theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank())+ + coord_flip() +hist_right <- ggplotly(p = hist_right) + +s <- subplot( + hist_top, + plotly_empty(), + scatter, + hist_right, + nrows = 2, heights = c(0.2, 0.8), widths = c(0.8, 0.2), margin = 0, + shareX = TRUE, shareY = TRUE, titleX = TRUE, titleY = TRUE +) +layout(s, showlegend = FALSE) ``` From 1759ebf4952ed662e71f6013a64bfd35b18b52bb Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Tue, 20 Jul 2021 17:25:14 +0530 Subject: [PATCH 08/20] ml-regression added residual plot violin --- r/2021-07-08-ml-regression.Rmd | 55 +++++++++++++++++++++++++++++++++- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 1aa0e3ca..033a07a5 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -253,7 +253,60 @@ s <- subplot( layout(s, showlegend = FALSE) ``` - +## Residual plots +Just like prediction error plots, it's easy to visualize your prediction residuals in just a few lines of codes using `ggplotly` built-in capabilities. ```{r} +library(plotly) +library(ggplot2) +library(tideyverse) +library(tidymodels) + +data(iris) + +X <- iris %>% select(Sepal.Width, Sepal.Length) +y <- iris %>% select(Petal.Width) + +set.seed(0) +iris_split <- initial_split(iris, prop = 3/4) +iris_training <- iris_split %>% + training() +iris_test <- iris_split %>% + testing() + +train_index <- as.integer(rownames(iris_training)) +test_index <- as.integer(rownames(iris_test)) + +iris[train_index,'split'] = 'train' +iris[test_index,'split'] = 'test' +lm_model <- linear_reg() %>% + set_engine('lm') %>% + set_mode('regression') %>% + fit(Petal.Width ~ Sepal.Width + Sepal.Length, data = iris_training) + +prediction <- lm_model %>% + predict(X) +colnames(prediction) <- c('prediction') +iris = cbind(iris, prediction) +residual <- prediction - iris$Petal.Width +colnames(residual) <- c('residual') +iris = cbind(iris, residual) + +scatter <- ggplot(iris, aes(x = prediction, y = residual, color = split)) + + geom_point() + + geom_smooth(method=lm, se=FALSE) + +scatter <- ggplotly(p = scatter, type = 'scatter') + +violin <- iris %>% + plot_ly(x = ~split, y = ~residual, split = ~split, type = 'violin' ) + +s <- subplot( + scatter, + violin, + nrows = 1, heights = c(1), widths = c(0.65, 0.35), margin = 0.01, + shareX = TRUE, shareY = TRUE, titleX = TRUE, titleY = TRUE +) + +layout(s, showlegend = FALSE) ``` From 9e08d38138ec134ab0a4780493315e68b75947eb Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Tue, 20 Jul 2021 20:31:54 +0530 Subject: [PATCH 09/20] fixing typo --- r/2021-07-08-ml-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 033a07a5..0f0f7c75 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -258,7 +258,7 @@ Just like prediction error plots, it's easy to visualize your prediction residua ```{r} library(plotly) library(ggplot2) -library(tideyverse) +library(tidyverse) library(tidymodels) data(iris) From da9ba1c82a8dc5f5a9f5fb72e39a665252b0e13f Mon Sep 17 00:00:00 2001 From: kvdesai Date: Wed, 21 Jul 2021 22:26:12 +0530 Subject: [PATCH 10/20] Update r/2021-07-08-ml-regression.Rmd Co-authored-by: HammadTheOne <30986043+HammadTheOne@users.noreply.github.com> --- r/2021-07-08-ml-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 0f0f7c75..69b3257b 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -1,7 +1,7 @@ This page shows how to use Plotly charts for displaying various types of regression models, starting from simple models like Linear Regression and progressively move towards models like Decision Tree and Polynomial Features. We highlight various capabilities of plotly, such as comparative analysis of the same model with different parameters, displaying Latex, and [surface plots](https://plotly.com/r/3d-surface-plots/) for 3D data. -We will use [tidymodels](https://tidymodels.tidymodels.org/) to split and preprocess our data and train various regression models. Tidymodels is a popular Machine Learning (ML) library in R that is compatible with the "tidyverse" concepts, and offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It is the next-gen version of the popular [caret](http://topepo.github.io/caret/index.html) library in R. +We will use [tidymodels](https://tidymodels.tidymodels.org/) to split and preprocess our data and train various regression models. Tidymodels is a popular Machine Learning (ML) library in R that is compatible with the "tidyverse" concepts, and offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It is the next-gen version of the popular [caret](http://topepo.github.io/caret/index.html) library for R. From bf2b0b08cab8e309baecbeaf904e2f96e1960c3b Mon Sep 17 00:00:00 2001 From: kvdesai Date: Wed, 21 Jul 2021 22:26:32 +0530 Subject: [PATCH 11/20] Update r/2021-07-08-ml-regression.Rmd Co-authored-by: HammadTheOne <30986043+HammadTheOne@users.noreply.github.com> --- r/2021-07-08-ml-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 69b3257b..128b89bc 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -9,7 +9,7 @@ We will use [tidymodels](https://tidymodels.tidymodels.org/) to split and prepro In this section, we show you how to apply a simple regression model for predicting tips a server will receive based on various client attributes (such as sex, time of the week, and whether they are a smoker). -We will be using the [Linear Regression][lr], which is a simple model that fit an intercept (the mean tip received by a server), and add a slope for each feature we use, such as the value of the total bill. +We will be using the [Linear Regression][lr], which is a simple model that fits an intercept (the mean tip received by a server), and adds a slope for each feature we use, such as the value of the total bill. [lr]: https://parsnip.tidymodels.org/reference/linear_reg.html From 8e88395833bfa407b17b7fa14e6e274a7dd5bc3e Mon Sep 17 00:00:00 2001 From: kvdesai Date: Wed, 21 Jul 2021 22:26:53 +0530 Subject: [PATCH 12/20] Update r/2021-07-08-ml-regression.Rmd Co-authored-by: HammadTheOne <30986043+HammadTheOne@users.noreply.github.com> --- r/2021-07-08-ml-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 128b89bc..1648862a 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -1,5 +1,5 @@ -This page shows how to use Plotly charts for displaying various types of regression models, starting from simple models like Linear Regression and progressively move towards models like Decision Tree and Polynomial Features. We highlight various capabilities of plotly, such as comparative analysis of the same model with different parameters, displaying Latex, and [surface plots](https://plotly.com/r/3d-surface-plots/) for 3D data. +This page shows how to use Plotly charts for displaying various types of regression models, starting from simple models like [Linear Regression](https://parsnip.tidymodels.org/reference/linear_reg.html) and progressively move towards models like Decision Tree and Polynomial Features. We highlight various capabilities of plotly, such as comparative analysis of the same model with different parameters, displaying Latex, and [surface plots](https://plotly.com/r/3d-surface-plots/) for 3D data. We will use [tidymodels](https://tidymodels.tidymodels.org/) to split and preprocess our data and train various regression models. Tidymodels is a popular Machine Learning (ML) library in R that is compatible with the "tidyverse" concepts, and offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It is the next-gen version of the popular [caret](http://topepo.github.io/caret/index.html) library for R. From b12757250cde9d3e15e47df6fcc4234759638847 Mon Sep 17 00:00:00 2001 From: kvdesai Date: Thu, 22 Jul 2021 18:23:31 +0530 Subject: [PATCH 13/20] Update r/2021-07-08-ml-regression.Rmd Co-authored-by: HammadTheOne <30986043+HammadTheOne@users.noreply.github.com> --- r/2021-07-08-ml-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 1648862a..2ef3989d 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -232,7 +232,7 @@ hist_top <- ggplotly(p = hist_top) scatter <- ggplot(iris, aes(x = Petal.Width, y = prediction, color = split)) + geom_point() + - geom_smooth(method=lm, se=FALSE) + geom_smooth(formula=y ~ x, method=lm, se=FALSE) scatter <- ggplotly(p = scatter, type = 'scatter') hist_right <- ggplot(iris,aes(x=prediction)) + From 651d71a616c28475f0f45e28a37e36c2800bf25f Mon Sep 17 00:00:00 2001 From: kvdesai Date: Thu, 22 Jul 2021 18:23:43 +0530 Subject: [PATCH 14/20] Update r/2021-07-08-ml-regression.Rmd Co-authored-by: HammadTheOne <30986043+HammadTheOne@users.noreply.github.com> --- r/2021-07-08-ml-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 2ef3989d..5c1b043a 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -294,7 +294,7 @@ iris = cbind(iris, residual) scatter <- ggplot(iris, aes(x = prediction, y = residual, color = split)) + geom_point() + - geom_smooth(method=lm, se=FALSE) + geom_smooth(formula=y ~ x, method=lm, se=FALSE) scatter <- ggplotly(p = scatter, type = 'scatter') From dd411628c131ff4ad8b366d9515adc4ef691a7ac Mon Sep 17 00:00:00 2001 From: kvdesai Date: Thu, 22 Jul 2021 18:23:58 +0530 Subject: [PATCH 15/20] Update r/2021-07-08-ml-regression.Rmd Co-authored-by: HammadTheOne <30986043+HammadTheOne@users.noreply.github.com> --- r/2021-07-08-ml-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 5c1b043a..c4f25490 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -46,7 +46,7 @@ fig ``` ## Model generalization on unseen data -With `add_trace()`, you can easily color your plot based on a predefined data split. By coloring the training and the testing data points with different colors, you can easily see if whether the model generalizes well to the test data or not. +With `add_trace()`, you can easily color your plot based on a predefined data split. By coloring the training and the testing data points with different colors, you can easily see if the model generalizes well to the test data or not. ```{r} library(reshape2) From 56ce3dc6c034e905f93d49566789688e42217b10 Mon Sep 17 00:00:00 2001 From: kvdesai Date: Thu, 22 Jul 2021 18:24:45 +0530 Subject: [PATCH 16/20] Update r/2021-07-08-ml-regression.Rmd Co-authored-by: HammadTheOne <30986043+HammadTheOne@users.noreply.github.com> --- r/2021-07-08-ml-regression.Rmd | 3 +++ 1 file changed, 3 insertions(+) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index c4f25490..95f470ce 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -106,10 +106,13 @@ data(tips) y <- tips$tip X <- tips$total_bill +# Model #1 knn_dist <- nearest_neighbor(neighbors = 10, weight_func = 'inv') %>% set_engine('kknn') %>% set_mode('regression') %>% fit(tip ~ total_bill, data = tips) + +# Model #2 knn_uni <- nearest_neighbor(neighbors = 10, weight_func = 'rectangular') %>% set_engine('kknn') %>% set_mode('regression') %>% From 3de3fdb82970405bcbbfc6ab1695882ab495b200 Mon Sep 17 00:00:00 2001 From: kvdesai Date: Thu, 22 Jul 2021 18:25:16 +0530 Subject: [PATCH 17/20] Update r/2021-07-08-ml-regression.Rmd Co-authored-by: HammadTheOne <30986043+HammadTheOne@users.noreply.github.com> --- r/2021-07-08-ml-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 95f470ce..75dad11d 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -141,7 +141,7 @@ fig ## 3D regression surface with `mesh3d` and `add_surface` -Visualize the decision plane of your model whenever you have more than one variable in your input data. Here, we will use [`svm_rbf`](https://parsnip.tidymodels.org/reference/svm_rbf.html) with [`kernlab`](https://cran.r-project.org/web/packages/kernlab/index.html) engine in `regression` mode. For generating the 2D mesh on the surface, we use the package [`pracma`](https://cran.r-project.org/web/packages/pracma/index.html) +Visualize the decision plane of your model whenever you have more than one variable in your input data. Here, we will use [`svm_rbf`](https://parsnip.tidymodels.org/reference/svm_rbf.html) with [`kernlab`](https://cran.r-project.org/web/packages/kernlab/index.html) engine in `regression` mode. For generating the 2D mesh on the surface, we use the [`pracma`](https://cran.r-project.org/web/packages/pracma/index.html) package. ```{r} library(reshape2) From 323b75b317872694fc14a13f4571edc093e97492 Mon Sep 17 00:00:00 2001 From: kvdesai Date: Thu, 22 Jul 2021 18:35:36 +0530 Subject: [PATCH 18/20] Update r/2021-07-08-ml-regression.Rmd Co-authored-by: HammadTheOne <30986043+HammadTheOne@users.noreply.github.com> --- r/2021-07-08-ml-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index 75dad11d..df1eb1e7 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -257,7 +257,7 @@ layout(s, showlegend = FALSE) ``` ## Residual plots -Just like prediction error plots, it's easy to visualize your prediction residuals in just a few lines of codes using `ggplotly` built-in capabilities. +Just like prediction error plots, it's easy to visualize your prediction residuals in just a few lines of codes using `ggplotly` and `tidymodels` capabilities. ```{r} library(plotly) library(ggplot2) From a93d8f84130517cec30017739ff65c5e1172eeb7 Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Fri, 23 Jul 2021 22:04:51 +0530 Subject: [PATCH 19/20] resolving the PR review comments. Also added a missing snippet plot --- r/2021-07-08-ml-regression.Rmd | 78 ++++++++++++++++++++++++++-------- 1 file changed, 60 insertions(+), 18 deletions(-) diff --git a/r/2021-07-08-ml-regression.Rmd b/r/2021-07-08-ml-regression.Rmd index df1eb1e7..fc8657e3 100644 --- a/r/2021-07-08-ml-regression.Rmd +++ b/r/2021-07-08-ml-regression.Rmd @@ -32,13 +32,13 @@ lm_model <- linear_reg() %>% x_range <- seq(min(X), max(X), length.out = 100) x_range <- matrix(x_range, nrow=100, ncol=1) -x_range <- data.frame(x_range) -colnames(x_range) <- c('total_bill') +xdf <- data.frame(x_range) +colnames(xdf) <- c('total_bill') -y_range <- lm_model %>% predict(x_range) +ydf <- lm_model %>% predict(xdf) -colnames(y_range) <- c('tip') -xy <- data.frame(x_range, y_range) +colnames(ydf) <- c('tip') +xy <- data.frame(xdf, ydf) fig <- plot_ly(tips, x = ~total_bill, y = ~tip, type = 'scatter', alpha = 0.65, mode = 'markers', name = 'Tips') fig <- fig %>% add_trace(data = xy, x = ~total_bill, y = ~tip, name = 'Regression Fit', mode = 'lines', alpha = 1) @@ -72,14 +72,14 @@ lm_model <- linear_reg() %>% x_range <- seq(min(X), max(X), length.out = 100) x_range <- matrix(x_range, nrow=100, ncol=1) -x_range <- data.frame(x_range) -colnames(x_range) <- c('total_bill') +xdf <- data.frame(x_range) +colnames(xdf) <- c('total_bill') -y_range <- lm_model %>% - predict(x_range) +ydf <- lm_model %>% + predict(xdf) -colnames(y_range) <- c('tip') -xy <- data.frame(x_range, y_range) +colnames(ydf) <- c('tip') +xy <- data.frame(xdf, ydf) fig <- plot_ly(data = tips_training, x = ~total_bill, y = ~tip, type = 'scatter', name = 'train', mode = 'markers', alpha = 0.65) %>% add_trace(data = tips_test, x = ~total_bill, y = ~tip, type = 'scatter', name = 'test', mode = 'markers', alpha = 0.65 ) %>% @@ -93,7 +93,7 @@ In addition to linear regression, it's possible to fit the same data using [k-Ne Notice how we can combine scatter points with lines using Plotly. You can learn more about [multiple chart types](https://plotly.com/r/graphing-multiple-chart-types/). -[knn]: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html +[knn]: http://klausvigo.github.io/kknn/ ```{r} library(reshape2) @@ -120,17 +120,17 @@ knn_uni <- nearest_neighbor(neighbors = 10, weight_func = 'rectangular') %>% x_range <- seq(min(X), max(X), length.out = 100) x_range <- matrix(x_range, nrow=100, ncol=1) -x_range <- data.frame(x_range) -colnames(x_range) <- c('total_bill') +xdf <- data.frame(x_range) +colnames(xdf) <- c('total_bill') y_dist <- knn_dist %>% - predict(x_range) + predict(xdf) y_uni <- knn_uni %>% - predict(x_range) + predict(xdf) colnames(y_dist) <- c('dist') colnames(y_uni) <- c('uni') -xy <- data.frame(x_range, y_dist, y_uni) +xy <- data.frame(xdf, y_dist, y_uni) fig <- plot_ly(tips, type = 'scatter', mode = 'markers', colors = c("#FF7F50", "#6495ED")) %>% add_trace(data = tips, x = ~total_bill, y = ~tip, type = 'scatter', mode = 'markers', color = ~sex, alpha = 0.65) %>% @@ -181,14 +181,56 @@ pred <- model %>% pred <- pred$.pred pred <- matrix(pred, dim_val[1], dim_val[2]) -dim(pred) fig <- plot_ly(iris, x = ~Sepal.Width, y = ~Sepal.Length, z = ~Petal.Width ) %>% add_markers(size = 5) %>% add_surface(x=xrange, y=yrange, z=pred, alpha = 0.65, type = 'mesh3d', name = 'pred_surface') fig ``` +## Prediction Error Plots +When you are working with very high-dimensional data, it is inconvenient to plot every dimension with your output `y`. Instead, you can use methods such as prediction error plots, which let you visualize how well your model does compared to the ground truth. + +### Simple actual vs predicted plot + +This example shows you the simplest way to compare the predicted output vs. the actual output. A good model will have most of the scatter dots near the diagonal black line. + +```{r} +library(tidyverse) +library(tidymodels) +library(plotly) +library(ggplot2) + +data("iris") + +X <- data.frame(Sepal.Width = c(iris$Sepal.Width), Sepal.Length = c(iris$Sepal.Length)) +y <- iris$Petal.Width + +lm_model <- linear_reg() %>% + set_engine('lm') %>% + set_mode('regression') %>% + fit(Petal.Width ~ Sepal.Width + Sepal.Length, data = iris) + +y_pred <- lm_model %>% + predict(X) + +db = cbind(iris, y_pred) + +colnames(db)[4] <- "Ground_truth" +colnames(db)[6] <- "prediction" + +x0 = min(y) +y0 = max(y) +x1 = max(y) +y1 = max(y) +p1 <- ggplot(db, aes(x= Ground_truth, y= prediction )) + + geom_point(aes(color = "Blue"), show.legend = FALSE) + geom_segment(aes(x = x0, y = x0, xend = y1, yend = y1 ),linetype = 2) + + +p1 <- ggplotly(p1) +p1 + +``` ### Enhanced prediction error analysis using `ggplotly` From 036b474a55bdec678c531cbf8ddccf471309ceea Mon Sep 17 00:00:00 2001 From: Kalpit Desai Date: Mon, 26 Jul 2021 17:59:07 +0530 Subject: [PATCH 20/20] attempt to fix build error with anglr --- .circleci/config.yml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 75184656..a9b88a8d 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -25,7 +25,8 @@ jobs: name: install application-level dependencies command: | sudo apt-get install -y pandoc libudunits2-dev libgdal-dev libxt-dev libglu1-mesa-dev libfftw3-dev libglpk40 libxml2-dev libcurl4-openssl-dev apt-transport-https software-properties-common - sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kknn", "kernlab", "pracma", "reshape2", "ggplot2", "datasets")); devtools::install_github("hypertidy/anglr"); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' + sudo R -e 'install.packages(c("curl", "devtools", "mvtnorm", "hexbin", "tidyverse", "tidymodels", "kknn", "kernlab", "pracma", "reshape2", "ggplot2", "datasets")); devtools::install_github("ropensci/plotly"); devtools::install_github("johannesbjork/LaCroixColoR"); install.packages("BiocManager"); BiocManager::install("EBImage"); devtools::install_deps(dependencies = TRUE) ' + sudo R -e 'install.packages("https://github.com/hypertidy/anglr/archive/refs/tags/v0.7.0.tar.gz", repos=NULL, type="source"); devtools::install_deps(dependencies = TRUE) ' - save_cache: key: cache4 paths: