0% found this document useful (0 votes)

33 views64 pages

07 Tidymodels

Uploaded by

Saidul Islam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views64 pages

07 Tidymodels

Uploaded by

Saidul Islam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

07 - Machine Learning with tidymodels

Data Science with R · Summer 2021

Uli Niemann · Knowledge Management & Discovery Lab
https://brain.cs.uni-magdeburg.de/kmd/DataSciR/
📷 Photo courtesy of Ulrich Arendt
tidymodels

2 / 78
3 / 78
Feature Selection

Learners Utilities
Data mlr³filters
mlr³misc bbotk
Filter-based FS

mlr³learners
mlr³db Devel Helper Functions Black-Box Optimization

Database Backends
Core learners for mlr3
mlr³fselect
Wrapper-based FS

mlr³keras mlr³measures mlr3raster

mlr³oml Connecting Keras to mlr3
Performance Measures Enhanced spatial prediction

OpenML Connection

mlr³extralearners
Stable
mlr³verse mlr³benchmark
mlr³data Additional learners Maturing
Meta-package Tools for benchmarking

Example datasets
Planned
mlr³batchmark
mlr3 - batchtools connector
Tuning Core

mlr³tuning mlr³hyperband
Hyperparameter Tuning Hyperband Parameter Tuning mlr³ Pipelines

tasks, learners, train-test-eval,

resample, benchmark mlr³pipelines
paradox mlr³mbo
Preprocessing, pipelines & ensembles

Parameter Sets Bayesian Optimization

Tasks

Visualization
mlr³spatiotemporal mlr³ordinal
Spatiotemporal resampling Learning with ordinal targets

mlr³viz
Visualization

mlr³multioutput mlr³cluster
Multiple Targets Cluster Analysis

Figure source: https://mlr3.mlr-org.com/ mlr³forecasting mlr³proba

Time Series Forecasting & Resampling Probabilistic Learning & Survival Analysis

4 / 78
This tutorial is a condensed version of the 2-day workshop "Introduction to Machine Learning with the
Tidyverse" held by Dr. Alison Hill at the rstudio::conf 2020.

5 / 78
Setup
library(tidyverse)
library(tidymodels)

## -- Attaching packages ------------------------------------------------ tidymodels 0.1.2 --

## v broom 0.7.6 v recipes 0.1.15

## v dials 0.0.9 v rsample 0.0.9
## v infer 0.5.4 v tune 0.1.2
## v modeldata 0.1.0 v workflows 0.2.2
## v parsnip 0.1.5 v yardstick 0.0.7

## -- Conflicts --------------------------------------------------- tidymodels_conflicts() --

## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()

6 / 78
Ames Iowa Housing Dataset
library(AmesHousing)
(ames <- make_ames() %>% select(-matches("Qu")))
"Data set contains
## # A tibble: 2,930 x 74
information from the Ames ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
Assessor’s Ofﬁce used in ## <fct> <fct> <dbl> <int> <fct> <fct> <fct>
computing assessed values ## 1 One_Story_1~ Resident~ 141 31770 Pave No_A~ Slightly~
for individual residential ## 2 One_Story_1~ Resident~ 80 11622 Pave No_A~ Regular
## 3 One_Story_1~ Resident~ 81 14267 Pave No_A~ Slightly~
properties sold in Ames, IA ## 4 One_Story_1~ Resident~ 93 11160 Pave No_A~ Regular
from 2006 to 2010." — ## 5 Two_Story_1~ Resident~ 74 13830 Pave No_A~ Slightly~
Dataset documentation ## 6 Two_Story_1~ Resident~ 78 9978 Pave No_A~ Slightly~
## 7 One_Story_P~ Resident~ 41 4920 Pave No_A~ Regular
## 8 One_Story_P~ Resident~ 43 5005 Pave No_A~ Slightly~
## 9 One_Story_P~ Resident~ 39 5389 Pave No_A~ Slightly~
## 10 Two_Story_1~ Resident~ 60 7500 Pave No_A~ Regular
## # ... with 2,920 more rows, and 67 more variables:
De Cock, Dean. "Ames, Iowa: Alternative ## # Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
to the Boston housing data as an end of ## # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>,
semester regression project." Journal of ## # Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
Statistics Education 19.3 (2011). URL ## # Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
## # Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>,
## # Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>,
## # Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
## # Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>,
## # BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
## # Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>,
## # Central_Air <fct>, Electrical <fct>, First_Flr_SF <int>,
## # Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
7 / 78
Specify a model with parsnip

8 / 78
Specify a model with parsnip
. Pick a model
. Set the engine
. Set the mode (if needed)

decision_tree() %>% # model nearest_neighbor() %>%

set_engine("rpart") %>% # engine set_engine("kknn") %>%
set_mode("classification") # mode set_mode("regression")

## Decision Tree Model Specification (classification) ## K-Nearest Neighbor Model Specification (regression)
## ##
## Computational engine: rpart ## Computational engine: kknn

11 / 78
All available models are listed at https://www.tidymodels.org/ﬁnd/parsnip/#models.

To learn about the parsnip package, see Get Started: Build a Model. Use the tables below
Tidymodels PAC K AG E S G E T S TA R T E D LEARN HELP CONTRIBUTE  
to ﬁnd model types and engines and to explore model arguments.

EXPLORE
EXPLORE MODELS TIDYMODELS

Show 5 entries Search: Search all of

TITLE MODEL PACKAGE MODE ENGINE tidymodels

TYPE
Search parsnip
All All All All All
models

"Boosted" arima_boost model me regression arima_xgboost Search recipe steps

ARIMA ,
Regression auto_arima_xg
RESOURCES
Models boost

ARIMA i d l i i  Find

12 / 78
1. Pick a model linear_reg()
2. Set the engine Specify a model that uses linear regression:

3. Set the mode linear_reg(

mode = "regression", # type of model (only "regression" here)
penalty = NULL, # amount of regularization
mixture = NULL # proportion of L1 regularization
)

13 / 78
1. Pick a model set_engine()
2. Set the engine Add an engine to power or implement the model:

3. Set the mode linear_reg() %>%

set_engine(engine = "lm", ...)

Available engines for linear_reg():

R: "lm" (the default) or "glmnet"

Stan: "stan"
Spark: "spark"
keras: "keras"

14 / 78
1. Pick a model set_mode()
2. Set the engine Set the model type, either "regression" or "classification". Not
necessary if mode is set in Step 1.
3. Set the mode
linear_reg() %>%
set_engine(engine = "lm") %>%
set_mode(mode = "regression")

15 / 78
fit()
fit(): ﬁt a simple linear regression model to predict sale price based on above ground living area.

lm_spec <- linear_reg() %>%

set_engine(engine = "lm") %>%
set_mode(mode = "regression")
m <- fit(
lm_spec, # parsnip model spec
Sale_Price ~ Gr_Liv_Area, # formula
ames # data frame
)
m

## parsnip model object

##
## Fit time: 10ms
##
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data)
##
## Coefficients:
## (Intercept) Gr_Liv_Area
## 13289.6 111.7

16 / 78
predict()
predict(): use a ﬁtted model to predict new response values from data. Returns a tibble.

p <- predict(m, new_data = ames)

## # A tibble: 2,930 x 1
## .pred
## <dbl>
## 1 198255.
## 2 113367.
## 3 161731.
## 4 248964.
## 5 195239.
## 6 192447.
## 7 162736.
## 8 156258.
## 9 193787.
## 10 214786.
## # ... with 2,920 more rows

17 / 78
Measure model performance
with yardstick

18 / 78
Measure the model performance with yardstick::rmse()
Residuals. The difference between observed and predicted values: yî − yi .
∑
Mean Absolute Error. n1 ni=1 |yî − yi |.
Root Mean Squared Error. √ ∑ 1
n
n
i=1
î − yi ) 2 .
(y

Calculate the RMSE based on two columns in a data frame:

truth yi
predicted estimate y^

lm_spec <- linear_reg() %>%

set_engine(engine = "lm") %>%
set_mode(mode = "regression")
lm_fit <- fit(object = lm_spec, formula = Sale_Price ~ Gr_Liv_Area, data = ames)
price_pred <- lm_fit %>%
predict(new_data = ames) %>%
mutate(truth = ames$Sale_Price)

rmse(price_pred, truth = truth, estimate = .pred)

## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 56505.

20 / 78
Available metrics in yardstick
https://yardstick.tidymodels.org/articles/metric-types.html#metrics

Metrics Contents
Below is a table of all of the metrics available in yardstick , grouped by type.
Metric types

type metric Example

Metrics
class accuracy()

class bal_accuracy()

class detection_prevalence()

class f_meas()

class j_index()

class kap()

21 / 78
Perform resampling with
rsample

22 / 78
initial_split()
initial_split(): partition data randomly into a single training and a single test set.

set.seed(123)
(ames_split <- initial_split(ames, prop = 3/4)) # prop = proportion of training instances

## <Analysis/Assess/Total>
## <2198/732/2930>

23 / 78
training() and testing()
Extract training and testing sets from an rsplit object:

training(ames_split) testing(ames_split)

## # A tibble: 2,198 x 74 ## # A tibble: 732 x 74

## MS_SubClass MS_Zoning Lot_Frontage ## MS_SubClass MS_Zoning Lot_Frontage
## <fct> <fct> <dbl> ## <fct> <fct> <dbl>
## 1 One_Story_1946_~ Residential~ 141 ## 1 One_Story_PUD_1~ Residential~ 43
## 2 One_Story_1946_~ Residential~ 80 ## 2 One_Story_PUD_1~ Residential~ 39
## 3 One_Story_1946_~ Residential~ 81 ## 3 Two_Story_1946_~ Residential~ 60
## 4 One_Story_1946_~ Residential~ 93 ## 4 Two_Story_1946_~ Residential~ 63
## 5 Two_Story_1946_~ Residential~ 74 ## 5 Two_Story_1946_~ Residential~ 47
## 6 Two_Story_1946_~ Residential~ 78 ## 6 One_Story_1946_~ Residential~ 88
## 7 One_Story_PUD_1~ Residential~ 41 ## 7 One_Story_1946_~ Residential~ 0
## 8 Two_Story_1946_~ Residential~ 75 ## 8 Two_Story_PUD_1~ Residential~ 21
## 9 One_Story_1946_~ Residential~ 0 ## 9 One_Story_1946_~ Residential~ 95
## 10 One_Story_1946_~ Residential~ 85 ## 10 One_Story_1946_~ Residential~ 70
## # ... with 2,188 more rows, and 71 more ## # ... with 722 more rows, and 71 more
## # variables: Lot_Area <int>, Street <fct>, ## # variables: Lot_Area <int>, Street <fct>,
## # Alley <fct>, Lot_Shape <fct>, ## # Alley <fct>, Lot_Shape <fct>,
## # Land_Contour <fct>, Utilities <fct>, ## # Land_Contour <fct>, Utilities <fct>,
## # Lot_Config <fct>, Land_Slope <fct>, ## # Lot_Config <fct>, Land_Slope <fct>,
## # Neighborhood <fct>, Condition_1 <fct>, ## # Neighborhood <fct>, Condition_1 <fct>,
## # Condition_2 <fct>, Bldg_Type <fct>, ## # Condition_2 <fct>, Bldg_Type <fct>,
## # House_Style <fct>, Overall_Cond <fct>, ## # House_Style <fct>, Overall_Cond <fct>,
## # Year_Built <int>, Year_Remod_Add <int>, ## # Year_Built <int>, Year_Remod_Add <int>,
## # Roof_Style <fct>, Roof_Matl <fct>, ## # Roof_Style <fct>, Roof_Matl <fct>,
## # Exterior_1st <fct>, Exterior_2nd <fct>, ## # Exterior_1st <fct>, Exterior_2nd <fct>, 24 / 78
## # Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, ## # Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>,
Stratiﬁed sampling
initial_split(ames, strata = Sale_Price, breaks = 6)

25 / 78
Cross-validation with vfold_cv()
General syntax:

vfold_cv(data, v = 10, repeats = 1, strata = NULL, breaks = 4, ...)

Example: 10-fold CV on ames data:

set.seed(123)
(folds <- vfold_cv(ames, v = 5))

## # 5-fold cross-validation
## # A tibble: 5 x 2
## splits id
## <list> <chr>
## 1 <split [2344/586]> Fold1
## 2 <split [2344/586]> Fold2
## 3 <split [2344/586]> Fold3
## 4 <split [2344/586]> Fold4
## 5 <split [2344/586]> Fold5

Check whether mean y is approx. equal in each training fold:

map_dbl(folds$splits, ~mean(.x$data$Sale_Price[.x$in_id]))

## [1] 181310.8 180991.0 180840.0 181268.6

## [5] 179569.9
27 / 78
Calculate the model performance on multiple resamples
with fit_resamples()
res <- fit_resamples(lm_spec, Sale_Price ~ Gr_Liv_Area, resamples = folds)
res

## # Resampling results
## # 5-fold cross-validation
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [2344/586]> Fold1 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 2 <split [2344/586]> Fold2 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 3 <split [2344/586]> Fold3 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 4 <split [2344/586]> Fold4 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 5 <split [2344/586]> Fold5 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>

28 / 78
Collapse performance results across resamples with
collect_metrics()
res %>% collect_metrics()

## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 56486. 5 1866. Preprocessor1_Model1
## 2 rsq standard 0.504 5 0.0193 Preprocessor1_Model1

res %>% collect_metrics(summarize = FALSE)

## # A tibble: 10 x 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold1 rmse standard 51064. Preprocessor1_Model1
## 2 Fold1 rsq standard 0.542 Preprocessor1_Model1
## 3 Fold2 rmse standard 57206. Preprocessor1_Model1
## 4 Fold2 rsq standard 0.464 Preprocessor1_Model1
## 5 Fold3 rmse standard 53526. Preprocessor1_Model1
## 6 Fold3 rsq standard 0.557 Preprocessor1_Model1
## 7 Fold4 rmse standard 61210. Preprocessor1_Model1
## 8 Fold4 rsq standard 0.468 Preprocessor1_Model1
## 9 Fold5 rmse standard 59422. Preprocessor1_Model1
## 10 Fold5 rsq standard 0.488 Preprocessor1_Model1

29 / 78
metric_set()
metric_set(): a helper function for selecting yardstick metric functions.

fit_resamples( If metrics = NULL:

object,
resamples, regression: metric_set(rmse, rsq)
...,
metrics = metric_set(rmse, rsq), classiﬁcation: metric_set(accuracy, roc_auc)
control = control_resamples()
)

30 / 78
Other resampling methods
loo_cv(): leave-one-out CV
mc_cv(): repeated holdout / Monte Carlo (random) CV: test sets sampled without replacement
bootstraps(): test sets sampled with replacement

31 / 78
A classiﬁcation example
stackoverflow <- read_rds(here::here("data/stackoverflow.rds"))
glimpse(stackoverflow)

## Rows: 1,150
## Columns: 21
## $ country <fct> United States, United States, United Kingdo~
## $ salary <dbl> 63750.00, 93000.00, 40625.00, 45000.00, 100~
## $ years_coded_job <int> 4, 9, 8, 3, 8, 12, 20, 17, 20, 4, 3, 13, 16~
## $ open_source <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1~
## $ hobby <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1~
## $ company_size_number <dbl> 20, 1000, 10000, 1, 10, 100, 20, 500, 1, 20~
## $ remote <fct> Remote, Remote, Remote, Remote, Remote, Rem~
## $ career_satisfaction <int> 8, 8, 5, 10, 8, 10, 9, 7, 8, 7, 9, 8, 8, 7,~
## $ data_scientist <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ database_administrator <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0~
## $ desktop_applications_developer <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0~
## $ developer_with_stats_math_background <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ dev_ops <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0~
## $ embedded_developer <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0~
## $ graphic_designer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ graphics_programming <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ machine_learning_specialist <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ mobile_developer <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1~
## $ quality_assurance_engineer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ systems_administrator <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ web_developer <dbl> 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1~

Data source: Stack Overﬂow Annual Developer Survey

32 / 78
Specify a classiﬁcation model
1. Pick a model Specify a decision tree model with default parameter settings:

2. Set the engine vanilla_tree_spec <- decision_tree() %>%

set_engine("rpart") %>%
set_mode("classification")
3. Set the mode

34 / 78
Measure the performance of a vanilla decision tree model using 5-fold CV:

set.seed(100)
so_cv <- vfold_cv(stackoverflow, v = 5)
(fit_van_res <- fit_resamples(vanilla_tree_spec, remote ~ ., resamples = so_cv) %>%
collect_metrics())

## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.639 5 0.00870 Preprocessor1_Model1
## 2 roc_auc binary 0.663 5 0.0155 Preprocessor1_Model1

🤔 "Can we improve the performance by tuning the algorithm parameters?"

🤔 "Which parameters can we tune?"

37 / 78
args()
args() prints the arguments for a parsnip model speciﬁcation:

args(decision_tree)

## function (mode = "unknown", cost_complexity = NULL, tree_depth = NULL,

## min_n = NULL)
## NULL

Arguments of decision_tree():

cost_complexity:minimum ﬁt improvement of a split (0 < cost_complexity ≤ 1)

tree_depth: maximum number of levels in the tree
min_n: minimum number of observations in a node in order for a split to be attempted

39 / 78
decision_tree(
cost_complexity = 0.01, # min. fit improvement of a split (0 < cp <=1)
tree_depth = 30, # max. number of levels in the tree
min_n = 20 # min. number of observations in a node in order for a split to be attempted
)

## Decision Tree Model Specification (unknown)

##
## Main Arguments:
## cost_complexity = 0.01
## tree_depth = 30
## min_n = 20

If the arguments are left to their defaults (NULL), the arguments will use the engine's underlying model
functions default value.

For example, rpart is used as default engine. The default parameters are:

args(rpart::rpart.control) # cost_complexity -> cp; tree_depth -> maxdepth; min_n -> minsplit

## function (minsplit = 20L, minbucket = round(minsplit/3), cp = 0.01,

## maxcompete = 4L, maxsurrogate = 5L, usesurrogate = 2L, xval = 10L,
## surrogatestyle = 0L, maxdepth = 30L, ...)
## NULL

41 / 78
set_args()
set_args(): change the arguments for a parsnip model speciﬁcation:

dt_spec <- decision_tree()

dt_spec %>% set_args(tree_depth = 3)

## Decision Tree Model Specification (unknown)

##
## Main Arguments:
## tree_depth = 3

... which is equivalent to: An example spec of model, engine, mode and tree
depth:
dt_spec <- decision_tree(tree_depth = 3)
dt_spec decision_tree() %>%
set_engine("rpart") %>%
## Decision Tree Model Specification (unknown) set_mode("classification") %>%
## set_args(tree_depth = 3)
## Main Arguments:
## tree_depth = 3 ## Decision Tree Model Specification (classification)
##
## Main Arguments:
## tree_depth = 3
##
## Computational engine: rpart
44 / 78
45 / 78
46 / 78
Overﬁtted tree (cost_complexity=0.0008): Optimal tree (cost_complexity=0.0093):

47 / 78
workflow()
Create a workﬂow with workflow().

add_formula()
Add a formula to a workﬂow

workflow() %>% add_formula(Sale_Price ~ Year)

add_model()
Add a parsnip model spec to a workﬂow:

workflow() %>% add_model(lm_spec)

50 / 78
Example workﬂow
wf <- workflow() %>%
add_formula(remote ~ .) %>%
add_model(decision_tree() %>% set_engine("rpart") %>% set_mode("classification"))

wf %>% fit_resamples(so_cv)

## # Resampling results
## # 5-fold cross-validation
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [920/230]> Fold1 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 2 <split [920/230]> Fold2 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 3 <split [920/230]> Fold3 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 4 <split [920/230]> Fold4 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 5 <split [920/230]> Fold5 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>

51 / 78
update_formula()
Replace a workﬂow formula with a new one:

workflow() %>%
add_formula(remote ~ .) %>%
update_formula(remote ~ salary + open_source)

## == Workflow ==============================================================================
## Preprocessor: Formula
## Model: None
##
## -- Preprocessor --------------------------------------------------------------------------
## remote ~ salary + open_source

52 / 78
update_model()
Replaces a workﬂow model spec with a new one:

workflow() %>%
add_model(nearest_neighbor()) %>%
update_model(decision_tree())

## == Workflow ==============================================================================
## Preprocessor: None
## Model: decision_tree()
##
## -- Model ---------------------------------------------------------------------------------
## Decision Tree Model Specification (unknown)

53 / 78
Tune model hyperparameters
with tune

54 / 78
tune()
tune() is a placeholder for hyperparameters that are to be tuned:

decision_tree(cost_complexity = tune())

## Decision Tree Model Specification (unknown)

##
## Main Arguments:
## cost_complexity = tune()

55 / 78
tune_grid()
A version of fit_resamples() that performs a grid search for the best combination of tuned hyper-parameters.

tune_grid(
object, # a model workflow, R formula or recipe object.
resamples, # a resampling object, e.g. the output of vfold_cv()
...,
grid = 10, # the number of tuning iterations or a data frame of tuning operations (tuning grid)
metrics = NULL, # yardstick::metric_set() or NULL
control = control_grid() # An object used to modify the tuning process
)

56 / 78
expand_grid()
tidyr::expand_grid(): takes one or more vectors, and returns a data frame holding all combinations of their
values.

expand_grid(cost_complexity = 10^(0:-5), min_n = seq(4,20,4))

## # A tibble: 30 x 2
## cost_complexity min_n
## <dbl> <dbl>
## 1 1 4
## 2 1 8
## 3 1 12
## 4 1 16
## 5 1 20
## 6 0.1 4
## 7 0.1 8
## 8 0.1 12
## 9 0.1 16
## 10 0.1 20
## # ... with 20 more rows

expand_grid() is a re-implementation of the base expand.grid().

57 / 78
dt_spec <- decision_tree(
cost_complexity = tune(),
tree_depth = tune()
) %>%
set_engine("rpart") %>%
set_mode("classification")

dt_wf <- workflow() %>%

add_model(dt_spec) %>%
add_formula(remote ~ .)

dt_res <- dt_wf %>%

tune_grid(resamples = so_cv,
grid = expand_grid(cost_complexity = 10^-(1:5), tree_depth = 1:6)
)
dt_res

## # Tuning results
## # 5-fold cross-validation
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [920/230]> Fold1 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>
## 2 <split [920/230]> Fold2 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>
## 3 <split [920/230]> Fold3 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>
## 4 <split [920/230]> Fold4 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>
## 5 <split [920/230]> Fold5 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>

58 / 78
dt_res %>%
collect_metrics() %>%
filter(.metric == "accuracy") %>%
arrange(desc(mean))

## # A tibble: 30 x 8
## cost_complexity tree_depth .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model~
## 2 0.0001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model~
## 3 0.00001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model~
## 4 0.01 2 accuracy binary 0.656 5 0.0142 Preprocessor1_Model~
## 5 0.01 3 accuracy binary 0.649 5 0.0142 Preprocessor1_Model~
## 6 0.001 5 accuracy binary 0.646 5 0.00488 Preprocessor1_Model~
## 7 0.001 6 accuracy binary 0.646 5 0.00918 Preprocessor1_Model~
## 8 0.0001 5 accuracy binary 0.646 5 0.00488 Preprocessor1_Model~
## 9 0.0001 6 accuracy binary 0.646 5 0.00918 Preprocessor1_Model~
## 10 0.00001 5 accuracy binary 0.646 5 0.00488 Preprocessor1_Model~
## # ... with 20 more rows

59 / 78
show_best()
show_best(): display the n best hyperparameters combinations according to a metric:

dt_res %>%
show_best(metric = "accuracy", n = 5)

## # A tibble: 5 x 8
## cost_complexity tree_depth .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model14
## 2 0.0001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model20
## 3 0.00001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model26
## 4 0.01 2 accuracy binary 0.656 5 0.0142 Preprocessor1_Model08
## 5 0.01 3 accuracy binary 0.649 5 0.0142 Preprocessor1_Model09

60 / 78
autoplot()
autoplot(): quickly visualize tuning results

dt_res %>% autoplot()

61 / 78
select_best()
select_best() returns the best combination of hyperparameters according to a metric:

so_best <- dt_res %>% select_best(metric = "roc_auc")

so_best

## # A tibble: 1 x 3
## cost_complexity tree_depth .config
## <dbl> <int> <chr>
## 1 0.001 2 Preprocessor1_Model14

62 / 78
finalize_workflow()
finalize_workflow(): replaces tune() placeholders in a model/recipe/workﬂow with a set of hyper-parameter
values.

dt_wf_final <- dt_wf %>% finalize_workflow(so_best)

dt_wf_final

## == Workflow ==============================================================================
## Preprocessor: Formula
## Model: decision_tree()
##
## -- Preprocessor --------------------------------------------------------------------------
## remote ~ .
##
## -- Model ---------------------------------------------------------------------------------
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = 0.001
## tree_depth = 2
##
## Computational engine: rpart

63 / 78
Preprocessing with recipes

64 / 78
1. Create a recipe()

2. Deﬁne the predictor and outcome variables

3. Add one or more preprocessing step speciﬁcations

4. Calculate statistics from the training set

5. Apply preprocessing to datasets

65 / 78
recipe()
recipe():create a recipe by specifying predictors, responses and
1. Create a recipe()
reference (template) data frame.

2. Deﬁne the predictor recipe(Sale_Price ~ ., data = ames)

and outcome variables
## Data Recipe
##
3. Add one or more ## Inputs:
##
preprocessing step ## role #variables
speciﬁcations ## outcome
## predictor
1
73

4. Calculate statistics
from the training set

5. Apply preprocessing
to datasets

66 / 78
step_*()
step_*():
add preprocessing step specifications in the order they will be
performed.
1. Create a recipe()
recipe(Sale_Price ~ ., data = ames) %>%
# step_novel(): assign a previously unseen factor level to
2. Define the predictor # a new value
step_novel(all_nominal()) %>%
and outcome variables # step_zv(): zero variance filter: remove vars that contain
# only a single value
step_zv(all_predictors())
3. Add one or more
preprocessing step ## Data Recipe
##
specifications ## Inputs:
##
## role #variables
4. Calculate statistics ## outcome 1
## predictor 73
from the training set ##
## Operations:
##
5. Apply preprocessing ## Novel factor level assignment for all_nominal()
to datasets ## Zero variance filter on all_predictors()

67 / 78
step_*()
Complete list at: https://recipes.tidymodels.org/reference/index.html

add_role() update_role() Manually Alter Roles

remove_role()

CONTENTS
STEP FUNCTIONS - IMPUTATION
Basic Functions
step_impute_bag() Imputation via Bagged Trees Step Functions -
step_bagimpute() imp_vars() Imputation
tidy(<step_impute_bag>)
Step Functions -
Individual
step_impute_knn() Imputation via K-Nearest Neighbors
Transformations
step_knnimpute()

tidy(<step_impute_knn>) Step Functions -

Discretization
step_impute_linear() Imputation of numeric variables via a linear model.
Step Functions -
tidy(<step_impute_linear>)
Dummy Variables

68 / 78
Selectors
Selectors, e.g., all_nominal() and all_predictors() are helper functions for selecting sets of variables, which
behave similar to the select helpers from dplyr.

rec %>%
step_novel(all_nominal()) %>%
step_zv(all_predictors())

selector description
all_predictors() Each x variable (right side of ~)
all_outcomes() Each y variable (left side of ~)
all_numeric() Each numeric variable
all_nominal() Each categorical variable (e.g. factor, string)
dplyr::select() helpers starts_with('Lot_'), etc.

70 / 78
prep()
prep() "trains" a recipe, i.e., calculates statistics from the training data

1. Create a recipe() recipe(Sale_Price ~ ., data = ames) %>%

step_novel(all_nominal()) %>%
step_zv(all_predictors()) %>%
2. Deﬁne the predictor prep(training = training(ames_split))

and outcome variables ## Data Recipe

##
## Inputs:
3. Add one or more ##
preprocessing step ##
##
role #variables
outcome 1
speciﬁcations ## predictor 73
##
## Training data contained 2198 data points and no missing data.
4. Calculate statistics ##
## Operations:
from the training set ##
## Novel factor level assignment for MS_SubClass, MS_Zoning, Street, Alley, ... [tra
## Zero variance filter removed no terms [trained]
5. Apply preprocessing
to datasets

71 / 78
bake()
bake() transforms data with the prepped recipe

1. Create a recipe() recipe(Sale_Price ~ ., data = ames) %>%

step_novel(all_nominal()) %>%
step_zv(all_predictors()) %>%
2. Deﬁne the predictor prep(training = training(ames_split)) %>%
bake(new_data = testing(ames_split)) # or training(ames_split)
and outcome variables
## # A tibble: 732 x 74
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Lan
3. Add one or more ## <fct> <fct> <dbl> <int> <fct> <fct> <fct> <fc
preprocessing step ##
##
1 One_Story_PUD_~ Residentia~
2 One_Story_PUD_~ Residentia~
43
39
5005 Pave
5389 Pave
No_Al~ Slightly_~ HLS
No_Al~ Slightly_~ Lvl
speciﬁcations ## 3 Two_Story_1946~ Residentia~ 60 7500 Pave No_Al~ Regular Lvl
## 4 Two_Story_1946~ Residentia~ 63 8402 Pave No_Al~ Slightly_~ Lvl
## 5 Two_Story_1946~ Residentia~ 47 53504 Pave No_Al~ Moderatel~ HLS
4. Calculate statistics ##
##
6 One_Story_1946~ Residentia~
7 One_Story_1946~ Residentia~
88
0
11394 Pave
11241 Pave
No_Al~ Regular Lvl
No_Al~ Slightly_~ Lvl
from the training set ## 8 Two_Story_PUD_~ Residentia~ 21 1680 Pave No_Al~ Regular Lvl
## 9 One_Story_1946~ Residentia~ 95 12182 Pave No_Al~ Regular Lvl
## 10 One_Story_1946~ Residentia~ 70 10171 Pave No_Al~ Slightly_~ Lvl
5. Apply preprocessing ## # ... with 722 more rows, and 66 more variables: Utilities <fct>, Lot_Config <fct
## # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
to datasets ## # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
## # Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>,
## # Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>
## # Foundation <fct>, Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>
## # 72 / 78<db
BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF
Source
73 / 78
juice()
juice()returns the preprocessed training data back from a prepped recipe, without having to rerun the
preprocessing steps on the training data.

rec <- recipe(Sale_Price ~ ., data = ames) %>%

step_center(all_numeric()) %>%
step_scale(all_numeric())
rec %>%
prep(training = training(ames_split),
retain = TRUE
) %>%
juice()

## # A tibble: 2,198 x 74
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour
## <fct> <fct> <dbl> <dbl> <fct> <fct> <fct> <fct>
## 1 One_Story_1946~ Residentia~ 2.46 2.64 Pave No_Al~ Slightly_~ Lvl
## 2 One_Story_1946~ Residentia~ 0.658 0.185 Pave No_Al~ Regular Lvl
## 3 One_Story_1946~ Residentia~ 0.687 0.507 Pave No_Al~ Slightly_~ Lvl
## 4 One_Story_1946~ Residentia~ 1.04 0.128 Pave No_Al~ Regular Lvl
## 5 Two_Story_1946~ Residentia~ 0.480 0.454 Pave No_Al~ Slightly_~ Lvl
## 6 Two_Story_1946~ Residentia~ 0.598 -0.0156 Pave No_Al~ Slightly_~ Lvl
## 7 One_Story_PUD_~ Residentia~ -0.496 -0.632 Pave No_Al~ Regular Lvl
## 8 Two_Story_1946~ Residentia~ 0.510 -0.0129 Pave No_Al~ Slightly_~ Lvl
## 9 One_Story_1946~ Residentia~ -1.71 -0.259 Pave No_Al~ Slightly_~ Lvl
## 10 One_Story_1946~ Residentia~ 0.805 0.00851 Pave No_Al~ Regular Lvl
## # ... with 2,188 more rows, and 66 more variables: Utilities <fct>, Lot_Config <fct>,
## # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
## # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <dbl>,
## # Year_Remod_Add <dbl>, Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, 74 / 78
A full workﬂow
set.seed(123)
so_cv <- vfold_cv(stackoverflow, v = 5)
so_rec <- recipe(remote ~ ., data = stackoverflow) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_predictors(), threshold = 0.5)

tree_spec <- decision_tree() %>%

set_engine("rpart") %>%
set_mode("classification")

so_wf <- workflow() %>%

add_model(tree_spec) %>%
add_recipe(so_rec)

fit_resamples(so_wf, # note: workflow object instead of model spec

resamples = so_cv,
metrics = metric_set(accuracy, sens, spec),
control = control_resamples(save_pred = TRUE)) %>%
# collect_metrics() %>%
collect_predictions() %>%
conf_mat(remote, .pred_class)

## Truth
## Prediction Remote Not remote
## Remote 381 224
## Not remote 194 351

75 / 78
You can tune models and recipes!

pca_tuner <- recipe(Sale_Price ~ ., data = ames) %>%

step_novel(all_nominal()) %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(all_predictors(), num_comp = tune())
pca_twf <- workflow() %>%
add_recipe(pca_tuner) %>%
add_model(nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>% set_mode("regression"))
tg <- expand_grid(num_comp = 2:10, neighbors = seq(1, 15, 4))
set.seed(100)
cv_folds <- vfold_cv(ames, v = 5, strata = Sale_Price, breaks = 4)
set.seed(100)
pca_results <- pca_twf %>%
tune_grid(resamples = cv_folds, grid = tg)
pca_results %>% show_best(metric = "rmse")

## # A tibble: 5 x 8
## neighbors num_comp .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9 7 rmse standard 31793. 5 968. Preprocessor6_Model3
## 2 13 7 rmse standard 31961. 5 1157. Preprocessor6_Model4
## 3 9 8 rmse standard 31963. 5 1099. Preprocessor7_Model3
## 4 9 5 rmse standard 32141. 5 951. Preprocessor4_Model3
## 5 13 8 rmse standard 32180. 5 1234. Preprocessor7_Model4

76 / 78
Session info
## setting value
## version R version 4.0.5 (2021-03-31)
## os Windows 10 x64
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## ctype English_United States.1252
## tz Europe/Berlin
## date 2021-05-10

package version date source package version date source

AmesHousing 0.0.4 2020-06-23 CRAN (R 4.0.3) recipes 0.1.15 2020-11-11 CRAN (R 4.0.3)
broom 0.7.6 2021-04-05 CRAN (R 4.0.5) rlang 0.4.11 2021-04-30 CRAN (R 4.0.5)
dials 0.0.9 2020-09-16 CRAN (R 4.0.3) rpart 4.1.15 2019-04-12 CRAN (R 4.0.5)
dplyr 1.0.5 2021-03-05 CRAN (R 4.0.4) rpart.plot 3.0.9 2020-09-17 CRAN (R 4.0.2)
forcats 0.5.1 2021-01-27 CRAN (R 4.0.3) rsample 0.0.9 2021-02-17 CRAN (R 4.0.4)
ggplot2 3.3.3 2020-12-30 CRAN (R 4.0.3) scales 1.1.1 2020-05-11 CRAN (R 4.0.2)
infer 0.5.4 2021-01-13 CRAN (R 4.0.3) stringr 1.4.0 2019-02-10 CRAN (R 4.0.2)
kableExtra 1.3.4 2021-02-20 CRAN (R 4.0.3) tibble 3.1.1 2021-04-18 CRAN (R 4.0.5)
kknn 1.3.1 2016-03-26 CRAN (R 4.0.4) tidymodels 0.1.2 2020-11-22 CRAN (R 4.0.3)
knitr 1.31 2021-01-27 CRAN (R 4.0.3) tidyr 1.1.3 2021-03-03 CRAN (R 4.0.4)
modeldata 0.1.0 2020-10-22 CRAN (R 4.0.5) tidyverse 1.3.0 2019-11-21 CRAN (R 4.0.2)
parsnip 0.1.5 2021-01-19 CRAN (R 4.0.3) tune 0.1.2 2020-11-17 CRAN (R 4.0.3)
patchwork 1.1.1 2020-12-17 CRAN (R 4.0.3) vctrs 0.3.8 2021-04-29 CRAN (R 4.0.5)
purrr 0.3.4 2020-04-17 CRAN (R 4.0.2) workﬂows 0.2.2 2021-03-10 CRAN (R 4.0.4)
readr 1.4.0 2020-10-05 CRAN (R 4.0.3) yardstick 0.0.7 2020-07-13 CRAN (R 4.0.3) 77 / 78
Thank you! Questions?
📷 Photo courtesy of Stefan Berger

01 Build Model
No ratings yet
01 Build Model
109 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Recipes For Data Processing
No ratings yet
Recipes For Data Processing
51 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
R Markdown File Mid
No ratings yet
R Markdown File Mid
13 pages
Intro to ML with Sklearn & Python
No ratings yet
Intro to ML with Sklearn & Python
10 pages
Real Estate Valuation Regression Analysis
No ratings yet
Real Estate Valuation Regression Analysis
15 pages
Module 2
No ratings yet
Module 2
35 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
T2 Summary VHA
No ratings yet
T2 Summary VHA
14 pages
Python ML for Engineers: Week 3
No ratings yet
Python ML for Engineers: Week 3
12 pages
Cp4252 Machine Learning Lab Manual
No ratings yet
Cp4252 Machine Learning Lab Manual
27 pages
Boston Housing
No ratings yet
Boston Housing
17 pages
Spatialreg-Package in R
No ratings yet
Spatialreg-Package in R
83 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
16 pages
Linear Regression for House Pricing
No ratings yet
Linear Regression for House Pricing
113 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
Ds ML House Price Book
No ratings yet
Ds ML House Price Book
46 pages
ML Lab Manual
No ratings yet
ML Lab Manual
60 pages
Data Scinece Practical File
No ratings yet
Data Scinece Practical File
23 pages
Tidy Models
No ratings yet
Tidy Models
39 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
Problem Set 3: General Guideline
No ratings yet
Problem Set 3: General Guideline
12 pages
SVM Guide for Data Science Enthusiasts
100% (1)
SVM Guide for Data Science Enthusiasts
28 pages
Real Estate Valuation Data Set: Section Order
No ratings yet
Real Estate Valuation Data Set: Section Order
17 pages
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
No ratings yet
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
20 pages
The Boston Housing Dataset
100% (2)
The Boston Housing Dataset
4 pages
Linear Regression with Boston Housing Data
No ratings yet
Linear Regression with Boston Housing Data
14 pages
Test 1
No ratings yet
Test 1
3 pages
Week 6 LAB
No ratings yet
Week 6 LAB
13 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
Hands On Machine Learning, End-to-End Machine Learning Project Notes
No ratings yet
Hands On Machine Learning, End-to-End Machine Learning Project Notes
10 pages
Intro to Pandas for Data Science
No ratings yet
Intro to Pandas for Data Science
6 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
ML Beginners: Predict House Prices
No ratings yet
ML Beginners: Predict House Prices
32 pages
DataViz Ggplot Sample
No ratings yet
DataViz Ggplot Sample
23 pages
Practical Machine Learning R
90% (10)
Practical Machine Learning R
149 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Manual of Applied Spatial Ecology
No ratings yet
Manual of Applied Spatial Ecology
190 pages
LightGBM Python Guide: Datasets & Training
No ratings yet
LightGBM Python Guide: Datasets & Training
26 pages
Sds322project 2 Jannu Karthik
No ratings yet
Sds322project 2 Jannu Karthik
19 pages
Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab
No ratings yet
Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab
6 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
Boston Housing Price Prediction
No ratings yet
Boston Housing Price Prediction
3 pages
ML Lab - BCSL606
No ratings yet
ML Lab - BCSL606
67 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
No ratings yet
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
36 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
Orange 3
100% (1)
Orange 3
46 pages
House Report
No ratings yet
House Report
26 pages
Orange3 Data Mining Library Using Python
50% (2)
Orange3 Data Mining Library Using Python
102 pages
House Price Prediction 1
No ratings yet
House Price Prediction 1
27 pages
Unit 2
No ratings yet
Unit 2
78 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
GAMBOA - Assignment No.1 Decision Tree Analysis
No ratings yet
GAMBOA - Assignment No.1 Decision Tree Analysis
4 pages
Data Mining Applications & Tasks
No ratings yet
Data Mining Applications & Tasks
25 pages
Computational Intelligence To Aid Text F
No ratings yet
Computational Intelligence To Aid Text F
14 pages
Hina Riaz, Journal of Informatics Education and Research Center For Research
No ratings yet
Hina Riaz, Journal of Informatics Education and Research Center For Research
9 pages
Quantitative Methods For Business 13th Edition by David R. Anderson (Ebook PDF) Full
100% (5)
Quantitative Methods For Business 13th Edition by David R. Anderson (Ebook PDF) Full
76 pages
Ozker 2020
No ratings yet
Ozker 2020
6 pages
Applsci 10 07922 v2
No ratings yet
Applsci 10 07922 v2
23 pages
Detecting Fake Profiles with AI
No ratings yet
Detecting Fake Profiles with AI
7 pages
AIL Quiz
No ratings yet
AIL Quiz
30 pages
Vanaja Internship Report 2023
No ratings yet
Vanaja Internship Report 2023
39 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
GB Tree Explained
No ratings yet
GB Tree Explained
21 pages
Progrram8-Decision Tree
No ratings yet
Progrram8-Decision Tree
3 pages
Chatbot For Disease Prediction Using Classification Based Machine Learning Algorithms
No ratings yet
Chatbot For Disease Prediction Using Classification Based Machine Learning Algorithms
5 pages
Heart Disease Prediction by Using Machine Learning Final Research Paper
No ratings yet
Heart Disease Prediction by Using Machine Learning Final Research Paper
8 pages
Heizer Om13 TB MA
No ratings yet
Heizer Om13 TB MA
39 pages
Data Mining and Data Warehousing: Principles and Practical Techniques 1st Edition Parteek Bhatia PDF Download
No ratings yet
Data Mining and Data Warehousing: Principles and Practical Techniques 1st Edition Parteek Bhatia PDF Download
91 pages
FINAL FINDINGS - IBM-DataScience-Professional-Cert - Applied - Capstone - Project
No ratings yet
FINAL FINDINGS - IBM-DataScience-Professional-Cert - Applied - Capstone - Project
48 pages
Data Science and Machine Learning Interview Questions Using Python Second Edition Vishwanathan Narayanan PDF Version
No ratings yet
Data Science and Machine Learning Interview Questions Using Python Second Edition Vishwanathan Narayanan PDF Version
138 pages
SPO Single Pass Optimization For Soccer Simulation 2D
No ratings yet
SPO Single Pass Optimization For Soccer Simulation 2D
24 pages
Sem 3, 4 Syllabus-2024
No ratings yet
Sem 3, 4 Syllabus-2024
86 pages
Machine Learning for Heart Disease Prediction
No ratings yet
Machine Learning for Heart Disease Prediction
63 pages
Automatic Detection of Liver Diseases Based On Sup
No ratings yet
Automatic Detection of Liver Diseases Based On Sup
20 pages
BAMS (Business Analytics)
No ratings yet
BAMS (Business Analytics)
11 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
154 pages
Machine Learning Interview Questions & Answers - MIQ
No ratings yet
Machine Learning Interview Questions & Answers - MIQ
17 pages
Lecture15 Decision Trees
No ratings yet
Lecture15 Decision Trees
67 pages
Data Preprocessing and Apriori Algorithm Improvement in Medical Data Mining
No ratings yet
Data Preprocessing and Apriori Algorithm Improvement in Medical Data Mining
4 pages
DS Capstone Presentation
No ratings yet
DS Capstone Presentation
46 pages
Appendix B: Using Palisade's Decision Tools Suite: B.3.4 Modifying A Graph
No ratings yet
Appendix B: Using Palisade's Decision Tools Suite: B.3.4 Modifying A Graph
6 pages

07 Tidymodels

Uploaded by

07 Tidymodels

Uploaded by

07 - Machine Learning with tidymodels

Data Science with R · Summer 2021

mlr³keras mlr³measures mlr3raster

tasks, learners, train-test-eval,

Parameter Sets Bayesian Optimization

Figure source: https://mlr3.mlr-org.com/ mlr³forecasting mlr³proba

## -- Attaching packages ------------------------------------------------ tidymodels 0.1.2 --

## v broom 0.7.6 v recipes 0.1.15

## -- Conflicts --------------------------------------------------- tidymodels_conflicts() --

decision_tree() %>% # model nearest_neighbor() %>%

Show 5 entries Search: Search all of

TITLE MODEL PACKAGE MODE ENGINE tidymodels

"Boosted" arima_boost model me regression arima_xgboost Search recipe steps

3. Set the mode linear_reg(

3. Set the mode linear_reg() %>%

Available engines for linear_reg():

R: "lm" (the default) or "glmnet"

lm_spec <- linear_reg() %>%

## parsnip model object

p <- predict(m, new_data = ames)

Calculate the RMSE based on two columns in a data frame:

lm_spec <- linear_reg() %>%

rmse(price_pred, truth = truth, estimate = .pred)

type metric Example

## # A tibble: 2,198 x 74 ## # A tibble: 732 x 74

vfold_cv(data, v = 10, repeats = 1, strata = NULL, breaks = 4, ...)

Example: 10-fold CV on ames data:

Check whether mean y is approx. equal in each training fold:

## [1] 181310.8 180991.0 180840.0 181268.6

res %>% collect_metrics(summarize = FALSE)

fit_resamples( If metrics = NULL:

Data source: Stack Overﬂow Annual Developer Survey

2. Set the engine vanilla_tree_spec <- decision_tree() %>%

🤔 "Can we improve the performance by tuning the algorithm parameters?"

🤔 "Which parameters can we tune?"

## function (mode = "unknown", cost_complexity = NULL, tree_depth = NULL,

cost_complexity:minimum ﬁt improvement of a split (0 < cost_complexity ≤ 1)

## Decision Tree Model Specification (unknown)

## function (minsplit = 20L, minbucket = round(minsplit/3), cp = 0.01,

dt_spec <- decision_tree()

dt_spec %>% set_args(tree_depth = 3)

## Decision Tree Model Specification (unknown)

workflow() %>% add_formula(Sale_Price ~ Year)

workflow() %>% add_model(lm_spec)

## Decision Tree Model Specification (unknown)

expand_grid(cost_complexity = 10^(0:-5), min_n = seq(4,20,4))

expand_grid() is a re-implementation of the base expand.grid().

dt_wf <- workflow() %>%

dt_res <- dt_wf %>%

dt_res %>% autoplot()

so_best <- dt_res %>% select_best(metric = "roc_auc")

dt_wf_final <- dt_wf %>% finalize_workflow(so_best)

2. Deﬁne the predictor and outcome variables

3. Add one or more preprocessing step speciﬁcations

4. Calculate statistics from the training set

5. Apply preprocessing to datasets

2. Deﬁne the predictor recipe(Sale_Price ~ ., data = ames)

add_role() update_role() Manually Alter Roles

tidy(<step_impute_knn>) Step Functions -

1. Create a recipe() recipe(Sale_Price ~ ., data = ames) %>%

and outcome variables ## Data Recipe

1. Create a recipe() recipe(Sale_Price ~ ., data = ames) %>%

rec <- recipe(Sale_Price ~ ., data = ames) %>%

tree_spec <- decision_tree() %>%

so_wf <- workflow() %>%

fit_resamples(so_wf, # note: workflow object instead of model spec

pca_tuner <- recipe(Sale_Price ~ ., data = ames) %>%

package version date source package version date source

You might also like