07 Tidymodels
07 Tidymodels
2 / 78
3 / 78
Feature Selection
Learners Utilities
Data mlr³filters
mlr³misc bbotk
Filter-based FS
mlr³learners
mlr³db Devel Helper Functions Black-Box Optimization
Database Backends
Core learners for mlr3
mlr³fselect
Wrapper-based FS
OpenML Connection
mlr³extralearners
Stable
mlr³verse mlr³benchmark
mlr³data Additional learners Maturing
Meta-package Tools for benchmarking
Example datasets
Planned
mlr³batchmark
mlr3 - batchtools connector
Tuning Core
mlr³tuning mlr³hyperband
Hyperparameter Tuning Hyperband Parameter Tuning mlr³ Pipelines
Tasks
Visualization
mlr³spatiotemporal mlr³ordinal
Spatiotemporal resampling Learning with ordinal targets
mlr³viz
Visualization
mlr³multioutput mlr³cluster
Multiple Targets Cluster Analysis
4 / 78
This tutorial is a condensed version of the 2-day workshop "Introduction to Machine Learning with the
Tidyverse" held by Dr. Alison Hill at the rstudio::conf 2020.
5 / 78
Setup
library(tidyverse)
library(tidymodels)
6 / 78
Ames Iowa Housing Dataset
library(AmesHousing)
(ames <- make_ames() %>% select(-matches("Qu")))
"Data set contains
## # A tibble: 2,930 x 74
information from the Ames ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
Assessor’s Office used in ## <fct> <fct> <dbl> <int> <fct> <fct> <fct>
computing assessed values ## 1 One_Story_1~ Resident~ 141 31770 Pave No_A~ Slightly~
for individual residential ## 2 One_Story_1~ Resident~ 80 11622 Pave No_A~ Regular
## 3 One_Story_1~ Resident~ 81 14267 Pave No_A~ Slightly~
properties sold in Ames, IA ## 4 One_Story_1~ Resident~ 93 11160 Pave No_A~ Regular
from 2006 to 2010." — ## 5 Two_Story_1~ Resident~ 74 13830 Pave No_A~ Slightly~
Dataset documentation ## 6 Two_Story_1~ Resident~ 78 9978 Pave No_A~ Slightly~
## 7 One_Story_P~ Resident~ 41 4920 Pave No_A~ Regular
## 8 One_Story_P~ Resident~ 43 5005 Pave No_A~ Slightly~
## 9 One_Story_P~ Resident~ 39 5389 Pave No_A~ Slightly~
## 10 Two_Story_1~ Resident~ 60 7500 Pave No_A~ Regular
## # ... with 2,920 more rows, and 67 more variables:
De Cock, Dean. "Ames, Iowa: Alternative ## # Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
to the Boston housing data as an end of ## # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>,
semester regression project." Journal of ## # Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
Statistics Education 19.3 (2011). URL ## # Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
## # Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>,
## # Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>,
## # Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
## # Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>,
## # BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
## # Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>,
## # Central_Air <fct>, Electrical <fct>, First_Flr_SF <int>,
## # Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
7 / 78
Specify a model with parsnip
8 / 78
Specify a model with parsnip
. Pick a model
. Set the engine
. Set the mode (if needed)
## Decision Tree Model Specification (classification) ## K-Nearest Neighbor Model Specification (regression)
## ##
## Computational engine: rpart ## Computational engine: kknn
11 / 78
All available models are listed at https://www.tidymodels.org/find/parsnip/#models.
To learn about the parsnip package, see Get Started: Build a Model. Use the tables below
Tidymodels PAC K AG E S G E T S TA R T E D LEARN HELP CONTRIBUTE
to find model types and engines and to explore model arguments.
EXPLORE
EXPLORE MODELS TIDYMODELS
ARIMA i d l i i Find
12 / 78
1. Pick a model linear_reg()
2. Set the engine Specify a model that uses linear regression:
13 / 78
1. Pick a model set_engine()
2. Set the engine Add an engine to power or implement the model:
14 / 78
1. Pick a model set_mode()
2. Set the engine Set the model type, either "regression" or "classification". Not
necessary if mode is set in Step 1.
3. Set the mode
linear_reg() %>%
set_engine(engine = "lm") %>%
set_mode(mode = "regression")
15 / 78
fit()
fit(): fit a simple linear regression model to predict sale price based on above ground living area.
16 / 78
predict()
predict(): use a fitted model to predict new response values from data. Returns a tibble.
## # A tibble: 2,930 x 1
## .pred
## <dbl>
## 1 198255.
## 2 113367.
## 3 161731.
## 4 248964.
## 5 195239.
## 6 192447.
## 7 162736.
## 8 156258.
## 9 193787.
## 10 214786.
## # ... with 2,920 more rows
17 / 78
Measure model performance
with yardstick
18 / 78
Measure the model performance with yardstick::rmse()
Residuals. The difference between observed and predicted values: y^i − yi .
∑
Mean Absolute Error. n1 ni=1 |y^i − yi |.
Root Mean Squared Error. √ ∑ 1
n
n
i=1
^i − yi ) 2 .
(y
truth yi
predicted estimate y^
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 56505.
20 / 78
Available metrics in yardstick
https://yardstick.tidymodels.org/articles/metric-types.html#metrics
Metrics Contents
Below is a table of all of the metrics available in yardstick , grouped by type.
Metric types
Metrics
class accuracy()
class bal_accuracy()
class detection_prevalence()
class f_meas()
class j_index()
class kap()
21 / 78
Perform resampling with
rsample
22 / 78
initial_split()
initial_split(): partition data randomly into a single training and a single test set.
set.seed(123)
(ames_split <- initial_split(ames, prop = 3/4)) # prop = proportion of training instances
## <Analysis/Assess/Total>
## <2198/732/2930>
23 / 78
training() and testing()
Extract training and testing sets from an rsplit object:
training(ames_split) testing(ames_split)
25 / 78
Cross-validation with vfold_cv()
General syntax:
set.seed(123)
(folds <- vfold_cv(ames, v = 5))
## # 5-fold cross-validation
## # A tibble: 5 x 2
## splits id
## <list> <chr>
## 1 <split [2344/586]> Fold1
## 2 <split [2344/586]> Fold2
## 3 <split [2344/586]> Fold3
## 4 <split [2344/586]> Fold4
## 5 <split [2344/586]> Fold5
map_dbl(folds$splits, ~mean(.x$data$Sale_Price[.x$in_id]))
## # Resampling results
## # 5-fold cross-validation
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [2344/586]> Fold1 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 2 <split [2344/586]> Fold2 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 3 <split [2344/586]> Fold3 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 4 <split [2344/586]> Fold4 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 5 <split [2344/586]> Fold5 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
28 / 78
Collapse performance results across resamples with
collect_metrics()
res %>% collect_metrics()
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 56486. 5 1866. Preprocessor1_Model1
## 2 rsq standard 0.504 5 0.0193 Preprocessor1_Model1
## # A tibble: 10 x 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold1 rmse standard 51064. Preprocessor1_Model1
## 2 Fold1 rsq standard 0.542 Preprocessor1_Model1
## 3 Fold2 rmse standard 57206. Preprocessor1_Model1
## 4 Fold2 rsq standard 0.464 Preprocessor1_Model1
## 5 Fold3 rmse standard 53526. Preprocessor1_Model1
## 6 Fold3 rsq standard 0.557 Preprocessor1_Model1
## 7 Fold4 rmse standard 61210. Preprocessor1_Model1
## 8 Fold4 rsq standard 0.468 Preprocessor1_Model1
## 9 Fold5 rmse standard 59422. Preprocessor1_Model1
## 10 Fold5 rsq standard 0.488 Preprocessor1_Model1
29 / 78
metric_set()
metric_set(): a helper function for selecting yardstick metric functions.
30 / 78
Other resampling methods
loo_cv(): leave-one-out CV
mc_cv(): repeated holdout / Monte Carlo (random) CV: test sets sampled without replacement
bootstraps(): test sets sampled with replacement
31 / 78
A classification example
stackoverflow <- read_rds(here::here("data/stackoverflow.rds"))
glimpse(stackoverflow)
## Rows: 1,150
## Columns: 21
## $ country <fct> United States, United States, United Kingdo~
## $ salary <dbl> 63750.00, 93000.00, 40625.00, 45000.00, 100~
## $ years_coded_job <int> 4, 9, 8, 3, 8, 12, 20, 17, 20, 4, 3, 13, 16~
## $ open_source <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1~
## $ hobby <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1~
## $ company_size_number <dbl> 20, 1000, 10000, 1, 10, 100, 20, 500, 1, 20~
## $ remote <fct> Remote, Remote, Remote, Remote, Remote, Rem~
## $ career_satisfaction <int> 8, 8, 5, 10, 8, 10, 9, 7, 8, 7, 9, 8, 8, 7,~
## $ data_scientist <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ database_administrator <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0~
## $ desktop_applications_developer <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0~
## $ developer_with_stats_math_background <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ dev_ops <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0~
## $ embedded_developer <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0~
## $ graphic_designer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ graphics_programming <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ machine_learning_specialist <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ mobile_developer <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1~
## $ quality_assurance_engineer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ systems_administrator <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ web_developer <dbl> 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1~
34 / 78
Measure the performance of a vanilla decision tree model using 5-fold CV:
set.seed(100)
so_cv <- vfold_cv(stackoverflow, v = 5)
(fit_van_res <- fit_resamples(vanilla_tree_spec, remote ~ ., resamples = so_cv) %>%
collect_metrics())
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.639 5 0.00870 Preprocessor1_Model1
## 2 roc_auc binary 0.663 5 0.0155 Preprocessor1_Model1
37 / 78
args()
args() prints the arguments for a parsnip model specification:
args(decision_tree)
Arguments of decision_tree():
39 / 78
decision_tree(
cost_complexity = 0.01, # min. fit improvement of a split (0 < cp <=1)
tree_depth = 30, # max. number of levels in the tree
min_n = 20 # min. number of observations in a node in order for a split to be attempted
)
If the arguments are left to their defaults (NULL), the arguments will use the engine's underlying model
functions default value.
For example, rpart is used as default engine. The default parameters are:
args(rpart::rpart.control) # cost_complexity -> cp; tree_depth -> maxdepth; min_n -> minsplit
41 / 78
set_args()
set_args(): change the arguments for a parsnip model specification:
... which is equivalent to: An example spec of model, engine, mode and tree
depth:
dt_spec <- decision_tree(tree_depth = 3)
dt_spec decision_tree() %>%
set_engine("rpart") %>%
## Decision Tree Model Specification (unknown) set_mode("classification") %>%
## set_args(tree_depth = 3)
## Main Arguments:
## tree_depth = 3 ## Decision Tree Model Specification (classification)
##
## Main Arguments:
## tree_depth = 3
##
## Computational engine: rpart
44 / 78
45 / 78
46 / 78
Overfitted tree (cost_complexity=0.0008): Optimal tree (cost_complexity=0.0093):
47 / 78
workflow()
Create a workflow with workflow().
add_formula()
Add a formula to a workflow
add_model()
Add a parsnip model spec to a workflow:
50 / 78
Example workflow
wf <- workflow() %>%
add_formula(remote ~ .) %>%
add_model(decision_tree() %>% set_engine("rpart") %>% set_mode("classification"))
wf %>% fit_resamples(so_cv)
## # Resampling results
## # 5-fold cross-validation
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [920/230]> Fold1 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 2 <split [920/230]> Fold2 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 3 <split [920/230]> Fold3 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 4 <split [920/230]> Fold4 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
## 5 <split [920/230]> Fold5 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]>
51 / 78
update_formula()
Replace a workflow formula with a new one:
workflow() %>%
add_formula(remote ~ .) %>%
update_formula(remote ~ salary + open_source)
## == Workflow ==============================================================================
## Preprocessor: Formula
## Model: None
##
## -- Preprocessor --------------------------------------------------------------------------
## remote ~ salary + open_source
52 / 78
update_model()
Replaces a workflow model spec with a new one:
workflow() %>%
add_model(nearest_neighbor()) %>%
update_model(decision_tree())
## == Workflow ==============================================================================
## Preprocessor: None
## Model: decision_tree()
##
## -- Model ---------------------------------------------------------------------------------
## Decision Tree Model Specification (unknown)
53 / 78
Tune model hyperparameters
with tune
54 / 78
tune()
tune() is a placeholder for hyperparameters that are to be tuned:
decision_tree(cost_complexity = tune())
55 / 78
tune_grid()
A version of fit_resamples() that performs a grid search for the best combination of tuned hyper-parameters.
tune_grid(
object, # a model workflow, R formula or recipe object.
resamples, # a resampling object, e.g. the output of vfold_cv()
...,
grid = 10, # the number of tuning iterations or a data frame of tuning operations (tuning grid)
metrics = NULL, # yardstick::metric_set() or NULL
control = control_grid() # An object used to modify the tuning process
)
56 / 78
expand_grid()
tidyr::expand_grid(): takes one or more vectors, and returns a data frame holding all combinations of their
values.
## # A tibble: 30 x 2
## cost_complexity min_n
## <dbl> <dbl>
## 1 1 4
## 2 1 8
## 3 1 12
## 4 1 16
## 5 1 20
## 6 0.1 4
## 7 0.1 8
## 8 0.1 12
## 9 0.1 16
## 10 0.1 20
## # ... with 20 more rows
57 / 78
dt_spec <- decision_tree(
cost_complexity = tune(),
tree_depth = tune()
) %>%
set_engine("rpart") %>%
set_mode("classification")
## # Tuning results
## # 5-fold cross-validation
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [920/230]> Fold1 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>
## 2 <split [920/230]> Fold2 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>
## 3 <split [920/230]> Fold3 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>
## 4 <split [920/230]> Fold4 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>
## 5 <split [920/230]> Fold5 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]>
58 / 78
dt_res %>%
collect_metrics() %>%
filter(.metric == "accuracy") %>%
arrange(desc(mean))
## # A tibble: 30 x 8
## cost_complexity tree_depth .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model~
## 2 0.0001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model~
## 3 0.00001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model~
## 4 0.01 2 accuracy binary 0.656 5 0.0142 Preprocessor1_Model~
## 5 0.01 3 accuracy binary 0.649 5 0.0142 Preprocessor1_Model~
## 6 0.001 5 accuracy binary 0.646 5 0.00488 Preprocessor1_Model~
## 7 0.001 6 accuracy binary 0.646 5 0.00918 Preprocessor1_Model~
## 8 0.0001 5 accuracy binary 0.646 5 0.00488 Preprocessor1_Model~
## 9 0.0001 6 accuracy binary 0.646 5 0.00918 Preprocessor1_Model~
## 10 0.00001 5 accuracy binary 0.646 5 0.00488 Preprocessor1_Model~
## # ... with 20 more rows
59 / 78
show_best()
show_best(): display the n best hyperparameters combinations according to a metric:
dt_res %>%
show_best(metric = "accuracy", n = 5)
## # A tibble: 5 x 8
## cost_complexity tree_depth .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model14
## 2 0.0001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model20
## 3 0.00001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model26
## 4 0.01 2 accuracy binary 0.656 5 0.0142 Preprocessor1_Model08
## 5 0.01 3 accuracy binary 0.649 5 0.0142 Preprocessor1_Model09
60 / 78
autoplot()
autoplot(): quickly visualize tuning results
61 / 78
select_best()
select_best() returns the best combination of hyperparameters according to a metric:
## # A tibble: 1 x 3
## cost_complexity tree_depth .config
## <dbl> <int> <chr>
## 1 0.001 2 Preprocessor1_Model14
62 / 78
finalize_workflow()
finalize_workflow(): replaces tune() placeholders in a model/recipe/workflow with a set of hyper-parameter
values.
## == Workflow ==============================================================================
## Preprocessor: Formula
## Model: decision_tree()
##
## -- Preprocessor --------------------------------------------------------------------------
## remote ~ .
##
## -- Model ---------------------------------------------------------------------------------
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = 0.001
## tree_depth = 2
##
## Computational engine: rpart
63 / 78
Preprocessing with recipes
64 / 78
1. Create a recipe()
65 / 78
recipe()
recipe():create a recipe by specifying predictors, responses and
1. Create a recipe()
reference (template) data frame.
4. Calculate statistics
from the training set
5. Apply preprocessing
to datasets
66 / 78
step_*()
step_*():
add preprocessing step specifications in the order they will be
performed.
1. Create a recipe()
recipe(Sale_Price ~ ., data = ames) %>%
# step_novel(): assign a previously unseen factor level to
2. Define the predictor # a new value
step_novel(all_nominal()) %>%
and outcome variables # step_zv(): zero variance filter: remove vars that contain
# only a single value
step_zv(all_predictors())
3. Add one or more
preprocessing step ## Data Recipe
##
specifications ## Inputs:
##
## role #variables
4. Calculate statistics ## outcome 1
## predictor 73
from the training set ##
## Operations:
##
5. Apply preprocessing ## Novel factor level assignment for all_nominal()
to datasets ## Zero variance filter on all_predictors()
67 / 78
step_*()
Complete list at: https://recipes.tidymodels.org/reference/index.html
CONTENTS
STEP FUNCTIONS - IMPUTATION
Basic Functions
step_impute_bag() Imputation via Bagged Trees Step Functions -
step_bagimpute() imp_vars() Imputation
tidy(<step_impute_bag>)
Step Functions -
Individual
step_impute_knn() Imputation via K-Nearest Neighbors
Transformations
step_knnimpute()
68 / 78
Selectors
Selectors, e.g., all_nominal() and all_predictors() are helper functions for selecting sets of variables, which
behave similar to the select helpers from dplyr.
rec %>%
step_novel(all_nominal()) %>%
step_zv(all_predictors())
selector description
all_predictors() Each x variable (right side of ~)
all_outcomes() Each y variable (left side of ~)
all_numeric() Each numeric variable
all_nominal() Each categorical variable (e.g. factor, string)
dplyr::select() helpers starts_with('Lot_'), etc.
70 / 78
prep()
prep() "trains" a recipe, i.e., calculates statistics from the training data
71 / 78
bake()
bake() transforms data with the prepped recipe
## # A tibble: 2,198 x 74
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour
## <fct> <fct> <dbl> <dbl> <fct> <fct> <fct> <fct>
## 1 One_Story_1946~ Residentia~ 2.46 2.64 Pave No_Al~ Slightly_~ Lvl
## 2 One_Story_1946~ Residentia~ 0.658 0.185 Pave No_Al~ Regular Lvl
## 3 One_Story_1946~ Residentia~ 0.687 0.507 Pave No_Al~ Slightly_~ Lvl
## 4 One_Story_1946~ Residentia~ 1.04 0.128 Pave No_Al~ Regular Lvl
## 5 Two_Story_1946~ Residentia~ 0.480 0.454 Pave No_Al~ Slightly_~ Lvl
## 6 Two_Story_1946~ Residentia~ 0.598 -0.0156 Pave No_Al~ Slightly_~ Lvl
## 7 One_Story_PUD_~ Residentia~ -0.496 -0.632 Pave No_Al~ Regular Lvl
## 8 Two_Story_1946~ Residentia~ 0.510 -0.0129 Pave No_Al~ Slightly_~ Lvl
## 9 One_Story_1946~ Residentia~ -1.71 -0.259 Pave No_Al~ Slightly_~ Lvl
## 10 One_Story_1946~ Residentia~ 0.805 0.00851 Pave No_Al~ Regular Lvl
## # ... with 2,188 more rows, and 66 more variables: Utilities <fct>, Lot_Config <fct>,
## # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
## # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <dbl>,
## # Year_Remod_Add <dbl>, Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, 74 / 78
A full workflow
set.seed(123)
so_cv <- vfold_cv(stackoverflow, v = 5)
so_rec <- recipe(remote ~ ., data = stackoverflow) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_predictors(), threshold = 0.5)
## Truth
## Prediction Remote Not remote
## Remote 381 224
## Not remote 194 351
75 / 78
You can tune models and recipes!
## # A tibble: 5 x 8
## neighbors num_comp .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9 7 rmse standard 31793. 5 968. Preprocessor6_Model3
## 2 13 7 rmse standard 31961. 5 1157. Preprocessor6_Model4
## 3 9 8 rmse standard 31963. 5 1099. Preprocessor7_Model3
## 4 9 5 rmse standard 32141. 5 951. Preprocessor4_Model3
## 5 13 8 rmse standard 32180. 5 1234. Preprocessor7_Model4
76 / 78
Session info
## setting value
## version R version 4.0.5 (2021-03-31)
## os Windows 10 x64
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## ctype English_United States.1252
## tz Europe/Berlin
## date 2021-05-10