step_normalize(all_numeric())
recipe(children ~ ., data = hotels)
step_rm(arrival_date)
step_date(arrival_date)
step_downsample(children)
step_holiday(arrival_date, holidays = holidays)
step_dummy(all_nominal_predictors())
step_zv(all_predictors())
Building better training data to predict children in hotel bookings
Suggested answers
Application exercise
Answers
Your Turn 1
Unscramble! You have all the steps from our knn_rec
- your challenge is to unscramble them into the right order!
Save the result as knn_rec
Answer:
knn_rec <- recipe(children ~ ., data = hotels) |>
step_date(arrival_date) |>
step_holiday(arrival_date, holidays = holidays) |>
step_rm(arrival_date) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric()) |>
step_downsample(children)
knn_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 21
── Operations
• Date features from: arrival_date
• Holiday features from: arrival_date
• Variables removed: arrival_date
• Dummy variables from: all_nominal_predictors()
• Zero variance filter on: all_predictors()
• Centering and scaling for: all_numeric()
• Down-sampling based on: children
Your Turn 2
Fill in the blanks to make a workflow that combines knn_rec
and with knn_mod
.
<- ______ |>
knn_wf ______(knn_rec) |>
______(knn_mod)
knn_wf
Answer:
knn_wf <- workflow() |>
add_recipe(knn_rec) |>
add_model(knn_mod)
knn_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps
• step_date()
• step_holiday()
• step_rm()
• step_dummy()
• step_zv()
• step_normalize()
• step_downsample()
── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)
Computational engine: kknn
Your Turn 3
Edit the code chunk below to fit the entire knn_wflow
instead of just knn_mod
.
set.seed(100)
knn_mod |>
fit_resamples(children ~ .,
resamples = hotels_folds,
# print progress of model fitting
control = control_resamples(verbose = TRUE)) |>
collect_metrics()
Answer:
set.seed(100)
knn_wf |>
fit_resamples(resamples = hotels_folds,
control = control_resamples(verbose = TRUE)) |>
collect_metrics()
i Fold01: preprocessor 1/1
✓ Fold01: preprocessor 1/1
i Fold01: preprocessor 1/1, model 1/1
✓ Fold01: preprocessor 1/1, model 1/1
i Fold01: preprocessor 1/1, model 1/1 (extracts)
i Fold01: preprocessor 1/1, model 1/1 (predictions)
i Fold02: preprocessor 1/1
✓ Fold02: preprocessor 1/1
i Fold02: preprocessor 1/1, model 1/1
✓ Fold02: preprocessor 1/1, model 1/1
i Fold02: preprocessor 1/1, model 1/1 (extracts)
i Fold02: preprocessor 1/1, model 1/1 (predictions)
i Fold03: preprocessor 1/1
✓ Fold03: preprocessor 1/1
i Fold03: preprocessor 1/1, model 1/1
✓ Fold03: preprocessor 1/1, model 1/1
i Fold03: preprocessor 1/1, model 1/1 (extracts)
i Fold03: preprocessor 1/1, model 1/1 (predictions)
i Fold04: preprocessor 1/1
✓ Fold04: preprocessor 1/1
i Fold04: preprocessor 1/1, model 1/1
✓ Fold04: preprocessor 1/1, model 1/1
i Fold04: preprocessor 1/1, model 1/1 (extracts)
i Fold04: preprocessor 1/1, model 1/1 (predictions)
i Fold05: preprocessor 1/1
✓ Fold05: preprocessor 1/1
i Fold05: preprocessor 1/1, model 1/1
✓ Fold05: preprocessor 1/1, model 1/1
i Fold05: preprocessor 1/1, model 1/1 (extracts)
i Fold05: preprocessor 1/1, model 1/1 (predictions)
i Fold06: preprocessor 1/1
✓ Fold06: preprocessor 1/1
i Fold06: preprocessor 1/1, model 1/1
✓ Fold06: preprocessor 1/1, model 1/1
i Fold06: preprocessor 1/1, model 1/1 (extracts)
i Fold06: preprocessor 1/1, model 1/1 (predictions)
i Fold07: preprocessor 1/1
✓ Fold07: preprocessor 1/1
i Fold07: preprocessor 1/1, model 1/1
✓ Fold07: preprocessor 1/1, model 1/1
i Fold07: preprocessor 1/1, model 1/1 (extracts)
i Fold07: preprocessor 1/1, model 1/1 (predictions)
i Fold08: preprocessor 1/1
✓ Fold08: preprocessor 1/1
i Fold08: preprocessor 1/1, model 1/1
✓ Fold08: preprocessor 1/1, model 1/1
i Fold08: preprocessor 1/1, model 1/1 (extracts)
i Fold08: preprocessor 1/1, model 1/1 (predictions)
i Fold09: preprocessor 1/1
✓ Fold09: preprocessor 1/1
i Fold09: preprocessor 1/1, model 1/1
✓ Fold09: preprocessor 1/1, model 1/1
i Fold09: preprocessor 1/1, model 1/1 (extracts)
i Fold09: preprocessor 1/1, model 1/1 (predictions)
i Fold10: preprocessor 1/1
✓ Fold10: preprocessor 1/1
i Fold10: preprocessor 1/1, model 1/1
✓ Fold10: preprocessor 1/1, model 1/1
i Fold10: preprocessor 1/1, model 1/1 (extracts)
i Fold10: preprocessor 1/1, model 1/1 (predictions)
# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.739 10 0.00192 Preprocessor1_Model1
2 brier_class binary 0.175 10 0.00143 Preprocessor1_Model1
3 roc_auc binary 0.830 10 0.00352 Preprocessor1_Model1
Your Turn 4
Turns out, the same knn_rec
recipe can also be used to fit a penalized logistic regression model using the lasso. Let’s try it out!
plr_mod <- logistic_reg(penalty = .01, mixture = 1) |>
set_engine("glmnet") |>
set_mode("classification")
plr_mod |>
translate()
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = 0.01
mixture = 1
Computational engine: glmnet
Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
alpha = 1, family = "binomial")
Answer:
glmnet_wf <- knn_wf |>
update_model(plr_mod)
glmnet_wf |>
fit_resamples(resamples = hotels_folds) |>
collect_metrics()
# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.828 10 0.00215 Preprocessor1_Model1
2 brier_class binary 0.139 10 0.000876 Preprocessor1_Model1
3 roc_auc binary 0.873 10 0.00209 Preprocessor1_Model1
Acknowledgments
- Materials derived from Tidymodels, Virtually: An Introduction to Machine Learning with Tidymodels by Allison Hill.
- Dataset and some modeling steps derived from A predictive modeling case study and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.
Session information
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.6.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-11-07
pandoc 3.4 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P backports 1.5.0 2024-05-23 [?] CRAN (R 4.4.0)
P bit 4.0.5 2022-11-15 [?] CRAN (R 4.3.0)
P bit64 4.0.5 2020-08-30 [?] CRAN (R 4.3.0)
P broom * 1.0.6 2024-05-17 [?] CRAN (R 4.4.0)
P class 7.3-22 2023-05-03 [?] CRAN (R 4.4.0)
cli 3.6.3 2024-06-21 [1] RSPM (R 4.4.0)
P clock 0.7.0 2023-05-15 [?] CRAN (R 4.3.0)
P codetools 0.2-20 2024-03-31 [?] CRAN (R 4.4.1)
P colorspace 2.1-0 2023-01-23 [?] CRAN (R 4.3.0)
P crayon 1.5.3 2024-06-20 [?] CRAN (R 4.4.0)
P data.table 1.15.4 2024-03-30 [?] CRAN (R 4.3.1)
P dials * 1.2.1 2024-02-22 [?] CRAN (R 4.3.1)
P DiceDesign 1.10 2023-12-07 [?] CRAN (R 4.3.1)
P digest 0.6.35 2024-03-11 [?] CRAN (R 4.3.1)
P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.3.1)
P ellipsis 0.3.2 2021-04-29 [?] CRAN (R 4.3.0)
P evaluate 0.24.0 2024-06-10 [?] CRAN (R 4.4.0)
P fansi 1.0.6 2023-12-08 [?] CRAN (R 4.3.1)
P fastmap 1.2.0 2024-05-15 [?] CRAN (R 4.4.0)
P forcats * 1.0.0 2023-01-29 [?] CRAN (R 4.3.0)
P foreach 1.5.2 2022-02-02 [?] CRAN (R 4.3.0)
P furrr 0.3.1 2022-08-15 [?] CRAN (R 4.3.0)
P future 1.33.2 2024-03-26 [?] CRAN (R 4.3.1)
P future.apply 1.11.2 2024-03-28 [?] CRAN (R 4.3.1)
P generics 0.1.3 2022-07-05 [?] CRAN (R 4.3.0)
P ggplot2 * 3.5.1 2024-04-23 [?] CRAN (R 4.3.1)
P glmnet * 4.1-8 2023-08-22 [?] CRAN (R 4.3.0)
P globals 0.16.3 2024-03-08 [?] CRAN (R 4.3.1)
glue 1.8.0 2024-09-30 [1] RSPM (R 4.4.0)
P gower 1.0.1 2022-12-22 [?] CRAN (R 4.3.0)
P GPfit 1.0-8 2019-02-08 [?] CRAN (R 4.3.0)
P gtable 0.3.5 2024-04-22 [?] CRAN (R 4.3.1)
P hardhat 1.4.0 2024-06-02 [?] CRAN (R 4.4.0)
P here 1.0.1 2020-12-13 [?] CRAN (R 4.3.0)
P hms 1.1.3 2023-03-21 [?] CRAN (R 4.3.0)
P htmltools 0.5.8.1 2024-04-04 [?] CRAN (R 4.3.1)
P htmlwidgets 1.6.4 2023-12-06 [?] CRAN (R 4.3.1)
P igraph 2.0.3 2024-03-13 [?] CRAN (R 4.4.0)
P infer * 1.0.7 2024-03-25 [?] CRAN (R 4.3.1)
P ipred 0.9-14 2023-03-09 [?] CRAN (R 4.3.0)
P iterators 1.0.14 2022-02-05 [?] CRAN (R 4.3.0)
P jsonlite 1.8.8 2023-12-04 [?] CRAN (R 4.3.1)
P kknn * 1.3.1 2016-03-26 [?] CRAN (R 4.3.0)
P knitr 1.47 2024-05-29 [?] CRAN (R 4.4.0)
P lattice 0.22-6 2024-03-20 [?] CRAN (R 4.4.0)
P lava 1.8.0 2024-03-05 [?] CRAN (R 4.3.1)
P lhs 1.1.6 2022-12-17 [?] CRAN (R 4.3.0)
P lifecycle 1.0.4 2023-11-07 [?] CRAN (R 4.3.1)
P listenv 0.9.1 2024-01-29 [?] CRAN (R 4.3.1)
P lubridate * 1.9.3 2023-09-27 [?] CRAN (R 4.3.1)
P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.3.0)
P MASS 7.3-61 2024-06-13 [?] CRAN (R 4.4.0)
P Matrix * 1.7-0 2024-03-22 [?] CRAN (R 4.4.0)
P modeldata * 1.4.0 2024-06-19 [?] CRAN (R 4.4.0)
P munsell 0.5.1 2024-04-01 [?] CRAN (R 4.3.1)
P nnet 7.3-19 2023-05-03 [?] CRAN (R 4.4.0)
P parallelly 1.37.1 2024-02-29 [?] CRAN (R 4.3.1)
P parsnip * 1.2.1 2024-03-22 [?] CRAN (R 4.3.1)
P pillar 1.9.0 2023-03-22 [?] CRAN (R 4.3.0)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.3.0)
P prodlim 2023.08.28 2023-08-28 [?] CRAN (R 4.3.0)
P purrr * 1.0.2 2023-08-10 [?] CRAN (R 4.3.0)
P R6 2.5.1 2021-08-19 [?] CRAN (R 4.3.0)
P Rcpp 1.0.12 2024-01-09 [?] CRAN (R 4.3.1)
P readr * 2.1.5 2024-01-10 [?] CRAN (R 4.3.1)
P recipes * 1.0.10 2024-02-18 [?] CRAN (R 4.3.1)
renv 1.0.7 2024-04-11 [1] CRAN (R 4.4.0)
P rlang 1.1.4 2024-06-04 [?] CRAN (R 4.3.3)
P rmarkdown 2.27 2024-05-17 [?] CRAN (R 4.4.0)
P ROSE 0.0-4 2021-06-14 [?] CRAN (R 4.3.0)
P rpart 4.1.23 2023-12-05 [?] CRAN (R 4.4.0)
P rprojroot 2.0.4 2023-11-05 [?] CRAN (R 4.3.1)
P rsample * 1.2.1 2024-03-25 [?] CRAN (R 4.3.1)
P rstudioapi 0.16.0 2024-03-24 [?] CRAN (R 4.3.1)
P scales * 1.3.0.9000 2024-05-07 [?] Github (r-lib/scales@c0f79d3)
P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.3.0)
P shape 1.4.6.1 2024-02-23 [?] CRAN (R 4.3.1)
P stringi 1.8.4 2024-05-06 [?] CRAN (R 4.3.1)
P stringr * 1.5.1 2023-11-14 [?] CRAN (R 4.3.1)
P survival 3.7-0 2024-06-05 [?] CRAN (R 4.4.0)
P themis * 1.0.2 2023-08-14 [?] CRAN (R 4.3.0)
P tibble * 3.2.1 2023-03-20 [?] CRAN (R 4.3.0)
P tidymodels * 1.2.0 2024-03-25 [?] CRAN (R 4.3.1)
P tidyr * 1.3.1 2024-01-24 [?] CRAN (R 4.3.1)
P tidyselect 1.2.1 2024-03-11 [?] CRAN (R 4.3.1)
P tidyverse * 2.0.0 2023-02-22 [?] CRAN (R 4.3.0)
P timechange 0.3.0 2024-01-18 [?] CRAN (R 4.3.1)
P timeDate 4032.109 2023-12-14 [?] CRAN (R 4.3.1)
P tune * 1.2.1 2024-04-18 [?] CRAN (R 4.3.1)
P tzdb 0.4.0 2023-05-12 [?] CRAN (R 4.3.0)
P utf8 1.2.4 2023-10-22 [?] CRAN (R 4.3.1)
P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.3.1)
P vroom 1.6.5 2023-12-05 [?] CRAN (R 4.3.1)
withr 3.0.1 2024-07-31 [1] RSPM (R 4.4.0)
P workflows * 1.1.4 2024-02-19 [?] CRAN (R 4.3.1)
P workflowsets * 1.1.0 2024-03-21 [?] CRAN (R 4.3.1)
P xfun 0.45 2024-06-16 [?] CRAN (R 4.4.0)
P yaml 2.3.8 2023-12-11 [?] CRAN (R 4.3.1)
P yardstick * 1.3.1 2024-03-21 [?] CRAN (R 4.3.1)
[1] /Users/soltoffbc/Projects/info-5001/course-site/renv/library/macos/R-4.4/aarch64-apple-darwin20
[2] /Users/soltoffbc/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────