Predicting children in hotel bookings

Suggested answers

Application exercise
Answers
Modified

September 12, 2024

Your Turn 1

Run the chunk below and look at the output. Then, copy/paste the code and edit to create:

  • a decision tree model for classification

  • that uses the C5.0 engine.

Save it as tree_mod and look at the object. What is different about the output?

Hint: you’ll need https://www.tidymodels.org/find/parsnip/

lr_mod <- logistic_reg() |>
  set_engine(engine = "glm") |>
  set_mode("classification")
lr_mod
Logistic Regression Model Specification (classification)

Computational engine: glm 
tree_mod <- decision_tree() |>
  set_engine(engine = "C5.0") |>
  set_mode("classification")
tree_mod
Decision Tree Model Specification (classification)

Computational engine: C5.0 

Your Turn 2

Fill in the blanks.

Use initial_split(), training(), and testing() to:

  1. Split hotels into training and test sets. Save the rsplit!

  2. Extract the training data and fit your classification tree model.

  3. Check the proportions of the test variable in each set.

Keep set.seed(100) at the start of your code.

Hint: Be sure to remove every _ before running the code!

set.seed(100) # Important!

hotels_split <- initial_split(data = hotels, prop = 3 / 4)
hotels_train <- training(hotels_split)
hotels_test <- testing(hotels_split)

# check distribution
count(x = hotels_train, children) |>
  mutate(prop = n / sum(n))
# A tibble: 2 × 3
  children     n  prop
  <fct>    <int> <dbl>
1 children  1503 0.501
2 none      1497 0.499
count(x = hotels_test, children) |>
  mutate(prop = n / sum(n))
# A tibble: 2 × 3
  children     n  prop
  <fct>    <int> <dbl>
1 children   497 0.497
2 none       503 0.503

Your Turn 3

Run the code below. What does it return?

set.seed(100)
hotels_folds <- vfold_cv(data = hotels_train, v = 10)
hotels_folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [2700/300]> Fold01
 2 <split [2700/300]> Fold02
 3 <split [2700/300]> Fold03
 4 <split [2700/300]> Fold04
 5 <split [2700/300]> Fold05
 6 <split [2700/300]> Fold06
 7 <split [2700/300]> Fold07
 8 <split [2700/300]> Fold08
 9 <split [2700/300]> Fold09
10 <split [2700/300]> Fold10

Your Turn 4

Add a autoplot() to visualize the ROC AUC. How well does the model perform?

tree_preds <- tree_mod |>
  fit_resamples(
    children ~ average_daily_rate + stays_in_weekend_nights,
    resamples = hotels_folds,
    control = control_resamples(save_pred = TRUE)
  )

tree_preds |>
  collect_predictions() |>
  roc_auc(truth = children, .pred_children)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.670
tree_preds |>
  collect_predictions() |>
  roc_curve(truth = children, .pred_children) |>
  autoplot()

It’s moderately successful. Better than \(0.5\), but still has a lot of room for improvement.

Acknowledgments

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.1 (2024-06-14)
 os       macOS Sonoma 14.6.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2024-11-04
 pandoc   3.4 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package      * version    date (UTC) lib source
 P backports      1.5.0      2024-05-23 [?] CRAN (R 4.4.0)
 P bit            4.0.5      2022-11-15 [?] CRAN (R 4.3.0)
 P bit64          4.0.5      2020-08-30 [?] CRAN (R 4.3.0)
 P broom        * 1.0.6      2024-05-17 [?] CRAN (R 4.4.0)
 P C50          * 0.1.8      2023-02-08 [?] RSPM
 P class          7.3-22     2023-05-03 [?] CRAN (R 4.4.0)
   cli            3.6.3      2024-06-21 [1] RSPM (R 4.4.0)
 P codetools      0.2-20     2024-03-31 [?] CRAN (R 4.4.1)
 P colorspace     2.1-0      2023-01-23 [?] CRAN (R 4.3.0)
 P crayon         1.5.3      2024-06-20 [?] CRAN (R 4.4.0)
 P Cubist         0.4.4      2024-07-02 [?] RSPM
 P data.table     1.15.4     2024-03-30 [?] CRAN (R 4.3.1)
 P dials        * 1.2.1      2024-02-22 [?] CRAN (R 4.3.1)
 P DiceDesign     1.10       2023-12-07 [?] CRAN (R 4.3.1)
 P digest         0.6.35     2024-03-11 [?] CRAN (R 4.3.1)
 P dplyr        * 1.1.4      2023-11-17 [?] CRAN (R 4.3.1)
 P evaluate       0.24.0     2024-06-10 [?] CRAN (R 4.4.0)
 P fansi          1.0.6      2023-12-08 [?] CRAN (R 4.3.1)
 P farver         2.1.2      2024-05-13 [?] CRAN (R 4.3.3)
 P fastmap        1.2.0      2024-05-15 [?] CRAN (R 4.4.0)
 P forcats      * 1.0.0      2023-01-29 [?] CRAN (R 4.3.0)
 P foreach        1.5.2      2022-02-02 [?] CRAN (R 4.3.0)
 P Formula        1.2-5      2023-02-24 [?] CRAN (R 4.3.0)
 P furrr          0.3.1      2022-08-15 [?] CRAN (R 4.3.0)
 P future         1.33.2     2024-03-26 [?] CRAN (R 4.3.1)
 P future.apply   1.11.2     2024-03-28 [?] CRAN (R 4.3.1)
 P generics       0.1.3      2022-07-05 [?] CRAN (R 4.3.0)
 P ggplot2      * 3.5.1      2024-04-23 [?] CRAN (R 4.3.1)
 P globals        0.16.3     2024-03-08 [?] CRAN (R 4.3.1)
   glue           1.8.0      2024-09-30 [1] RSPM (R 4.4.0)
 P gower          1.0.1      2022-12-22 [?] CRAN (R 4.3.0)
 P GPfit          1.0-8      2019-02-08 [?] CRAN (R 4.3.0)
 P gtable         0.3.5      2024-04-22 [?] CRAN (R 4.3.1)
 P hardhat        1.4.0      2024-06-02 [?] CRAN (R 4.4.0)
 P here           1.0.1      2020-12-13 [?] CRAN (R 4.3.0)
 P hms            1.1.3      2023-03-21 [?] CRAN (R 4.3.0)
 P htmltools      0.5.8.1    2024-04-04 [?] CRAN (R 4.3.1)
 P htmlwidgets    1.6.4      2023-12-06 [?] CRAN (R 4.3.1)
 P infer        * 1.0.7      2024-03-25 [?] CRAN (R 4.3.1)
 P inum           1.0-5      2023-03-09 [?] CRAN (R 4.3.0)
 P ipred          0.9-14     2023-03-09 [?] CRAN (R 4.3.0)
 P iterators      1.0.14     2022-02-05 [?] CRAN (R 4.3.0)
 P jsonlite       1.8.8      2023-12-04 [?] CRAN (R 4.3.1)
 P knitr          1.47       2024-05-29 [?] CRAN (R 4.4.0)
 P labeling       0.4.3      2023-08-29 [?] CRAN (R 4.3.0)
 P lattice        0.22-6     2024-03-20 [?] CRAN (R 4.4.0)
 P lava           1.8.0      2024-03-05 [?] CRAN (R 4.3.1)
 P lhs            1.1.6      2022-12-17 [?] CRAN (R 4.3.0)
 P libcoin        1.0-10     2023-09-27 [?] CRAN (R 4.3.1)
 P lifecycle      1.0.4      2023-11-07 [?] CRAN (R 4.3.1)
 P listenv        0.9.1      2024-01-29 [?] CRAN (R 4.3.1)
 P lubridate    * 1.9.3      2023-09-27 [?] CRAN (R 4.3.1)
 P magrittr       2.0.3      2022-03-30 [?] CRAN (R 4.3.0)
 P MASS           7.3-61     2024-06-13 [?] CRAN (R 4.4.0)
 P Matrix         1.7-0      2024-03-22 [?] CRAN (R 4.4.0)
 P modeldata    * 1.4.0      2024-06-19 [?] CRAN (R 4.4.0)
 P modelenv       0.1.1      2023-03-08 [?] CRAN (R 4.3.0)
 P munsell        0.5.1      2024-04-01 [?] CRAN (R 4.3.1)
 P mvtnorm        1.2-5      2024-05-21 [?] CRAN (R 4.4.0)
 P nnet           7.3-19     2023-05-03 [?] CRAN (R 4.4.0)
 P parallelly     1.37.1     2024-02-29 [?] CRAN (R 4.3.1)
 P parsnip      * 1.2.1      2024-03-22 [?] CRAN (R 4.3.1)
 P partykit       1.2-20     2023-04-14 [?] CRAN (R 4.3.0)
 P pillar         1.9.0      2023-03-22 [?] CRAN (R 4.3.0)
 P pkgconfig      2.0.3      2019-09-22 [?] CRAN (R 4.3.0)
 P plyr           1.8.9      2023-10-02 [?] CRAN (R 4.3.1)
 P prodlim        2023.08.28 2023-08-28 [?] CRAN (R 4.3.0)
 P purrr        * 1.0.2      2023-08-10 [?] CRAN (R 4.3.0)
 P R6             2.5.1      2021-08-19 [?] CRAN (R 4.3.0)
 P Rcpp           1.0.12     2024-01-09 [?] CRAN (R 4.3.1)
 P readr        * 2.1.5      2024-01-10 [?] CRAN (R 4.3.1)
 P recipes      * 1.0.10     2024-02-18 [?] CRAN (R 4.3.1)
   renv           1.0.7      2024-04-11 [1] CRAN (R 4.4.0)
 P reshape2       1.4.4      2020-04-09 [?] CRAN (R 4.3.0)
 P rlang          1.1.4      2024-06-04 [?] CRAN (R 4.3.3)
 P rmarkdown      2.27       2024-05-17 [?] CRAN (R 4.4.0)
 P rpart          4.1.23     2023-12-05 [?] CRAN (R 4.4.0)
 P rprojroot      2.0.4      2023-11-05 [?] CRAN (R 4.3.1)
 P rsample      * 1.2.1      2024-03-25 [?] CRAN (R 4.3.1)
 P rstudioapi     0.16.0     2024-03-24 [?] CRAN (R 4.3.1)
 P scales       * 1.3.0.9000 2024-05-07 [?] Github (r-lib/scales@c0f79d3)
 P sessioninfo    1.2.2      2021-12-06 [?] CRAN (R 4.3.0)
 P stringi        1.8.4      2024-05-06 [?] CRAN (R 4.3.1)
 P stringr      * 1.5.1      2023-11-14 [?] CRAN (R 4.3.1)
 P survival       3.7-0      2024-06-05 [?] CRAN (R 4.4.0)
 P tibble       * 3.2.1      2023-03-20 [?] CRAN (R 4.3.0)
 P tidymodels   * 1.2.0      2024-03-25 [?] CRAN (R 4.3.1)
 P tidyr        * 1.3.1      2024-01-24 [?] CRAN (R 4.3.1)
 P tidyselect     1.2.1      2024-03-11 [?] CRAN (R 4.3.1)
 P tidyverse    * 2.0.0      2023-02-22 [?] CRAN (R 4.3.0)
 P timechange     0.3.0      2024-01-18 [?] CRAN (R 4.3.1)
 P timeDate       4032.109   2023-12-14 [?] CRAN (R 4.3.1)
 P tune         * 1.2.1      2024-04-18 [?] CRAN (R 4.3.1)
 P tzdb           0.4.0      2023-05-12 [?] CRAN (R 4.3.0)
 P utf8           1.2.4      2023-10-22 [?] CRAN (R 4.3.1)
 P vctrs          0.6.5      2023-12-01 [?] CRAN (R 4.3.1)
 P vroom          1.6.5      2023-12-05 [?] CRAN (R 4.3.1)
   withr          3.0.1      2024-07-31 [1] RSPM (R 4.4.0)
 P workflows    * 1.1.4      2024-02-19 [?] CRAN (R 4.3.1)
 P workflowsets * 1.1.0      2024-03-21 [?] CRAN (R 4.3.1)
 P xfun           0.45       2024-06-16 [?] CRAN (R 4.4.0)
 P yaml           2.3.8      2023-12-11 [?] CRAN (R 4.3.1)
 P yardstick    * 1.3.1      2024-03-21 [?] CRAN (R 4.3.1)

 [1] /Users/soltoffbc/Projects/info-5001/course-site/renv/library/macos/R-4.4/aarch64-apple-darwin20
 [2] /Users/soltoffbc/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────