Tune better models to predict children in hotel bookings

Application exercise

Modified

September 12, 2024

Your Turn 1

Fill in the blanks to return the accuracy and ROC AUC for this model using 10-fold cross-validation.

tree_mod <- decision_tree(engine = "rpart") |>
  set_mode("classification")

tree_wf <- workflow() |>
  add_formula(children ~ .) |>
  add_model(tree_mod)

set.seed(100)
______ |> 
  ______(resamples = hotels_folds) |> 
  ______

Your Turn 2

Create a new parsnip model called rf_mod, which will learn an ensemble of classification trees from our training data using the ranger package. Update your tree_wf with this new model.

Fit your workflow with 10-fold cross-validation and compare the ROC AUC of the random forest to your single decision tree model — which predicts the test set better?

Hint: you’ll need https://www.tidymodels.org/find/parsnip/

# model
rf_mod <- _____ |> 
  _____("ranger") |> 
  _____("classification")

# workflow
rf_wf <- tree_wf |> 
  update_model(_____)

# fit with cross-validation
set.seed(100)
_____ |> 
  fit_resamples(resamples = hotels_folds) |> 
  collect_metrics()

Your Turn 3

Challenge: Fit 3 more random forest models, each using 5, 12, and 21 variables at each split. Update your rf_wf with each new model. Which value maximizes the area under the ROC curve?

rf5_mod <- rf_mod |> 
  set_args(mtry = 5) 

rf12_mod <- rf_mod |> 
  set_args(mtry = 12) 

rf21_mod <- rf_mod |> 
  set_args(mtry = 21)

Do this for each model above:

_____ <- rf_wf |> 
  update_model(_____)

set.seed(100)
_____ |> 
  fit_resamples(resamples = hotels_folds) |> 
  collect_metrics()

Your Turn 4

Edit the random forest model to tune the mtry and min_n hyper-parameters; call the new model spec rf_tuner.

Update your workflow to use the tuned model.

Then use tune_grid() to find the best combination of hyper-parameters to maximize roc_auc; let tune set up the grid for you.

How does it compare to the average ROC AUC across folds from fit_resamples()?

rf_mod <- rand_forest(engine = "ranger") |> 
  set_mode("classification")

rf_wf <- workflow() |> 
  add_formula(children ~ .) |> 
  add_model(rf_mod)

set.seed(100) # Important!
rf_results <- rf_wf |> 
  fit_resamples(resamples = hotels_folds,
                metrics = metric_set(roc_auc),
                # change me to control_grid() with tune_grid
                control = control_resamples(verbose = TRUE,
                                            save_workflow = TRUE))

rf_results |> 
  collect_metrics()

# your code here

Your Turn 5

Use fit_best() to take the best combination of hyper-parameters from rf_results and use them to predict the test set.

How does our actual test ROC AUC compare to our cross-validated estimate?

# your code here

Acknowledgments

Materials derived from Tidymodels, Virtually: An Introduction to Machine Learning with Tidymodels by Allison Hill.
Dataset and some modeling steps derived from A predictive modeling case study and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.