Build better training data

Lecture 20

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2023

2023-11-06

Announcements

Exploration feedback forthcoming

Application exercise

`ae-18`

Go to the course GitHub org and find your ae-18 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of tomorrow

Import data

hotels <- read_csv("data/hotels.csv") |>
  mutate(across(where(is.character), as.factor))

count(hotels, children)

# A tibble: 2 × 2
  children     n
  <fct>    <int>
1 children  4039
2 none     45961

👩🏼‍🍳 Build a better training set with `recipes`

`recipes`

Preprocessing options

Encode categorical predictors
Center and scale variables
Handle class imbalance
Impute missing data
Perform dimensionality reduction
A lot more!

To build a recipe

Start the recipe()
Define the variables involved
Describe preprocessing step-by-step

`recipe()`

Creates a recipe for a set of variables

recipe(children ~ ., data = hotels)

rec <- recipe(children ~ ., data = hotels)

`step_*()`

Adds a single transformation to a recipe. Transformations are replayed in order when the recipe is run on data.

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date)

Before recipe

# A tibble: 45,000 × 1
   arrival_date
   <date>      
 1 2016-04-28  
 2 2016-12-29  
 3 2016-10-17  
 4 2016-05-22  
 5 2016-03-02  
 6 2016-06-16  
 7 2017-02-13  
 8 2017-08-20  
 9 2017-08-22  
10 2017-05-18  
# ℹ 44,990 more rows

After recipe

# A tibble: 45,000 × 4
   arrival_date arrival_date_dow arrival_date_month
   <date>       <fct>            <fct>             
 1 2016-04-28   Thu              Apr               
 2 2016-12-29   Thu              Dec               
 3 2016-10-17   Mon              Oct               
 4 2016-05-22   Sun              May               
 5 2016-03-02   Wed              Mar               
 6 2016-06-16   Thu              Jun               
 7 2017-02-13   Mon              Feb               
 8 2017-08-20   Sun              Aug               
 9 2017-08-22   Tue              Aug               
10 2017-05-18   Thu              May               
# ℹ 44,990 more rows
# ℹ 1 more variable: arrival_date_year <int>

`step_*()`

Complete list at: https://recipes.tidymodels.org/reference/index.html

`step_holiday()` + `step_rm()`

Generate a set of indicator variables for specific holidays.

holidays <- c("AllSouls", "AshWednesday", "ChristmasEve", "Easter", 
              "ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date)

`step_holiday()` + `step_rm()`

Rows: 45,000
Columns: 11
$ arrival_date_dow          <fct> Thu, Thu, Mon, Sun, Wed,…
$ arrival_date_month        <fct> Apr, Dec, Oct, May, Mar,…
$ arrival_date_year         <int> 2016, 2016, 2016, 2016, …
$ arrival_date_AllSouls     <int> 0, 0, 0, 0, 0, 0, 0, 0, …
$ arrival_date_AshWednesday <int> 0, 0, 0, 0, 0, 0, 0, 0, …
$ arrival_date_ChristmasEve <int> 0, 0, 0, 0, 0, 0, 0, 0, …
$ arrival_date_Easter       <int> 0, 0, 0, 0, 0, 0, 0, 0, …
$ arrival_date_ChristmasDay <int> 0, 0, 0, 0, 0, 0, 0, 0, …
$ arrival_date_GoodFriday   <int> 0, 0, 0, 0, 0, 0, 0, 0, …
$ arrival_date_NewYearsDay  <int> 0, 0, 0, 0, 0, 0, 0, 0, …
$ arrival_date_PalmSunday   <int> 0, 0, 0, 0, 0, 0, 0, 0, …

K Nearest Neighbors (KNN)

To predict the outcome of a new data point:

Find the K most similar old data points
Take the average/mode/etc. outcome

To specify a model with `parsnip`

Pick a model
Set the engine
Set the mode (if needed)

To specify a KNN model with `parsnip`

knn_mod <- nearest_neighbor() |>              
  set_engine("kknn") |>             
  set_mode("classification")

Fact

KNN requires all numeric predictors, and all need to be centered and scaled.

What does that mean?

Quiz

Why do you need to “train” a recipe?

Imagine “scaling” a new data point. What do you subtract from it? What do you divide it by?

Guess

# A tibble: 5 × 1
  meal     
  <fct>    
1 SC       
2 BB       
3 HB       
4 Undefined
5 FB

# A tibble: 50,000 × 5
      SC    BB    HB Undefined    FB
   <dbl> <dbl> <dbl>     <dbl> <dbl>
 1     1     0     0         0     0
 2     0     1     0         0     0
 3     0     1     0         0     0
 4     0     1     0         0     0
 5     0     1     0         0     0
 6     0     1     0         0     0
 7     0     0     1         0     0
 8     0     1     0         0     0
 9     0     0     1         0     0
10     1     0     0         0     0
# ℹ 49,990 more rows

Dummy Variables

logistic_reg() |>
  fit(children ~ meal, data = hotels) |> 
  broom::tidy()

# A tibble: 5 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      2.38     0.0183    130.   0       
2 mealFB          -1.15     0.165      -6.98 2.88e-12
3 mealHB          -0.118    0.0465     -2.54 1.12e- 2
4 mealSC           1.43     0.104      13.7  1.37e-42
5 mealUndefined    0.570    0.188       3.03 2.47e- 3

`step_dummy()`

Converts nominal data into numeric dummy variables, needed as predictors for models like KNN.

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors())

Quiz

How does recipes know which variables are numeric and which are nominal?

rec <- recipe(
  children ~ ., 
  data = hotels
  )

Quiz

How does recipes know what is a predictor and what is an outcome?

rec <- recipe(
  children ~ .,
  data = hotels
  )

The formula → indicates outcomes vs predictors

The data → is only used to catalog the names and types of each variable

Selectors

Helper functions for selecting sets of variables

rec |> 
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

Some common selector functions

selector	description
`all_predictors()`	Each x variable (right side of ~)
`all_outcomes()`	Each y variable (left side of ~)
`all_numeric()`	Each numeric variable
`all_nominal()`	Each categorical variable (e.g. factor, string)
`all_nominal_predictors()`	Each categorical variable (e.g. factor, string) that is defined as a predictor
`all_numeric_predictors()`	Each numeric variable that is defined as a predictor
`dplyr::select()` helpers	`starts_with('NY_')`, etc.

Guess

What would happen if you try to normalize a variable that doesn’t vary?

Error! You’d be dividing by zero!

`step_zv()`

Intelligently handles zero variance variables (variables that contain only a single value)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors())

`step_normalize()`

Centers then scales numeric variable (mean = 0, sd = 1)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric())

Imbalanced outcome

`step_downsample()`

library(themis)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric()) |>
  step_downsample(children)

After downsampling

⏱️ Your Turn 1

Unscramble! You have all the steps from our knn_rec- your challenge is to unscramble them into the right order!

Save the result as knn_rec

03:00

knn_rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric()) |>
  step_downsample(children)
knn_rec

📜 Create boilerplate code using `usemodels`

`usemodels`

https://tidymodels.github.io/usemodels/

library(usemodels)
use_kknn(children ~ ., data = hotels, verbose = TRUE, tune = FALSE)

kknn_recipe <- 
  recipe(formula = children ~ ., data = hotels) %>% 
  ## Since distance calculations are used, the predictor variables should 
  ## be on the same scale. Before centering and scaling the numeric 
  ## predictors, any predictors with a single unique value are filtered 
  ## out. 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 

kknn_spec <- 
  nearest_neighbor() %>% 
  set_mode("classification") %>% 
  set_engine("kknn") 

kknn_workflow <- 
  workflow() %>% 
  add_recipe(kknn_recipe) %>% 
  add_model(kknn_spec)

use_glmnet(children ~ ., data = hotels, verbose = TRUE, tune = FALSE)

glmnet_recipe <- 
  recipe(formula = children ~ ., data = hotels) %>% 
  ## Regularization methods sum up functions of the model slope 
  ## coefficients. Because of this, the predictor variables should be on 
  ## the same scale. Before centering and scaling the numeric predictors, 
  ## any predictors with a single unique value are filtered out. 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 

glmnet_spec <- 
  logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("glmnet") 

glmnet_workflow <- 
  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec)

Note

usemodels creates boilerplate code using the older pipe operator %>%

Now we’ve built a recipe.

But, how do we use a recipe?

Axiom

Feature engineering and modeling are two halves of a single predictive workflow.

🪢🪵 Bundling machine learning workflows with `workflow()`

`workflow()`

Creates a workflow to which you can add a model (and more)

workflow()

`add_formula()`

Adds a formula to a workflow *

workflow() |> add_formula(children ~ average_daily_rate)

`add_model()`

Adds a parsnip model spec to a workflow

workflow() |> add_model(knn_mod)

Guess

If we use add_model() to add a model to a workflow, what would we use to add a recipe?

Let’s see!

⏱️ Your Turn 2

Fill in the blanks to make a workflow that combines knn_rec and with knn_mod.

01:00

knn_wf <- workflow() |> 
  add_recipe(knn_rec) |> 
  add_model(knn_mod)
knn_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps

• step_date()
• step_holiday()
• step_rm()
• step_dummy()
• step_zv()
• step_normalize()
• step_downsample()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)

Computational engine: kknn

`add_recipe()`

Adds a recipe to a workflow.

knn_wf <- workflow() |>
  add_recipe(knn_rec) |>
  add_model(knn_mod)

Guess

Do you need to add a formula if you have a recipe?

Nope!

rec <- recipe(
  children ~ .,
  data = hotels
)

`fit()`

Fit a workflow that bundles a recipe* and a model.

knn_wf |> 
  fit(data = hotels_train) |> 
  predict(hotels_test)

Preprocess k-fold resamples?

set.seed(100)
hotels_folds <- vfold_cv(hotels_train, v = 10,
                         strata = children)

`fit_resamples()`

Fit a workflow that bundles a recipe* and a model with resampling.

knn_wf |> 
  fit_resamples(resamples = hotels_folds)

⏱️ Your Turn 3

Run the first chunk. Then try our KNN workflow on hotels_folds. What is the ROC AUC?

03:00

set.seed(100)
hotels_folds <- vfold_cv(hotels_train, v = 10, strata = children)

knn_wf |> 
  fit_resamples(resamples = hotels_folds) |> 
  collect_metrics()

# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config           
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
1 accuracy binary     0.741    10 0.00227 Preprocessor1_Mod…
2 roc_auc  binary     0.833    10 0.00325 Preprocessor1_Mod…

Feature Engineering

`update_recipe()`

Replace the recipe in a workflow.

knn_wf |>
  update_recipe(glmnet_rec)

`update_model()`

Replace the model in a workflow.

knn_wf |>
  update_model(tree_mod)

⏱️ Your Turn 4

Turns out, the same knn_rec recipe can also be used to fit a penalized logistic regression model. Let’s try it out!

plr_mod <- logistic_reg(penalty = .01, mixture = 1) |> 
  set_engine("glmnet") |> 
  set_mode("classification")

plr_mod |> 
  translate()

Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = 0.01
  mixture = 1

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    alpha = 1, family = "binomial")

03:00

glmnet_wf <- knn_wf |> 
  update_model(plr_mod)

glmnet_wf |> 
  fit_resamples(resamples = hotels_folds) |> 
  collect_metrics()

# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config           
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
1 accuracy binary     0.829    10 0.00109 Preprocessor1_Mod…
2 roc_auc  binary     0.874    10 0.00208 Preprocessor1_Mod…

Recap

Feature engineering defines a series of pre-processing steps to prepare for modeling the outcome of interest
- Some feature engineering steps are required for specific types of models
- Others are dependent on specific types of variables/data structures
Feature engineering and modeling are two halves of a single predictive workflow
Feature engineering requires training, just like the model
Implement feature engineering using recipes
Leverage workflow() to create explicitly, logical pipelines for training a machine learning model

Build better training data

Announcements

Announcements

Application exercise

ae-18

Import data

👩🏼‍🍳 Build a better training set with recipes

recipes

Preprocessing options

To build a recipe

recipe()

step_*()

Before recipe

After recipe

step_*()

step_holiday() + step_rm()

step_holiday() + step_rm()

K Nearest Neighbors (KNN)

K Nearest Neighbors (KNN)

To specify a model with parsnip

To specify a KNN model with parsnip

Fact

Quiz

Guess

Dummy Variables

step_dummy()

Quiz

Quiz

Selectors

Some common selector functions

Guess

step_zv()

step_normalize()

Imbalanced outcome

step_downsample()

After downsampling

⏱️ Your Turn 1

📜 Create boilerplate code using usemodels

usemodels

Axiom

🪢🪵 Bundling machine learning workflows with workflow()

workflow()

add_formula()

add_model()

Guess

⏱️ Your Turn 2

add_recipe()

Guess

fit()

Preprocess k-fold resamples?

fit_resamples()

⏱️ Your Turn 3

Feature Engineering

update_recipe()

update_model()

⏱️ Your Turn 4

Recap

TV recommendation

`ae-18`

👩🏼‍🍳 Build a better training set with `recipes`

`recipes`

`recipe()`

`step_*()`

`step_*()`

`step_holiday()` + `step_rm()`

`step_holiday()` + `step_rm()`

To specify a model with `parsnip`

To specify a KNN model with `parsnip`

`step_dummy()`

`step_zv()`

`step_normalize()`

`step_downsample()`

📜 Create boilerplate code using `usemodels`

`usemodels`

🪢🪵 Bundling machine learning workflows with `workflow()`

`workflow()`

`add_formula()`

`add_model()`

`add_recipe()`

`fit()`

`fit_resamples()`

`update_recipe()`

`update_model()`