Text analysis: supervised classification

Lecture 23

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2023

2023-11-15

Announcements

Announcements

  • Lab 06 oopsie
  • Homework 06
  • Lab on Friday
  • Extra credit assignment

Supervised text classification

Supervised learning

  1. Hand-code a small set of documents \(N =\) 1,000
  2. Train a machine learning model on the hand-coded data
  3. Evaluate the effectiveness of the machine learning model
  4. Apply the final model to the remaining set of documents \(N =\) 1,000,000

USCongress

Rows: 4,449
Columns: 7
$ ID       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ cong     <dbl> 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 1…
$ billnum  <dbl> 4499, 4500, 4501, 4502, 4503, 4504, 4505, 4506, 4507, 4508, 4…
$ h_or_sen <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "…
$ major    <dbl> 18, 18, 18, 18, 5, 21, 15, 18, 18, 18, 18, 16, 18, 12, 2, 3, …
$ text     <chr> "To suspend temporarily the duty on Fast Magenta 2 Stage.", "…
$ label    <fct> "Foreign trade", "Foreign trade", "Foreign trade", "Foreign t…
[1] "To suspend temporarily the duty on Fast Magenta 2 Stage."                                                                                                                                                                                
[2] "To suspend temporarily the duty on Fast Black 286 Stage."                                                                                                                                                                                
[3] "To suspend temporarily the duty on mixtures of Fluazinam."                                                                                                                                                                               
[4] "To reduce temporarily the duty on Prodiamine Technical."                                                                                                                                                                                 
[5] "To amend the Immigration and Nationality Act in regard to Caribbean-born immigrants."                                                                                                                                                    
[6] "To amend title 38, United States Code, to extend the eligibility for housing loans guaranteed by the Secretary of Veterans Affairs under the Native American Housing Loan Pilot Program to veterans who are married to Native Americans."

Split the data set

set.seed(123)

# convert response variable to factor
congress <- congress |>
  mutate(major = factor(x = major, levels = major, labels = label))

# split into training and testing sets
congress_split <- initial_split(data = congress, strata = major, prop = .8)
congress_split
<Training/Testing/Total>
<3558/891/4449>
congress_train <- training(congress_split)
congress_test <- testing(congress_split)

# generate cross-validation folds
congress_folds <- vfold_cv(data = congress_train, strata = major)

Class imbalance

Preprocessing the data frame

library(textrecipes)
congress_rec <- recipe(major ~ text, data = congress_train) |>
  step_tokenize(text) |>
  step_stopwords(text) |>
  step_tokenfilter(text, max_tokens = 500) |>
  step_tfidf(text) |>
  step_downsample(major)

Establish a baseline

null_classification <- null_model() |>
  set_engine("parsnip") |>
  set_mode("classification")

null_cv <- workflow() |>
  add_recipe(congress_rec) |>
  add_model(null_classification) |>
  fit_resamples(
    congress_folds
  )

null_cv |>
  collect_metrics()
# A tibble: 2 × 6
  .metric  .estimator   mean     n std_err .config             
  <chr>    <chr>       <dbl> <int>   <dbl> <chr>               
1 accuracy multiclass 0.0899    10 0.00396 Preprocessor1_Model1
2 roc_auc  hand_till  0.5       10 0       Preprocessor1_Model1

Define the model

tree_spec <- decision_tree() |>
  set_mode("classification") |>
  set_engine("C5.0")

tree_spec
Decision Tree Model Specification (classification)

Computational engine: C5.0 

Train the model

tree_wf <- workflow() |>
  add_recipe(congress_rec) |>
  add_model(tree_spec)
set.seed(123)

tree_cv <- fit_resamples(
  tree_wf,
  congress_folds,
  control = control_resamples(
    save_pred = TRUE,
    save_workflow = TRUE
  )
)
tree_cv_metrics <- collect_metrics(tree_cv)
tree_cv_predictions <- collect_predictions(tree_cv)
tree_cv_metrics
# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy multiclass 0.432    10 0.00689 Preprocessor1_Model1
2 roc_auc  hand_till  0.766    10 0.00706 Preprocessor1_Model1

Confusion matrix

Feature importance

Application exercise

ae-21

  • Go to the course GitHub org and find your ae-21 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow

Recap

  • Supervised ML with text features does not have to be that different from ordinary ML
  • Need to quantify text into numeric variables through tokenization and other processes
  • Tokenization can produce a substantial number of features – unless computing power is unlimited, you will need to restrict the number of variables using some methodology
  • Use variable importance measures to assess how important specific tokens are to the predictive model

Family photos (2023)

My family standing in front of trees. From left to right, Amanda, Jacob, Rosemarie, Benjamin, and Beverly.