Lab 05 - Predicting attitudes on marijuana legalization
This homework is due November 11 at 11:59pm ET.
Getting started
Go to the info5001-fa24 organization on GitHub. Click on the repo with the prefix lab-05. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.
First, open the Quarto document
lab-05.qmd
and Render it.Make sure it compiles without errors.
All team members should clone the team GitHub repository for the lab. Then, one team member should edit the document YAML by adding the team name to the subtitle
field and adding the names of the team members contributing to lab to the author
field. Hopefully that’s everyone, but if someone doesn’t contribute during the lab session or throughout the week before the deadline, their name should not be added. If you have 4 members in your team, you can delete the line for the 5th team member. Then, this team member should render the document and commit and push the changes. All others should not touch the document at this stage.
title: "Lab 05 - Predicting attitudes on marijuana legalization"
subtitle: "Team name"
author:
- "Team member 1 (netID)"
- "Team member 2 (netID)"
- "Team member 3 (netID)"
date: today
format:
typst:
fig-format: png
Warm up
Before we introduce the data, let’s warm up with some simple exercises.
- Update the YAML, changing the author name to your name, and render the document.
- Commit your changes with a meaningful commit message.
- Push your changes to GitHub.
- Go to your repo on GitHub and confirm that your changes are visible in your
.qmd
and .pdf
files. If anything is missing, render, commit, and push again.
You will estimate a series of machine learning models for this homework assignment. I strongly encourage you to make use of code caching in the Quarto document to decrease the rendering time for the document.
Data and packages
We’ll use the tidyverse and tidymodels packages for this assignment.
The General Social Survey is a biannual survey of the American public.
Over the past twenty years, American attitudes towards marijuana have softened extensively. In the early 2010s, the number of Americans who believed marijuana should be legal began to outnumber those who thought it should not be legal.
data/gss.rds
contains a selection of variables from the 2022 GSS. The outcome of interest grassv
is a factor variable coded as either "should be legal"
(respondent believes marijuana should be legal) or "should not be legal"
(respondent believes marijuana should not be legal).
Name | gss |
Number of rows | 3319 |
Number of columns | 25 |
_______________________ | |
Column type frequency: | |
factor | 22 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
colath | 1165 | 0.65 | FALSE | 2 | yes: 1444, not: 710 |
colmslm | 1172 | 0.65 | FALSE | 2 | not: 1435, yes: 712 |
degree | 2 | 1.00 | TRUE | 5 | hig: 1533, bac: 685, gra: 476, les: 330 |
fear | 1141 | 0.66 | FALSE | 2 | no: 1311, yes: 867 |
grassv | 2795 | 0.16 | FALSE | 2 | sho: 391, sho: 133 |
gunlaw | 1154 | 0.65 | FALSE | 2 | fav: 1560, opp: 605 |
happy | 19 | 0.99 | TRUE | 3 | pre: 1819, not: 759, ver: 722 |
health | 4 | 1.00 | FALSE | 4 | goo: 1724, fai: 804, exc: 628, poo: 159 |
hispanic | 36 | 0.99 | FALSE | 22 | not: 2666, mex: 326, pue: 81, spa: 46 |
income16 | 377 | 0.89 | TRUE | 26 | $17: 324, $60: 294, $90: 239, $50: 230 |
letdie1 | 2229 | 0.33 | FALSE | 2 | yes: 815, no: 275 |
owngun | 1142 | 0.66 | FALSE | 3 | no: 1415, yes: 725, ref: 37 |
partyid | 27 | 0.99 | TRUE | 8 | ind: 765, str: 570, not: 468, ind: 380 |
polviews | 96 | 0.97 | TRUE | 7 | mod: 1228, con: 455, lib: 451, sli: 396 |
pray | 37 | 0.99 | FALSE | 6 | sev: 965, nev: 669, onc: 623, les: 424 |
pres20 | 1057 | 0.68 | FALSE | 4 | bid: 1378, tru: 801, oth: 68, did: 15 |
race | 46 | 0.99 | FALSE | 3 | whi: 2124, bla: 617, oth: 532 |
region | 0 | 1.00 | FALSE | 9 | sou: 755, eas: 537, pac: 536, wes: 361 |
sex | 16 | 1.00 | FALSE | 2 | fem: 1780, mal: 1523 |
sexfreq | 1883 | 0.43 | TRUE | 7 | not: 447, 2 o: 209, 2 o: 196, abo: 194 |
wrkstat | 6 | 1.00 | FALSE | 8 | wor: 1526, ret: 691, wor: 331, kee: 245 |
zodiac | 262 | 0.92 | FALSE | 12 | pis: 280, aqu: 276, lib: 274, gem: 264 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 2091.62 | 1195.82 | 1 | 1053.5 | 2104 | 3130.5 | 4150 | ▇▇▇▇▇ |
age | 205 | 0.94 | 48.47 | 17.73 | 18 | 33.0 | 47 | 63.0 | 89 | ▇▇▆▆▂ |
hrs1 | 1468 | 0.56 | 40.14 | 14.06 | 0 | 36.0 | 40 | 45.0 | 89 | ▁▂▇▁▁ |
You can find the documentation for each of the available variables using the GSS Data Explorer. Just search by the column name to find the associated description.
Exercises
Pick a member of the team to complete exercise 1-3. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.
Exercise 1
Selecting potential features. For each of the variables below, explain whether or not you think they would be useful predictors for grassv
and why.
degree
happy
zodiac
id
Exercise 2
Partitioning your data. Reproducibly split your data into training and test sets. Allocate 75% of observations to training, and 25% to testing. Partition the training set into 10 distinct folds for model fitting. Unless otherwise stated, you will use these sets for all the remaining exercises.
Exercise 3
Fit a null model. To establish a baseline for evaluating model performance, we want to estimate a null model. This is a model with zero predictors. In the absence of predictors, our best guess for a classification model is to predict the modal outcome for all observations (e.g. if a majority of respondents in the training set believe marijuana should be legal, then we would predict that outcome for every respondent).
The parsnip package includes a model specification for the null model. Fit the null model using the cross-validated folds. Report the accuracy and ROC AUC values for this model. How does the null model perform?
After the team member working on Exercise 1-3 renders, commits, and pushes, another team member should pull their changes and run the script again to verify correct output. Then, they should write the answer to Exercise 4. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.
Exercise 4
Fit a basic random forest model. Estimate a random forest model to predict grassv
as a function of all the other variables in the dataset (except id
). In order to do this, you need to impute missing values for all the predictor columns. This means replacing missing values (NA
) with plausible values given what we know about the other observations.
To do this you should build a feature engineering recipe that does the following:
- Omits the
id
column as a predictor - Remove rows with an
NA
forgrassv
- we want to omit observations with missing values for outcomes, not impute them - Use median imputation for numeric predictors
- Use modal imputation for nominal predictors
Fit the model using the cross-validated folds and the ranger
engine, and report the ROC AUC values for this model. How does this model perform?
After the team member working on Exercise 4 renders, commits, and pushes, another team member should pull their changes and run the script again to verify correct output. Then, they should write the answer to Exercise 5. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.
Exercise 5
Fit a penalized logistic regression model. Estimate a penalized logistic regression model to predict grassv
as a function of all the other variables in the dataset (except id
). Use recipes to pre-process the data as necessary to train a penalized regression model. Be sure to also perform the same pre-processing as for the random forest model (e.g. omitting NA
outcomes, imputation). Make sure your step order is correct for the recipe.
Tune the model over its two hyperparameters: penalty
and mixture
. Create a data frame containing combinations of values for each of these parameters. penalty
should be tested at the values 10^seq(-6, -1, length.out = 20)
, while mixture
should be tested at values c(0, 0.2, 0.4, 0.6, 0.8, 1)
.
Tune the model using the cross-validated folds and report the ROC AUC values for the five best models. Use autoplot()
to inspect the performance of the models. How do these models perform?
After the team member working on Exercise 5 renders, commits, and pushes, another team member should pull their changes and run the script again to verify correct output. Then, they should write the answer to Exercise 6-7. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.
Exercise 6
Tune the random forest model. Revisit the random forest model used previously. This time, implement hyperparameter tuning over the mtry
and min_n
to find the optimal settings. Use at least ten combinations of hyperparameter values. Report the best five combinations of values and their ROC AUC values. How do these models perform?
Exercise 7
Pick the best performing model. Select the best performing model. Train that recipe + model using the full training set and report the accuracy and ROC AUC using the held-out test set of data. Visualize the ROC curve. How would you describe this model’s performance at predicting attitudes towards the legalization of marijuana?
After the team member working on Exercise 7 renders, commits, and pushes, all other team members should pull the changes and render the document. Finally, a team member different than the one responsible for typing up responses to Exercise 7 should do the last task outlined below (if your team wants to compete for extra credit).
Bonus (optional) - Battle Royale
For those looking for a challenge (and a slight amount of extra credit for this assignment), train a high-performing model to predict grassv
. You must use tidymodels to train this model.
To evaluate your model’s effectiveness, you will generate predictions for a held-back secret test set of respondents from the survey. These can be found in data/gss-test.rds
. The data frame has an identical structure to gss.rds
, however I have not included the grassv
column. You will have no way of judging the effectiveness of your model on the test set itself.
To evaluate your model’s performance, you must create a CSV file that contains your predicted probabilities for grassv
. This CSV should have three columns: id
(the id
value for the respondent), .pred_should be legal
, and .pred_should not be legal
. You can generate this CSV file using the code below:
bind_cols(
gss_secret_test,
predict(best_fit, new_data = gss_secret_test, type = "prob")
) |>
select(id, starts_with(".pred")) |>
write_csv(file = "data/gss-preds.csv")
where gss_secret_test
is a data frame imported from data/gss-test.rds
and best_fit
is the final model fitted using the entire training set.
Your CSV file must
- Be structured exactly as I specified above.
- Be stored in the
data
folder and named"gss-preds.csv"
.
If it does not meet these requirements, then you are not eligible to win this challenge.
The team with the highest ROC AUC as calculated using their secret test set predictions will earn an extra (uncapped) 10 points on this lab assignment. For instance, if a team earned 45/50 points on the other components and had the best performing model, then they would earn a 55/50 for this lab assignment.
After the team member working on the bonus problem renders, commits, and pushes, all other team members should pull the changes and render the document. Finally, a team member different than the one responsible for typing up responses to the bonus problem should do the last task outlined below.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 5001 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
- Exercise 1: 2 points
- Exercise 2: 2 points
- Exercise 3: 6 points
- Exercise 4: 8 points
- Exercise 5: 12 points
- Exercise 6: 12 points
- Exercise 7: 4 points
- Bonus: 0 points (extra credit)
- Workflow + formatting: 4 points
- Total: 50 points
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- Having at least 3 informative commit messages
- Each team member contributing to the repo with commits at least once
- Following tidyverse code style
- Ensuring reproducibility by setting a random seed value.