HW 06 - Predicting Kickstarter funding

Homework

Modified

November 14, 2024

Important

This homework is due November 20 at 11:59pm ET.

Getting started

Go to the info5001-fa24 organization on GitHub. Click on the repo with the prefix hw-06. It contains the starter documents you need to complete the homework.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

Workflow + formatting

Make sure to

Update author name on your document.
Label all code chunks informatively and concisely.
Follow the Tidyverse code style guidelines.
Make at least 3 commits.
Resize figures where needed, avoid tiny or huge plots.
Use informative labels for plot axes, titles, etc.
Consider aesthetic choices such as color, legend position, etc.
Turn in an organized, well formatted document.

Tip

You will estimate a series of machine learning models for this homework assignment. I strongly encourage you to make use of code caching in the Quarto document to decrease the rendering time for the document.

Data and packages

We’ll use the tidyverse and tidymodels packages for this assignment. You are welcome and encouraged to load additional packages if you desire.

library(tidyverse)
library(tidymodels)

Kickstarter

Kickstarter is an American public benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity.¹ Individuals can propose various types of projects on the platform, with specified funding goals and campaign durations. If projects receive enough funding, they are considered “funded” and the project creator receives the funds. If projects do not receive enough funding, they are considered “not funded” and the project creator does not receive the funds.

¹ Source: Wikipedia

² The dataset comes from Supervised Machine Learning for Text Analysis in R by Emil Hvitfeldt and Julia Silge and has been lightly modified for use in this assignment.

In this assignment you will estimate a series of machine learning models to predict whether or not a project will be funded based on short “blurbs” written by the project’s creators (typically between 30 and 130 characters in length).²

There are two files in the data folder:

kickstarter-train.rds - this contains the training set of observations
kickstarter-test.rds - this contains the test set of observations

Tip

We have already split the data into training/test sets for you. You do not need to use initial_split() to partition the data. Unless otherwise specified, all models should be fit using 10-fold cross-validation.

Exercises

Exercise 1

Fit a null model to predict whether or not a project is funded. Report the accuracy and ROC AUC, and interpret them in the context of the model.

Tip

When fitting a null model, parsnip doesn’t actually use the specified predictor(s) to fit the model. However you still need to explicitly provide a model formula to the fit() function. Since the blurb is a character vector, I recommend using only .id as the “predictor” variable.

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 2

Fit a naive Bayes model. Use the text description from blurb to fit a naive Bayes model predicting whether or not the project was funded.³

³ Don’t recognize this model type? Look back at the preparation materials for this week.

At minimum, create an appropriate feature engineering recipe to tokenize the data, retain the 5000 most frequently occurring tokens, and calculate tf-idf scores. But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.

Report the accuracy and ROC AUC values for this model, along with the confusion matrix for the predictions. How does this model perform? Which outcome is it more likely to predict incorrectly?

Exercise 3

Fit a lasso regression model. Estimate a lasso logistic regression model to predict whether or not the project is funded.

Tip

A lasso regression model is a form of penalized regression where the mixture hyperparameter is set to 1.

At minimum, create an appropriate feature engineering recipe to tokenize the data, retain the 500 most frequently occurring tokens, and calculate tf-idf scores.⁴ But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.

⁴ Lasso regression requires all features to be scaled and normalized so they have the same mean and variance. Fortunately for us, by definition tf-idf scores already are normalized so you do not have to explicitly perform this feature engineering step if all features are tf-idf scores.

Tune the model over the penalty hyperparameter using a regular grid of at least 30 values.

Tip

Check out dials::grid_regular().

Tune the model using the cross-validated folds and the glmnet engine, and report the ROC AUC values for the five best models. Use autoplot() to inspect the performance of the models. How do these models perform?

Exercise 4

Use sparse encoding to improve model fitting efficiency. Review 7.5 Case study: sparse encoding from your preparation materials. Use sparse encoding to improve the efficiency of fitting the lasso regression model from exercise 3.

Perform the same tuning process again using the sparse-encoded dataset, and report the ROC AUC values for the five best models. Use autoplot() to inspect the performance of the models. How do these models perform compared to the models from exercise 3? What, if anything, did you notice about the runtime?

Exercise 5

Build a better recipe. Revise the feature engineering recipe from the lasso model to improve its performance. At minimum, you should:

Remove stop words
Stem the tokens
Calculate all possible 1-grams, 2-grams, and 3-grams
Generate additional text features using step_textfeature(). This creates a series of numeric features based on the original character strings.
Ensure all predictors are normalized to the same mean and variance

But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.

Exercise 6

Fit a model of your own choosing. Fit a model of your own choosing to predict whether or not the project is funded. You are responsible for implementing appropriate feature engineering and/or hyperparameter tuning for the model.

Briefly summarize how you decided on the workflow for each model (e.g. feature engineering + model specification). How does this model perform compared to the previous models? Report relevant metrics and plots to support your conclusions.

Tip

Credit will be earned based on the effort applied to fitting appropriate models and utilizing the techniques taught in this class. Do the minimum and you can expect to earn minimal credit.

Exercise 7

Pick the best performing model. Select a single model to train using the entire training set and provide a brief written justification as to why you chose this specific model.

Fit the recipe + model using the full training set. Report the accuracy and ROC AUC values for this model, along with the ROC curve and confusion matrix for the predictions. How does this model perform? Does it perform equally well for projects that were and were not funded, or does it have a built-in bias towards one specific outcome?

Finally, report the top 20 most relevant features of the model.⁵ What features were most relevant? Do these features make sense?

⁵ If the final model uses penalized regression, report the top 20 most relevant features for both outcomes. For all other model types, feature importance is measured the same for both outcomes.

Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
Click on your INFO 5001 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

Exercise 1: 4 points
Exercise 2: 8 points
Exercise 3: 8 points
Exercise 4: 4 points
Exercise 5: 8 points
Exercise 6: 8 points
Exercise 7: 6 points
Workflow + formatting: 4 points
Total: 50 points

Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

Following tidyverse code style
All code being visible in rendered PDF (no more than 80 characters)
Appropriate figure sizing, and figures with informative labels and legends
Ensuring reproducibility by setting a random seed value.