HW 05 - Predicting song danceability

Homework
Important

This homework is due November 15 at 11:59pm ET.

Getting started

  • Go to the info5001-fa23 organization on GitHub. Click on the repo with the prefix hw-05. It contains the starter documents you need to complete the homework.

  • Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

Workflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Use informative labels for plot axes, titles, etc.
  • Consider aesthetic choices such as color, legend position, etc.
  • Turn in an organized, well formatted document.
Tip

You will estimate a series of machine learning models for this homework assignment. I strongly encourage you to make use of code caching in the Quarto document to decrease the rendering time for the document.

Data and packages

We’ll use the tidyverse and tidymodels packages for this assignment. You are welcome and encouraged to load additional packages if you desire.

library(tidyverse)
library(tidymodels)

Dance Mode from Bluey

In this part, you will estimate a series of machine learning models to predict whether or not a song is “danceable” as determined by Spotify.

The source of the data is Spotify and contains detailed song-level data for every song in a playlist created by or liked by the instructor. There are two files in the data folder:

  • spotify-train.rds - this contains the training set of observations
  • spotify-test.rds - this contains the test set of observations
Tip

We have already split the data into training/test sets for you. You do not need to use initial_split() to partition the data. Unless otherwise specified, all models should be fit using 10-fold cross-validation.

These files contain the following variables:

Column name Variable description
.id Unique identification number for each song in the dataset
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
album Name of the album from which the song originates.
artist The artist who recorded the song.
danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. This version of the variable is a factor which classifies each song as ‘Danceable’ or ‘Not danceable’.
energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
explicit A logical value which indicates whether or not the song contains explicit lyrics
instrumentalness Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
key_name The key the track is in
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
mode_name Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.
playlist_name The (anonymized) name of the Spotify playlist where the song is included.
speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
time_signature An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”.
track Name of the song
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Exercises

Exercise 1

Fit a null model. To establish a baseline for evaluating model performance, we want to estimate a null model. This is a model with zero predictors. In the absence of predictors, our best guess for a classification model is to predict the modal outcome for all observations (e.g. if a majority of songs are danceable, then we would predict that outcome for every song).

The parsnip package includes a model specification for the null model. Fit the null model using the cross-validated folds. Report the accuracy and ROC AUC values for this model. How does the null model perform?

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 2

Fit a basic logistic regression model. Use relevant features in the dataset (based on your personal expertise) and fit the model using an appropriate model specification and feature engineering recipe.

Report the ROC AUC values for this model. How does this model perform?

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 3

Fit a nearest neighbors model. Estimate a nearest neighbors model to predict danceability. Use recipes to pre-process the data as necessary to train a nearest neighbors model. At minimum, perform the required pre-processing for a nearest neighbors model. But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data. Make sure your step order is correct for the recipe.

To determine the optimal number of neighbors, tune over at least 10 possible values.

Tune the model using the cross-validated folds and report the ROC AUC values for the five best models. How do these models perform?

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 4

Fit a penalized logistic regression model. Estimate a penalized logistic regression model to predict danceability. Use recipes to pre-process the data. At minimum, perform the required pre-processing for a penalized logistic regression model. But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data. Make sure your step order is correct for the recipe.

Tune the model over its two hyperparameters: penalty and mixture using an appropriate grid search method.

Tune the model using the cross-validated folds and the glmnet engine, and report the ROC AUC values for the five best models. Use autoplot() to inspect the performance of the models. How do these models perform?

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 5

Fit a random forest model. Estimate a random forest model to predict danceability. Use recipes to pre-process the data. At minimum, perform the required pre-processing for a random forest model. But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data. Make sure your step order is correct for the recipe.

Implement hyperparameter tuning over the mtry and min_n to find the optimal settings. Use at least ten combinations of hyperparameter values. Report the best five combinations of values and their ROC AUC values. How do these models perform?

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 6

Select a final predictive model. Select a single model to train using the entire training set and provide a brief written justification as to why you chose this specific model. Generate predicted probabilities of danceability for observations in the test set.

To evaluate your model’s performance, you must create a CSV file that contains your predicted values for danceability. This CSV should have three columns: .id (the ID value for the song), .pred_Danceable, and .pred_Not Danceable. You can generate this CSV file using the code below:

bind_cols(
  spotify_test,
  predict(final_mod, new_data = spotify_test, type = "prob")
) |>
  select(.id, starts_with(".pred")) |>
  write_csv(file = "data/spotify-preds.csv")

where spotify_test is a data frame imported from data/spotify-test.rds and final_mod is the final model fitted using the entire training set.

Warning

Your CSV file must

  • Be structured exactly as I specified above.
  • Be stored in the data folder and named "spotify-preds.csv".

If it does not meet these requirements, then we will not be able to evaluate your test set performance.

Tip

Credit will be earned based on the model’s test set performance and how much you are able to improve its ROC AUC compared to the null model.

Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 5001 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
  • Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

  • Exercise 1: 4 points
  • Exercise 2: 4 points
  • Exercise 3: 10 points
  • Exercise 4: 10 points
  • Exercise 5: 10 points
  • Exercise 6: 8 points
  • Workflow + formatting: 4 points
  • Total: 50 points
Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • Following tidyverse code style
  • All code being visible in rendered PDF (no more than 80 characters)
  • Appropriate figure sizing, and figures with informative labels and legends
  • Ensuring reproducibility by setting a random seed value.