HW 03 - Import + clean data


This homework is due October 4 at 11:59pm ET.

Getting started

  • Go to the info5001-fa23 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the homework.

  • Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

Workflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Use informative labels for plot axes, titles, etc.
  • Consider aesthetic choices such as color, legend position, etc.
  • Turn in an organized, well formatted document.


We’ll use the tidyverse package for much of the data wrangling and visualization and googlesheets4 for importing Google Sheets. You’re welcome to also load other packages as needed.


Mass shootings in the United States

The United States experiences far more mass shooting events than any other developed country in the world. While policymakers, politicians, the media, activists, and the general public recognize the widespread prevalence of these tragic events, policies intended to stop these events should be grounded in evidence and empirical data. Regrettably, mass shootings are not well-documented in the United States, and generalizable data is difficult to collect.

In July 2012, in the aftermath of a mass shooting in a movie theater in Aurora, Colorado, Mother Jones published a report on mass shootings in the United States since 1982. Importantly, they provided the underlying data set as an open-source database for anyone interested in studying and understanding this criminal behavior.

Exercise 1 - Import the data

Import the spreadsheet from Google Sheets. The entire dataset can be found here. Use googlesheets4 to import the dataset. Do not download the file as a CSV and then import into R. You must use googlesheets4 to import the data.

Exercise 2 - Clean the data

Mother Jones updates the dataset by adding a new row to the top of the spreadsheet each time a new mass shooting occurs. Data entry is performed manually, so there are many inconsistencies in the variables. Clean the data frame to standardize the dataset for analysis.

Specifically, you should (at minimum) do the following:

  • Remove columns that do not provide substantive information for analysis. This includes columns that are open-ended text fields that are unique to each row (e.g. details and sourcing columns). But do not remove columns that contain relevant ID information (specifically case, location, and summary).
  • Rename columns so they have unique, descriptive, syntactic names.
  • Ensure variables are stored as appropriate types and classes (e.g. numeric variables stored as a numeric type, categorical variables stored as character or factor columns, dates stored as a date class).
  • Extract date information so you have separate variables for month and year.
  • Standardize categorical variables so they use consistent case (e.g. lower, upper, or title case) and missing values (e.g. “unknown”, “TBD”) are defined as NA values in R.
  • Collapse the gender column to three categories: “Male”, “Female”, and “Transgender”. If the incident had multiple shooters with separate genders, classify the incident’s gender as NA.

Print the output of glimpse() of your data frame in the PDF document so we can see the data frame’s structure.


To assist you in this process, I have included a copy of the cleaned dataset in your repo as data/mass-shootings.rds. You can compare your cleaned dataset to this one using

mass_shootings_precleaned <- read_rds(file = "data/mass-shootings.rds")

waldo::compare(x = mass_shootings_precleaned, y = mass_shootings,
               tolerance = 1e-4)

where mass_shootings is your cleaned data frame. waldo::compare() will report on any differences between your cleaned dataset and the pre-cleaned example included in your repo. You do not have to perfectly match the pre-cleaned example in order to earn full credit. This is provided merely as an aid to check your work.

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 3 - Mass shootings by location

Generate a boxplot visualizing the number of total victims, by type of location. Redraw the same plot, but remove the Las Vegas Strip massacre from the visualization. Include a written interpretation of your visualizations.

Exercise 4 - Very specific subset

How many white individuals with prior signs of mental illness initiated a mass shooting after 2000? Report the number of incidents perpetrated by these individuals.

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 5 - Mass shootings by state

Which states have the most mass shootings? Generate a visualization reporting the number of incidents per state. Comment on interesting or relevant trends.

Exercise 6 - Age of mass shooters

Are recent mass shootings perpetrated by younger individuals? How has the typical age of a mass shooter changed over time? Generate an appropriate visualization and interpret.

Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up


  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 5001 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
  • Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.


  • Exercise 1: 5 points
  • Exercise 2: 20 points
  • Exercise 3: 5 points
  • Exercise 4: 5 points
  • Exercise 5: 5 points
  • Exercise 6: 5 points
  • Workflow + formatting: 5 points
  • Total: 50 points

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • At least 3 informative commit messages
  • Following tidyverse code style
  • All code being visible in rendered PDF (no more than 80 characters)