Grammar of data wrangling

Lecture 4

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2024

September 10, 2024

Announcements

Announcements

Clarification for hw-01

Home age variable

tompkins <- tompkins |>
  mutate(home_age = if_else(year_built < 1960, "Before 1960", "Newer than 1960"))
1
Create a new column of data
2
Save the modified data frame as tompkins

Ordering of health categories

brfss <- brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Excellent", "Very good",
                                 "Good", "Fair", "Poor")
  )
1
Modify a column of data
2
Save the modified data frame as brfss

Coding style + workflow

  • Avoid long lines of code

    • We should be able to see all of your code in the PDF document you submit.
    • Do not rely on automatic line wrapping. It is not consistent.
  • Label code chunks

    • Do not put spaces in the code-chunk labels.
  • Use the tidyverse style guide and styler

  • Render, commit, and push regularly

    • Think about it like clicking to save regularly as you type a report

Why data wrangling matters

Application exercise

ae-02

  • Go to the course GitHub org and find your ae-02 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Recap of AE

  • The pipe operator, |>, can be read as “and then”.
  • The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
sum(1, 2)
[1] 3
1 |> 
  sum(2)
[1] 3
  • Always use a line break after the pipe, and indent the next line of code.
    • Just like always use a line break between layers of ggplots, after +, and indent the next line.
  • Use dplyr functions to transform your data

Riding the thermals