Grammar of data wrangling

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2023




Clarification for hw-01

Home age variable

tompkins <- tompkins |>
  mutate(home_age = if_else(year_built < 1960, "Before 1960", "Newer than 1960"))
Create a new column of data
Save the modified data frame as tompkins

Ordering of health categories

brfss <- brfss |>
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Excellent", "Very good",
                                 "Good", "Fair", "Poor")
Modify a column of data
Save the modified data frame as brfss

Coding style + workflow

  • Avoid long lines of code

    • We should be able to see all of your code in the PDF document you submit.
  • Label code chunks

    • Do not put spaces in the code-chunk labels.
  • Use the tidyverse style guide and styler

  • Render, commit, and push regularly

    • Think about it like clicking to save regularly as you type a report

Why data wrangling matters

Questions from the prepare materials?

Application exercise


  • Go to the course GitHub org and find your ae-03 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow.

Recap of AE

  • The pipe operator, |>, can be read as “and then”.
  • The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
sum(1, 2)
[1] 3
1 |> 
[1] 3
  • Always use a line break after the pipe, and indent the next line of code.
    • Just like always use a line break between layers of ggplots, after +, and indent the next line.
  • Use dplyr functions to transform your data

Why my daughter is afraid of Canada

Reveal below for code from the previous plot.


# custom font for plot
font_add_google("Atkinson Hyperlegible")

# definition of AQI ranges and air quality
aqi_levels <- tribble(
  ~aqi_min, ~aqi_max, ~color,    ~level,
  0,        50,       "#D8EEDA", "Good",
  51,       100,      "#F1E7D4", "Moderate",
  101,      150,      "#F8E4D8", "Unhealthy for sensitive groups",
  151,      200,      "#FEE2E1", "Unhealthy",
  201,      300,      "#F4E3F7", "Very unhealthy",
  301,      400,      "#F9D0D4", "Hazardous"

# get AQI data
# source:
syr_2023 <- read_csv(
  file = "data/aqi-2023-syracuse.csv",
  na = c(".", "")
) |>
  janitor::clean_names() |>
  mutate(date = mdy(date))

# find the midpoint
aqi_levels <- aqi_levels |>
  mutate(aqi_mid = ((aqi_min + aqi_max) / 2))

# draw the graph
syr_2023 |>
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  # shade in background with colors based on AQI guide
    data = aqi_levels,
      ymin = aqi_min, ymax = aqi_max,
      xmin = as.Date(-Inf), xmax = as.Date(Inf),
      fill = color, y = NULL, x = NULL
  ) +
  # use the hexidecimal colors from the dataset for the palette
  scale_fill_identity() +
  # format the x-axis for dates
    name = NULL, date_labels = "%b %Y",
    limits = c(ymd("2023-01-01"), ymd("2023-10-01"))
  ) +
  # add text labels for each AQI category
    data = aqi_levels,
    aes(x = ymd("2023-10-01"), y = aqi_mid, label = level),
    hjust = 1, size = 6, fontface = "bold", color = "white",
    family = "Atkinson Hyperlegible"
  ) +
  # plot the AQI in Syracuse
  geom_line(linewidth = 1, alpha = 0.5) +
  # human-readable labels
    x = NULL, y = "AQI",
    title = "Ozone and PM2.5 Daily AQI Values",
    subtitle = "Syracuse, NY",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  # don't like the default theme
  theme_minimal(base_size = 12, base_family = "Atkinson Hyperlegible") +
    plot.title.position = "plot",
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()