Grammar of data wrangling

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2023

2023-09-06

Announcements

Announcements

Clarification for hw-01

Home age variable

tompkins <- tompkins |>
  mutate(home_age = if_else(year_built < 1960, "Before 1960", "Newer than 1960"))
1
Create a new column of data
2
Save the modified data frame as tompkins

Ordering of health categories

brfss <- brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Excellent", "Very good",
                                 "Good", "Fair", "Poor")
  )
1
Modify a column of data
2
Save the modified data frame as brfss

Coding style + workflow

  • Avoid long lines of code

    • We should be able to see all of your code in the PDF document you submit.
  • Label code chunks

    • Do not put spaces in the code-chunk labels.
  • Use the tidyverse style guide and styler

  • Render, commit, and push regularly

    • Think about it like clicking to save regularly as you type a report

Why data wrangling matters

Questions from the prepare materials?

Application exercise

ae-03

  • Go to the course GitHub org and find your ae-03 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow.

Recap of AE

  • The pipe operator, |>, can be read as “and then”.
  • The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
sum(1, 2)
[1] 3
1 |> 
  sum(2)
[1] 3
  • Always use a line break after the pipe, and indent the next line of code.
    • Just like always use a line break between layers of ggplots, after +, and indent the next line.
  • Use dplyr functions to transform your data

Why my daughter is afraid of Canada

Why my daughter is afraid of Canada

Reveal below for code from the previous plot.

Code
library(tidyverse)
library(showtext)

# custom font for plot
font_add_google("Atkinson Hyperlegible")

# definition of AQI ranges and air quality
aqi_levels <- tribble(
  ~aqi_min, ~aqi_max, ~color,    ~level,
  0,        50,       "#D8EEDA", "Good",
  51,       100,      "#F1E7D4", "Moderate",
  101,      150,      "#F8E4D8", "Unhealthy for sensitive groups",
  151,      200,      "#FEE2E1", "Unhealthy",
  201,      300,      "#F4E3F7", "Very unhealthy",
  301,      400,      "#F9D0D4", "Hazardous"
)

# get AQI data
# source: https://www.epa.gov/outdoor-air-quality-data/air-data-daily-air-quality-tracker
syr_2023 <- read_csv(
  file = "data/aqi-2023-syracuse.csv",
  na = c(".", "")
) |>
  janitor::clean_names() |>
  mutate(date = mdy(date))

# find the midpoint
aqi_levels <- aqi_levels |>
  mutate(aqi_mid = ((aqi_min + aqi_max) / 2))

# draw the graph
syr_2023 |>
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  # shade in background with colors based on AQI guide
  geom_rect(
    data = aqi_levels,
    aes(
      ymin = aqi_min, ymax = aqi_max,
      xmin = as.Date(-Inf), xmax = as.Date(Inf),
      fill = color, y = NULL, x = NULL
    )
  ) +
  # use the hexidecimal colors from the dataset for the palette
  scale_fill_identity() +
  # format the x-axis for dates
  scale_x_date(
    name = NULL, date_labels = "%b %Y",
    limits = c(ymd("2023-01-01"), ymd("2023-10-01"))
  ) +
  # add text labels for each AQI category
  geom_text(
    data = aqi_levels,
    aes(x = ymd("2023-10-01"), y = aqi_mid, label = level),
    hjust = 1, size = 6, fontface = "bold", color = "white",
    family = "Atkinson Hyperlegible"
  ) +
  # plot the AQI in Syracuse
  geom_line(linewidth = 1, alpha = 0.5) +
  # human-readable labels
  labs(
    x = NULL, y = "AQI",
    title = "Ozone and PM2.5 Daily AQI Values",
    subtitle = "Syracuse, NY",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  # don't like the default theme
  theme_minimal(base_size = 12, base_family = "Atkinson Hyperlegible") +
  theme(
    plot.title.position = "plot",
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )