Tidying data

Lecture 7

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2023

2023-09-13

Announcements

Announcements

  • Regrade requests should be submitted within one week of the assignment grade being published
  • Regrade requests can be submitted starting at noon the day after the assignment grade is published
  • Regrade requests are for if you believe a mistake was made in grading your submission
  • Be specific and polite in your request. We all make mistakes. If we made a mistake grading your submission, we want to correct it.

Common questions at this point

  • What is the difference between rendering and saving a document?

    • Saving a Quarto file saves the changes in the source .qmd file but are not reflected in your output HTML or PDF file
    • When you render the document, the output is also updated to reflect those changes
    • When you click “render” RStudio automatically first saves your Quarto file, then renders it
    • Render early and often
      • Save changes
      • Identifies any errors early
  • What does it mean to commit and push something?

    • Commit stores a snapshot of the files in your local repository (i.e. the files save on the university server)
    • Push gets those changes to the remote repository (i.e. your repository on GitHub)

Tidying datasets

What makes a dataset “tidy”?

02:00
  1. One column per variable
  2. One row per observation
  3. One cell per value

Untidy/messy is not inherently bad, but it makes certain types of analyses more challenging.

Application exercise

ae-05

  • Go to the course GitHub org and find your ae-05 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow.

Recap of AE

  • Data sets should not be labeled as wide or long but they can be made wider or longer for a certain analysis that requires a certain format
  • When pivoting longer, variable names that turn into values are characters by default. If you need them to be in another format, you need to explicitly make that transformation, which you can do so within the pivot_longer() function.
  • You can tweak a plot forever, but at some point the tweaks are likely not very productive. However, you should always be critical of defaults (however pretty they might be) and see if you can improve the plot to better portray your data / results / what you want to communicate.

Using R to settle family disputes

A relative frequency bar chart reporting the simulated frequency of how often each of my children says good bye in the mornings before I go to work. With a random draw for order, Mama, Jacob, Beverly, and Rosemarie each say good bye first roughly 25% of the time, second roughly 25% of the time, third roughly 25% of the time, and fourth roughly 25% of the time.

Using R to settle family disputes

Reveal below for code from the previous plot.

Code
# load required packages
library(tidyverse)
library(scales)

# use custom font for plot
library(showtext)
font_add_google("Atkinson Hyperlegible")

# set seed for reproducibility
set.seed(123)

# simulate draws 100,000 times
draw_order_wide <- map(1:1e05, \(n) sample.int(n = 4)) |>
  enframe(
    name = ".id",
    value = "person"
  ) |>
  unnest_wider(col = person, names_sep = "_")

# how often each person picks in each spot
draw_order_wide |>
  # pivot to long form for plotting
  pivot_longer(
    cols = starts_with("person"),
    names_to = "person",
    values_to = "order"
  ) |>
  mutate(
    # replace generic values with individual names
    person = case_match(
      .x = person,
      "person_1" ~ "Jacob",
      "person_2" ~ "Beverly",
      "person_3" ~ "Rosemarie",
      "person_4" ~ "Mama"
    ),
    person = fct_relevel(.f = person, "Mama", "Jacob", "Beverly", "Rosemarie"),
    # convert to factor for plotting purposes
    order = factor(x = order,
                   levels = 1:4,
                   labels = c(
      "First", "Second",
      "Third", "Fourth"
    ))
  ) |>
  # draw a relative frequency bar chart
  ggplot(mapping = aes(x = person, fill = order)) +
  geom_bar(position = "fill") +
  # clean up scales and labels
  scale_y_continuous(labels = label_percent()) +
  scale_fill_viridis_d(end = 0.8) +
  labs(
    title = "Daddy draw for order is fair",
    subtitle = "On average every person draws the same spots with the same frequency",
    x = NULL,
    y = "Percent of morning workday departures",
    fill = "Good bye\norder",
    caption = "Source: probability and simulations"
  ) +
  # increase font size and use accessible font
  theme_minimal(
    base_size = 13,
    base_family = "Atkinson Hyperlegible"
  ) +
  # move title and subtitle to left side of plot window
  theme(
    plot.title.position = "plot"
  )