AE 05: Pivoting Cornell Degrees

Suggested answers

Application exercise
Answers

Goal

Our ultimate goal in this application exercise is to make the following data visualization.

Line plot of numbers of Cornell degrees awarded in six fields of study from 2001 to 2020.

  • Your turn (3 minutes): Take a close look at the plot and describe what it shows in 2-3 sentences.

Add your response here.

Data

The data come from the Department of Education’s College Scorecard.

They make the data available through online dashboards and an API, but I’ve prepared the data for you in a CSV file. Let’s load that in.

library(tidyverse)
library(scales)

cornell_deg <- read_csv("data/cornell-degrees.csv")

And let’s take a look at the data.

cornell_deg
# A tibble: 6 × 21
  field_of_study  `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
  <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 academics.prog… 0.239  0.290  0.173  0.161   0.168 0.170  0.181  0.183  0.181 
2 academics.prog… 0.112  0.0979 0.110  0.198   0.157 0.168  0.151  0.136  0.148 
3 academics.prog… 0.0859 0.0745 0.0463 0.0327  0.032 0.0221 0.0263 0.0262 0.0264
4 academics.prog… 0.071  0.0709 0.112  0.100   0.109 0.107  0.116  0.122  0.117 
5 academics.prog… 0      0      0.122  0.112   0.109 0.110  0.126  0.134  0.128 
6 academics.prog… 0.161  0.160  0.105  0.0973  0.113 0.099  0.102  0.0975 0.0983
# ℹ 11 more variables: `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
#   `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>,
#   `2019` <dbl>, `2020` <dbl>

The dataset has 6 rows and 21 columns. The first column (variable) is the field_of_study, which are the 6 most frequent fields of study for students graduating in 2020.1 The remaining columns show the proportion of degrees awarded in each year from 2001-2020.

1 For the sake of application, I omitted the other 32 possible fields of study.

  • Your turn (4 minutes): Take a look at the plot we aim to make and sketch the data frame we need to make the plot. Determine what each row and each column of the data frame should be. Hint: We need data to be in columns to map to aesthetic elements of the plot.
    • Columns: year, pct , field_of_study

    • Rows: Combination of year and field of study

Pivoting

  • Demo: Pivot the cornell_deg data frame longer such that each row represents a field of study / year combination and year and number of graduates for that year are columns in the data frame.
cornell_deg |>
  pivot_longer(
    cols = -field_of_study,
    names_to = "year",
    values_to = "pct"
  )
# A tibble: 120 × 3
   field_of_study                           year    pct
   <chr>                                    <chr> <dbl>
 1 academics.program_percentage.engineering 2001  0.239
 2 academics.program_percentage.engineering 2002  0.290
 3 academics.program_percentage.engineering 2003  0.173
 4 academics.program_percentage.engineering 2004  0.161
 5 academics.program_percentage.engineering 2005  0.168
 6 academics.program_percentage.engineering 2006  0.170
 7 academics.program_percentage.engineering 2007  0.181
 8 academics.program_percentage.engineering 2008  0.183
 9 academics.program_percentage.engineering 2009  0.181
10 academics.program_percentage.engineering 2010  0.179
# ℹ 110 more rows
  • Question: What is the type of the year variable? Why? What should it be?

It’s a character (chr) variable since the information came from the columns of the original data frame and R cannot know that these character strings represent years. The variable type should be numeric.

  • Demo: Start over with pivoting, and this time also make sure year is a numerical variable in the resulting data frame.
cornell_deg |>
  pivot_longer(
    cols = -field_of_study,
    names_to = "year",
    names_transform = parse_number,
    values_to = "pct"
  )
# A tibble: 120 × 3
   field_of_study                            year   pct
   <chr>                                    <dbl> <dbl>
 1 academics.program_percentage.engineering  2001 0.239
 2 academics.program_percentage.engineering  2002 0.290
 3 academics.program_percentage.engineering  2003 0.173
 4 academics.program_percentage.engineering  2004 0.161
 5 academics.program_percentage.engineering  2005 0.168
 6 academics.program_percentage.engineering  2006 0.170
 7 academics.program_percentage.engineering  2007 0.181
 8 academics.program_percentage.engineering  2008 0.183
 9 academics.program_percentage.engineering  2009 0.181
10 academics.program_percentage.engineering  2010 0.179
# ℹ 110 more rows
  • Demo: In our plot the fields of study are the name of the field. This information is in our dataset, in the field_of_study column, but this column also has additional characters we don’t need. Create a new column called field with levels Engineering, Business Marketing, Computer, Biological, Agriculture, and Social Science (in this order) based on field_of_study. Do this by adding on to your pipeline from earlier.
cornell_deg |>
  pivot_longer(
    cols = -field_of_study,
    names_to = "year",
    names_transform = parse_number,
    values_to = "pct"
  ) |>
  separate(col = field_of_study, into = c(NA, NA, "field"), sep = "\\.") |>
  mutate(
    field = case_match(
      .x = field,
      "business_marketing" ~ "Business Marketing",
      "engineering" ~ "Engineering",
      "computer" ~ "Computer",
      "biological" ~ "Biological",
      "social_science" ~ "Social Science",
      "agriculture" ~ "Agriculture"
    ),
    field = fct_relevel(
      field, "Engineering", "Business Marketing", "Computer",
      "Biological", "Agriculture", "Social Science"
    ),
    .before = everything()
  )
# A tibble: 120 × 3
   field        year   pct
   <fct>       <dbl> <dbl>
 1 Engineering  2001 0.239
 2 Engineering  2002 0.290
 3 Engineering  2003 0.173
 4 Engineering  2004 0.161
 5 Engineering  2005 0.168
 6 Engineering  2006 0.170
 7 Engineering  2007 0.181
 8 Engineering  2008 0.183
 9 Engineering  2009 0.181
10 Engineering  2010 0.179
# ℹ 110 more rows
  • Your turn (5 minutes): Now we start making our plot, but let’s not get too fancy right away. Create the following plot, which will serve as the “first draft” on the way to our Goal. Do this by adding on to your pipeline from earlier.

Line plot of numbers of Cornell degrees awarded in six fields of study from 2001 to 2020.

cornell_deg |>
  pivot_longer(
    cols = -field_of_study,
    names_to = "year",
    names_transform = parse_number,
    values_to = "pct"
  ) |>
  separate(col = field_of_study, into = c(NA, NA, "field"), sep = "\\.") |>
  mutate(
    field = case_match(
      .x = field,
      "business_marketing" ~ "Business Marketing",
      "engineering" ~ "Engineering",
      "computer" ~ "Computer",
      "biological" ~ "Biological",
      "social_science" ~ "Social Science",
      "agriculture" ~ "Agriculture"
    ),
    field = fct_relevel(
      field, "Engineering", "Business Marketing", "Computer",
      "Biological", "Agriculture", "Social Science"
    ),
    .before = everything()
  ) |>
  ggplot(aes(x = year, y = pct, color = field)) +
  geom_point() +
  geom_line()

  • Your turn (4 minutes): What aspects of the plot need to be updated to go from the draft you created above to the Goal plot at the beginning of this application exercise.
    • x-axis scale: need to go from 2000 to 2020 in increments of 4 years

    • y-axis scale: percentage labeling

    • line colors

    • axis labels: title, subtitle, x, y, caption

    • theme

    • legend position and border

  • Demo: Update x-axis scale such that the years displayed go from 2000 to 2020 in increments of 4 years. Update y-axis scale so it uses percentage formatting. Do this by adding on to your pipeline from earlier.
cornell_deg |>
  pivot_longer(
    cols = -field_of_study,
    names_to = "year",
    names_transform = parse_number,
    values_to = "pct"
  ) |>
  separate(col = field_of_study, into = c(NA, NA, "field"), sep = "\\.") |>
  mutate(
    field = case_match(
      .x = field,
      "business_marketing" ~ "Business Marketing",
      "engineering" ~ "Engineering",
      "computer" ~ "Computer",
      "biological" ~ "Biological",
      "social_science" ~ "Social Science",
      "agriculture" ~ "Agriculture"
    ),
    field = fct_relevel(
      field, "Engineering", "Business Marketing", "Computer",
      "Biological", "Agriculture", "Social Science"
    ),
    .before = everything()
  ) |>
  ggplot(aes(x = year, y = pct, color = field)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2020), breaks = seq(2000, 2020, 4)) +
  scale_y_continuous(labels = label_percent())

  • Demo: Update line colors using the scale_color_colorblind() palette from ggthemes. Once again, do this by adding on to your pipeline from earlier.
library(ggthemes)

cornell_deg |>
  pivot_longer(
    cols = -field_of_study,
    names_to = "year",
    names_transform = parse_number,
    values_to = "pct"
  ) |>
  separate(col = field_of_study, into = c(NA, NA, "field"), sep = "\\.") |>
  mutate(
    field = case_match(
      .x = field,
      "business_marketing" ~ "Business Marketing",
      "engineering" ~ "Engineering",
      "computer" ~ "Computer",
      "biological" ~ "Biological",
      "social_science" ~ "Social Science",
      "agriculture" ~ "Agriculture"
    ),
    field = fct_relevel(
      field, "Engineering", "Business Marketing", "Computer",
      "Biological", "Agriculture", "Social Science"
    ),
    .before = everything()
  ) |>
  ggplot(aes(x = year, y = pct, color = field)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2020), breaks = seq(2000, 2020, 4)) +
  scale_y_continuous(labels = label_percent()) +
  scale_color_colorblind()

  • Your turn (4 minutes): Update the plot labels (title, subtitle, x, y, and caption) and use theme_minimal(). Once again, do this by adding on to your pipeline from earlier.
cornell_deg |>
  pivot_longer(
    cols = -field_of_study,
    names_to = "year",
    names_transform = parse_number,
    values_to = "pct"
  ) |>
  separate(col = field_of_study, into = c(NA, NA, "field"), sep = "\\.") |>
  mutate(
    field = case_match(
      .x = field,
      "business_marketing" ~ "Business Marketing",
      "engineering" ~ "Engineering",
      "computer" ~ "Computer",
      "biological" ~ "Biological",
      "social_science" ~ "Social Science",
      "agriculture" ~ "Agriculture"
    ),
    field = fct_relevel(
      field, "Engineering", "Business Marketing", "Computer",
      "Biological", "Agriculture", "Social Science"
    ),
    .before = everything()
  ) |>
  ggplot(aes(x = year, y = pct, color = field)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2020), breaks = seq(2000, 2020, 4)) +
  scale_color_colorblind() +
  scale_y_continuous(labels = label_percent()) +
  labs(
    x = "Graduation year",
    y = "Percent of degrees awarded",
    color = "Field of study",
    title = "Cornell University degrees awarded from 2001-2020",
    subtitle = "Only the top six fields as of 2020",
    caption = "Source: Department of Education\nhttps://collegescorecard.ed.gov/"
  ) +
  theme_minimal()

  • Demo: Finally, set fig-width: 7 and fig-height: 5 for your plot in the chunk options.
#| fig-width: 7
#| fig-height: 5

cornell_deg |>
  pivot_longer(
    cols = -field_of_study,
    names_to = "year",
    names_transform = parse_number,
    values_to = "pct"
  ) |>
  separate(col = field_of_study, into = c(NA, NA, "field"), sep = "\\.") |>
  mutate(
    field = case_match(
      .x = field,
      "business_marketing" ~ "Business Marketing",
      "engineering" ~ "Engineering",
      "computer" ~ "Computer",
      "biological" ~ "Biological",
      "social_science" ~ "Social Science",
      "agriculture" ~ "Agriculture"
    ),
    field = fct_relevel(
      field, "Engineering", "Business Marketing", "Computer",
      "Biological", "Agriculture", "Social Science"
    ),
    .before = everything()
  ) |>
  ggplot(aes(x = year, y = pct, color = field)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2020), breaks = seq(2000, 2020, 4)) +
  scale_color_colorblind() +
  scale_y_continuous(labels = label_percent()) +
  labs(
    x = "Graduation year",
    y = "Percent of degrees awarded",
    color = "Field of study",
    title = "Cornell University degrees awarded from 2001-2020",
    subtitle = "Only the top six fields as of 2020",
    caption = "Source: Department of Education\nhttps://collegescorecard.ed.gov/"
  ) +
  theme_minimal()

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.4.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2023-09-15
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
 digest        0.6.31  2022-12-11 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
 evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)
 fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.3.0)
 ggthemes    * 4.2.4   2021-01-20 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.5   2023-06-05 [1] CRAN (R 4.3.0)
 knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)
 labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 ragg          1.2.5   2023-01-12 [1] CRAN (R 4.3.0)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.22    2023-06-01 [1] CRAN (R 4.3.0)
 rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.3.0)
 scales      * 1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 systemfonts   1.0.4   2022-02-11 [1] CRAN (R 4.3.0)
 textshaping   0.3.6   2021-10-13 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)
 vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)
 withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
 xfun          0.39    2023-04-20 [1] CRAN (R 4.3.0)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────