AE 03: Wrangling professor evaluations

Suggested answers

Application exercise
Answers
Important

These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.

To demonstrate data wrangling we will use evals. It contains anonymized information on end-of-semester student evaluations for 463 courses taught by a sample of 94 professors from the University of Texas at Austin.1

1 Source: Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, 2005 and OpenIntro.

library(tidyverse)
evals <- read_csv("data/course-evals.csv")

The data frame has over 400 observations (rows), 463 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.

glimpse(evals)
Rows: 463
Columns: 23
$ course_id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ prof_id       <dbl> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,…
$ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4…
$ rank          <chr> "tenure track", "tenure track", "tenure track", "tenure …
$ ethnicity     <chr> "minority", "minority", "minority", "minority", "not min…
$ gender        <chr> "female", "female", "female", "female", "male", "male", …
$ language      <chr> "english", "english", "english", "english", "english", "…
$ age           <dbl> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, …
$ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87.500…
$ cls_did_eval  <dbl> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,…
$ cls_students  <dbl> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, …
$ cls_level     <chr> "upper", "upper", "upper", "upper", "upper", "upper", "u…
$ cls_profs     <chr> "single", "single", "single", "single", "multiple", "mul…
$ cls_credits   <chr> "multi credit", "multi credit", "multi credit", "multi c…
$ bty_f1lower   <dbl> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7, 7,…
$ bty_f1upper   <dbl> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9,…
$ bty_f2upper   <dbl> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9,…
$ bty_m1lower   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7,…
$ bty_m1upper   <dbl> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6,…
$ bty_m2upper   <dbl> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6,…
$ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, …
$ pic_outfit    <chr> "not formal", "not formal", "not formal", "not formal", …
$ pic_color     <chr> "color", "color", "color", "color", "color", "color", "c…
names(evals)
 [1] "course_id"     "prof_id"       "score"         "rank"         
 [5] "ethnicity"     "gender"        "language"      "age"          
 [9] "cls_perc_eval" "cls_did_eval"  "cls_students"  "cls_level"    
[13] "cls_profs"     "cls_credits"   "bty_f1lower"   "bty_f1upper"  
[17] "bty_f2upper"   "bty_m1lower"   "bty_m1upper"   "bty_m2upper"  
[21] "bty_avg"       "pic_outfit"    "pic_color"    
head(evals)
# A tibble: 6 × 23
  course_id prof_id score rank     ethnicity gender language   age cls_perc_eval
      <dbl>   <dbl> <dbl> <chr>    <chr>     <chr>  <chr>    <dbl>         <dbl>
1         1       1   4.7 tenure … minority  female english     36          55.8
2         2       1   4.1 tenure … minority  female english     36          68.8
3         3       1   3.9 tenure … minority  female english     36          60.8
4         4       1   4.8 tenure … minority  female english     36          62.6
5         5       2   4.6 tenured  not mino… male   english     59          85  
6         6       2   4.3 tenured  not mino… male   english     59          87.5
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>

The head() function returns “A tibble: 6 x 23” and then the first six rows of the evals data.

Tibble vs. data frame

A tibble is an opinionated version of the R data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!

There are two main differences between a tibble and a data frame:

  1. When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.

    Let’s look at the differences in the output when we type evals (tibble) in the console versus typing cars (data frame) in the console.

  2. Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get a warning message if you try to access a variable that doesn’t exist in a tibble. You will get NULL if you try to access a variable that doesn’t exist in a data frame.

evals$apple
Warning: Unknown or uninitialised column: `apple`.
NULL
cars$apple
NULL

Data wrangling with dplyr

dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the data transformation cheatsheet.

Quick summary of key dplyr functions2:

Rows:

  • filter():chooses rows based on column values.
  • slice(): chooses rows based on location.
  • arrange(): changes the order of the rows
  • sample_n(): take a random subset of the rows

Columns:

  • select(): changes whether or not a column is included.
  • rename(): changes the name of columns.
  • mutate(): changes the values of columns and creates new columns.

Groups of rows:

  • summarize(): collapses a group into a single row.
  • count(): count unique values of one or more variables.
  • group_by(): perform calculations separately for each value of a variable

select()

  • Demo: Make a data frame that only contains the variables score and cls_students.
evals |>
  select(score, cls_students)
# A tibble: 463 × 2
   score cls_students
   <dbl>        <dbl>
 1   4.7           43
 2   4.1          125
 3   3.9          125
 4   4.8          123
 5   4.6           20
 6   4.3           40
 7   2.8           44
 8   4.1           55
 9   3.4          195
10   4.5           46
# ℹ 453 more rows
  • Demo: Make a data frame that keeps every variable except cls_students.
evals |>
  select(-cls_students)
# A tibble: 463 × 22
   course_id prof_id score rank    ethnicity gender language   age cls_perc_eval
       <dbl>   <dbl> <dbl> <chr>   <chr>     <chr>  <chr>    <dbl>         <dbl>
 1         1       1   4.7 tenure… minority  female english     36          55.8
 2         2       1   4.1 tenure… minority  female english     36          68.8
 3         3       1   3.9 tenure… minority  female english     36          60.8
 4         4       1   4.8 tenure… minority  female english     36          62.6
 5         5       2   4.6 tenured not mino… male   english     59          85  
 6         6       2   4.3 tenured not mino… male   english     59          87.5
 7         7       2   2.8 tenured not mino… male   english     59          88.6
 8         8       3   4.1 tenured not mino… male   english     51         100  
 9         9       3   3.4 tenured not mino… male   english     51          56.9
10        10       4   4.5 tenured not mino… female english     40          87.0
# ℹ 453 more rows
# ℹ 13 more variables: cls_did_eval <dbl>, cls_level <chr>, cls_profs <chr>,
#   cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>, bty_f2upper <dbl>,
#   bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>, bty_avg <dbl>,
#   pic_outfit <chr>, pic_color <chr>
  • Demo: Make a data frame that includes all variables between score through age (inclusive).
evals |>
  select(score:age)
# A tibble: 463 × 6
   score rank         ethnicity    gender language   age
   <dbl> <chr>        <chr>        <chr>  <chr>    <dbl>
 1   4.7 tenure track minority     female english     36
 2   4.1 tenure track minority     female english     36
 3   3.9 tenure track minority     female english     36
 4   4.8 tenure track minority     female english     36
 5   4.6 tenured      not minority male   english     59
 6   4.3 tenured      not minority male   english     59
 7   2.8 tenured      not minority male   english     59
 8   4.1 tenured      not minority male   english     51
 9   3.4 tenured      not minority male   english     51
10   4.5 tenured      not minority female english     40
# ℹ 453 more rows
  • Demo: Use the select helper contains() to make a data frame that includes the variables associated with the class, i.e., contains the string "cls_" in the name.
evals |>
  select(contains("cls_"))
# A tibble: 463 × 6
   cls_perc_eval cls_did_eval cls_students cls_level cls_profs cls_credits 
           <dbl>        <dbl>        <dbl> <chr>     <chr>     <chr>       
 1          55.8           24           43 upper     single    multi credit
 2          68.8           86          125 upper     single    multi credit
 3          60.8           76          125 upper     single    multi credit
 4          62.6           77          123 upper     single    multi credit
 5          85             17           20 upper     multiple  multi credit
 6          87.5           35           40 upper     multiple  multi credit
 7          88.6           39           44 upper     multiple  multi credit
 8         100             55           55 upper     single    multi credit
 9          56.9          111          195 upper     single    multi credit
10          87.0           40           46 upper     single    multi credit
# ℹ 453 more rows

The pipe

Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.

When reading code “in English”, say “and then” whenever you see a pipe.

  • Your turn (4 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.
evals |>
  select(score, rank) |>
  head()
# A tibble: 6 × 2
  score rank        
  <dbl> <chr>       
1   4.7 tenure track
2   4.1 tenure track
3   3.9 tenure track
4   4.8 tenure track
5   4.6 tenured     
6   4.3 tenured     

slice()

Look at the documentation for slice()3 or the data transformation cheat sheet. What variations of slice_*() might be useful here?

3 Run ?slice in the console.

  • Your turn: Display the first five rows of the evals data frame.
evals |>
  slice(1:5)
# A tibble: 5 × 23
  course_id prof_id score rank     ethnicity gender language   age cls_perc_eval
      <dbl>   <dbl> <dbl> <chr>    <chr>     <chr>  <chr>    <dbl>         <dbl>
1         1       1   4.7 tenure … minority  female english     36          55.8
2         2       1   4.1 tenure … minority  female english     36          68.8
3         3       1   3.9 tenure … minority  female english     36          60.8
4         4       1   4.8 tenure … minority  female english     36          62.6
5         5       2   4.6 tenured  not mino… male   english     59          85  
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
# with slice_head()
evals |>
  slice_head(n = 5)
# A tibble: 5 × 23
  course_id prof_id score rank     ethnicity gender language   age cls_perc_eval
      <dbl>   <dbl> <dbl> <chr>    <chr>     <chr>  <chr>    <dbl>         <dbl>
1         1       1   4.7 tenure … minority  female english     36          55.8
2         2       1   4.1 tenure … minority  female english     36          68.8
3         3       1   3.9 tenure … minority  female english     36          60.8
4         4       1   4.8 tenure … minority  female english     36          62.6
5         5       2   4.6 tenured  not mino… male   english     59          85  
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
  • Your turn: Display the last two rows of the evals data frame.
evals |>
  slice((n() - 1):n())
# A tibble: 2 × 23
  course_id prof_id score rank     ethnicity gender language   age cls_perc_eval
      <dbl>   <dbl> <dbl> <chr>    <chr>     <chr>  <chr>    <dbl>         <dbl>
1       462      94   4.4 tenure … minority  female non-eng…    42          81.8
2       463      94   4.1 tenure … minority  female non-eng…    42          80  
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
# with slice_tail()
evals |>
  slice_tail(n = 2)
# A tibble: 2 × 23
  course_id prof_id score rank     ethnicity gender language   age cls_perc_eval
      <dbl>   <dbl> <dbl> <chr>    <chr>     <chr>  <chr>    <dbl>         <dbl>
1       462      94   4.4 tenure … minority  female non-eng…    42          81.8
2       463      94   4.1 tenure … minority  female non-eng…    42          80  
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>

arrange()

  • Your turn: Let’s arrange the data by score, so the courses with the lowest scores will be at the top of the data frame.
evals |>
  arrange(score)
# A tibble: 463 × 23
   course_id prof_id score rank    ethnicity gender language   age cls_perc_eval
       <dbl>   <dbl> <dbl> <chr>   <chr>     <chr>  <chr>    <dbl>         <dbl>
 1       162      30   2.3 tenure… not mino… female english     41          83.3
 2       335      68   2.4 tenured not mino… male   english     60          71.9
 3        40       8   2.5 tenured not mino… female english     51          80  
 4       337      68   2.5 tenured not mino… male   english     60          62.5
 5       329      66   2.7 tenured not mino… male   english     64          81.8
 6       376      76   2.7 tenured minority  female english     43          48.9
 7         7       2   2.8 tenured not mino… male   english     59          88.6
 8       185      34   2.8 tenure… minority  female english     47          92.3
 9       434      88   2.8 tenured not mino… male   english     62          40.9
10        79      15   2.9 tenure… not mino… female english     37          82.1
# ℹ 453 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
  • Your turn: Now let’s arrange the data by descending score, so the evals with the highest scores will be at the top.
evals |>
  arrange(desc(score))
# A tibble: 463 × 23
   course_id prof_id score rank    ethnicity gender language   age cls_perc_eval
       <dbl>   <dbl> <dbl> <chr>   <chr>     <chr>  <chr>    <dbl>         <dbl>
 1        54      10     5 teachi… not mino… male   english     47          90.9
 2        57      10     5 teachi… not mino… male   english     47          83.3
 3        59      10     5 teachi… not mino… male   english     47          80  
 4       103      19     5 tenured not mino… female english     46          93.3
 5       108      19     5 tenured not mino… female english     46         100  
 6       349      71     5 teachi… minority  male   english     50          90.9
 7       356      71     5 teachi… minority  male   english     50          95.2
 8       406      82     5 tenured not mino… male   english     57          40  
 9       420      85     5 teachi… not mino… male   english     58         100  
10       421      85     5 teachi… not mino… male   english     58          85.7
# ℹ 453 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
  • Your turn (5 minutes): Create a data frame that only includes the evaluation score (score), faculty rank (rank), and average beauty rating of the professor (bty_avg) for the course with the highest evaluation score and the faculty member with the highest average beauty rating. What is the average beauty rating (bty_avg) for this professor?
evals |>
  select(score, rank, bty_avg) |>
  arrange(desc(score), desc(bty_avg)) |>
  slice(1)
# A tibble: 1 × 3
  score rank     bty_avg
  <dbl> <chr>      <dbl>
1     5 teaching    7.83
Note

Feel free to work ahead on the remaining exercises but we will pause to check in at this point.

filter()

  • Demo: Filter the data frame by selecting the rows where the faculty is on the teaching-track.
evals |>
  filter(rank == "teaching")
# A tibble: 102 × 23
   course_id prof_id score rank    ethnicity gender language   age cls_perc_eval
       <dbl>   <dbl> <dbl> <chr>   <chr>     <chr>  <chr>    <dbl>         <dbl>
 1        50      10   4   teachi… not mino… male   english     47          84.2
 2        51      10   4.3 teachi… not mino… male   english     47          75  
 3        52      10   4.4 teachi… not mino… male   english     47          93.3
 4        53      10   4.5 teachi… not mino… male   english     47          95.7
 5        54      10   5   teachi… not mino… male   english     47          90.9
 6        55      10   4.9 teachi… not mino… male   english     47          58.6
 7        56      10   4.6 teachi… not mino… male   english     47          76.2
 8        57      10   5   teachi… not mino… male   english     47          83.3
 9        58      10   4.7 teachi… not mino… male   english     47          84.2
10        59      10   5   teachi… not mino… male   english     47          80  
# ℹ 92 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
  • Demo: We can also filter using more than one condition. Here we select all rows where the faculty is teaching-track and the evaluation score is greater than 3.5
evals |>
  filter(rank == "teaching", score > 3.5)
# A tibble: 87 × 23
   course_id prof_id score rank    ethnicity gender language   age cls_perc_eval
       <dbl>   <dbl> <dbl> <chr>   <chr>     <chr>  <chr>    <dbl>         <dbl>
 1        50      10   4   teachi… not mino… male   english     47          84.2
 2        51      10   4.3 teachi… not mino… male   english     47          75  
 3        52      10   4.4 teachi… not mino… male   english     47          93.3
 4        53      10   4.5 teachi… not mino… male   english     47          95.7
 5        54      10   5   teachi… not mino… male   english     47          90.9
 6        55      10   4.9 teachi… not mino… male   english     47          58.6
 7        56      10   4.6 teachi… not mino… male   english     47          76.2
 8        57      10   5   teachi… not mino… male   english     47          83.3
 9        58      10   4.7 teachi… not mino… male   english     47          84.2
10        59      10   5   teachi… not mino… male   english     47          80  
# ℹ 77 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>

We can do more complex tasks using logical operators:

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?
x & y is x AND y?
x | y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

  • Your turn (4 minutes): Describe what the code is doing in words.
evals |>
  filter(
    rank %in% c("tenure track", "tenured"),
    score > 3.5, bty_avg > 6
  )
# A tibble: 55 × 23
   course_id prof_id score rank    ethnicity gender language   age cls_perc_eval
       <dbl>   <dbl> <dbl> <chr>   <chr>     <chr>  <chr>    <dbl>         <dbl>
 1        18       5   4.8 tenure… not mino… female english     31          87.5
 2        19       5   4.6 tenure… not mino… female english     31          90.9
 3        20       5   4.6 tenure… not mino… female english     31          79.2
 4        21       5   4.9 tenure… not mino… female english     31          88.9
 5        22       5   4.6 tenure… not mino… female english     31          88.1
 6        23       5   4.5 tenure… not mino… female english     31          56.3
 7       140      25   4.8 tenure… not mino… female english     34          76.9
 8       141      25   4.1 tenure… not mino… female english     34          82.5
 9       194      36   3.9 tenured minority  female english     44          54.5
10       196      36   4   tenured minority  female english     44         100  
# ℹ 45 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>

count()

  • Demo: Create a frequency table of the ethnicity of the evaluated professors.
evals |>
  count(ethnicity)
# A tibble: 2 × 2
  ethnicity        n
  <chr>        <int>
1 minority        64
2 not minority   399
  • Demo: Which faculty rank had the fewest number of evals? How many evals were there for that group?
evals |>
  count(rank) |>
  filter(n == min(n))
# A tibble: 1 × 2
  rank         n
  <chr>    <int>
1 teaching   102
  • Your turn (5 minutes): Which type of faculty (based on rank, gender, and ethnicity) is most highly represented in this dataset? How many courses did they teach in this sample?
evals |>
  count(rank, gender, ethnicity) |>
  filter(n == max(n))
# A tibble: 1 × 4
  rank    gender ethnicity        n
  <chr>   <chr>  <chr>        <int>
1 tenured male   not minority   162

mutate()

Use mutate() to create a new variable.

  • Demo: In the code chunk below, we calculate difference in the average beauty ratings by gender of the rater (bty_f* vs bty_m*).
evals |>
  mutate(
    bty_avg_f = (bty_f1lower + bty_f1upper + bty_f2upper) / 3,
    bty_avg_m = (bty_m1lower + bty_m1upper + bty_m2upper) / 3,
    bty_avg_diff = bty_avg_f - bty_avg_m
  ) |>
  select(score, bty_avg_f, bty_avg_m, bty_avg_diff)
# A tibble: 463 × 4
   score bty_avg_f bty_avg_m bty_avg_diff
   <dbl>     <dbl>     <dbl>        <dbl>
 1   4.7      6         4           2    
 2   4.1      6         4           2    
 3   3.9      6         4           2    
 4   4.8      6         4           2    
 5   4.6      3.33      2.67        0.667
 6   4.3      3.33      2.67        0.667
 7   2.8      3.33      2.67        0.667
 8   4.1      4         2.67        1.33 
 9   3.4      4         2.67        1.33 
10   4.5      3.67      2.67        1    
# ℹ 453 more rows
  • Your turn (4 minutes): Create a new variable to calculate the percentage of evals for each faculty rank. What percentage of evals were for teaching-track faculty?
evals |>
  count(rank) |>
  mutate(perc = n / sum(n) * 100)
# A tibble: 3 × 3
  rank             n  perc
  <chr>        <int> <dbl>
1 teaching       102  22.0
2 tenure track   108  23.3
3 tenured        253  54.6

summarize()

summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.

Be sure to name your columns!

evals |>
  summarize(mean_score = mean(score))
# A tibble: 1 × 1
  mean_score
       <dbl>
1         NA

Question: Why did this code return NA?

Let’s fix it!

evals |>
  summarize(mean_score = mean(score, na.rm = TRUE))
# A tibble: 1 × 1
  mean_score
       <dbl>
1       4.18

group_by()

group_by() is used for grouped operations. It’s very powerful when paired with summarize() to calculate summary statistics by group.

Here we find the mean and standard deviation of evaluation scores for each professor in the sample.

evals |>
  group_by(prof_id) |>
  summarize(
    mean_score = mean(score, na.rm = TRUE),
    sd_score = sd(score, na.rm = TRUE)
  )
# A tibble: 94 × 3
   prof_id mean_score sd_score
     <dbl>      <dbl>    <dbl>
 1       1       4.38    0.443
 2       2       3.9     0.964
 3       3       3.75    0.495
 4       4       4.3     0.321
 5       5       4.67    0.151
 6       6       4.63    0.180
 7       7       4.1     0.354
 8       8       4       0.766
 9       9       4.61    0.177
10      10       4.64    0.344
# ℹ 84 more rows
  • Your turn (4 minutes): What is the median evaluation score for each faculty rank? Which type of faculty has the lowest median evaluation score?
evals |>
  group_by(rank) |>
  summarize(
    med_score = median(score, na.rm = TRUE)
  )
# A tibble: 3 × 2
  rank         med_score
  <chr>            <dbl>
1 teaching          4.4 
2 tenure track      4.35
3 tenured           4.2 

Additional Practice

Note

Only if we have enough time in class. You do not need to complete these for credit.

  1. Create a new dataset that only contains evals that do not have a missing evaluation score. Include the columns prof_id, score, rank, age, bty_avg, and bty_avg_diff (the difference in the average beauty score for female and male raters). Hint: Note you may need to use mutate() to make one or more of these variables.
evals |>
  # drop rows with NAs for score
  drop_na(score) |>
  # create required variable
  mutate(
    bty_avg_f = (bty_f1lower + bty_f1upper + bty_f2upper) / 3,
    bty_avg_m = (bty_m1lower + bty_m1upper + bty_m2upper) / 3,
    bty_avg_diff = bty_avg_f - bty_avg_m
  ) |>
  # keep only requested columns
  select(prof_id, score, rank, age, bty_avg, bty_avg_diff)
# A tibble: 449 × 6
   prof_id score rank           age bty_avg bty_avg_diff
     <dbl> <dbl> <chr>        <dbl>   <dbl>        <dbl>
 1       1   4.7 tenure track    36    5           2    
 2       1   4.1 tenure track    36    5           2    
 3       1   3.9 tenure track    36    5           2    
 4       1   4.8 tenure track    36    5           2    
 5       2   4.6 tenured         59    3           0.667
 6       2   4.3 tenured         59    3           0.667
 7       2   2.8 tenured         59    3           0.667
 8       3   4.1 tenured         51    3.33        1.33 
 9       3   3.4 tenured         51    3.33        1.33 
10       4   4.5 tenured         40    3.17        1    
# ℹ 439 more rows
  1. For each professor (uniquely identified by prof_id), use a group_by() paired with summarize() to find the sample size, mean, and standard deviation of evaluation scores. Then include only the top 5 and bottom 5 professors in terms of mean scores in the final data frame.
# calculate requested summary statistics
prof_scores <- evals |>
  # drop rows with NAs for score
  drop_na(score) |>
  group_by(prof_id) |>
  summarize(
    mean_score = mean(score, na.rm = TRUE),
    sd_score = sd(score, na.rm = TRUE),
    sample_size = n()
  ) |>
  # sort rows by mean_score from high to low
  arrange(desc(mean_score))

# need to get top 5 and bottom 5 rows for each in a single data frame
bind_rows(
  slice_head(.data = prof_scores, n = 5),
  slice_tail(.data = prof_scores, n = 5)
)
# A tibble: 10 × 4
   prof_id mean_score sd_score sample_size
     <dbl>      <dbl>    <dbl>       <int>
 1      85       4.87   0.150            7
 2      73       4.82   0.0447           5
 3      71       4.81   0.179           10
 4      52       4.74   0.113            7
 5      50       4.73   0.163            6
 6      15       3.18   0.189            4
 7      60       3.13   0.208            3
 8      69       3     NA                1
 9      68       2.67   0.379            3
10      30       2.3   NA                1
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.4.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2023-09-08
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
 digest        0.6.31  2022-12-11 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
 evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)
 fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.5   2023-06-05 [1] CRAN (R 4.3.0)
 knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.22    2023-06-01 [1] CRAN (R 4.3.0)
 rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.3.0)
 scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)
 vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)
 withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
 xfun          0.39    2023-04-20 [1] CRAN (R 4.3.0)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────