AE 03: Wrangling professor evaluations

Application exercise
Important

Go to the course GitHub organization and locate the repo titled ae-03-YOUR_GITHUB_USERNAME to get started.

This AE is due September 7 at 11:59pm.

To demonstrate data wrangling we will use evals. It contains anonymized information on end-of-semester student evaluations for 463 courses taught by a sample of 94 professors from the University of Texas at Austin.1

1 Source: Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, 2005 and OpenIntro.

library(tidyverse)
evals <- read_csv("data/course-evals.csv")

The data frame has over 400 observations (rows), 463 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.

glimpse(evals)
Rows: 463
Columns: 23
$ course_id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ prof_id       <dbl> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,…
$ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4…
$ rank          <chr> "tenure track", "tenure track", "tenure track", "tenure …
$ ethnicity     <chr> "minority", "minority", "minority", "minority", "not min…
$ gender        <chr> "female", "female", "female", "female", "male", "male", …
$ language      <chr> "english", "english", "english", "english", "english", "…
$ age           <dbl> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, …
$ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87.500…
$ cls_did_eval  <dbl> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,…
$ cls_students  <dbl> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, …
$ cls_level     <chr> "upper", "upper", "upper", "upper", "upper", "upper", "u…
$ cls_profs     <chr> "single", "single", "single", "single", "multiple", "mul…
$ cls_credits   <chr> "multi credit", "multi credit", "multi credit", "multi c…
$ bty_f1lower   <dbl> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7, 7,…
$ bty_f1upper   <dbl> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9,…
$ bty_f2upper   <dbl> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9,…
$ bty_m1lower   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7,…
$ bty_m1upper   <dbl> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6,…
$ bty_m2upper   <dbl> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6,…
$ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, …
$ pic_outfit    <chr> "not formal", "not formal", "not formal", "not formal", …
$ pic_color     <chr> "color", "color", "color", "color", "color", "color", "c…
names(evals)
 [1] "course_id"     "prof_id"       "score"         "rank"         
 [5] "ethnicity"     "gender"        "language"      "age"          
 [9] "cls_perc_eval" "cls_did_eval"  "cls_students"  "cls_level"    
[13] "cls_profs"     "cls_credits"   "bty_f1lower"   "bty_f1upper"  
[17] "bty_f2upper"   "bty_m1lower"   "bty_m1upper"   "bty_m2upper"  
[21] "bty_avg"       "pic_outfit"    "pic_color"    
head(evals)
# A tibble: 6 × 23
  course_id prof_id score rank     ethnicity gender language   age cls_perc_eval
      <dbl>   <dbl> <dbl> <chr>    <chr>     <chr>  <chr>    <dbl>         <dbl>
1         1       1   4.7 tenure … minority  female english     36          55.8
2         2       1   4.1 tenure … minority  female english     36          68.8
3         3       1   3.9 tenure … minority  female english     36          60.8
4         4       1   4.8 tenure … minority  female english     36          62.6
5         5       2   4.6 tenured  not mino… male   english     59          85  
6         6       2   4.3 tenured  not mino… male   english     59          87.5
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>

The head() function returns “A tibble: 6 x 23” and then the first six rows of the evals data.

Tibble vs. data frame

A tibble is an opinionated version of the R data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!

There are two main differences between a tibble and a data frame:

  1. When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.

    Let’s look at the differences in the output when we type evals (tibble) in the console versus typing cars (data frame) in the console.

  2. Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get a warning message if you try to access a variable that doesn’t exist in a tibble. You will get NULL if you try to access a variable that doesn’t exist in a data frame.

evals$apple
Warning: Unknown or uninitialised column: `apple`.
NULL
cars$apple
NULL

Data wrangling with dplyr

dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the data transformation cheatsheet.

Quick summary of key dplyr functions2:

Rows:

  • filter():chooses rows based on column values.
  • slice(): chooses rows based on location.
  • arrange(): changes the order of the rows
  • sample_n(): take a random subset of the rows

Columns:

  • select(): changes whether or not a column is included.
  • rename(): changes the name of columns.
  • mutate(): changes the values of columns and creates new columns.

Groups of rows:

  • summarize(): collapses a group into a single row.
  • count(): count unique values of one or more variables.
  • group_by(): perform calculations separately for each value of a variable

select()

  • Demo: Make a data frame that only contains the variables score and cls_students.
# add code here
  • Demo: Make a data frame that keeps every variable except cls_students.
# add code here
  • Demo: Make a data frame that includes all variables between score through age (inclusive).
# add code here
  • Demo: Use the select helper contains() to make a data frame that includes the variables associated with the class, i.e., contains the string "cls_" in the name.
# add code here

The pipe

Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.

When reading code “in English”, say “and then” whenever you see a pipe.

  • Your turn (4 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.
# add code here

evals |>
  select(score, rank) |>
  head()
# A tibble: 6 × 2
  score rank        
  <dbl> <chr>       
1   4.7 tenure track
2   4.1 tenure track
3   3.9 tenure track
4   4.8 tenure track
5   4.6 tenured     
6   4.3 tenured     

slice()

Look at the documentation for slice()3 or the data transformation cheat sheet. What variations of slice_*() might be useful here?

3 Run ?slice in the console.

  • Your turn: Display the first five rows of the evals data frame.
# add code here
  • Your turn: Display the last two rows of the evals data frame.
# add code here

arrange()

  • Your turn: Let’s arrange the data by score, so the courses with the lowest scores will be at the top of the data frame.
# add code here
  • Your turn: Now let’s arrange the data by descending score, so the evals with the highest scores will be at the top.
# add code here
  • Your turn (5 minutes): Create a data frame that only includes the evaluation score (score), faculty rank (rank), and average beauty rating of the professor (bty_avg) for the course with the highest evaluation score and the faculty member with the highest average beauty rating. What is the average beauty rating (bty_avg) for this professor?
# add code here
Note

Feel free to work ahead on the remaining exercises but we will pause to check in at this point.

filter()

  • Demo: Filter the data frame by selecting the rows where the faculty is on the teaching-track.
# add code here
  • Demo: We can also filter using more than one condition. Here we select all rows where the faculty is teaching-track and the evaluation score is greater than 3.5
# add code here

We can do more complex tasks using logical operators:

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?
x & y is x AND y?
x | y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

  • Your turn (4 minutes): Describe what the code is doing in words.
evals |>
  filter(
    rank %in% c("tenure track", "tenured"),
    score > 3.5, bty_avg > 6
  )
# A tibble: 55 × 23
   course_id prof_id score rank    ethnicity gender language   age cls_perc_eval
       <dbl>   <dbl> <dbl> <chr>   <chr>     <chr>  <chr>    <dbl>         <dbl>
 1        18       5   4.8 tenure… not mino… female english     31          87.5
 2        19       5   4.6 tenure… not mino… female english     31          90.9
 3        20       5   4.6 tenure… not mino… female english     31          79.2
 4        21       5   4.9 tenure… not mino… female english     31          88.9
 5        22       5   4.6 tenure… not mino… female english     31          88.1
 6        23       5   4.5 tenure… not mino… female english     31          56.3
 7       140      25   4.8 tenure… not mino… female english     34          76.9
 8       141      25   4.1 tenure… not mino… female english     34          82.5
 9       194      36   3.9 tenured minority  female english     44          54.5
10       196      36   4   tenured minority  female english     44         100  
# ℹ 45 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>

count()

  • Demo: Create a frequency table of the ethnicity of the evaluated professors.
# add code here
  • Demo: Which faculty rank had the fewest number of evals? How many evals were there for that group?
# add code here
  • Your turn (5 minutes): Which type of faculty (based on rank, gender, and ethnicity) is most highly represented in this dataset? How many courses did they teach in this sample?
# add code here

mutate()

Use mutate() to create a new variable.

  • Demo: In the code chunk below, we calculate difference in the average beauty ratings by gender of the rater (bty_f* vs bty_m*).
# add code here
  • Your turn (4 minutes): Create a new variable to calculate the percentage of evals for each faculty rank. What percentage of evals were for teaching-track faculty?
# add code here

summarize()

summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.

Be sure to name your columns!

# add code here

Question: Why did this code return NA?

Let’s fix it!

# add code here

group_by()

group_by() is used for grouped operations. It’s very powerful when paired with summarize() to calculate summary statistics by group.

Here we find the mean and standard deviation of evaluation scores for each professor in the sample.

# add code here
  • Your turn (4 minutes): What is the median evaluation score for each faculty rank? Which type of faculty has the lowest median evaluation score?
# add code here

Additional Practice

Note

Only if we have enough time in class. You do not need to complete these for credit.

  1. Create a new dataset that only contains evals that do not have a missing evaluation score. Include the columns prof_id, score, rank, age, bty_avg, and bty_avg_diff (the difference in the average beauty score for female and male raters). Hint: Note you may need to use mutate() to make one or more of these variables.
# add code here
  1. For each professor (uniquely identified by prof_id), use a group_by() paired with summarize() to find the sample size, mean, and standard deviation of evaluation scores. Then include only the top 5 and bottom 5 professors in terms of mean scores in the final data frame.
# add code here