AE 03: Wrangling professor evaluations

Application exercise

Important

Go to the course GitHub organization and locate the repo titled ae-03-YOUR_GITHUB_USERNAME to get started.

This AE is due September 7 at 11:59pm.

To demonstrate data wrangling we will use evals. It contains anonymized information on end-of-semester student evaluations for 463 courses taught by a sample of 94 professors from the University of Texas at Austin.¹

¹ Source: Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, 2005 and OpenIntro.

library(tidyverse)

evals <- read_csv("data/course-evals.csv")

The data frame has over 400 observations (rows), 463 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.

glimpse(evals)

Rows: 463
Columns: 23
$ course_id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ prof_id       <dbl> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,…
$ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4…
$ rank          <chr> "tenure track", "tenure track", "tenure track", "tenure …
$ ethnicity     <chr> "minority", "minority", "minority", "minority", "not min…
$ gender        <chr> "female", "female", "female", "female", "male", "male", …
$ language      <chr> "english", "english", "english", "english", "english", "…
$ age           <dbl> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, …
$ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87.500…
$ cls_did_eval  <dbl> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,…
$ cls_students  <dbl> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, …
$ cls_level     <chr> "upper", "upper", "upper", "upper", "upper", "upper", "u…
$ cls_profs     <chr> "single", "single", "single", "single", "multiple", "mul…
$ cls_credits   <chr> "multi credit", "multi credit", "multi credit", "multi c…
$ bty_f1lower   <dbl> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7, 7,…
$ bty_f1upper   <dbl> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9,…
$ bty_f2upper   <dbl> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9,…
$ bty_m1lower   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7,…
$ bty_m1upper   <dbl> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6,…
$ bty_m2upper   <dbl> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6,…
$ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, …
$ pic_outfit    <chr> "not formal", "not formal", "not formal", "not formal", …
$ pic_color     <chr> "color", "color", "color", "color", "color", "color", "c…

names(evals)

 [1] "course_id"     "prof_id"       "score"         "rank"         
 [5] "ethnicity"     "gender"        "language"      "age"          
 [9] "cls_perc_eval" "cls_did_eval"  "cls_students"  "cls_level"    
[13] "cls_profs"     "cls_credits"   "bty_f1lower"   "bty_f1upper"  
[17] "bty_f2upper"   "bty_m1lower"   "bty_m1upper"   "bty_m2upper"  
[21] "bty_avg"       "pic_outfit"    "pic_color"

head(evals)

# A tibble: 6 × 23
  course_id prof_id score rank     ethnicity gender language   age cls_perc_eval
      <dbl>   <dbl> <dbl> <chr>    <chr>     <chr>  <chr>    <dbl>         <dbl>
1         1       1   4.7 tenure … minority  female english     36          55.8
2         2       1   4.1 tenure … minority  female english     36          68.8
3         3       1   3.9 tenure … minority  female english     36          60.8
4         4       1   4.8 tenure … minority  female english     36          62.6
5         5       2   4.6 tenured  not mino… male   english     59          85  
6         6       2   4.3 tenured  not mino… male   english     59          87.5
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>

The head() function returns “A tibble: 6 x 23” and then the first six rows of the evals data.

Tibble vs. data frame

A tibble is an opinionated version of the R data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!

There are two main differences between a tibble and a data frame:

When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.

Let’s look at the differences in the output when we type evals (tibble) in the console versus typing cars (data frame) in the console.
Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get a warning message if you try to access a variable that doesn’t exist in a tibble. You will get NULL if you try to access a variable that doesn’t exist in a data frame.

evals$apple

Warning: Unknown or uninitialised column: `apple`.

NULL

cars$apple

NULL

Data wrangling with `dplyr`

dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the data transformation cheatsheet.

Quick summary of key dplyr functions²:

² From dplyr vignette

Rows:

filter():chooses rows based on column values.
slice(): chooses rows based on location.
arrange(): changes the order of the rows
sample_n(): take a random subset of the rows

Columns:

select(): changes whether or not a column is included.
rename(): changes the name of columns.
mutate(): changes the values of columns and creates new columns.

Groups of rows:

summarize(): collapses a group into a single row.
count(): count unique values of one or more variables.
group_by(): perform calculations separately for each value of a variable

`select()`

Demo: Make a data frame that only contains the variables score and cls_students.

# add code here

Demo: Make a data frame that keeps every variable except cls_students.

# add code here

Demo: Make a data frame that includes all variables between score through age (inclusive).

# add code here

Demo: Use the select helper contains() to make a data frame that includes the variables associated with the class, i.e., contains the string "cls_" in the name.

# add code here

The pipe

Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.

When reading code “in English”, say “and then” whenever you see a pipe.

Your turn (4 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.

# add code here

evals |>
  select(score, rank) |>
  head()

# A tibble: 6 × 2
  score rank        
  <dbl> <chr>       
1   4.7 tenure track
2   4.1 tenure track
3   3.9 tenure track
4   4.8 tenure track
5   4.6 tenured     
6   4.3 tenured

`slice()`

Look at the documentation for slice()³ or the data transformation cheat sheet. What variations of slice_*() might be useful here?

³ Run ?slice in the console.

Your turn: Display the first five rows of the evals data frame.

# add code here

Your turn: Display the last two rows of the evals data frame.

# add code here

`arrange()`

Your turn: Let’s arrange the data by score, so the courses with the lowest scores will be at the top of the data frame.

# add code here

Your turn: Now let’s arrange the data by descending score, so the evals with the highest scores will be at the top.

# add code here

Your turn (5 minutes): Create a data frame that only includes the evaluation score (score), faculty rank (rank), and average beauty rating of the professor (bty_avg) for the course with the highest evaluation score and the faculty member with the highest average beauty rating. What is the average beauty rating (bty_avg) for this professor?

# add code here

Note

Feel free to work ahead on the remaining exercises but we will pause to check in at this point.

`filter()`

Demo: Filter the data frame by selecting the rows where the faculty is on the teaching-track.

# add code here

Demo: We can also filter using more than one condition. Here we select all rows where the faculty is teaching-track and the evaluation score is greater than 3.5

# add code here

We can do more complex tasks using logical operators:

operator	definition
`<`	is less than?
`<=`	is less than or equal to?
`>`	is greater than?
`>=`	is greater than or equal to?
`==`	is exactly equal to?
`!=`	is not equal to?
`x & y`	is x AND y?
`x \| y`	is x OR y?
`is.na(x)`	is x NA?
`!is.na(x)`	is x not NA?
`x %in% y`	is x in y?
`!(x %in% y)`	is x not in y?
`!x`	is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

Your turn (4 minutes): Describe what the code is doing in words.

evals |>
  filter(
    rank %in% c("tenure track", "tenured"),
    score > 3.5, bty_avg > 6
  )

# A tibble: 55 × 23
   course_id prof_id score rank    ethnicity gender language   age cls_perc_eval
       <dbl>   <dbl> <dbl> <chr>   <chr>     <chr>  <chr>    <dbl>         <dbl>
 1        18       5   4.8 tenure… not mino… female english     31          87.5
 2        19       5   4.6 tenure… not mino… female english     31          90.9
 3        20       5   4.6 tenure… not mino… female english     31          79.2
 4        21       5   4.9 tenure… not mino… female english     31          88.9
 5        22       5   4.6 tenure… not mino… female english     31          88.1
 6        23       5   4.5 tenure… not mino… female english     31          56.3
 7       140      25   4.8 tenure… not mino… female english     34          76.9
 8       141      25   4.1 tenure… not mino… female english     34          82.5
 9       194      36   3.9 tenured minority  female english     44          54.5
10       196      36   4   tenured minority  female english     44         100  
# ℹ 45 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
#   cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
#   bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
#   bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>

`count()`

Demo: Create a frequency table of the ethnicity of the evaluated professors.

# add code here

Demo: Which faculty rank had the fewest number of evals? How many evals were there for that group?

# add code here

Your turn (5 minutes): Which type of faculty (based on rank, gender, and ethnicity) is most highly represented in this dataset? How many courses did they teach in this sample?

# add code here

`mutate()`

Use mutate() to create a new variable.

Demo: In the code chunk below, we calculate difference in the average beauty ratings by gender of the rater (bty_f* vs bty_m*).

# add code here

Your turn (4 minutes): Create a new variable to calculate the percentage of evals for each faculty rank. What percentage of evals were for teaching-track faculty?

# add code here

`summarize()`

summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.

Be sure to name your columns!

# add code here

Question: Why did this code return NA?

Let’s fix it!

# add code here

`group_by()`

group_by() is used for grouped operations. It’s very powerful when paired with summarize() to calculate summary statistics by group.

Here we find the mean and standard deviation of evaluation scores for each professor in the sample.

# add code here

Your turn (4 minutes): What is the median evaluation score for each faculty rank? Which type of faculty has the lowest median evaluation score?

# add code here

Additional Practice

Note

Only if we have enough time in class. You do not need to complete these for credit.

Create a new dataset that only contains evals that do not have a missing evaluation score. Include the columns prof_id, score, rank, age, bty_avg, and bty_avg_diff (the difference in the average beauty score for female and male raters). Hint: Note you may need to use mutate() to make one or more of these variables.

# add code here

For each professor (uniquely identified by prof_id), use a group_by() paired with summarize() to find the sample size, mean, and standard deviation of evaluation scores. Then include only the top 5 and bottom 5 professors in terms of mean scores in the final data frame.

# add code here

Tibble vs. data frame

Data wrangling with dplyr

select()

The pipe

slice()

arrange()

filter()

count()

mutate()

summarize()

group_by()

Additional Practice

Data wrangling with `dplyr`

`select()`

`slice()`

`arrange()`

`filter()`

`count()`

`mutate()`

`summarize()`

`group_by()`