library(tidyverse)
AE 03: Wrangling professor evaluations
Go to the course GitHub organization and locate the repo titled ae-03-YOUR_GITHUB_USERNAME
to get started.
This AE is due September 7 at 11:59pm.
To demonstrate data wrangling we will use evals
. It contains anonymized information on end-of-semester student evaluations for 463 courses taught by a sample of 94 professors from the University of Texas at Austin.1
1 Source: Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, 2005 and OpenIntro.
<- read_csv("data/course-evals.csv") evals
The data frame has over 400 observations (rows), 463 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.
glimpse(evals)
Rows: 463
Columns: 23
$ course_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ prof_id <dbl> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,…
$ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4…
$ rank <chr> "tenure track", "tenure track", "tenure track", "tenure …
$ ethnicity <chr> "minority", "minority", "minority", "minority", "not min…
$ gender <chr> "female", "female", "female", "female", "male", "male", …
$ language <chr> "english", "english", "english", "english", "english", "…
$ age <dbl> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, …
$ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87.500…
$ cls_did_eval <dbl> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,…
$ cls_students <dbl> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, …
$ cls_level <chr> "upper", "upper", "upper", "upper", "upper", "upper", "u…
$ cls_profs <chr> "single", "single", "single", "single", "multiple", "mul…
$ cls_credits <chr> "multi credit", "multi credit", "multi credit", "multi c…
$ bty_f1lower <dbl> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7, 7,…
$ bty_f1upper <dbl> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9,…
$ bty_f2upper <dbl> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9,…
$ bty_m1lower <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7,…
$ bty_m1upper <dbl> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6,…
$ bty_m2upper <dbl> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6,…
$ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, …
$ pic_outfit <chr> "not formal", "not formal", "not formal", "not formal", …
$ pic_color <chr> "color", "color", "color", "color", "color", "color", "c…
names(evals)
[1] "course_id" "prof_id" "score" "rank"
[5] "ethnicity" "gender" "language" "age"
[9] "cls_perc_eval" "cls_did_eval" "cls_students" "cls_level"
[13] "cls_profs" "cls_credits" "bty_f1lower" "bty_f1upper"
[17] "bty_f2upper" "bty_m1lower" "bty_m1upper" "bty_m2upper"
[21] "bty_avg" "pic_outfit" "pic_color"
head(evals)
# A tibble: 6 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1 1 4.7 tenure … minority female english 36 55.8
2 2 1 4.1 tenure … minority female english 36 68.8
3 3 1 3.9 tenure … minority female english 36 60.8
4 4 1 4.8 tenure … minority female english 36 62.6
5 5 2 4.6 tenured not mino… male english 59 85
6 6 2 4.3 tenured not mino… male english 59 87.5
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
The head()
function returns “A tibble: 6 x 23” and then the first six rows of the evals
data.
Tibble vs. data frame
A tibble is an opinionated version of the R
data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!
There are two main differences between a tibble and a data frame:
When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.
Let’s look at the differences in the output when we type
evals
(tibble) in the console versus typingcars
(data frame) in the console.Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get a warning message if you try to access a variable that doesn’t exist in a tibble. You will get
NULL
if you try to access a variable that doesn’t exist in a data frame.
$apple evals
Warning: Unknown or uninitialised column: `apple`.
NULL
$apple cars
NULL
Data wrangling with dplyr
dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the data transformation cheatsheet.
Quick summary of key dplyr functions2:
2 From dplyr vignette
Rows:
filter()
:chooses rows based on column values.slice()
: chooses rows based on location.arrange()
: changes the order of the rowssample_n()
: take a random subset of the rows
Columns:
select()
: changes whether or not a column is included.rename()
: changes the name of columns.mutate()
: changes the values of columns and creates new columns.
Groups of rows:
summarize()
: collapses a group into a single row.count()
: count unique values of one or more variables.group_by()
: perform calculations separately for each value of a variable
select()
- Demo: Make a data frame that only contains the variables
score
andcls_students
.
# add code here
- Demo: Make a data frame that keeps every variable except
cls_students
.
# add code here
- Demo: Make a data frame that includes all variables between
score
throughage
(inclusive).
# add code here
- Demo: Use the
select
helpercontains()
to make a data frame that includes the variables associated with the class, i.e., contains the string"cls_"
in the name.
# add code here
The pipe
Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>
, is an operator (a tool) for passing information from one process to another. We will use |>
mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.
When reading code “in English”, say “and then” whenever you see a pipe.
- Your turn (4 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.
# add code here
|>
evals select(score, rank) |>
head()
# A tibble: 6 × 2
score rank
<dbl> <chr>
1 4.7 tenure track
2 4.1 tenure track
3 3.9 tenure track
4 4.8 tenure track
5 4.6 tenured
6 4.3 tenured
slice()
Look at the documentation for slice()
3 or the data transformation cheat sheet. What variations of slice_*()
might be useful here?
3 Run ?slice
in the console.
- Your turn: Display the first five rows of the
evals
data frame.
# add code here
- Your turn: Display the last two rows of the
evals
data frame.
# add code here
arrange()
- Your turn: Let’s arrange the data by score, so the courses with the lowest scores will be at the top of the data frame.
# add code here
- Your turn: Now let’s arrange the data by descending score, so the evals with the highest scores will be at the top.
# add code here
- Your turn (5 minutes): Create a data frame that only includes the evaluation score (
score
), faculty rank (rank
), and average beauty rating of the professor (bty_avg
) for the course with the highest evaluation score and the faculty member with the highest average beauty rating. What is the average beauty rating (bty_avg
) for this professor?
# add code here
Feel free to work ahead on the remaining exercises but we will pause to check in at this point.
filter()
- Demo: Filter the data frame by selecting the rows where the faculty is on the teaching-track.
# add code here
- Demo: We can also filter using more than one condition. Here we select all rows where the faculty is teaching-track and the evaluation score is greater than 3.5
# add code here
We can do more complex tasks using logical operators:
operator | definition |
---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? |
The final operator only makes sense if x
is logical (TRUE / FALSE).
- Your turn (4 minutes): Describe what the code is doing in words.
|>
evals filter(
%in% c("tenure track", "tenured"),
rank > 3.5, bty_avg > 6
score )
# A tibble: 55 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 18 5 4.8 tenure… not mino… female english 31 87.5
2 19 5 4.6 tenure… not mino… female english 31 90.9
3 20 5 4.6 tenure… not mino… female english 31 79.2
4 21 5 4.9 tenure… not mino… female english 31 88.9
5 22 5 4.6 tenure… not mino… female english 31 88.1
6 23 5 4.5 tenure… not mino… female english 31 56.3
7 140 25 4.8 tenure… not mino… female english 34 76.9
8 141 25 4.1 tenure… not mino… female english 34 82.5
9 194 36 3.9 tenured minority female english 44 54.5
10 196 36 4 tenured minority female english 44 100
# ℹ 45 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
count()
- Demo: Create a frequency table of the ethnicity of the evaluated professors.
# add code here
- Demo: Which faculty rank had the fewest number of evals? How many evals were there for that group?
# add code here
- Your turn (5 minutes): Which type of faculty (based on rank, gender, and ethnicity) is most highly represented in this dataset? How many courses did they teach in this sample?
# add code here
mutate()
Use mutate()
to create a new variable.
- Demo: In the code chunk below, we calculate difference in the average beauty ratings by gender of the rater (
bty_f*
vsbty_m*
).
# add code here
- Your turn (4 minutes): Create a new variable to calculate the percentage of evals for each faculty rank. What percentage of evals were for teaching-track faculty?
# add code here
summarize()
summarize()
collapses the rows into summary statistics and removes columns irrelevant to the calculation.
Be sure to name your columns!
# add code here
Question: Why did this code return NA
?
Let’s fix it!
# add code here
group_by()
group_by()
is used for grouped operations. It’s very powerful when paired with summarize()
to calculate summary statistics by group.
Here we find the mean and standard deviation of evaluation scores for each professor in the sample.
# add code here
- Your turn (4 minutes): What is the median evaluation score for each faculty rank? Which type of faculty has the lowest median evaluation score?
# add code here
Additional Practice
Only if we have enough time in class. You do not need to complete these for credit.
- Create a new dataset that only contains evals that do not have a missing evaluation score. Include the columns
prof_id
,score
,rank
,age
,bty_avg
, andbty_avg_diff
(the difference in the average beauty score for female and male raters). Hint: Note you may need to usemutate()
to make one or more of these variables.
# add code here
- For each professor (uniquely identified by
prof_id
), use agroup_by()
paired withsummarize()
to find the sample size, mean, and standard deviation of evaluation scores. Then include only the top 5 and bottom 5 professors in terms of mean scores in the final data frame.
# add code here