library(tidyverse)
AE 03: Wrangling professor evaluations
Suggested answers
These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.
To demonstrate data wrangling we will use evals
. It contains anonymized information on end-of-semester student evaluations for 463 courses taught by a sample of 94 professors from the University of Texas at Austin.1
1 Source: Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, 2005 and OpenIntro.
<- read_csv("data/course-evals.csv") evals
The data frame has over 400 observations (rows), 463 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.
glimpse(evals)
Rows: 463
Columns: 23
$ course_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ prof_id <dbl> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,…
$ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4…
$ rank <chr> "tenure track", "tenure track", "tenure track", "tenure …
$ ethnicity <chr> "minority", "minority", "minority", "minority", "not min…
$ gender <chr> "female", "female", "female", "female", "male", "male", …
$ language <chr> "english", "english", "english", "english", "english", "…
$ age <dbl> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, …
$ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87.500…
$ cls_did_eval <dbl> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,…
$ cls_students <dbl> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, …
$ cls_level <chr> "upper", "upper", "upper", "upper", "upper", "upper", "u…
$ cls_profs <chr> "single", "single", "single", "single", "multiple", "mul…
$ cls_credits <chr> "multi credit", "multi credit", "multi credit", "multi c…
$ bty_f1lower <dbl> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7, 7,…
$ bty_f1upper <dbl> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9,…
$ bty_f2upper <dbl> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9,…
$ bty_m1lower <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7,…
$ bty_m1upper <dbl> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6,…
$ bty_m2upper <dbl> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6,…
$ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, …
$ pic_outfit <chr> "not formal", "not formal", "not formal", "not formal", …
$ pic_color <chr> "color", "color", "color", "color", "color", "color", "c…
names(evals)
[1] "course_id" "prof_id" "score" "rank"
[5] "ethnicity" "gender" "language" "age"
[9] "cls_perc_eval" "cls_did_eval" "cls_students" "cls_level"
[13] "cls_profs" "cls_credits" "bty_f1lower" "bty_f1upper"
[17] "bty_f2upper" "bty_m1lower" "bty_m1upper" "bty_m2upper"
[21] "bty_avg" "pic_outfit" "pic_color"
head(evals)
# A tibble: 6 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1 1 4.7 tenure … minority female english 36 55.8
2 2 1 4.1 tenure … minority female english 36 68.8
3 3 1 3.9 tenure … minority female english 36 60.8
4 4 1 4.8 tenure … minority female english 36 62.6
5 5 2 4.6 tenured not mino… male english 59 85
6 6 2 4.3 tenured not mino… male english 59 87.5
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
The head()
function returns “A tibble: 6 x 23” and then the first six rows of the evals
data.
Tibble vs. data frame
A tibble is an opinionated version of the R
data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!
There are two main differences between a tibble and a data frame:
When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.
Let’s look at the differences in the output when we type
evals
(tibble) in the console versus typingcars
(data frame) in the console.Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get a warning message if you try to access a variable that doesn’t exist in a tibble. You will get
NULL
if you try to access a variable that doesn’t exist in a data frame.
$apple evals
Warning: Unknown or uninitialised column: `apple`.
NULL
$apple cars
NULL
Data wrangling with dplyr
dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the data transformation cheatsheet.
Quick summary of key dplyr functions2:
2 From dplyr vignette
Rows:
filter()
:chooses rows based on column values.slice()
: chooses rows based on location.arrange()
: changes the order of the rowssample_n()
: take a random subset of the rows
Columns:
select()
: changes whether or not a column is included.rename()
: changes the name of columns.mutate()
: changes the values of columns and creates new columns.
Groups of rows:
summarize()
: collapses a group into a single row.count()
: count unique values of one or more variables.group_by()
: perform calculations separately for each value of a variable
select()
- Demo: Make a data frame that only contains the variables
score
andcls_students
.
|>
evals select(score, cls_students)
# A tibble: 463 × 2
score cls_students
<dbl> <dbl>
1 4.7 43
2 4.1 125
3 3.9 125
4 4.8 123
5 4.6 20
6 4.3 40
7 2.8 44
8 4.1 55
9 3.4 195
10 4.5 46
# ℹ 453 more rows
- Demo: Make a data frame that keeps every variable except
cls_students
.
|>
evals select(-cls_students)
# A tibble: 463 × 22
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1 1 4.7 tenure… minority female english 36 55.8
2 2 1 4.1 tenure… minority female english 36 68.8
3 3 1 3.9 tenure… minority female english 36 60.8
4 4 1 4.8 tenure… minority female english 36 62.6
5 5 2 4.6 tenured not mino… male english 59 85
6 6 2 4.3 tenured not mino… male english 59 87.5
7 7 2 2.8 tenured not mino… male english 59 88.6
8 8 3 4.1 tenured not mino… male english 51 100
9 9 3 3.4 tenured not mino… male english 51 56.9
10 10 4 4.5 tenured not mino… female english 40 87.0
# ℹ 453 more rows
# ℹ 13 more variables: cls_did_eval <dbl>, cls_level <chr>, cls_profs <chr>,
# cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>, bty_f2upper <dbl>,
# bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>, bty_avg <dbl>,
# pic_outfit <chr>, pic_color <chr>
- Demo: Make a data frame that includes all variables between
score
throughage
(inclusive).
|>
evals select(score:age)
# A tibble: 463 × 6
score rank ethnicity gender language age
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 4.7 tenure track minority female english 36
2 4.1 tenure track minority female english 36
3 3.9 tenure track minority female english 36
4 4.8 tenure track minority female english 36
5 4.6 tenured not minority male english 59
6 4.3 tenured not minority male english 59
7 2.8 tenured not minority male english 59
8 4.1 tenured not minority male english 51
9 3.4 tenured not minority male english 51
10 4.5 tenured not minority female english 40
# ℹ 453 more rows
- Demo: Use the
select
helpercontains()
to make a data frame that includes the variables associated with the class, i.e., contains the string"cls_"
in the name.
|>
evals select(contains("cls_"))
# A tibble: 463 × 6
cls_perc_eval cls_did_eval cls_students cls_level cls_profs cls_credits
<dbl> <dbl> <dbl> <chr> <chr> <chr>
1 55.8 24 43 upper single multi credit
2 68.8 86 125 upper single multi credit
3 60.8 76 125 upper single multi credit
4 62.6 77 123 upper single multi credit
5 85 17 20 upper multiple multi credit
6 87.5 35 40 upper multiple multi credit
7 88.6 39 44 upper multiple multi credit
8 100 55 55 upper single multi credit
9 56.9 111 195 upper single multi credit
10 87.0 40 46 upper single multi credit
# ℹ 453 more rows
The pipe
Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>
, is an operator (a tool) for passing information from one process to another. We will use |>
mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.
When reading code “in English”, say “and then” whenever you see a pipe.
- Your turn (4 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.
|>
evals select(score, rank) |>
head()
# A tibble: 6 × 2
score rank
<dbl> <chr>
1 4.7 tenure track
2 4.1 tenure track
3 3.9 tenure track
4 4.8 tenure track
5 4.6 tenured
6 4.3 tenured
slice()
Look at the documentation for slice()
3 or the data transformation cheat sheet. What variations of slice_*()
might be useful here?
3 Run ?slice
in the console.
- Your turn: Display the first five rows of the
evals
data frame.
|>
evals slice(1:5)
# A tibble: 5 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1 1 4.7 tenure … minority female english 36 55.8
2 2 1 4.1 tenure … minority female english 36 68.8
3 3 1 3.9 tenure … minority female english 36 60.8
4 4 1 4.8 tenure … minority female english 36 62.6
5 5 2 4.6 tenured not mino… male english 59 85
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
# with slice_head()
|>
evals slice_head(n = 5)
# A tibble: 5 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1 1 4.7 tenure … minority female english 36 55.8
2 2 1 4.1 tenure … minority female english 36 68.8
3 3 1 3.9 tenure … minority female english 36 60.8
4 4 1 4.8 tenure … minority female english 36 62.6
5 5 2 4.6 tenured not mino… male english 59 85
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
- Your turn: Display the last two rows of the
evals
data frame.
|>
evals slice((n() - 1):n())
# A tibble: 2 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 462 94 4.4 tenure … minority female non-eng… 42 81.8
2 463 94 4.1 tenure … minority female non-eng… 42 80
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
# with slice_tail()
|>
evals slice_tail(n = 2)
# A tibble: 2 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 462 94 4.4 tenure … minority female non-eng… 42 81.8
2 463 94 4.1 tenure … minority female non-eng… 42 80
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
arrange()
- Your turn: Let’s arrange the data by score, so the courses with the lowest scores will be at the top of the data frame.
|>
evals arrange(score)
# A tibble: 463 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 162 30 2.3 tenure… not mino… female english 41 83.3
2 335 68 2.4 tenured not mino… male english 60 71.9
3 40 8 2.5 tenured not mino… female english 51 80
4 337 68 2.5 tenured not mino… male english 60 62.5
5 329 66 2.7 tenured not mino… male english 64 81.8
6 376 76 2.7 tenured minority female english 43 48.9
7 7 2 2.8 tenured not mino… male english 59 88.6
8 185 34 2.8 tenure… minority female english 47 92.3
9 434 88 2.8 tenured not mino… male english 62 40.9
10 79 15 2.9 tenure… not mino… female english 37 82.1
# ℹ 453 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
- Your turn: Now let’s arrange the data by descending score, so the evals with the highest scores will be at the top.
|>
evals arrange(desc(score))
# A tibble: 463 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 54 10 5 teachi… not mino… male english 47 90.9
2 57 10 5 teachi… not mino… male english 47 83.3
3 59 10 5 teachi… not mino… male english 47 80
4 103 19 5 tenured not mino… female english 46 93.3
5 108 19 5 tenured not mino… female english 46 100
6 349 71 5 teachi… minority male english 50 90.9
7 356 71 5 teachi… minority male english 50 95.2
8 406 82 5 tenured not mino… male english 57 40
9 420 85 5 teachi… not mino… male english 58 100
10 421 85 5 teachi… not mino… male english 58 85.7
# ℹ 453 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
- Your turn (5 minutes): Create a data frame that only includes the evaluation score (
score
), faculty rank (rank
), and average beauty rating of the professor (bty_avg
) for the course with the highest evaluation score and the faculty member with the highest average beauty rating. What is the average beauty rating (bty_avg
) for this professor?
|>
evals select(score, rank, bty_avg) |>
arrange(desc(score), desc(bty_avg)) |>
slice(1)
# A tibble: 1 × 3
score rank bty_avg
<dbl> <chr> <dbl>
1 5 teaching 7.83
Feel free to work ahead on the remaining exercises but we will pause to check in at this point.
filter()
- Demo: Filter the data frame by selecting the rows where the faculty is on the teaching-track.
|>
evals filter(rank == "teaching")
# A tibble: 102 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 50 10 4 teachi… not mino… male english 47 84.2
2 51 10 4.3 teachi… not mino… male english 47 75
3 52 10 4.4 teachi… not mino… male english 47 93.3
4 53 10 4.5 teachi… not mino… male english 47 95.7
5 54 10 5 teachi… not mino… male english 47 90.9
6 55 10 4.9 teachi… not mino… male english 47 58.6
7 56 10 4.6 teachi… not mino… male english 47 76.2
8 57 10 5 teachi… not mino… male english 47 83.3
9 58 10 4.7 teachi… not mino… male english 47 84.2
10 59 10 5 teachi… not mino… male english 47 80
# ℹ 92 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
- Demo: We can also filter using more than one condition. Here we select all rows where the faculty is teaching-track and the evaluation score is greater than 3.5
|>
evals filter(rank == "teaching", score > 3.5)
# A tibble: 87 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 50 10 4 teachi… not mino… male english 47 84.2
2 51 10 4.3 teachi… not mino… male english 47 75
3 52 10 4.4 teachi… not mino… male english 47 93.3
4 53 10 4.5 teachi… not mino… male english 47 95.7
5 54 10 5 teachi… not mino… male english 47 90.9
6 55 10 4.9 teachi… not mino… male english 47 58.6
7 56 10 4.6 teachi… not mino… male english 47 76.2
8 57 10 5 teachi… not mino… male english 47 83.3
9 58 10 4.7 teachi… not mino… male english 47 84.2
10 59 10 5 teachi… not mino… male english 47 80
# ℹ 77 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
We can do more complex tasks using logical operators:
operator | definition |
---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? |
The final operator only makes sense if x
is logical (TRUE / FALSE).
- Your turn (4 minutes): Describe what the code is doing in words.
|>
evals filter(
%in% c("tenure track", "tenured"),
rank > 3.5, bty_avg > 6
score )
# A tibble: 55 × 23
course_id prof_id score rank ethnicity gender language age cls_perc_eval
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 18 5 4.8 tenure… not mino… female english 31 87.5
2 19 5 4.6 tenure… not mino… female english 31 90.9
3 20 5 4.6 tenure… not mino… female english 31 79.2
4 21 5 4.9 tenure… not mino… female english 31 88.9
5 22 5 4.6 tenure… not mino… female english 31 88.1
6 23 5 4.5 tenure… not mino… female english 31 56.3
7 140 25 4.8 tenure… not mino… female english 34 76.9
8 141 25 4.1 tenure… not mino… female english 34 82.5
9 194 36 3.9 tenured minority female english 44 54.5
10 196 36 4 tenured minority female english 44 100
# ℹ 45 more rows
# ℹ 14 more variables: cls_did_eval <dbl>, cls_students <dbl>, cls_level <chr>,
# cls_profs <chr>, cls_credits <chr>, bty_f1lower <dbl>, bty_f1upper <dbl>,
# bty_f2upper <dbl>, bty_m1lower <dbl>, bty_m1upper <dbl>, bty_m2upper <dbl>,
# bty_avg <dbl>, pic_outfit <chr>, pic_color <chr>
count()
- Demo: Create a frequency table of the ethnicity of the evaluated professors.
|>
evals count(ethnicity)
# A tibble: 2 × 2
ethnicity n
<chr> <int>
1 minority 64
2 not minority 399
- Demo: Which faculty rank had the fewest number of evals? How many evals were there for that group?
|>
evals count(rank) |>
filter(n == min(n))
# A tibble: 1 × 2
rank n
<chr> <int>
1 teaching 102
- Your turn (5 minutes): Which type of faculty (based on rank, gender, and ethnicity) is most highly represented in this dataset? How many courses did they teach in this sample?
|>
evals count(rank, gender, ethnicity) |>
filter(n == max(n))
# A tibble: 1 × 4
rank gender ethnicity n
<chr> <chr> <chr> <int>
1 tenured male not minority 162
mutate()
Use mutate()
to create a new variable.
- Demo: In the code chunk below, we calculate difference in the average beauty ratings by gender of the rater (
bty_f*
vsbty_m*
).
|>
evals mutate(
bty_avg_f = (bty_f1lower + bty_f1upper + bty_f2upper) / 3,
bty_avg_m = (bty_m1lower + bty_m1upper + bty_m2upper) / 3,
bty_avg_diff = bty_avg_f - bty_avg_m
|>
) select(score, bty_avg_f, bty_avg_m, bty_avg_diff)
# A tibble: 463 × 4
score bty_avg_f bty_avg_m bty_avg_diff
<dbl> <dbl> <dbl> <dbl>
1 4.7 6 4 2
2 4.1 6 4 2
3 3.9 6 4 2
4 4.8 6 4 2
5 4.6 3.33 2.67 0.667
6 4.3 3.33 2.67 0.667
7 2.8 3.33 2.67 0.667
8 4.1 4 2.67 1.33
9 3.4 4 2.67 1.33
10 4.5 3.67 2.67 1
# ℹ 453 more rows
- Your turn (4 minutes): Create a new variable to calculate the percentage of evals for each faculty rank. What percentage of evals were for teaching-track faculty?
|>
evals count(rank) |>
mutate(perc = n / sum(n) * 100)
# A tibble: 3 × 3
rank n perc
<chr> <int> <dbl>
1 teaching 102 22.0
2 tenure track 108 23.3
3 tenured 253 54.6
summarize()
summarize()
collapses the rows into summary statistics and removes columns irrelevant to the calculation.
Be sure to name your columns!
|>
evals summarize(mean_score = mean(score))
# A tibble: 1 × 1
mean_score
<dbl>
1 NA
Question: Why did this code return NA
?
Let’s fix it!
|>
evals summarize(mean_score = mean(score, na.rm = TRUE))
# A tibble: 1 × 1
mean_score
<dbl>
1 4.18
group_by()
group_by()
is used for grouped operations. It’s very powerful when paired with summarize()
to calculate summary statistics by group.
Here we find the mean and standard deviation of evaluation scores for each professor in the sample.
|>
evals group_by(prof_id) |>
summarize(
mean_score = mean(score, na.rm = TRUE),
sd_score = sd(score, na.rm = TRUE)
)
# A tibble: 94 × 3
prof_id mean_score sd_score
<dbl> <dbl> <dbl>
1 1 4.38 0.443
2 2 3.9 0.964
3 3 3.75 0.495
4 4 4.3 0.321
5 5 4.67 0.151
6 6 4.63 0.180
7 7 4.1 0.354
8 8 4 0.766
9 9 4.61 0.177
10 10 4.64 0.344
# ℹ 84 more rows
- Your turn (4 minutes): What is the median evaluation score for each faculty rank? Which type of faculty has the lowest median evaluation score?
|>
evals group_by(rank) |>
summarize(
med_score = median(score, na.rm = TRUE)
)
# A tibble: 3 × 2
rank med_score
<chr> <dbl>
1 teaching 4.4
2 tenure track 4.35
3 tenured 4.2
Additional Practice
Only if we have enough time in class. You do not need to complete these for credit.
- Create a new dataset that only contains evals that do not have a missing evaluation score. Include the columns
prof_id
,score
,rank
,age
,bty_avg
, andbty_avg_diff
(the difference in the average beauty score for female and male raters). Hint: Note you may need to usemutate()
to make one or more of these variables.
|>
evals # drop rows with NAs for score
drop_na(score) |>
# create required variable
mutate(
bty_avg_f = (bty_f1lower + bty_f1upper + bty_f2upper) / 3,
bty_avg_m = (bty_m1lower + bty_m1upper + bty_m2upper) / 3,
bty_avg_diff = bty_avg_f - bty_avg_m
|>
) # keep only requested columns
select(prof_id, score, rank, age, bty_avg, bty_avg_diff)
# A tibble: 449 × 6
prof_id score rank age bty_avg bty_avg_diff
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 4.7 tenure track 36 5 2
2 1 4.1 tenure track 36 5 2
3 1 3.9 tenure track 36 5 2
4 1 4.8 tenure track 36 5 2
5 2 4.6 tenured 59 3 0.667
6 2 4.3 tenured 59 3 0.667
7 2 2.8 tenured 59 3 0.667
8 3 4.1 tenured 51 3.33 1.33
9 3 3.4 tenured 51 3.33 1.33
10 4 4.5 tenured 40 3.17 1
# ℹ 439 more rows
- For each professor (uniquely identified by
prof_id
), use agroup_by()
paired withsummarize()
to find the sample size, mean, and standard deviation of evaluation scores. Then include only the top 5 and bottom 5 professors in terms of mean scores in the final data frame.
# calculate requested summary statistics
<- evals |>
prof_scores # drop rows with NAs for score
drop_na(score) |>
group_by(prof_id) |>
summarize(
mean_score = mean(score, na.rm = TRUE),
sd_score = sd(score, na.rm = TRUE),
sample_size = n()
|>
) # sort rows by mean_score from high to low
arrange(desc(mean_score))
# need to get top 5 and bottom 5 rows for each in a single data frame
bind_rows(
slice_head(.data = prof_scores, n = 5),
slice_tail(.data = prof_scores, n = 5)
)
# A tibble: 10 × 4
prof_id mean_score sd_score sample_size
<dbl> <dbl> <dbl> <int>
1 85 4.87 0.150 7
2 73 4.82 0.0447 5
3 71 4.81 0.179 10
4 52 4.74 0.113 7
5 50 4.73 0.163 6
6 15 3.18 0.189 4
7 60 3.13 0.208 3
8 69 3 NA 1
9 68 2.67 0.379 3
10 30 2.3 NA 1
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os macOS Ventura 13.4.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2023-09-08
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)
here 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)
jsonlite 1.8.5 2023-06-05 [1] CRAN (R 4.3.0)
knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.22 2023-06-01 [1] CRAN (R 4.3.0)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
[1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────