[1] 3
Lecture 12
Cornell University
INFO 5001 - Fall 2024
October 4, 2023
What are some functions you’ve learned? What are their inputs, what are their outputs?
mean()
mutate()
ggplot()
mean()
rescale01()
temp_convert()
Repeat yourself:
Many functions in R are vectorized, meaning they automatically operate on all elements of a vector without needing to explicitly iterate through and act on each element individually.
{ }
Accessing data variables as if they were variables in the environment
my_variable
instead of df$my_variable
Base R requires identifying the location of variables in data frames
Want to access a variable without referring to the data frame as well?
grouped_mean <- function(df, group_var, mean_var) {
df |>
group_by({{ group_var }}) |>
summarize(mean({{ mean_var }}, na.rm = TRUE))
}
penguins |> grouped_mean(species, bill_length_mm)
# A tibble: 3 × 2
species `mean(bill_length_mm, na.rm = TRUE)`
<fct> <dbl>
1 Adelie 38.8
2 Chinstrap 48.8
3 Gentoo 47.5
Data-masking: this is used in functions like arrange()
, filter()
, and summarize()
that compute with variables.
Tidy-selection: this is used for functions like select()
, relocate()
, and rename()
that select variables.
summary6 <- function(data, var) {
data |> summarize(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE),
n = n(),
n_miss = sum(is.na({{ var }})),
.groups = "drop"
)
}
penguins |> summary6(body_mass_g)
# A tibble: 1 × 6
min mean median max n n_miss
<int> <dbl> <dbl> <int> <int> <int>
1 2700 4202. 4050 6300 344 2
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |>
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
penguins |>
count_missing(c(species, island), bill_length_mm)
Error in `group_by()`:
ℹ In argument: `c(species, island)`.
Caused by error:
! `c(species, island)` must be size 344 or 1, not 688.
pick()
for tidy-selectioncount_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |>
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
penguins |>
count_missing(c(species, island), bill_length_mm)
# A tibble: 5 × 3
species island n_miss
<fct> <fct> <int>
1 Adelie Biscoe 0
2 Adelie Dream 0
3 Adelie Torgersen 1
4 Chinstrap Dream 0
5 Gentoo Biscoe 1
ae-10
ae-10
(repo name will be suffixed with your GitHub name).{ }
.