Functions

Lecture 11

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2024

October 3, 2024

Announcements

Announcements

  • Project proposals

Functions

Functions in R

What are some functions you’ve learned? What are their inputs, what are their outputs?

  • mean()
  • mutate()
  • ggplot()

mean()

x <- c(1, 2, 3, 4, 5)
mean(x)
[1] 3
df <- tibble(
  a = rnorm(5),
  b = rnorm(5),
  c = rnorm(5),
  d = rnorm(5),
)

df |> summarize(a = mean(a), b = mean(b), c = mean(c), d = mean(d))
# A tibble: 1 × 4
      a       b     c     d
  <dbl>   <dbl> <dbl> <dbl>
1 0.194 -0.0443 0.308 0.109
df |> summarize(across(.cols = everything(), .fns = mean))
# A tibble: 1 × 4
      a       b     c     d
  <dbl>   <dbl> <dbl> <dbl>
1 0.194 -0.0443 0.308 0.109

Function components

name <- function(arguments) {
  body
}
  • Name
  • Arguments
  • Body

rescale01()

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
  d = rescale01(d),
)
# A tibble: 5 × 4
      a     b     c     d
  <dbl> <dbl> <dbl> <dbl>
1 0     1     1     1    
2 0.156 0.579 0.514 0.657
3 1     0     0.537 0    
4 0.298 0.194 0.374 0.711
5 0.325 0.275 0     0.398
# or with across()
df |>
  mutate(across(
    .cols = everything(),
    .fns = rescale01
  ))
# A tibble: 5 × 4
      a     b     c     d
  <dbl> <dbl> <dbl> <dbl>
1 0     1     1     1    
2 0.156 0.579 0.514 0.657
3 1     0     0.537 0    
4 0.298 0.194 0.374 0.711
5 0.325 0.275 0     0.398

Custom function: temp_convert()

  • Goal: Convert temperatures in degrees Celsius to Farenheit; multiply by \(\frac{9}{5}\) and add 32.
  • Number and type of inputs: 1 (a numeric vector of length 1)
  • Number and type of outputs: 1 (a numeric vector of length 1)
temp_convert <- function(temp_c) {
  (temp_c * 9 / 5) + 32
}

Test out the function

temp_convert(0) # freezing point
[1] 32
temp_convert(220) # bread baking temperature
[1] 428
temp_convert(100) # boiling point
[1] 212

Why do we need functions?

Repeat yourself:

# freezing point
(0 * 9 / 5) + 32
[1] 32
# bread baking temperature
(220 * 9 / 5) + 32
[1] 428
# boiling point
(100 * 9 / 5) + 32
[1] 212

Do not repeat yourself (DRY):

# freezing point
temp_convert(0)
[1] 32
# bread baking temperature
temp_convert(220)
[1] 428
# boiling point
temp_convert(100)
[1] 212

Vectorized functions in R

Many functions in R are vectorized, meaning they automatically operate on all elements of a vector without needing to explicitly iterate through and act on each element individually.

x <- c(0, 100, 220)

temp_convert(x)
[1]  32 212 428
tibble(temp_c = x) |> mutate(temp_f = temp_convert(temp_c))
# A tibble: 3 × 2
  temp_c temp_f
   <dbl>  <dbl>
1      0     32
2    100    212
3    220    428

Data frame functions

Data frame functions

  • Work like dplyr verbs
    • First argument is a data frame
    • Additional arguments about what to do with the data frame
    • Output is a data frame or vector
  • Requires use of indirection and embracing {{ }}

Indirection and tidy evaluation

library(palmerpenguins)

grouped_mean <- function(df, group_var, mean_var) {
  df |> 
    group_by(group_var) |> 
    summarize(mean(mean_var, na.rm = TRUE))
}

penguins |> grouped_mean(species, bill_length_mm)
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.

Data masking

  • Accessing data variables as if they were variables in the environment

    my_variable instead of df$my_variable

  • Base R requires identifying the location of variables in data frames

# base R
penguins[penguins$species == "Adelie" & penguins$body_mass_g >= 4000, ]

# tidyverse + data masking
penguins |> filter(species == "Adelie", body_mass_g >= 4000)

Embracing

Want to access a variable without referring to the data frame as well?

grouped_mean <- function(df, group_var, mean_var) {
  df |> 
    group_by({{ group_var }}) |> 
    summarize(mean({{ mean_var }}, na.rm = TRUE))
}

penguins |> grouped_mean(species, bill_length_mm)
# A tibble: 3 × 2
  species   `mean(bill_length_mm, na.rm = TRUE)`
  <fct>                                    <dbl>
1 Adelie                                    38.8
2 Chinstrap                                 48.8
3 Gentoo                                    47.5

When to embrace

  • Data-masking: this is used in functions like arrange(), filter(), and summarize() that compute with variables.

  • Tidy-selection: this is used for functions like select(), relocate(), and rename() that select variables.

Common use cases

summary6 <- function(data, var) {
  data |> summarize(
    min = min({{ var }}, na.rm = TRUE),
    mean = mean({{ var }}, na.rm = TRUE),
    median = median({{ var }}, na.rm = TRUE),
    max = max({{ var }}, na.rm = TRUE),
    n = n(),
    n_miss = sum(is.na({{ var }})),
    .groups = "drop"
  )
}

penguins |> summary6(body_mass_g)
# A tibble: 1 × 6
    min  mean median   max     n n_miss
  <int> <dbl>  <dbl> <int> <int>  <int>
1  2700 4202.   4050  6300   344      2
penguins |> group_by(species) |> summary6(body_mass_g)
# A tibble: 3 × 7
  species     min  mean median   max     n n_miss
  <fct>     <int> <dbl>  <dbl> <int> <int>  <int>
1 Adelie     2850 3701.   3700  4775   152      1
2 Chinstrap  2700 3733.   3700  4800    68      0
3 Gentoo     3950 5076.   5000  6300   124      1

Data-masking vs. tidy-selection

count_missing <- function(df, group_vars, x_var) {
  df |> 
    group_by({{ group_vars }}) |> 
    summarize(
      n_miss = sum(is.na({{ x_var }})),
      .groups = "drop"
    )
}

penguins |> 
  count_missing(c(species, island), bill_length_mm)
Error in `group_by()`:
ℹ In argument: `c(species, island)`.
Caused by error:
! `c(species, island)` must be size 344 or 1, not 688.

Use pick() for tidy-selection

count_missing <- function(df, group_vars, x_var) {
  df |> 
    group_by(pick({{ group_vars }})) |> 
    summarize(
      n_miss = sum(is.na({{ x_var }})),
      .groups = "drop"
  )
}

penguins |> 
  count_missing(c(species, island), bill_length_mm)
# A tibble: 5 × 3
  species   island    n_miss
  <fct>     <fct>      <int>
1 Adelie    Biscoe         0
2 Adelie    Dream          0
3 Adelie    Torgersen      1
4 Chinstrap Dream          0
5 Gentoo    Biscoe         1

ae-09

  • Go to the course GitHub org and find your ae-09 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Recap

  • Writing your own functions is a great way to make your code more readable and reusable.
  • Functions can be vectorized, meaning they operate on all elements of a vector without needing to explicitly iterate through and act on each element individually.
  • Data frame functions work like dplyr verbs, and require use of indirection and embracing { }.

Watercolors