AE 06: Data types and classes

Application exercise
Answers

Packages

We will use the following two packages in this application exercise.

  • tidyverse: For data import, wrangling, and visualization.
  • skimr: For summarizing the entire data frame at once.
  • scales: For better axis labels.
library(tidyverse)
library(skimr)
library(scales)

Type coercion

  • Demo: Determine the type of the following vector. And then, change the type to numeric.
x <- c("1", "2", "3")
typeof(x)
[1] "character"
as.numeric(x)
[1] 1 2 3
parse_number(x)
[1] 1 2 3
  • Demo: Once again, determine the type of the following vector. And then, change the type to numeric. What’s different than the previous exercise?
y <- c("a", "b", "c")
typeof(y)
[1] "character"
as.numeric(y)
Warning: NAs introduced by coercion
[1] NA NA NA
parse_number(y)
Warning: 3 parsing failures.
row col expected actual
  1  -- a number      a
  2  -- a number      b
  3  -- a number      c
[1] NA NA NA
attr(,"problems")
# A tibble: 3 × 4
    row   col expected actual
  <int> <int> <chr>    <chr> 
1     1    NA a number a     
2     2    NA a number b     
3     3    NA a number c     
  • Demo: Once again, determine the type of the following vector. And then, change the type to numeric. What’s different than the previous exercise?
z <- c("1", "2", "three")
typeof(z)
[1] "character"
as.numeric(z)
Warning: NAs introduced by coercion
[1]  1  2 NA
parse_number(z)
Warning: 1 parsing failure.
row col expected actual
  3  -- a number  three
[1]  1  2 NA
attr(,"problems")
# A tibble: 1 × 4
    row   col expected actual
  <int> <int> <chr>    <chr> 
1     3    NA a number three 

Recoding survey results

Demo: Suppose you conducted a survey where you asked people how many cars their household owns collectively. And the answers are as follows:

survey_results <- tibble(cars = c(1, 2, "three"))
survey_results
# A tibble: 3 × 1
  cars 
  <chr>
1 1    
2 2    
3 three

This is annoying because of that third survey taker who just had to go and type out the number instead of providing as a numeric value. So now you need to update the cars variable to be numeric. You do the following

survey_results |>
  mutate(cars = as.numeric(cars))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `cars = as.numeric(cars)`.
Caused by warning:
! NAs introduced by coercion
# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3    NA

And now things are even more annoying because you get a warning NAs introduced by coercion that happened while computing cars = as.numeric(cars) and the response from the third survey taker is now an NA (you lost their data). Fix your mutate() call to avoid this warning.

survey_results |>
  mutate(
    cars = if_else(cars == "three", "3", cars),
    cars = as.numeric(cars)
  )
# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3     3
# or with parse_number()
survey_results |>
  mutate(
    cars = if_else(cars == "three", "3", cars),
    cars = parse_number(cars)
  )
# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3     3

Hotel bookings

# From TidyTuesday: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md

hotels <- read_csv("data/hotels-tt.csv")
skim(hotels) # much more useful to run interactively in the console
Data summary
Name hotels
Number of rows 119390
Number of columns 32
_______________________
Column type frequency:
character 13
Date 1
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
hotel 0 1 10 12 0 2 0
arrival_date_month 0 1 3 9 0 12 0
meal 0 1 2 9 0 5 0
country 0 1 2 4 0 178 0
market_segment 0 1 6 13 0 8 0
distribution_channel 0 1 3 9 0 5 0
reserved_room_type 0 1 1 1 0 10 0
assigned_room_type 0 1 1 1 0 12 0
deposit_type 0 1 10 10 0 3 0
agent 0 1 1 4 0 334 0
company 0 1 1 4 0 353 0
customer_type 0 1 5 15 0 4 0
reservation_status 0 1 7 9 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
reservation_status_date 0 1 2014-10-17 2017-09-14 2016-08-07 926

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
is_canceled 0 1 0.37 0.48 0.00 0.00 0.00 1 1 ▇▁▁▁▅
lead_time 0 1 104.01 106.86 0.00 18.00 69.00 160 737 ▇▂▁▁▁
arrival_date_year 0 1 2016.16 0.71 2015.00 2016.00 2016.00 2017 2017 ▃▁▇▁▆
arrival_date_week_number 0 1 27.17 13.61 1.00 16.00 28.00 38 53 ▅▇▇▇▅
arrival_date_day_of_month 0 1 15.80 8.78 1.00 8.00 16.00 23 31 ▇▇▇▇▆
stays_in_weekend_nights 0 1 0.93 1.00 0.00 0.00 1.00 2 19 ▇▁▁▁▁
stays_in_week_nights 0 1 2.50 1.91 0.00 1.00 2.00 3 50 ▇▁▁▁▁
adults 0 1 1.86 0.58 0.00 2.00 2.00 2 55 ▇▁▁▁▁
children 4 1 0.10 0.40 0.00 0.00 0.00 0 10 ▇▁▁▁▁
babies 0 1 0.01 0.10 0.00 0.00 0.00 0 10 ▇▁▁▁▁
is_repeated_guest 0 1 0.03 0.18 0.00 0.00 0.00 0 1 ▇▁▁▁▁
previous_cancellations 0 1 0.09 0.84 0.00 0.00 0.00 0 26 ▇▁▁▁▁
previous_bookings_not_canceled 0 1 0.14 1.50 0.00 0.00 0.00 0 72 ▇▁▁▁▁
booking_changes 0 1 0.22 0.65 0.00 0.00 0.00 0 21 ▇▁▁▁▁
days_in_waiting_list 0 1 2.32 17.59 0.00 0.00 0.00 0 391 ▇▁▁▁▁
adr 0 1 101.83 50.54 -6.38 69.29 94.58 126 5400 ▇▁▁▁▁
required_car_parking_spaces 0 1 0.06 0.25 0.00 0.00 0.00 0 8 ▇▁▁▁▁
total_of_special_requests 0 1 0.57 0.79 0.00 0.00 0.00 1 5 ▇▁▁▁▁

Question: Take a look at the the following visualization. How are the months ordered? What would be a better order?

Solve using factors

Demo: Reorder the months on the x-axis (levels of arrival_date_month) in a way that makes more sense. You will want to use functions from the forcats package, see https://forcats.tidyverse.org/reference/index.html for inspiration and help.

month_names <- month.name
names(month_names) <- month.abb

# simple with factor()
hotels |>
  mutate(
    # convert to factor
    arrival_date_month = factor(
      x = arrival_date_month,
      levels = month.name,
      labels = month.abb
    )
  ) |>
  summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
  ggplot(mapping = aes(
    x = arrival_date_month,
    y = mean_adr,
    group = hotel,
    color = hotel
  )) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Arrival month",
    y = "Mean ADR (average daily rate)",
    title = "Comparison of resort and city hotel prices across months",
    subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
    color = "Hotel type"
  )

# more involved with forcats
hotels |>
  mutate(
    # convert to factor
    arrival_date_month = fct(x = arrival_date_month) |>
      # change order to be chronological
      fct_relevel(month.name) |>
      # change labels to be abbreviated for plotting
      fct_recode(!!!month_names)
  ) |>
  summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
  ggplot(mapping = aes(
    x = arrival_date_month,
    y = mean_adr,
    group = hotel,
    color = hotel
  )) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Arrival month",
    y = "Mean ADR (average daily rate)",
    title = "Comparison of resort and city hotel prices across months",
    subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
    color = "Hotel type"
  )

Solve using lubridate

Demo: Reorder the months on the x-axis (levels of arrival_date_month) in a way that makes more sense. You will want to use functions from the lubridate package, see https://lubridate.tidyverse.org/reference/index.html for inspiration and help.

hotels |>
  mutate(
    # create a date column and extract month
    arrival_date = str_glue("{arrival_date_month} {arrival_date_day_of_month}, {arrival_date_year}") |>
      mdy(),
    arrival_date_month = month(arrival_date, label = TRUE, abbr = TRUE),
    .before = arrival_date_year
  ) |>
  summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
  ggplot(mapping = aes(
    x = arrival_date_month,
    y = mean_adr,
    group = hotel,
    color = hotel
  )) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Arrival month",
    y = "Mean ADR (average daily rate)",
    title = "Comparison of resort and city hotel prices across months",
    subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
    color = "Hotel type"
  )

Stretch goal: If you finish the above task before time is up, change the y-axis label so the values are shown with dollar signs, e.g. $80 instead of 80. You will want to use a function from the scales package, see https://scales.r-lib.org/reference/index.html for inspiration and help.

Additionally, adjust the fig-width code chunk option so that the entire title fits on the plot.

```{r}
#| label: hotels-plot-improve
#| fig-width: 8

# either approach above could be used here
hotels |>
  mutate(
    # convert to factor, use labels argument to create short versions
    arrival_date_month = factor(x = arrival_date_month, levels = month.name, labels = month.abb),
    # adjust the level order using month.abb
    arrival_date_month = fct_relevel(.f = arrival_date_month, month.abb)
  ) |>
  summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
  ggplot(mapping = aes(
    x = arrival_date_month,
    y = mean_adr,
    group = hotel,
    color = hotel
  )) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Arrival month",
    y = "Mean ADR (average daily rate)",
    title = "Comparison of resort and city hotel prices across months",
    subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
    color = "Hotel type"
  ) +
  scale_y_continuous(labels = label_dollar())
```

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.5.2
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2023-11-01
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 base64enc     0.1-3   2015-07-28 [1] CRAN (R 4.3.0)
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.3   2023-09-03 [1] CRAN (R 4.3.0)
 evaluate      0.22    2023-09-29 [1] CRAN (R 4.3.1)
 fansi         1.0.5   2023-10-08 [1] CRAN (R 4.3.1)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.6.1 2023-10-06 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)
 knitr         1.44    2023-09-11 [1] CRAN (R 4.3.0)
 labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 ragg          1.2.5   2023-01-12 [1] CRAN (R 4.3.0)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
 repr          1.1.6   2023-01-26 [1] CRAN (R 4.3.0)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
 rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.3.0)
 scales      * 1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 skimr       * 2.1.5   2022-12-23 [1] CRAN (R 4.3.0)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 systemfonts   1.0.4   2022-02-11 [1] CRAN (R 4.3.0)
 textshaping   0.3.6   2021-10-13 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.4   2023-10-12 [1] CRAN (R 4.3.1)
 vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)
 withr         2.5.1   2023-09-26 [1] CRAN (R 4.3.1)
 xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────