AE 06: Data types and classes

Application exercise

Answers

Packages

We will use the following two packages in this application exercise.

tidyverse: For data import, wrangling, and visualization.
skimr: For summarizing the entire data frame at once.
scales: For better axis labels.

library(tidyverse)
library(skimr)
library(scales)

Type coercion

Demo: Determine the type of the following vector. And then, change the type to numeric.

x <- c("1", "2", "3")
typeof(x)

[1] "character"

as.numeric(x)

[1] 1 2 3

parse_number(x)

[1] 1 2 3

Demo: Once again, determine the type of the following vector. And then, change the type to numeric. What’s different than the previous exercise?

y <- c("a", "b", "c")
typeof(y)

[1] "character"

as.numeric(y)

Warning: NAs introduced by coercion

[1] NA NA NA

parse_number(y)

Warning: 3 parsing failures.
row col expected actual
  1  -- a number      a
  2  -- a number      b
  3  -- a number      c

[1] NA NA NA
attr(,"problems")
# A tibble: 3 × 4
    row   col expected actual
  <int> <int> <chr>    <chr> 
1     1    NA a number a     
2     2    NA a number b     
3     3    NA a number c

Demo: Once again, determine the type of the following vector. And then, change the type to numeric. What’s different than the previous exercise?

z <- c("1", "2", "three")
typeof(z)

[1] "character"

as.numeric(z)

Warning: NAs introduced by coercion

[1]  1  2 NA

parse_number(z)

Warning: 1 parsing failure.
row col expected actual
  3  -- a number  three

[1]  1  2 NA
attr(,"problems")
# A tibble: 1 × 4
    row   col expected actual
  <int> <int> <chr>    <chr> 
1     3    NA a number three

Recoding survey results

Demo: Suppose you conducted a survey where you asked people how many cars their household owns collectively. And the answers are as follows:

survey_results <- tibble(cars = c(1, 2, "three"))
survey_results

# A tibble: 3 × 1
  cars 
  <chr>
1 1    
2 2    
3 three

This is annoying because of that third survey taker who just had to go and type out the number instead of providing as a numeric value. So now you need to update the cars variable to be numeric. You do the following

survey_results |>
  mutate(cars = as.numeric(cars))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `cars = as.numeric(cars)`.
Caused by warning:
! NAs introduced by coercion

# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3    NA

And now things are even more annoying because you get a warning NAs introduced by coercion that happened while computing cars = as.numeric(cars) and the response from the third survey taker is now an NA (you lost their data). Fix your mutate() call to avoid this warning.

survey_results |>
  mutate(
    cars = if_else(cars == "three", "3", cars),
    cars = as.numeric(cars)
  )

# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3     3

# or with parse_number()
survey_results |>
  mutate(
    cars = if_else(cars == "three", "3", cars),
    cars = parse_number(cars)
  )

# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3     3

Hotel bookings

# From TidyTuesday: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md

hotels <- read_csv("data/hotels-tt.csv")
skim(hotels) # much more useful to run interactively in the console

Data summary
Name	hotels
Number of rows	119390
Number of columns	32
_______________________
Column type frequency:
character	13
Date	1
numeric	18
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
hotel	1	10	12	2
arrival_date_month	1	3	9	12
meal	1	2	9	5
country	1	2	4	178
market_segment	1	6	13	8
distribution_channel	1	3	9	5
reserved_room_type	1	1	1	10
assigned_room_type	1	1	1	12
deposit_type	1	10	10	3
agent	1	1	4	334
company	1	1	4	353
customer_type	1	5	15	4
reservation_status	1	7	9	3

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
reservation_status_date	0	1	2014-10-17	2017-09-14	2016-08-07	926

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
is_canceled	0	1	0.37	0.48	0.00	0.00	0.00	1	1	▇▁▁▁▅
lead_time	0	1	104.01	106.86	0.00	18.00	69.00	160	737	▇▂▁▁▁
arrival_date_year	0	1	2016.16	0.71	2015.00	2016.00	2016.00	2017	2017	▃▁▇▁▆
arrival_date_week_number	0	1	27.17	13.61	1.00	16.00	28.00	38	53	▅▇▇▇▅
arrival_date_day_of_month	0	1	15.80	8.78	1.00	8.00	16.00	23	31	▇▇▇▇▆
stays_in_weekend_nights	0	1	0.93	1.00	0.00	0.00	1.00	2	19	▇▁▁▁▁
stays_in_week_nights	0	1	2.50	1.91	0.00	1.00	2.00	3	50	▇▁▁▁▁
adults	0	1	1.86	0.58	0.00	2.00	2.00	2	55	▇▁▁▁▁
children	4	1	0.10	0.40	0.00	0.00	0.00	0	10	▇▁▁▁▁
babies	0	1	0.01	0.10	0.00	0.00	0.00	0	10	▇▁▁▁▁
is_repeated_guest	0	1	0.03	0.18	0.00	0.00	0.00	0	1	▇▁▁▁▁
previous_cancellations	0	1	0.09	0.84	0.00	0.00	0.00	0	26	▇▁▁▁▁
previous_bookings_not_canceled	0	1	0.14	1.50	0.00	0.00	0.00	0	72	▇▁▁▁▁
booking_changes	0	1	0.22	0.65	0.00	0.00	0.00	0	21	▇▁▁▁▁
days_in_waiting_list	0	1	2.32	17.59	0.00	0.00	0.00	0	391	▇▁▁▁▁
adr	0	1	101.83	50.54	-6.38	69.29	94.58	126	5400	▇▁▁▁▁
required_car_parking_spaces	0	1	0.06	0.25	0.00	0.00	0.00	0	8	▇▁▁▁▁
total_of_special_requests	0	1	0.57	0.79	0.00	0.00	0.00	1	5	▇▁▁▁▁

Question: Take a look at the the following visualization. How are the months ordered? What would be a better order?

Solve using factors

Demo: Reorder the months on the x-axis (levels of arrival_date_month) in a way that makes more sense. You will want to use functions from the forcats package, see https://forcats.tidyverse.org/reference/index.html for inspiration and help.

month_names <- month.name
names(month_names) <- month.abb

# simple with factor()
hotels |>
  mutate(
    # convert to factor
    arrival_date_month = factor(
      x = arrival_date_month,
      levels = month.name,
      labels = month.abb
    )
  ) |>
  summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
  ggplot(mapping = aes(
    x = arrival_date_month,
    y = mean_adr,
    group = hotel,
    color = hotel
  )) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Arrival month",
    y = "Mean ADR (average daily rate)",
    title = "Comparison of resort and city hotel prices across months",
    subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
    color = "Hotel type"
  )

# more involved with forcats
hotels |>
  mutate(
    # convert to factor
    arrival_date_month = fct(x = arrival_date_month) |>
      # change order to be chronological
      fct_relevel(month.name) |>
      # change labels to be abbreviated for plotting
      fct_recode(!!!month_names)
  ) |>
  summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
  ggplot(mapping = aes(
    x = arrival_date_month,
    y = mean_adr,
    group = hotel,
    color = hotel
  )) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Arrival month",
    y = "Mean ADR (average daily rate)",
    title = "Comparison of resort and city hotel prices across months",
    subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
    color = "Hotel type"
  )

Solve using lubridate

Demo: Reorder the months on the x-axis (levels of arrival_date_month) in a way that makes more sense. You will want to use functions from the lubridate package, see https://lubridate.tidyverse.org/reference/index.html for inspiration and help.

hotels |>
  mutate(
    # create a date column and extract month
    arrival_date = str_glue("{arrival_date_month} {arrival_date_day_of_month}, {arrival_date_year}") |>
      mdy(),
    arrival_date_month = month(arrival_date, label = TRUE, abbr = TRUE),
    .before = arrival_date_year
  ) |>
  summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
  ggplot(mapping = aes(
    x = arrival_date_month,
    y = mean_adr,
    group = hotel,
    color = hotel
  )) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Arrival month",
    y = "Mean ADR (average daily rate)",
    title = "Comparison of resort and city hotel prices across months",
    subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
    color = "Hotel type"
  )

Stretch goal: If you finish the above task before time is up, change the y-axis label so the values are shown with dollar signs, e.g. $80 instead of 80. You will want to use a function from the scales package, see https://scales.r-lib.org/reference/index.html for inspiration and help.

Additionally, adjust the fig-width code chunk option so that the entire title fits on the plot.

```{r}
#| label: hotels-plot-improve
#| fig-width: 8

# either approach above could be used here
hotels |>
  mutate(
    # convert to factor, use labels argument to create short versions
    arrival_date_month = factor(x = arrival_date_month, levels = month.name, labels = month.abb),
    # adjust the level order using month.abb
    arrival_date_month = fct_relevel(.f = arrival_date_month, month.abb)
  ) |>
  summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
  ggplot(mapping = aes(
    x = arrival_date_month,
    y = mean_adr,
    group = hotel,
    color = hotel
  )) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Arrival month",
    y = "Mean ADR (average daily rate)",
    title = "Comparison of resort and city hotel prices across months",
    subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
    color = "Hotel type"
  ) +
  scale_y_continuous(labels = label_dollar())
```

Session information

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.5.2
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2023-11-01
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 base64enc     0.1-3   2015-07-28 [1] CRAN (R 4.3.0)
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.3   2023-09-03 [1] CRAN (R 4.3.0)
 evaluate      0.22    2023-09-29 [1] CRAN (R 4.3.1)
 fansi         1.0.5   2023-10-08 [1] CRAN (R 4.3.1)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.6.1 2023-10-06 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)
 knitr         1.44    2023-09-11 [1] CRAN (R 4.3.0)
 labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 ragg          1.2.5   2023-01-12 [1] CRAN (R 4.3.0)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
 repr          1.1.6   2023-01-26 [1] CRAN (R 4.3.0)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
 rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.3.0)
 scales      * 1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 skimr       * 2.1.5   2022-12-23 [1] CRAN (R 4.3.0)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 systemfonts   1.0.4   2022-02-11 [1] CRAN (R 4.3.0)
 textshaping   0.3.6   2021-10-13 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.4   2023-10-12 [1] CRAN (R 4.3.1)
 vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)
 withr         2.5.1   2023-09-26 [1] CRAN (R 4.3.1)
 xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────