library(tidyverse)
library(skimr)
library(scales)
AE 06: Data types and classes
Packages
We will use the following two packages in this application exercise.
- tidyverse: For data import, wrangling, and visualization.
- skimr: For summarizing the entire data frame at once.
- scales: For better axis labels.
Type coercion
- Demo: Determine the type of the following vector. And then, change the type to numeric.
<- c("1", "2", "3")
x typeof(x)
[1] "character"
as.numeric(x)
[1] 1 2 3
parse_number(x)
[1] 1 2 3
- Demo: Once again, determine the type of the following vector. And then, change the type to numeric. What’s different than the previous exercise?
<- c("a", "b", "c")
y typeof(y)
[1] "character"
as.numeric(y)
Warning: NAs introduced by coercion
[1] NA NA NA
parse_number(y)
Warning: 3 parsing failures.
row col expected actual
1 -- a number a
2 -- a number b
3 -- a number c
[1] NA NA NA
attr(,"problems")
# A tibble: 3 × 4
row col expected actual
<int> <int> <chr> <chr>
1 1 NA a number a
2 2 NA a number b
3 3 NA a number c
- Demo: Once again, determine the type of the following vector. And then, change the type to numeric. What’s different than the previous exercise?
<- c("1", "2", "three")
z typeof(z)
[1] "character"
as.numeric(z)
Warning: NAs introduced by coercion
[1] 1 2 NA
parse_number(z)
Warning: 1 parsing failure.
row col expected actual
3 -- a number three
[1] 1 2 NA
attr(,"problems")
# A tibble: 1 × 4
row col expected actual
<int> <int> <chr> <chr>
1 3 NA a number three
Recoding survey results
Demo: Suppose you conducted a survey where you asked people how many cars their household owns collectively. And the answers are as follows:
<- tibble(cars = c(1, 2, "three"))
survey_results survey_results
# A tibble: 3 × 1
cars
<chr>
1 1
2 2
3 three
This is annoying because of that third survey taker who just had to go and type out the number instead of providing as a numeric value. So now you need to update the cars
variable to be numeric. You do the following
|>
survey_results mutate(cars = as.numeric(cars))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `cars = as.numeric(cars)`.
Caused by warning:
! NAs introduced by coercion
# A tibble: 3 × 1
cars
<dbl>
1 1
2 2
3 NA
And now things are even more annoying because you get a warning NAs introduced by coercion
that happened while computing cars = as.numeric(cars)
and the response from the third survey taker is now an NA
(you lost their data). Fix your mutate()
call to avoid this warning.
|>
survey_results mutate(
cars = if_else(cars == "three", "3", cars),
cars = as.numeric(cars)
)
# A tibble: 3 × 1
cars
<dbl>
1 1
2 2
3 3
# or with parse_number()
|>
survey_results mutate(
cars = if_else(cars == "three", "3", cars),
cars = parse_number(cars)
)
# A tibble: 3 × 1
cars
<dbl>
1 1
2 2
3 3
Hotel bookings
# From TidyTuesday: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md
<- read_csv("data/hotels-tt.csv")
hotels skim(hotels) # much more useful to run interactively in the console
Name | hotels |
Number of rows | 119390 |
Number of columns | 32 |
_______________________ | |
Column type frequency: | |
character | 13 |
Date | 1 |
numeric | 18 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
hotel | 0 | 1 | 10 | 12 | 0 | 2 | 0 |
arrival_date_month | 0 | 1 | 3 | 9 | 0 | 12 | 0 |
meal | 0 | 1 | 2 | 9 | 0 | 5 | 0 |
country | 0 | 1 | 2 | 4 | 0 | 178 | 0 |
market_segment | 0 | 1 | 6 | 13 | 0 | 8 | 0 |
distribution_channel | 0 | 1 | 3 | 9 | 0 | 5 | 0 |
reserved_room_type | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
assigned_room_type | 0 | 1 | 1 | 1 | 0 | 12 | 0 |
deposit_type | 0 | 1 | 10 | 10 | 0 | 3 | 0 |
agent | 0 | 1 | 1 | 4 | 0 | 334 | 0 |
company | 0 | 1 | 1 | 4 | 0 | 353 | 0 |
customer_type | 0 | 1 | 5 | 15 | 0 | 4 | 0 |
reservation_status | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
reservation_status_date | 0 | 1 | 2014-10-17 | 2017-09-14 | 2016-08-07 | 926 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
is_canceled | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1 | 1 | ▇▁▁▁▅ |
lead_time | 0 | 1 | 104.01 | 106.86 | 0.00 | 18.00 | 69.00 | 160 | 737 | ▇▂▁▁▁ |
arrival_date_year | 0 | 1 | 2016.16 | 0.71 | 2015.00 | 2016.00 | 2016.00 | 2017 | 2017 | ▃▁▇▁▆ |
arrival_date_week_number | 0 | 1 | 27.17 | 13.61 | 1.00 | 16.00 | 28.00 | 38 | 53 | ▅▇▇▇▅ |
arrival_date_day_of_month | 0 | 1 | 15.80 | 8.78 | 1.00 | 8.00 | 16.00 | 23 | 31 | ▇▇▇▇▆ |
stays_in_weekend_nights | 0 | 1 | 0.93 | 1.00 | 0.00 | 0.00 | 1.00 | 2 | 19 | ▇▁▁▁▁ |
stays_in_week_nights | 0 | 1 | 2.50 | 1.91 | 0.00 | 1.00 | 2.00 | 3 | 50 | ▇▁▁▁▁ |
adults | 0 | 1 | 1.86 | 0.58 | 0.00 | 2.00 | 2.00 | 2 | 55 | ▇▁▁▁▁ |
children | 4 | 1 | 0.10 | 0.40 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
babies | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
is_repeated_guest | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0 | 1 | ▇▁▁▁▁ |
previous_cancellations | 0 | 1 | 0.09 | 0.84 | 0.00 | 0.00 | 0.00 | 0 | 26 | ▇▁▁▁▁ |
previous_bookings_not_canceled | 0 | 1 | 0.14 | 1.50 | 0.00 | 0.00 | 0.00 | 0 | 72 | ▇▁▁▁▁ |
booking_changes | 0 | 1 | 0.22 | 0.65 | 0.00 | 0.00 | 0.00 | 0 | 21 | ▇▁▁▁▁ |
days_in_waiting_list | 0 | 1 | 2.32 | 17.59 | 0.00 | 0.00 | 0.00 | 0 | 391 | ▇▁▁▁▁ |
adr | 0 | 1 | 101.83 | 50.54 | -6.38 | 69.29 | 94.58 | 126 | 5400 | ▇▁▁▁▁ |
required_car_parking_spaces | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 8 | ▇▁▁▁▁ |
total_of_special_requests | 0 | 1 | 0.57 | 0.79 | 0.00 | 0.00 | 0.00 | 1 | 5 | ▇▁▁▁▁ |
Question: Take a look at the the following visualization. How are the months ordered? What would be a better order?
Solve using factors
Demo: Reorder the months on the x-axis (levels of arrival_date_month
) in a way that makes more sense. You will want to use functions from the forcats package, see https://forcats.tidyverse.org/reference/index.html for inspiration and help.
<- month.name
month_names names(month_names) <- month.abb
# simple with factor()
|>
hotels mutate(
# convert to factor
arrival_date_month = factor(
x = arrival_date_month,
levels = month.name,
labels = month.abb
)|>
) summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
ggplot(mapping = aes(
x = arrival_date_month,
y = mean_adr,
group = hotel,
color = hotel
+
)) geom_line() +
theme_minimal() +
labs(
x = "Arrival month",
y = "Mean ADR (average daily rate)",
title = "Comparison of resort and city hotel prices across months",
subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
color = "Hotel type"
)
# more involved with forcats
|>
hotels mutate(
# convert to factor
arrival_date_month = fct(x = arrival_date_month) |>
# change order to be chronological
fct_relevel(month.name) |>
# change labels to be abbreviated for plotting
fct_recode(!!!month_names)
|>
) summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
ggplot(mapping = aes(
x = arrival_date_month,
y = mean_adr,
group = hotel,
color = hotel
+
)) geom_line() +
theme_minimal() +
labs(
x = "Arrival month",
y = "Mean ADR (average daily rate)",
title = "Comparison of resort and city hotel prices across months",
subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
color = "Hotel type"
)
Solve using lubridate
Demo: Reorder the months on the x-axis (levels of arrival_date_month
) in a way that makes more sense. You will want to use functions from the lubridate package, see https://lubridate.tidyverse.org/reference/index.html for inspiration and help.
|>
hotels mutate(
# create a date column and extract month
arrival_date = str_glue("{arrival_date_month} {arrival_date_day_of_month}, {arrival_date_year}") |>
mdy(),
arrival_date_month = month(arrival_date, label = TRUE, abbr = TRUE),
.before = arrival_date_year
|>
) summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
ggplot(mapping = aes(
x = arrival_date_month,
y = mean_adr,
group = hotel,
color = hotel
+
)) geom_line() +
theme_minimal() +
labs(
x = "Arrival month",
y = "Mean ADR (average daily rate)",
title = "Comparison of resort and city hotel prices across months",
subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
color = "Hotel type"
)
Stretch goal: If you finish the above task before time is up, change the y-axis label so the values are shown with dollar signs, e.g. $80 instead of 80. You will want to use a function from the scales package, see https://scales.r-lib.org/reference/index.html for inspiration and help.
Additionally, adjust the fig-width
code chunk option so that the entire title fits on the plot.
```{r}
#| label: hotels-plot-improve
#| fig-width: 8
# either approach above could be used here
hotels |>
mutate(
# convert to factor, use labels argument to create short versions
arrival_date_month = factor(x = arrival_date_month, levels = month.name, labels = month.abb),
# adjust the level order using month.abb
arrival_date_month = fct_relevel(.f = arrival_date_month, month.abb)
) |>
summarize(mean_adr = mean(adr), .by = c(hotel, arrival_date_month)) |>
ggplot(mapping = aes(
x = arrival_date_month,
y = mean_adr,
group = hotel,
color = hotel
)) +
geom_line() +
theme_minimal() +
labs(
x = "Arrival month",
y = "Mean ADR (average daily rate)",
title = "Comparison of resort and city hotel prices across months",
subtitle = "Resort hotel prices soar in the summer while city hotel prices remain relatively constant throughout the year",
color = "Hotel type"
) +
scale_y_continuous(labels = label_dollar())
```
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os macOS Ventura 13.5.2
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2023-11-01
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.3.0)
bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)
dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.3.0)
evaluate 0.22 2023-09-29 [1] CRAN (R 4.3.1)
fansi 1.0.5 2023-10-08 [1] CRAN (R 4.3.1)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)
here 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.6.1 2023-10-06 [1] CRAN (R 4.3.1)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)
knitr 1.44 2023-09-11 [1] CRAN (R 4.3.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
ragg 1.2.5 2023-01-12 [1] CRAN (R 4.3.0)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
repr 1.1.6 2023-01-26 [1] CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
scales * 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
skimr * 2.1.5 2022-12-23 [1] CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
systemfonts 1.0.4 2022-02-11 [1] CRAN (R 4.3.0)
textshaping 0.3.6 2021-10-13 [1] CRAN (R 4.3.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
vctrs 0.6.4 2023-10-12 [1] CRAN (R 4.3.1)
vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)
withr 2.5.1 2023-09-26 [1] CRAN (R 4.3.1)
xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
[1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────