AE 02: Visualizing the prognosticators

Suggested answers

Application exercise
Answers
Important

These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.

Note

Below are the contents of the YAML header. You normally do not see this in the rendered file - we include it so you can reproduce the figure dimensions in the document.

---
title: "AE 02: Visualizing the prognosticators"
subtitle: "Suggested answers"
editor: visual
execute:
  fig-width: 8
  fig-asp: 4
  fig-align: center
  warning: false
---

For all analyses, we’ll use the tidyverse packages.

library(tidyverse)

The dataset we will visualize is called seers.1 It contains summary statistics for all known Groundhog Day forecasters. 2 Let’s glimpse() at it .

1 I would prefer prognosticators, but I had way too many typos preparing these materials to make you all use it.

2 Source: Countdown to Groundhog Day. Application exercise inspired by Groundhogs Do Not Make Good Meteorologists originally published on FiveThirtyEight.

# import data using readr::read_csv()
seers <- read_csv("data/prognosticators-sum-stats.csv")

glimpse(seers)
Rows: 138
Columns: 18
$ name                  <chr> "Allen McButterpants", "Arboretum Annie", "Beard…
$ forecaster_type       <chr> "Groundhog", "Groundhog", "Stuffed Prairie Dog",…
$ forecaster_simple     <chr> "Groundhog", "Groundhog", "Other", "Other", "Oth…
$ alive                 <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ climate_region        <chr> "Northeast", "South", "Northeast", "South", "Ohi…
$ town                  <chr> "Hampton Bays", "Dallas", "Bridgeport", "Bee Cav…
$ state                 <chr> "NY", "TX", "CT", "TX", "OH", "TX", "TX", "AL", …
$ preds_n               <dbl> 1, 3, 12, 13, 8, 7, 1, 1, 12, 2, 4, 13, 10, 1, 4…
$ preds_long_winter     <dbl> 0, 1, 1, 3, 4, 5, 1, 1, 6, 2, 2, 8, 9, 0, 12, 6,…
$ preds_long_winter_pct <dbl> 0.00000000, 0.33333333, 0.08333333, 0.23076923, …
$ preds_correct         <dbl> 1, 2, 9, 9, 5, 3, 0, 0, 5, 0, 2, 5, 2, 0, 23, 0,…
$ preds_rate            <dbl> 1.0000000, 0.6666667, 0.7500000, 0.6923077, 0.62…
$ temp_mean             <dbl> 32.40000, 50.18333, 31.13333, 51.30769, 40.29375…
$ temp_hist             <dbl> 30.14667, 51.32000, 29.56639, 50.97615, 38.88542…
$ temp_sd               <dbl> 4.120782, 3.890902, 4.120782, 3.890902, 4.706475…
$ precip_mean           <dbl> 2.395000, 2.768333, 3.012500, 2.593462, 4.088125…
$ precip_hist           <dbl> 2.9743333, 2.5588889, 3.0695000, 2.5643846, 3.37…
$ precip_sd             <dbl> 0.9620579, 0.9029786, 0.9620579, 0.9029786, 1.18…

The variables are:

3 Prognosticators labeled as Animatronic/Puppet/Statue/Stuffed/Taxidermied are classified as not alive.

4 We adopt the same definition as FiveThirtyEight. An “Early Spring” is defined as any year in which the average temperature in either February or March was higher than the historic average. A “Late Winter” was when the average temperature in both months was lower than or the same as the historical average.

Visualizing prediction success rate - Demo

Single variable

Note

Analyzing the a single variable is called univariate analysis.

Create visualizations of the distribution of preds_rate for the prognosticators.

  1. Make a histogram. Set an appropriate binwidth.
ggplot(seers, aes(x = preds_rate)) +
  geom_histogram(binwidth = 0.02)

  1. Make a boxplot.
ggplot(seers, aes(x = preds_rate)) +
  geom_boxplot()

Two variables - Your turn

Note

Analyzing the relationship between two variables is called bivariate analysis.

Create visualizations of the distribution of preds_rate by alive (whether or not the prognosticator is alive).

  1. Make a single histogram. Set an appropriate binwidth.
ggplot(
  seers,
  aes(x = preds_rate, fill = alive)
) +
  geom_histogram(binwidth = 0.02, alpha = 0.5, color = "black")

  1. Use multiple histograms via faceting, one for each type. Set an appropriate binwidth, add color as you see fit, and turn off legends if not needed.
ggplot(
  seers,
  aes(x = preds_rate, fill = alive)
) +
  geom_histogram(binwidth = 0.02, show.legend = FALSE) +
  facet_wrap(vars(alive), ncol = 1)

  1. Use side-by-side box plots. Add color as you see fit and turn off legends if not needed.
ggplot(
  seers,
  aes(x = alive, y = preds_rate, fill = alive)
) +
  geom_boxplot(show.legend = FALSE)

  1. Use density plots. Add color as you see fit.
ggplot(
  seers,
  aes(x = preds_rate, fill = alive)
) +
  geom_density(alpha = 0.5)

  1. Use violin plots. Add color as you see fit and turn off legends if not needed.
ggplot(
  seers,
  aes(x = alive, y = preds_rate, fill = alive)
) +
  geom_violin(alpha = 0.5, show.legend = FALSE)

  1. Make a jittered scatter plot. Add color as you see fit and turn off legends if not needed.
ggplot(
  seers,
  aes(x = alive, y = preds_rate, color = alive)
) +
  geom_jitter(show.legend = FALSE)

  1. Use beeswarm plots. Add color as you see fit and turn off legends if not needed.
library(ggbeeswarm)

ggplot(
  seers,
  aes(x = alive, y = preds_rate, color = alive)
) +
  geom_beeswarm(show.legend = FALSE)

  1. Demonstration: Use multiple geoms on a single plot. Be deliberate about the order of plotting. Change the theme and the color scale of the plot. Finally, add informative labels.
ggplot(
  seers,
  aes(x = alive, y = preds_rate, color = alive)
) +
  geom_beeswarm(show.legend = FALSE) +
  geom_boxplot(show.legend = FALSE, alpha = 0.5) +
  scale_color_viridis_d(option = "D", end = 0.8) +
  scale_y_continuous(labels = scales::label_percent()) +
  theme_minimal() +
  labs(
    x = "Is the prognosticator alive?",
    y = "Prediction accuracy for late winter/early spring",
    title = "Accuracy of prognosticators predicting the coming season",
    subtitle = "By living status of prognosticator"
  )

Multiple variables - Demo

Note

Analyzing the relationship between three or more variables is called multivariate analysis.

  1. Facet the plot you created in the previous exercise by forecaster_simple. Adjust labels accordingly.
ggplot(
  seers,
  aes(x = alive, y = preds_rate, color = alive)
) +
  geom_beeswarm(show.legend = FALSE) +
  geom_boxplot(show.legend = FALSE, alpha = 0.5) +
  scale_color_viridis_d(option = "D", end = 0.8) +
  scale_y_continuous(labels = scales::label_percent()) +
  facet_wrap(vars(forecaster_simple)) +
  theme_minimal() +
  labs(
    x = "Is the prognosticator alive?",
    y = "Prediction accuracy for late winter/early spring",
    title = "Accuracy of prognosticators predicting the coming season",
    subtitle = "By type and living status of prognosticator"
  )

Before you continue, let’s turn off all warnings the code chunks generate and resize all figures. We’ll do this by editing the YAML.

Visualizing other variables - Your turn!

  1. Pick a single categorical variable from the data set and make a bar plot of its distribution.
# add code here
  1. Pick two categorical variables and make a visualization to visualize the relationship between the two variables. Along with your code and output, provide an interpretation of the visualization.
# add code here

Interpretation goes here…

  1. Make another plot that uses at least three variables. At least one should be numeric and at least one categorical. In 1-2 sentences, describe what the plot shows about the relationships between the variables you plotted. Don’t forget to label your code chunk.
# add code here

Interpretation goes here…

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.4.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2023-09-05
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 beeswarm      0.4.0   2021-06-01 [1] CRAN (R 4.3.0)
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
 digest        0.6.31  2022-12-11 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
 evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)
 fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggbeeswarm  * 0.7.2   2023-04-29 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.5   2023-06-05 [1] CRAN (R 4.3.0)
 knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)
 labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.22    2023-06-01 [1] CRAN (R 4.3.0)
 rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.3.0)
 scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)
 vipor         0.4.5   2017-03-22 [1] CRAN (R 4.3.0)
 viridisLite   0.4.2   2023-05-02 [1] CRAN (R 4.3.0)
 vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)
 withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
 xfun          0.39    2023-04-20 [1] CRAN (R 4.3.0)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────