AE 01: Visualizing the prognosticators

Application exercise
Modified

September 4, 2024

Important

Go to the course GitHub organization and locate the repo titled ae-01-YOUR_GITHUB_USERNAME to get started.

This AE is due September 5 at 11:59pm.

For all analyses, we’ll use the tidyverse packages.

library(tidyverse)
library(scales)

Data: The prognosticators

The dataset we will visualize is called seers.1 It contains summary statistics for all known Groundhog Day forecasters. 2 Let’s glimpse() at it.

# import data using readr::read_csv()
seers <- read_csv("data/prognosticators-sum-stats.csv")
Rows: 154 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): name, forecaster_type, forecaster_simple, climate_region, town, state
dbl (11): preds_n, preds_long_winter, preds_long_winter_pct, preds_correct, ...
lgl  (1): alive

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# add code here

The variables are:

  • name - name of the prognosticator
  • forecaster_type - what kind of animal or thing is the prognosticator?
  • forecaster_simple - a simplified version that lumps together the least-frequently appearing types of prognosticators
  • alive - is the prognosticator an animate (alive) being?3
  • climate_region - the NOAA climate region in which the prognosticator is located.
  • town - self-explanatory
  • state - state (or territory) where prognosticator is located
  • preds_n - number of predictions in the database
  • preds_long_winter - number of predictions for a “Late Winter” (as opposed to “Early Spring”)
  • preds_long_winter_pct - percentage of predictions for a “Late Winter”
  • preds_correct - number of correct predictions4
  • preds_rate - proportion of predictions that are correct
  • temp_mean - average temperature (in Fahrenheit) in February and March in the climate region across all prognostication years
  • temp_hist - average of the rolling 15-year historic average temperature in February and March across all prognostication years
  • temp_sd - standard deviation of average February and March temperatures across all prognostication years
  • precip_mean - average amount of precipitation in February and March across all prognostication years (measured in rainfall inches)
  • precip_hist average of the rolling 15-year historic average precipitation in February and March across all prognostication years
  • precip_sd - standard deviation of average February and March precipitation across all prognostication years

Visualizing prediction success rate - Demo

Single variable

Note

Analyzing the a single variable is called univariate analysis.

Create visualizations of the distribution of preds_rate for the prognosticators.

  1. Make a histogram. Set an appropriate binwidth.
# add code here

Two variables - Your turn

Note

Analyzing the relationship between two variables is called bivariate analysis.

Create visualizations of the distribution of preds_rate by alive (whether or not the prognosticator is alive).

  1. Make a single histogram. Set an appropriate binwidth.
# add code here
  1. Use multiple histograms via faceting, one for each type. Set an appropriate binwidth, add color as you see fit, and turn off legends if not needed.
# add code here
  1. Use side-by-side box plots. Add color as you see fit and turn off legends if not needed.
# add code here
  1. Use a density plot. Add color as you see fit.
# add code here
  1. Use a violin plot. Add color as you see fit and turn off legends if not needed.
# add code here
  1. Make a jittered scatter plot. Add color as you see fit and turn off legends if not needed.
# add code here
  1. Use beeswarm plots. Add color as you see fit and turn off legends if not needed.
library(ggbeeswarm)

# add code here
  1. Demonstration: Use multiple geoms on a single plot. Be deliberate about the order of plotting. Change the theme and the color scale of the plot. Finally, add informative labels.
# add code here

Multiple variables - Demo

Note

Analyzing the relationship between three or more variables is called multivariate analysis.

  1. Facet the plot you created in the previous exercise by forecaster_simple. Adjust labels accordingly.
# add code here

Before you continue, let’s turn off all warnings the code chunks generate and resize all figures. We’ll do this by editing the YAML.

Visualizing other variables - Your turn!

  1. Pick a single categorical variable from the data set and make a bar plot of its distribution.
# add code here
  1. Pick two categorical variables and make a visualization to visualize the relationship between the two variables. Along with your code and output, provide an interpretation of the visualization.
# add code here

Interpretation goes here…

  1. Make another plot that uses at least three variables. At least one should be numeric and at least one categorical. In 1-2 sentences, describe what the plot shows about the relationships between the variables you plotted. Don’t forget to label your code chunk.
# add code here

Interpretation goes here…

Footnotes

  1. I would prefer prognosticators, but I had way too many typos preparing these materials to make you all use it.↩︎

  2. Source: Countdown to Groundhog Day. Application exercise inspired by Groundhogs Do Not Make Good Meteorologists originally published on FiveThirtyEight.↩︎

  3. Prognosticators labeled as Animatronic/Puppet/Statue/Stuffed/Taxidermied are classified as not alive.↩︎

  4. We adopt the same definition as FiveThirtyEight. An “Early Spring” is defined as any year in which the average temperature in either February or March was higher than the historic average. A “Late Winter” was when the average temperature in both months was lower than or the same as the historical average.↩︎