AE 01: Visualizing the prognosticators
Go to the course GitHub organization and locate the repo titled ae-01-YOUR_GITHUB_USERNAME
to get started.
This AE is due September 5 at 11:59pm.
For all analyses, we’ll use the tidyverse packages.
Data: The prognosticators
The dataset we will visualize is called seers
.1 It contains summary statistics for all known Groundhog Day forecasters. 2 Let’s glimpse()
at it.
1 I would prefer prognosticators
, but I had way too many typos preparing these materials to make you all use it.
2 Source: Countdown to Groundhog Day. Application exercise inspired by Groundhogs Do Not Make Good Meteorologists originally published on FiveThirtyEight.
# import data using readr::read_csv()
seers <- read_csv("data/prognosticators-sum-stats.csv")
Rows: 154 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): name, forecaster_type, forecaster_simple, climate_region, town, state
dbl (11): preds_n, preds_long_winter, preds_long_winter_pct, preds_correct, ...
lgl (1): alive
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# add code here
The variables are:
-
name
- name of the prognosticator -
forecaster_type
- what kind of animal or thing is the prognosticator? -
forecaster_simple
- a simplified version that lumps together the least-frequently appearing types of prognosticators -
alive
- is the prognosticator an animate (alive) being?3 -
climate_region
- the NOAA climate region in which the prognosticator is located. -
town
- self-explanatory -
state
- state (or territory) where prognosticator is located -
preds_n
- number of predictions in the database -
preds_long_winter
- number of predictions for a “Late Winter” (as opposed to “Early Spring”) -
preds_long_winter_pct
- percentage of predictions for a “Late Winter” -
preds_correct
- number of correct predictions4 -
preds_rate
- proportion of predictions that are correct -
temp_mean
- average temperature (in Fahrenheit) in February and March in the climate region across all prognostication years -
temp_hist
- average of the rolling 15-year historic average temperature in February and March across all prognostication years -
temp_sd
- standard deviation of average February and March temperatures across all prognostication years -
precip_mean
- average amount of precipitation in February and March across all prognostication years (measured in rainfall inches) -
precip_hist
average of the rolling 15-year historic average precipitation in February and March across all prognostication years -
precip_sd
- standard deviation of average February and March precipitation across all prognostication years
3 Prognosticators labeled as Animatronic/Puppet/Statue/Stuffed/Taxidermied are classified as not alive.
4 We adopt the same definition as FiveThirtyEight. An “Early Spring” is defined as any year in which the average temperature in either February or March was higher than the historic average. A “Late Winter” was when the average temperature in both months was lower than or the same as the historical average.
Visualizing prediction success rate - Demo
Single variable
Analyzing the a single variable is called univariate analysis.
Create visualizations of the distribution of preds_rate
for the prognosticators.
- Make a histogram. Set an appropriate binwidth.
# add code here
Two variables - Your turn
Analyzing the relationship between two variables is called bivariate analysis.
Create visualizations of the distribution of preds_rate
by alive
(whether or not the prognosticator is alive).
- Make a single histogram. Set an appropriate binwidth.
# add code here
- Use multiple histograms via faceting, one for each type. Set an appropriate binwidth, add color as you see fit, and turn off legends if not needed.
# add code here
- Use side-by-side box plots. Add color as you see fit and turn off legends if not needed.
# add code here
- Use a density plot. Add color as you see fit.
# add code here
- Use a violin plot. Add color as you see fit and turn off legends if not needed.
# add code here
- Make a jittered scatter plot. Add color as you see fit and turn off legends if not needed.
# add code here
- Use beeswarm plots. Add color as you see fit and turn off legends if not needed.
library(ggbeeswarm)
# add code here
- Demonstration: Use multiple geoms on a single plot. Be deliberate about the order of plotting. Change the theme and the color scale of the plot. Finally, add informative labels.
# add code here
Multiple variables - Demo
Analyzing the relationship between three or more variables is called multivariate analysis.
- Facet the plot you created in the previous exercise by
forecaster_simple
. Adjust labels accordingly.
# add code here
Before you continue, let’s turn off all warnings the code chunks generate and resize all figures. We’ll do this by editing the YAML.
Visualizing other variables - Your turn!
- Pick a single categorical variable from the data set and make a bar plot of its distribution.
# add code here
- Pick two categorical variables and make a visualization to visualize the relationship between the two variables. Along with your code and output, provide an interpretation of the visualization.
# add code here
Interpretation goes here…
- Make another plot that uses at least three variables. At least one should be numeric and at least one categorical. In 1-2 sentences, describe what the plot shows about the relationships between the variables you plotted. Don’t forget to label your code chunk.
# add code here
Interpretation goes here…