AE 06: Importing and cleaning lottery data
Some chunks require you to merely fill in the TODOs
with the appropriate code. Other chunks require you to write the entire code block.
Packages
We will use the following four packages in this application exercise.
- tidyverse: For data import, wrangling, and visualization.
- readxl: For importing data from Excel.
- janitor: For cleaning column names.
- scales: For formatting ggplot2 scales.
Powerball
A lottery is form of gambling that involves the drawing of numbers at random for a prize.1 In the United States, Powerball is a popular multi-state lottery played in 45 states, Washington D.C., Puerto Rico, and the US Virgin Islands. 2 The basic rules are (relatively) straightforward :
1 Source: Wikipedia.
2 Source: Powerball.com
- Powerball costs $2 per play.
- Players select five numbers between 1 and 69 for the white balls, then select one number between 1 and 26 for the red Powerball.
- Drawings are held every Monday, Wednesday, and Saturday night.
- The Powerball jackpot grows until it is won. Players win a prize by matching one of the 9 ways to win. The jackpot is won by matching all five white balls in any order and the red Powerball.3
3 For our purposes here, we will only examine the Powerball jackpot results.
4 Drawing history can be obtained from their website.
The Colorado Lottery provides detailed information on Powerball drawings dating back to August 2001.4 For these exercises we will work with a dataset containing every Powerball drawing in the Colorado Lottery’s database.
Import and clean the data
The dataset is available for download as an Excel spreadsheet.
Demo: Import the data file so it looks like below. Store it as powerball_raw
.
# A tibble: 2,577 × 61
`Draw date` `Last Day To Claim` `Winning Numbers` Powerball `Power Play`
<chr> <dttm> <chr> <dbl> <dbl>
1 Monday, 9/23/… 2025-03-22 00:00:00 15 - 21 - 25 - 3… 19 3
2 Saturday, 9/2… 2025-03-20 00:00:00 17 - 19 - 21 - 3… 14 2
3 Wednesday, 9/… 2025-03-17 00:00:00 1 - 11 - 22 - 47… 7 4
4 Monday, 9/16/… 2025-03-15 00:00:00 8 - 9 - 11 - 27 … 17 5
5 Saturday, 9/1… 2025-03-13 00:00:00 29 - 34 - 38 - 4… 16 2
6 Wednesday, 9/… 2025-03-10 00:00:00 10 - 12 - 55 - 6… 3 3
7 Monday, 9/9/24 2025-03-08 00:00:00 1 - 16 - 21 - 47… 5 3
8 Saturday, 9/7… 2025-03-06 00:00:00 14 - 34 - 37 - 5… 20 2
9 Wednesday, 9/… 2025-03-03 00:00:00 7 - 10 - 21 - 33… 20 3
10 Monday, 9/2/24 2025-03-01 00:00:00 8 - 42 - 46 - 48… 22 3
# ℹ 2,567 more rows
# ℹ 56 more variables: Jackpot <dbl>, `Jackpot Cash Value` <dbl>,
# `Jackpot Winners` <dbl>, `Jackpot CO Winners` <dbl>,
# `Match 5 Prize` <dbl>, `Match 5 CO Winners` <dbl>,
# `Match 5 Prize (with Power Play)` <dbl>,
# `Match 5 CO Winners (with Power Play)` <dbl>,
# `Match 4 + Powerball Prize` <dbl>, …
# ℹ Use `print(n = ...)` to see more rows
# add code here
powerball_raw <- TODO
powerball_raw
Your turn: Clean the raw data to fix the following issues:
- Standardize the column names using
snake_case
format. - Create columns with appropriate data types for the date of the drawing as well as the weekday. Append these columns to the beginning of the data frame.
- Our analysis focuses specifically on jackpot outcomes. Drop columns related to other prizes offered through the Powerball lottery (e.g. Match \(N\), Double Play)
Store the cleaned data frame as powerball
.
# add code here
# standardize column names
powerball <- powerball_raw |>
# standardize the column names
TODO() |>
# separate draw_date into two variables, clean both
separate_TODO(
cols = TODO,
delim = "TODO",
names = c(NA, "draw_date")
) |>
# create a new column for the day of the week
mutate(
draw_date = TODO(draw_date),
draw_weekday = TODO(x = draw_date, label = TRUE),
.before = last_day_to_claim
) |>
# keep only a smaller subset to work with
select(draw_date:jackpot_co_winners)
powerball
Why does it seem like everyone is winning big?
Anyone living in the United States in the past few years is likely to have seen news reports whenever the jackpot grows dramatically. The 10 biggest lottery jackpots in the United States have all occurred since 2015. What is driving this trend?
How the jackpot value has changed over time
In order to address this question, let’s start first with a simpler question: how has the jackpot value changed over time? The jackpot amount varies for each drawing depending on the number of tickets sold as well as if the jackpot is rolling over from the previous drawing.
Demo: Create a line graph visualizing the jackpot value for every Powerball drawing over time.
# add code here
ggplot(data = powerball, mapping = aes(x = draw_date, y = jackpot)) +
geom_line() +
scale_y_continuous(labels = label_currency(scale_cut = cut_short_scale())) +
theme_minimal() +
labs(
x = "Date of lottery drawing",
y = "Jackpot value",
title = "Powerball jackpot values have increased dramatically since 2015",
caption = "Source: Colorado Lottery"
)
Your turn: What do you observe from the graph? Add response here.
Distribution of winning numbers
To investigate this further, let’s look at the distribution of the white balls + the red Powerball. Presumably since the numbers are drawn at random, then they should be uniformly distributed.
Your turn: Convert winning_numbers
into numeric values with one row for each drawing for each number. Keep just two columns: the drawing date and the winning numbers. Store this as powerball_white
.
Your turn: Create a similarly structured data frame for the red Powerball called powerball_red
. Ensure it has the same column names as powerball_white
.
# add code here
powerball_red <- powerball |>
select(TODO)
powerball_red
Your turn: Combine the two data frames and create a histogram visualizing the distribution of the winning numbers, faceted between the white balls and the Red Powerballs.
# add code here
bind_rows(
`White balls` = TODO,
Powerball = TODO,
.id = "num_type"
|>
) TODO
Your turn: Visualize the distribution of white balls (numbers drawn) over time using a scatterplot + a smoothing line.
# add code here
Why wrangle data in R
Question: Why should we bother with writing code to clean our data files (e.g. renaming columns, cleaning variables, separating into new columns/rows) instead of opening the Excel file and editing the data in there to prepare it for a clean import?
Your answer here