AE 06: Importing and cleaning lottery data
Suggested answers
Packages
We will use the following four packages in this application exercise.
- tidyverse: For data import, wrangling, and visualization.
- readxl: For importing data from Excel.
- janitor: For cleaning column names.
- scales: For formatting ggplot2 scales.
Powerball
A lottery is form of gambling that involves the drawing of numbers at random for a prize.1 In the United States, Powerball is a popular multi-state lottery played in 45 states, Washington D.C., Puerto Rico, and the US Virgin Islands. 2 The basic rules are (relatively) straightforward :
1 Source: Wikipedia.
2 Source: Powerball.com
- Powerball costs $2 per play.
- Players select five numbers between 1 and 69 for the white balls, then select one number between 1 and 26 for the red Powerball.
- Drawings are held every Monday, Wednesday, and Saturday night.
- The Powerball jackpot grows until it is won. Players win a prize by matching one of the 9 ways to win. The jackpot is won by matching all five white balls in any order and the red Powerball.3
3 For our purposes here, we will only examine the Powerball jackpot results.
4 Drawing history can be obtained from their website.
The Colorado Lottery provides detailed information on Powerball drawings dating back to August 2001.4 For these exercises we will work with a dataset containing every Powerball drawing in the Colorado Lottery’s database.
Import and clean the data
The dataset is available for download as an Excel spreadsheet.
Demo: Import the data file so it looks like below. Store it as powerball_raw
.
# A tibble: 2,577 × 61
`Draw date` `Last Day To Claim` `Winning Numbers` Powerball `Power Play`
<chr> <dttm> <chr> <dbl> <dbl>
1 Monday, 9/23/… 2025-03-22 00:00:00 15 - 21 - 25 - 3… 19 3
2 Saturday, 9/2… 2025-03-20 00:00:00 17 - 19 - 21 - 3… 14 2
3 Wednesday, 9/… 2025-03-17 00:00:00 1 - 11 - 22 - 47… 7 4
4 Monday, 9/16/… 2025-03-15 00:00:00 8 - 9 - 11 - 27 … 17 5
5 Saturday, 9/1… 2025-03-13 00:00:00 29 - 34 - 38 - 4… 16 2
6 Wednesday, 9/… 2025-03-10 00:00:00 10 - 12 - 55 - 6… 3 3
7 Monday, 9/9/24 2025-03-08 00:00:00 1 - 16 - 21 - 47… 5 3
8 Saturday, 9/7… 2025-03-06 00:00:00 14 - 34 - 37 - 5… 20 2
9 Wednesday, 9/… 2025-03-03 00:00:00 7 - 10 - 21 - 33… 20 3
10 Monday, 9/2/24 2025-03-01 00:00:00 8 - 42 - 46 - 48… 22 3
# ℹ 2,567 more rows
# ℹ 56 more variables: Jackpot <dbl>, `Jackpot Cash Value` <dbl>,
# `Jackpot Winners` <dbl>, `Jackpot CO Winners` <dbl>,
# `Match 5 Prize` <dbl>, `Match 5 CO Winners` <dbl>,
# `Match 5 Prize (with Power Play)` <dbl>,
# `Match 5 CO Winners (with Power Play)` <dbl>,
# `Match 4 + Powerball Prize` <dbl>, …
# ℹ Use `print(n = ...)` to see more rows
powerball_raw <- read_excel("data/POWERBALL-from_0001-01-01_to_2024-09-24.xlsx",
col_types = c("text", "date", "text",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "text", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric"))
powerball_raw
# A tibble: 2,577 × 61
`Draw date` `Last Day To Claim` `Winning Numbers` Powerball `Power Play`
<chr> <dttm> <chr> <dbl> <dbl>
1 Monday, 9/23/24 2025-03-22 00:00:00 15 - 21 - 25 - 3… 19 3
2 Saturday, 9/21/… 2025-03-20 00:00:00 17 - 19 - 21 - 3… 14 2
3 Wednesday, 9/18… 2025-03-17 00:00:00 1 - 11 - 22 - 47… 7 4
4 Monday, 9/16/24 2025-03-15 00:00:00 8 - 9 - 11 - 27 … 17 5
5 Saturday, 9/14/… 2025-03-13 00:00:00 29 - 34 - 38 - 4… 16 2
6 Wednesday, 9/11… 2025-03-10 00:00:00 10 - 12 - 55 - 6… 3 3
7 Monday, 9/9/24 2025-03-08 00:00:00 1 - 16 - 21 - 47… 5 3
8 Saturday, 9/7/24 2025-03-06 00:00:00 14 - 34 - 37 - 5… 20 2
9 Wednesday, 9/4/… 2025-03-03 00:00:00 7 - 10 - 21 - 33… 20 3
10 Monday, 9/2/24 2025-03-01 00:00:00 8 - 42 - 46 - 48… 22 3
# ℹ 2,567 more rows
# ℹ 56 more variables: Jackpot <dbl>, `Jackpot Cash Value` <dbl>,
# `Jackpot Winners` <dbl>, `Jackpot CO Winners` <dbl>, `Match 5 Prize` <dbl>,
# `Match 5 CO Winners` <dbl>, `Match 5 Prize (with Power Play)` <dbl>,
# `Match 5 CO Winners (with Power Play)` <dbl>,
# `Match 4 + Powerball Prize` <dbl>, `Match 4 + Powerball CO Winners` <dbl>,
# `Match 4 + Powerball Prize (with Power Play)` <dbl>, …
Your turn: Clean the raw data to fix the following issues:
- Standardize the column names using
snake_case
format. - Create columns with appropriate data types for the date of the drawing as well as the weekday. Append these columns to the beginning of the data frame.
- Our analysis focuses specifically on jackpot outcomes. Drop columns related to other prizes offered through the Powerball lottery (e.g. Match \(N\), Double Play)
Store the cleaned data frame as powerball
.
# standardize column names
powerball <- powerball_raw |>
clean_names() |>
# separate draw_date into two variables, clean both
separate_wider_delim(
cols = draw_date,
delim = ",",
names = c(NA, "draw_date")
) |>
mutate(
draw_date = mdy(draw_date),
draw_weekday = wday(x = draw_date, label = TRUE),
.before = last_day_to_claim
) |>
# keep only a smaller subset to work with
select(draw_date:jackpot_co_winners)
powerball
# A tibble: 2,577 × 10
draw_date draw_weekday last_day_to_claim winning_numbers powerball
<date> <ord> <dttm> <chr> <dbl>
1 2024-09-23 Mon 2025-03-22 00:00:00 15 - 21 - 25 - 37 - 45 19
2 2024-09-21 Sat 2025-03-20 00:00:00 17 - 19 - 21 - 37 - 45 14
3 2024-09-18 Wed 2025-03-17 00:00:00 1 - 11 - 22 - 47 - 68 7
4 2024-09-16 Mon 2025-03-15 00:00:00 8 - 9 - 11 - 27 - 31 17
5 2024-09-14 Sat 2025-03-13 00:00:00 29 - 34 - 38 - 48 - 56 16
6 2024-09-11 Wed 2025-03-10 00:00:00 10 - 12 - 55 - 65 - 67 3
7 2024-09-09 Mon 2025-03-08 00:00:00 1 - 16 - 21 - 47 - 60 5
8 2024-09-07 Sat 2025-03-06 00:00:00 14 - 34 - 37 - 55 - 63 20
9 2024-09-04 Wed 2025-03-03 00:00:00 7 - 10 - 21 - 33 - 59 20
10 2024-09-02 Mon 2025-03-01 00:00:00 8 - 42 - 46 - 48 - 53 22
# ℹ 2,567 more rows
# ℹ 5 more variables: power_play <dbl>, jackpot <dbl>,
# jackpot_cash_value <dbl>, jackpot_winners <dbl>, jackpot_co_winners <dbl>
Why does it seem like everyone is winning big?
Anyone living in the United States in the past few years is likely to have seen news reports whenever the jackpot grows dramatically. The 10 biggest lottery jackpots in the United States have all occurred since 2015. What is driving this trend?
How the jackpot value has changed over time
In order to address this question, let’s start first with a simpler question: how has the jackpot value changed over time? The jackpot amount varies for each drawing depending on the number of tickets sold as well as if the jackpot is rolling over from the previous drawing.
Demo: Create a line graph visualizing the jackpot value for every Powerball drawing over time.
ggplot(data = powerball, mapping = aes(x = draw_date, y = jackpot)) +
geom_line() +
scale_y_continuous(labels = label_currency(scale_cut = cut_short_scale())) +
theme_minimal() +
labs(
x = "Date of lottery drawing",
y = "Jackpot value",
title = "Powerball jackpot values have increased dramatically since 2015",
caption = "Source: Colorado Lottery"
)
Your turn: What do you observe from the graph? Add response here.
There definitely seems to be an increase in the typical Powerball jackpot values since 2015.
Distribution of winning numbers
To investigate this further, let’s look at the distribution of the white balls + the red Powerball. Presumably since the numbers are drawn at random, then they should be uniformly distributed.
Your turn: Convert winning_numbers
into numeric values with one row for each drawing for each number. Keep just two columns: the drawing date and the winning numbers. Store this as powerball_white
.
powerball_white <- powerball |>
separate_longer_delim(
cols = winning_numbers,
delim = " - "
) |>
mutate(winning_numbers = parse_number(x = winning_numbers)) |>
select(draw_date, number = winning_numbers)
powerball_white
# A tibble: 12,885 × 2
draw_date number
<date> <dbl>
1 2024-09-23 15
2 2024-09-23 21
3 2024-09-23 25
4 2024-09-23 37
5 2024-09-23 45
6 2024-09-21 17
7 2024-09-21 19
8 2024-09-21 21
9 2024-09-21 37
10 2024-09-21 45
# ℹ 12,875 more rows
Your turn: Create a similarly structured data frame for the red Powerball called powerball_red
. Ensure it has the same column names as powerball_white
.
powerball_red <- powerball |>
select(draw_date, number = powerball)
powerball_red
# A tibble: 2,577 × 2
draw_date number
<date> <dbl>
1 2024-09-23 19
2 2024-09-21 14
3 2024-09-18 7
4 2024-09-16 17
5 2024-09-14 16
6 2024-09-11 3
7 2024-09-09 5
8 2024-09-07 20
9 2024-09-04 20
10 2024-09-02 22
# ℹ 2,567 more rows
Your turn: Combine the two data frames and create a histogram visualizing the distribution of the winning numbers, faceted between the white balls and the Red Powerballs.
bind_rows(
`White balls` = powerball_white,
Powerball = powerball_red,
.id = "num_type"
) |>
ggplot(mapping = aes(x = number, fill = num_type)) +
geom_histogram(binwidth = 1) +
scale_fill_viridis_d(guide = "none") +
facet_wrap(facets = vars(num_type), scales = "free") +
theme_minimal() +
labs(
x = "Number drawn",
y = NULL,
title = "Higher numbers are drawn less frequently",
subtitle = "Powerball drawings, 2001-24",
caption = "Source: Colorado Lottery"
)
Your turn: Visualize the distribution of white balls (numbers drawn) over time using a scatterplot + a smoothing line.
ggplot(data = powerball_white, mapping = aes(x = draw_date, y = number)) +
geom_point(alpha = 0.06) +
geom_smooth() +
theme_minimal() +
labs(
x = "Drawing date",
y = "White balls (numbers drawn)",
title = "Higher numbers have only been drawn since late 2015",
subtitle = "Powerball drawings, 2001-24",
caption = "Source: Colorado Lottery"
)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Turns out lottery officials have changed the rules of Powerball over time to change the odds of winning the jackpot. By increasing the pool of choices for the white balls, they double the available combinations of white balls and make it harder to win the jackpot. Higher jackpots spur more casual lottery players to purchase tickets, and overall generate more revenue.
Why wrangle data in R
Question: Why should we bother with writing code to clean our data files (e.g. renaming columns, cleaning variables, separating into new columns/rows) instead of opening the Excel file and editing the data in there to prepare it for a clean import?
Because the code allows us to struggle once and re-use for future datasets and leaves a transparent trail of our modifications while manipulating the data in Excel directly is neither reproducible nor reusable.
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.6.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-09-26
pandoc 3.3 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P cellranger 1.1.0 2016-07-27 [?] CRAN (R 4.3.0)
cli 3.6.3 2024-06-21 [1] RSPM (R 4.4.0)
P colorspace 2.1-0 2023-01-23 [?] CRAN (R 4.3.0)
P digest 0.6.35 2024-03-11 [?] CRAN (R 4.3.1)
P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.3.1)
P evaluate 0.24.0 2024-06-10 [?] CRAN (R 4.4.0)
P fansi 1.0.6 2023-12-08 [?] CRAN (R 4.3.1)
P farver 2.1.2 2024-05-13 [?] CRAN (R 4.3.3)
P fastmap 1.2.0 2024-05-15 [?] CRAN (R 4.4.0)
P forcats * 1.0.0 2023-01-29 [?] CRAN (R 4.3.0)
P generics 0.1.3 2022-07-05 [?] CRAN (R 4.3.0)
P ggplot2 * 3.5.1 2024-04-23 [?] CRAN (R 4.3.1)
P glue 1.7.0 2024-01-09 [?] CRAN (R 4.3.1)
P gtable 0.3.5 2024-04-22 [?] CRAN (R 4.3.1)
P here 1.0.1 2020-12-13 [?] CRAN (R 4.3.0)
P hms 1.1.3 2023-03-21 [?] CRAN (R 4.3.0)
P htmltools 0.5.8.1 2024-04-04 [?] CRAN (R 4.3.1)
P htmlwidgets 1.6.4 2023-12-06 [?] CRAN (R 4.3.1)
P janitor * 2.2.0 2023-02-02 [?] CRAN (R 4.3.0)
P jsonlite 1.8.8 2023-12-04 [?] CRAN (R 4.3.1)
P knitr 1.47 2024-05-29 [?] CRAN (R 4.4.0)
P labeling 0.4.3 2023-08-29 [?] CRAN (R 4.3.0)
P lattice 0.22-6 2024-03-20 [?] CRAN (R 4.4.0)
P lifecycle 1.0.4 2023-11-07 [?] CRAN (R 4.3.1)
P lubridate * 1.9.3 2023-09-27 [?] CRAN (R 4.3.1)
P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.3.0)
P Matrix 1.7-0 2024-03-22 [?] CRAN (R 4.4.0)
P mgcv 1.9-1 2023-12-21 [?] CRAN (R 4.4.0)
P munsell 0.5.1 2024-04-01 [?] CRAN (R 4.3.1)
P nlme 3.1-165 2024-06-06 [?] CRAN (R 4.4.0)
P pillar 1.9.0 2023-03-22 [?] CRAN (R 4.3.0)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.3.0)
P purrr * 1.0.2 2023-08-10 [?] CRAN (R 4.3.0)
P R6 2.5.1 2021-08-19 [?] CRAN (R 4.3.0)
P readr * 2.1.5 2024-01-10 [?] CRAN (R 4.3.1)
P readxl * 1.4.3 2023-07-06 [?] CRAN (R 4.3.0)
renv 1.0.7 2024-04-11 [1] CRAN (R 4.4.0)
P rlang 1.1.4 2024-06-04 [?] CRAN (R 4.3.3)
P rmarkdown 2.27 2024-05-17 [?] CRAN (R 4.4.0)
P rprojroot 2.0.4 2023-11-05 [?] CRAN (R 4.3.1)
P rstudioapi 0.16.0 2024-03-24 [?] CRAN (R 4.3.1)
P scales * 1.3.0.9000 2024-05-07 [?] Github (r-lib/scales@c0f79d3)
P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.3.0)
P snakecase 0.11.1 2023-08-27 [?] CRAN (R 4.3.0)
P stringi 1.8.4 2024-05-06 [?] CRAN (R 4.3.1)
P stringr * 1.5.1 2023-11-14 [?] CRAN (R 4.3.1)
P tibble * 3.2.1 2023-03-20 [?] CRAN (R 4.3.0)
P tidyr * 1.3.1 2024-01-24 [?] CRAN (R 4.3.1)
P tidyselect 1.2.1 2024-03-11 [?] CRAN (R 4.3.1)
P tidyverse * 2.0.0 2023-02-22 [?] CRAN (R 4.3.0)
P timechange 0.3.0 2024-01-18 [?] CRAN (R 4.3.1)
P tzdb 0.4.0 2023-05-12 [?] CRAN (R 4.3.0)
P utf8 1.2.4 2023-10-22 [?] CRAN (R 4.3.1)
P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.3.1)
P viridisLite 0.4.2 2023-05-02 [?] CRAN (R 4.3.0)
withr 3.0.1 2024-07-31 [1] RSPM (R 4.4.0)
P xfun 0.45 2024-06-16 [?] CRAN (R 4.4.0)
P yaml 2.3.8 2023-12-11 [?] CRAN (R 4.3.1)
[1] /Users/soltoffbc/Projects/info-5001/course-site/renv/library/macos/R-4.4/aarch64-apple-darwin20
[2] /Users/soltoffbc/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────