AE 06: Importing and cleaning lottery data

Application exercise
Modified

September 24, 2024

Note

Some chunks require you to merely fill in the TODOs with the appropriate code. Other chunks require you to write the entire code block.

Packages

We will use the following four packages in this application exercise.

  • tidyverse: For data import, wrangling, and visualization.
  • readxl: For importing data from Excel.
  • janitor: For cleaning column names.
  • scales: For formatting ggplot2 scales.

Powerball

A lottery is form of gambling that involves the drawing of numbers at random for a prize.1 In the United States, Powerball is a popular multi-state lottery played in 45 states, Washington D.C., Puerto Rico, and the US Virgin Islands. 2 The basic rules are (relatively) straightforward :

1 Source: Wikipedia.

2 Source: Powerball.com

  • Powerball costs $2 per play.
  • Players select five numbers between 1 and 69 for the white balls, then select one number between 1 and 26 for the red Powerball.
  • Drawings are held every Monday, Wednesday, and Saturday night.
  • The Powerball jackpot grows until it is won. Players win a prize by matching one of the 9 ways to win. The jackpot is won by matching all five white balls in any order and the red Powerball.3

3 For our purposes here, we will only examine the Powerball jackpot results.

4 Drawing history can be obtained from their website.

The Colorado Lottery provides detailed information on Powerball drawings dating back to August 2001.4 For these exercises we will work with a dataset containing every Powerball drawing in the Colorado Lottery’s database.

Import and clean the data

The dataset is available for download as an Excel spreadsheet.

A screenshot of the Powerball spreadsheet opened in Excel.

Demo: Import the data file so it looks like below. Store it as powerball_raw.

# A tibble: 2,577 × 61
   `Draw date`    `Last Day To Claim` `Winning Numbers` Powerball `Power Play`
   <chr>          <dttm>              <chr>                 <dbl>        <dbl>
 1 Monday, 9/23/2025-03-22 00:00:00 15 - 21 - 25 - 319            3
 2 Saturday, 9/22025-03-20 00:00:00 17 - 19 - 21 - 314            2
 3 Wednesday, 9/2025-03-17 00:00:00 1 - 11 - 22 - 477            4
 4 Monday, 9/16/2025-03-15 00:00:00 8 - 9 - 11 - 2717            5
 5 Saturday, 9/12025-03-13 00:00:00 29 - 34 - 38 - 416            2
 6 Wednesday, 9/2025-03-10 00:00:00 10 - 12 - 55 - 63            3
 7 Monday, 9/9/24 2025-03-08 00:00:00 1 - 16 - 21 - 475            3
 8 Saturday, 9/72025-03-06 00:00:00 14 - 34 - 37 - 520            2
 9 Wednesday, 9/2025-03-03 00:00:00 7 - 10 - 21 - 3320            3
10 Monday, 9/2/24 2025-03-01 00:00:00 8 - 42 - 46 - 4822            3
# ℹ 2,567 more rows
# ℹ 56 more variables: Jackpot <dbl>, `Jackpot Cash Value` <dbl>,
#   `Jackpot Winners` <dbl>, `Jackpot CO Winners` <dbl>,
#   `Match 5 Prize` <dbl>, `Match 5 CO Winners` <dbl>,
#   `Match 5 Prize (with Power Play)` <dbl>,
#   `Match 5 CO Winners (with Power Play)` <dbl>,
#   `Match 4 + Powerball Prize` <dbl>, …
# ℹ Use `print(n = ...)` to see more rows
# add code here
powerball_raw <- TODO
powerball_raw

Your turn: Clean the raw data to fix the following issues:

  • Standardize the column names using snake_case format.
  • Create columns with appropriate data types for the date of the drawing as well as the weekday. Append these columns to the beginning of the data frame.
  • Our analysis focuses specifically on jackpot outcomes. Drop columns related to other prizes offered through the Powerball lottery (e.g. Match \(N\), Double Play)

Store the cleaned data frame as powerball.

# add code here
# standardize column names
powerball <- powerball_raw |>
  # standardize the column names
  TODO() |>
  # separate draw_date into two variables, clean both
  separate_TODO(
    cols = TODO,
    delim = "TODO",
    names = c(NA, "draw_date")
  ) |>
  # create a new column for the day of the week
  mutate(
    draw_date = TODO(draw_date),
    draw_weekday = TODO(x = draw_date, label = TRUE),
    .before = last_day_to_claim
  ) |>
  # keep only a smaller subset to work with
  select(draw_date:jackpot_co_winners)
powerball

Why does it seem like everyone is winning big?

Anyone living in the United States in the past few years is likely to have seen news reports whenever the jackpot grows dramatically. The 10 biggest lottery jackpots in the United States have all occurred since 2015. What is driving this trend?

How the jackpot value has changed over time

In order to address this question, let’s start first with a simpler question: how has the jackpot value changed over time? The jackpot amount varies for each drawing depending on the number of tickets sold as well as if the jackpot is rolling over from the previous drawing.

Demo: Create a line graph visualizing the jackpot value for every Powerball drawing over time.

# add code here
ggplot(data = powerball, mapping = aes(x = draw_date, y = jackpot)) +
  geom_line() +
  scale_y_continuous(labels = label_currency(scale_cut = cut_short_scale())) +
  theme_minimal() +
  labs(
    x = "Date of lottery drawing",
    y = "Jackpot value",
    title = "Powerball jackpot values have increased dramatically since 2015",
    caption = "Source: Colorado Lottery"
  )

Your turn: What do you observe from the graph? Add response here.

Distribution of winning numbers

To investigate this further, let’s look at the distribution of the white balls + the red Powerball. Presumably since the numbers are drawn at random, then they should be uniformly distributed.

Your turn: Convert winning_numbers into numeric values with one row for each drawing for each number. Keep just two columns: the drawing date and the winning numbers. Store this as powerball_white.

# add code here
powerball_white <- powerball |>
  # separate into one row per number
  separate_TODO(
    cols = TODO,
    delim = "TODO"
  ) |>
  # ensure winning_numbers is stored as numeric type
  mutate(winning_numbers = TODO(x = winning_numbers)) |>
  select(draw_date, number = winning_numbers)
powerball_white

Your turn: Create a similarly structured data frame for the red Powerball called powerball_red. Ensure it has the same column names as powerball_white.

# add code here
powerball_red <- powerball |>
  select(TODO)
powerball_red

Your turn: Combine the two data frames and create a histogram visualizing the distribution of the winning numbers, faceted between the white balls and the Red Powerballs.

# add code here
bind_rows(
  `White balls` = TODO,
  Powerball = TODO,
  .id = "num_type"
) |>
  TODO

Your turn: Visualize the distribution of white balls (numbers drawn) over time using a scatterplot + a smoothing line.

# add code here

Why wrangle data in R

Question: Why should we bother with writing code to clean our data files (e.g. renaming columns, cleaning variables, separating into new columns/rows) instead of opening the Excel file and editing the data in there to prepare it for a clean import?

Your answer here