AE 12: Rectangling data from the PokéAPI
Suggested answers
Packages
We will use the following packages in this application exercise.
- tidyverse: For data import, wrangling, and visualization.
- jsonlite: For importing JSON files
Gotta catch em’ all!
Pokémon (also known as Pocket Monsters) is a Japanese media franchise consisting of video games, animated series and films, a trading card game, and other related media.1 The PokéAPI contains detailed information about each Pokémon, including their name, type, and abilities. In this application exercise, we will use a set of JSON files containing API results from the PokéAPI to explore the Pokémon universe.
1 Source: Wikipedia
Importing the data
data/pokedex.json
and data/types.json
contain information about each Pokémon and the different types of Pokémon, respectively. We will use read_json()
to import these files.
Your turn: Use View()
to interactively explore each list object to identify their structure and the elements contained within each object.
Unnesting for analysis
For each of the exercises below, use an appropriate rectangling procedure to unnest_*()
one or more lists to extract the required elements for analysis.
How many Pokémon are there for each primary type?
Your turn: Use each Pokemon’s primary type (the first one listed in the data) to determine how many Pokémon there are for each type, then create a bar chart to visualize the distribution.
Examine the contents of each list object to determine how the relevant variables are structured so you can plan your approach.
There are (at least) three ways you could approach this problem.
- Use
unnest_wider()
twice to extract the primary type from the pokemon list and generate a frequency count. - Use
unnest_wider()
andhoist()
to extract the primary type from the pokemon list and generate a frequency count. - Use
unnest_wider()
andunnest_longer()
to extract the primary type from the pokemon list and generate a frequency count.
Pick one and have at it!
Fancy a challenge? Label each Pokémon type in both English and Japanese.
# extract the primary type from the pokemon list and generate a frequency count
## using hoist()
poke_types <- tibble(pokemon) |>
# expand so each pokemon variable is in its own column
unnest_wider(pokemon) |>
# extract the pokemon's primary type from the type column
hoist(.col = type, main_type = 1L) |>
# generate frequency count
count(main_type)
## using unnest_wider() twice
poke_types <- tibble(pokemon) |>
# expand so each pokemon variable is in its own column
unnest_wider(pokemon) |>
# expand the type column so each type is in its own column
unnest_wider(type, names_sep = "_") |>
rename(main_type = type_1) |>
# generate frequency count
count(main_type)
## using unnest_wider() and unnest_longer()
poke_types <- tibble(pokemon) |>
# expand so each pokemon variable is in its own column
unnest_wider(pokemon) |>
# expand the type column so each type is in its own row
unnest_longer(type) |>
# keep just the first row for each pokemon
slice_head(n = 1, by = id) |>
# generate frequency count
count(main_type = type)
# extract english and japanese names for types
types_df <- tibble(types) |>
unnest_wider(types)
# combine poke_types with types_df and create a name column that includes both
# english and japanese
left_join(x = poke_types, y = types_df, by = join_by(main_type == english)) |>
mutate(
name = str_glue("{main_type} ({japanese})"),
name = fct_reorder(.f = name, .x = n)
) |>
ggplot(mapping = aes(x = n, y = name)) +
geom_col() +
labs(
title = "Water-type Pokémon are the most common",
x = "Number of Pokémon",
y = NULL,
caption = "Source: PokéAPI"
) +
theme_minimal()
Which primary type of Pokémon are strongest based on total number of points?
Your turn: Use each Pokémon’s base stats to determine which primary type of Pokémon are strongest based on the total number of points. Create a boxplot to visualize the distribution of total points for each primary type.
To calculate the sum total of points for each Pokémon’s base stats, there are two approaches you might consider. In either approach you first need to get each Pokémon’s variables into separate columns and extract the primary type.
- Use
unnest_wider()
to extract the base stats, then calculate the sum of the base stats. - Use
unnest_longer()
to extract the base stats, then calculate the sum of the base stats.
# base stats in one column per stat
pokemon_points <- tibble(pokemon) |>
# one column per variable
unnest_wider(pokemon) |>
# extract the pokemon's primary type from the type column
hoist(.col = type, main_type = 1L) |>
# expand to get base stats
unnest_wider(base) |>
# for each row, calculate the sum of HP:Speed
rowwise() |>
mutate(total = sum(c_across(cols = HP:Speed), na.rm = TRUE), .before = HP) |>
ungroup() |>
# exclude pokemon with total = 0 - means we don't have stats available
filter(total != 0) |>
select(id, main_type, total)
pokemon_points
# A tibble: 809 × 3
id main_type total
<int> <chr> <int>
1 1 Grass 318
2 2 Grass 405
3 3 Grass 525
4 4 Fire 309
5 5 Fire 405
6 6 Fire 534
7 7 Water 314
8 8 Water 405
9 9 Water 530
10 10 Bug 195
# ℹ 799 more rows
# base stats in one row per pokemon per stat
pokemon_points <- tibble(pokemon) |>
# one column per variable
unnest_wider(pokemon) |>
# extract the pokemon's primary type from the type column
hoist(.col = type, main_type = 1L) |>
# expand to get base stats
unnest_longer(base,
values_to = "points",
indices_to = "stat"
) |>
# calculate sum of points for each pokemon
summarize(total = sum(points, na.rm = TRUE), .by = c(id, main_type))
pokemon_points
# A tibble: 809 × 3
id main_type total
<int> <chr> <int>
1 1 Grass 318
2 2 Grass 405
3 3 Grass 525
4 4 Fire 309
5 5 Fire 405
6 6 Fire 534
7 7 Water 314
8 8 Water 405
9 9 Water 530
10 10 Bug 195
# ℹ 799 more rows
pokemon_points |>
# order the boxplots meaningfully
mutate(main_type = fct_reorder(.f = main_type, .x = total)) |>
# generate the plot
ggplot(mapping = aes(x = total, y = main_type)) +
geom_boxplot() +
labs(
title = "Flying-type Pokémon are the most powerful on average",
x = "Total points",
y = NULL,
caption = "Source: PokéAPI"
) +
theme_minimal()
From what types of eggs do Pokémon hatch? (Bonus!)
In Generation II, Pokémon introduced the concept of breeding, whereby Pokémon can produce offspring. In Generation III, Pokémon eggs were introduced, which can be hatched to produce a Pokémon.
Use each Pokémon’s egg group to determine from what types of eggs Pokémon hatch. Create a heatmap like the one below to visualize the distribution of egg groups for each primary type.
Consider using hoist()
to extract the main type and egg group from each Pokémon.
tibble(pokemon) |>
unnest_wider(pokemon) |>
# extract the main type
hoist(
.col = type, main_type = 1L
) |>
# extract the type of eggs from which the pokemon can hatch
hoist(
.col = profile, egg = "egg"
) |>
select(main_type, egg) |>
# some pokemon have more than one egg group, need to unnest longer
unnest_longer(egg) |>
# count the number of type-egg pairings
count(main_type, egg) |>
# draw the plot
ggplot(mapping = aes(x = egg, y = main_type, fill = n)) +
geom_tile() +
geom_text(mapping = aes(label = n), color = "white") +
scale_fill_viridis_c() +
theme_minimal() +
labs(
title = "A few Pokémon types are more likely to hatch from certain eggs",
x = "Egg group",
y = "Main Pokémon type",
caption = "Source: PokéAPI"
) +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1)
)
Acknowledgments
- JSON data files obtained from
Purukitto/pokemon-data.json
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.6.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-10-18
pandoc 3.4 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
cli 3.6.3 2024-06-21 [1] RSPM (R 4.4.0)
P colorspace 2.1-0 2023-01-23 [?] CRAN (R 4.3.0)
P digest 0.6.35 2024-03-11 [?] CRAN (R 4.3.1)
P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.3.1)
P evaluate 0.24.0 2024-06-10 [?] CRAN (R 4.4.0)
P fansi 1.0.6 2023-12-08 [?] CRAN (R 4.3.1)
P farver 2.1.2 2024-05-13 [?] CRAN (R 4.3.3)
P fastmap 1.2.0 2024-05-15 [?] CRAN (R 4.4.0)
P forcats * 1.0.0 2023-01-29 [?] CRAN (R 4.3.0)
P generics 0.1.3 2022-07-05 [?] CRAN (R 4.3.0)
P ggplot2 * 3.5.1 2024-04-23 [?] CRAN (R 4.3.1)
P glue 1.7.0 2024-01-09 [?] CRAN (R 4.3.1)
P gtable 0.3.5 2024-04-22 [?] CRAN (R 4.3.1)
P here 1.0.1 2020-12-13 [?] CRAN (R 4.3.0)
P hms 1.1.3 2023-03-21 [?] CRAN (R 4.3.0)
P htmltools 0.5.8.1 2024-04-04 [?] CRAN (R 4.3.1)
P htmlwidgets 1.6.4 2023-12-06 [?] CRAN (R 4.3.1)
P jsonlite * 1.8.8 2023-12-04 [?] CRAN (R 4.3.1)
P knitr 1.47 2024-05-29 [?] CRAN (R 4.4.0)
P labeling 0.4.3 2023-08-29 [?] CRAN (R 4.3.0)
P lifecycle 1.0.4 2023-11-07 [?] CRAN (R 4.3.1)
P lubridate * 1.9.3 2023-09-27 [?] CRAN (R 4.3.1)
P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.3.0)
P munsell 0.5.1 2024-04-01 [?] CRAN (R 4.3.1)
P pillar 1.9.0 2023-03-22 [?] CRAN (R 4.3.0)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.3.0)
P purrr * 1.0.2 2023-08-10 [?] CRAN (R 4.3.0)
P R6 2.5.1 2021-08-19 [?] CRAN (R 4.3.0)
P ragg 1.3.2 2024-05-15 [?] CRAN (R 4.4.0)
P readr * 2.1.5 2024-01-10 [?] CRAN (R 4.3.1)
renv 1.0.7 2024-04-11 [1] CRAN (R 4.4.0)
P rlang 1.1.4 2024-06-04 [?] CRAN (R 4.3.3)
P rmarkdown 2.27 2024-05-17 [?] CRAN (R 4.4.0)
P rprojroot 2.0.4 2023-11-05 [?] CRAN (R 4.3.1)
P rstudioapi 0.16.0 2024-03-24 [?] CRAN (R 4.3.1)
P scales 1.3.0.9000 2024-05-07 [?] Github (r-lib/scales@c0f79d3)
P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.3.0)
P stringi 1.8.4 2024-05-06 [?] CRAN (R 4.3.1)
P stringr * 1.5.1 2023-11-14 [?] CRAN (R 4.3.1)
P systemfonts 1.1.0 2024-05-15 [?] CRAN (R 4.4.0)
P textshaping 0.4.0 2024-05-24 [?] CRAN (R 4.4.0)
P tibble * 3.2.1 2023-03-20 [?] CRAN (R 4.3.0)
P tidyr * 1.3.1 2024-01-24 [?] CRAN (R 4.3.1)
P tidyselect 1.2.1 2024-03-11 [?] CRAN (R 4.3.1)
P tidyverse * 2.0.0 2023-02-22 [?] CRAN (R 4.3.0)
P timechange 0.3.0 2024-01-18 [?] CRAN (R 4.3.1)
P tzdb 0.4.0 2023-05-12 [?] CRAN (R 4.3.0)
P utf8 1.2.4 2023-10-22 [?] CRAN (R 4.3.1)
P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.3.1)
P viridisLite 0.4.2 2023-05-02 [?] CRAN (R 4.3.0)
withr 3.0.1 2024-07-31 [1] RSPM (R 4.4.0)
P xfun 0.45 2024-06-16 [?] CRAN (R 4.4.0)
P yaml 2.3.8 2023-12-11 [?] CRAN (R 4.3.1)
[1] /Users/soltoffbc/Projects/info-5001/course-site/renv/library/macos/R-4.4/aarch64-apple-darwin20
[2] /Users/soltoffbc/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────