[[1]]
[1] 1 2 3 4
[[2]]
[1] "a"
[[3]]
[1] TRUE
Lecture 14
Cornell University
INFO 5001 - Fall 2024
October 17, 2024
# A tibble: 10 × 25
Title Year Rated Released Runtime Genre Director Writer Actors Plot
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Sharknado 2013 Not … 11 Jul … 86 min Acti… Anthony… Thund… Ian Z… When…
2 Sharknado 2013 Not … 11 Jul … 86 min Acti… Anthony… Thund… Ian Z… When…
3 Sharknado 2:… 2014 TV-14 30 Jul … 95 min Acti… Anthony… Thund… Ian Z… Fin …
4 Sharknado 2:… 2014 TV-14 30 Jul … 95 min Acti… Anthony… Thund… Ian Z… Fin …
5 Sharknado 3:… 2015 TV-14 22 Jul … 93 min Acti… Anthony… Thund… Ian Z… A mo…
6 Sharknado 3:… 2015 TV-14 22 Jul … 93 min Acti… Anthony… Thund… Ian Z… A mo…
7 Sharknado 4:… 2016 TV-14 31 Jul … 95 min Acti… Anthony… Thund… Ian Z… Fin,…
8 Sharknado 4:… 2016 TV-14 31 Jul … 95 min Acti… Anthony… Thund… Ian Z… Fin,…
9 Sharknado 5:… 2017 TV-14 06 Aug … 93 min Acti… Anthony… Thund… Ian Z… With…
10 Sharknado 5:… 2017 TV-14 06 Aug … 93 min Acti… Anthony… Thund… Ian Z… With…
# ℹ 15 more variables: Language <chr>, Country <chr>, Awards <chr>,
# Poster <chr>, Ratings <list>, Metascore <chr>, imdbRating <chr>,
# imdbVotes <chr>, imdbID <chr>, Type <chr>, DVD <chr>, BoxOffice <chr>,
# Production <chr>, Website <chr>, Response <chr>
List of 25
$ Title : chr "Sharknado"
$ Year : chr "2013"
$ Rated : chr "Not Rated"
$ Released : chr "11 Jul 2013"
$ Runtime : chr "86 min"
$ Genre : chr "Action, Adventure, Comedy"
$ Director : chr "Anthony C. Ferrante"
$ Writer : chr "Thunder Levin"
$ Actors : chr "Ian Ziering, Tara Reid, John Heard"
$ Plot : chr "When a freak hurricane swamps Los Angeles, nature's deadliest killer rules sea, land, and air as thousands of s"| __truncated__
$ Language : chr "English"
$ Country : chr "United States"
$ Awards : chr "1 win & 2 nominations"
$ Poster : chr "https://m.media-amazon.com/images/M/MV5BNTNkOTA0NjYtM2VjOC00ZmYzLWI1NzQtMjY4NTU2MGNkZmU0XkEyXkFqcGc@._V1_SX300.jpg"
$ Ratings :List of 2
..$ :List of 2
.. ..$ Source: chr "Internet Movie Database"
.. ..$ Value : chr "3.3/10"
..$ :List of 2
.. ..$ Source: chr "Rotten Tomatoes"
.. ..$ Value : chr "77%"
$ Metascore : chr "N/A"
$ imdbRating: chr "3.3"
$ imdbVotes : chr "53,662"
$ imdbID : chr "tt2724064"
$ Type : chr "movie"
$ DVD : chr "N/A"
$ BoxOffice : chr "N/A"
$ Production: chr "N/A"
$ Website : chr "N/A"
$ Response : chr "True"
Art and craft of taking a deeply nested list and taming it into a tidy data set of rows and columns
unnest_longer()
- each row contains multiple observationsunnest_wider()
- each row contains a single observationunnest_auto()
- make an educated guesshoist()
- extract a specific elementgh_repos
unnest_longer()
# A tibble: 176 × 1
json
<list>
1 <named list [68]>
2 <named list [68]>
3 <named list [68]>
4 <named list [68]>
5 <named list [68]>
6 <named list [68]>
7 <named list [68]>
8 <named list [68]>
9 <named list [68]>
10 <named list [68]>
# ℹ 166 more rows
One row per repo
unnest_wider()
# A tibble: 176 × 68
id name full_name owner private html_url description fork url
<int> <chr> <chr> <list> <lgl> <chr> <chr> <lgl> <chr>
1 6.12e7 after gaborcsa… <named list> FALSE https:/… Run Code i… FALSE http…
2 4.05e7 argu… gaborcsa… <named list> FALSE https:/… Declarativ… FALSE http…
3 3.64e7 ask gaborcsa… <named list> FALSE https:/… Friendly C… FALSE http…
4 3.49e7 base… gaborcsa… <named list> FALSE https:/… Do we get … FALSE http…
5 6.16e7 cite… gaborcsa… <named list> FALSE https:/… Test R pac… TRUE http…
6 3.39e7 clis… gaborcsa… <named list> FALSE https:/… Unicode sy… FALSE http…
7 3.72e7 cmak… gaborcsa… <named list> FALSE https:/… port of cm… TRUE http…
8 6.80e7 cmark gaborcsa… <named list> FALSE https:/… CommonMark… TRUE http…
9 6.32e7 cond… gaborcsa… <named list> FALSE https:/… <NA> TRUE http…
10 2.43e7 cray… gaborcsa… <named list> FALSE https:/… R package … FALSE http…
# ℹ 166 more rows
# ℹ 59 more variables: forks_url <chr>, keys_url <chr>,
# collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
# issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
# branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
# git_refs_url <chr>, trees_url <chr>, statuses_url <chr>,
# languages_url <chr>, stargazers_url <chr>, contributors_url <chr>, …
One column per list element
[1] "id" "name" "full_name"
[4] "owner" "private" "html_url"
[7] "description" "fork" "url"
[10] "forks_url" "keys_url" "collaborators_url"
[13] "teams_url" "hooks_url" "issue_events_url"
[16] "events_url" "assignees_url" "branches_url"
[19] "tags_url" "blobs_url" "git_tags_url"
[22] "git_refs_url" "trees_url" "statuses_url"
[25] "languages_url" "stargazers_url" "contributors_url"
[28] "subscribers_url" "subscription_url" "commits_url"
[31] "git_commits_url" "comments_url" "issue_comment_url"
[34] "contents_url" "compare_url" "merges_url"
[37] "archive_url" "downloads_url" "issues_url"
[40] "pulls_url" "milestones_url" "notifications_url"
[43] "labels_url" "releases_url" "deployments_url"
[46] "created_at" "updated_at" "pushed_at"
[49] "git_url" "ssh_url" "clone_url"
[52] "svn_url" "homepage" "size"
[55] "stargazers_count" "watchers_count" "language"
[58] "has_issues" "has_downloads" "has_wiki"
[61] "has_pages" "forks_count" "mirror_url"
[64] "open_issues_count" "forks" "open_issues"
[67] "watchers" "default_branch"
repos |>
unnest_longer(col = json) |>
unnest_wider(col = json) |>
select(id, full_name, owner, description)
# A tibble: 176 × 4
id full_name owner description
<int> <chr> <list> <chr>
1 61160198 gaborcsardi/after <named list [17]> Run Code in the Background
2 40500181 gaborcsardi/argufy <named list [17]> Declarative function argu…
3 36442442 gaborcsardi/ask <named list [17]> Friendly CLI interaction …
4 34924886 gaborcsardi/baseimports <named list [17]> Do we get warnings for un…
5 61620661 gaborcsardi/citest <named list [17]> Test R package and repo f…
6 33907457 gaborcsardi/clisymbols <named list [17]> Unicode symbols for CLI a…
7 37236467 gaborcsardi/cmaker <named list [17]> port of cmake to r
8 67959624 gaborcsardi/cmark <named list [17]> CommonMark parsing and re…
9 63152619 gaborcsardi/conditions <named list [17]> <NA>
10 24343686 gaborcsardi/crayon <named list [17]> R package for colored ter…
# ℹ 166 more rows
repos |>
unnest_longer(col = json) |>
unnest_wider(col = json) |>
select(id, full_name, owner, description) |>
unnest_wider(owner)
Error in `unnest_wider()`:
! Can't duplicate names between the affected columns and the original
data.
✖ These names are duplicated:
ℹ `id`, from `owner`.
ℹ Use `names_sep` to disambiguate using the column name.
ℹ Or use `names_repair` to specify a repair strategy.
repos |>
unnest_longer(col = json) |>
unnest_wider(col = json) |>
select(id, full_name, owner, description) |>
unnest_wider(owner, names_sep = "_")
# A tibble: 176 × 20
id full_name owner_login owner_id owner_avatar_url owner_gravatar_id
<int> <chr> <chr> <int> <chr> <chr>
1 61160198 gaborcsardi… gaborcsardi 660288 https://avatars… ""
2 40500181 gaborcsardi… gaborcsardi 660288 https://avatars… ""
3 36442442 gaborcsardi… gaborcsardi 660288 https://avatars… ""
4 34924886 gaborcsardi… gaborcsardi 660288 https://avatars… ""
5 61620661 gaborcsardi… gaborcsardi 660288 https://avatars… ""
6 33907457 gaborcsardi… gaborcsardi 660288 https://avatars… ""
7 37236467 gaborcsardi… gaborcsardi 660288 https://avatars… ""
8 67959624 gaborcsardi… gaborcsardi 660288 https://avatars… ""
9 63152619 gaborcsardi… gaborcsardi 660288 https://avatars… ""
10 24343686 gaborcsardi… gaborcsardi 660288 https://avatars… ""
# ℹ 166 more rows
# ℹ 14 more variables: owner_url <chr>, owner_html_url <chr>,
# owner_followers_url <chr>, owner_following_url <chr>,
# owner_gists_url <chr>, owner_starred_url <chr>,
# owner_subscriptions_url <chr>, owner_organizations_url <chr>,
# owner_repos_url <chr>, owner_events_url <chr>,
# owner_received_events_url <chr>, owner_type <chr>, …
# A tibble: 30 × 1
json
<list>
1 <named list [18]>
2 <named list [18]>
3 <named list [18]>
4 <named list [18]>
5 <named list [18]>
6 <named list [18]>
7 <named list [18]>
8 <named list [18]>
9 <named list [18]>
10 <named list [18]>
# ℹ 20 more rows
One row per POV character
# A tibble: 30 × 18
url id name gender culture born died alive titles aliases father
<chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr>
1 https://w… 1022 Theo… Male "Ironb… "In … "" TRUE <chr> <chr> ""
2 https://w… 1052 Tyri… Male "" "In … "" TRUE <chr> <chr> ""
3 https://w… 1074 Vict… Male "Ironb… "In … "" TRUE <chr> <chr> ""
4 https://w… 1109 Will Male "" "" "In … FALSE <chr> <chr> ""
5 https://w… 1166 Areo… Male "Norvo… "In … "" TRUE <chr> <chr> ""
6 https://w… 1267 Chett Male "" "At … "In … FALSE <chr> <chr> ""
7 https://w… 1295 Cres… Male "" "In … "In … FALSE <chr> <chr> ""
8 https://w… 130 Aria… Female "Dorni… "In … "" TRUE <chr> <chr> ""
9 https://w… 1303 Daen… Female "Valyr… "In … "" TRUE <chr> <chr> ""
10 https://w… 1319 Davo… Male "Weste… "In … "" TRUE <chr> <chr> ""
# ℹ 20 more rows
# ℹ 7 more variables: mother <chr>, spouse <chr>, allegiances <list>,
# books <list>, povBooks <list>, tvSeries <list>, playedBy <list>
characters <- chars |>
unnest_wider(col = json)
characters |>
select(id, name, gender, culture, born, died, alive)
# A tibble: 30 × 7
id name gender culture born died alive
<int> <chr> <chr> <chr> <chr> <chr> <lgl>
1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 279 AC,… "" TRUE
2 1052 Tyrion Lannister Male "" "In 273 AC, at Caster… "" TRUE
3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or before,… "" TRUE
4 1109 Will Male "" "" "In … FALSE
5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or before,… "" TRUE
6 1267 Chett Male "" "At Hag's Mire" "In … FALSE
7 1295 Cressen Male "" "In 219 AC or 220 AC" "In … FALSE
8 130 Arianne Martell Female "Dornish" "In 276 AC, at Sunspe… "" TRUE
9 1303 Daenerys Targaryen Female "Valyrian" "In 284 AC, at Dragon… "" TRUE
10 1319 Davos Seaworth Male "Westeros" "In 260 AC or before,… "" TRUE
# ℹ 20 more rows
# A tibble: 30 × 8
id titles aliases allegiances books povBooks tvSeries playedBy
<int> <list> <list> <list> <list> <list> <list> <list>
1 1022 <chr [2]> <chr [4]> <chr [1]> <chr [3]> <chr [2]> <chr [6]> <chr>
2 1052 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr [4]> <chr [6]> <chr>
3 1074 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [1]> <chr>
4 1109 <chr [1]> <chr [1]> <NULL> <chr [1]> <chr [1]> <chr [1]> <chr>
5 1166 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [2]> <chr>
6 1267 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr [1]> <chr [1]> <chr>
7 1295 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr [1]> <chr [1]> <chr>
8 130 <chr [1]> <chr [1]> <chr [1]> <chr [4]> <chr [1]> <chr [1]> <chr>
9 1303 <chr [5]> <chr [11]> <chr [1]> <chr [1]> <chr [4]> <chr [6]> <chr>
10 1319 <chr [4]> <chr [5]> <chr [2]> <chr [1]> <chr [3]> <chr [5]> <chr>
# ℹ 20 more rows
# A tibble: 30 × 2
name title
<chr> <list>
1 Theon Greyjoy <chr [2]>
2 Tyrion Lannister <chr [2]>
3 Victarion Greyjoy <chr [2]>
4 Will <chr [1]>
5 Areo Hotah <chr [1]>
6 Chett <chr [1]>
7 Cressen <chr [1]>
8 Arianne Martell <chr [1]>
9 Daenerys Targaryen <chr [5]>
10 Davos Seaworth <chr [4]>
# ℹ 20 more rows
# A tibble: 59 × 2
name title
<chr> <chr>
1 Theon Greyjoy "Prince of Winterfell"
2 Theon Greyjoy "Lord of the Iron Islands (by law of the green lands)"
3 Tyrion Lannister "Acting Hand of the King (former)"
4 Tyrion Lannister "Master of Coin (former)"
5 Victarion Greyjoy "Lord Captain of the Iron Fleet"
6 Victarion Greyjoy "Master of the Iron Victory"
7 Will ""
8 Areo Hotah "Captain of the Guard at Sunspear"
9 Chett ""
10 Cressen "Maester"
# ℹ 49 more rows
# A tibble: 30 × 3
name books tvSeries
<chr> <list> <list>
1 Theon Greyjoy <chr [3]> <chr [6]>
2 Tyrion Lannister <chr [2]> <chr [6]>
3 Victarion Greyjoy <chr [3]> <chr [1]>
4 Will <chr [1]> <chr [1]>
5 Areo Hotah <chr [3]> <chr [2]>
6 Chett <chr [2]> <chr [1]>
7 Cressen <chr [2]> <chr [1]>
8 Arianne Martell <chr [4]> <chr [1]>
9 Daenerys Targaryen <chr [1]> <chr [6]>
10 Davos Seaworth <chr [1]> <chr [5]>
# ℹ 20 more rows
# A tibble: 60 × 3
name media value
<chr> <chr> <list>
1 Theon Greyjoy books <chr [3]>
2 Theon Greyjoy tvSeries <chr [6]>
3 Tyrion Lannister books <chr [2]>
4 Tyrion Lannister tvSeries <chr [6]>
5 Victarion Greyjoy books <chr [3]>
6 Victarion Greyjoy tvSeries <chr [1]>
7 Will books <chr [1]>
8 Will tvSeries <chr [1]>
9 Areo Hotah books <chr [3]>
10 Areo Hotah tvSeries <chr [2]>
# ℹ 50 more rows
# A tibble: 179 × 3
name media value
<chr> <chr> <chr>
1 Theon Greyjoy books A Game of Thrones
2 Theon Greyjoy books A Storm of Swords
3 Theon Greyjoy books A Feast for Crows
4 Theon Greyjoy tvSeries Season 1
5 Theon Greyjoy tvSeries Season 2
6 Theon Greyjoy tvSeries Season 3
7 Theon Greyjoy tvSeries Season 4
8 Theon Greyjoy tvSeries Season 5
9 Theon Greyjoy tvSeries Season 6
10 Tyrion Lannister books A Feast for Crows
# ℹ 169 more rows
omdb_api <- function(title) {
# print a message to track progress
message(str_glue("Scraping {title}..."))
# define request
req <- request(base_url = "http://www.omdbapi.com/") |>
# throttle to avoid overloading server
req_throttle(rate = 15 / 60) |>
# create query
req_url_query(
apikey = Sys.getenv("omdb_key"),
t = title
)
# perform request
req_results <- req |>
req_perform()
# extract results
req_json <- req_results |>
resp_body_json()
# convert to data frame
req_df <- tibble(result = list(req_json)) |>
unnest_wider(col = result)
return(req_df)
}
# titles of films
sharknados <- c(
"Sharknado", "Sharknado 2", "Sharknado 3",
"Sharknado 4", "Sharknado 5"
)
# iterate over titles and query API
sharknados_df <- map(.x = sharknados, .f = omdb_api) |>
list_rbind()
sharknados_df
# A tibble: 5 × 25
Title Year Rated Released Runtime Genre Director Writer Actors Plot Language
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Shar… 2013 Not … 11 Jul … 86 min Acti… Anthony… Thund… Ian Z… When… English
2 Shar… 2014 TV-14 30 Jul … 95 min Acti… Anthony… Thund… Ian Z… Fin … English
3 Shar… 2015 TV-14 22 Jul … 93 min Acti… Anthony… Thund… Ian Z… A mo… English
4 Shar… 2016 TV-14 31 Jul … 95 min Acti… Anthony… Thund… Ian Z… Fin,… English
5 Shar… 2017 TV-14 06 Aug … 93 min Acti… Anthony… Thund… Ian Z… With… English
# ℹ 14 more variables: Country <chr>, Awards <chr>, Poster <chr>,
# Ratings <list>, Metascore <chr>, imdbRating <chr>, imdbVotes <chr>,
# imdbID <chr>, Type <chr>, DVD <chr>, BoxOffice <chr>, Production <chr>,
# Website <chr>, Response <chr>
ae-12
ae-12
(repo name will be suffixed with your GitHub name).renv::restore()
to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.unnest_longer()
and unnest_wider()
are useful for working with list-columnsstr()
or View()
to understand how elements are stored within itImage credit: Bulbagarden