Rectangling data

Lecture 15

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2023

2023-10-18

Announcements

Announcements

  • Homework 04
  • No homework this week
  • Lab on Friday: project exploration

Lists and list-columns

Lists

x1 <- list(1:4, "a", TRUE)
x1
[[1]]
[1] 1 2 3 4

[[2]]
[1] "a"

[[3]]
[1] TRUE

Named lists

x2 <- list(a = 1:2, b = 1:3, c = 1:4)
x2
$a
[1] 1 2

$b
[1] 1 2 3

$c
[1] 1 2 3 4

Structure of lists

str(x1)
List of 3
 $ : int [1:4] 1 2 3 4
 $ : chr "a"
 $ : logi TRUE
str(x2)
List of 3
 $ a: int [1:2] 1 2
 $ b: int [1:3] 1 2 3
 $ c: int [1:4] 1 2 3 4

Nested lists

x3 <- list(list(1, 2), list(3, 4))
str(x3)
List of 2
 $ :List of 2
  ..$ : num 1
  ..$ : num 2
 $ :List of 2
  ..$ : num 3
  ..$ : num 4

List-columns

df <- tibble(
  x = 1:2, 
  y = c("a", "b"),
  z = list(list(1, 2), list(3, 4, 5))
)
df
# A tibble: 2 × 3
      x y     z         
  <int> <chr> <list>    
1     1 a     <list [2]>
2     2 b     <list [3]>

What’s up with the Sharknado results?

Sharknado results from OMDB API

# A tibble: 10 × 25
   Title         Year  Rated Released Runtime Genre Director Writer Actors Plot 
   <chr>         <chr> <chr> <chr>    <chr>   <chr> <chr>    <chr>  <chr>  <chr>
 1 Sharknado     2013  Not … 11 Jul … 86 min  Acti… Anthony… Thund… Ian Z… When…
 2 Sharknado     2013  Not … 11 Jul … 86 min  Acti… Anthony… Thund… Ian Z… When…
 3 Sharknado 2:… 2014  TV-14 30 Jul … 95 min  Acti… Anthony… Thund… Ian Z… Fin …
 4 Sharknado 2:… 2014  TV-14 30 Jul … 95 min  Acti… Anthony… Thund… Ian Z… Fin …
 5 Sharknado 3:… 2015  TV-14 22 Jul … 93 min  Acti… Anthony… Thund… Ian Z… A mo…
 6 Sharknado 3:… 2015  TV-14 22 Jul … 93 min  Acti… Anthony… Thund… Ian Z… A mo…
 7 Sharknado 4:… 2016  TV-14 31 Jul … 95 min  Acti… Anthony… Thund… Ian Z… Fin,…
 8 Sharknado 4:… 2016  TV-14 31 Jul … 95 min  Acti… Anthony… Thund… Ian Z… Fin,…
 9 Sharknado 5:… 2017  TV-14 06 Aug … 93 min  Acti… Anthony… Thund… Ian Z… With…
10 Sharknado 5:… 2017  TV-14 06 Aug … 93 min  Acti… Anthony… Thund… Ian Z… With…
# ℹ 15 more variables: Language <chr>, Country <chr>, Awards <chr>,
#   Poster <chr>, Ratings <list>, Metascore <chr>, imdbRating <chr>,
#   imdbVotes <chr>, imdbID <chr>, Type <chr>, DVD <chr>, BoxOffice <chr>,
#   Production <chr>, Website <chr>, Response <chr>

Sharknado

List of 25
 $ Title     : chr "Sharknado"
 $ Year      : chr "2013"
 $ Rated     : chr "Not Rated"
 $ Released  : chr "11 Jul 2013"
 $ Runtime   : chr "86 min"
 $ Genre     : chr "Action, Adventure, Comedy"
 $ Director  : chr "Anthony C. Ferrante"
 $ Writer    : chr "Thunder Levin"
 $ Actors    : chr "Ian Ziering, Tara Reid, John Heard"
 $ Plot      : chr "When a freak hurricane swamps Los Angeles, nature's deadliest killer rules sea, land, and air as thousands of s"| __truncated__
 $ Language  : chr "English"
 $ Country   : chr "United States"
 $ Awards    : chr "1 win & 2 nominations"
 $ Poster    : chr "https://m.media-amazon.com/images/M/MV5BODcwZWFiNTEtNDgzMC00ZmE2LWExMzYtNzZhZDgzNDc5NDkyXkEyXkFqcGdeQXVyMTQxNzM"| __truncated__
 $ Ratings   :List of 2
  ..$ :List of 2
  .. ..$ Source: chr "Internet Movie Database"
  .. ..$ Value : chr "3.3/10"
  ..$ :List of 2
  .. ..$ Source: chr "Rotten Tomatoes"
  .. ..$ Value : chr "75%"
 $ Metascore : chr "N/A"
 $ imdbRating: chr "3.3"
 $ imdbVotes : chr "52,068"
 $ imdbID    : chr "tt2724064"
 $ Type      : chr "movie"
 $ DVD       : chr "11 Oct 2016"
 $ BoxOffice : chr "N/A"
 $ Production: chr "N/A"
 $ Website   : chr "N/A"
 $ Response  : chr "True"

Rectangling messy data

Rectangling and tidyr

Art and craft of taking a deeply nested list and taming it into a tidy data set of rows and columns

  • unnest_longer() - each row contains multiple observations
  • unnest_wider() - each row contains a single observation
  • unnest_auto() - make an educated guess
  • hoist() - extract a specific element

Unnesting very wide data

gh_repos

library(repurrrsive)
repos <- tibble(json = gh_repos)
repos
# A tibble: 6 × 1
  json       
  <list>     
1 <list [30]>
2 <list [30]>
3 <list [30]>
4 <list [26]>
5 <list [30]>
6 <list [30]>

unnest_longer()

repos |>
  unnest_longer(col = json)
# A tibble: 176 × 1
   json             
   <list>           
 1 <named list [68]>
 2 <named list [68]>
 3 <named list [68]>
 4 <named list [68]>
 5 <named list [68]>
 6 <named list [68]>
 7 <named list [68]>
 8 <named list [68]>
 9 <named list [68]>
10 <named list [68]>
# ℹ 166 more rows

One row per repo

unnest_wider()

repos |>
  unnest_longer(col = json) |>
  unnest_wider(col = json)
# A tibble: 176 × 68
        id name  full_name owner        private html_url description fork  url  
     <int> <chr> <chr>     <list>       <lgl>   <chr>    <chr>       <lgl> <chr>
 1  6.12e7 after gaborcsa… <named list> FALSE   https:/… Run Code i… FALSE http…
 2  4.05e7 argu… gaborcsa… <named list> FALSE   https:/… Declarativ… FALSE http…
 3  3.64e7 ask   gaborcsa… <named list> FALSE   https:/… Friendly C… FALSE http…
 4  3.49e7 base… gaborcsa… <named list> FALSE   https:/… Do we get … FALSE http…
 5  6.16e7 cite… gaborcsa… <named list> FALSE   https:/… Test R pac… TRUE  http…
 6  3.39e7 clis… gaborcsa… <named list> FALSE   https:/… Unicode sy… FALSE http…
 7  3.72e7 cmak… gaborcsa… <named list> FALSE   https:/… port of cm… TRUE  http…
 8  6.80e7 cmark gaborcsa… <named list> FALSE   https:/… CommonMark… TRUE  http…
 9  6.32e7 cond… gaborcsa… <named list> FALSE   https:/… <NA>        TRUE  http…
10  2.43e7 cray… gaborcsa… <named list> FALSE   https:/… R package … FALSE http…
# ℹ 166 more rows
# ℹ 59 more variables: forks_url <chr>, keys_url <chr>,
#   collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
#   issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
#   branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
#   git_refs_url <chr>, trees_url <chr>, statuses_url <chr>,
#   languages_url <chr>, stargazers_url <chr>, contributors_url <chr>, …

One column per list element

What are all these columns?

repos |>
  unnest_longer(col = json) |>
  unnest_wider(col = json) |>
  names()
 [1] "id"                "name"              "full_name"        
 [4] "owner"             "private"           "html_url"         
 [7] "description"       "fork"              "url"              
[10] "forks_url"         "keys_url"          "collaborators_url"
[13] "teams_url"         "hooks_url"         "issue_events_url" 
[16] "events_url"        "assignees_url"     "branches_url"     
[19] "tags_url"          "blobs_url"         "git_tags_url"     
[22] "git_refs_url"      "trees_url"         "statuses_url"     
[25] "languages_url"     "stargazers_url"    "contributors_url" 
[28] "subscribers_url"   "subscription_url"  "commits_url"      
[31] "git_commits_url"   "comments_url"      "issue_comment_url"
[34] "contents_url"      "compare_url"       "merges_url"       
[37] "archive_url"       "downloads_url"     "issues_url"       
[40] "pulls_url"         "milestones_url"    "notifications_url"
[43] "labels_url"        "releases_url"      "deployments_url"  
[46] "created_at"        "updated_at"        "pushed_at"        
[49] "git_url"           "ssh_url"           "clone_url"        
[52] "svn_url"           "homepage"          "size"             
[55] "stargazers_count"  "watchers_count"    "language"         
[58] "has_issues"        "has_downloads"     "has_wiki"         
[61] "has_pages"         "forks_count"       "mirror_url"       
[64] "open_issues_count" "forks"             "open_issues"      
[67] "watchers"          "default_branch"   

Pick a few columns

repos |> 
  unnest_longer(col = json) |> 
  unnest_wider(col = json) |> 
  select(id, full_name, owner, description)
# A tibble: 176 × 4
         id full_name               owner             description               
      <int> <chr>                   <list>            <chr>                     
 1 61160198 gaborcsardi/after       <named list [17]> Run Code in the Background
 2 40500181 gaborcsardi/argufy      <named list [17]> Declarative function argu…
 3 36442442 gaborcsardi/ask         <named list [17]> Friendly CLI interaction …
 4 34924886 gaborcsardi/baseimports <named list [17]> Do we get warnings for un…
 5 61620661 gaborcsardi/citest      <named list [17]> Test R package and repo f…
 6 33907457 gaborcsardi/clisymbols  <named list [17]> Unicode symbols for CLI a…
 7 37236467 gaborcsardi/cmaker      <named list [17]> port of cmake to r        
 8 67959624 gaborcsardi/cmark       <named list [17]> CommonMark parsing and re…
 9 63152619 gaborcsardi/conditions  <named list [17]> <NA>                      
10 24343686 gaborcsardi/crayon      <named list [17]> R package for colored ter…
# ℹ 166 more rows

Unnest another column

repos |> 
  unnest_longer(col = json) |> 
  unnest_wider(col = json) |> 
  select(id, full_name, owner, description) |> 
  unnest_wider(owner)
Error in `unnest_wider()`:
! Can't duplicate names between the affected columns and the original
  data.
✖ These names are duplicated:
  ℹ `id`, from `owner`.
ℹ Use `names_sep` to disambiguate using the column name.
ℹ Or use `names_repair` to specify a repair strategy.

Resolve duplicate names

repos |> 
  unnest_longer(col = json) |> 
  unnest_wider(col = json) |> 
  select(id, full_name, owner, description) |> 
  unnest_wider(owner, names_sep = "_")
# A tibble: 176 × 20
         id full_name    owner_login owner_id owner_avatar_url owner_gravatar_id
      <int> <chr>        <chr>          <int> <chr>            <chr>            
 1 61160198 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
 2 40500181 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
 3 36442442 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
 4 34924886 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
 5 61620661 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
 6 33907457 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
 7 37236467 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
 8 67959624 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
 9 63152619 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
10 24343686 gaborcsardi… gaborcsardi   660288 https://avatars… ""               
# ℹ 166 more rows
# ℹ 14 more variables: owner_url <chr>, owner_html_url <chr>,
#   owner_followers_url <chr>, owner_following_url <chr>,
#   owner_gists_url <chr>, owner_starred_url <chr>,
#   owner_subscriptions_url <chr>, owner_organizations_url <chr>,
#   owner_repos_url <chr>, owner_events_url <chr>,
#   owner_received_events_url <chr>, owner_type <chr>, …

Relational data stored as nested lists

An API of Ice and Fire

chars <- tibble(json = got_chars)
chars
# A tibble: 30 × 1
   json             
   <list>           
 1 <named list [18]>
 2 <named list [18]>
 3 <named list [18]>
 4 <named list [18]>
 5 <named list [18]>
 6 <named list [18]>
 7 <named list [18]>
 8 <named list [18]>
 9 <named list [18]>
10 <named list [18]>
# ℹ 20 more rows

One row per POV character

Widen the data frame

chars |> 
  unnest_wider(col = json)
# A tibble: 30 × 18
   url           id name  gender culture born  died  alive titles aliases father
   <chr>      <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
 1 https://w…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
 2 https://w…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
 3 https://w…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
 4 https://w…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
 5 https://w…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
 6 https://w…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
 7 https://w…  1295 Cres… Male   ""      "In … "In … FALSE <chr>  <chr>   ""    
 8 https://w…   130 Aria… Female "Dorni… "In … ""    TRUE  <chr>  <chr>   ""    
 9 https://w…  1303 Daen… Female "Valyr… "In … ""    TRUE  <chr>  <chr>   ""    
10 https://w…  1319 Davo… Male   "Weste… "In … ""    TRUE  <chr>  <chr>   ""    
# ℹ 20 more rows
# ℹ 7 more variables: mother <chr>, spouse <chr>, allegiances <list>,
#   books <list>, povBooks <list>, tvSeries <list>, playedBy <list>

Some of the columns

characters <- chars |> 
  unnest_wider(col = json)
characters |> 
  select(id, name, gender, culture, born, died, alive)
# A tibble: 30 × 7
      id name               gender culture    born                   died  alive
   <int> <chr>              <chr>  <chr>      <chr>                  <chr> <lgl>
 1  1022 Theon Greyjoy      Male   "Ironborn" "In 278 AC or 279 AC,… ""    TRUE 
 2  1052 Tyrion Lannister   Male   ""         "In 273 AC, at Caster… ""    TRUE 
 3  1074 Victarion Greyjoy  Male   "Ironborn" "In 268 AC or before,… ""    TRUE 
 4  1109 Will               Male   ""         ""                     "In … FALSE
 5  1166 Areo Hotah         Male   "Norvoshi" "In 257 AC or before,… ""    TRUE 
 6  1267 Chett              Male   ""         "At Hag's Mire"        "In … FALSE
 7  1295 Cressen            Male   ""         "In 219 AC or 220 AC"  "In … FALSE
 8   130 Arianne Martell    Female "Dornish"  "In 276 AC, at Sunspe… ""    TRUE 
 9  1303 Daenerys Targaryen Female "Valyrian" "In 284 AC, at Dragon… ""    TRUE 
10  1319 Davos Seaworth     Male   "Westeros" "In 260 AC or before,… ""    TRUE 
# ℹ 20 more rows

List-columns in ASOIAF

chars |> 
  unnest_wider(col = json) |> 
  select(id, where(is.list))
# A tibble: 30 × 8
      id titles    aliases    allegiances books     povBooks  tvSeries  playedBy
   <int> <list>    <list>     <list>      <list>    <list>    <list>    <list>  
 1  1022 <chr [2]> <chr [4]>  <chr [1]>   <chr [3]> <chr [2]> <chr [6]> <chr>   
 2  1052 <chr [2]> <chr [11]> <chr [1]>   <chr [2]> <chr [4]> <chr [6]> <chr>   
 3  1074 <chr [2]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [1]> <chr>   
 4  1109 <chr [1]> <chr [1]>  <NULL>      <chr [1]> <chr [1]> <chr [1]> <chr>   
 5  1166 <chr [1]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [2]> <chr>   
 6  1267 <chr [1]> <chr [1]>  <NULL>      <chr [2]> <chr [1]> <chr [1]> <chr>   
 7  1295 <chr [1]> <chr [1]>  <NULL>      <chr [2]> <chr [1]> <chr [1]> <chr>   
 8   130 <chr [1]> <chr [1]>  <chr [1]>   <chr [4]> <chr [1]> <chr [1]> <chr>   
 9  1303 <chr [5]> <chr [11]> <chr [1]>   <chr [1]> <chr [4]> <chr [6]> <chr>   
10  1319 <chr [4]> <chr [5]>  <chr [2]>   <chr [1]> <chr [3]> <chr [5]> <chr>   
# ℹ 20 more rows

Choose your own adventure

Match character’s title to their name

select(
  .data = characters,
  name,
  title = titles
)
# A tibble: 30 × 2
   name               title    
   <chr>              <list>   
 1 Theon Greyjoy      <chr [2]>
 2 Tyrion Lannister   <chr [2]>
 3 Victarion Greyjoy  <chr [2]>
 4 Will               <chr [1]>
 5 Areo Hotah         <chr [1]>
 6 Chett              <chr [1]>
 7 Cressen            <chr [1]>
 8 Arianne Martell    <chr [1]>
 9 Daenerys Targaryen <chr [5]>
10 Davos Seaworth     <chr [4]>
# ℹ 20 more rows

Match character’s title to their name

select(
  .data = characters,
  name,
  title = titles
) |>
  unnest_longer(col = title)
# A tibble: 59 × 2
   name              title                                                 
   <chr>             <chr>                                                 
 1 Theon Greyjoy     "Prince of Winterfell"                                
 2 Theon Greyjoy     "Lord of the Iron Islands (by law of the green lands)"
 3 Tyrion Lannister  "Acting Hand of the King (former)"                    
 4 Tyrion Lannister  "Master of Coin (former)"                             
 5 Victarion Greyjoy "Lord Captain of the Iron Fleet"                      
 6 Victarion Greyjoy "Master of the Iron Victory"                          
 7 Will              ""                                                    
 8 Areo Hotah        "Captain of the Guard at Sunspear"                    
 9 Chett             ""                                                    
10 Cressen           "Maester"                                             
# ℹ 49 more rows

Every appearance per book/season

select(
  .data = characters,
  name, books, tvSeries
)
# A tibble: 30 × 3
   name               books     tvSeries 
   <chr>              <list>    <list>   
 1 Theon Greyjoy      <chr [3]> <chr [6]>
 2 Tyrion Lannister   <chr [2]> <chr [6]>
 3 Victarion Greyjoy  <chr [3]> <chr [1]>
 4 Will               <chr [1]> <chr [1]>
 5 Areo Hotah         <chr [3]> <chr [2]>
 6 Chett              <chr [2]> <chr [1]>
 7 Cressen            <chr [2]> <chr [1]>
 8 Arianne Martell    <chr [4]> <chr [1]>
 9 Daenerys Targaryen <chr [1]> <chr [6]>
10 Davos Seaworth     <chr [1]> <chr [5]>
# ℹ 20 more rows

Every appearance per book/season

select(
  .data = characters,
  name, books, tvSeries
) |>
  pivot_longer(
    cols = c(books, tvSeries),
    names_to = "media",
    values_to = "value"
  )
# A tibble: 60 × 3
   name              media    value    
   <chr>             <chr>    <list>   
 1 Theon Greyjoy     books    <chr [3]>
 2 Theon Greyjoy     tvSeries <chr [6]>
 3 Tyrion Lannister  books    <chr [2]>
 4 Tyrion Lannister  tvSeries <chr [6]>
 5 Victarion Greyjoy books    <chr [3]>
 6 Victarion Greyjoy tvSeries <chr [1]>
 7 Will              books    <chr [1]>
 8 Will              tvSeries <chr [1]>
 9 Areo Hotah        books    <chr [3]>
10 Areo Hotah        tvSeries <chr [2]>
# ℹ 50 more rows

Every appearance per book/season

select(
  .data = characters,
  name, books, tvSeries
) |>
  pivot_longer(
    cols = c(books, tvSeries),
    names_to = "media",
    values_to = "value"
  ) |>
  unnest_longer(col = value)
# A tibble: 179 × 3
   name             media    value            
   <chr>            <chr>    <chr>            
 1 Theon Greyjoy    books    A Game of Thrones
 2 Theon Greyjoy    books    A Storm of Swords
 3 Theon Greyjoy    books    A Feast for Crows
 4 Theon Greyjoy    tvSeries Season 1         
 5 Theon Greyjoy    tvSeries Season 2         
 6 Theon Greyjoy    tvSeries Season 3         
 7 Theon Greyjoy    tvSeries Season 4         
 8 Theon Greyjoy    tvSeries Season 5         
 9 Theon Greyjoy    tvSeries Season 6         
10 Tyrion Lannister books    A Feast for Crows
# ℹ 169 more rows

Fix the OMDB API function

Rectangled version

omdb_api <- function(title) {
  # print a message to track progress
  message(glue::glue("Scraping {title}..."))
  
  # define request
  req <- request(base_url = "http://www.omdbapi.com/") |>
    # throttle to avoid overloading server
    req_throttle(rate = 15 / 60) |>
    # create query
    req_url_query(
      apikey = Sys.getenv("omdb_key"),
      t = title
    )
  
  # perform request
  req_results <- req |>
    req_perform()
  
  # extract results
  req_json <- req_results |>
    resp_body_json()
  
  # convert to data frame
  req_df <- tibble(result = list(req_json)) |>
    unnest_wider(col = result)
  
  return(req_df)
}

Test that it works

# titles of films
sharknados <- c(
  "Sharknado", "Sharknado 2", "Sharknado 3",
  "Sharknado 4", "Sharknado 5"
)

# iterate over titles and query API
sharknados_df <- map(.x = sharknados, .f = omdb_api) |>
  list_rbind()
sharknados_df
# A tibble: 5 × 25
  Title Year  Rated Released Runtime Genre Director Writer Actors Plot  Language
  <chr> <chr> <chr> <chr>    <chr>   <chr> <chr>    <chr>  <chr>  <chr> <chr>   
1 Shar… 2013  Not … 11 Jul … 86 min  Acti… Anthony… Thund… Ian Z… When… English 
2 Shar… 2014  TV-14 30 Jul … 95 min  Acti… Anthony… Thund… Ian Z… Fin … English 
3 Shar… 2015  TV-14 22 Jul … 93 min  Acti… Anthony… Thund… Ian Z… A mo… English 
4 Shar… 2016  TV-14 31 Jul … 95 min  Acti… Anthony… Thund… Ian Z… Fin,… English 
5 Shar… 2017  TV-14 06 Aug … 93 min  Acti… Anthony… Thund… Ian Z… With… English 
# ℹ 14 more variables: Country <chr>, Awards <chr>, Poster <chr>,
#   Ratings <list>, Metascore <chr>, imdbRating <chr>, imdbVotes <chr>,
#   imdbID <chr>, Type <chr>, DVD <chr>, BoxOffice <chr>, Production <chr>,
#   Website <chr>, Response <chr>

Application exercise

ae-13

  • Go to the course GitHub org and find your ae-13 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow

Recap

  • Lists are an efficient data structure in R for storing hierarchical data
  • unnest_longer() and unnest_wider() are useful for working with list-columns
  • Examine the list structure with str() or View() to understand how elements are stored within it

Zapdos