AE 08: Scraping articles from the Cornell Review

Suggested answers

Application exercise
Answers
Modified

September 8, 2025

Packages

We will use the following packages in this application exercise.

  • {tidyverse}: For data import, wrangling, and visualization.
  • {rvest}: For scraping HTML files.
  • {robotstxt}: For verifying if we can scrape a website.

Data scraping

See the code below stored in scrape-cornell-review.R.

# load packages
library(tidyverse)
library(rvest)
library(robotstxt)

# check that we can scrape data from the cornell review
paths_allowed("https://www.thecornellreview.org/")

# read the first page
page <- read_html("https://www.thecornellreview.org/")

# extract desired components
titles <- html_elements(x = page, css = "#main .read-title a") |>
  html_text2()

authors <- html_elements(x = page, css = "#main .byline a") |>
  html_text2()

article_dates <- html_elements(x = page, css = "#main .posts-date") |>
  html_text2()

topics <- html_elements(x = page, css = "#main .cat-links") |>
  html_text2()

abstracts <- html_elements(x = page, css = ".post-description") |>
  html_text2()

post_urls <- html_elements(x = page, css = ".aft-readmore") |>
  html_attr(name = "href")

# create a tibble with this data
review_raw <- tibble(
  title = titles,
  author = authors,
  date = article_dates,
  topic = topics,
  description = abstracts,
  url = post_urls
)

# clean up the data
review <- review_raw |>
  mutate(
    date = mdy(date),
    description = str_remove(string = description, pattern = "\nRead More")
  )

# save to disk
write_csv(x = review, file = "data/cornell-review.csv")
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.5.1 (2025-06-13)
 os       macOS Tahoe 26.0
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2025-09-26
 pandoc   3.4 @ /usr/local/bin/ (via rmarkdown)
 quarto   1.8.24 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 ! package      * version date (UTC) lib source
 P cli            3.6.5   2025-04-23 [?] RSPM (R 4.5.0)
 P digest         0.6.37  2024-08-19 [?] RSPM (R 4.5.0)
 P dplyr        * 1.1.4   2023-11-17 [?] RSPM (R 4.5.0)
 P evaluate       1.0.4   2025-06-18 [?] RSPM (R 4.5.1)
 P farver         2.1.2   2024-05-13 [?] RSPM (R 4.5.0)
 P fastmap        1.2.0   2024-05-15 [?] RSPM (R 4.5.0)
 P forcats      * 1.0.0   2023-01-29 [?] RSPM (R 4.5.0)
 P generics       0.1.4   2025-05-09 [?] RSPM (R 4.5.0)
 P ggplot2      * 3.5.2   2025-04-09 [?] RSPM (R 4.5.0)
 P glue           1.8.0   2024-09-30 [?] RSPM (R 4.5.0)
 P gtable         0.3.6   2024-10-25 [?] RSPM (R 4.5.0)
 P here           1.0.1   2020-12-13 [?] RSPM (R 4.5.0)
 P hms            1.1.3   2023-03-21 [?] RSPM (R 4.5.0)
 P htmltools      0.5.8.1 2024-04-04 [?] RSPM (R 4.5.0)
 P htmlwidgets    1.6.4   2023-12-06 [?] RSPM (R 4.5.0)
 P httr           1.4.7   2023-08-15 [?] RSPM (R 4.5.0)
 P jsonlite       2.0.0   2025-03-27 [?] RSPM (R 4.5.0)
 P knitr          1.50    2025-03-16 [?] RSPM (R 4.5.0)
 P lifecycle      1.0.4   2023-11-07 [?] RSPM (R 4.5.0)
 P lubridate    * 1.9.4   2024-12-08 [?] RSPM (R 4.5.0)
 P magrittr       2.0.3   2022-03-30 [?] RSPM (R 4.5.1)
 P pillar         1.11.0  2025-07-04 [?] RSPM (R 4.5.1)
 P pkgconfig      2.0.3   2019-09-22 [?] RSPM (R 4.5.0)
 P purrr        * 1.1.0   2025-07-10 [?] RSPM (R 4.5.0)
 P R6             2.6.1   2025-02-15 [?] RSPM (R 4.5.0)
 P RColorBrewer   1.1-3   2022-04-03 [?] RSPM (R 4.5.0)
 P readr        * 2.1.5   2024-01-10 [?] RSPM (R 4.5.0)
   renv           1.0.7   2024-04-11 [1] RSPM (R 4.5.1)
 P rlang          1.1.6   2025-04-11 [?] RSPM (R 4.5.0)
 P rmarkdown      2.29    2024-11-04 [?] RSPM
 P robotstxt    * 0.7.15  2024-08-29 [?] RSPM
 P rprojroot      2.1.0   2025-07-12 [?] RSPM (R 4.5.0)
 P rvest        * 1.0.4   2024-02-12 [?] RSPM (R 4.5.0)
 P scales         1.4.0   2025-04-24 [?] RSPM (R 4.5.0)
 P sessioninfo    1.2.3   2025-02-05 [?] RSPM (R 4.5.0)
 P stringi        1.8.7   2025-03-27 [?] RSPM (R 4.5.0)
 P stringr      * 1.5.1   2023-11-14 [?] RSPM (R 4.5.1)
 P tibble       * 3.3.0   2025-06-08 [?] RSPM (R 4.5.0)
 P tidyr        * 1.3.1   2024-01-24 [?] RSPM (R 4.5.0)
 P tidyselect     1.2.1   2024-03-11 [?] RSPM (R 4.5.0)
 P tidyverse    * 2.0.0   2023-02-22 [?] RSPM (R 4.5.0)
 P timechange     0.3.0   2024-01-18 [?] RSPM (R 4.5.0)
 P tzdb           0.5.0   2025-03-15 [?] RSPM (R 4.5.0)
 P vctrs          0.6.5   2023-12-01 [?] RSPM (R 4.5.0)
 P withr          3.0.2   2024-10-28 [?] RSPM (R 4.5.0)
 P xfun           0.52    2025-04-02 [?] RSPM (R 4.5.1)
 P xml2           1.3.8   2025-03-14 [?] RSPM (R 4.5.1)
 P yaml           2.3.10  2024-07-26 [?] RSPM (R 4.5.0)

 [1] /Users/bcs88/Projects/info-5001/course-site/renv/library/macos/R-4.5/aarch64-apple-darwin20
 [2] /Users/bcs88/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.5/aarch64-apple-darwin20/4cd76b74

 * ── Packages attached to the search path.
 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────