AE 08: Scraping articles from the Cornell Review

Application exercise
Modified

October 1, 2024

Packages

We will use the following packages in this application exercise.

  • tidyverse: For data import, wrangling, and visualization.
  • rvest: For scraping HTML files.
  • robotstxt: For verifying if we can scrape a website.

Data scraping

This will be done in the scrape-cornell-review.R R script. Save the resulting data frame in the data folder.

# load packages
library(tidyverse)
library(rvest)
library(robotstxt)

# check that we can scrape data from the cornell review
paths_allowed("https://www.thecornellreview.org/")

# read the first page
page <- read_html("https://www.thecornellreview.org/")
# page <- read_html("data/cornell-review-raw.html") # use this if we break the website

# extract desired components
titles <- html_elements(x = page, css = "______") |>
  html_text2()

authors <- html_elements(x = page, css = "______") |>
  html_text2()

article_dates <- html_elements(x = page, css = "______") |>
  html_text2()

topics <- html_elements(x = page, css = "______") |>
  html_text2()

abstracts <- html_elements(x = page, css = "______") |>
  html_text2()

post_urls <- html_elements(x = page, css = "______") |>
  html_______(______)

# create a tibble with this data
## add code here

# clean up the data
## add code here

# save to disk
write_csv(x = review, file = "data/cornell-review.csv")