AE 08: Scraping articles from the Cornell Review
Application exercise
Packages
We will use the following packages in this application exercise.
- tidyverse: For data import, wrangling, and visualization.
- rvest: For scraping HTML files.
- robotstxt: For verifying if we can scrape a website.
Data scraping
This will be done in the scrape-cornell-review.R
R script. Save the resulting data frame in the data folder.
# load packages
library(tidyverse)
library(rvest)
library(robotstxt)
# check that we can scrape data from the cornell review
paths_allowed("https://www.thecornellreview.org/")
# read the first page
<- read_html("https://www.thecornellreview.org/")
page # page <- read_html("data/cornell-review-raw.html") # use this if we break the website
# extract desired components
<- html_elements(x = page, css = "______") |>
titles html_text2()
<- html_elements(x = page, css = "______") |>
authors html_text2()
<- html_elements(x = page, css = "______") |>
article_dates html_text2()
<- html_elements(x = page, css = "______") |>
topics html_text2()
<- html_elements(x = page, css = "______") |>
abstracts html_text2()
<- html_elements(x = page, css = "______") |>
post_urls html_______(______)
# create a tibble with this data
## add code here
# clean up the data
## add code here
# save to disk
write_csv(x = review, file = "data/cornell-review.csv")