Lecture 10
Cornell University
INFO 5001 - Fall 2024
October 1, 2024
Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy & paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
Two different scenarios:
Screen scraping: extract data from source code of website, with HTML parser (easy) or regular expression matching (less easy).
Web APIs (application programming interface): website offers a set of structured HTTP requests that return JSON or XML files.
|>
read_html()
- Read HTML data from a url or character stringhtml_element()
/ html_elements()
- Select a specified element(s) from HTML documenthtml_table()
- Parse an HTML table into a data framehtml_text()
- Extract text from an elementhtml_text2()
- Extract text from an element and lightly format it to match how text looks in the browserhtml_name()
- Extract elements’ nameshtml_attr()
/ html_attrs()
- Extract a single attribute or all attributesae-08
ae-08
(repo name will be suffixed with your GitHub name).renv::restore()
to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.When working in a Quarto document, your analysis is re-run each time you render
If web scraping in a Quarto document, you’d be re-scraping the data each time you render, which is undesirable (and not nice)!
An alternative workflow:
Two different scenarios for web scraping:
Screen scraping: extract data from source code of website, with HTML parser (easy) or regular expression matching (less easy)
Web APIs (application programming interface): website offers a set of structured HTTP requests that return JSON or XML files