Lab 04 - Money in U.S. politics

Lab

Modified

October 10, 2024

Important

This lab is due October 17 at 11:59pm.

Learning goals

In this lab, you will…

use web scraping to import unstructured data from HTML files
write a function to perform repetitive operations efficiently
create an iterative operation to apply a function to multiple inputs
clean variables for data analysis

Money in politics

Every election cycle brings its own brand of excitement – and lots of money. Political donations are of particular interest to political scientists and other researchers studying politics and voting patterns. They are also of interest to citizens who want to stay informed of how much money their candidates raise and where that money comes from.

In the United States, “only American citizens (and immigrants with green cards) can contribute to federal politics, but the American divisions of foreign companies can form political action committees (PACs) and collect contributions from their American employees.”¹

¹ Source: Open Secrets - Foreign Connected PACs.

In this assignment we will scrape and work with data on foreign-connected PACs that donate to US political campaigns. First, we will get data on foreign-connected PAC contributions in the 2022 election cycle. Then, you will use a similar approach to get data such contributions from previous years so that we can examine trends over time.

In order to complete this assignment you will need a browser with the Selector Gadget extension installed.

Getting started

Go to the info5001-fa24 organization on GitHub. Click on the repo with the prefix lab-04. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.
First, open the Quarto document lab-04.qmd and Render it.
Make sure it compiles without errors.

Team submission

All team members should clone the team GitHub repository for the lab. Then, one team member should edit the document YAML by adding the team name to the subtitle field and adding the names of the team members contributing to lab to the author field. Hopefully that’s everyone, but if someone doesn’t contribute during the lab session or throughout the week before the deadline, their name should not be added. If you have 4 members in your team, you can delete the line for the 5th team member. Then, this team member should render the document and commit and push the changes. All others should not touch the document at this stage.

title: "Lab 04 - Money in U.S. politics"
subtitle: "Team name"
author: 
  - "Team member 1 (netID)"
  - "Team member 2 (netID)"
  - "Team member 3 (netID)"
date: today
format:
  typst:
    fig-format: png

Warm up

Before we introduce the data, let’s warm up with some simple exercises.

Update the YAML, changing the author name to your name, and render the document.
Commit your changes with a meaningful commit message.
Push your changes to GitHub.
Go to your repo on GitHub and confirm that your changes are visible in your .qmd and .pdf files. If anything is missing, render, commit, and push again.

Packages

We’ll use the tidyverse package for much of the data wrangling, the scales package for better plot labels, and rvest for web scraping. These packages are already installed for you. You can load it by running the following in your Console:

library(tidyverse)
library(scales)
library(rvest)

Data collection via web scraping

Important

Pick a member of the team to complete exercise 1. All others should contribute to the discussion but only one person should type up the answer, run the R script, commit, and push to GitHub. All others should not touch the document.

The data come from OpenSecrets.org, a “website tracking the influence of money on U.S. politics, and how that money affects policy and citizens’ lives”. This website is hosted by The Center for Responsive Politics, which is a nonpartisan, independent nonprofit that “tracks money in U.S. politics and its effect on elections and public policy.”²

² Source: Open Secrets - About.

Before getting started, let’s check that a bot has permissions to access pages on this domain.

library(robotstxt)
paths_allowed("https://www.opensecrets.org")

[1] TRUE

Exercise 1

Our goal is to scrape data for contributions in all election years Open Secrets has data for. Since that means repeating a task many times, let’s first write a function that works on the first page. Confirm it works on a few others. Then iterate it over pages for all years.

Note

Complete the following set of steps in the lab-04-scrape-pacs.R file in your repository. This file already contains some starter code to help you out.

Write a function called scrape_pac() that scrapes information from the Open Secrets webpage for foreign-connected PAC contributions in a given year. This function should
- have one input: the URL of the webpage and should return a data frame.
- rename variables scraped, using snake_case naming.
- clean up the Name variable to remove extraneous whitespace characters.
- add a new column to the data frame for year. We will want this information when we ultimately have data from all years, so this is a good time to keep track of it. Our function doesn’t take a year argument, but the year is embedded in the URL, so we can extract it out of there, and add it as a new column. Use the str_sub() function to extract the last 4 characters from the URL. You will probably want to look at the help for this function to figure out how to specify “last 4 characters”.
Define the URLs for 2022, 2020, and 2000 contributions. Then, test your function using these URLs as inputs. Does the function seem to do what you expected it to do?
Construct a vector called urls that contains the URLs for each webpage that contains information on foreign-connected PAC contributions for a given year. Identify all election years available on Money in Politics.
Map the scrape_pac() function over urls in a way that will result in a data frame called pac_all.
Write the data frame to a csv file called pac-all.csv in the data folder.

After the team member working on Exercise 1 runs the script, commits, and pushes, another team member should pull their changes and run the script again to verify correct output. Then, they should write the answer to Exercise 2-3. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Data cleaning

Note

For the remaining exercises, complete them in the lab-04.qmd file in your repository.

In your Quarto document, load pac-all.csv and report its number of observations and variables using inline code.

In this section we clean the pac_all data frame to prepare it for analysis and visualization. We have two goals in data cleaning:

Separate the country_parent into two such that country and parent company appear in different columns for country-level analysis.
Convert contribution amounts in total, dems, and repubs from character strings to numeric values.

The following exercises walk you through how to make these fixes to the data.

Exercise 2

Separate country_parent into country and parent columns. Note that country and parent company names are separated by / (which will need to be specified in your function) and also note that there are some entries where the / sign appears twice and in these cases we want to only split the value at the first occurrence of /. This can be accomplished by setting the too_many argument to "merge" so that the cell is split into only 2 segments, e.g. we want "Denmark/Novo Nordisk A/S" to be split into "Denmark" and "Novo Nordisk A/S". (See help for the appropriate separate_*() function for more on this.) End your code chunk by printing out the top 10 rows of your data frame (if you just type the data frame name it should automatically do this for you).

Exercise 3

Remove the character strings including $ and , signs in the total, dems, and repubs columns and convert these columns to numeric.

parse_number() will need some help here. Be sure to check your output and make sure there were no parsing failures if you use it to convert the column. If so, rewrite your code to avoid them.
The $ character is a special character so it will need to be escaped.

End your code chunk by printing out the top 10 rows of your data frame (if you just type the data frame name it should automatically do this for you).

After the team member working on Exercises 2-3 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Exercise 4. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Data visualization and interpretation

Exercise 4

Create a line plot of total contributions from all foreign-connected PACs in Canada and Mexico over the years. Once you have made the plot, write a brief interpretation of what the graph reveals.

Filter for only Canada and Mexico.
Calculate sum of total contributions from PACs for each year for each country by using a sequence of group_by() then summarize().
Make a plot of year (x-axis) by total contributions (y-axis) where two lines identified by different colors represent each of Canada and Mexico.

After the team member working on Exercises 4 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Exercise 5. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Exercise 5

Recreate the following visualization. Once you have made the plot, write a brief interpretation of what the graph reveals. Note that these are only contributions from the United Kingdom.

Note

Note: The figure you create might look slightly different than this one if the data on the website has been updated recently.

After the team member working on Exercise 5 renders, commits, and pushes, all other team members should pull the changes and render the document. Finally, a team member different than the one responsible for typing up responses to Exercise 5 should do the last task outlined below.

Submission

Once you are finished with the lab, you will submit your final PDF document to Gradescope.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.

To submit your assignment:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials $\rightarrow$ Cornell University NetID and log in using your NetID credentials.
Click on your INFO 5001 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Team submission (Gradescope)

Follow the Gradescope instructions to include all students on the submission.

Grading

Component	Points
Ex 1	15
Ex 2	8
Ex 3	7
Ex 4	6
Ex 5	9
Workflow & formatting	5
Total	50

Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

Having at least 3 informative commit messages
Each team member contributing to the repo with commits at least once
Following tidyverse code style
All code being visible in rendered PDF (no more than 80 characters)

Acknowledgments

This assignment is derived from Data Science in a Box and licensed under CC BY-SA 4.0.