HW 02 - Data wrangling + tidying
This homework is due September 17 at 11:59pm ET.
Learning objectives
- Transform data to extract meaning from it
- Pivot untidy data sets
- Join relational data tables
- Implement interpretable and accessible data visualizations
Getting started
Go to the info5001-fa25 organization on GitHub. Click on the repo with the prefix hw-02. It contains the starter documents you need to complete the lab.
Clone the repo and start a new workspace in Positron. See the Homework 0 instructions for details on cloning a repo and starting a new R project.
General guidance
As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Packages
We’ll use the {tidyverse} package for much of the data wrangling and visualization, the {scales} package for better formatting of labels on visualizations, and the {fivethirtyeight} package for some of the data sets.
Part 1: College majors and earnings
The first step in the process of turning information into knowledge process is to summarize and describe the raw information - the data. In this part we explore data on college majors and earnings, specifically the data begin the FiveThirtyEight story “The Economic Guide To Picking A College Major”.
These data originally come from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series. While this is outside the scope of this assignment, if you are curious about how raw data from the ACS were cleaned and prepared, see the code FiveThirtyEight authors used.
We should also note that there are many considerations that go into picking a major. Earnings potential and employment prospects are two of them, and they are important, but they don’t tell the whole story. Keep this in mind as you analyze the data.
Data
The data can be found in the {fivethirtyeight} package, and it’s called college_recent_grads
. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?college_recent_grads
in the Console. You can also find this information here.
You can also take a quick peek at your data frame and view its dimensions with the glimpse()
function.
glimpse(college_recent_grads)
Rows: 173
Columns: 21
$ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
$ major_code <int> 2419, 2416, 2415, 2417, 2405, 2418, 6202, …
$ major <chr> "Petroleum Engineering", "Mining And Miner…
$ major_category <chr> "Engineering", "Engineering", "Engineering…
$ total <int> 2339, 756, 856, 1258, 32260, 2573, 3777, 1…
$ sample_size <int> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, …
$ men <int> 2057, 679, 725, 1123, 21239, 2200, 2110, 8…
$ women <int> 282, 77, 131, 135, 11021, 373, 1667, 960, …
$ sharewomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132…
$ employed <int> 1976, 640, 648, 758, 25694, 1857, 2912, 15…
$ employed_fulltime <int> 1849, 556, 558, 1069, 23170, 2038, 2924, 1…
$ employed_parttime <int> 270, 170, 133, 150, 5180, 264, 296, 553, 1…
$ employed_fulltime_yearround <int> 1207, 388, 340, 692, 16697, 1449, 2482, 82…
$ unemployed <int> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, …
$ unemployment_rate <dbl> 0.018380527, 0.117241379, 0.024096386, 0.0…
$ p25th <dbl> 95000, 55000, 50000, 43000, 50000, 50000, …
$ median <dbl> 110000, 75000, 73000, 70000, 65000, 65000,…
$ p75th <dbl> 125000, 90000, 105000, 80000, 75000, 10200…
$ college_jobs <int> 1534, 350, 456, 529, 18314, 1142, 1768, 97…
$ non_college_jobs <int> 364, 257, 176, 102, 4440, 657, 314, 500, 1…
$ low_wage_jobs <int> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3…
Exercise 1
Which majors have the highest percentage of women? Answer the question using a single data wrangling pipeline. The output should be a tibble with the columns major
, and sharewomen
Only the five majors with the highest proportions of women should be included. The major with the highest proportion of women should be at the top. In a few sentences, describe any trends you observe.
Render, commit (with a descriptive and concise commit message), and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 2
How much are college graduates (those who finished undergrad) making?
Plot the distribution of all median incomes using a histogram with an appropriate binwidth (you will need to determine what is “appropriate” – remember there is not one single value you should use).
Calculate the mean and median for median income. Based on the shape of the histogram, determine which of these summary statistics is useful for describing the distribution.
-
Describe the distribution of median incomes of college graduates across various majors based on your histogram from part (a) and incorporating the statistic you chose in part (b) to help your narrative.
TipHintMention shape, center, spread, any unusual observations.
Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 3
How do the distributions of median income compare across major categories?
Calculate the minimum, median, and maximum median income per major category as well as the number of majors in each category. Your summary statistics should be in decreasing order of median income.
-
Create box plots of the distribution of median income by major category.
- The variable
major_category
should be on the y-axis andmedian
on the x-axis. - The boxes should be sorted meaningfully with the major with the largest median income at the top of the chart and the major with the smallest median income at the bottom of the chart.
- Use color to enhance your plot, and turn off any legends providing redundant information.
- Style the x-axis labels such that the values are shown in thousands, e.g., 20000 should show up as $20K.
- The variable
In 1-2 sentences, describe how median incomes across various major categories compare. Your description should also touch on where your own intended/declared major (yes, your major at Cornell).
Once again, render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Part 2: Inflation across the world
For this part of the analysis you will work with inflation data from various countries in the world over the last 30 years. The datasets that you will work with in this dataset come from the Organization for Economic Co-Operation and Development (OECD), stats.oecd.org.
country_inflation <- read_csv("data/country-inflation.csv")
Exercise 4
Describe the data structure. What does each row of the country_inflation
dataset represent? What are the columns in the dataset and what do they represent?
Exercise 5
Tidying the data set. Reshape country_inflation
such that each row represents a country/year combination, with columns country
, year
, and annual_inflation
. Make sure that annual_inflation
is a numeric variable. Save the result as a new data frame – you should give it a concise and informative name.
Render, commit (with a descriptive and concise commit message), and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 6
Analyze inflation across the globe. In a single pipeline, filter your reshaped dataset to analyze countries of your choosing to visualize the inflation rates for over the years and create a plot of annual inflation over time for these countries. Then, in a few sentences, state why you chose these countries and describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of these countries economies.
- Data should be represented with points as well as lines connecting the points for each country.
- Each country should be represented by a different color line.
- Axes and legend should be properly labeled.
- The plot should have an appropriate title (and optionally a subtitle).
- Axis labels for annual inflation should be shown in percentages (e.g., 25% not 25).
The label_percent()
function from the {scales} package will be useful.
ggplot(...) +
... +
scale_y_continuous(label = label_percent())
Render, commit (with a descriptive and concise commit message), and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Part 3: Inflation in the US
The OECD defines inflation as follows:
Inflation is a rise in the general level of prices of goods and services that households acquire for the purpose of consumption in an economy over a period of time.
The main measure of inflation is the annual inflation rate which is the movement of the Consumer Price Index (CPI) from one month/period to the same month/period of the previous year expressed as percentage over time.
Source: OECD CPI FAQ
CPI is broken down into 12 divisions such as food, housing, health, etc. Your goal in this part is to create another time series plot of annual inflation, this time for only the United States.
The data you will need to create this visualization is spread across two files:
-
us-inflation.csv
: Annual inflation rate for the US for 12 CPI divisions. Each division is identified by an ID number. -
cpi-divisions.csv
: A “lookup table” of CPI division ID numbers and their descriptions.
Let’s load both of these files.
Exercise 7
Join the data frames. Add a column to the us_inflation
dataset called description
which has the CPI division description that matches the cpi_division_id
, by joining the two datasets.
Exercise 8
Analyze inflation across divisions. In a single pipeline, filter your joined dataset to include a subset of CPI divisions which you wish to examine, and create a plot of annual inflation over time for these divisions. Then, in a few sentences, state why you chose these divisions and describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of inflation rates in the US over the last decade.
- Data should be represented with points as well as lines connecting the points for each division.
- Each division should be represented by a different color line.
- Axes and legend should be properly labeled.
- The plot should have an appropriate title (and optionally a subtitle).
- Axis labels for annual inflation should be shown in percentages (e.g., 25% not 25).
Once again, render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 5001 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
- Exercise 1: 3 points
- Exercise 2: 8 points
- Exercise 3: 8 points
- Exercise 4: 2 points
- Exercise 5: 5 points
- Exercise 6: 6 points
- Exercise 7: 4 points
- Exercise 8: 9 points
- Workflow + formatting: 5 points
- Total: 50 points
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- At least 3 informative commit messages
- Following {tidyverse} code style
- All code being visible in rendered PDF without automatic wrapping (no more than 80 characters)
Acknowledgments
- This assignment is derived in part from Data Science in a Box and licensed under CC BY-SA 4.0.
- This assignment is derived in part from STA 199: Introduction to Data Science