Lab 01 - Data visualization


This lab is due September 5 (Tuesday due to Labor Day) at 11:59pm ET.

Learning goals

In this lab, you will…

  • learn how to effectively visualize numeric and categorical data.
  • continue developing a workflow for reproducible data analysis.

Getting started

  • Go to the info5001-fa23 organization on GitHub. Click on the repo with the prefix lab-01. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

If you are completing the assignment with a peer

You may complete a lab assignment collaboratively with up to three peers in the class (maximum group size of four). If you choose to do so, please submit the assignment only once in Gradescope. You should also revise the YAML header (the author argument only) in your Quarto document so you can list all the students who worked on the assignment.

  - "Student 1 (netID)"
  - "Student 2 (netID)"
  - "Student 3 (netID)"
  - "Student 4 (netID)"


We will use the tidyverse package to create and customize plots in R, and we will obtain the data from the rcis package.


rcis is not published on CRAN. It must be installed from GitHub. If you are using RStudio Workbench, it is already installed. If you are rolling with a local installation, use the code below to install it.

# check if remotes is installed and install if necessary
if(!require("remotes")) {

# install rcis

Data: Exploring higher education statistics

The data in this lab is in the scorecard data frame. It is part of the rcis R package, so the scorecard data set is automatically loaded when you load the rcis package.

The data contains a selection of variables from the College Scorecard database. It includes every 4-year college or university in the United States.

Because the data set is part of the rcis package, you can read documentation for the data set, including variable definitions by typing ?scorecard in the console.


As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled with appropriate units indicated (e.g. %, $), and careful consideration should be given to aesthetic choices.

In addition, the code should not exceed the 80 character limit, so that all the code can be read when you render to PDF. To help with this, you can add a vertical line at 80 characters by clicking “Tools” \(\rightarrow\) “Global Options” \(\rightarrow\) “Code” \(\rightarrow\) “Display”, then set “Margin Column” to 80, and click “Apply”.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

  1. Make a histogram to visualize the net cost of attendance. Set the binwidth to 1,000 and include axes labels and a title.
    • Describe the shape of the distribution.
    • Does there appear to be are any outliers? Briefly explain.

For more details and code examples for histograms ggplot2 reference page.

  1. Create a scatterplot of the net cost of attendance (netcost) versus average faculty salary (avgfacsal) with points colored by type of university type (i.e. public, private non-profit, or private for-profit). Label the axes and legend and give the plot a title. Use the scale_color_viridis_d() function to apply the viridis color palette to your plot.

See Introduction to the viridis color maps to read more about the viridis R package and see code examples.

Render, commit, and push your changes to GitHub with the commit message “Added answer for Ex 1-2”.

Make sure to commit and push all changed files so that your Git pane is empty afterwards.

  1. Describe what you observe in the plot from the previous exercise. In your description, include similarities and differences in the patterns across college types. Is this an effective graph to assess these patterns? Why or why not?

  2. Now, let’s examine the relationship between the same two variables, using a separate plot for each type. Label the axes and give the plot a title. Use geom_smooth with the argument se = FALSE to add a smooth curve fit to the data. Which plot do you prefer - this plot or the plot in Ex 2? Briefly explain your choice.


se = FALSE removes the confidence bands around the line. These bands show the uncertainty around the smooth curve. We’ll discuss uncertainty around estimates later in the course and bring these bands back then.

Now is another good time to render, commit, and push your changes to GitHub with a meaningful commit message.

Once again, make sure to commit and push all changed files so that your Git pane is empty afterwards.

  1. Do students leaving some types of colleges tend to have higher debt loads than others? To explore this question, create side-by-side boxplots of median student debt (debt) of a college based on type (type).
    • Describe what you observe from the plot.
    • Which type of college has the single highest median student debt? How do you know based on the plot?
  2. Are some types of colleges more likely to be located in a city? Create a segmented bar chart with one bar per type of college and the fill determined by the distribution of locale, which identifies if the college is located in a city, suburb, town, or rural areas. The y axis of the segmented barplot should range from 0 to 1.
    • What do you notice from the plot?

For this exercise, you should begin with the data wrangling code below. We will learn more about data wrangling next week.

scorecard <- scorecard |>

Now is another good time to render, commit, and push your changes to GitHub with a meaningful commit message.

And once again, make sure to commit and push all changed files so that your Git pane is empty afterwards. We keep repeating this because it’s important, and because we see students forget to do this. So take a moment to make sure you’re following along with the instructions around Git.

  1. Recreate the plot below.


Render, commit, and push your final changes to GitHub with a meaningful commit message.

Make sure to commit and push all changed files so that your Git pane is empty afterwards.


Once you are finished with the lab, you will submit your final PDF document to Gradescope.


Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.

To submit your assignment:

  • Go to and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 5001 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
  • Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
If you worked with another student(s) on the assignment

Grading (50 pts)

Component Points
Ex 1 4
Ex 2 6
Ex 3 4
Ex 4 8
Ex 5 6
Ex 6 6
Ex 7 8
Workflow & formatting 8

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • Following tidyverse code style
  • All code being visible in rendered PDF (no more than 80 characters)
  • Appropriate figure sizing, and figures with informative labels and legends