Lab 04 - Git workflows

Lab
Important

This lab is due October 2 at 11:59pm.

Learning goals

In this lab, you will…

  • review common workflows for data transformation and visualization
  • create branches in Git
  • merge branches through pull requests in GitHub
  • conduct code reviews

Getting started

  • Go to the info5001-fa23 organization on GitHub. Click on the repo with the prefix lab-04. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

Git: Branches and PRs

So far we have practiced individual workflows in Git on your application exercises and homework assignments. For the labs and project, you have collaborated with your teammates by working in a single repository for each assignment. Individuals have pushed commits to GitHub which required the other team members to pull those commits to their local repository on RStudio Workbench before editing files locally. If you failed to pull first, then you may have encountered a merge conflict. We learned how to resolve merge conflicts in a previous lab.

Branches

This approach to collaboration (everyone editing the same copy of the files and hoping to avoid merge conflicts) is inefficient and confusing.

Cars attempting to merge on a 10 lane highway causing one car to overturn.

A better approach is to leverage Git’s ability to create branches for separate development streams.

A diagram of a Git repository with three distinct branches. The 'little feature' and 'big feature' branches are created as spurs off of the 'main' branch.

The Git branching model. Source: Bitbucket Tutorials

Branching means that you take a detour from the main stream of development and do work without changing the main stream. It allows one or many people to work in parallel without overwriting each other’s work. It allows someone working solo to work incrementally on an experimental idea, without jeopardizing the state of the main product.

Branching in Git is very lightweight, which means creating a branch and switching between branches is nearly instantaneous. This means Git encourages workflows which create small branches for exploration or new features, often merging them back together quickly.

main

Each Git repository initializes with a single branch. By convention that branch is called main.1 There is nothing inherently special about this branch, it is simply the first branch that is created when a new repository is generated from scratch.

1 Historically this initial branch was named master. In an effort to utilize inclusive language, the Git project shifted away from usage of this term in 2020.

In practice, the main branch represents the primary development line and is often the branch used to publish or deploy software in a stable version. It is unwise to experiment directly on the main branch for fear of breaking the software in the process of developing new features or content.

Remotes and origin

Remote repositories are versions of your project that are hosted on the Internet or another network. A single project can have 1, 2, or even hundreds of remotes. You pull others’ changes from remotes and push your changes to remotes.

In this class, all repositories are hosted on Cornell’s GitHub server. By convention, origin is used to define the main remote server.

Pull requests

In order to integrate commits from one branch to another (say from issue-5 to main), you need to merge the branches.2 This can all be done within Git from the terminal, but this can be difficult to manage in collaborative environments with multiple contributors to the codebase. Furthermore, open-source software development means anyone should be able to propose changes to a codebase, while still providing oversight from project maintainers to decide which changes should be incorporated.

2 Huh, that sounds kind of like merging commits from the origin to your local repository.

In order to facilitate this process, GitHub implements a system of pull requests (PR). Per the documentation:

Pull requests let you tell others about changes you’ve pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.

Note

Note the terminology “pull” request: you are requesting that the developer “pull” your changes into their repository and/or branch. Think of it as a form of affirmative consent.

GitHub also provides pull request reviews.

Reviews allow collaborators to comment on the changes proposed in pull requests, approve the changes, or request further changes before the pull request is merged. Repository administrators can require that all pull requests are approved before being merged.

A review has three possible statuses:

  • Comment: Submit general feedback without explicitly approving the changes or requesting additional changes.
  • Approve: Submit feedback and approve merging the changes proposed in the pull request.
  • Request changes: Submit feedback that must be addressed before the pull request can be merged.
Note

Your repositories for this assignment have all been configured to require at least one “approved” review from a team member who did not create the PR before a PR can be merged.

usethis

usethis is a workflow package: it automates repetitive tasks that arise during project setup and development, both for R packages and non-package projects. It includes an extensive set of functions for assisting with pull requests as well as other Git and GitHub operations. Many of the operations you will need to do for this assignment can be performed using the Terminal or on GitHub. Where usethis has a relevant function to perform a required operation in this assignment, we will tell you.

Create a personal access token

Warning

usethis requires you to create a personal access token (PAT) on Cornell’s GitHub server in order to use its functions. Follow the instructions below to do this for the assignment.

Run this code from your R console:

usethis::create_github_token(
  scopes = c("repo", "user", "gist", "workflow"),
  description = "RStudio Workbench",
  host = "https://github.coecis.cornell.edu/"
)

This is a helper function that takes you to the web form to create a PAT.

  • Give the PAT a description (e.g. “PAT for INFO 5001”)
  • Change the Expiration to 120 days. This ensures the PAT remains valid through the end of the course. You can also set the token to never expire, but GitHub will warn you this is not as secure as an expiring token.
  • Leave the remaining options on the pre-filled form selected and click “Generate token”. As the page says, you must store this token somewhere, because you’ll never be able to see it again, once you leave that page or close the window. For now, you can copy it to your clipboard (we will save it in the next step).

If you lose or forget your PAT, just generate a new one.

Store your PAT

In order to store your PAT so you don’t have to reenter it every time you interact with Git, we need to run the following code:

gitcreds::gitcreds_set(url = "https://github.coecis.cornell.edu/")

When prompted, paste your PAT into the console and press return. Your credential should now be saved on your computer.

Confirm your PAT is saved

Run the following code:

gh::gh_whoami(.api_url = "https://github.coecis.cornell.edu/")

usethis::git_sitrep()

You should see output that provides information about your GitHub account.

Warm up

Packages

We’ll use the tidyverse package for much of the data wrangling, the dsbox package for our dataset, the scales package for formatting axis labels, and the usethis package for our Git/GitHub workflow. These packages are already installed for you. You can load it by running the following in your Console:

library(tidyverse)
library(dsbox)
library(usethis)

Data

This week we’ll do some data gymnastics to refresh and review what we learned over the past few weeks using (simulated) data from Lego sales in 2018 for a sample of customers who bought Legos in the US.

The data can be found in the dsbox package, and it’s called lego_sales. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?lego_sales in the Console or using the Help menu in RStudio to search for lego_sales. You can also find this information here.

Exercises

Workflow

Note

Unlike other lab assignments, each of the following exercises should be completed independently by one team member. The “collaboration” on this assignment comes from the extensive Git/GitHub workflow we are practicing.

Tip

For this assignment, you may find it easier to not stage and commit the rendered PDF until all PRs have been merged. Git can (mostly) resolve merge conflicts automatically for plain text files. It cannot do so for binary file formats such as PDFs. Since the PDF is directly produced from the source code Quarto document, you do not have to incorporate the revised version into the repo until all exercises have been completed.

For each exercise:

  1. Create a new branch. You can do this by running pr_init(branch = "<BRANCH-NAME>") from the console. Replace <BRANCH-NAME> with a brief, syntactic, descriptive name for the branch. Examples include:

    • exercise-1
    • ex1
    • analysis-sales
Tip

All usethis functions should be run from the console, not from the Quarto document. They are not intended to be part of the substantive analysis and you will encounter errors rendering the document if you add them to your .qmd file.

  1. Complete the exercise in the development branch. Remember to follow best practices when you write your code, create visualizations, etc. You are expected to stage and commit just as you would for other assignments.
Note

If you used pr_init() to create the branch, you’ll notice there is no ability to push to GitHub yet. This is expected behavior. pr_init() does not automatically create a branch on the origin server for your work. You will do this in the next step.

  1. Submit your pull request. pr_push() pushes the local changes to your development branch on GitHub and puts you in position to make your pull request. When the browser window opens, click “Create pull request” to make the PR.

  2. Another team member should review the PR. This must be a different person from the one who opened the PR. Use the interface on GitHub to review the changes in the repository. If you want, you can use pr_fetch() to create a local branch that tracks the remote PR. This allows you to test the code or make additional changes.

    In order to merge the PR, a team member must submit a code review in “Approve” status. Remember that your grade on this assignment is based on the quality of your peers’ work on the exercises. If you believe the quality is insufficient, leave a review that “Request[s] changes” and explain what needs to be fixed in order for you to approve the PR.

  3. Once the PR is approved, the original team member can merge the PR. This will incorporate the commits from the exercise branch into the main branch. Do this from the GitHub webpage, then run pr_finish() to delete the local development branch, switch back to the main, and pull the latest changes from GitHub.

Tip

After a PR is merged into the main branch, other team members may be unable to merge their own PRs. This is because the changes that exist in main do not exist in the separate development branches. To resolve this conflict, you can run usethis::pr_merge_main() locally while you are working in your development branch. This will sync the updated changes in origin/main with your local development branch. Resolve any merge conflicts manually, then stage/commit/push to GitHub. If you already opened your pull request, the new commits are automatically integrated into the PR. No need to open another one.

Once all exercises are completed and all PRs done, confirm everyone is synced again with the main and submit.

Exercise 0 (set your YAML header)

Edit the YAML header to include your team name and the names and netIDs of all team members.

title: "Lab 04 - Git workflows"
subtitle: "Team name"
author: 
  - "Team member 1 (netID)"
  - "Team member 2 (netID)"
  - "Team member 3 (netID)"
  - "Team member 4 (netID)"
  - "Team member 5 (netID)"
date: today
format: pdf
editor: visual

Exercise 1

Among the most common theme of Lego sets purchased, what are the most common subthemes? Create a visualization showing the frequency of Lego sets by subtheme for the most common theme in the dataset.

Exercise 2

Which age group has spent the most money on Legos? Create a new variable called age_group and group the ages into the following categories: “18 and under”, “19 - 25”, “26 - 35”, “36 - 50”, “51 and over”.

Tip

Use the case_when() function to create the categorical variable.

Based on these age groups, determine how much each group has spent on Lego sets and report using a bar chart.

Exercise 3

Which Lego theme has made the most money for Lego? Generate a visualization reporting the revenue generated by each Lego set theme in the dataset.

Exercise 4

Which area code has spent the most money on Legos? In the US the area code is the first 3 digits of a phone number. Report the revenue generated from the top-10 area codes based on total sales as a visualization.

Submission

Once you are finished with the lab, you will submit your final PDF document to Gradescope.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 5001 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
  • Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
If you worked with another student(s) on the assignment

Grading

Component Points
Ex 1 7
Ex 2 5
Ex 3 5
Ex 4 7
Workflow & formatting 26
Total 50
Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • Having at least 3 informative commit messages
  • Each team member contributing to the repo with commits at least once
  • Following tidyverse code style
  • All code being visible in rendered PDF (no more than 80 characters)
  • Created all required branches
  • Closed all PRs
  • Appropriate figure sizing, and figures with informative labels and legends

Acknowledgments