Lab 03 - Git workflows
This lab is due September 30 at 11:59pm.
Learning goals
In this lab, you will…
- review common workflows for data transformation and visualization
- create branches in Git
- merge branches through pull requests in GitHub
- conduct code reviews
Getting started
Go to the info5001-fa24 organization on GitHub. Click on the repo with the prefix lab-03. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.
Git: Branches and PRs
So far we have practiced individual workflows in Git on your application exercises and homework assignments. For the labs and project, you have collaborated with your teammates by working in a single repository for each assignment. Individuals have pushed commits to GitHub which required the other team members to pull those commits to their local repository on RStudio Workbench before editing files locally. If you failed to pull first, then you may have encountered a merge conflict. We learned how to resolve merge conflicts in a previous lab.
Branches
This approach to collaboration (everyone editing the same copy of the files and hoping to avoid merge conflicts) is inefficient and confusing.
A better approach is to leverage Git’s ability to create branches for separate development streams.
Branching means that you take a detour from the main
stream of development and do work without changing the main
stream. It allows one or many people to work in parallel without overwriting each other’s work. It allows someone working solo to work incrementally on an experimental idea, without jeopardizing the state of the main product.
Branching in Git is very lightweight, which means creating a branch and switching between branches is nearly instantaneous. This means Git encourages workflows which create small branches for exploration or new features, often merging them back together quickly.
main
Each Git repository initializes with a single branch. By convention that branch is called main
.1 There is nothing inherently special about this branch, it is simply the first branch that is created when a new repository is generated from scratch.
1 Historically this initial branch was named master
. In an effort to utilize inclusive language, the Git project shifted away from usage of this term in 2020.
In practice, the main
branch represents the primary development line and is often the branch used to publish or deploy software in a stable version. It is unwise to experiment directly on the main
branch for fear of breaking the software in the process of developing new features or content.
Remotes and origin
Remote repositories are versions of your project that are hosted on the Internet or another network. A single project can have 1, 2, or even hundreds of remotes. You pull others’ changes from remotes and push your changes to remotes.
In this class, all repositories are hosted on Cornell’s GitHub server. By convention, origin
is used to define the main remote server.
Pull requests
In order to integrate commits from one branch to another (say from issue-5
to main
), you need to merge the branches.2 This can all be done within Git from the terminal, but this can be difficult to manage in collaborative environments with multiple contributors to the codebase. Furthermore, open-source software development means anyone should be able to propose changes to a codebase, while still providing oversight from project maintainers to decide which changes should be incorporated.
2 Huh, that sounds kind of like merging commits from the origin
to your local repository.
In order to facilitate this process, GitHub implements a system of pull requests (PR). Per the documentation:
Pull requests let you tell others about changes you’ve pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.
Note the terminology “pull” request: you are requesting that the developer “pull” your changes into their repository and/or branch. Think of it as a form of affirmative consent.
GitHub also provides pull request reviews.
Reviews allow collaborators to comment on the changes proposed in pull requests, approve the changes, or request further changes before the pull request is merged. Repository administrators can require that all pull requests are approved before being merged.
A review has three possible statuses:
- Comment: Submit general feedback without explicitly approving the changes or requesting additional changes.
- Approve: Submit feedback and approve merging the changes proposed in the pull request.
- Request changes: Submit feedback that must be addressed before the pull request can be merged.
Your repositories for this assignment have all been configured to require at least one “approved” review from a team member who did not create the PR before a PR can be merged.
usethis
usethis is a workflow package: it automates repetitive tasks that arise during project setup and development, both for R packages and non-package projects. It includes an extensive set of functions for assisting with pull requests as well as other Git and GitHub operations. Many of the operations you will need to do for this assignment can be performed using the Terminal or on GitHub. Where usethis has a relevant function to perform a required operation in this assignment, we will tell you.
usethis requires you to create a personal access token (PAT) on Cornell’s GitHub server in order to use its functions. If you are using the SHH protocol to authenticate with GitHub, follow the instructions in lab 00 to create a PAT.
Warm up
Packages
We’ll use the tidyverse package for much of the data wrangling, the dsbox package for our dataset, the scales package for formatting axis labels, and the usethis package for our Git/GitHub workflow. These packages are already installed for you. You can load it by running the following in your Console:
Data
This week we’ll do some data gymnastics to refresh and review what we learned over the past few weeks using (simulated) data from Lego sales in 2018 for a sample of customers who bought Legos in the US.
The data can be found in the dsbox package, and it’s called lego_sales
. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?lego_sales
in the Console or using the Help menu in RStudio to search for lego_sales
. You can also find this information here.
Exercises
Workflow
Unlike other lab assignments, each of the following exercises should be completed independently by one team member. The “collaboration” on this assignment comes from the extensive Git/GitHub workflow we are practicing.
For this assignment, you may find it easier to not stage and commit the rendered PDF until all PRs have been merged. Git can (mostly) resolve merge conflicts automatically for plain text files. It cannot do so for binary file formats such as PDFs. Since the PDF is directly produced from the source code Quarto document, you do not have to incorporate the revised version into the repo until all exercises have been completed.
For each exercise:
-
Create a new branch. You can do this by running
pr_init(branch = "BRANCH-NAME")
from the console. ReplaceBRANCH-NAME
with a brief, syntactic, descriptive name for the branch. Examples include:exercise-1
ex1
analysis-sales
All usethis functions should be run from the console, not from the Quarto document. They are not intended to be part of the substantive analysis and you will encounter errors rendering the document if you add them to your .qmd
file.
- Complete the exercise in the development branch. Remember to follow best practices when you write your code, create visualizations, etc. You are expected to stage and commit just as you would for other assignments.
Submit your pull request.
pr_push()
pushes the local changes to your development branch on GitHub and puts you in position to make your pull request. When the browser window opens, click “Create pull request” to make the PR.-
Another team member should review the PR. This must be a different person from the one who opened the PR. Use the interface on GitHub to review the changes in the repository. If you want, you can use
pr_fetch()
to create a local branch that tracks the remote PR. This allows you to test the code or make additional changes.In order to merge the PR, a team member must submit a code review in “Approve” status. Remember that your grade on this assignment is based on the quality of your peers’ work on the exercises. If you believe the quality is insufficient, leave a review that “Request[s] changes” and explain what needs to be fixed in order for you to approve the PR.
Once the PR is approved, the original team member can merge the PR. This will incorporate the commits from the exercise branch into the
main
branch. Do this from the GitHub webpage, then runpr_finish()
to delete the local development branch, switch back to themain
, and pull the latest changes from GitHub.
After a PR is merged into the main
branch, other team members may be unable to merge their own PRs. This is because the changes that exist in main
do not exist in the separate development branches. To resolve this conflict, you can run usethis::pr_merge_main()
locally while you are working in your development branch. This will sync the updated changes in origin/main
with your local development branch. Resolve any merge conflicts manually, then stage/commit/push to GitHub. If you already opened your pull request, the new commits are automatically integrated into the PR. No need to open another one.
Once all exercises are completed and all PRs done, confirm everyone is synced again with the main
and submit.
Exercise 0 (set your YAML header)
Edit the YAML header to include your team name and the names and netIDs of all team members.
title: "Lab 03 - Git workflows"
subtitle: "Team name"
author:
- "Team member 1 (netID)"
- "Team member 2 (netID)"
- "Team member 3 (netID)"
date: today
format:
typst:
fig-format: png
Exercise 1
Among the most common theme of Lego sets purchased, what are the most common subthemes? Create a visualization showing the frequency of Lego sets by subtheme for the most common theme in the dataset.
Exercise 2
Which age group has spent the most money on Legos? Create a new variable called age_group
and group the ages into the following categories: “18 and under”, “19 - 25”, “26 - 35”, “36 - 50”, “51 and over”.
Use the case_when()
function to create the categorical variable.
Based on these age groups, determine how much each group has spent on Lego sets and report using a bar chart.
Exercise 3
Which area code has spent the most money on Legos? In the US the area code is the first 3 digits of a phone number. Report the revenue generated from the top-10 area codes based on total sales as a visualization.
Submission
Once you are finished with the lab, you will submit your final PDF document to Gradescope.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 5001 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
Component | Points |
---|---|
Ex 1 | 8 |
Ex 2 | 6 |
Ex 3 | 8 |
Workflow & formatting | 28 |
Total | 50 |
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- Having at least 3 informative commit messages
- Each team member contributing to the repo with commits at least once
- Following tidyverse code style
- All code being visible in rendered PDF (no more than 80 characters)
- Appropriate figure sizing, and figures with informative labels and legends
- Created all required branches
- Closed all PRs
Acknowledgments
- This assignment is derived from Data Science in a Box and licensed under CC BY-SA 4.0.
- Descriptions of Git and branching workflow are derived from Happy Git and GitHub for the useR and licensed under CC BY-NC 4.0