Reproducible project-based workflows

Lecture 17

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2023

2023-10-25

Announcements

Announcements

  • Revised project deadlines

  • Exam begins tomorrow at 8am

Organize your life

Be organized as you go, not “tomorrow”

Don’t fret over past mistakes.

Raise the bar for new work.

Be organized



self-explaining >>> wordy, needy explainers

Be organized

>>>

  file salad
  + an out of date README

Good enough practices in scientific computing

PLOS Computational Biology

Wilson, Bryan, Cranston, Kitzes, Nederbragt, Teal (2017)

https://doi.org/10.1371/journal.pcbi.1005510

http://bit.ly/good-enuff

Project-oriented workflows

Adopt a project-oriented workflow

Why

  • work on more than 1 thing at a time

  • collaborate, communicate, distribute

  • start and stop

How

  • dedicated directory

  • RStudio Project

  • Git repo, probably syncing to a remote

If the top of your script is

setwd("C:\Users\jenny\path\that\only\I\have")
rm(list = ls())


Jenny will come into your your office and SET YOUR COMPUTER ON FIRE 🔥.

Project-oriented workflow designs this away. 🙌

Which persist after rm(list = ls())?

Option Persists?
A. library(dplyr)
B. summary <- head
C. options(stringsAsFactors = FALSE)
D. Sys.setenv(LANGUAGE = "fr")
E. x <- 1:5
F. attach(iris)
01:00

What does it mean to be an RStudio Project?


RStudio leaves notes to itself in foo.Rproj


Open Project = dedicated instance of RStudio

  • dedicated R process

  • file browser pointed at Project directory

  • working directory set to Project directory

Many projects open

Use a “blank slate”


usethis::use_blank_slate()


OR


Tools -> Global Options

Restart R often


Session -> Restart R

Windows

  • Ctrl + Shift + F10

Mac

  • Cmd + Shift + 0

  • Cmd + Shift + F10

Project initiation: the local case

  1. New folder + make it an RStudio Project
  • usethis::create_project("~/i_am_new")

  • File -> New Project -> New Directory -> New Project

  1. Make existing folder into an RStudio Project
  • usethis::create_project("~/i_exist")

  • File -> New Project -> Existing Directory

Safe paths

On reproducibility of code


A large-scale study on research code quality and execution.
Trisovic, A., Lau, M.K., Pasquier, T. et al. 
Sci Data 9, 60 (2022).

Do you know where your files are?

Working directory vs. home directory

  • Working directory is associated with a specific process or running application
  • Home directory is a static, persistent thing

Working directory \(\neq\) home directory

Practice “safe paths”

Relative to a stable base, use file system functions

Packages with file system functions

install.packages("fs")

fs = file path handling

install.packages("here")

here = project-relative paths

Examples of a stable base

Project directory

here::here("data", "raw-data.csv")
here::here("data/raw-data.csv")

Automatically complete paths with Tab.

User’s home directory

file.path("~", ...)
fs::path_home(...)

Absolute paths

Don’t hard-wire them into your scripts.

Instead, form at runtime relative to a stable base

(BAD <- "/Users/soltoffbc/tmp/test.csv")
[1] "/Users/soltoffbc/tmp/test.csv"

(GOOD <- fs::path_home("tmp/test.csv"))
/Users/soltoffbc/tmp/test.csv

Practice safe paths

  • Use the here package to build paths inside a project.

  • Leave working directory at top-level at all times, during development.

  • Absolute paths are formed at runtime.

here example

# form filepath
here::here("figs", "built-barchart.png")
# save to disk
ggsave(here::here("figs", "built-barchart.png"))
  • Works on my machine, works on yours!

  • Works even if working directory is in a sub-folder.

  • Works for RStudio Projects, Git repos, R packages, etc.

  • Works with knitr / Quarto.

here::here()

The here package is designed to work inside a project, where that could mean:

  • RStudio Project

  • Git repo

  • R package

  • Folder with a file named .here

here::here() does not create directories; that’s your job.

Kinds of paths

Absolute path.

dat <- read_csv("/Users/soltoffbc/Projects/info-5001/ae-15/data/installed-packages.csv")


Relative path to working directory, established by the RStudio Project.

dat <- read_csv("data/installed-packages.csv")


Relative path within the RStudio Project directory.

dat <- read_csv(here::here("data/installed-packages.csv"))

What if my data can’t live in my project directory?

  1. Are you sure it can’t?

  2. Review the Good Enough Practices paper for tips.

  3. Create a symbolic link to access the data. (fs::link_create(), fs::link_path())

  4. Put the data in an R package.

  5. Use pins.

  6. Explore other data warehousing options.

Project structure

Break logic and output into pieces

Process

Project code

smell-test.R

wrangle.R

model.R

make-figs.R

report.Qmd

>>>

    everything.R

Process and code

Project artifacts

raw-data.xlsx

data.csv

fit.rds

ests.csv

>>>

.Rdata

Process and artifacts

A humane API for analysis

Application exercise

ae-15

  • Go to the course GitHub org and find your ae-15 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the R scripts in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow

Recap

  • Use project-based workflows to easily and reproducibly structure your work
  • Create usable, reproducible file paths using here
  • Split project workflow into separate files (scripts and/or Quarto documents) based on substantive tasks

Acknowledgments

Good luck on the exam!