Reproducible project-based workflows

Lecture 16

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2024

October 24, 2024

Announcements

Announcements

Exam begins tomorrow at 8am

Project-oriented workflows

Adopt a project-oriented workflow

Why

  • Work on more than 1 thing at a time

  • Collaborate, communicate, distribute

  • Start and stop

How

  • Dedicated directory

  • RStudio Project

  • Git repo, probably syncing to a remote

What does it mean to be an RStudio Project?


RStudio leaves notes to itself in foo.Rproj


Open Project = dedicated instance of RStudio

  • dedicated R process

  • file browser pointed at Project directory

  • working directory set to Project directory

Many projects open

Think of your R processes as livestock, not pets

Pets or cattle?

R Session

  • R process (e.g. “session”)
  • Treat individual R processes and workspaces as disposable

Workspace

  • Libraries with library()
  • User-created objects

Treat your source code as precious, not the workspace

Save code, not workspace

  • Enforces reproducibility
  • Easy to regenerate on-demand
  • Always save commands
  • Always start R with a blank state
  • Restart R often

Always start R with a blank slate

usethis::use_blank_slate()

OR

Tools -> Global Options

Restart R often


Session -> Restart R

Windows

  • Ctrl + Shift + F10

Mac

  • Cmd + Shift + 0

  • Cmd + Shift + F10

Avoid unknown unknowns

Write every script like its running in a fresh process

Best way to ensure this: write every script in a fresh process

Storing computationally demanding output

  • write_rds() & read_rds()
  • cache: true

Safe paths

On reproducibility of code


A large-scale study on research code quality and execution.
Trisovic, A., Lau, M.K., Pasquier, T. et al. 
Sci Data 9, 60 (2022).

Do you know where your files are?

Working directory vs. home directory

  • Working directory is associated with a specific process or running application
  • Home directory is a static, persistent thing

Working directory \(\neq\) home directory

Practice “safe paths”

Relative to a stable base, use file system functions

Packages with file system functions

install.packages("fs")

fs = file path handling

install.packages("here")

here = project-relative paths

Examples of a stable base

Project directory

here::here("data", "raw-data.csv")
here::here("data/raw-data.csv")

Automatically complete paths with Tab.

User’s home directory

file.path("~", ...)
fs::path_home(...)

Absolute paths

Don’t hard-wire them into your scripts.

Instead, form at run-time relative to a stable base

(BAD <- "/Users/soltoffbc/tmp/test.csv")
[1] "/Users/soltoffbc/tmp/test.csv"

(GOOD <- fs::path_home("tmp/test.csv"))
/Users/soltoffbc/tmp/test.csv

Practice safe paths

  • Use the here package to build paths inside a project.

  • Leave working directory at top-level at all times, during development.

  • Absolute paths are formed at runtime.

here::here()

library(here)
here()
[1] "/Users/soltoffbc/Projects/info-5001/course-site"

Build a file path

here("slides/extras/awesome.txt")
## [1] "/Users/soltoffbc/Projects/info-5001/course-site/slides/extras/awesome.txt"
cat(readLines(here("slides/extras/awesome.txt")))
## OMG this is so awesome!

What if we change the working directory?

setwd(here("slides"))
getwd()
## [1] "/Users/soltoffbc/Projects/info-5001/course-site/slides"
cat(readLines(here("slides/extras/awesome.txt")))
## OMG this is so awesome!

Filepaths and Quarto documents

data/
  scotus.csv
analysis/
  exploratory-analysis.qmd
final-report.qmd
scotus.Rproj
  • .qmd and assumption of working directory
  • Run read_csv("data/scotus.csv")
  • Run read_csv(here("data/scotus.csv"))

What if my data can’t live in my project directory?

  1. Are you sure it can’t?

  2. Review the Good Enough Practices paper for tips.

  3. Create a symbolic link to access the data. (fs::link_create(), fs::link_path())

  4. Put the data in an R package.

  5. Use pins.

  6. Explore other data warehousing options.

Personal R admin

R startup procedures

  • Customized startup
  • .Renviron
  • .Rprofile

.Renviron

  • Define sensitive information
  • Set R specific environmental variables
  • Does not use R code syntax
  • usethis::edit_r_environ()

Example .Renviron

R_HISTSIZE=100000
GITHUB_PAT=abc123
R_LIBS_USER=~/R/%p/%v

.Rprofile

  • R code to run when R starts up
  • Runs after .Renviron
  • Multiple .Rprofile files
    • Home directory (~/.Rprofile)
    • Each R Project folder
    • Project .Rprofile overrides home .Rprofile
  • usethis::edit_r_profile(scope = c("user", "project"))

Common items in .Rprofile

  1. Set a default CRAN mirror
  2. Change options, screen width, numeric display
  3. Activate renv

Exam review

Evam review

Recap

Recap

  • Use project-based workflows to easily and reproducibly structure your work
  • Create usable, reproducible file paths using here
  • Split project workflow into separate files (scripts and/or Quarto documents) based on substantive tasks

Acknowledgments

Good luck on the exam!