Importing and recoding data

Lecture 8

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2024

September 24, 2024

Is \(x\) in \(y\)?

library(fivethirtyeight)

stem_categories <- c(
  "Biology & Life Science",
  "Computers & Mathematics",
  "Engineering",
  "Physical Sciences"
)

Is \(x\) in \(y\)?

college_recent_grads <- college_recent_grads |>
  mutate(major_type = if_else(major_category == stem_categories,
    "STEM", "Not STEM"
  ))

college_recent_grads |>
  filter(major_type == "STEM", median < 55000) |>
  select(major, median) |>
  arrange(desc(median))
# A tibble: 20 × 2
   major                                                      median
   <chr>                                                       <dbl>
 1 Architectural Engineering                                   54000
 2 Electrical Engineering Technology                           52000
 3 Environmental Engineering                                   50000
 4 Industrial Production Technologies                          46000
 5 Nuclear, Industrial Radiology, And Biological Technologies  46000
 6 Mathematics                                                 45000
 7 Physics                                                     45000
 8 Information Sciences                                        45000
 9 Pharmacology                                                45000
10 Engineering And Industrial Management                       44000
11 Computer Programming And Data Processing                    41300
12 Architecture                                                40000
13 Mechanical Engineering Related Technologies                 40000
14 Microbiology                                                38000
15 Computer Administration Management And Security             37500
16 Environmental Science                                       35600
17 Communication Technologies                                  35000
18 Neuroscience                                                35000
19 Ecology                                                     33000
20 Zoology                                                     26000
college_recent_grads <- college_recent_grads |>
  mutate(major_type = if_else(major_category %in% stem_categories,
    "STEM", "Not STEM"
  ))

college_recent_grads |>
  filter(major_type == "STEM", median < 55000) |>
  select(major, median) |>
  arrange(desc(median))
# A tibble: 47 × 2
   major                                       median
   <chr>                                        <dbl>
 1 Architectural Engineering                    54000
 2 Computer Science                             53000
 3 Electrical Engineering Technology            52000
 4 Materials Engineering And Materials Science  52000
 5 Civil Engineering                            50000
 6 Miscellaneous Engineering                    50000
 7 Environmental Engineering                    50000
 8 Engineering Technologies                     50000
 9 Geological And Geophysical Engineering       50000
10 Industrial Production Technologies           46000
# ℹ 37 more rows

Data “wrangling”

A screenshot of a New York Times article.

A screenshot of 'Data Carpentry' by David Mimno.

Reading data into R

Reading rectangular data

  • readr:
    • Most commonly: read_csv()
    • Maybe also: read_tsv(), read_delim(), etc.
  • readxl: read_excel()
  • arrow: read_arrow(), read_parquet()
  • haven: read_sas(), read_sav(), read_dta()
  • googlesheets4: read_sheet()
  • data.table: fread()1

Application exercise

Powerball Lottery

Powerball Lottery

Powerball Lottery

ae-06

  • Go to the course GitHub org and find your ae-06 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Recap of AE

  • Simplify your life – get the data in as simple a format as possible
  • Examine the file’s structure before attempting to import into R. Use the RStudio interactive menu as necessary.
  • Ensure all data cleaning is reproducible. Do not replace your raw data files.