Quiz 01

Quiz

Modified

September 18, 2025

Overview

Quiz 1 will be held on September 26th in-class. It will be a 50 minute in-person, timed quiz.

The quiz will cover all material through the end of week 4 (importing and recoding data). It will consist of a series of short answer and free response questions. Questions are designed to evaluate your understanding of concepts and methods. You may be asked to answer conceptual questions, interpret visualizations, interpret code and output, and/or write code by hand.

Students with SDS accommodations

Students who have registered SDS accommodations related to timed assignments are implemented by the SDS Alternative Testing Program. You will receive separate instructions from SDS about how to take the quiz with your accommodations. If you have any questions about your accommodations, please contact SDS directly.

Students with religious or other accommodations

If you require an accommodation for quiz 01 on the basis of religious observances, athletics, military service, or another accommodation listed on the course syllabus, please contact us at soltoffbc@cornell.edu by September 23. Any accommodation requests received after this date are unlikely to be approved.

Rules & Notes

Academic Integrity

A student shall in no way misrepresent his or her work.
A student shall in no way fraudulently or unfairly advance his or her academic position.
A student shall refuse to be a party to another student’s failure to maintain academic integrity.
A student shall not in any other manner violate the principle of academic integrity.

Source: Cornell University Code of Academic Integrity

This is an individual assignment. Everything in the quiz is for your eyes only.
The quiz will be held in-person. All responses will be written by hand and submitted on paper.
You may not use any electronic devices during the quiz.¹ This includes laptops, tablets, phones, smartwatches, etc.
You may not use any physical materials during the quiz. This includes textbooks, notes, calculators, etc. Any required information will be provided in the quiz.

Submission

All responses will be submitted on paper using the provided forms. Quizzes will be evaluated and returned to you via Gradescope.

Grading

Each quiz is weighted equally. There will be three quizzes in total, so each quiz is worth 5% of your final grade.

Practice problems

Instructions

Below are some practice problems you may complete in order to prepare for the quiz. The suggested solution is hidden below each exercise. Try to solve the problem on your own before looking at the solution.

The following chart was shared by @GraphCrimes on Twitter on September 3, 2022.
1. What is misleading about this graph?
2. Suppose you wanted to recreate this plot, with improvements to avoid its misleading pitfalls from part (a). You would obviously need the data from the survey in order to be able to do that. How many observations would this data have? How many variables (at least) should it have, and what should those variables be?
NoteSuggested solution
1. The bar segments do not visually match the text labels. For example, the “Public Sector” segment for the Royal Mail bar is labeled as 85%, but visually does not fill 85% of the bar.
2. It should have 1858 observations (based on the plot’s caption). It would need to have at least two variables.
  
  Service (rail, water, energy, or royal mail)
  
  Preference of who should run the service (public sector, private sector, or don’t know)

Suppose we have the following data frame:

df <- tibble(
  x = 1:5,
  y = c("a", "b", "a", "a", "b"),
  z = c("K", "K", "L", "L", "K")
)
df

# A tibble: 5 × 3
      x y     z    
  <int> <chr> <chr>
1     1 a     K    
2     2 b     K    
3     3 a     L    
4     4 a     L    
5     5 b     K

Answer all questions without actually running the code.

What is the difference between these two piped operations?

df |>
  group_by(y)

df |>
  arrange(y)

Suggested solution

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x), .groups = "drop")

# A tibble: 3 × 3
  y     z     mean_x
  <chr> <chr>  <dbl>
1 a     K        1  
2 a     L        3.5
3 b     K        3.5

The output will be the same as part (b), but the grouping information will not be retained in the output (i.e. it will be an ungrouped data frame)

We have data tracking the flight performance of dragons across 500 trips.²

Variables include:
1. Average wind speed (mph)
2. Flight distance (miles)
3. Dragon type
  - Fire dragon – Powerful but heavy, struggles in strong winds.
  - Ice dragon – Prefers icy conditions, decent endurance.
  - Storm dragon – Loves turbulent air, flies far in high winds.
  - Forest dragon – Agile but avoids strong winds, moderate endurance.
The data is stored in dragons and looks like this:
```
# A tibble: 500 × 3
   wind_speed flight_distance dragon_type  
        <dbl>           <dbl> <chr>        
 1         15              63 Ice Dragon   
 2         38              85 Storm Dragon 
 3         29              45 Fire Dragon  
 4         24              48 Fire Dragon  
 5          6              94 Fire Dragon  
 6          6             110 Fire Dragon  
 7          2              98 Forest Dragon
 8         35              33 Fire Dragon  
 9         24              52 Forest Dragon
10         28              41 Forest Dragon
# ℹ 490 more rows
```
You believe that Storm Dragons will perform better in high winds compared to the other dragon types.
1. What kind of visualization would you use to compare the flight distance of Storm Dragons to other dragon types at different wind speeds?
2. Write the R code you would use to create this plot.
NoteSuggested solution
1. I would probably use a color-coded scatterplot with wind speed on the x-axis, flight distance on the y-axis, and different colors for each dragon type. Since there are 500 observations, smoothing lines will probably be useful to summarize the trends.
2. Here is the R code to create this plot:
  
  ggplot( data = dragons, mapping = aes(x = wind_speed, y = flight_distance, color = dragon_type) ) + geom_point() + geom_smooth()
Cleaning coffee orders

You are given a dataset that records orders at a coffee shop. It looks like this:

order_id drinks

201 Latte-Medium, Espresso-Small, Mocha-Large

202 Cappuccino-Small, Latte-Large

203 Mocha-Medium, Espresso-Large, Latte-Small
1. Explain why this dataset is not tidy according to the principles of tidy data.
2. Identify the tidy structure for this data. What information would be stored in rows? What would the columns be?
3. Write the R code to convert the original dataset into the tidy structure you identified in part (b).
NoteSuggested solution
1. The dataset is not tidy because the drinks column contains multiple values in a single cell. Not only does it contain multiple beverages, each beverage is also paired with a drink size.
2. The tidy structure would be
  
  Each drink is in a separate row.
  
  Columns include the order ID, the type of drink, and the drink size.
3. Here is the R code to convert the original dataset into the tidy structure:
  
  # original structure coffee_orders
  
  # A tibble: 3 × 2 order_id drinks <dbl> <chr> 1 201 Latte-Medium, Espresso-Small, Mocha-Large 2 202 Cappuccino-Small, Latte-Large 3 203 Mocha-Medium, Espresso-Large, Latte-Small
  
  # tidy structure coffee_orders |> separate_longer_delim( cols = drinks, delim = "," ) |> separate_wider_delim( cols = drinks, delim = "-", names = c("drink", "size") )
  
  # A tibble: 8 × 3 order_id drink size <dbl> <chr> <chr> 1 201 "Latte" Medium 2 201 " Espresso" Small 3 201 " Mocha" Large 4 202 "Cappuccino" Small 5 202 " Latte" Large 6 203 "Mocha" Medium 7 203 " Espresso" Large 8 203 " Latte" Small

order_id	drinks
201	Latte-Medium, Espresso-Small, Mocha-Large
202	Cappuccino-Small, Latte-Large
203	Mocha-Medium, Espresso-Large, Latte-Small

Footnotes

Students with certain SDS accommodations are permitted to use a computer.↩︎
For all you Empryean Series fans out there. Also not real data. Thanks ChatGPT.↩︎