# A tibble: 5 × 3
x y z
<int> <chr> <chr>
1 1 a K
2 2 b K
3 3 a L
4 4 a L
5 5 b K
Quiz 01
Overview
Quiz 1 will be held on September 26th in-class. It will be a 50 minute in-person, timed quiz.
The quiz will cover all material through the end of week 4 (importing and recoding data). It will consist of a series of short answer and free response questions. Questions are designed to evaluate your understanding of concepts and methods. You may be asked to answer conceptual questions, interpret visualizations, interpret code and output, and/or write code by hand.
Students with SDS accommodations
Students who have registered SDS accommodations related to timed assignments are implemented by the SDS Alternative Testing Program. You will receive separate instructions from SDS about how to take the quiz with your accommodations. If you have any questions about your accommodations, please contact SDS directly.
Students with religious or other accommodations
If you require an accommodation for quiz 01 on the basis of religious observances, athletics, military service, or another accommodation listed on the course syllabus, please contact us at soltoffbc@cornell.edu by September 23. Any accommodation requests received after this date are unlikely to be approved.
Rules & Notes
Academic Integrity
- A student shall in no way misrepresent his or her work.
- A student shall in no way fraudulently or unfairly advance his or her academic position.
- A student shall refuse to be a party to another student’s failure to maintain academic integrity.
- A student shall not in any other manner violate the principle of academic integrity.
- This is an individual assignment. Everything in the quiz is for your eyes only.
- The quiz will be held in-person. All responses will be written by hand and submitted on paper.
- You may not use any electronic devices during the quiz.1 This includes laptops, tablets, phones, smartwatches, etc.
- You may not use any physical materials during the quiz. This includes textbooks, notes, calculators, etc. Any required information will be provided in the quiz.
Submission
- All responses will be submitted on paper using the provided forms. Quizzes will be evaluated and returned to you via Gradescope.
Grading
- Each quiz is weighted equally. There will be three quizzes in total, so each quiz is worth 5% of your final grade.
Practice problems
Below are some practice problems you may complete in order to prepare for the quiz. The suggested solution is hidden below each exercise. Try to solve the problem on your own before looking at the solution.
-
The following chart was shared by @GraphCrimes on Twitter on September 3, 2022.
- What is misleading about this graph?
- Suppose you wanted to recreate this plot, with improvements to avoid its misleading pitfalls from part (a). You would obviously need the data from the survey in order to be able to do that. How many observations would this data have? How many variables (at least) should it have, and what should those variables be?
NoteSuggested solutionThe bar segments do not visually match the text labels. For example, the “Public Sector” segment for the Royal Mail bar is labeled as 85%, but visually does not fill 85% of the bar.
-
It should have 1858 observations (based on the plot’s caption). It would need to have at least two variables.
- Service (rail, water, energy, or royal mail)
- Preference of who should run the service (public sector, private sector, or don’t know)
-
Suppose we have the following data frame:
Answer all questions without actually running the code.
-
What is the difference between these two piped operations?
df |> group_by(y) df |> arrange(y)
NoteSuggested solutiondf |> group_by(y)
# A tibble: 5 × 3 # Groups: y [2] x y z <int> <chr> <chr> 1 1 a K 2 2 b K 3 3 a L 4 4 a L 5 5 b K
df |> arrange(y)
# A tibble: 5 × 3 x y z <int> <chr> <chr> 1 1 a K 2 3 a L 3 4 a L 4 2 b K 5 5 b K
The first operation groups the data frame by the
y
column, while the second operation sorts the data frame by they
column. -
What does the following code do?
df |> group_by(y, z) |> summarize(mean_x = mean(x))
NoteSuggested solutiondf |> group_by(y, z) |> summarize(mean_x = mean(x))
`summarise()` has grouped output by 'y'. You can override using the `.groups` argument.
# A tibble: 3 × 3 # Groups: y [2] y z mean_x <chr> <chr> <dbl> 1 a K 1 2 a L 3.5 3 b K 3.5
This code groups the data frame by the
y
andz
columns, and then calculates the mean of thex
column for each group. -
How will the output of this code be different from part (b)?
df |> group_by(y, z) |> summarize(mean_x = mean(x), .groups = "drop")
NoteSuggested solutiondf |> group_by(y, z) |> summarize(mean_x = mean(x), .groups = "drop")
# A tibble: 3 × 3 y z mean_x <chr> <chr> <dbl> 1 a K 1 2 a L 3.5 3 b K 3.5
The output will be the same as part (b), but the grouping information will not be retained in the output (i.e. it will be an ungrouped data frame)
-
-
We have data tracking the flight performance of dragons across 500 trips.2
Variables include:
Average wind speed (mph)
Flight distance (miles)
-
Dragon type
- Fire dragon – Powerful but heavy, struggles in strong winds.
- Ice dragon – Prefers icy conditions, decent endurance.
- Storm dragon – Loves turbulent air, flies far in high winds.
- Forest dragon – Agile but avoids strong winds, moderate endurance.
The data is stored in
dragons
and looks like this:# A tibble: 500 × 3 wind_speed flight_distance dragon_type <dbl> <dbl> <chr> 1 15 63 Ice Dragon 2 38 85 Storm Dragon 3 29 45 Fire Dragon 4 24 48 Fire Dragon 5 6 94 Fire Dragon 6 6 110 Fire Dragon 7 2 98 Forest Dragon 8 35 33 Fire Dragon 9 24 52 Forest Dragon 10 28 41 Forest Dragon # ℹ 490 more rows
You believe that Storm Dragons will perform better in high winds compared to the other dragon types.
- What kind of visualization would you use to compare the flight distance of Storm Dragons to other dragon types at different wind speeds?
- Write the R code you would use to create this plot.
NoteSuggested solutionI would probably use a color-coded scatterplot with wind speed on the x-axis, flight distance on the y-axis, and different colors for each dragon type. Since there are 500 observations, smoothing lines will probably be useful to summarize the trends.
-
Here is the R code to create this plot:
ggplot( data = dragons, mapping = aes(x = wind_speed, y = flight_distance, color = dragon_type) ) + geom_point() + geom_smooth()
-
Cleaning coffee orders
You are given a dataset that records orders at a coffee shop. It looks like this:
order_id drinks 201 Latte-Medium, Espresso-Small, Mocha-Large 202 Cappuccino-Small, Latte-Large 203 Mocha-Medium, Espresso-Large, Latte-Small - Explain why this dataset is not tidy according to the principles of tidy data.
- Identify the tidy structure for this data. What information would be stored in rows? What would the columns be?
- Write the R code to convert the original dataset into the tidy structure you identified in part (b).
NoteSuggested solutionThe dataset is not tidy because the
drinks
column contains multiple values in a single cell. Not only does it contain multiple beverages, each beverage is also paired with a drink size.-
The tidy structure would be
- Each drink is in a separate row.
- Columns include the order ID, the type of drink, and the drink size.
-
Here is the R code to convert the original dataset into the tidy structure:
# original structure coffee_orders
# A tibble: 3 × 2 order_id drinks <dbl> <chr> 1 201 Latte-Medium, Espresso-Small, Mocha-Large 2 202 Cappuccino-Small, Latte-Large 3 203 Mocha-Medium, Espresso-Large, Latte-Small
# tidy structure coffee_orders |> separate_longer_delim( cols = drinks, delim = "," ) |> separate_wider_delim( cols = drinks, delim = "-", names = c("drink", "size") )
# A tibble: 8 × 3 order_id drink size <dbl> <chr> <chr> 1 201 "Latte" Medium 2 201 " Espresso" Small 3 201 " Mocha" Large 4 202 "Cappuccino" Small 5 202 " Latte" Large 6 203 "Mocha" Medium 7 203 " Espresso" Large 8 203 " Latte" Small