library(tidyverse)
HW 01 - Data visualization
This homework is due September 11 at 11:59pm ET.
Getting started
Go to the info5001-fa24 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.
Packages
Guidelines + tips
As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to knit, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Note: Do not let R output answer the question for you unless the question specifically asks for just a plot. For example, if the question asks for the number of columns in the data set, please type out the number of columns. You are subject to lose points if you do not.
Workflow + formatting
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Exercises
Data 1: Tompkins County home sales
Use this dataset for Exercises 1 and 2.
For the following two exercises you will work with data on houses that were sold in Tompkins County, NY in 2022 and 2023.1
1 Data source: Redfin.
The variables include:
property_type
- type of property (e.g. single family residential, townhouse, condo)address
- street address of propertycity
- city of propertystate
- state of property (all are New York)zip_code
- ZIP code of propertyprice
- sale price (in dollars)beds
- number of bedroomsbaths
- number of bathrooms. Full bathrooms with shower/toilet count as 1, bathrooms with just a toilet count as 0.5.area
- living area of the home (in square feet)lot_size
- size of property’s lot (in acres)year_built
- year home was builthoa_month
- monthly HOA dues. If the property is not part of an HOA, then the value isNA
The dataset can be found in the data
folder of your repo. It is called tompkins-home-sales.csv
.
<- read_csv("data/tompkins-home-sales.csv") tompkins
Rows: 1901 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): property_type, address, city, state
dbl (8): zip_code, price, beds, baths, area, lot_size, year_built, hoa_month
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice that read_csv()
produces a bunch of text output. This is just the default behavior of read_csv()
as it summarizes how it has imported the data file. We will learn more about data types in a couple of weeks. For now, if you don’t like seeing the message in your rendered document you can disable it using the code chunk option message: false
, like below:
```{r}
#| message: false
tompkins <- read_csv("data/tompkins-home-sales.csv")
```
Exercise 1
Suppose you’re helping some family friends who are looking to buy a house in Tompkins County (good luck!). As they browse Redfin listings, they realize there are many old houses, and they wonder: Does the age of a house make a difference?
Luckily, you can help them answer this question with data visualization!
- Make boxplots of the prices of houses sold in Tompkins County based on whether they were built before or after 1960.
- In order to do this, you will first need to create a new variable called
home_age
(with values"Before 1960"
and"Newer than 1960"
). - Below is the code for creating this new variable. Here, we
mutate()
thetompkins
data frame to add a new variable calledhome_age
which takes the value"Before 1960"
if theyear_built
variable takes a value of less than 1960 and takes the string"Newer than 1960"
if not.2
- In order to do this, you will first need to create a new variable called
2 I know, properly it should be "1960 or newer"
but that will lead to confusing ordering of the values in the plot and we haven’t covered data wrangling yet.
<- tompkins |>
tompkins mutate(home_age = if_else(year_built < 1960, "Before 1960", "Newer than 1960"))
- 1
- Create a new column of data
- 2
-
Save the modified data frame as
tompkins
- Then, plot
home_age
vs.price
. - The sale price variable is heavily skewed due to outlier values. To make the median values more distinctive, log transform the
price
axis. Use the documentation for ascale_*_*()
function from ggplot2 to implement this adjustment. - Include informative title and axis labels.
- Finally, include a brief (2-3 sentence) narrative comparing the distributions of prices of older and newer homes. Your narrative should touch on whether being built pre/post-1960 “makes a difference” in terms of the price of the house.
Now is a good time to render, commit, and push.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 2
It’s expected that within any given market larger houses will be priced higher. It’s also expected that the age of the house will have an effect on the price. However in some markets new houses might be more expensive while in others new construction might mean “no character” and hence be less expensive. So your family friends ask: “In Tompkins County, do houses that are bigger and more expensive tend to be newer ones than those that are smaller and cheaper?”
Once again, data visualization skills to the rescue!
- Create a scatter plot to exploring the relationship between
price
andarea
, conditioning foryear_built
(the continuous variable, not the one you created for exercise 1). - Use
geom_smooth()
with the argumentse = FALSE
to add a smooth curve fit to the data and color the points byyear_built
. - You may find it difficult to interpret the graph with the original axes scales due to outliers. Consider how you might transform the axes using a
scale_*_*()
function from ggplot2 to implement this adjustment. - Include informative title, axis, and legend labels.
- Discuss each of the following claims (1-2 sentences per claim). Your discussion should touch on specific things you observe in your plot as evidence for or against the claims.
- Claim 1: Larger houses are priced higher.
- Claim 2: Newer houses are priced higher.
- Claim 3: Bigger and more expensive houses tend to be newer ones than smaller and cheaper ones.
Now is a good time to render, commit, and push.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceding.
Data 2: BRFSS
Use this dataset for Exercises 3 to 5.
The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
Source: cdc.gov/brfss
In the following exercises we will work with data from the 2020 BRFSS survey. The originally come from here, though we will work with a random sample of responses and a small number of variables from the data provided. These have already been sampled for you and the dataset you’ll use can be found in the data
folder of your repo. It’s called brfss.csv
.
<- read_csv("data/brfss.csv") brfss
Exercise 3
- How many rows are in the
brfss
dataset? What does each row represent? - How many columns are in the
brfss
dataset? Indicate the type of each variable. - Include the code and resulting output used to support your answer.
Now is a good time to render, commit, and push.
Exercise 4
Do people who smoke more tend to have worse health conditions?
- Use a segmented bar chart (also known as a standardized or filled bar chart) to visualize the relationship between smoking (
smoke_freq
) and general health (general_health
). Decide on which variable to represent with bars and which variable to fill the color of the bars by. - Pay attention to the order of the bars and, if need be, use the
fct_relevel
function to reorder the levels of the variables.- Below is sample code for releveling
general_health
. Here we first convertgeneral_health
to a factor (how R stores categorical data) and then order the levels from Excellent to Poor.
- Below is sample code for releveling
<- brfss |>
brfss mutate(
general_health = as.factor(general_health),
general_health = fct_relevel(general_health, "Excellent", "Very good",
"Good", "Fair", "Poor")
)
- 1
- Modify a column of data
- 2
-
Ensure specific ordering of
general_health
when plotted - 3
-
Save the modified data frame as
brfss
- Include informative title, axis, and legend labels.
- Comment on the motivating question based on evidence from the visualization: Do people who smoke more tend to have worse health conditions?
Now is a good time to render, commit, and push.
Exercise 5
How are sleep and general health associated?
- Create a visualization displaying the relationship between
sleep
andgeneral_health
. - Include informative title and axis labels.
- Modify your plot to use a different theme than the default.
- Comment on the motivating question based on evidence from the visualization: How are sleep and general health associated?
Now is a good time to render, commit, and push.
Exercise 6
- Fill in the blanks:
- The gg in the name of the package ggplot2 stands for ___.
- If you map the same continuous variable to both
x
andy
aesthetics in a scatterplot, you get a straight ___ line. (Choose between “vertical”, “horizontal”, or “diagonal”.)
- Code style: Fix up the code style by spaces and line breaks where needed. Briefly describe your fixes. (Hint: You can refer to the Tidyverse style guide.)
ggplot(data=penguins,mapping=aes(x=species,fill=island))+geom_bar()+scale_fill_viridis_d()
- Read
?facet_wrap
. What doesnrow
do? What doesncol
do? What other options control the layout of the individual panels? Why doesn’tfacet_grid()
havenrow
andncol
arguments?
Render, commit, and push one last time.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 5001 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
- Exercise 1: 7 points
- Exercise 2: 8 points
- Exercise 3: 6 points
- Exercise 4: 8 points
- Exercise 5: 8 points
- Exercise 6: 8 points
- Workflow + formatting: 5 points
- Total: 50 points
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- At least 3 informative commit messages
- Following tidyverse code style
- All code being visible in rendered PDF (no more than 80 characters)
Acknowledgments
- This assignment is derived from STA 199: Introduction to Data Science