Project description

Modified

November 14, 2024

Important dates

Important

The details will be updated as the project date approaches.

Introduction

TL;DR: Create something related to data science.

This is intentionally vague – part of the challenge is to design a project that showcases best your interests and strengths.

One requirement is that your project should feature some element that you had to learn on your own. This could be a package you use that we didn’t teach in class (e.g., a package for building interactive web applications) or a workflow (e.g., making a package) or anything else.

If you’re not sure if your “new” thing counts, just ask!

Ideas

Identify a goal for your project that leverages the skills you develop in this class. Some possible ideas include:

  • Develop educational content introducing and presenting a technical topic from statistics or mathematics (e.g. gradient descent, neural networks, decision trees) and publish as a Quarto website
  • Create online tutorials for a specific R package or data science technique using Web Assembly and Quarto Live
  • Build a Shiny web application for visualizing and exploring a complex dataset
  • Create an R package that provides enhanced functionality for ggplot2
  • Build an R package to provide a straightforward interface to an API
  • Construct a chatbot and build an API to provide programmatic access
  • Develop a machine learning model and deploy it as an API using plumber

Most importantly, be prepared to brainstorm a bunch of ideas and discard them until you settle on a topic that everyone in the team is happy with and feels like a good choice for showcasing what you’ve learned in the class and how you can use that to learn something new and implement for your project.

The project is very open ended. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R, and all components of the project must be reproducible.

You will work on the project with your lab teams.

Deliverables

The four primary deliverables for the final project are

  1. A project proposal with three ideas.
  2. A final report that explains the process and results.
  3. A reproducible product in a format based upon the type of project you propose (e.g. R package, interactive web application, custom-built API), with one required draft along the way.
  4. A presentation with slides.

There will be additional submissions throughout the semester to facilitate completion of the final product and presentation.

Organization of files in the repository

The files in your repository are organized as a Quarto Project. This enables easy rendering of all Quarto documents within the project folder with a single command, as well as the ability to share YAML configurations across multiple documents. To render the project go to the Build tab in RStudio, and click on “Render”.

Teams

Projects will be completed in teams of 3-5 students. Every team member should be involved in all aspects of planning and executing the project. Each team member should make an equal contribution to all parts of the project. The scope of your project is based on the number of contributing team members on your project. If you have 4 contributing team members, we will expect a larger project than a team of 3 contributing team members.

Some lab section meetings will be devoted to work on the project, so all teams will be formed within each lab section (i.e. only students in your lab section can be your team members). The course staff will assign students to teams. To facilitate this process, we will provide a short survey identifying study and communication habits. Once teams are assigned, they cannot be changed.

Team conflicts

Conflict is a healthy part of any team relationship. If your team doesn’t have conflict, then your team members are likely not communicating their issues with each other. Use your team contract (written at the beginning of the project) to help keep your team dynamic healthy.

When you have conflict, you should follow this procedure:

  1. Refer to the team contract and follow it to address the conflict.

  2. If you resolve the conflict without issue, great! Otherwise, update the team contract and try to resolve the conflict yourselves.

  3. If your team is unable to resolve your conflict, please contact soltoffbc@cornell.edu and explain your situation.

    We’ll ask to meet with all the group members and figure out how we can work together to move forward.

  4. Please do not avoid confrontation if you have conflict. If there’s a conflict, the best way to handle it is to bring it into the open and address it.

Project grade adjustments

Remember, do not do the work for a slacking team member. This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)

Your team will initially receive a final grade assuming that all team members contributed to your project. If you have a 5-person team, but only 3 persons contributed, your team will likely receive a lower grade initially because only 3 persons worth of effort exists for a 5-person project. About a week after the initial project grades are released, adjustments will be made to each individual team member’s group project grade.

We use your project’s Git history (to view the contributions of each team member) and the peer evaluations to adjust each team members’ grades. Both adjustments to increase or decrease your grade are possible based on each individual’s contributions.

For example, if you have a 4-person team, but only 3 contributing members, the 3 contributing members may have their grades increased to reflect the effort of only 3 contributing members. The non-contributing member will likely have their grade decreased significantly.

Warning

I am serious about every member of the team equitably contributing to the project. Students who fail to contribute equitably may receive up to a 100% deduction on their project grade.

Please be patient for the grade adjustments. The adjustments take time to do them fairly. Please know that the instructor handles this entire process himself, and I take it very seriously. If you think your initial group project grade is unfair, please wait for your grade adjustment before you contact us.

The slacking team member

Please do not cover for a slacking/freeloading team member. Please do not do their work for them! This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)

Remember, we have your Git history. We can see who contributes to the project and who doesn’t. If a team member rarely commits to Git and only makes very small commits, we can see that they did not contribute their fair share.

All students should make their project contributions through their own GitHub account. Do not commit changes to the repository from another team member’s GitHub account. Your Git history should reflect your individual contributions to the project.

Proposal

There are two main purposes of the project proposal:

  • To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
  • To ensure that the topic you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.

Identify 3 topics you’re interested in potentially using for the project. At least two of the three topics must utilize real-world data. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find datasets on those topics.

Write the proposal in the proposal.qmd file in your project repo.

Important

You must use one of the topics in the proposal for the final project, unless instructed otherwise when given feedback.

Criteria for datasets

The datasets should meet the following criteria:

  • At least 500 observations
  • At least 8 columns
  • At least 6 of the columns must be useful and unique explanatory variables.
    • Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
    • If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
  • You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
Warning

You may not use data from a secondary data archive. In plainest terms, do not use datasets you find from Kaggle or the UCI Machine Learning Repository. Your data should come from your own collection process (e.g. API or web scraping) or the primary source (e.g. government agency, research group, etc.).

Please ask a member of the course staff if you’re unsure whether your dataset meets the criteria.

If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be okay; use these numbers as guidance for a successful proposal, not as minimum requirements.

Questions for project mentor

Include specific, relevant questions you have for the project mentor about your proposed topics. These questions should be about the feasibility of the project, the quality of the data, the potential for interesting analysis, etc.

Resources for datasets

You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.

Proposal components

For each topic, include the following:

Problem or question

What is the problem you will solve?

For each topic, include the following:

  • A well formulated objective. (You may include more than one idea if you want to receive feedback on different ideas for your project. However, one per topic is required.)
  • Statement on why this topic is important.
  • Identify the types of variables you will use. Categorical? Quantitative?
  • What will be the major product(s)? A published website? An interactive web application a la Shiny? An R package? A deployable API?

Introduction and data

For each dataset (if one is provided):

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

  • Write a brief description of the observations.

  • Address ethical concerns about the data, if any.

Glimpse of data

For each dataset (if one is provided):

  • Place the file containing your data in the data folder of the project repo.
  • Use the skimr::skim() function to provide a glimpse of the dataset.

Exploration

Settle on a single idea and state your objective(s) clearly. You will carry out most of your data collection and cleaning, compute some relevant summary statistics, and show some plots of your data as applicable to your objective(s).

Write up your explanation in the explore.qmd file in your project repo. It should include the following sections:

  • Objective(s). State the question(s) you are answering or the problem(s) you are solving clearly.
  • Data collection and cleaning.1 Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.2
  • Data description. Have an initial draft of your data description section. Your data description should be about your analysis-ready data.
  • Data limitations. Identify any potential problems with your dataset.
  • Exploratory data analysis. Perform an (initial) exploratory data analysis.
  • Questions for reviewers. List specific questions for your project mentor to answer in giving you feedback on this phase.

1 If you are using real-world data. If you are generating synthetic data, define the process here.

2 If you have written code to collect your data (e.g. using an API or web scraping), store this in a separate .qmd file or .R script in the repo.

Note

If your project does not make substantial use of real-world data, you should develop your plan for the products. Who is the audience for your product? What functions or features will you need to incorporate? How will you go about designing and implementing these features?

Warning

Thorough EDA requires substantial review and analysis of your data. You should not expect to complete this phase in a single day. You should expect to iterate through 20-30 charts, sets of summary statistics, etc., to get a good understanding of your data.

Visualizations are not expected to look perfect at this point since they are mainly intended for you and your team members. Standard expectations for visualizations (e.g. clearly labeled charts and axes, optimized color palettes) are not necessary at the EDA stage.

  • Questions for reviewers. List specific questions for your project mentor to answer in giving you feedback on this phase.

Draft

The purpose of the draft and peer review is to give you an opportunity to get early feedback on your analysis. Therefore, the draft and peer review will focus primarily on the exploratory analysis and initial drafts of the final product(s).

Write the draft write-up in the report.qmd file in your project repo. Be sure to explicitly identify how to access the draft product (e.g. a link to a published web page or Shiny app).

You should have a functional product at this stage, but it is okay to have some incompleteness or partial components. If you have made more progress by this point, then you are likely to receive higher quality feedback.

Peer review

Critically reviewing others’ work is a crucial part of the scientific process, and INFO 5001 is no exception. You will be assigned two teams to review. This feedback is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.

During the peer feedback process, you will be provided read-only access to your partner team’s GitHub repo. You will provide your feedback in the form of GitHub issues to your partner team’s GitHub repo.

Peer review process and questions are outlined in the relevant lab instructions.

Peer reviews will be graded on the extent to which they comprehensively and constructively address the components of the reviewee’s team’s report. Specifics of peer review grading are also outlined in the relevant lab instructions.

Final product

You will create a functioning, working end-product constructed using a reproducible workflow. The form of your product will vary depending on your objectives. Examples of potential products include (but are not limited to):

  • A multi-page website constructed using Quarto
  • Shiny web application
  • R package with published documentation site
  • Application programming interface (API) constructed using Plumber and deployed publicly

Regardless of format, the product should be accessible to a public audience. That means users should be able to access the content through a web interface (and not have to clone your Git repo to access and run files).

Your final product must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.

Your final product will be evaluated based on degree of difficulty and execution. You will receive feedback during the proposal stage as to the perceived level of difficulty of your project.

Report

Your written report must be completed in the report.qmd file.

Tip

Before you finalize your write up, make sure the printing of code chunks is off with the option echo: false in the YAML.

The report should be between 1000-2000 words. There is no expectation that you get close to the upper limit, anywhere in that range is fine as long as you have clearly explained yourself. The limits are provided to help you, not to set stressful expectations.

Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive workflow. This includes (but is not limited to) addressing the following items.

Tip

Feel free to add additional sections and/or structure to your report where necessary. We will take that into account when we grade.

Introduction

Identify the project motivation, data, and objectives. What is the context of the work? What problem are you trying to solve? What are your main conclusions?

Justification of approach

Describe the product(s). What did your team create? Who is the intended audience? How will the product(s) meet their needs?

Data description

If using real-world data, describe it. A good model for this is presented in Gebru et al, 2018. Answer any relevant questions from sections 3.1-3.5 of the Gebru et al article, especially the following questions:

  • What are the observations (rows) and the attributes (columns)?
  • Why was this dataset created?
  • Who funded the creation of the dataset?
  • What processes might have influenced what data was observed and recorded and what was not?
  • What preprocessing was done, and how did the data come to be in the form that you are using?
  • If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?

Design process

Summarize your design process for the product(s). Explain the key design challenges you encountered in creating the main product(s). What were the most important considerations your team faced in designing and constructing the final product?

Limitations

Assess the limitations of your work. What hurdles did you fail to overcome? If you had the opportunity to do this again, how would you improve on your product(s)?

Acknowledgments

Recognize any people or online resources that you found helpful. These can be tutorials, software packages, Stack Overflow questions, peers, and data sources. Showing gratitude is a great way to feel happier! But it also has the nice side-effect of reassuring us that you’re not passing off someone else’s work as your own. Crossover with other courses is permitted and encouraged, but it must be clearly stated, and it must be obvious what parts were and were not done for 5001. Copying without attribution robs you of the chance to learn, and wastes our time investigating.

Appendicies

You are welcome to include an appendix with additional work at the end of the written report document; however, grading will largely be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 1000-2000 word limit.

You should submit your appendix(-ces) in the appendices.qmd file in your project repo.

  • At minimum, you should have an appendix for your data cleaning. Submit an updated version of your data cleaning description from phase II that describes all data cleaning steps performed on your raw data to turn it into the analysis-read dataset submitted with your final project. When rendered, it should output the dataset you submit as part of your project (e.g. written as a .csv file).
  • (Optional) Other appendices. You will almost certainly feel that you have done a lot of work that didn’t end up in the final report. We want you to edit and focus, but we also want to make sure that there’s a place for work that didn’t work out or that didn’t fit in the final presentation. You may include any analyses you tried but were tangential to the final direction of your main report. Graders may briefly look at these appendices, but they also may not. You want to make your final report interesting enough that the graders don’t feel the need to look at other things you tried. “Interesting” doesn’t necessarily mean that the results in your final report were all statistically significant; it could be that your results were not significant but you were able to interpret them in an interesting and informed way.

Organization + formatting

While not a separate written section, you will be assessed on the overall presentation and formatting of the written report. A non-exhaustive list of criteria include:

  • The report neatly written and organized with clear section headers and appropriately sized figures with informative labels.
  • Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted.
  • All citations and links are properly formatted.
  • If there is an appendix, it is reasonably organized and easy for the reader to find relevant information.
  • All code, warnings, and messages are suppressed.
  • The main body of the written report (not including the appendix) is no longer than 10 pages.

Presentation + slides

Slides

In addition to the written report, your team will also create an oral presentation that summarizes and showcases your project. Using a slide presentation, you will introduce your objective(s) and dataset, showcase visualizations, and discuss the primary outcomes. These slides should serve as a brief visual addition to your written report and will be graded for content and quality.

Your presentation will be created using Quarto, which allows you to write slides using the same reproducible document structure you’re used to.

The slide deck should have no more than 6 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.

TODO

  • Title Slide
  • Slide 1: Introduce the topic and motivation
  • Slide 2: Introduce the data
  • Slide 3: Highlights from EDA
  • Slide 4-5: Inference/modeling/other analysis
  • Slide 6: Conclusions + future work

Presentation

Presentations will take place in class during the last lab of the semester. The presentation must be no longer than 10 minutes.

Evaluation

Presentations will be evaluated by the course staff (15 points) and by your peers in your lab section (5 points). Students will receive access to a Google Form where they will provide (confidential) feedback on their peer groups’ presentations. Students will evaluate their own presentations.

Reproducibility + organization

All written work should be reproducible, and the GitHub repo should be neatly organized.

  • Points for reproducibility + organization will be based on the reproducibility of the entire repository and the organization of the project GitHub repo.
  • The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.

Teamwork

Every team member should make an equal contribution to all parts of the project. Every team member should have an equal experience designing, coding, testing, etc.

At the completion of the project, you will be asked to fill out a survey where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member. Working as a team is every team member’s responsibility.

If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that they underperformed on the project, we will conduct further analysis and their overall project grade may be adjusted accordingly.

Overall grading

Total 150 pts
Project proposal 10 pts
Exploration 15 pts
Draft 10 pts
Peer review 5 pts
Final report 20 pts
Final product(s) 60 pts
Slides + presentation 15 pts
Slides + presentation (peer) 5 pts
Reproducibility + organization 10 pts

Evaluation criteria

Project proposal

Category Less developed projects Typical projects More developed projects
Dataset ideas Fewer than three topics are included.

Topic ideas are vague and impossible or excessively difficult to collect.
Three topic ideas are included and all or most datasets could feasibly be collected or accessed by the end of the semester.

Each dataset is described alongside a note about availability with a source cited.
Three topic ideas are included and all or most datasets could feasibly be collected or accessed by the end of the semester.

Each dataset is described alongside a note about availability with (possibly multiple) sources cited.

Each dataset could reasonably be part of a data science project, driven by an interesting research question.
Questions for reviewers The questions for reviewers are vague or unclear. The questions for reviewers are specific to the datasets and are based on group discussions between team members. The questions for reviewers are specific to the datasets and are based on group discussions between team members.

Questions for reviewers look toward the next stage of the project.

Exploration

Category Less developed projects Typical projects More developed projects
Objective(s) Objective is not clearly stated or significantly limits potential analysis. Clearly states the objective(s), which have moderate potential for relevant impact. Clearly states complex objective(s) that leads to significant potential for relevant impact.
Data cleaning Data is minimally cleaned, with little documentation and description of the steps undertaken.

Completes all necessary data cleaning for subsequent analyses.

Describes cleaning steps with some detail.

Completes all necessary data cleaning for subsequent analyses.

Describes all cleaning steps in full detail, so that the reader has an excellent grasp of how the raw data was transformed into the analysis-ready dataset.

Data description

Simple description of some aspects of the dataset, little consideration for sources.

The description is missing answers to applicable questions detailed in the “Datasheets for Datasets” paper.

Answers all relevant questions in the “Datasheets for Datasets” paper. All expectations of typical projects + credits and values data sources.
Data limitations

The limitations are not explained in depth.

There is no mention of how these limitations may affect the meaning of results.

Identifies potential harms and data gaps, and describes how these could affect the meaning of results. Creatively identifies potential harms and data gaps, and describes how these could affect the meaning of results, and the impact of results on people. It is evident that significant thought has been put into the limitations of the collected data.
Exploratory data analysis

Motivation for choice of analysis methods is unclear.

Does not justify decisions to either confirm / update objective and data description.

Sufficient plots (20-30) and summary statistics to identify typical values in single variables and connections between pairs of variables.

Uses exploratory analysis to confirm/update objectives and data description.

All expectations of typical projects + analysis methods are carefully chosen to identify important characteristics of data.

Draft

Category Less developed projects Typical projects More developed projects
Functional prototype Product is non-functional or broken. Product is reasonably functional. It need not be perfect or without errors, but is mostly working and includes most substantive parts. Product is functional and performs without errors. All major components have been incorporated. Still lacks polish and finishing touches.
Progress It is unclear whether or not the project will be completed by the deadline. The team has made progress on the project at this point and is on track to finish by the deadline. The team has made substantial progress on the project at this point and is on track to finish ahead of the deadline.
Reproducibility

Source code is unclear.

Project files are missing or hard to find.

Project files cannot be rendered.

Source code is easy to read, properly formatted, and appropriately documented.

Project files are generally organized in the repository and easy to find.

Project files generally render with minimal errors.

All expectations of typical projects + all required files are provided. Project files (e.g. Quarto, Shiny apps, R scripts) render without issues and reproduce the necessary outputs.

Peer review

  • Peer review issues open
  • Reviews are constructive, actionable, and sufficiently thorough

Final report

TODO

Category Less developed projects Typical projects More developed projects
Introduction

Less focused and organized. They may jump to technical details without explaining why results are important.

Research questions are not clearly stated and/or results are not clearly summarized at the end of the introduction.

Provides background information and context.

Introduces key terms and data sources.

Outlines research question(s).

Ends with a brief summary of findings.

All expectations of typical projects + clearly describes why the setting is important and what is at stake in the results of the analysis. Even if the reader doesn’t know much about the subject, they know why they care about the results of your analysis.
Justification of approach
Limitations

The limitations are not explained in depth.

There is no mention of how these limitations may affect the meaning of results.

Identifies potential harms and data gaps, and describes how these could affect the meaning of results. Creatively identifies potential harms and data gaps, and describes how these could affect the meaning of results, as well as the impact of results on people.

Final product(s)

TODO

Category Less developed projects Typical projects More developed projects
Design + visualization
Functionality
Code
Impact

Slides + presentation

Category Less developed projects Typical projects More developed projects
Time management Only some members speak during the presentation. Team does not manage time wisely (e.g. runs out of time, finishes early without adequately presenting their project). All members speak during the presentation. Team does not exceed the five minute limit. Team maximally uses their five minutes. Clearly communicates their objectives and outcomes from the project.
Professionalism Presentation is slapped together or haphazard. Seems like independent pieces of work patched together. Presentation appears to be rehearsed. There is cohesion to the presentation. All elements of typical projects + everyone says something meaningful about the project.
Slides

Slides contain excessive text and/or content.

Team relies too heavily on slides for their presentation.

Slides are well-organized.

Slides are used as a tool to assist the oral presentation.

All elements of typical projects + graphics and tables follow best-practices (e.g. all text is legible, appropriate use of color and legends).

Slides are not crammed full of text.

Creativity/originality

Project meets the minimum requirements but not much else.

Project is incomplete or does not meet the team’s objectives.

Project appears carefully thought out. Time and effort seem to have gone into the planning and implementation of the project.

All elements of typical projects + project goes above and beyond the minimum requirements.

Addresses a truly important social issue or noteworthy goal.

Content

Ojective is unclear.

Product(s) do not clearly address the research question.

Limitations are glossed over or ignored entirely.

Objective is stated.

Product(s) address the objective.

Limitations are noted.

Objective is clearly stated.

Product(s) clearly address the objective.

Limitations are carefully considered and articulated.

Slides + presentation (peer)

  • Content: Is/are the objective(s) clearly articulated and can the product(s) accomplish it/them?

  • Content: Did the team effective meet the objective(s)?

  • Creativity and critical thought: Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

  • Slides: Are the slides well organized, readable, not full of text, featuring figures with legible labels, legends, etc.?

  • Professionalism: How well did the team present? Does the presentation appear to be well practiced? Are they reading off of a script? Did everyone get a chance to say something meaningful about the project?

Reproducibility + organization

Category Less developed projects Typical projects
Reproducibility Required files are missing. Quarto files do not render successfully (except for if a package needs to be installed). All required files are provided. Project files (e.g. Quarto, Shiny apps, R scripts) render without issues and reproduce the necessary outputs.
Data documentation Codebook is missing. No local copies of data files. All datasets are stored in a data folder, a codebook is provided, and a local copy of the data file is used in the code where needed.
File readability Documents lack a clear structure. There are extraneous materials in the repo and/or files are not clearly organized. Documents (Quarto files and R scripts) are well structured and easy to follow. No extraneous materials.
Issues Issues have been left open, or are closed mostly without specific commits addressing them. All issues are closed, mostly with specific commits addressing them.

Late work policy

There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.