Project description

Important dates

Important

The details will be updated as the project date approaches.

Introduction

TL;DR: Create something related to data science.

This is intentionally vague – part of the challenge is to design a project that showcases best your interests and strengths.

One requirement is that your project should feature some element that you had to learn on your own. This could be a package you use that we didn’t teach in class (e.g., a package for building interactive web applications) or a workflow (e.g., making a package) or anything else.

If you’re not sure if your “new” thing counts, just ask!

Ideas

Identify a goal for your project that leverages the skills you develop in this class. Some possible ideas include:

  • Develop educational content introducing and presenting a technical topic from statistics or mathematics (e.g. gradient descent, neural networks, decision trees)
  • Build a Shiny web application for visualizing and exploring a complex dataset
  • Create an R package that provides enhanced functionality for ggplot2
  • Build an R package to provide a straightforward interface to an API
  • Do a deep dive into accessibility for data visualization and build a lesson plan for creating accessible visualizations with ggplot2, Quarto, and generally within the R ecosystem.
  • Built a chatbot and construct an API to provide programmatic access

Most importantly, be prepared to brainstorm a bunch of ideas and discard them until you settle on a topic that everyone in the team is happy with and feels like a good choice for showcasing what you’ve learned in the class and how you can use that to learn something new and implement for your project.

The project is very open ended. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R, and all components of the project must be reproducible.

You will work on the project with your lab teams.

Deliverables

The four primary deliverables for the final project are

  1. A project proposal with three ideas.
  2. A final report that explains the process and results.
  3. A reproducible product in a format based upon the type of project you propose (e.g. R package, interactive web application, custom-built API), with one required draft along the way.
  4. A presentation with slides.

There will be additional submissions throughout the semester to facilitate completion of the final product and presentation.

Warning

You will not be submitting anything on Gradescope for the project. Submission of these deliverables will happen on GitHub and feedback will be provided as GitHub issues that you need to engage with and close. The collection of the documents in your GitHub repo will create a website for your project. To create the website go to the Build tab in RStudio, and click on Render Website. All deliverables must be successfully rendered as part of your published website. We will not check source code (*.qmd) files or clone your repo locally to produce the required documents.

Tip

You can access your published project website using the URL

https://pages.github.coecis.cornell.edu/info5001-fa23/project-TEAM-NAME/

For example, if your team name is awesome-mewtwothen your URL will be

https://pages.github.coecis.cornell.edu/info5001-fa23/project-awesome-mewtwo/

Teams

Projects will be completed in teams of 3-5 students. Every team member should be involved in all aspects of planning and executing the project. Each team member should make an equal contribution to all parts of the project. The scope of your project is based on the number of contributing team members on your project. If you have 4 contributing team members, we will expect a larger project than a team of 3 contributing team members.

The course staff will assign students to teams. To facilitate this process, we will provide a short survey identifying study and communication habits. Once teams are assigned, they cannot be changed.

Team conflicts

Conflict is a healthy part of any team relationship. If your team doesn’t have conflict, then your team members are likely not communicating their issues with each other. Use your team contract (written at the beginning of the project) to help keep your team dynamic healthy.

When you have conflict, you should follow this procedure:

  1. Refer to the team contract and follow it to address the conflict.

  2. If you resolve the conflict without issue, great! Otherwise, update the team contract and try to resolve the conflict yourselves.

  3. If your team is unable to resolve your conflict, please contact soltoffbc@cornell.edu and explain your situation.

    We’ll ask to meet with all the group members and figure out how we can work together to move forward.

  4. Please do not avoid confrontation if you have conflict. If there’s a conflict, the best way to handle it is to bring it into the open and address it.

Proposal

There are two main purposes of the project proposal:

  • To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
  • To ensure that the topic you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.

Identify 3 topics you’re interested in potentially using for the project. At least two of the three topics must utilize real-world data. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find datasets on those topics.

Write the proposal in the proposal.qmd file in your project repo.

Important

You must use one of the topics in the proposal for the final project, unless instructed otherwise when given feedback.

Criteria for datasets

The datasets should meet the following criteria:

  • At least 500 observations
  • At least 8 columns
  • At least 6 of the columns must be useful and unique explanatory variables.
    • Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
    • If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
  • You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
  • We strongly recommend curating at least one of your datasets via an API or web scraping.

Please ask a member of the course staff if you’re unsure whether your dataset meets the criteria.

If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be okay; use these numbers as guidance for a successful proposal, not as minimum requirements.

Resources for datasets

You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.

Warning

No datasets from Kaggle or #TidyTuesday! You can use data found on a secondary host, but always trace it back to the original source to verify you have the complete and up-to-date records.

Proposal components

For each topic, include the following:

Problem or question

What is the problem you will solve? Or, what is the question you will answer?

For each topic, include the following:

  • A well formulated question or objective. (You may include more than one idea if you want to receive feedback on different ideas for your project. However, one per topic is required.)
  • Statement on why this topic is important.
  • Identify the types of variables you will use. Categorical? Quantitative?
  • What will be the major deliverable(s)? A published website? An interactive web application a la Shiny? An R package? A deployable API?

Introduction and data

For each dataset (if one is provided):

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

  • Write a brief description of the observations.

  • Address ethical concerns about the data, if any.

Glimpse of data

For each dataset (if one is provided):

  • Place the file containing your data in the data folder of the project repo.
  • Use the skimr::skim() function to provide a glimpse of the dataset.

Exploration

Settle on a single idea and state your objective(s) clearly. You will carry out most of your data collection and cleaning, compute some relevant summary statistics, and show some plots of your data as applicable to your objective(s).

Write up your explanation in the explore.qmd file in your project repo. It should include the following sections:

  • Objective(s). State the question(s) you are answering or the problem(s) you are solving clearly.
  • Data collection and cleaning.1 Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.2
  • Data description. Have an initial draft of your data description section. Your data description should be about your analysis-ready data.
  • Data limitations. Identify any potential problems with your dataset.
  • Exploratory data analysis. Perform an (initial) exploratory data analysis.
  • Questions for reviewers. List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

1 If you are using real-world data. If you are generating synthetic data, define the process here.

2 If you have written code to collect your data (e.g. using an API or web scraping), store this in a separate .qmd file or .R script in the repo.

Note

If your project does not make substantial use of real-world data, you should develop your plan for the deliverables. Who is the audience for your deliverable? What functions or features will you need to incorporate? How will you go about designing and implementing these features?

Draft

The purpose of the draft and peer review is to give you an opportunity to get early feedback on your analysis. Therefore, the draft and peer review will focus primarily on the exploratory analysis and initial drafts of the final deliverable(s).

Write the draft write-up in the report.qmd file in your project repo. Be sure to explicitly identify how to access the draft deliverable (e.g. a link to a published web page or Shiny app).

You should have a functional deliverable at this stage, but it is okay to have some incompleteness or partial components. If you have made more progress by this point, then you are likely to receive higher quality feedback.

Peer review

Critically reviewing others’ work is a crucial part of the scientific process, and INFO 5001 is no exception. You will be assigned two teams to review. This feedback is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.

During the peer feedback process, you will be provided read-only access to your partner team’s GitHub repo. You will provide your feedback in the form of GitHub issues to your partner team’s GitHub repo.

Peer review process and questions are outlined in the relevant lab instructions.

Peer reviews will be graded on the extent to which they comprehensively and constructively address the components of the reviewee’s team’s report. Specifics of peer review grading are also outlined in the relevant lab instructions.

Final deliverable

You will create a functioning, working end-product constructed using a reproducible workflow. The form of your deliverable will vary depending on your objectives. Examples of potential deliverables include (but are not limited to):

  • A multi-page website constructed using Quarto
  • Shiny web application
  • R package with published documentation site
  • Application programming interface (API) constructed using Plumber and deployed publicly

Regardless of format, the deliverable should be accessible to a public audience. That means users should be able to access the content through a web interface (and not have to clone your Git repo to access and run files).

Your final deliverable must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.

Your final deliverable will be evaluated based on degree of difficulty and execution. You will receive feedback during the proposal stage as to the perceived level of difficulty of your project.

Report

Your written report must be completed in the report.qmd file.

Tip

Before you finalize your write up, make sure the printing of code chunks is off with the option echo: false in the YAML.

The report should be between 1000-2000 words. There is no expectation that you get close to the upper limit, anywhere in that range is fine as long as you have clearly explained yourself. The limits are provided to help you, not to set stressful expectations.

Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive workflow. This includes (but is not limited to) addressing the following items.

Tip

Feel free to add additional sections and/or structure to your report where necessary. We will take that into account when we grade.

Introduction

Identify the project motivation, data, and objectives. What is the context of the work? What problem are you trying to solve? What are your main conclusions?

Justification of approach

Describe the deliverable(s). What did your team create? Who is the intended audience? How will the deliverable(s) meet their needs?

Data description

If using real-world data, describe it. A good model for this is presented in Gebru et al, 2018. Answer any relevant questions from sections 3.1-3.5 of the Gebru et al article, especially the following questions:

  • What are the observations (rows) and the attributes (columns)?
  • Why was this dataset created?
  • Who funded the creation of the dataset?
  • What processes might have influenced what data was observed and recorded and what was not?
  • What preprocessing was done, and how did the data come to be in the form that you are using?
  • If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?

Design process

Summarize your design process for the deliverable(s). Explain the key design challenges you encountered in creating the main deliverable(s). What were the most important considerations your team faced in designing and constructing the final product?

Limitations

Assess the limitations of your work. What hurdles did you fail to overcome? If you had the opportunity to do this again, how would you improve on your deliverable(s)?

Acknowledgments

Recognize any people or online resources that you found helpful. These can be tutorials, software packages, Stack Overflow questions, peers, and data sources. Showing gratitude is a great way to feel happier! But it also has the nice side-effect of reassuring us that you’re not passing off someone else’s work as your own. Crossover with other courses is permitted and encouraged, but it must be clearly stated, and it must be obvious what parts were and were not done for 5001. Copying without attribution robs you of the chance to learn, and wastes our time investigating.

Appendicies

You are welcome to include an appendix with additional work at the end of the written report document; however, grading will largely be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 1000-2000 word limit.

You should submit your appendix(-ces) in the appendices.qmd file in your project repo.

  • At minimum, you should have an appendix for your data cleaning. Submit an updated version of your data cleaning description from phase II that describes all data cleaning steps performed on your raw data to turn it into the analysis-read dataset submitted with your final project. When rendered, it should output the dataset you submit as part of your project (e.g. written as a .csv file).
  • (Optional) Other appendices. You will almost certainly feel that you have done a lot of work that didn’t end up in the final report. We want you to edit and focus, but we also want to make sure that there’s a place for work that didn’t work out or that didn’t fit in the final presentation. You may include any analyses you tried but were tangential to the final direction of your main report. Graders may briefly look at these appendices, but they also may not. You want to make your final report interesting enough that the graders don’t feel the need to look at other things you tried. “Interesting” doesn’t necessarily mean that the results in your final report were all statistically significant; it could be that your results were not significant but you were able to interpret them in an interesting and informed way.

Organization + formatting

While not a separate written section, you will be assessed on the overall presentation and formatting of the written report. A non-exhaustive list of criteria include:

  • The report neatly written and organized with clear section headers and appropriately sized figures with informative labels.
  • Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted.
  • All citations and links are properly formatted.
  • If there is an appendix, it is reasonably organized and easy for the reader to find relevant information.
  • All code, warnings, and messages are suppressed.
  • The main body of the written report (not including the appendix) is no longer than 10 pages.

Presentation + slides

Slides

In addition to the written report, your team will also create an oral presentation that summarizes and showcases your project. Using a slide presentation, you will introduce your objective(s) and dataset, showcase visualizations, and discuss the primary outcomes. These slides should serve as a brief visual addition to your written report and will be graded for content and quality.

Your presentation will be created using Quarto, which allows you to write slides using the same reproducible document structure you’re used to.

The slide deck should have no more than 6 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.

  • Title Slide
  • Slide 1: Introduce the topic and motivation
  • Slide 2: Introduce the data
  • Slide 3: Highlights from EDA
  • Slide 4-5: Inference/modeling/other analysis
  • Slide 6: Conclusions + future work

Presentation

Presentations will take place in class during the last lab of the semester. The presentation must be no longer than 5 minutes. You can choose to present live in class (recommended) or pre-record a video to be shown in class. Either way you must attend the lab session for the Q&A following your presentation.

If you choose to pre-record your presentation, you may use can use any platform that works best for your group to record your presentation. Below are a few resources on recording videos:

Once your video is ready, upload the video to Panopto or another video platform (e.g., YouTube), then modify the href to your video in your repo _quarto.yml.

      - text: "Presentation"
        href: presentation.qmd

Render the website, and your navigation bar will now directly link to your recorded presentation.

Tip

To upload your video to Panopto:

  • Click the Panopto tab in the course Canvas site.
  • Click the “+” and select “Upload files”.
  • Locate the video on your computer and click to upload.
  • Once you’ve uploaded the video to Panopto, click to share the video and copy the video’s URL. This is the URL to include in _quarto.yml as described above.

Evaluation

Presentations will be evaluated by the course staff (15 points) and by your peers in your lab section (5 points). Students will receive access to a Google Form where they will provide (confidential) feedback on their peer groups’ presentations. Students will evaluate their own presentations.

Reproducibility + organization

All written work should be reproducible, and the GitHub repo should be neatly organized.

  • Points for reproducibility + organization will be based on the reproducibility of the entire repository and the organization of the project GitHub repo.
  • The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.

Teamwork

Every team member should make an equal contribution to all parts of the project. Every team member should have an equal experience designing, coding, testing, etc.

At the completion of the project, you will be asked to fill out a survey where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member. Working as a team is every team member’s responsibility.

If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that they underperformed on the project, we will conduct further analysis and their overall project grade may be adjusted accordingly.

Grading

Total 130 pts
Project proposal 10 pts
Exploration 15 pts
Draft 10 pts
Peer review 5 pts
Final report 20 pts
Final deliverable(s) 40 pts
Slides + presentation 15 pts
Slides + presentation (peer) 5 pts
Reproducibility + organization 10 pts

Some of the components are further detailed below.

Project proposal

  • Introduction and data (3 points)
  • Problem or question (3 points)
  • Glimpse of data (3 points)
  • Project website is built successfully and accessible via GitHub Pages link (1 point)

Exploration

  • Clearly stated objective(s) (2 points)
  • Data collection and cleaning (4 points)
  • Data description (2 points)
  • Data limitations (2 points)
  • Exploratory analysis (2 points)
  • Questions for reviewers (1 point)

Draft

  • Functional prototype (4 points)
  • Evolution of the project (3 points)
  • Source code is easy to read, properly formatted, and properly documented. (3 points)

Peer review

  • Peer review issues open
  • Reviews are constructive, actionable, and sufficiently thorough

Final report

  • Introduction: The introduction provides a clear explanation of the project objectives. (4 points)
  • Justification of approach: Defines the deliverable(s) constructed for the project. The chosen approach is clearly explained and justified. When used, data is described in sufficient detail. Design process includes not just the final product, but also addresses decision points and alternative paths not taken. (12 points)
  • Limitations: Identifies reasonable limitations to the scope of the work. Addresses potential biases in the data or model assumptions. Proposes potential remedies in future iterations of the project. (4 points)

Final deliverable(s)

  • Design + visualization: The design/visualizations are appropriate, easy to read, and accessible. (10 points)
  • Functionality: Deliverable(s) provide sufficient value to the intended audience. No errors or warnings present in final version. Deliverable(s) are complete. (10 points)
  • Code: Code is efficient, easy to read, properly formatted, and properly documented. (10 points)
  • Impact: Broader impact and usefulness of the deliverable(s) is clear. (10 points)

Slides + presentation

  • Time management: Did the team divide the time well among themselves or got cut off going over time? (2 points)

  • Professionalism: How well did the team present? Does the presentation appear to be well practiced? Did everyone get a chance to say something meaningful about the project? (2 points)

  • Teamwork: Did the team present a unified story, or did it seem like independent pieces of work patched together? (2 points)

  • Slides: Are the slides (or other presentation medium) well organized, readable, not full of text, featuring figures with legible labels, legends, etc.? (2 points)

  • Creativity / Critical Thought: Is the project carefully thought out? Does it appear that time and effort went into the planning and implementation of the project? (2 points)

  • Content: Including, but not limited to the following: (5 points)

    • Is the objective well articulated in the presentation?
    • Can the deliverable(s) accomplish the objective?
    • Do(es) the deliverable(s) accomplish the objective?
    • Do figures or tables included in the presentation follow good visualization practices?
    • Are the limitations carefully considered and articulated?

Slides + presentation (peer)

  • Content: Is/are the objective(s) clearly articulated and can the deliverable(s) accomplish it/them? (1 point)
  • Content: Did the team effective meet the objective(s)? (1 point)
  • Creativity and Critical Thought: Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project? (1 point)
  • Slides: Are the slides well organized, readable, not full of text, featuring figures with legible labels, legends, etc.? (1 point)
  • Professionalism: How well did the team present? Does the presentation appear to be well practiced? Are they reading off of a script? Did everyone get a chance to say something meaningful about the project? (1 point)

Reproducibility + organization

  • All required files are provided. Quarto files render without issues and reproduce the necessary outputs. If building a package, the checks pass. (3 points)
  • If there’s a dataset, it’s provided in a data folder, a codebook is provided, and a local copy of the data file is used where needed. (3 points)
  • Documents are well structured and easy to follow. No extraneous materials. (2 points)
  • All issues are closed, mostly with specific commits addressing them. (2 points)

Late work policy

There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.