Welcome to INFO 5001

Lecture 1

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2025

August 26, 2025

Agenda

Agenda

  • Intros
  • What is data science?
  • Software
  • Application exercise
  • Course overview
  • This week’s tasks

Learning objectives

  • Introduce the course staff
  • Define data science and how it will be taught in this course
  • Review course policies

Staff intros

Meet the instructor

Dr. Benjamin Soltoff

Associate Teaching Professor in Information Science

284 CIS Building

Headshot of Dr. Benjamin Soltoff

Meet the course team

  • Philan T.
  • Paul V.
  • Steven X.

Meet each other!

Physically interact with at least 2 people sitting around you. Introduce yourselves to each other and share:

02:00

What is data science?

What is data science?

  • Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.

    [A]n interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains

  • This is a course on computing applications for data science workflows

Data science life cycle

Data science life cycle

Import

Data science life cycle, with import highlighted

Tidy + transform

Data science life cycle, with tidy and transform highlighted

Visualize

Data science life cycle, with visualize highlighted

Model

Data science life cycle, with model highlighted

Understand

Data science life cycle, with understand highlighted

Communicate

Data science life cycle, with communicate highlighted

Understand + communicate

Data science life cycle, with understand and communicate highlighted

Program

Data science life cycle, with program highlighted

How we will do this

Excel - not…

An Excel window with data about countries

ChatGPT - no

A ChatGPT window with a conversation about data analysis

Alteryx - closer…

An Alteryx window with a workflow

R and Positron

R logo

  • R is an open-source statistical programming language
  • R is also an environment for statistical computing and graphics

Positron logo

  • Positron is a free, data science IDE for R and Python
  • “Soft-fork” of VS Code, with a focus on data science
  • Other popular IDE is RStudio

Major differences between R and Python

R Python
Syntax Functional language Object-oriented language
Statistical learning Developed by statisticians for statistical analysis Meh
Machine learning {scikit-learn}
Deep learning
Visualization {ggplot2} {matplotlib} + others
Package management CRAN pip/virtualenv/PyPI/Anaconda/uv
Speed Somewhat slower Somewhat faster
Community Academia and industry Larger (general-purpose programming language)

tidyverse

Hex logos for dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purrr

tidyverse.org

  • The {tidyverse} is an opinionated collection of R packages designed for data science
  • All packages share an underlying philosophy and a common grammar

Quarto

Quarto

  • Fully reproducible documents – each time you render the analysis is run from the beginning
  • Code goes in chunks – narrative goes outside of chunks
  • A visual editor for a familiar / Google docs-like editing experience
  • Similar (but IMO superior) to Jupyter Notebooks
  • Supports Python, R, Julia, Observable natively

Example output with title (ggplot2 demo), author (Norah Jones), and date (5/22/2021). Below is a header reading Air Quality followed by body text (Figure 1 further explores the impact of temperature on ozone level.) with a toggleable code field, and figure with caption Figure 1 Temperature and ozone level.

How will we use Quarto?

  • Every assignment / report / project / etc. is a Quarto document
  • You’ll always have a template Quarto document to start with
  • The amount of scaffolding in the template will decrease over the semester

Version control with Git

Git and GitHub

Git logo

  • Git is a version control system – like “Track Changes” features from Microsoft Word, on steroids
  • It’s not the only version control system, but it’s a very popular one

GitHub logo

  • GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better

  • We will use GitHub (Enterprise) as a platform for web hosting and collaboration

Versioning

Versioning

with human readable messages

How we use Git and GitHub

How we use Git and GitHub

How we use Git and GitHub

How we use Git and GitHub

Let’s dive in!

Application exercise

Or more like demo for today…

📋 info5001.infosci.cornell.edu/ae/ae-00-unvotes.html

Course overview

Homepage

https://info5001.infosci.cornell.edu/

  • All course materials
  • Links to Canvas, GitHub, Posit Workbench, etc.
  • Let’s take a tour!

Course toolkit

All linked from the course website:

Important

Make sure you can access RStudio before class on Thursday.

Activities: Prepare, Participate, Practice, Perform

  • Prepare: Introduce new content and prepare for lectures by completing the readings

  • Participate: Attend and actively participate in lectures and labs, office hours, team meetings

  • Practice: Practice applying computational techniques with application exercises during lecture, graded for completion

  • Perform: Put together what you’ve learned to analyze real-world data

    • Homework assignments x 8-10(-ish) (individual)
    • Monthly quizzes
    • Final exam
    • Team project

Activities: Participate

Preparing for and participating in class

Not preparing for class, not actively participating

Cadence

  • Application exercises: Complete by the end of the day
  • HWs: Posted Friday morning, due following Wednesday 11:59pm
  • Quizzes: Completed during Friday lab
  • Exam: Written, in-person during finals week
  • Project: Deadlines throughout the semester, with some lab time dedicated to working on them, and most work done in teams outside of class

Grading

Category Percentage
Project 30%
Homework 25%
Exam 20%
Quizzes 15%
Application Exercises 10%

See course syllabus for how the final letter grade will be determined.

15 minute rule

Support

  • Attend office hours
  • Ask and answer questions on the discussion forum
  • Use Beebe for generative AI assistance with course content
  • Reserve email for questions on personal matters and/or grades
  • Read the course support page

Diversity + inclusion

  • I want you to feel like you belong in this class and are respected
  • We are committed to full inclusion in education for all persons
  • If you feel that we have failed these goals, please either let us know or report it, and we will address the issue

Accessibility

I want this course to be accessible to students with all abilities. Please feel free to let me know if there are circumstances affecting your ability to participate in class.

Course policies

Late work, waivers, regrades policy

  • We have policies!
  • Read about them on the course syllabus and refer back to them when you need it

Collaboration policy

  • Only work that is clearly assigned as team work should be completed collaboratively.

  • Homeworks must be completed individually. You may not directly share answers / code with others, however you are welcome to discuss the problems in general and ask for advice.

  • Quizzes and exams must be completed individually. You may not discuss any aspect of these assignments with peers until the grades are posted.

Sharing / reusing code policy

  • We are aware that a huge volume of code is available on the web, and many tasks may have solutions posted

  • Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism, regardless of source

  • All code must be written by you, the human being

Generative AI

Academic integrity

  1. A student shall in no way misrepresent his or her work.
  2. A student shall in no way fraudulently or unfairly advance his or her academic position.
  3. A student shall refuse to be a party to another student’s failure to maintain academic integrity.
  4. A student shall not in any other manner violate the principle of academic integrity.

Most importantly!

Ask if you’re not sure if something violates a policy!

Before Thursday