Lecture 1
Cornell University
INFO 5001 - Fall 2024
August 27, 2024
Dr. Benjamin Soltoff
Lecturer in Information Science
Gates Hall 216
02:00
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.
[A]n interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains1
We’re going to learn to do this in a tidy
way – more on that later!
This is a course on computing applications for data science workflows
Illustration credit: R for Data Science
Illustration credit: R for Data Science
Illustration credit: R for Data Science
Illustration credit: R for Data Science
Illustration credit: R for Data Science
Illustration credit: R for Data Science
Illustration credit: R for Data Science
Illustration credit: R for Data Science
Illustration credit: R for Data Science
R | Python | |
---|---|---|
Syntax | Functional language | Object-oriented language |
Statistical learning | Developed by statisticians for statistical analysis | Meh |
Machine learning | scikit-learn | |
Deep learning |
|
|
Visualization | ggplot2 | matplotlib + others |
Package management | CRAN | pip/virtualenv/PyPI/Anaconda |
Speed | Somewhat slower | Somewhat faster |
Community | Academia and industry | Larger (general-purpose programming language) |
GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better
We will use GitHub (Enterprise) as a platform for web hosting and collaboration
Or more like demo for today…
Generated by DALL·E
Generated by DALL·E
Generated by DALL·E
Generated by DALL·E
https://info5001.infosci.cornell.edu/
All linked from the course website:
GitHub organization: github.coecis.cornell.edu/info5001-fa24
RStudio
Use the Workbench: rstudio-workbench.infosci.cornell.edu
Communication: GitHub Discussions
Assignment submission and feedback: Gradescope
Important
Make sure you can access RStudio before class on Thursday.
Prepare: Introduce new content and prepare for lectures by completing the readings
Participate: Attend and actively participate in lectures and labs, office hours, team meetings
Practice: Practice applying statistical concepts and computing with application exercises during lecture, graded for completion
Perform: Put together what you’ve learned to analyze real-world data
Category | Percentage |
---|---|
Homework | 30% |
Project | 30% |
Labs | 15% |
Exam | 15% |
Application Exercises | 10% |
See course syllabus for how the final letter grade will be determined.
I want this course to be accessible to students with all abilities. Please feel free to let me know if there are circumstances affecting your ability to participate in class.
Only work that is clearly assigned as team work should be completed collaboratively.
Homeworks must be completed individually. You may not directly share answers / code with others, however you are welcome to discuss the problems in general and ask for advice.
Exams must be completed individually. You may not discuss any aspect of the exam with peers.
We are aware that a huge volume of code is available on the web, and many tasks may have solutions posted
Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism, regardless of source
All code must be written by you, the human being
Use generative AI to facilitate, rather than hinder, learning
✅ GAI tools for reference purposes
❌ GAI tools for writing code/analysis
❌ GAI tools for narrative
You are ultimately responsible for the work you turn in; it should reflect your understanding of the course content
Ask if you’re not sure if something violates a policy!