Lecture 2
Cornell University
INFO 5001 - Fall 2023
2023-08-23
Q - Is this an intro CS course?
A - No – we assume you have completed the equivalent of CS 1110/1112
Q - What data science background does this course assume?
A - None! Sort of…
Q - Is this an intro stat course?
A - No. We presume you have taken undergraduate stats course(s)
While statistics \(\ne\) data science, they are very closely related and have tremendous of overlap.
Q - What computing language will we learn?
A - R.
Q: Why not language X?
A: Come meet with me during office hours and we can talk about it!
Course operation
Doing data science
By the end of the semester, you will…
What does it mean for a data analysis to be “reproducible”?
Near-term goals:
Long-term goals:
Packages: Fundamental units of reproducible R code, including reusable R functions, the documentation that describes how to use them, and sample data1
As of January 2023, there are over 19,000 R packages available on CRAN (the Comprehensive R Archive Network)2
We’re going to work with a small (but important) subset of these!
R is a functional language
Functions are (most often) verbs, followed by what they will be applied to in parentheses:
Packages are installed with the install.packages()
function and loaded with the library
function, once per session:
$
:It is not accurate to say that one programming language is inherently “better” than another, as the choice of language often depends on the specific use case and the individual’s personal preferences. However, R and Python both have their own strengths and weaknesses.
R is particularly well-suited for data analysis and visualization, and it has a large number of libraries and packages specifically designed for these tasks. R’s syntax is also designed to make it easy to manipulate and analyze data.
Python, on the other hand, is a general-purpose programming language that is widely used in a variety of fields, including web development, machine learning, and scientific computing. It has a large and active community that maintains a wide variety of libraries and packages for many different tasks. Python’s simple and easy-to-learn syntax makes it a popular choice for beginners.
Ultimately, the choice between R and Python will depend on the specific task you are trying to accomplish and your personal preferences as a developer. Both languages are powerful and have a lot to offer, and many data scientists use both languages in their work.
R | Python | |
---|---|---|
Syntax | Functional language | Object-oriented language |
Statistical learning | Developed by statisticians for statistical analysis | Meh |
Machine learning |
|
|
Visualization | ggplot2 | matplotlib + others |
Package management | CRAN | pip/virtualenv/PyPI/Anaconda |
Speed | Somewhat slower | Somewhat faster |
Community | Academia and industry | Larger (general-purpose programming language) |
Important
The environment of your Quarto document is separate from the Console!
Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!
In order to pass the test, a movie must have
ae-00-bechdel-quarto
ae-00-bechdel-quarto
and clone the repo to RStudio Workbench.bechdel.qmd
, review the document, and fill in the blanks.Warning
ae-00-bechdel-quarto
is hosted on GitHub.com because we have not configured your authentication method for Cornell’s GitHub. We will do this tomorrow in lab.
GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better
We will use GitHub (Enterprise) as a platform for web hosting and collaboration