Quiz 03

Quiz

Modified

November 19, 2025

Overview

Quiz 3 will be held on November 21st in-class. It will be a 50 minute in-person, timed quiz.

The quiz will cover all material through the end of week 12 (programming with LLMs), with an emphasis on content taught since quiz 02. It will consist of a series of short answer and free response questions. Questions are designed to evaluate your understanding of concepts and methods. You may be asked to answer conceptual questions, interpret visualizations, interpret code and output, and/or write code by hand.

Students with SDS accommodations

Students who have registered SDS accommodations related to timed assignments are implemented by the SDS Alternative Testing Program. You will receive separate instructions from SDS about how to take the quiz with your accommodations. If you have any questions about your accommodations, please contact SDS directly.

Students with religious or other accommodations

If you require an accommodation for quiz 03 on the basis of religious observances, athletics, military service, or another accommodation listed on the course syllabus, please contact us at soltoffbc@cornell.edu by November 18. Any accommodation requests received after this date are unlikely to be approved.

Rules & Notes

Academic Integrity

A student shall in no way misrepresent his or her work.
A student shall in no way fraudulently or unfairly advance his or her academic position.
A student shall refuse to be a party to another student’s failure to maintain academic integrity.
A student shall not in any other manner violate the principle of academic integrity.

Source: Cornell University Code of Academic Integrity

This is an individual assignment. Everything in the quiz is for your eyes only.
The quiz will be held in-person. All responses will be written by hand and submitted on paper.
You may not use any electronic devices during the quiz.¹ This includes laptops, tablets, phones, smartwatches, etc.
You may not use any physical materials during the quiz. This includes textbooks, notes, calculators, etc. Any required information will be provided in the quiz.

Submission

All responses will be submitted on paper using the provided forms. Quizzes will be evaluated and returned to you via Gradescope.

Grading

Each quiz is weighted equally. There will be three quizzes in total, so each quiz is worth 5% of your final grade.

Practice problems

Instructions

Below are some practice problems you may complete in order to prepare for the quiz. The suggested solution is hidden below each exercise. Try to solve the problem on your own before looking at the solution.

Given the original dataset, which of the samples are valid bootstraps?
1. Only sample A
2. Only sample B
3. Only sample C
4. Samples A and B
5. Samples B and C
6. None of the samples
7. All of the samples
NoteSuggested solution
1. Samples A and B
What is the primary difference between a bagging model and a random forest model?

NoteSuggested solution

The primary difference between a bagging model and a random forest model is that random forests introduce an additional layer of randomness by selecting a subset of features at each split in the decision tree. This feature selection process helps to decorrelate the trees in the forest and reduce overfitting.
Gallup surveyed 1,000 youths² asking them which Heeler cousin from hit TV show Bluey is their favourite: Bluey, Bingo, Muffin, or Socks.

This is the confusion matrix for the assessment set using a model predicting individual preferences for Bluey, Bingo, Muffin, or Socks based on other demographic features collected in the survey. Explain what has happened to the model, what might be a contributing factor to these results, and propose a solution to improve the model’s performance.

NoteSuggested solution

The class imbalance in the data is a contributing factor to these results. The model is predicting Bingo as the favourite cousin for all respondents, which is likely due to the overwhelming majority of youths preferring Bingo in the dataset. To resolve this issue, we could use techniques such as oversampling the minority classes (Bluey, Muffin, and Socks) or undersampling all classes except for Socks to ensure we have balance in the classes in the dataset.
Which of the following feature engineering steps is not required for fitting a nearest neighbors model?
1. Normalizing all quantitative variables to the same mean and variance.
2. Converting nominal predictors to quantitative variables.
3. Remove all features with zero variance.
4. Downsample the training set so there are an equal number of observations for each class.
NoteSuggested solution
1. Downsample the training set so there are an equal number of observations for each class.
What is a node in a decision tree?
NoteSuggested solution
A node in a decision tree represents a decision point or a condition based on the features of the data.

There are two main types of nodes:
1. Decision (Internal) Node: This node splits the data based on a condition or rule (e.g., “Is age > 30?”). The data is divided into subsets based on the feature’s value at this node, and this process continues recursively in the tree.
2. Leaf (Terminal) Node: This node provides the final outcome or prediction. It represents the class label (in classification) or a predicted value (in regression) for the data points that fall into that node.
In a decision tree, internal nodes represent decisions or tests, while leaf nodes provide the final prediction or classification.
Suppose you are working on a credit risk model for a bank. The goal is to predict whether a loan applicant will default (1) or not (0) based on features such as credit score, income, and debt-to-income ratio. After training your logistic regression model, you find that the training ROC AUC is significantly higher than the test ROC AUC. Explain why this discrepancy might occur.

NoteSuggested solution

The observed discrepancy between training and test ROC AUC can be attributed to overfitting.
What is the purpose of set_engine()?

NoteSuggested solution

The set_engine() function is used to specify the computational engine or algorithm that will be used to fit a model in the {tidymodels} framework. It allows you to choose from different implementations of a model, since multiple engines may be available for the same model.
What is the distinction between the user interface (UI) and server components in a Shiny application? Describe the role of each component.

NoteSuggested solution

The UI component defines the visual layout and structure of the application that users see and interact with in their web browser. It specifies input controls (such as sliders, dropdown menus, text boxes) and output placeholders (such as plots, tables, or text displays).

The server component contains the R code that runs on the server to process user inputs and generate outputs. It implements the reactive logic that responds to changes in user inputs and produces the corresponding outputs to be displayed in the UI. The server receives input values from the UI and sends computed results back to update the UI displays.
What is the difference between an LLM provider and an LLM model? Provide one example of each.
NoteSuggested solution
- LLM Provider: A company or organization that hosts and provides access to LLM services through APIs. Examples include OpenAI, Anthropic, Google, or AWS.
- LLM Model: A specific trained language model with particular capabilities, size, and performance characteristics. Examples include GPT-4, Claude 3.5 Sonnet, Gemini, or Llama 3.
A single provider may offer multiple models (e.g., OpenAI provides GPT-3.5, GPT-4, GPT-4o), and choosing the appropriate model depends on factors like task complexity, cost, speed, and capabilities.

Footnotes

Students with certain SDS accommodations are permitted to use a computer.↩︎
Not really - I’m making this up for the purposes of the exam.↩︎