Quiz 03

Quiz
Modified

November 17, 2025

Overview

Quiz 3 will be held on November 21st in-class. It will be a 50 minute in-person, timed quiz.

The quiz will cover all material through the end of week 12 (programming with LLMs), with an emphasis on content taught since quiz 02. It will consist of a series of short answer and free response questions. Questions are designed to evaluate your understanding of concepts and methods. You may be asked to answer conceptual questions, interpret visualizations, interpret code and output, and/or write code by hand.

Students with SDS accommodations

Students who have registered SDS accommodations related to timed assignments are implemented by the SDS Alternative Testing Program. You will receive separate instructions from SDS about how to take the quiz with your accommodations. If you have any questions about your accommodations, please contact SDS directly.

Students with religious or other accommodations

If you require an accommodation for quiz 03 on the basis of religious observances, athletics, military service, or another accommodation listed on the course syllabus, please contact us at soltoffbc@cornell.edu by November 18. Any accommodation requests received after this date are unlikely to be approved.

Rules & Notes

Academic Integrity

  1. A student shall in no way misrepresent his or her work.
  2. A student shall in no way fraudulently or unfairly advance his or her academic position.
  3. A student shall refuse to be a party to another student’s failure to maintain academic integrity.
  4. A student shall not in any other manner violate the principle of academic integrity.
  • This is an individual assignment. Everything in the quiz is for your eyes only.
  • The quiz will be held in-person. All responses will be written by hand and submitted on paper.
  • You may not use any electronic devices during the quiz.1 This includes laptops, tablets, phones, smartwatches, etc.
  • You may not use any physical materials during the quiz. This includes textbooks, notes, calculators, etc. Any required information will be provided in the quiz.

Submission

  • All responses will be submitted on paper using the provided forms. Quizzes will be evaluated and returned to you via Gradescope.

Grading

  • Each quiz is weighted equally. There will be three quizzes in total, so each quiz is worth 5% of your final grade.

Practice problems

NoteInstructions

Below are some practice problems you may complete in order to prepare for the quiz. The suggested solution is hidden below each exercise. Try to solve the problem on your own before looking at the solution.

  1. Given the original dataset, which of the samples are valid bootstraps?

    1. Only sample A
    2. Only sample B
    3. Only sample C
    4. Samples A and B
    5. Samples B and C
    6. None of the samples
    7. All of the samples
    1. Samples A and B
  2. What is the primary difference between a bagging model and a random forest model?

    The primary difference between a bagging model and a random forest model is that random forests introduce an additional layer of randomness by selecting a subset of features at each split in the decision tree. This feature selection process helps to decorrelate the trees in the forest and reduce overfitting.

  3. Gallup surveyed 1,000 youths2 asking them which Heeler cousin from hit TV show Bluey is their favourite: Bluey, Bingo, Muffin, or Socks.

    This is the confusion matrix for the assessment set using a model predicting individual preferences for Bluey, Bingo, Muffin, or Socks based on other demographic features collected in the survey. Explain what has happened to the model, what might be a contributing factor to these results, and propose a solution to improve the model’s performance.

    The class imbalance in the data is a contributing factor to these results. The model is predicting Bingo as the favourite cousin for all respondents, which is likely due to the overwhelming majority of youths preferring Bingo in the dataset. To resolve this issue, we could use techniques such as oversampling the minority classes (Bluey, Muffin, and Socks) or undersampling all classes except for Socks to ensure we have balance in the classes in the dataset.

  4. Which of the following feature engineering steps is not required for fitting a nearest neighbors model?

    1. Normalizing all quantitative variables to the same mean and variance.
    2. Converting nominal predictors to quantitative variables.
    3. Remove all features with zero variance.
    4. Downsample the training set so there are an equal number of observations for each class.
    1. Downsample the training set so there are an equal number of observations for each class.
  5. What is a node in a decision tree?

    A node in a decision tree represents a decision point or a condition based on the features of the data.

    There are two main types of nodes:

    1. Decision (Internal) Node: This node splits the data based on a condition or rule (e.g., “Is age > 30?”). The data is divided into subsets based on the feature’s value at this node, and this process continues recursively in the tree.

    2. Leaf (Terminal) Node: This node provides the final outcome or prediction. It represents the class label (in classification) or a predicted value (in regression) for the data points that fall into that node.

    In a decision tree, internal nodes represent decisions or tests, while leaf nodes provide the final prediction or classification.

  6. Suppose you are working on a credit risk model for a bank. The goal is to predict whether a loan applicant will default (1) or not (0) based on features such as credit score, income, and debt-to-income ratio. After training your logistic regression model, you find that the training ROC AUC is significantly higher than the test ROC AUC. Explain why this discrepancy might occur.

    The observed discrepancy between training and test ROC AUC can be attributed to overfitting.

  7. What is the purpose of set_engine()?

    The set_engine() function is used to specify the computational engine or algorithm that will be used to fit a model in the {tidymodels} framework. It allows you to choose from different implementations of a model, since multiple engines may be available for the same model.

Footnotes

  1. Students with certain SDS accommodations are permitted to use a computer.↩︎

  2. Not really - I’m making this up for the purposes of the exam.↩︎