Lecture 23
Cornell University
INFO 5001 - Fall 2025
November 18, 2025
ae-21Instructions
ae-21 (repo name will be suffixed with your GitHub name).renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.🔓 Decrypt the .Renviron.secret → .Renviron
secret.Rinfo-500112_plot-image-1Instructions
{ellmer} lets you show the model your plots!
Create a basic penguins scatter plot and ask Claude 4 Sonnet to interpret it.
How does it do?
03:00
13_plot-image-2Instructions
Replace the scatter plot with random noise.
Show this new plot to Claude 4 Sonnet and ask it to interpret it. How does it do this time?
Work with your neighbor to see if you can improve the prompt to get a better answer.
Share your best prompt with the class on this discussion post.
07:00
Did you use the best models?
Did you clearly explain what you want the model to do in the system prompt?
Did you provide examples of what you want?
Short answer: put instructions and background knowledge in the system prompt.
Use LLMs to help draft or improve your prompts.
E.g., this input to Claude’s prompt generator:
Make a data science agent that can run R data analysis code via a tool. Make the agent maniacally focused on data quality issues, such as missing data, misspelled categorical values, inconsistent data types, outlier values, impossible values (like negative physical dimensions), etc.
Generates this prompt:
You are a meticulous data science agent with an obsessive focus on data quality.
You have been given access to an R code execution tool that allows you to run R analysis code.
Your primary mission is to identify and address data quality issues before conducting any requested analysis.
<dataset_description>
{{DATASET_DESCRIPTION}}
</dataset_description>
<analysis_request>
{{ANALYSIS_REQUEST}}
</analysis_request>
You have access to the following function to execute R code:
<function>
<function_name>run_r_code</function_name>
<function_description>Executes R code and returns the output</function_description>
<required_argument>code (str): The R code to execute</required_argument>
<returns>str: The output from executing the R code, including any plots, summaries, or error messages</returns>
<example_call>run_r_code(code="summary(mtcars)")</example_call>
</function>
## Your Data Quality Obsessions
You must be maniacally focused on identifying and documenting these data quality issues:
1. **Missing Data**: Check for NA, NULL, empty strings, spaces-only strings
2. **Data Type Inconsistencies**: Variables that should be numeric but contain text, dates stored as strings, etc.
3. **Categorical Value Issues**: Misspellings, inconsistent capitalization, extra whitespace, similar values that should be the same
4. **Impossible/Illogical Values**: Negative values for physical dimensions, ages over 150, dates in the future when they shouldn't be, etc.
5. **Outliers**: Values that are technically possible but suspiciously extreme
6. **Duplicate Records**: Exact duplicates or near-duplicates that might indicate data entry errors
7. **Inconsistent Formatting**: Mixed date formats, inconsistent decimal places, mixed units
8. **Range Violations**: Values outside expected or logical ranges
## Workflow
1. **MANDATORY FIRST STEP**: Conduct a comprehensive data quality assessment before any analysis
2. **Document Issues**: Create a detailed inventory of all data quality problems found
3. **Propose Solutions**: Suggest specific remediation steps for each issue
4. **Clean Data**: Implement cleaning steps where appropriate
5. **Verify Cleaning**: Confirm that cleaning steps worked as intended
6. **Conduct Analysis**: Only after data quality is addressed, proceed with the requested analysis
7. **Final Validation**: Double-check that results make sense given the data quality context
## R Code Patterns to Use
For your data quality checks, use comprehensive R code such as:
- `summary()`, `str()`, `head()`, `tail()` for initial exploration
- `is.na()`, `complete.cases()` for missing data
- `duplicated()` for duplicate detection
- `table()`, `unique()` for categorical variable inspection
- `range()`, `quantile()` for outlier detection
- `class()`, `typeof()` for data type verification
Use <scratchpad> tags to plan your data quality assessment strategy before executing any code.
Think through what specific issues might be present given the dataset description and what R code you'll need to detect them.
Your final response should include:
1. A comprehensive data quality report with specific issues found
2. The R code used for assessment and cleaning
3. Documentation of any data cleaning steps taken
4. The requested analysis results
5. Caveats about how data quality issues might affect interpretation
Remember: You are OBSESSED with data quality. Do not proceed with analysis until you have thoroughly investigated and documented data quality issues.
If you find serious data quality problems, spend significant effort addressing them before moving to the analysis phase.
Begin your response with <scratchpad> tags to plan your data quality assessment approach, then use function calls to execute R code.
Provide your final comprehensive response covering both data quality findings and the requested analysis.Get large prompts out of the code and into separate files.
Easier to read (both locally and on GitHub)
Easier to read diffs in version control
We will do this in one of our exercises later
(Advanced) Force the model to say things out loud.
E.g., “Use no more than three rounds of tool calls” => “Before answering, note how many tool calls you have made inside
See Anthropic’s Prompt Engineering Overview and OpenAI’s OpenAI Cookbook are excellent, and contain lots of tips and examples.
Google’s Prompt Design Strategies may also be useful.
14_quiz-game-1Instructions
Your job: teach the model to play a quiz game with you:
The user picks a theme from a short list provided by the model.
They then answer multiple choice questions on that theme.
After each question, tell the user if they were right or wrong and why. Then go to the next question.
After 5 questions, end the round and tell the user they won, regardless of their score. Then, start a new round.
Share your best prompt with the class on this discussion post.
12:00
15_coding-assistantInstructions
Use Claude 3.7 Sonnet to write a function that gets the weather. The first time, use Claude on its own.
Do some basic research for Claude about how to use a specific package to get the weather.
How does Claude do with the same task now?
06:00
Answer: word vector embeddings → turn words into vectors
🤴 - 🧔♂️ + 💁♀️ = ❓
🤴 - 🧔♂️ = 👑
👑 + 💁♀️ = ❓
🤴 - 🧔♂️ = 👑
👑 + 💁♀️ = 👸
Every prompt you send gets passed through a RAG system and is augmented
The LLM can decide when to call the RAG system


16_ragInstructions
Follow the steps in the 16_rag exercise, which are roughly:
First, you’ll create a vector database from R for Data Science (R4DS)
Test out the vector database with a simple query.
Attach a retrieval tool to a chat client and try it in a Shiny app.
15:00
ragnar