AI Activity 2: Tidy vs Untidy Data: When Rules Are Not Enough
NYC Airbnb 2019 neighborhood summary dataset
Tidy data rules are helpful, but real data work often requires choices. There can be more than one reasonable tidy format, depending on the question you want to answer.
You will use an AI tool as a learning partner to investigate a core data science reality: Tidy rules tell you what tidy looks like, but they do not tell you which tidy format is best for a specific analysis goal.
You are graded on how you think, critique, and explain, not on whether AI gives perfect answers.
By the end of this activity, your group will do four things:
- Explain why the provided dataset is not tidy by pointing to specific evidence in the file.
- Create two different tidy versions of the same data:
- Tidy version A that is best for modeling or tabular summaries.
- Tidy version B that is best for visualization across multiple metrics.
- Show evidence that reshaping did not change the meaning of the data by completing at least two sanity checks.
- Choose which tidy version is better for your investigation question and justify your choice.
Your grade depends on your reasoning, evidence, and explanation, not on whether AI output sounds correct.
- Download the dataset from the Dataset section below.
- Write your investigation question (2 to 3 sentences). Step 1 in Tasks.
- Run 3 to 5 AI prompts that cover tidy criteria, reshaping, and verification. Step 2 in Tasks.
- Reshape the data and complete checklist items 1 to 4 first. Step 4
- Draft your synthesis, then build slides using the required 5 slide structure. Step 5 and Step 6
Assigned Roles (3 students)
[NOTE:] Each student has a designated role for accountability. However, teammates are encouraged to collaborate, support one another, and learn together in order to produce high-quality, cohesive work.
Prompt Engineer
Responsibilities
- Create 3 to 5 purposeful AI prompts
- Save AI responses and build the AI Interaction Log
- Write annotations for each prompt and response pair
Data Science Auditor
Responsibilities
- Identify hidden assumptions in AI guidance
- Evaluate whether AI produced a tidy result that matches the investigation question
- Design and run verification checks after reshaping
- Provide evidence for at least two meaningful choices or limitations
Synthesizer
Responsibilities
- Write the Human Authored Synthesis in clear course language
- Build slides using the required structure
- Ensure the final work is consistent and concise
Dataset
Required file
Dataset notes
What you will submit
- Group submissions
- AI Interaction Log
- Human Authored Synthesis
- Slides for a 12 to 15 minute presentation
- Individual submission
- Individual Reflection (150 to 200 words)
Step by step tasks
Step 1 Define your investigation question (before using AI)
Write 2 to 3 sentences that answer these four questions:
- What is the analysis task you want to do (choose one: a summary table, a visualization, or a model input)
- What is the main outcome you care about (examples: mean_price, median_minimum_nights, mean_availability_365, mean_reviews_per_month, n_listings)
- What comparison do you want to make (examples: compare room types within a neighborhood, compare neighborhoods within a borough, compare multiple metrics for one room type)
- Why your task requires a specific tidy format
Your question must be specific enough that you can argue for two different tidy representations and then choose one.
Example investigation questions you may use or adapt:
“I want to make a plot that compares mean_price across room_type within each neighbourhood_group. I think tidy version A is better because it gives one row per neighbourhood_group, neighbourhood, and room_type with a mean_price column ready to plot.”
“I want to compare multiple metrics for each room_type in one visualization. I think tidy version B is better because it has a metric column and a value column, which makes it easy to facet or color by metric.”
“I want to prepare the data for a simple model where mean_price is predicted by room_type and neighbourhood_group. I think tidy version A is better because each row is one observation unit with clear predictor columns.”
Step 2 Use AI strategically (3 to 5 prompts)
[Note:] You may run several prompts, but keep 5 most useful and meaningful prompts for reporting.
Your prompts must be purposeful and iterative.
Prompt requirements
- At least 1 prompt asking AI to explain why the provided dataset is untidy, using tidy data rules
- At least 1 prompt asking AI to propose a tidy target for a specific analysis goal you choose
- At least 1 prompt asking AI to reshape the data using a concrete plan, for example pivot_longer then pivot_wider
- At least 1 prompt asking AI for verification checks after reshaping
Suggested prompt starters (you may adapt)
*“Here is a dataset where the column names look like mean_price__Private_room and mean_availability_365__Shared_room. Explain why this is untidy, using tidy data rules.”*
“Give two different tidy formats for this dataset, one that is best for modeling, and one that is best for visualization. Explain why.”
“Write a step by step plan in R to reshape this dataset so that each row is one neighbourhood_group, neighbourhood, and room_type, with separate columns for mean_price, median_minimum_nights, mean_availability_365, mean_reviews_per_month, and n_listings.”
“After reshaping, what sanity checks can prove that I did not accidentally change the meaning of the data”
“If some room types are missing in some neighborhoods, how should that appear in a tidy data set, and how should I report it”
Step 3 Create the AI Interaction Log
For each prompt, include:
- Prompt goal
- The AI response excerpt you used
- Your annotation:
- What AI got right
- What AI assumed
- What was missing, misleading, or incorrect
- How you revised your prompt or your plan
Important rule
- You may not paste AI text into your final synthesis verbatim.
Step 4 Reshape the dataset and run required validation checks
Reshape the required file, then complete the validation checklist below. Your evidence can be screenshots, printed outputs, or short summaries of what you observed. A short summary without numbers or outputs does not count as evidence.
1. Rows and columns in the original data
- Confirm the dataset has 15 rows and 17 columns after import.
2. Identify at least three untidy features
Use tidy data rules and point to concrete evidence from this file. Examples of valid untidy features include:
- Two variables are encoded in the column names
- Some variables are spread across many columns instead of stored in one column
- Missing values appear because some room types are absent in some neighborhoods, which can be hard to see in wide format
3. Create tidy version A, best for modeling and tabular summaries
Target structure
- One row per neighbourhood_group, neighbourhood, and room_type
- Separate columns for: mean_price, median_minimum_nights, mean_availability_365, mean_reviews_per_month, n_listings
Evidence to submit
- The row and column count of tidy version A
- The list of column names of tidy version A
4. Create tidy version B, best for visualization across metrics
Target structure
- One row per neighbourhood_group, neighbourhood, room_type, and metric
- A single value column
Evidence to submit
- The row and column count of tidy version B
- The list of unique values in the metric column
5. Sanity checks that meaning did not change
Provide evidence for at least two checks, for example:
- For one neighborhood, show that a value like mean_price for Private_room is the same before and after reshaping
- Show that the total number of non missing cells in the wide table equals the number of rows in tidy version A for each metric room_type combination
- Show that the number of missing combinations you report is consistent between formats
6. Choice and justification
Write 3 to 5 sentences answering:
- Which tidy version is better for your investigation question
- Why the other tidy version is less convenient for that goal
- What would make the other tidy version the better choice in a different situation
Step 5 Write the Human Authored Synthesis (group)
Length target: 400 to 600 words.
Your synthesis must include:
- Your investigation question
- At least two meaningful choices, assumptions, or limitations, supported by evidence from your checklist
- A clear explanation of why tidy rules alone were not enough for this case
- A short recommended decision process for choosing a tidy format
Your synthesis must be written in your own words.
Step 6 Presentation slides (group)
Use this exact slide structure:
- Our question
- What AI suggested
- Where AI fell short
- Our corrected understanding, with evidence
- One takeaway for future data science work
Time: 12 to 15 minutes.

Individual Reflection (each student)
Write 150 to 200 words answering:
- What did AI help you learn
- What did AI miss or oversimplify
- What did you contribute as a human thinker
- How will you change your AI use in future data work
Your reflection must match your assigned role.
Participation (non-presenting students)
Please scan the QR code, or go to https://forms.office.com/r/EFnti60KRk to share what you learned from this activity.