AI Activity 3: Dirty Data and Misleading Visualizations: When AI Plots Too Fast
Messy retail orders and returns dataset
A visualization can look polished and still be wrong. If data are dirty, a quick AI-generated plot may quietly mix categories, misread dates, double-count rows, or summarize the wrong values. Before plotting, data scientists must decide what to clean, what to keep, and what each row really represents.
You will use an AI tool as a learning partner to investigate a core data science reality: Data visualization is not just about plotting. It depends on careful cleaning, wrangling, and validation.
You are graded on how you think, critique, and explain, not on whether AI gives perfect answers.
By the end of this activity, your group will do four things:
- Identify data problems in a messy raw dataset that could distort visualizations.
- Use AI to suggest cleaning and plotting steps, then audit where AI is incomplete, risky, or wrong.
- Create one naive plot and one corrected plot to show how dirty data can change the story.
- Explain which cleaning and wrangling decisions made your final visualizations more trustworthy.
Your grade depends on your reasoning, evidence, and explanation, not on whether AI output sounds correct.
- Download the dataset from the Dataset section below.
- Write your investigation question (2 to 3 sentences). Step 1 in Tasks.
- Run 3 to 5 AI prompts that cover cleaning, wrangling, and visualization. Step 2 in Tasks.
- Clean the data and complete checklist items 1 to 6 first. Step 4
- Build one naive plot and at least two corrected visualizations. Step 4
- Draft your synthesis, then build slides using the required 5 slide structure. Step 5 and Step 6
Assigned Roles (3 students)
[NOTE:] Each student has a designated role for accountability. However, teammates are encouraged to collaborate, support one another, and learn together in order to produce high quality, cohesive work.
Prompt Engineer
Responsibilities
- Create 3 to 5 purposeful AI prompts
- Save AI responses and build the AI Interaction Log
- Write annotations for each prompt and response pair
Data Science Auditor
Responsibilities
- Identify hidden assumptions in AI guidance
- Check whether AI cleaning steps would silently change values or groups
- Design and run verification checks after cleaning and wrangling
- Provide evidence for at least two important mistakes, risks, or limitations in the AI suggestions
Synthesizer
Responsibilities
- Write the Human Authored Synthesis in clear course language
- Build slides using the required structure
- Ensure the final work is consistent and concise
Dataset
Required file
Dataset notes
This file is intentionally messy. It includes issues such as duplicate IDs, inconsistent category labels, mixed date formats, numeric values stored as text, incompatible discount formats, and multiple missing value representations.
What you will submit
- Group submissions
- AI Interaction Log
- Human Authored Synthesis
- Slides for a 12 to 15 minute presentation
- Individual submission
- Individual Reflection (150 to 200 words)
Step by step tasks
Step 1 Define your investigation question (before using AI)
Write 2 to 3 sentences answering
- What data story do you want to visualize
- Which variables will matter most
- Which dirty-data problems could change that visual story if handled badly
Your question must require both cleaning and visualization.
Example investigation questions you may use or adapt:
“We want to compare monthly revenue across sales channels, but mixed date formats, revenue stored as text, and repeated order IDs could distort the trend. We will build a naive plot first, then a corrected plot after cleaning and wrangling.”
“We want to compare total revenue by category, but inconsistent category labels and numeric fields stored as text could split or miscount the bars. Our goal is to show how the visual changes after careful cleaning.”
“We want to compare return rates across customer types, but inconsistent return coding and missing values could create a misleading comparison plot if we trust a quick AI-generated workflow.”
Step 2 Use AI strategically (3 to 5 prompts)
[Note:] You may run several prompts, but keep the 5 most useful and meaningful prompts for reporting.
Your prompts must be purposeful and iterative.
Prompt requirements
At least 1 prompt asking AI to identify likely data problems before plotting
At least 1 prompt about cleaning dates, categories, or numeric fields
At least 1 prompt about duplicates, repeated IDs, or what one row represents
At least 1 prompt about how to validate whether a visualization is trustworthy after cleaning
Suggested prompt starters (you may adapt)
“I want to visualize monthly revenue by sales channel, but
order_dateuses mixed formats andrevenueis stored as messy text. What cleaning steps should I take before plotting, and how do I verify them?”“My dataset has repeated
order_idvalues. How can I tell whether these are valid multi-line orders or duplicate records, and how could a wrong decision affect a plot?”“
categoryandsales_channelhave inconsistent capitalization, spaces, punctuation, and abbreviations. What is a safe way to standardize them and check whether I merged levels correctly?”“
discountcontains values like10%,0.10,10 percent,none, and blank cells. What mistakes might AI make here, and how could those mistakes affect a visualization?”“I want to create a plot from a dirty retail dataset. How can I compare a naive plot with a corrected plot to show why cleaning decisions matter?”
Step 3 Create the AI Interaction Log
For each prompt, include:
Prompt goal
The AI response excerpt you used
Your annotation:
- What AI got right
- What AI assumed
- What was missing, misleading, or incorrect
- How you revised your prompt or your plan
Important rule
- You may not paste AI text into your final synthesis verbatim.
Step 4 Clean the data, build visualizations, and run required validation checks
Import the required file, then complete the validation checklist below. Your evidence can be screenshots, printed outputs, or short summaries of what you observed. A short summary without numbers or outputs does not count as evidence.
1. Rows and columns
- Confirm the dataset has 480 rows and 15 columns after import.
2. Raw data quality scan
Identify at least four concrete data problems in the raw file that could distort a plot. Examples may include:
- mixed date formats
- numeric values stored as text
- inconsistent category labels
- repeated order IDs
- exact duplicate rows
- multiple missing value representations
- inconsistent return coding
3. Missing values and missing-value representations
- Identify at least three missing-value representations in the file.
- Choose two columns and report how many missing values they contain after your cleaning rules are applied.
- Explain how you decided what should count as missing.
4. Dates or time-related variables
- Inspect
order_date. - Show evidence that more than one date format exists in the raw data.
- State how you parsed the dates.
- Report how many non-missing entries failed to parse, if any.
- Explain how a bad date parse could distort a time-based plot.
5. Numeric values stored as text
Choose two columns from this list: quantity, unit_price, discount, revenue.
For each chosen column:
- Show at least three raw examples
- Explain the cleaning rule you used
- State one mistake that AI could easily make
- Give one validation check you used after conversion
6. Repeated IDs and duplicate records
- Investigate
order_id. - Report whether repeated order IDs appear.
- Determine whether you found any exact duplicate rows.
- Explain what you decided to keep, remove, or flag.
- State how a wrong decision here could change a visualization.
7. Build one naive plot
Create one plot from the raw or minimally cleaned data that looks plausible but is not fully trustworthy.
- Explain what the plot seems to say
- Explain exactly why the plot is misleading or incomplete
- Identify which dirty-data issue caused the problem
8. Wrangle the cleaned data for visualization
Create one cleaned summary table that will feed your corrected plots.
Your summary table must clearly define:
- what one row represents
- which grouping variables are used
- which summary statistic is computed
9. Build at least two corrected visualizations
Create at least two plots from the cleaned and wrangled data.
Requirements:
- The two plots must answer your investigation question from different angles.
- At least one plot must compare groups.
- At least one plot must require wrangling beyond simply plotting raw rows.
- For each corrected plot, explain how it differs from the naive version or from what AI first suggested.
10. Judgment and caution
Write 4 to 6 sentences answering:
- Which cleaning or wrangling decision mattered most for your visual story
- Where AI guidance was most risky or incomplete
- Why your final visualizations are more trustworthy than a quick AI-generated version
- What additional check you would do before using these visuals in a real report
Step 6 Presentation slides (group)
Use this exact slide structure:
- Our question
- What AI suggested
- Where AI fell short
- Our corrected understanding, with evidence
- One takeaway for future data science work
Time: 12 to 15 minutes.

Individual Reflection (each student)
Write 150 to 200 words answering:
- What did AI help you learn
- What did AI miss or oversimplify
- What did you contribute as a human thinker
- How will you change your AI use in future data work
Your reflection must match your assigned role.