AI Activity 3: Dirty Data and Misleading Visualizations: When AI Plots Too Fast

Messy retail orders and returns dataset

AI activity

Modified

April 23, 2026

A visualization can look polished and still be wrong. If data are dirty, a quick AI-generated plot may quietly mix categories, misread dates, double-count rows, or summarize the wrong values. Before plotting, data scientists must decide what to clean, what to keep, and what each row really represents.

You will use an AI tool as a learning partner to investigate a core data science reality: Data visualization is not just about plotting. It depends on careful cleaning, wrangling, and validation.

You are graded on how you think, critique, and explain, not on whether AI gives perfect answers.

Goal of this activity

By the end of this activity, your group will do four things:

Identify data problems in a messy raw dataset that could distort visualizations.
Use AI to suggest cleaning and plotting steps, then audit where AI is incomplete, risky, or wrong.
Create one naive plot and one corrected plot to show how dirty data can change the story.
Explain which cleaning and wrangling decisions made your final visualizations more trustworthy.

Your grade depends on your reasoning, evidence, and explanation, not on whether AI output sounds correct.

To Do

Download the dataset from the Dataset section below.
Write your investigation question (2 to 3 sentences). Step 1 in Tasks.
Run 3 to 5 AI prompts that cover cleaning, wrangling, and visualization. Step 2 in Tasks.
Clean the data and complete checklist items 1 to 6 first. Step 4
Build one naive plot and at least two corrected visualizations. Step 4
Draft your synthesis, then build slides using the required 5 slide structure. Step 5 and Step 6

Assigned Roles (3 students)

[NOTE:] Each student has a designated role for accountability. However, teammates are encouraged to collaborate, support one another, and learn together in order to produce high quality, cohesive work.

Prompt Engineer

Responsibilities

Create 3 to 5 purposeful AI prompts
Save AI responses and build the AI Interaction Log
Write annotations for each prompt and response pair

Data Science Auditor

Responsibilities

Identify hidden assumptions in AI guidance
Check whether AI cleaning steps would silently change values or groups
Design and run verification checks after cleaning and wrangling
Provide evidence for at least two important mistakes, risks, or limitations in the AI suggestions

Synthesizer

Responsibilities

Write the Human Authored Synthesis in clear course language
Build slides using the required structure
Ensure the final work is consistent and concise

Dataset

Required file

topic3-data.csv

Dataset notes

README_MessyRetailOrders_Topic3.txt

Important note

This file is intentionally messy. It includes issues such as duplicate IDs, inconsistent category labels, mixed date formats, numeric values stored as text, incompatible discount formats, and multiple missing value representations.

What you will submit

Group submissions
- AI Interaction Log
- Human Authored Synthesis
- Slides for a 12 to 15 minute presentation
Individual submission
- Individual Reflection (150 to 200 words)

Step by step tasks

Step 1 Define your investigation question (before using AI)

Write 2 to 3 sentences answering

What data story do you want to visualize
Which variables will matter most
Which dirty-data problems could change that visual story if handled badly

Your question must require both cleaning and visualization.

Example investigation questions you may use or adapt:

“We want to compare monthly revenue across sales channels, but mixed date formats, revenue stored as text, and repeated order IDs could distort the trend. We will build a naive plot first, then a corrected plot after cleaning and wrangling.”
“We want to compare total revenue by category, but inconsistent category labels and numeric fields stored as text could split or miscount the bars. Our goal is to show how the visual changes after careful cleaning.”
“We want to compare return rates across customer types, but inconsistent return coding and missing values could create a misleading comparison plot if we trust a quick AI-generated workflow.”

Step 2 Use AI strategically (3 to 5 prompts)

[Note:] You may run several prompts, but keep the 5 most useful and meaningful prompts for reporting.

Your prompts must be purposeful and iterative.

Prompt requirements

At least 1 prompt asking AI to identify likely data problems before plotting
At least 1 prompt about cleaning dates, categories, or numeric fields
At least 1 prompt about duplicates, repeated IDs, or what one row represents
At least 1 prompt about how to validate whether a visualization is trustworthy after cleaning

Suggested prompt starters (you may adapt)

“I want to visualize monthly revenue by sales channel, but order_date uses mixed formats and revenue is stored as messy text. What cleaning steps should I take before plotting, and how do I verify them?”
“My dataset has repeated order_id values. How can I tell whether these are valid multi-line orders or duplicate records, and how could a wrong decision affect a plot?”
“category and sales_channel have inconsistent capitalization, spaces, punctuation, and abbreviations. What is a safe way to standardize them and check whether I merged levels correctly?”
“discount contains values like 10%, 0.10, 10 percent, none, and blank cells. What mistakes might AI make here, and how could those mistakes affect a visualization?”
“I want to create a plot from a dirty retail dataset. How can I compare a naive plot with a corrected plot to show why cleaning decisions matter?”

Step 3 Create the AI Interaction Log

For each prompt, include:

Prompt goal
The AI response excerpt you used
Your annotation:
- What AI got right
- What AI assumed
- What was missing, misleading, or incorrect
- How you revised your prompt or your plan

Important rule

You may not paste AI text into your final synthesis verbatim.

Step 4 Clean the data, build visualizations, and run required validation checks

Import the required file, then complete the validation checklist below. Your evidence can be screenshots, printed outputs, or short summaries of what you observed. A short summary without numbers or outputs does not count as evidence.

Required cleaning and visualization checklist

1. Rows and columns

Confirm the dataset has 480 rows and 15 columns after import.

2. Raw data quality scan

Identify at least four concrete data problems in the raw file that could distort a plot. Examples may include:

mixed date formats
numeric values stored as text
inconsistent category labels
repeated order IDs
exact duplicate rows
multiple missing value representations
inconsistent return coding

3. Missing values and missing-value representations

Identify at least three missing-value representations in the file.
Choose two columns and report how many missing values they contain after your cleaning rules are applied.
Explain how you decided what should count as missing.

4. Dates or time-related variables

Inspect order_date.
Show evidence that more than one date format exists in the raw data.
State how you parsed the dates.
Report how many non-missing entries failed to parse, if any.
Explain how a bad date parse could distort a time-based plot.

5. Numeric values stored as text

Choose two columns from this list: quantity, unit_price, discount, revenue.

For each chosen column:

Show at least three raw examples
Explain the cleaning rule you used
State one mistake that AI could easily make
Give one validation check you used after conversion

6. Repeated IDs and duplicate records

Investigate order_id.
Report whether repeated order IDs appear.
Determine whether you found any exact duplicate rows.
Explain what you decided to keep, remove, or flag.
State how a wrong decision here could change a visualization.

7. Build one naive plot

Create one plot from the raw or minimally cleaned data that looks plausible but is not fully trustworthy.

Explain what the plot seems to say
Explain exactly why the plot is misleading or incomplete
Identify which dirty-data issue caused the problem

8. Wrangle the cleaned data for visualization

Create one cleaned summary table that will feed your corrected plots.

Your summary table must clearly define:

what one row represents
which grouping variables are used
which summary statistic is computed

9. Build at least two corrected visualizations

Create at least two plots from the cleaned and wrangled data.

Requirements:

The two plots must answer your investigation question from different angles.
At least one plot must compare groups.
At least one plot must require wrangling beyond simply plotting raw rows.
For each corrected plot, explain how it differs from the naive version or from what AI first suggested.

10. Judgment and caution

Write 4 to 6 sentences answering:

Which cleaning or wrangling decision mattered most for your visual story
Where AI guidance was most risky or incomplete
Why your final visualizations are more trustworthy than a quick AI-generated version
What additional check you would do before using these visuals in a real report

Step 5 Write the Human Authored Synthesis (group)

Length target: 400 to 600 words.

Your synthesis must include:

Your investigation question
At least two important cleaning or wrangling decisions, supported by evidence from your checklist
A clear explanation of why the naive plot was misleading
A short recommended workflow for making trustworthy visualizations from dirty data

Your synthesis must be written in your own words.

Step 6 Presentation slides (group)

Use this exact slide structure:

Our question
What AI suggested
Where AI fell short
Our corrected understanding, with evidence
One takeaway for future data science work

Time: 12 to 15 minutes.

Individual Reflection (each student)

Write 150 to 200 words answering:

What did AI help you learn
What did AI miss or oversimplify
What did you contribute as a human thinker
How will you change your AI use in future data work

Your reflection must match your assigned role.