Mini Project 2

Data Wrangling, Reproducibility, and Pipelines

Mini Project

Modified

April 23, 2026

Overview

In Mini Project 1, your team explored a data set and told a short story with visuals. In this second mini project, your team takes on a different role: data rescue team.

Your job is to take a messy, incomplete, or hard to use data set and turn it into a clean, trustworthy table that someone else could actually use for analysis.

Think of this as a handoff project. You are not the final analyst. Instead, your team is preparing the data so that another student, researcher, journalist, or organization could use it with confidence.

Identifying problems in a raw data set
Building a simple, readable cleaning pipeline
Checking that your cleaned data makes sense
Explaining your choices in plain language
Preparing a cleaned table that is ready for the next step of analysis

This is a data wrangling project. Focus on cleaning, organizing, and documenting your data. Do NOT do complex modeling. Keep any analysis brief and descriptive.

What you will submit

Show, in your Posit Cloud,

The rendered HTML report
The complete version of this source qmd file
Any raw data file you used, only if it is not a built in package data set
One exported cleaned data file named data_clean.csv

What you will present

A 10 minute team presentation that explains

What was messy or unclear in the original data
What your cleaning pipeline did
Why the final cleaned table is more trustworthy and useful
Who the cleaned table is for
What someone could do next with the cleaned data

Team Info

Team Name: Your Team Name
Case title: Give your project a short, interesting title
Team members and roles for this project:
1. Project lead (keeps time, coordinates tasks): Your Member Name(s)
2. Pipeline builder (writes and organizes the cleaning code): Your Member Name(s)
3. Quality checker (checks missing values, duplicates, odd values): Your Member Name(s)
4. Documentation writer (explains steps and prepares presentation): Your Member Name(s)

Step 1: Choose a data set

Choose one option below. Two teams may use the same data set.

Warning

Choose a data set that is manageable for a short team project. Your goal is not to clean a huge data set. Your goal is to demonstrate a clear and meaningful workflow.

Aim for a data set where you can complete at least 4 meaningful wrangling tasks.

Data sets

billboard data set from R package tidyr
penguins_raw data set from R package palmerpenguins
who2 data set from R package tidyr
relig_income data set from R package tidyr
Your own data set, with instructor approval

Use the code chunk import-data below to import your data, and call the main raw data table data_raw.

## Import your data
## Example:
## data_raw <- tidyr::billboard
data_raw <-

Quick description of your data set

Answer the following questions.

What does one row represent

Answer:

What makes this data set messy, incomplete, or hard to use right away

Answer:

What might someone want to do with the cleaned version of this data

Answer:

If you are using more than one raw table, what does each table contribute

Answer:

Story lens

Imagine your team is preparing this data for a specific audience. Choose one audience and keep that audience in mind throughout the project.

Possible audiences include:

a journalist trying to explain a pattern to the public
a researcher preparing for a later analysis
a nonprofit or public agency trying to understand a problem
another student who will use your cleaned table next

Who is your audience, and why would they care about this cleaned table?

Answer:

Step 2: Mission briefing

Write short answers. Keep them specific.

What is the goal of your cleaning pipeline

Answer:

Who is the audience for your cleaned table

Answer:

Why would this audience care about the cleaned data

Answer:

List at least 4 wrangling tasks you expect to do

Answer:

What will your final cleaned table allow someone to do more easily

Answer:

What is one challenge you expect to face

Answer:

Step 3: Investigate the raw data

Before cleaning, inspect the data and identify problems.

3.1 First look

## Check your data in your own way
## Suggestions:
## glimpse(data_raw)
## summary(data_raw)
## nrow(data_raw)
## ncol(data_raw)

In 2 to 4 sentences, describe what you noticed about the structure of the raw data.

Answer:

3.2 Missing values

missing_counts <- data_raw |>
  summarise(across(everything(), ~ sum(is.na(.x)))) |>
  pivot_longer(cols = everything(), names_to = "variable", values_to = "n_missing") |>
  arrange(desc(n_missing))

missing_counts

What missing value issues do you notice, and how do you plan to handle them?

Answer:

3.3 Duplicate rows or repeated records

## Example ideas:
## nrow(data_raw)
## nrow(distinct(data_raw))
## data_raw |> count(...) |> arrange(desc(n))

Do you see any duplicate rows, repeated records, or values that need attention?

Answer:

3.4 Other issues

Look for at least one more issue, such as unclear names, wrong data types, dates stored as text, values that need recoding, or columns that should be reshaped.

## Check one more issue here

Describe the issue and why it matters.

Answer:

Step 4: Build your data rescue pipeline

Create a new object data that is ready for later analysis.

Your pipeline should include at least 4 meaningful wrangling actions. Possible actions include:

Selecting useful columns
Renaming columns
Filtering rows
Recoding values
Converting data types
Parsing dates or times
Separating or combining columns
Pivoting longer or wider
Handling missing values
Creating one or two derived variables
Removing duplicates
Joining a small lookup table

Before you write code, describe the transformation in one sentence:

We are turning [messy raw data] into [clean table ready for ___].

Answer:

Use the code chunk clean below to clean your data_raw, and save the cleaned data as data.

## Build your cleaning pipeline here
## Keep it readable with line breaks and comments
data <-

Describe your main cleaning steps

In plain language, explain the main changes you made.

Step 1:
Step 2:
Step 3:
Step 4:
Optional additional step:

Why your pipeline is readable

What did you do to make your code easier for another person to follow?

Answer:

Step 5: Verify the rescue worked

Now verify that your cleaned table makes sense.

5.1 Final structure

glimpse(data)
summary(data)

What looks better or clearer in data compared with data_raw?

Answer:

5.2 Quality checks

Run at least 3 checks that help you trust the cleaned data.

Possible checks:

Missing values after cleaning
No duplicate rows
Valid ranges for key variables
Number of rows before and after cleaning
Category levels look reasonable
Joined rows matched as expected

## Add at least 3 checks here

Summarize what your checks show.

Answer:

5.3 Export your cleaned data

write_csv(data, "data_clean.csv")

Step 6: Show the payoff

Your final report should show that the cleaned table is easier to understand, easier to use, and more helpful for the next person who receives it.

This is where your team shows that the wrangling work mattered.

6.1 Preview of cleaned data

## Show a small preview of the cleaned data
## Example: slice_head(data, n = 10)

In 2 to 3 sentences, explain why this table is more usable than the raw version.

Answer:

6.2 Mini data dictionary

Create a small table that explains 5 to 8 important variables in your cleaned data.

## Create a tibble with columns like:
## variable, meaning, type

6.3 One simple summary that the cleaned table now supports

Create one simple summary table or plot that would have been harder to make from the raw data. Keep it descriptive only.

## Example: a grouped summary table or a simple plot

What does this summary show, and how does it demonstrate that your cleaned table is useful?

Answer:

6.4 Analyst handoff note

Imagine your team is handing this cleaned table to the next analyst on the project.

Write a short handoff note that helps them understand what you cleaned, what they can now do, and what they should still be careful about.

Write 4 to 6 sentences. Include:

What this cleaned table is for
What one row represents
Which variables are most important
One limitation that still remains
One reasonable next analysis someone could do

Step 7: Pipeline quality checklist

Confirm that your report shows the following:

You clearly explained what was messy in the raw data
Your cleaning code can be rerun from top to bottom
Your cleaned table is saved as data
You exported data_clean.csv
You ran at least 3 checks on the final data
Your explanations are in plain language
You did not overcomplicate the project

Revise your report if needed.

Step 8: Team reflection

Each team member writes 2 to 4 sentences:

What you contributed
One thing you learned about data wrangling
One thing you would improve next time

Member 1: your name

Answer:

Member 2: your name

Answer:

Member 3: your name

Answer:

Member 4: your name (if applicable)

Answer:

Step 9: Presentation plan

Plan a 10 minute talk with the suggested structure:

About 1 minute: case title, data set, and audience
About 2 minutes: what was messy or unclear in the raw data
About 3 minutes: your main wrangling steps
About 2 minutes: your quality checks and what they showed
About 2 minutes: the cleaned table, handoff note, and takeaway

Presentation order

teams <- c("Team 1", "Superb Statisticians", "The Data Scientists", "Stat Padders", "Data Divers", "Plot Squad")
set.seed(2026)
sample(teams, 6, replace = FALSE)

[1] "Data Divers"          "Team 1"               "Plot Squad"          
[4] "Superb Statisticians" "Stat Padders"         "The Data Scientists"

Grading guide

Total 15 points:

Clear purpose, audience, and identification of raw data issues (3 pts)
Meaningful wrangling pipeline and clear documentation of steps (6 pts)
Reproducible workflow, quality checks, and usable final cleaned table (3 pts)
Presentation clarity, timing, and handoff story (3 pts)