## Import your data
## Example:
## data_raw <- tidyr::billboard
data_raw <- Mini Project 2
Data Wrangling, Reproducibility, and Pipelines
Overview
In Mini Project 1, your team explored a data set and told a short story with visuals. In this second mini project, your team takes on a different role: data rescue team.
Your job is to take a messy, incomplete, or hard to use data set and turn it into a clean, trustworthy table that someone else could actually use for analysis.
Think of this as a handoff project. You are not the final analyst. Instead, your team is preparing the data so that another student, researcher, journalist, or organization could use it with confidence.
- Identifying problems in a raw data set
- Building a simple, readable cleaning pipeline
- Checking that your cleaned data makes sense
- Explaining your choices in plain language
- Preparing a cleaned table that is ready for the next step of analysis
This is a data wrangling project. Focus on cleaning, organizing, and documenting your data. Do NOT do complex modeling. Keep any analysis brief and descriptive.
What you will submit
Show, in your Posit Cloud,
- The rendered HTML report
- The complete version of this source qmd file
- Any raw data file you used, only if it is not a built in package data set
- One exported cleaned data file named
data_clean.csv
What you will present
A 10 minute team presentation that explains
- What was messy or unclear in the original data
- What your cleaning pipeline did
- Why the final cleaned table is more trustworthy and useful
- Who the cleaned table is for
- What someone could do next with the cleaned data
Team Info
Team Name: Your Team Name
Case title: Give your project a short, interesting title
Team members and roles for this project:
- Project lead (keeps time, coordinates tasks): Your Member Name(s)
- Pipeline builder (writes and organizes the cleaning code): Your Member Name(s)
- Quality checker (checks missing values, duplicates, odd values): Your Member Name(s)
- Documentation writer (explains steps and prepares presentation): Your Member Name(s)
Step 1: Choose a data set
Choose one option below. Two teams may use the same data set.
Choose a data set that is manageable for a short team project. Your goal is not to clean a huge data set. Your goal is to demonstrate a clear and meaningful workflow.
Aim for a data set where you can complete at least 4 meaningful wrangling tasks.
Data sets
billboarddata set from R packagetidyrpenguins_rawdata set from R packagepalmerpenguinswho2data set from R packagetidyrrelig_incomedata set from R packagetidyr- Your own data set, with instructor approval
Use the code chunk import-data below to import your data, and call the main raw data table data_raw.
Quick description of your data set
Answer the following questions.
- What does one row represent
Answer:
- What makes this data set messy, incomplete, or hard to use right away
Answer:
- What might someone want to do with the cleaned version of this data
Answer:
- If you are using more than one raw table, what does each table contribute
Answer:
Story lens
Imagine your team is preparing this data for a specific audience. Choose one audience and keep that audience in mind throughout the project.
Possible audiences include:
- a journalist trying to explain a pattern to the public
- a researcher preparing for a later analysis
- a nonprofit or public agency trying to understand a problem
- another student who will use your cleaned table next
Who is your audience, and why would they care about this cleaned table?
Answer:
Step 2: Mission briefing
Write short answers. Keep them specific.
- What is the goal of your cleaning pipeline
Answer:
- Who is the audience for your cleaned table
Answer:
- Why would this audience care about the cleaned data
Answer:
- List at least 4 wrangling tasks you expect to do
Answer:
- What will your final cleaned table allow someone to do more easily
Answer:
- What is one challenge you expect to face
Answer:
Step 3: Investigate the raw data
Before cleaning, inspect the data and identify problems.
3.1 First look
## Check your data in your own way
## Suggestions:
## glimpse(data_raw)
## summary(data_raw)
## nrow(data_raw)
## ncol(data_raw)In 2 to 4 sentences, describe what you noticed about the structure of the raw data.
Answer:
3.2 Missing values
missing_counts <- data_raw |>
summarise(across(everything(), ~ sum(is.na(.x)))) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_missing") |>
arrange(desc(n_missing))
missing_countsWhat missing value issues do you notice, and how do you plan to handle them?
Answer:
3.3 Duplicate rows or repeated records
## Example ideas:
## nrow(data_raw)
## nrow(distinct(data_raw))
## data_raw |> count(...) |> arrange(desc(n))Do you see any duplicate rows, repeated records, or values that need attention?
Answer:
3.4 Other issues
Look for at least one more issue, such as unclear names, wrong data types, dates stored as text, values that need recoding, or columns that should be reshaped.
## Check one more issue hereDescribe the issue and why it matters.
Answer:
Step 4: Build your data rescue pipeline
Create a new object data that is ready for later analysis.
Your pipeline should include at least 4 meaningful wrangling actions. Possible actions include:
- Selecting useful columns
- Renaming columns
- Filtering rows
- Recoding values
- Converting data types
- Parsing dates or times
- Separating or combining columns
- Pivoting longer or wider
- Handling missing values
- Creating one or two derived variables
- Removing duplicates
- Joining a small lookup table
Before you write code, describe the transformation in one sentence:
We are turning [messy raw data] into [clean table ready for ___].
Answer:
Use the code chunk clean below to clean your data_raw, and save the cleaned data as data.
## Build your cleaning pipeline here
## Keep it readable with line breaks and comments
data <- Describe your main cleaning steps
In plain language, explain the main changes you made.
- Step 1:
- Step 2:
- Step 3:
- Step 4:
- Optional additional step:
Why your pipeline is readable
What did you do to make your code easier for another person to follow?
Answer:
Step 5: Verify the rescue worked
Now verify that your cleaned table makes sense.
5.1 Final structure
glimpse(data)
summary(data)What looks better or clearer in data compared with data_raw?
Answer:
5.2 Quality checks
Run at least 3 checks that help you trust the cleaned data.
Possible checks:
- Missing values after cleaning
- No duplicate rows
- Valid ranges for key variables
- Number of rows before and after cleaning
- Category levels look reasonable
- Joined rows matched as expected
## Add at least 3 checks hereSummarize what your checks show.
Answer:
5.3 Export your cleaned data
write_csv(data, "data_clean.csv")Step 6: Show the payoff
Your final report should show that the cleaned table is easier to understand, easier to use, and more helpful for the next person who receives it.
This is where your team shows that the wrangling work mattered.
6.1 Preview of cleaned data
## Show a small preview of the cleaned data
## Example: slice_head(data, n = 10)In 2 to 3 sentences, explain why this table is more usable than the raw version.
Answer:
6.2 Mini data dictionary
Create a small table that explains 5 to 8 important variables in your cleaned data.
## Create a tibble with columns like:
## variable, meaning, type6.3 One simple summary that the cleaned table now supports
Create one simple summary table or plot that would have been harder to make from the raw data. Keep it descriptive only.
## Example: a grouped summary table or a simple plotWhat does this summary show, and how does it demonstrate that your cleaned table is useful?
Answer:
6.4 Analyst handoff note
Imagine your team is handing this cleaned table to the next analyst on the project.
Write a short handoff note that helps them understand what you cleaned, what they can now do, and what they should still be careful about.
Write 4 to 6 sentences. Include:
- What this cleaned table is for
- What one row represents
- Which variables are most important
- One limitation that still remains
- One reasonable next analysis someone could do
Step 7: Pipeline quality checklist
Confirm that your report shows the following:
- You clearly explained what was messy in the raw data
- Your cleaning code can be rerun from top to bottom
- Your cleaned table is saved as
data - You exported
data_clean.csv - You ran at least 3 checks on the final data
- Your explanations are in plain language
- You did not overcomplicate the project
Revise your report if needed.
Step 8: Team reflection
Each team member writes 2 to 4 sentences:
- What you contributed
- One thing you learned about data wrangling
- One thing you would improve next time
Member 1: your name
Answer:
Member 2: your name
Answer:
Member 3: your name
Answer:
Member 4: your name (if applicable)
Answer:
Step 9: Presentation plan
Plan a 10 minute talk with the suggested structure:
- About 1 minute: case title, data set, and audience
- About 2 minutes: what was messy or unclear in the raw data
- About 3 minutes: your main wrangling steps
- About 2 minutes: your quality checks and what they showed
- About 2 minutes: the cleaned table, handoff note, and takeaway
Presentation order
teams <- c("Team 1", "Superb Statisticians", "The Data Scientists", "Stat Padders", "Data Divers", "Plot Squad")
set.seed(2026)
sample(teams, 6, replace = FALSE)[1] "Data Divers" "Team 1" "Plot Squad"
[4] "Superb Statisticians" "Stat Padders" "The Data Scientists"
Grading guide
Total 15 points:
- Clear purpose, audience, and identification of raw data issues (3 pts)
- Meaningful wrangling pipeline and clear documentation of steps (6 pts)
- Reproducible workflow, quality checks, and usable final cleaned table (3 pts)
- Presentation clarity, timing, and handoff story (3 pts)