README_MessyRetailOrders_Topic3.txt Dataset title Messy Retail Orders and Returns Dataset Purpose This synthetic dataset was created for AI Activity 3 in MATH/COSC 3570. It is designed to support practice with data cleaning, wrangling, and visualization. The file is intentionally messy. Students should not assume that column types, missing values, categories, dates, or repeated IDs are already clean. Context The file is a raw export from a fictional retail business covering orders from January 2025 through July 2025. Each row is intended to represent one extracted transaction line from the raw export. Because this is a raw extract, some rows may be duplicated exactly, while some repeated order IDs may reflect multi-line orders rather than errors. Students should inspect carefully before deciding what to remove, keep, or combine. Learning goal The main goal is not simply to make a plot. The goal is to show that quick AI-generated code can produce a polished but misleading visualization when the underlying data are dirty. File included - topic3-data.csv Dimensions - 480 rows - 15 columns Column descriptions 1. order_id Order identifier. Repeated values may occur. 2. order_date Date the order was placed. 3. ship_date Date the order was shipped. 4. region Geographic region. 5. city City linked to the order. 6. category Main product category. 7. subcategory Product subcategory. 8. sales_channel How the order was placed. 9. customer_type Type of customer. 10. quantity Quantity ordered. 11. unit_price Unit price for the line item. 12. discount Discount applied to the line item. 13. revenue Revenue recorded for the line item. 14. returned Return flag or return status. 15. notes Free-text operational note. This column may be useful for inspection but is not required for analysis. Important cautions - Do not assume the date columns use one consistent format. - Do not assume numeric columns are already numeric. - Do not assume one missing-value code is used everywhere. - Do not assume repeated order_id means the row is a duplicate. - Do not assume labels such as region, category, sales_channel, and customer_type are standardized. - Do not assume a quick plot from the raw file is trustworthy. Suggested visualization directions Students may choose one main question such as: - How does monthly revenue vary by sales channel? - How does total revenue vary across categories or regions? - How do return rates differ by customer type or category? - Which categories appear most important before and after cleaning? A strong project will compare: 1. a naive or minimally cleaned plot, and 2. a corrected plot built from a documented cleaning and wrangling pipeline. What not to do - Do not trust AI suggestions without checking the raw values. - Do not report a final plot without showing what cleaning decisions affected it. - Do not drop rows automatically without explaining why. Instructor note This dataset is synthetic and was designed for pedagogy, not for real business inference.