+++++++++++++++++++++++++++++++++++++++++++++++
+ NYC Airbnb Open Data (2019) for Topic 1     +
+++++++++++++++++++++++++++++++++++++++++++++++

===============================================
Files included
===============================================
1) insideairbnb_nyc_2019_listings_messy_600.csv
   The same 600 rows, but modified for teaching. It intentionally includes common data importing pitfalls.


===============================================
Data source and attribution
===============================================
The underlying dataset is the NYC Airbnb Open Data 2019 listings dataset (commonly distributed as AB_NYC_2019.csv).
It is derived from Inside Airbnb open data.


===============================================
What was modified in the messy file
===============================================
The messy file is designed to create reasonable, meaningful auditing work during data import.
Examples of issues included:
- price: sometimes formatted as currency text (such as $1,200, USD 149, or '149 dollars')
- last_review: mixed date formats (such as 2019-06-22, 6/22/19, Jun 22, 2019) and multiple missing value tokens (NA, N/A, blank, '.', 'unknown')
- reviews_per_month: decimal comma (0,21), textual values (~0.21, '0.21 per mo'), and multiple missing tokens
- minimum_nights and availability_365: numeric values mixed with units (such as '2 nights' or '96 days')
- room_type and neighbourhood_group: inconsistent capitalization and extra whitespace
- latitude and longitude: numeric values mixed with trailing spaces
- host_name: occasional extra whitespace and a few non visible characters


===============================================
Recommended use in your Topic 1 activity
===============================================
1) Import the data and record the parsing messages and inferred column types
2) Identify hidden assumptions (for example, currency, dates, missing values, and units)
3) Propose and justify a cleaning plan, then implement it
4) Validate the result using checks that are appropriate for the context (ranges, missingness, type checks, and summary comparisons)