Chapter 3 Data transformation
3.1 The Columns We Chose From Listings.csv
After we downloaded the CSV file, we processed it first directly in excel as it is a great way to look for every column we had. We chose to keep: id
,host_id
,host_since
,host_response_rate
,host_acceptance_rate
, host_listings_count
, host_has_profile_pic
, host_identity_verified
, neighbourhood_group
, latitude
, longitude
, room_type
, accommodates
, price
, availability_30
, availability_365
, number_of_reviews
, last_review
, review_scores_rating
, review_scores_accuracy
, review_scores_cleanliness
, review_scores_checkin
, review_scores_communication
, review_scores_location
, review_scores_value
, and reviews_per_month
as these variables might be meaningful for our further analysis.
The brief understanding of the data is that there are 44187 Airbnb properties collected in this dataset and 26 features that we chose can be grouped into three feature groups: property information, host information, review information
Property information (10 features): id
, host_since
, neighbourhood_group
, latitude
, longitude
, room_type
, accomodates
, price
, availability_30
, availability_365
.
Host information (6 features): host_id
, host_response_rate
, host_acceptance_rate
, host_listings_count
, host_has_profile_pic
, host_identity_verified
.
Review information (10 features): number_of_review
, last_review
, review_scores_rating
, review_scores_accuracy
, review_scores_cleanliness
, review_scores_checkin
, review_scores_communication
, review_scores_location
, review_scores_value
, review_per_month
.
3.2 The year_bar Dataset
For the year_bar.csv file, we transform the hosts’ registration data from the host_since
column in order to see how Airbnb’s hosting members grow over the years. We grouped the data by different boroughs then pivot the dataset wider so we have data filtered out for every year since 2008. We will use this dataset for our interactive bar chart in part 6.
Let’s see the dataset:
## # A tibble: 13 x 7
## year Bronx Brooklyn Manhattan Queens Staten_Island Total
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2008 0 15 7 2 0 24
## 2 2009 1 138 99 14 3 255
## 3 2010 25 427 320 43 4 819
## 4 2011 23 1215 895 111 10 2254
## 5 2012 40 1966 1658 262 20 3946
## 6 2013 58 2283 2290 352 13 4996
## 7 2014 108 2771 3001 599 27 6506
## 8 2015 100 2800 3211 745 33 6889
## 9 2016 158 2065 2596 758 52 5629
## 10 2017 187 1299 1489 768 46 3789
## 11 2018 180 1111 1335 710 45 3381
## 12 2019 197 1213 2073 768 43 4294
## 13 2020 72 433 606 265 12 1388
3.3 Data manipulation
To conduct further analysis and visualizations, we filtered the price
variable into three levels “Expensive”, “Medium”, and “Budget”. We define any property that has a price higher or equal to $160
as an expensive Airbnb listing. The medium price level is from $65
to $159
. The budget properties are those cheaper than $65
. Meanwhile, we also separated the review_scores_rating
variable into three levels, “Excellent”, “Poor”, and “Good”. The definition of an excellent property is that the listing received an overall score rating of 100. We defined the scores from 92 to 99 to be good review scores. Lastly, those properties that received a review rating of less than 92 are poor reviews. We separated three levels of these two variables according to their interquartile ranges.
Price:
## 25% 75%
## 65 160
Overall review scores:
## 25% 75%
## 92 100