Chapter 4 Missing values
We plot a vis_miss variable graph to investigate the missing patterns in all the variables.
For missing row patterns, we observe that most of the rows don’t have missing values. For missing column patterns, we find that there are 12 columns having missing values. Variables host_response_rate
and host_acceptance_rate
have the highest number of missing values, with 52% and 38% of the value missing. We believe the large amount of missing values in these two is normal for this dataset because the hosts will not respond to all the requests and certainly will not accept all customers. According to our analysis, most hosts filter out about half of their requests via these two steps. Therefore, these two variables make sense to have a large number of missings.
Moreover, these two variables will not influence our research since we wi not use them in our further analysis.
Other 9 variables of last_review
, review_scores_rating
, review_scores_accuracy
, review_scores_cleanliness
, review_scores_checkin
, review_scores_communication
, review_scores_location
, review_scores_value
, and reviews_per_month
all about the same amount of missing values ( around 25%), and they all related to the same attribute “reviews”. Thus, they will not influence our analysis of the relationship among other factors and overall review scores.
The table below shows the exact number of missing values of all variables.
## host_response_rate host_acceptance_rate review_scores_location
## 22892 16754 11484
## review_scores_value review_scores_checkin review_scores_accuracy
## 11483 11479 11464
## review_scores_communication review_scores_cleanliness review_scores_rating
## 11459 11451 11432
## last_review reviews_per_month host_listings_count
## 10394 10394 5540
## host_since host_has_profile_pic host_identity_verified
## 17 17 17
## id host_id neighbourhood_group
## 0 0 0
## latitude longitude room_type
## 0 0 0
## accommodates price availability_30
## 0 0 0
## availability_365 number_of_reviews
## 0 0