Chapter 4 Missing values

We plot a vis_miss variable graph to investigate the missing patterns in all the variables.

For missing row patterns, we observe that most of the rows don’t have missing values. For missing column patterns, we find that there are 12 columns having missing values. Variables host_response_rate and host_acceptance_rate have the highest number of missing values, with 52% and 38% of the value missing. We believe the large amount of missing values in these two is normal for this dataset because the hosts will not respond to all the requests and certainly will not accept all customers. According to our analysis, most hosts filter out about half of their requests via these two steps. Therefore, these two variables make sense to have a large number of missings. Moreover, these two variables will not influence our research since we wi not use them in our further analysis.

Other 9 variables of last_review, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, and reviews_per_month all about the same amount of missing values ( around 25%), and they all related to the same attribute “reviews”. Thus, they will not influence our analysis of the relationship among other factors and overall review scores.

The table below shows the exact number of missing values of all variables.

##          host_response_rate        host_acceptance_rate      review_scores_location 
##                       22892                       16754                       11484 
##         review_scores_value       review_scores_checkin      review_scores_accuracy 
##                       11483                       11479                       11464 
## review_scores_communication   review_scores_cleanliness        review_scores_rating 
##                       11459                       11451                       11432 
##                 last_review           reviews_per_month         host_listings_count 
##                       10394                       10394                        5540 
##                  host_since        host_has_profile_pic      host_identity_verified 
##                          17                          17                          17 
##                          id                     host_id         neighbourhood_group 
##                           0                           0                           0 
##                    latitude                   longitude                   room_type 
##                           0                           0                           0 
##                accommodates                       price             availability_30 
##                           0                           0                           0 
##            availability_365           number_of_reviews 
##                           0                           0