Predicting AirBnb Prices

Summary

Predictive model to estimate AirBnB prices in popular European tourist destinations by analyzing correlations between various factors such as location, property type, and amenities. Leveraged RandomForestRegressor and XGBRegressor to account for non-linear relationships and validated the models with varying n_estimators for optimal performance. The model assists guests in making informed decisions and empowers hosts to set competitive prices. Addressed challenges related to data collection dates and computational intensity, while identifying future opportunities for time-series modeling and a recommender system.

Dataset source: Link to dataset

Code: Github

Problem Statement

For this project we will be building a model to predict the prices of AirBnBs in European cities which are popular among tourists. I want to explore how the prices of accommodation differ in different regions in Europe and what factors impact the price.

The dataset being

This dataset offers a thorough view of Airbnb rates in some of the most well-known European cities. Each listing is assessed according to a number of factors, including the number of bedrooms, room kinds, cleanliness and satisfaction ratings, and how far it is from the city center, in order to gain a thorough grasp of Airbnb rates both during the week and on the weekends.

(Published: January,2021 | no information provided regarding what period of time the data was collected)

Dataset Features:

The dataset above contains the following columns:

realSum: price of accommodation for two people and two nights in EUR
room_type: the type of the accommodation. Type: Categorical (Entire home/apt, Private room, Shared room)
room_shared: If the rooms are shared. Type: Boolean
room_private: If the rooms are private. Type: Boolean
person_capacity: the maximum number of guests. Type: int
host_is_superhost: superhost status. Type: Boolean
multi: Whether the listing is for multiple rooms or not. Type: Boolean
biz: Whether the listing is for business purposes or not. Type: Boolean
cleanliness_rating: cleanliness rating. Type: int
guest_satisfaction_overall: overall rating of the listing. Type: int
bedrooms: number of bedrooms (0 for studios). Type: int
dist: distance from city centre in km. Type: int
metro_dist: distance from nearest metro station in km. Type: int
city: city in Europe. Type: Categorical (London, Rome, Paris, Lisbon, Athens, Budapest, Vienna, Barcelona, Berlin, Amsterdam)
day: Day of the week. Type: Categorical (weekday, weekend)

Data Exploration

In this section we will visualize the data, find correlations and check if we have enough data in each category to go ahead with building the model.

We can see from this graph that the day of the week(weekend/weekday) does not impact the price significantly, apart from in Amsterdam. The price of the accommodation does show some relation to the different cities, for instance Amsterdam rates are higher than the rest, but there is significant overlap.

The visualizations below show that we do have enough data in each category however some splits are not balanced well which might impact the model. For instance, in the column "room_type" the number on entries for "shared room" is very less compared to the rest. Similarly, there is imbalance in the number of entries for London and Amsterdam.

Lastly, lets have a quick look on the correlation matrix. Based on the correlation matrix it seems like 'person_capacity' and 'bedrooms' would be important features in the model.

Data Modeling

In this section we will be using RandomForestRegressor and XGBRegressor to build models to predict the AirBnB prices. We would be using these 2 models as both are capable of modeling non-linear relationships between the features and the target variable. This is important for predicting AirBnb prices, as there are likely to be complex interactions between factors such as location, property type, amenities etc. Also both are ensemble methods, which means they combine the predictions of multiple individual models, which can help reduce the impact of noise and outliers in the data, making them more robust than simpler models like linear regression.

For each of those 2 models we will vary n_estimators and see which gives us a better model.

Random Forest Regressor

n_estimators: 100

train score: 0.9258122699643747 test score: 0.7179151192344506

RMSE: 172.20063514996545

n_estimators: 500

train score: 0.9304904541596818 test score: 0.7020610830013977

RMSE: 176.97359033334226

n_estimators: 700

train score: 0.9298596407341069 test score: 0.7064771882366518

RMSE: 175.6571264080737

XGB Regressor

n_estimators: 100

train score: 0.5594230742630443 test score: 0.5151750056299127

RMSE: 225.7549828615818

n_estimators: 500

train score: 0.812236874502541 test score: 0.6329053604149923

RMSE: 196.44178226769043

n_estimators: 700

train score: 0.8469654753698661 test score: 0.660604366552968

RMSE: 188.88521705534265

For the RandomForestRegressor model, we see that the RMSE value was almost the same for the changes in n_estimators. But for the XGBRegressor model the RMSE value saw a decrease with increase in n_estimators. However, RandomForestRegressor performed better with the data.

From the above models, the best one is the RandomForestRegressor with n_estimators as 100 as it has the minimum RMSE (172.2), a test-train score of 0.92 and 0.71, and is also computationally better. So that would be the final model.

In the graph we plotted the predicted prices with the actual prices. We see that the model is fairly accurate in the price range 200-800 but beyond that we see a lot of inaccuracy.

This inaccuracy can be a result of less data points above the range

Lastly, we will see the feature importance of the top 10 features in this model.

Conclusion

Challenges:

There was some ambiguity regarding the dates of when the data for each city was collected which might have been a major factor when predicting price. In terms of implementation challenges, the data was very clean and I didn't have to do much there. However when playing around with the hyperparameters of the models, the RandomForestRegressor was taking time to run as it executes each step sequentially and hence is computationally intensive. I could have used other models like Adaboost to reduce the execution time.

Potential benefits of a model that predicts AirBnB prices could include:

Helping guests make more informed decisions. Guests can use the predicted prices to choose accommodations that suit their budget and preferences.
Supporting hosts in setting prices. Hosts can use the predicted prices to set prices that are competitive and fair for their property.

However, there are also potential harms that need to be considered:

It might reinforcing existing inequalities. If the model is trained on biased data, it could perpetuate existing inequalities by favoring certain locations or types of properties over others.
It might also encourage underinvestment in some areas. If the model indicates that prices are low in an area, it could discourage investment in those areas, leading to a lack of options for guests, and hinder the growth of the area.

Research question for future work:

If we have all the reviews of the accommodations along with the dates of reviews we could build a time-series model which would be better and more accurate in predicting the prices of AirBnbs, as the price is a time dependent feature and would also show seasonal changes, which my model does not account for. Also, this could be taken one step further and can be built into a recommender which recommends users accommodation based on their budget and the type of vacation they are looking for. For this we would need additional features.

Shinjini Guha