Anatomy of the Yelp User Review
Yelp.com, started in 2004, is the number one place to find reviews of local businesses.
A 2016 paper showed that in the restaurant industry, a one star increase in a rating on Yelp.com led to a 5–9% increase in revenue for that restaurant. Each time a restaurant’s rating crossed a rounding threshold it experienced a discontinuous increase in its displayed rating, meaning that it was the rating itself which caused the increased revenue and not an increase in quality of service or other factor.
I took a look at the Yelp dataset which is provided by the company for students and researchers.
The dataset is 6 JSON files. A user file with 1.3 million observations, a business file with 130,000 and a review file with 5 million.
I decided to focus just on these three files which were about 8 gigabytes uncompressed.
I merged them together in Pandas and then randomly sampled 100,000 observations.
Regression: Linear and Random Forest
The initial target was the length of a review. It was a simple feature to engineer but was hard to predict for both a human and a machine.
I started with a baseline approach, in which we ‘guess’ the mean value for each row against every actual observation:
Then I fit a linear regression.
The R-Squared score was -6.2 which wasn’t very good, however the mean square error (MSE) was 361.12 which was better than the baseline which had a MSE of 391.67.
Then I fit a RandomForestRegressor model which had an R-Squared score of 0.720 and a Mean Absolute Error of 145.24, which was considerably better.
However, when trying this out on the test data, it actually performed worse than a linear regression.
We can get some insight into what was going on with the RandomForest model with some Shap plots.
Classification, Permutation Importance
This dataset is essentially limited to the Southeastern US and Canada. Using a random forest decision tree I predicted whether a review was from Nevada or a review was from one of the Canadian provinces.
The test score was slightly better than a baseline assumption in both cases, and again it’s always interesting to see what columns were most important in making the prediction.