Which aspects of the feedback provided by Yelp customer reviews are associated with restaurant closures?
Online customer reviews play a significant role in shaping how restaurants are perceived by consumers. When searching for new places to eat at, consumers may resort to customer review platforms, such as Yelp, to establish which of several would be worth giving a try. Therefore, the valence of a restaurant's reviews on several different aspects can impact its success, and ultimately, its survival. Restaurant closures are the endmost sign that a restaurant was unsuccessful as a business, something that can occur for several reasons. The aim of this project is then to explore which, if any, aspects of customer reviews are associated to restaurant closures. To do so, a sample of Yelp's restaurant reviews will be used.
Investigating the research question Which aspects of the feedback provided by Yelp customer reviews are associated with restaurant closures? is crucial, as the insights derived from it may help current and prospective restaurant owners in detecting potential threats to their establishments' survival. That is, restaurant owners can use our results to, for example, be able to identify warning signs of when their restaurant's survival might be at risk, establish which key areas warrant improvement to exceed customer expectations, and adjust their business strategies to address early signs of threats before they escalate into critical issues.
The dataset used was Yelp Open Dataset, a public dataset provided by the review platform Yelp. This dataset was obtained via the following link: Yelp Open Dataset.
The dataset contains 5 subsets of data: business, review, checkin, tip, and user, but only the business, review, and checkin subsets were relevant for our study. The business subset contains general business data including location, attributes, categories, and information on whether restaurants are open or closed; the review subset contains full review texts and metadata; and the checkin subset contains comma-separated timestamps for every logged check-in of each restaurant.
Furthermore, the dataset contains millions of reviews on a variety of types of establishments, services, and experiences that lay outside the scope of our project. Thus, we constructed a balanced dataset that consists of a random sample of 5000 restaurant reviews, reviews_sampled.rds. This sample included 100 restaurants (50 of which open, and 50 of which closed) that each have at least 100 reviews since 2018. For closed restaurants, only reviews up to the date of the last check-in (i.e. while they were still active) were considered. For each restaurant, 50 reviews were randomly selected, resuting in the final sample of 5000 observations.
The table below summarizes the most important variables at this stage of the project:
To answer our research question, which is of exploratory nature, we first conducted a sentiment analysis on the 5000- reviews sample. In order to do the sentiment analysis we created our own dictionary combining different techniques like reviewing clusters from BERTopic and word frequency tables on usefull identified themes. This allowed us to classify useful themes, variables and key words accross reviews.
Thereafter, to perform the sentiment analysis we apply Quanteda to compute variables indicating whether each theme appears in a review. These aggregate sentiment scores and theme scores per restaurant are especially useful, and gives us both “what people talk about” (topics) and “how they feel about it”. Which finally will be tested against which of, and whether, these aspects are associated to restaurant closures.
If there is enough time available we could:
-Fit Statistical Model
-Model Validation
This integrated approach provides a clear and data-driven way to link review content to business outcomes.
- Describe the gist of your findings (save the details for the final paper!)
- How are the findings/end product of the project deployed?
- Explain the relevance of these findings/product.
- data/ → contains the datasets used in the project.
- reporting/ → contains R Markdown files documenting the project’s progress and results.
- scripts/ → contains both R and Python scripts developed for the project.
- cloudstorage/ → contains a file with the link to the shared Google Drive folder (e.g., Google Docs and other shared resources).
Please follow the installation guides on Tilburg Science Hub
Now, copy and run the following code in R:
required_packages <- c("tidyverse", "data.table", "here", "googledrive", "dplyr")
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
suppressMessages(install.packages(pkg))
}
}
#load all the dependencies
invisible(lapply(required_packages, function(pkg) {
suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))
Here’s a concise Step-by-Step :
- install packages
install.packages(c("rmarkdown","knitr","tidyverse","data.table","here","googledrive"))
- Download data form Yelp
Downloads Yelp CSVs from Google Drive (public folder) only if not present, converts to .rds.
-
Load data into the environment
-
create the sample
-
From
business, keep only businesses with category Restaurants. -
Extract their
business_idvalues and join with thereviewdataset. -
Compute the number of reviews per restaurant (
n_reviews). -
Keep only restaurants with at least 50 reviews.
-
Merge in the
is_openvariable frombusiness. -
Check the proportion of open vs. closed restaurants.
-
Define selection criteria: close vs open, where:
Closed_ids: closed restaurants with 100+ reviews since Jan 1, 2018, counted only until their last check-in. Open_ids: open restaurants with 100+ reviews since Jan 1, 2018.
-
Sample restaurants 50 closed restaurants 50 open restaurants
-
Sample reviews
For each selected restaurant, randomly sample 50 reviews (after 2018, and before last check-in if closed).
Final dataset size: (50 × 50) + (50 × 50) = 5,000 reviews.
Add the is_open variable to the sampled reviews (label open/closed).
-
Save final dataset
-
Alternatively, download it directly from Google Drive.
- topic modelling
The goal of this stage is to identify the most relevant topics discussed in the reviews. These topics will later be analyzed, together with sentiment, to assess whether they have an impact on restaurant closures.
On python:
- We converted reviews into embeddings and clustered them into 18 topics.
- We assigned each review to a main topic, based on the highest probability of that topic occurring in the review.
- We exported the file in csv for further analysis
On r: we identified the most useful topics and marked them in a specific column named "utility" for downstream analysis.
Next steps for topic modelling:
- Word Frequency & Dictionary Creation: Count the most frequent words in the reviews to build a dictionary of themes from the useful clusters and frequent words. – Expanding the Dictionary with keyATM: Use keyATM to identify synonyms or related phrases we might have missed.
This repository was made by Geert Huissen, Alice Ruggiero, Mathijs Quarles van Ufford, Nigel de Jong, and Maria Orgaz Jimenez as part of the Master's course Data Preparation & Programming Skills at the Department of Marketing, Tilburg University, the Netherlands.