NLP pipeline that flags potential bias and discrimination in Yelp reviews across hospitality businesses (restaurants, bars, hotels, spas, cafes) in Edmonton, Nashville, and New Orleans. Built in the last two weeks of August 2025 as a technical assessment for an internship. The company gave me full ownership of the work, so here it is.
Takes 1M+ Yelp reviews, filters to hospitality businesses, and tries to distinguish reviews describing actual discriminatory experiences from reviews that are just generally negative. The core idea is a dual-condition flag so a review only counts as "bias-flagged" if it contains a bias-related keyword AND has negative VADER sentiment. Just having one or the other isn't enough since people complain about cold food (negative, not bias) and people mention race/gender in perfectly neutral contexts (keyword, not bias).
From there, KMeans clustering on geographic coordinates groups flagged reviews into neighborhoods, with silhouette scores picking the cluster count. Folium heatmaps show where bias-flagged reviews concentrate across 10,900+ businesses in the three cities.
- Filter Yelp business data to hospitality categories
- Preprocess review text (lowercase, tokenize, stopword removal)
- Flag reviews containing bias-related keywords (regex patterns for race, gender, disability, religion, etc.)
- Run VADER sentiment on all reviews
- Dual-condition filter: negative sentiment + bias keyword = bias-flagged
- KMeans clustering on lat/long of flagged businesses with dynamic k via silhouette optimization
- Neighborhood-level analysis and Folium heatmap visualization
I did used ChatGPT for the regex patterns/matching because writing 50+ bias-detection regex expressions by hand sounded about as fun as having Vecna from Stranger Things gouge my eyeballs out.
Python (pandas, numpy), NLTK + VADER, scikit-learn (KMeans), Matplotlib/Seaborn, Folium
- Clone the repo
- Download the Yelp Open Dataset and drop
business.jsonandreview.jsonin the project folder- Note: Yelp periodically updates this dataset and rotates which cities are included, so results may not be exactly reproducible with a newer version. This was built on the version available in August 2025
- Run the Jupyter notebook