This project is a web scraping and data analysis tool designed to extract customer reviews from Best Buy product pages, analyze their sentiment using a hybrid approach, and visualize key customer insights.
- Advanced Web Scraping: Uses
undetected-chromedriverandSeleniumto bypass anti-bot measures and handle dynamic content loading (infinite scrolling/pagination). - Data Extraction: Parses HTML using
BeautifulSoupto extract:- Review Title & Body
- Star Rating (1-5)
- Date of Review
- Reviewer Name & "Verified Buyer" Status
- "Recommendation" Status (Yes/No)
- Hybrid Sentiment Analysis: Calculates sentiment scores using a custom weighted algorithm:
- NLTK VADER: Base sentiment scoring.
- TextBlob: Noun phrase extraction for topic modeling.
- Contextual Weighting: Adjusts sentiment scores based on the Star Rating and the user's "Recommended" flag.
- Visualizations: Generates insightful charts using
MatplotlibandSeaborn:- Overall Sentiment Distribution (Pie Chart).
- Top Drivers of Sentiment (Bar Chart of key topics).
- Average Rating comparison (Verified vs. Unverified Buyers).
- Python 3.11
- Google Chrome Browser (Must be installed on the system for the webdriver to work).
-
Clone the repository:
git clone https://github.com/SidoJain/Web-Scraping-Sentiment-Analysis.git
-
Install required Python packages: You can install the dependencies using the command below:
uv pip install -r requirements.txt
-
NLTK Data: The script automatically downloads the necessary NLTK lexicon (
vader_lexicon) upon first run.
-
Open the Jupyter Notebook (
main.ipynb). -
Locate the
main()function in the Driver Code cell. -
Update the
target_urlvariable with the link to the Reviews Page of the Best Buy product you wish to analyze.- Note: Ensure the URL ends with
/reviewor points specifically to the review section.
- Note: Ensure the URL ends with
-
Set Chrome Verion number as follows:
driver = uc.Chrome(options=options, version_main={version_num})
-
Run all cells in the notebook.
def main():
# Example URL
target_url = "https://www.bestbuy.ca/en-ca/product/apple-macbook-air-13-6-w-touch-id-2025-midnight-apple-m4-16gb-ram-256gb-ssd-english/19205139/review"
# ... rest of the code-
The Scraper The script launches a headless-like (but visible to avoid detection) Chrome instance. It:
- Loads the page and removes cookie/privacy banners.
- Applies the "Relevancy" filter.
- Repeatedly clicks the "Load More" button with random time delays to mimic human behavior until all reviews are loaded.
-
Sentiment Logic The analyze_sentiment function is more robust than standard library calls. It calculates a compound score based on:
- Text Analysis: VADER polarity score.
- Rating Bias: If the rating is >= 4, the score gets a bonus. If <= 2, it gets a penalty.
- Recommendation Bias: If the user clicked "No" on "Would you recommend this?", the score is heavily penalized.
-
Topic Extraction It uses TextBlob to extract Noun Phrases (e.g., "battery life", "screen quality") to identify what the user is talking about, assigning the sentiment score to that specific topic.
This tool is for educational and research purposes only. Web scraping may violate the Terms of Service of specific websites. Please respect robots.txt files and scrape responsibly. Do not use this tool to overwhelm servers.