Skip to content

PrinkleMahshwari/CodeAlpha_WebScraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“š CodeAlpha_WebScraping

A Python-based web scraping project developed for the CodeAlpha Data Analytics Internship that extracts book information from Books to Scrape and exports clean datasets in CSV and Excel formats.

Python Pandas BeautifulSoup CodeAlpha GitHub last commit

🌟 Intern: Prinkle Kella | CodeAlpha Data Analytics Internship | June 2026


🎯 Project Objective

The objective of this task was to demonstrate web scraping capabilities by extracting real-time data from a mock e-commerce website. The project focuses on navigating HTML structures, handling common data extraction challenges (like parsing CSS classes for ratings), performing basic data cleaning, and exporting the structured data into both CSV and Excel formats for future analysis.

πŸ› οΈ Tools & Technologies

  • Python 3.12
  • BeautifulSoup4: For parsing HTML and navigating the DOM tree
  • Requests: For making HTTP requests to the target website
  • Pandas: For data manipulation, cleaning, and export
  • Tabulate: For formatting terminal output into clean Markdown tables
  • Openpyxl: For exporting Pandas DataFrames to Excel (.xlsx)

πŸ“Š Dataset Source

  • Website: Books to Scrape
  • Description: A sandbox website specifically designed for developers to practice web scraping. It contains a catalogue of 1000 books across 50 pages.

βš™οΈ Methodology & Implementation

  1. HTTP Request: Used the requests library to fetch the HTML content of the target webpage and verified a successful connection (Status Code: 200).
  2. HTML Parsing: Utilized BeautifulSoup with the html.parser to organize the raw HTML into a searchable structure.
  3. Data Extraction: Identified that all books are wrapped inside <article class="product_pod"> tags. Looped through these containers to extract:
    • Book Title: Extracted from the title attribute within the <h3> -> <a> tags.
    • Price: Extracted from the text inside <p class="price_color">.
    • Availability: Extracted from <p class="instock availability"> and stripped of extra whitespace.
    • Rating: Extracted the second word from the CSS class in <p class="star-rating [Rating]"> (e.g., class star-rating Three yields "Three").
  4. Data Cleaning: Encountered and resolved a character encoding issue where the British Pound (Β£) symbol was prefixed with an unwanted Γ‚ character. Used Pandas string replacement (.str.replace('Γ‚', '')) to ensure clean price data.
  5. Data Export: Compiled the cleaned data into a Pandas DataFrame and exported it to both:
    • data/books_data.csv (Industry standard for data pipelines)
    • data/books_data.xlsx (Neatly formatted rows and columns for visual presentation)

πŸ“Έ Output Previews

Terminal Output

Terminal Output 1 Terminal Output 2

Excel Output (.xlsx Format)

Excel Output

πŸ’‘ Key Learnings & Challenges

  • HTML Navigation: Learned how to inspect webpage elements and traverse the DOM tree using BeautifulSoup's .find() and .find_all() methods.
  • Hidden Data Extraction: Discovered that sometimes data (like star ratings) isn't displayed as text but is embedded within CSS class names, requiring creative parsing logic.
  • Data Quality Control: Realized that scraped data is rarely perfect out of the box. Identifying and fixing the Γ‚ encoding bug reinforced the importance of data cleaning even at the collection stage.

πŸš€ How to Run Locally

  1. Clone the repository:
    git clone https://github.com/PrinkleMahshwari/CodeAlpha_WebScraping.git
  2. Navigate to the project directory:
    cd CodeAlpha_WebScraping
  3. Install the required libraries:
    pip install -r requirements.txt
  4. Run the scraper:
    python src/scrapper.py

πŸ“‚ Project Structure

CodeAlpha_WebScraping/
β”œβ”€β”€ data/                   # Exported dataset files
β”‚   β”œβ”€β”€ books_data.csv      # Raw comma-separated format
β”‚   └── books_data.xlsx     # Formatted Excel spreadsheet
β”œβ”€β”€ screenshots/            # Output previews for README
β”‚   β”œβ”€β”€ excel.png
β”‚   β”œβ”€β”€ terminal(1).png
β”‚   └── terminal(2).png
β”œβ”€β”€ src/                    # Source code directory
β”‚   └── scrapper.py         # Main web scraping script
β”œβ”€β”€ README.md               # Project documentation
└── requirements.txt        # Python dependencies

πŸ™ Acknowledgements

This project was completed as part of the CodeAlpha Data Analytics Internship Program.

Special thanks to CodeAlpha for providing this internship opportunity and to Books to Scrape for offering a public website specifically designed for web scraping practice and learning.


πŸ”— Important Links

Resource Link
Dataset Source Books to Scrape
Internship Organization CodeAlpha
GitHub Repository CodeAlpha_WebScraping
GitHub Profile PrinkleMahshwari

πŸ“ˆ Skills Gained

Through this project, I gained practical experience in:

  • Web Scraping
  • Data Collection
  • HTML Parsing
  • BeautifulSoup
  • Requests Library
  • Data Cleaning
  • Data Processing
  • Pandas DataFrames
  • CSV Handling
  • Excel Export
  • Git & GitHub Documentation

πŸš€ Future Improvements

Possible future improvements for this project include:

  • Scraping all available pages instead of a single page
  • Collecting category-wise book information
  • Storing scraped data in a PostgreSQL database
  • Creating interactive dashboards using Power BI or Tableau
  • Automating scheduled scraping tasks
  • Performing additional data analysis on pricing and ratings

πŸŽ₯ LinkedIn Project Demonstration

As part of the CodeAlpha Internship requirements, a project explanation video will be published on LinkedIn.

Status: ⏳ Recording and publication scheduled

LinkedIn Post Link: To be added after publication.


⭐ Internship Progress

Task Status
Web Scraping βœ… Completed
Exploratory Data Analysis ⏳ Pending
Data Visualization ⏳ Pending
Sentiment Analysis ⏳ Pending
LinkedIn Video Publication ⏳ Scheduled

πŸ“œ License

This project was developed for educational purposes and as part of the CodeAlpha Data Analytics Internship Program.


βœ… Project Status

Status: Completed

Internship Task: Web Scraping

Primary Outcome: Successfully extracted, cleaned, and exported book data into CSV and Excel formats.

Submission Ready: Yes

Documentation Ready: Yes

GitHub Ready: Yes

LinkedIn Publication: Pending


πŸ‘¨β€πŸ’» Author

Prinkle Kella

BS Software Engineering Student | Data Analytics Intern

  • GitHub: PrinkleMahshwari
  • LinkedIn: [Link to be added]
  • Project: CodeAlpha_WebScraping
  • Internship: CodeAlpha Data Analytics Internship

Thank you for visiting this repository. Feedback, suggestions, and improvements are always welcome.


About

πŸ“š Web scraping project built for the CodeAlpha Data Analytics Internship. Extracts structured book data (titles, prices, ratings, stock availability) from Books to Scrape using Python, BeautifulSoup, and Pandas. Outputs a clean CSV dataset ready for downstream analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages