A Python-based web scraping project developed for the CodeAlpha Data Analytics Internship that extracts book information from Books to Scrape and exports clean datasets in CSV and Excel formats.
π Intern: Prinkle Kella | CodeAlpha Data Analytics Internship | June 2026
The objective of this task was to demonstrate web scraping capabilities by extracting real-time data from a mock e-commerce website. The project focuses on navigating HTML structures, handling common data extraction challenges (like parsing CSS classes for ratings), performing basic data cleaning, and exporting the structured data into both CSV and Excel formats for future analysis.
- Python 3.12
- BeautifulSoup4: For parsing HTML and navigating the DOM tree
- Requests: For making HTTP requests to the target website
- Pandas: For data manipulation, cleaning, and export
- Tabulate: For formatting terminal output into clean Markdown tables
- Openpyxl: For exporting Pandas DataFrames to Excel (.xlsx)
- Website: Books to Scrape
- Description: A sandbox website specifically designed for developers to practice web scraping. It contains a catalogue of 1000 books across 50 pages.
- HTTP Request: Used the
requestslibrary to fetch the HTML content of the target webpage and verified a successful connection (Status Code: 200). - HTML Parsing: Utilized
BeautifulSoupwith thehtml.parserto organize the raw HTML into a searchable structure. - Data Extraction: Identified that all books are wrapped inside
<article class="product_pod">tags. Looped through these containers to extract:- Book Title: Extracted from the
titleattribute within the<h3>-><a>tags. - Price: Extracted from the text inside
<p class="price_color">. - Availability: Extracted from
<p class="instock availability">and stripped of extra whitespace. - Rating: Extracted the second word from the CSS class in
<p class="star-rating [Rating]">(e.g., classstar-rating Threeyields "Three").
- Book Title: Extracted from the
- Data Cleaning: Encountered and resolved a character encoding issue where the British Pound (Β£) symbol was prefixed with an unwanted
Γcharacter. Used Pandas string replacement (.str.replace('Γ', '')) to ensure clean price data. - Data Export: Compiled the cleaned data into a Pandas DataFrame and exported it to both:
data/books_data.csv(Industry standard for data pipelines)data/books_data.xlsx(Neatly formatted rows and columns for visual presentation)
- HTML Navigation: Learned how to inspect webpage elements and traverse the DOM tree using BeautifulSoup's
.find()and.find_all()methods. - Hidden Data Extraction: Discovered that sometimes data (like star ratings) isn't displayed as text but is embedded within CSS class names, requiring creative parsing logic.
- Data Quality Control: Realized that scraped data is rarely perfect out of the box. Identifying and fixing the
Γencoding bug reinforced the importance of data cleaning even at the collection stage.
- Clone the repository:
git clone https://github.com/PrinkleMahshwari/CodeAlpha_WebScraping.git
- Navigate to the project directory:
cd CodeAlpha_WebScraping - Install the required libraries:
pip install -r requirements.txt
- Run the scraper:
python src/scrapper.py
CodeAlpha_WebScraping/
βββ data/ # Exported dataset files
β βββ books_data.csv # Raw comma-separated format
β βββ books_data.xlsx # Formatted Excel spreadsheet
βββ screenshots/ # Output previews for README
β βββ excel.png
β βββ terminal(1).png
β βββ terminal(2).png
βββ src/ # Source code directory
β βββ scrapper.py # Main web scraping script
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
This project was completed as part of the CodeAlpha Data Analytics Internship Program.
- Dataset Source: Books to Scrape
- Internship Organization: CodeAlpha
- Repository: CodeAlpha_WebScraping
Special thanks to CodeAlpha for providing this internship opportunity and to Books to Scrape for offering a public website specifically designed for web scraping practice and learning.
| Resource | Link |
|---|---|
| Dataset Source | Books to Scrape |
| Internship Organization | CodeAlpha |
| GitHub Repository | CodeAlpha_WebScraping |
| GitHub Profile | PrinkleMahshwari |
Through this project, I gained practical experience in:
- Web Scraping
- Data Collection
- HTML Parsing
- BeautifulSoup
- Requests Library
- Data Cleaning
- Data Processing
- Pandas DataFrames
- CSV Handling
- Excel Export
- Git & GitHub Documentation
Possible future improvements for this project include:
- Scraping all available pages instead of a single page
- Collecting category-wise book information
- Storing scraped data in a PostgreSQL database
- Creating interactive dashboards using Power BI or Tableau
- Automating scheduled scraping tasks
- Performing additional data analysis on pricing and ratings
As part of the CodeAlpha Internship requirements, a project explanation video will be published on LinkedIn.
Status: β³ Recording and publication scheduled
LinkedIn Post Link: To be added after publication.
| Task | Status |
|---|---|
| Web Scraping | β Completed |
| Exploratory Data Analysis | β³ Pending |
| Data Visualization | β³ Pending |
| Sentiment Analysis | β³ Pending |
| LinkedIn Video Publication | β³ Scheduled |
This project was developed for educational purposes and as part of the CodeAlpha Data Analytics Internship Program.
Status: Completed
Internship Task: Web Scraping
Primary Outcome: Successfully extracted, cleaned, and exported book data into CSV and Excel formats.
Submission Ready: Yes
Documentation Ready: Yes
GitHub Ready: Yes
LinkedIn Publication: Pending
Prinkle Kella
BS Software Engineering Student | Data Analytics Intern
- GitHub: PrinkleMahshwari
- LinkedIn: [Link to be added]
- Project: CodeAlpha_WebScraping
- Internship: CodeAlpha Data Analytics Internship
Thank you for visiting this repository. Feedback, suggestions, and improvements are always welcome.
.png)
.png)
