📚 CodeAlpha_WebScraping

A Python-based web scraping project developed for the CodeAlpha Data Analytics Internship that extracts book information from Books to Scrape and exports clean datasets in CSV and Excel formats.

🌟 Intern: Prinkle Kella | CodeAlpha Data Analytics Internship | June 2026

🎯 Project Objective

The objective of this task was to demonstrate web scraping capabilities by extracting real-time data from a mock e-commerce website. The project focuses on navigating HTML structures, handling common data extraction challenges (like parsing CSS classes for ratings), performing basic data cleaning, and exporting the structured data into both CSV and Excel formats for future analysis.

🛠️ Tools & Technologies

Python 3.12
BeautifulSoup4: For parsing HTML and navigating the DOM tree
Requests: For making HTTP requests to the target website
Pandas: For data manipulation, cleaning, and export
Tabulate: For formatting terminal output into clean Markdown tables
Openpyxl: For exporting Pandas DataFrames to Excel (.xlsx)

📊 Dataset Source

Website: Books to Scrape
Description: A sandbox website specifically designed for developers to practice web scraping. It contains a catalogue of 1000 books across 50 pages.

⚙️ Methodology & Implementation

HTTP Request: Used the requests library to fetch the HTML content of the target webpage and verified a successful connection (Status Code: 200).
HTML Parsing: Utilized BeautifulSoup with the html.parser to organize the raw HTML into a searchable structure.
Data Extraction: Identified that all books are wrapped inside <article class="product_pod"> tags. Looped through these containers to extract:
- Book Title: Extracted from the title attribute within the <h3> -> <a> tags.
- Price: Extracted from the text inside <p class="price_color">.
- Availability: Extracted from <p class="instock availability"> and stripped of extra whitespace.
- Rating: Extracted the second word from the CSS class in <p class="star-rating [Rating]"> (e.g., class star-rating Three yields "Three").
Data Cleaning: Encountered and resolved a character encoding issue where the British Pound (£) symbol was prefixed with an unwanted Â character. Used Pandas string replacement (.str.replace('Â', '')) to ensure clean price data.
Data Export: Compiled the cleaned data into a Pandas DataFrame and exported it to both:
- data/books_data.csv (Industry standard for data pipelines)
- data/books_data.xlsx (Neatly formatted rows and columns for visual presentation)

📸 Output Previews

Terminal Output

Excel Output (.xlsx Format)

💡 Key Learnings & Challenges

HTML Navigation: Learned how to inspect webpage elements and traverse the DOM tree using BeautifulSoup's .find() and .find_all() methods.
Hidden Data Extraction: Discovered that sometimes data (like star ratings) isn't displayed as text but is embedded within CSS class names, requiring creative parsing logic.
Data Quality Control: Realized that scraped data is rarely perfect out of the box. Identifying and fixing the Â encoding bug reinforced the importance of data cleaning even at the collection stage.

🚀 How to Run Locally

Clone the repository:

git clone https://github.com/PrinkleMahshwari/CodeAlpha_WebScraping.git

Navigate to the project directory:
```
cd CodeAlpha_WebScraping
```
Install the required libraries:
```
pip install -r requirements.txt
```
Run the scraper:
```
python src/scrapper.py
```

📂 Project Structure

CodeAlpha_WebScraping/
├── data/                   # Exported dataset files
│   ├── books_data.csv      # Raw comma-separated format
│   └── books_data.xlsx     # Formatted Excel spreadsheet
├── screenshots/            # Output previews for README
│   ├── excel.png
│   ├── terminal(1).png
│   └── terminal(2).png
├── src/                    # Source code directory
│   └── scrapper.py         # Main web scraping script
├── README.md               # Project documentation
└── requirements.txt        # Python dependencies

🙏 Acknowledgements

This project was completed as part of the CodeAlpha Data Analytics Internship Program.

Dataset Source: Books to Scrape
Internship Organization: CodeAlpha
Repository: CodeAlpha_WebScraping

Special thanks to CodeAlpha for providing this internship opportunity and to Books to Scrape for offering a public website specifically designed for web scraping practice and learning.

🔗 Important Links

Resource	Link
Dataset Source	Books to Scrape
Internship Organization	CodeAlpha
GitHub Repository	CodeAlpha_WebScraping
GitHub Profile	PrinkleMahshwari

📈 Skills Gained

Through this project, I gained practical experience in:

Web Scraping
Data Collection
HTML Parsing
BeautifulSoup
Requests Library
Data Cleaning
Data Processing
Pandas DataFrames
CSV Handling
Excel Export
Git & GitHub Documentation

🚀 Future Improvements

Possible future improvements for this project include:

Scraping all available pages instead of a single page
Collecting category-wise book information
Storing scraped data in a PostgreSQL database
Creating interactive dashboards using Power BI or Tableau
Automating scheduled scraping tasks
Performing additional data analysis on pricing and ratings

🎥 LinkedIn Project Demonstration

As part of the CodeAlpha Internship requirements, a project explanation video will be published on LinkedIn.

Status: ⏳ Recording and publication scheduled

LinkedIn Post Link: To be added after publication.

⭐ Internship Progress

Task	Status
Web Scraping	✅ Completed
Exploratory Data Analysis	⏳ Pending
Data Visualization	⏳ Pending
Sentiment Analysis	⏳ Pending
LinkedIn Video Publication	⏳ Scheduled

📜 License

This project was developed for educational purposes and as part of the CodeAlpha Data Analytics Internship Program.

✅ Project Status

Status: Completed

Internship Task: Web Scraping

Primary Outcome: Successfully extracted, cleaned, and exported book data into CSV and Excel formats.

Submission Ready: Yes

Documentation Ready: Yes

GitHub Ready: Yes

LinkedIn Publication: Pending

👨‍💻 Author

Prinkle Kella

BS Software Engineering Student | Data Analytics Intern

GitHub: PrinkleMahshwari
LinkedIn: [Link to be added]
Project: CodeAlpha_WebScraping
Internship: CodeAlpha Data Analytics Internship

Thank you for visiting this repository. Feedback, suggestions, and improvements are always welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 CodeAlpha_WebScraping

🎯 Project Objective

🛠️ Tools & Technologies

📊 Dataset Source

⚙️ Methodology & Implementation

📸 Output Previews

Terminal Output

Excel Output (.xlsx Format)

💡 Key Learnings & Challenges

🚀 How to Run Locally

📂 Project Structure

🙏 Acknowledgements

🔗 Important Links

📈 Skills Gained

🚀 Future Improvements

🎥 LinkedIn Project Demonstration

⭐ Internship Progress

📜 License

✅ Project Status

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
screenshots		screenshots
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📚 CodeAlpha_WebScraping

🎯 Project Objective

🛠️ Tools & Technologies

📊 Dataset Source

⚙️ Methodology & Implementation

📸 Output Previews

Terminal Output

Excel Output (.xlsx Format)

💡 Key Learnings & Challenges

🚀 How to Run Locally

📂 Project Structure

🙏 Acknowledgements

🔗 Important Links

📈 Skills Gained

🚀 Future Improvements

🎥 LinkedIn Project Demonstration

⭐ Internship Progress

📜 License

✅ Project Status

👨‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages