- Project Overview
- Data Source
- Python Tools
- Execution and Steps
- Conclusion and Recommendation
- References
This is a simple python web scrapping project that seeks to demonstrate clear procedures to get html data(tables) from websites and web pages of organisations who permit this process. The html tags were identified with their headers, cleaned and read with pandas dataframe to give it a better table structure before exporting as a CSV file. This also helps data scientists and analyst to search for original data themselves.
The data was scrapped from wikipedia as the "List of largest companies by revenue/Publickly traded companies in the US for the 2023 Fiscal Year.
- Requests library: For creating response request to the web page
- BeautifulSoup: For finding and getting all html tags and data
- Pandas:Creating rows and columns storing in a dataframe
The following actions and steps were executed on this web scrapping exercise:
- Importing all libraries and packages
import requests from bs4 import BeautifulSoup import pandas as pd
- creating a url variable of the web page
url= 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue'
- creating response request and getting html tags and data using appropriate python syntax
page = requests.get(url)
- Identifying and sorting the html tags with specified tables needed
soup = BeautifulSoup(page.text, 'html.parser')
- Cleaning and storing scrapped table in a pandas dataframe
df=pd.DataFrame(columns=world_table_tiltes)
df- Exporting pandas data as a csv file into specified directory.
df.to_csv(r'C:\Users\HP ELITEBOOK 840 G3\Desktop\Mandy programs/company.csv', index=False)Though web scrapping is not a practice widely accepted and allowed for all websites and pages, it can be a poweful skillset to pull out data without having to depend on other sources or personell for a dataset. By mastering the usage of these libraries and some coding syntax this process can be done seemlessly especially if permission is granted by a webpage. Check out for full python syntax for this project and exercise.