Data science project analyzing Copenhagen's bicycle market using web-scraped business data, government population statistics, and geospatial analysis. Built regression and SVM models to forecast seasonal demand and identify high-potential areas for bicycle businesses.
- Data collection — scraped bicycle shops from Google Maps using Selenium across 6 Danish search queries. Collected name, rating, review count, category, address, phone, website, and GPS coordinates for each business.
- Data cleaning — handled missing ratings and review counts with median imputation, extracted 4-digit Danish postal codes from address strings using regex.
- Data merging — joined business data with Copenhagen population statistics from Statistics Denmark on postal code.
- Exploratory analysis — rating distributions, review count vs rating scatter, shops per postal code, category breakdowns, gender population breakdown by area.
- Geospatial visualization — mapped shop locations across Copenhagen using GeoPandas and Folium interactive heatmaps.
- Predictive modeling — regression and SVM models to forecast seasonal demand and market trends.
| Layer | Technology |
|---|---|
| Web Scraping | Selenium, BeautifulSoup |
| Data Processing | pandas, NumPy |
| Geospatial Analysis | GeoPandas, Shapely, Folium |
| Machine Learning | scikit-learn (Regression, SVM) |
| Visualization | Matplotlib, Seaborn |
| Data Sources | Google Maps, Statistics Denmark, DBA.dk |
data-science/
├── A 1/
│ └── Assignment 1/
│ ├── pandas_code.py # initial data exploration
│ ├── process_yelp.ipynb # Yelp dataset processing
│ └── schema.sql # database schema
├── A 2/
│ ├── Code/
│ │ ├── google_maps_scraping.py # Selenium scraper for Google Maps
│ │ └── clean_data.py # data cleaning pipeline
│ └── Dataset for Safety/ # raw scraped datasets
├── A 2 Milestone 3/
│ ├── cleaning_and_conversion_and_unconverted_dataset/
│ │ ├── google_maps.py # Google Maps data cleaning
│ │ ├── merging_data.py # merge business and population data
│ │ ├── json_to_csv.py # DBA.dk JSON to CSV conversion
│ │ └── extracting_population.py
│ ├── Datasets/
│ │ ├── google_maps.csv
│ │ ├── Copenhagen_Population.xlsx
│ │ └── merged_business_population.csv
│ └── copenhagen_analysis.ipynb # full analysis notebook
└── README.md
pip install pandas numpy scikit-learn selenium geopandas shapely folium matplotlib seaborn webdriver-manager openpyxlFor the scraper:
python "A 2/Code/google_maps_scraping.py"For the full analysis:
cd "A 2 Milestone 3"
jupyter notebook copenhagen_analysis.ipynbScraping Google Maps with Selenium is tricky. The page loads results dynamically and class names change between sessions. I handled stale element exceptions with a retry loop and used scroll detection to know when all results were loaded. The scraper ran 6 different Danish search queries to cover all bicycle-related business categories.
Merging datasets on postal code sounds simple but Danish addresses store postal codes inconsistently — sometimes embedded in a full address string, sometimes standalone. I used regex to extract the first 4-digit sequence from each address field before merging.
Geospatial analysis showed clear clustering of bicycle businesses in central Copenhagen postal codes. Population density alone does not predict shop density — tourist areas and cycling infrastructure proximity matter more.
Built by Abdullah Khalid