Infers the domain name from website screenshots
- Clone the repository
- Install requirements.txt
- Run
infer.py - Choose strategy to infer with
(SGD, SVM) - Insert file path
This implementation was trained on the 10 following domains: 'The Guardian', 'Spiegel', 'CNN', 'BBC', 'Amazon', 'Ebay', 'Njuskalo(hr)', 'Google', 'Github', 'Youtube'
-
The method is explained in detail here
-
- Scraped 2500+ screenshots from 10 given websites using web scrapers Selenium (Around 250 samples per website). Samples different parts of the websites. Please see the link above for data collection method
- Extracted image features using HOG, after resizing the images.
- Saved into CSV format for model training.
Data preparation notebook can be inspected here
Sample screenshots can be inspected in assets folder
-
Implemented 2 different strategies
- Splitting the data into test and train. Test dataset is kept for model evaluation.
- Dataset scaling (test and train sets were scaled seperately for preventing data leaks).
- Reduced dimension (n_features) by unsupervised methods (
RFECV&PCA). - Training classification models while finding the best hyperparameters by GridSearch Cross-validation methods.
- Scalers, dimension reducers and classifiers were serialized as a pipeline for later use.
- Evaluating model performances with
classification reportsandconfusion matrixes.
Model training notebook with
SGD + RFECVcan be inspected hereModel training notebook with
SVM + PCAcan be inspected here -
Input images are not expected to have scrollbar. There is a scrollbar removal function in the notebooks.
Inference notebook can be inspected here
The program works on
infer.pyandinfer_functions.py