Website classification

Infers the domain name from website screenshots

How to use

Clone the repository
Install requirements.txt
Run infer.py
Choose strategy to infer with (SGD, SVM)
Insert file path

About

This implementation was trained on the 10 following domains: 'The Guardian', 'Spiegel', 'CNN', 'BBC', 'Amazon', 'Ebay', 'Njuskalo(hr)', 'Google', 'Github', 'Youtube'

Data collection

The method is explained in detail here
Data preperation
- Scraped 2500+ screenshots from 10 given websites using web scrapers Selenium (Around 250 samples per website). Samples different parts of the websites. Please see the link above for data collection method
- Extracted image features using HOG, after resizing the images.
- Saved into CSV format for model training.
Data preparation notebook can be inspected here

Sample screenshots can be inspected in assets folder
Model training

Implemented 2 different strategies
1. SGD(SVM) + RFECV
2. SVM + PCA
  
  Both includes:
- Splitting the data into test and train. Test dataset is kept for model evaluation.
- Dataset scaling (test and train sets were scaled seperately for preventing data leaks).
- Reduced dimension (n_features) by unsupervised methods (RFECV & PCA).
- Training classification models while finding the best hyperparameters by GridSearch Cross-validation methods.
- Scalers, dimension reducers and classifiers were serialized as a pipeline for later use.
- Evaluating model performances with classification reports and confusion matrixes.
Model training notebook with SGD + RFECV can be inspected here

Model training notebook with SVM + PCA can be inspected here
Inference

Input images are not expected to have scrollbar. There is a scrollbar removal function in the notebooks.

Inference notebook can be inspected here

The program works on infer.py and infer_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website classification

How to use

About

Data collection

Data preperation

Model training

Inference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
models		models
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
infer.py		infer.py
infer_functions.py		infer_functions.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Website classification

How to use

About

Data collection

Data preperation

Model training

Inference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages