Skip to content
This repository was archived by the owner on May 3, 2026. It is now read-only.

hakanErgin/website-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Website classification

Infers the domain name from website screenshots

How to use

  • Clone the repository
  • Install requirements.txt
  • Run infer.py
  • Choose strategy to infer with (SGD, SVM)
  • Insert file path

About

This implementation was trained on the 10 following domains: 'The Guardian', 'Spiegel', 'CNN', 'BBC', 'Amazon', 'Ebay', 'Njuskalo(hr)', 'Google', 'Github', 'Youtube'

  • Data collection

    The method is explained in detail here

  • Data preperation

    • Scraped 2500+ screenshots from 10 given websites using web scrapers Selenium (Around 250 samples per website). Samples different parts of the websites. Please see the link above for data collection method
    • Extracted image features using HOG, after resizing the images.
    • Saved into CSV format for model training.

    Data preparation notebook can be inspected here

    Sample screenshots can be inspected in assets folder

  • Model training

    Implemented 2 different strategies

    1. SGD(SVM) + RFECV

    2. SVM + PCA

      Both includes:

    • Splitting the data into test and train. Test dataset is kept for model evaluation.
    • Dataset scaling (test and train sets were scaled seperately for preventing data leaks).
    • Reduced dimension (n_features) by unsupervised methods (RFECV & PCA).
    • Training classification models while finding the best hyperparameters by GridSearch Cross-validation methods.
    • Scalers, dimension reducers and classifiers were serialized as a pipeline for later use.
    • Evaluating model performances with classification reports and confusion matrixes.

    Model training notebook with SGD + RFECV can be inspected here

    Model training notebook with SVM + PCA can be inspected here

  • Inference

    Input images are not expected to have scrollbar. There is a scrollbar removal function in the notebooks.

    Inference notebook can be inspected here

    The program works on infer.py and infer_functions.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors