Skip to content

padieul/yt_spam_scanner

Repository files navigation

YtSpamScanner

Table of Contents

  1. General Information
  2. Project Milestones
  3. High Level Overview
  4. Data Analysis
  5. Code State
  6. Contributions
  7. How to run and debug

General Information

Task: Text Analytics project

Team Members: Angelina Basova, Abdulghani Almasri, Paul Dietze, Vivian Kazakova

Mail Addresses: angelina.basova@stud.uni-heidelberg.de, abdulghani.almasri@stud.uni-heidelberg.de, cl250@uni-heidelberg.de, vivian.kazakova@stud.uni-heidelberg.de

Existing Code Fragments: sklearn models (SVM, Logistic Regression, Naive Bayes), spaCy, nltk, YT-Spammer-Purge

Utilized libraries: models-requirements, middleware-requirements

Contributions: see table below


Please import the dashboard configurations file export.ndjson into Kibana (Stack Management > Saved Objects > Import) before you start the docker compose file.


Project Milestones

  • setup ES and Kibana, setup containers and debug configurations, obtain existing spam collection dataset (YouTube Spam Collection Data Set)
  • implement pipeline for extracting video comments using YouTube Data API, implement and store models (SVM, LG, NB), implement pipeline for loading and storing data in ES, create FastAPI functions, create first (example) dashboards
  • develop first prototyp of the interface
  • decide on spam classifier, improve preprocessing pipeline, improve frontend by including the results of the scanning (= spam comments and embedded dashboards)
  • clean and comment code
  • last updates and fixes
  • create video presentation and merge final code
  • write and hand in report

High Level Overview

alt text

Architecture Description:

  • Containers:

    • frontend - simple Svelte frontend that takes a video link, extracts the Video ID and shows the results (found spam comments and embedded dashboards) of the scanning
    • middleware - contains both an api implemented with FastAPI and main application functionality: retrieving raw data from Youtube API, reformatting, applying classifier and storing final data in Elasticsearch
    • elasticsearch - an ES instance
    • kibana - a Kibana container
    • elasticvue - an Elasticvue component for elasticsearch administration
    • setup - an additional container that runs scripts that help with configuring security credentials for ES and Kibana communication
  • Code Structure in Middleware:

    • main - includes the FastAPI functions
    • data_retriever - comprised of 4 classes:
      • 2 data classes: YtComment and YtCommentReply that store comment data for initial comments and their replies respectively
      • 2 interface classes: YtDataRetriever, which allows comment retrieval via the official Youtube API and ESConnect, which takes care of storing the comments and their data in Elastic Search
    • classifier - includes a Generic Classifier class, which allows loading the stored, already trained model and vectorizer, as well as preprocessing and prediction of single comments
  • Preprocessing pipeline consisting of the following steps:

    • removal of empty entries (and irrelevant features)
    • lowercase
    • spacy tokenization and lemmatization
    • removal of stop words

Data Analysis: see next section

Data Analysis

Data Sources:

Preprocessing and storing pipeline:

  • text sanitizing
  • using stored and already trained model (default is ligistic regression) to classify individual comments
  • saving relevant information about each comment (such as author, channel, number of likes, date, etc.).

Data Statistics:

yt-spam-collection

  • The Own Data Set contains of 30575 comments from 9 different YouTube videos. There are 12910 spam and 17665 legitimate comments. For more information see here

yt-own-spam-collection

Example comment stored in Elastic Search:

Code State

  • Important: Self-explanatory Variables, Comments, Docstrings, Module Structure, Code Consistency, PEP-8, "Hacks"

  • Web App Frontend:

Default Page Link entered
  • Kibana Dashboards

dashboard

dashboard

Contributions

Timeframe Angelina Vivian Abdulghani Paul
10.11 - 25.11 Accessing Youtube API implementation and evaluation of Support Vector Machine Classifier on the YouTube Spam Collection Data Set Configuring Docker containers and compose Configuring ES and Kibana
26.11 - 02.12 Sample Youtube data exploration analysis and processing implementation and evaluation of Logistic Regression and Naive Bayes on the YouTube Spam Collection Data Set Preparing and uploading the data to Elasticsearch Experimenting with debug configurations involving multiple containers including Svelte, FastApi, TensorFlow Serving and bare Python projects.
03.12 - 11.12 Extending YtDataRetriever class working on middleware and frontend reformating ES data and working on data visualization working on middleware and frontend, Kibana dashboard creation
11.12 - 15.01 group meetings bugfixes toubleshooting Kibana-to-Frontend network connectivity bugfixes
15.01 - 30.01 group meetings improving frontend frontend-to-middleware error debugging iframe embedding for Kibana dashboard in frontend
01.02 - 15.02 group meetings create dataset script, bugfixes within the spam detection pipeline obtain comments for dataset, setup yt-spammer-purge bugfixes within the spam detection pipeline
15.02 - 28.02 group meetings last frontend improvements secrets and environment variables updating Kibana dashboard views, repository clean up

How to run and debug?

Frontend

  1. If container not already running:
  • run docker compose up in terminal (requires docker-compose.yml) or
  • right-click on docker-compose.debug.yml in VS-Code and choose "Compose Up"
  1. Execute launch configuration "Launch Chrome against localhost". Set breakpoints inside "frontend/src" if necessary.

Middleware

  1. If container not already running:
  • run docker compose up in terminal (requires docker-compose.yml) or
  • right-click on docker-compose.debug.yml in VS-Code and choose "Compose Up"
  1. Open localhost:8000/docs to access API.
  2. To debug, execute launch configuration "Python: Middleware Remote Attach". Set breakpoints inside "middleware/app" if necessary.

NOTE: Before running docker compose up on Windows computer, please make sure the line ending is LF instead of CRLF in VS-Code for the file middleware/start.sh

About

Text Analytics project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •