YtSpamScanner

General Information

Task: Text Analytics project

Team Members: Angelina Basova, Abdulghani Almasri, Paul Dietze, Vivian Kazakova

Mail Addresses: angelina.basova@stud.uni-heidelberg.de, abdulghani.almasri@stud.uni-heidelberg.de, cl250@uni-heidelberg.de, vivian.kazakova@stud.uni-heidelberg.de

Existing Code Fragments: sklearn models (SVM, Logistic Regression, Naive Bayes), spaCy, nltk, YT-Spammer-Purge

Utilized libraries: models-requirements, middleware-requirements

Contributions: see table below

Please import the dashboard configurations file export.ndjson into Kibana (Stack Management > Saved Objects > Import) before you start the docker compose file.

Project Milestones

setup ES and Kibana, setup containers and debug configurations, obtain existing spam collection dataset (YouTube Spam Collection Data Set)
implement pipeline for extracting video comments using YouTube Data API, implement and store models (SVM, LG, NB), implement pipeline for loading and storing data in ES, create FastAPI functions, create first (example) dashboards
develop first prototyp of the interface
decide on spam classifier, improve preprocessing pipeline, improve frontend by including the results of the scanning (= spam comments and embedded dashboards)
clean and comment code
last updates and fixes
create video presentation and merge final code
write and hand in report

High Level Overview

Architecture Description:

Containers:
- frontend - simple Svelte frontend that takes a video link, extracts the Video ID and shows the results (found spam comments and embedded dashboards) of the scanning
- middleware - contains both an api implemented with FastAPI and main application functionality: retrieving raw data from Youtube API, reformatting, applying classifier and storing final data in Elasticsearch
- elasticsearch - an ES instance
- kibana - a Kibana container
- elasticvue - an Elasticvue component for elasticsearch administration
- setup - an additional container that runs scripts that help with configuring security credentials for ES and Kibana communication
Code Structure in Middleware:
- main - includes the FastAPI functions
- data_retriever - comprised of 4 classes:
  - 2 data classes: YtComment and YtCommentReply that store comment data for initial comments and their replies respectively
  - 2 interface classes: YtDataRetriever, which allows comment retrieval via the official Youtube API and ESConnect, which takes care of storing the comments and their data in Elastic Search
- classifier - includes a Generic Classifier class, which allows loading the stored, already trained model and vectorizer, as well as preprocessing and prediction of single comments
Preprocessing pipeline consisting of the following steps:
- removal of empty entries (and irrelevant features)
- lowercase
- spacy tokenization and lemmatization
- removal of stop words

Data Analysis: see next section

Data Analysis

Data Sources:

Reference dataset: YouTube Spam Collection Data Set
Manually extracted dataset: Comments extracted using YouTube Data API (stored beforehead)
User selected data: Comments extracted using YouTube Data API (live, from the input video url)

Preprocessing and storing pipeline:

text sanitizing
using stored and already trained model (default is ligistic regression) to classify individual comments
saving relevant information about each comment (such as author, channel, number of likes, date, etc.).

Data Statistics:

The YouTube Spam Collection Data Set contains of 1956 comments from 5 different YouTube videos. There are 1005 spam and 951 legitimate comments.

The Own Data Set contains of 30575 comments from 9 different YouTube videos. There are 12910 spam and 17665 legitimate comments. For more information see here

Example comment stored in Elastic Search:

Code State

Important: Self-explanatory Variables, Comments, Docstrings, Module Structure, Code Consistency, PEP-8, "Hacks"
Web App Frontend:

Default Page	Link entered

Kibana Dashboards

Contributions

Timeframe	Angelina	Vivian	Abdulghani	Paul
10.11 - 25.11	Accessing Youtube API	implementation and evaluation of Support Vector Machine Classifier on the YouTube Spam Collection Data Set	Configuring Docker containers and compose	Configuring ES and Kibana
26.11 - 02.12	Sample Youtube data exploration analysis and processing	implementation and evaluation of Logistic Regression and Naive Bayes on the YouTube Spam Collection Data Set	Preparing and uploading the data to Elasticsearch	Experimenting with debug configurations involving multiple containers including Svelte, FastApi, TensorFlow Serving and bare Python projects.
03.12 - 11.12	Extending YtDataRetriever class	working on middleware and frontend	reformating ES data and working on data visualization	working on middleware and frontend, Kibana dashboard creation
11.12 - 15.01	group meetings	bugfixes	toubleshooting Kibana-to-Frontend network connectivity	bugfixes
15.01 - 30.01	group meetings	improving frontend	frontend-to-middleware error debugging	iframe embedding for Kibana dashboard in frontend
01.02 - 15.02	group meetings	create dataset script, bugfixes within the spam detection pipeline	obtain comments for dataset, setup yt-spammer-purge	bugfixes within the spam detection pipeline
15.02 - 28.02	group meetings	last frontend improvements	secrets and environment variables	updating Kibana dashboard views, repository clean up

How to run and debug?

Frontend

If container not already running:

run docker compose up in terminal (requires docker-compose.yml) or
right-click on docker-compose.debug.yml in VS-Code and choose "Compose Up"

Execute launch configuration "Launch Chrome against localhost". Set breakpoints inside "frontend/src" if necessary.

Middleware

If container not already running:

run docker compose up in terminal (requires docker-compose.yml) or
right-click on docker-compose.debug.yml in VS-Code and choose "Compose Up"

Open localhost:8000/docs to access API.
To debug, execute launch configuration "Python: Middleware Remote Attach". Set breakpoints inside "middleware/app" if necessary.

NOTE: Before running docker compose up on Windows computer, please make sure the line ending is LF instead of CRLF in VS-Code for the file middleware/start.sh

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.vscode		.vscode
data		data
frontend		frontend
images		images
middleware		middleware
models		models
.env		.env
.gitignore		.gitignore
README.md		README.md
docker-compose.debug.yml		docker-compose.debug.yml
export.ndjson		export.ndjson

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

YtSpamScanner

Table of Contents

General Information

Project Milestones

High Level Overview

Data Analysis

Code State

Contributions

How to run and debug?

Frontend

Middleware

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

padieul/yt_spam_scanner

Folders and files

Latest commit

History

Repository files navigation

YtSpamScanner

Table of Contents

General Information

Project Milestones

High Level Overview

Data Analysis

Code State

Contributions

How to run and debug?

Frontend

Middleware

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages