- General Information
- Project Milestones
- High Level Overview
- Data Analysis
- Code State
- Contributions
- How to run and debug
Task: Text Analytics project
Team Members: Angelina Basova, Abdulghani Almasri, Paul Dietze, Vivian Kazakova
Mail Addresses: angelina.basova@stud.uni-heidelberg.de, abdulghani.almasri@stud.uni-heidelberg.de, cl250@uni-heidelberg.de, vivian.kazakova@stud.uni-heidelberg.de
Existing Code Fragments: sklearn models (SVM, Logistic Regression, Naive Bayes), spaCy, nltk, YT-Spammer-Purge
Utilized libraries: models-requirements, middleware-requirements
Contributions: see table below
Please import the dashboard configurations file
export.ndjsoninto Kibana (Stack Management > Saved Objects > Import) before you start the docker compose file.
- setup ES and Kibana, setup containers and debug configurations, obtain existing spam collection dataset (YouTube Spam Collection Data Set)
- implement pipeline for extracting video comments using YouTube Data API, implement and store models (SVM, LG, NB), implement pipeline for loading and storing data in ES, create FastAPI functions, create first (example) dashboards
- develop first prototyp of the interface
- decide on spam classifier, improve preprocessing pipeline, improve frontend by including the results of the scanning (= spam comments and embedded dashboards)
- clean and comment code
- last updates and fixes
- create video presentation and merge final code
- write and hand in report
Architecture Description:
-
Containers:
- frontend - simple Svelte frontend that takes a video link, extracts the Video ID and shows the results (found spam comments and embedded dashboards) of the scanning
- middleware - contains both an api implemented with FastAPI and main application functionality: retrieving raw data from Youtube API, reformatting, applying classifier and storing final data in Elasticsearch
- elasticsearch - an ES instance
- kibana - a Kibana container
- elasticvue - an Elasticvue component for elasticsearch administration
- setup - an additional container that runs scripts that help with configuring security credentials for ES and Kibana communication
-
Code Structure in Middleware:
- main - includes the FastAPI functions
- data_retriever - comprised of 4 classes:
- 2 data classes: YtComment and YtCommentReply that store comment data for initial comments and their replies respectively
- 2 interface classes: YtDataRetriever, which allows comment retrieval via the official Youtube API and ESConnect, which takes care of storing the comments and their data in Elastic Search
- classifier - includes a Generic Classifier class, which allows loading the stored, already trained model and vectorizer, as well as preprocessing and prediction of single comments
-
Preprocessing pipeline consisting of the following steps:
- removal of empty entries (and irrelevant features)
- lowercase
- spacy tokenization and lemmatization
- removal of stop words
Data Analysis: see next section
Data Sources:
- Reference dataset: YouTube Spam Collection Data Set
- Manually extracted dataset: Comments extracted using YouTube Data API (stored beforehead)
- User selected data: Comments extracted using YouTube Data API (live, from the input video url)
Preprocessing and storing pipeline:
- text sanitizing
- using stored and already trained model (default is ligistic regression) to classify individual comments
- saving relevant information about each comment (such as author, channel, number of likes, date, etc.).
Data Statistics:
- The YouTube Spam Collection Data Set contains of 1956 comments from 5 different YouTube videos. There are 1005 spam and 951 legitimate comments.
- The Own Data Set contains of 30575 comments from 9 different YouTube videos. There are 12910 spam and 17665 legitimate comments. For more information see here
Example comment stored in Elastic Search:
-
Important: Self-explanatory Variables, Comments, Docstrings, Module Structure, Code Consistency, PEP-8, "Hacks"
-
Web App Frontend:
| Default Page | Link entered |
|---|---|
![]() |
![]() |
- Kibana Dashboards
| Timeframe | Angelina | Vivian | Abdulghani | Paul |
|---|---|---|---|---|
| 10.11 - 25.11 | Accessing Youtube API | implementation and evaluation of Support Vector Machine Classifier on the YouTube Spam Collection Data Set | Configuring Docker containers and compose | Configuring ES and Kibana |
| 26.11 - 02.12 | Sample Youtube data exploration analysis and processing | implementation and evaluation of Logistic Regression and Naive Bayes on the YouTube Spam Collection Data Set | Preparing and uploading the data to Elasticsearch | Experimenting with debug configurations involving multiple containers including Svelte, FastApi, TensorFlow Serving and bare Python projects. |
| 03.12 - 11.12 | Extending YtDataRetriever class | working on middleware and frontend | reformating ES data and working on data visualization | working on middleware and frontend, Kibana dashboard creation |
| 11.12 - 15.01 | group meetings | bugfixes | toubleshooting Kibana-to-Frontend network connectivity | bugfixes |
| 15.01 - 30.01 | group meetings | improving frontend | frontend-to-middleware error debugging | iframe embedding for Kibana dashboard in frontend |
| 01.02 - 15.02 | group meetings | create dataset script, bugfixes within the spam detection pipeline | obtain comments for dataset, setup yt-spammer-purge | bugfixes within the spam detection pipeline |
| 15.02 - 28.02 | group meetings | last frontend improvements | secrets and environment variables | updating Kibana dashboard views, repository clean up |
- If container not already running:
- run
docker compose upin terminal (requires docker-compose.yml) or - right-click on docker-compose.debug.yml in VS-Code and choose "Compose Up"
- Execute launch configuration "Launch Chrome against localhost". Set breakpoints inside "frontend/src" if necessary.
- If container not already running:
- run
docker compose upin terminal (requires docker-compose.yml) or - right-click on docker-compose.debug.yml in VS-Code and choose "Compose Up"
- Open
localhost:8000/docsto access API. - To debug, execute launch configuration "Python: Middleware Remote Attach". Set breakpoints inside "middleware/app" if necessary.
NOTE: Before running
docker compose upon Windows computer, please make sure the line ending isLFinstead ofCRLFin VS-Code for the filemiddleware/start.sh








