The project is focused on classifying fake news articles using machine learning techniques such as Multilayer Perceptron and k-Nearest Neighbors algorithm. It involves preprocessing the data, feature extraction and building predictive models.
The data preprocessing pipeline includes the following steps:
- Text Cleaning: Removing unwanted characters, special symbols, and stopwords to retain meaningful content.
- Normalization: Converting all text to lowercase and applying stemming or lemmatization for consistency.
- Vectorization: Transforming text data into numerical features using techniques like
Term Frequency-Inverse Document Frequency.
To enhance the model's understanding of the text data, the TF-IDF method is used, representing text in a high-dimensional vector space. The formula for TF-IDF is:
where:
- tf(t, d): Term frequency of term
tin documentd, - idf(t, D): Inverse document frequency of term
t, withNbeing the total number of documents andd ∈ D : t ∈ dthe number of documents where the termtappears.
To reduce the complexity of high-dimensional data and improve model efficiency, Principal Component Analysis is used. PCA reduces the feature space while retaining the most important information by projecting data onto a set of orthogonal axes (principal components). This transformation helps:
- Minimize redundancy in the dataset.
- Speed up training and prediction times.
- Mitigate the risk of overfitting.
The number of principal components is chosen based on the explained variance ratio, ensuring that the reduced feature set captures a significant proportion of the original data's variance.
The project uses the following machine learning models to classify fake news:
- k-Nearest Neighbors: A simple algorithm that classifies data points based on the majority class of their nearest neighbors. The distance between points is calculated using metrics like Euclidean distance, and the optimal value of
kis determined through cross-validation. - Multilayer Perceptron: A feedforward neural network that uses backpropagation to adjust weights and biases for optimal performance. The
MLPmodel is designed to learn complex patterns in the data through its hidden layers and non-linear activation functions.
The models are evaluated using metrics such as:
- Accuracy
- Precision
- Recall
- F1-score
The project provides:
- A summary of key metrics for each model,
- Database Analysis,
- Visual Division of Real and Fake News,
- Visualisation of Most Frequent words in Titles and Text,
- Graph of division of extracted by
PCAandTF-IDFfeatures by labels, - Visualisation of Most Frequent words in Fake News and Real Articles,
- Graph of Co-Occurence frequencies of words in Titles,
- Graph of dependencies of
kand accuracy inkNN, - Plot of the Model's Accuracy and Loss
