This project implements core Natural Language Processing (NLP) components from scratch without relying on external NLP libraries like NLTK or Hugging Face. It is designed to provide a deep understanding of foundational NLP techniques by building a WordPiece Tokenizer, a Word2Vec (CBOW) Model, and a Neural Language Model (MLP-based).
├── 📖 README.md <- Project documentation and setup instructions
├── 📜 Project_Description.pdf <- Overview of objectives and methodology
├── 📑 Project Report.pdf <- Final project report with analysis and findings
├── 📂 Data
│ ├── 📄 corpus.txt <- Raw text corpus for tokenization
│ ├── 📄 tokenized_data.json <- Preprocessed tokenized data
│ ├── 📄 vocabulary_86.txt <- Extracted vocabulary
├── 📂 src
│ ├── 📓 Neural_Language_model.ipynb <- Notebook implementing a neural language model
│ ├── 📜 Word2Vec_model.py <- Word2Vec model for word embeddings
│ ├── 📜 WordPieceTokeniser.ipynb <- Jupyter notebook for WordPiece tokenization
│ ├── 📜 WordPieceTokeniser.py <- Python script for tokenization
│ ├── 📄 Neural_LM_loss.png <- Training loss curve for language model
│ ├── 📄 predicted.png <- Sample predictions from the model
│ ├── 📄 task2.png <- Visualization for task 2
│ ├── 📄 task3.png <- Visualization for task 3
│ ├── 📄 tokenSimilarity.png <- Token similarity heatmap
│ ├── 📓 temp.ipynb <- Temporary script for testing
This repository covers three major NLP tasks:
- Task 1: Implementing a WordPiece Tokenizer.
- Task 2: Building a Word2Vec (CBOW) Model using PyTorch.
- Task 3: Training a Neural Language Model (MLP-based) with three architectural variations.
- Fully custom implementation of a WordPiece Tokenizer.
- Word2Vec (CBOW) model built using PyTorch.
- Neural Language Model trained for next-word prediction.
- PyTorch-based training pipeline with loss visualization.
- Evaluation metrics including cosine similarity, accuracy, and perplexity.
- Preprocessing: Cleans and processes raw text data (lowercasing, removing special characters, etc.).
- Vocabulary Construction: Extracts subword tokens and saves them in
vocabulary_86.txt. - Tokenization: Converts sentences into subword tokens using the generated vocabulary.
task1.py- Contains the WordPieceTokenizer class.vocabulary_86.txt- Stores the generated vocabulary.tokenized_86.json- Output JSON file with tokenized sentences.
- Dataset Preparation: Implements
Word2VecDatasetto create training data for the CBOW model. - Word2Vec Model: Implements a CBOW-based neural network using PyTorch.
- Training Function: Manages the training pipeline, including loss computation and optimization.
- Similarity Calculation: Computes cosine similarity for token triplets to evaluate word relationships.
task2.py- ContainsWord2VecDatasetandWord2VecModelclasses.- Model checkpoint after training.
- Loss curve visualization.
- Identified token triplets based on cosine similarity.
- Dataset Preparation: Implements
NeuralLMDatasetfor next-word prediction tasks. - Three Neural Network Variations:
- NeuralLM1: Baseline model with basic architecture.
- NeuralLM2: Modified activation functions and additional layers.
- NeuralLM3: Increased input token size for better context understanding.
- Training Function: Handles training across all models.
- Evaluation Metrics: Computes accuracy and perplexity for model evaluation.
- Next Token Prediction: Predicts the next three tokens for test sentences.
task3.py- Contains dataset class and three model architectures.- Training and validation loss curves.
- Accuracy and perplexity scores.
- Token predictions for
test.txt.
Accuracy and Perplexity Results:
- Average Training Accuracy: 96.28%
- Average Validation Accuracy: 12.32%
- Average Training Perplexity: 1.11
- Average Validation Perplexity: 1,487,023.57
Ensure you have the following installed:
- Python 3.x
- PyTorch
- NumPy
- Pandas
pip install torch numpy pandasRun the following commands to execute each task:
Task 1: WordPiece Tokenizer
python WordPieceTokeniser.pyTask 2: Word2Vec Training
python Word2Vec_model.pyTask 3: Neural Language Model
python task3.py- The WordPiece Tokenizer effectively segments words into subwords.
- The CBOW Word2Vec model captures meaningful word relationships.
- The Neural Language Models exhibit varying performance based on architecture choices.
- Higher token context improves next-word prediction accuracy.
- Implement positional encoding for better embeddings.
- Experiment with Transformer-based models for improved performance.
- Extend vocabulary using larger datasets.
This project is licensed under the MIT License. See LICENSE for details.

