Natural Language Processing Projects

A collection of NLP projects covering Urdu text processing, sarcasm detection, sequence modeling for machine translation, and LLM fine-tuning. Built across coursework at FAST-NUCES.

📂 What is inside

1. Urdu Sarcasm Detection

End-to-end sarcasm classification pipeline built on an Urdu social media comments dataset.

Preprocessing pipeline:

Removed emojis, punctuation, and comments shorter than 3 words
Urdu normalization, stemming, and lemmatization using LughaatNLP
Removed English words and special characters to keep pure Urdu text
Combined Urdu and English stopword lists for filtering

Feature extraction and modeling:

TF-IDF vectorization
Word2Vec embeddings (vector size 100, window 5, 50 epochs)
Unigram, bigram, and trigram frequency analysis
Multinomial Naive Bayes classifier with 80/20 train-test split

Evaluation: accuracy, precision, recall, F1-score, confusion matrix

2. English to Urdu Machine Translation

Sequence-to-sequence translation models trained on paired English-Urdu sentence datasets.

Vanilla RNN encoder-decoder
LSTM encoder-decoder
Attention mechanism integration for improved translation quality

3. LLM Fine-Tuning (Phi-3)

Fine-tuned Microsoft Phi-3 small language model on custom NLP tasks using prompt engineering and task-specific fine-tuning. Evaluated on few-shot and zero-shot performance.

4. NLP Architectures Study

In-depth exploration of modern NLP architectures:

Transformer architecture and self-attention mechanism
BERT: masked language modeling, embeddings, and fine-tuning
Comparison of model tradeoffs across tasks

🛠️ Stack

Task	Tools
Urdu NLP	LughaatNLP, NLTK
Embeddings	Word2Vec (gensim), TF-IDF
Classification	scikit-learn (Naive Bayes)
Sequence Modeling	TensorFlow/Keras (RNN, LSTM)
LLM Fine-Tuning	HuggingFace Transformers (Phi-3)
Visualization	Matplotlib, Seaborn

📁 Project structure

nlp/
├── phase 1.py          # Urdu preprocessing: emoji removal, stopwords, filtering
├── phase 2.py          # LughaatNLP: normalization, stemming, lemmatization
├── phase 3.py          # Tokenization, TF-IDF, Word2Vec embeddings
├── phase 4.py          # N-gram frequency analysis
├── phase 5.py          # Naive Bayes sarcasm classifier
├── phase 6.py          # Evaluation metrics and confusion matrix
├── Assignment_1.ipynb  # Full documented pipeline
├── urdu_sarcastic_dataset.csv
├── stopwords-ur.txt
└── stopwords.txt

🚀 Running locally

pip install pandas scikit-learn nltk gensim tensorflow LughaatNLP matplotlib seaborn

Run phases in order:

python "phase 1.py"
python "phase 2.py"
python "phase 3.py"
python "phase 4.py"
python "phase 5.py"
python "phase 6.py"

Or open Assignment_1.ipynb for the full documented pipeline.

💡 What I learned building this

Working with Urdu text is genuinely different from English NLP. Arabic script reads right to left, characters connect differently, and standard English tokenizers break on it. LughaatNLP handles Urdu-specific normalization including character variants that look the same but have different Unicode points.

Sarcasm detection in any language is hard because sarcasm is contextual. The Naive Bayes classifier on TF-IDF features gives a reasonable baseline but misses the semantic layer entirely. Fine-tuning a transformer on this task would get much better results.

The RNN to LSTM improvement on translation was clear and measurable. Vanilla RNNs lose context over long sentences. LSTMs hold it significantly better which shows directly in translation quality on longer inputs.

📬 Contact

Built by Abdullah Khalid

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Assignment_1.ipynb		Assignment_1.ipynb
LICENSE.txt		LICENSE.txt
README.md		README.md
check.py		check.py
domain_model		domain_model
filtered_output.csv		filtered_output.csv
phase 1.py		phase 1.py
phase 2.py		phase 2.py
phase 3.py		phase 3.py
phase 4.py		phase 4.py
phase 5.py		phase 5.py
phase 6.py		phase 6.py
reviews.txt		reviews.txt
stopwords-ur.json.txt		stopwords-ur.json.txt
stopwords-ur.txt		stopwords-ur.txt
stopwords.txt		stopwords.txt
tempCodeRunnerFile.py		tempCodeRunnerFile.py
tokenize_reviews.json		tokenize_reviews.json
tokenize_reviews.txt		tokenize_reviews.txt
untitled.md		untitled.md
untitled1.md		untitled1.md
urdu_sarcastic_dataset.csv		urdu_sarcastic_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Processing Projects

📂 What is inside

1. Urdu Sarcasm Detection

2. English to Urdu Machine Translation

3. LLM Fine-Tuning (Phi-3)

4. NLP Architectures Study

🛠️ Stack

📁 Project structure

🚀 Running locally

💡 What I learned building this

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing Projects

📂 What is inside

1. Urdu Sarcasm Detection

2. English to Urdu Machine Translation

3. LLM Fine-Tuning (Phi-3)

4. NLP Architectures Study

🛠️ Stack

📁 Project structure

🚀 Running locally

💡 What I learned building this

📬 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages