Skip to content

Implemented projects from noncontextual word embeddings like Word2Vec to contextual word embeddings like ELMO, GPT, BERT to solving NLP tasks Sentence Level Classification (sentimental analysis & Toxic comment classification), Token Level Classification (POS Tagging, NER Tagging), Machine Translation (MT)

Notifications You must be signed in to change notification settings

khetansarvesh/NLP

Repository files navigation

$\color{cyan}{Representation\ Learning\ (pretraining) }$

We need to represent language mathematically i.e. given a corpus you need to convert this corpus into its numerical form. This mathematical representation is called an embedding/context and the process is called representation learning. Why do this?? Because computers understand only numbers and not texts. We can do this in several ways:

So since we have so many methods to convert our corpus into numerical representation then which one should we use?? It really depends on the data that you are trying to convert

  • if it is social media data then maybe character embedding will work better
  • if your language is Chinese then maybe word embedding will work really really badly because in the Chinese language, there is no gap between words so identifying words is a big challenge hence in the Chinese language character embedding works really well and even subword embedding works really well
  • In languages like French and Arabic, though there are spaces but if say we have two words in English “so said” then it is just one word in these languages, hence word embedding might not work well with these languages

Finally before you start building your own LLLM from scratch, I would recommend reading this amazing blog from HF which help you understand how to find the right architecture and how are big scalable models built !!

$\color{cyan}{Downstream\ NLP\ (posttraining) }$

$\color{yellow}{A.\ Supervised\ Fine\ Tuning\ (SFT) }$

  • Non Generative (Classification) Tasks - Natural Language Understanding (NLU) Tasks

    • Sentence level classification Tasks
    • Token / Word level classification Tasks (also called Sequence Labeling/ Learning/ Tagging Task) : Word / Token level classification problem is a problem in which you classify each word / token in the corpus as something. It has been observed that a Masked Language Model based base model gives better results compared to the next subword prediction based base model on NLU tasks. There can be numerous word level classification tasks, some of those are
  • Generative Tasks / Natural Language Generation (NLG) Tasks / Text2Text Tasks / Sequence2Sequence (Seq2Seq or S2S) Tasks / Sequence2Sequence Learning : This is a family of tasks in which we generate new sentences using input sentences. It has been observed that a next subword prediction based foundation model gives better results compared to Masked Language Model based foundation model on NLG tasks. These include tasks like

  • Instruction Tuning Task

$\color{yellow}{B.\ Reinforcement\ Learning\ Based\ Fine\ Tuning}$

Now above we saw how to finetune a foundational LLM model for different downstream tasks using SFT but for all those tasks we can also finetune a foundation model using RL. There are two ways to use RL to finetune :

  • Manual Reward Funtion Based RL : Here we will see how to finetune using RL if you 'can design' a good reward function for your downstream task.
  • (preferred)Automatic Reward Function Based RL : Here we will see how to finetune using RL if you 'cannot design' a good reward function for your downstream task, via the help of preference dataset. There are multiple methods to do this :
    • Reinforcement Learning using Human Feedback (RLHF) : trains a seperate reward model
    • (preferred) Direct Preference Optimization (DPO) : uses base LLM itself as reward model

Important

It has been proved previously that its better to first finetune LLM on any task using SFT and then finetune on the same task using RL, it gives better outcomes. This notebook from UnSloth follows the below recepie to convert Qwen3 from a non reasoning model to a reasoning model.

$\color{cyan}{LLM\ Systems}$

With foundation models that are able to do multiple tasks, you just need to do prompting to solve a single downstream task problem. There are many frameworks that you can use to build these LLM Systems, few good ones are DSPY || ⁠AutoGen || Langraph || CrewAI

About

Implemented projects from noncontextual word embeddings like Word2Vec to contextual word embeddings like ELMO, GPT, BERT to solving NLP tasks Sentence Level Classification (sentimental analysis & Toxic comment classification), Token Level Classification (POS Tagging, NER Tagging), Machine Translation (MT)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published