We need to represent language mathematically i.e. given a corpus you need to convert this corpus into its numerical form. This mathematical representation is called an embedding/context and the process is called representation learning. Why do this?? Because computers understand only numbers and not texts. We can do this in several ways:
- Via Sentence Embedding
- Via Word Embedding
- Via Character Embedding
- Via Subword (or WordPieces) (or Token) Embedding (everyone uses this)
So since we have so many methods to convert our corpus into numerical representation then which one should we use?? It really depends on the data that you are trying to convert
- if it is social media data then maybe character embedding will work better
- if your language is Chinese then maybe word embedding will work really really badly because in the Chinese language, there is no gap between words so identifying words is a big challenge hence in the Chinese language character embedding works really well and even subword embedding works really well
- In languages like French and Arabic, though there are spaces but if say we have two words in English “so said” then it is just one word in these languages, hence word embedding might not work well with these languages
Finally before you start building your own LLLM from scratch, I would recommend reading this amazing blog from HF which help you understand how to find the right architecture and how are big scalable models built !!
-
Non Generative (Classification) Tasks - Natural Language Understanding (NLU) Tasks
- Sentence level classification Tasks
- Token / Word level classification Tasks (also called Sequence Labeling/ Learning/ Tagging Task) :
Word / Token level classification problem is a problem in which you classify each word / token in the corpus as something. It has been observed that a Masked Language Model based base model gives better results compared to the next subword prediction based base model on NLU tasks. There can be numerous word level classification tasks, some of those are
- POS Tagging
- Name Entity Recognision (NER) Tagging
- Reading QA / Reading Comprehension
- Token Level Languauge Identification : Classifying if a token / word is hindi or english
- Chunking
- Word Sense Disambiguation (WSD) / Semantic Tagging
- Parsing
- Discourse and Coreference Resolution
-
Generative Tasks / Natural Language Generation (NLG) Tasks / Text2Text Tasks / Sequence2Sequence (Seq2Seq or S2S) Tasks / Sequence2Sequence Learning : This is a family of tasks in which we generate new sentences using input sentences. It has been observed that a next subword prediction based foundation model gives better results compared to Masked Language Model based foundation model on NLG tasks. These include tasks like
- Machine Translation (MT)
- Text Summarization
- Paraphrasing : Architecture similar to what we used in MT can be used here, just change the dataset!!
- Machine Transliteration : Architecture similar to what we used in MT can be used here, just change the dataset!!
- Text2Code Generation
Now above we saw how to finetune a foundational LLM model for different downstream tasks using SFT but for all those tasks we can also finetune a foundation model using RL. There are two ways to use RL to finetune :
- Manual Reward Funtion Based RL : Here we will see how to finetune using RL if you 'can design' a good reward function for your downstream task.
- (preferred)Automatic Reward Function Based RL : Here we will see how to finetune using RL if you 'cannot design' a good reward function for your downstream task, via the help of preference dataset. There are multiple methods to do this :
- Reinforcement Learning using Human Feedback (RLHF) : trains a seperate reward model
- (preferred) Direct Preference Optimization (DPO) : uses base LLM itself as reward model
Important
It has been proved previously that its better to first finetune LLM on any task using SFT and then finetune on the same task using RL, it gives better outcomes. This notebook from UnSloth follows the below recepie to convert Qwen3 from a non reasoning model to a reasoning model.

With foundation models that are able to do multiple tasks, you just need to do prompting to solve a single downstream task problem. There are many frameworks that you can use to build these LLM Systems, few good ones are DSPY || AutoGen || Langraph || CrewAI
- Single LLM System
- Multi LLM System