LLMs, while useful for solving various NLP related use-cases, are costly and cumbersome to productionize. Organisations use expensive A100/H100 GPUs to deploy LLMs while extracting underwhelming performance in terms of latency and throughput.
Design and develop a novel, generalised model optimisation and inference script for MistralForCausalLM based LLMs that:
- Accepts a Huggingface model path as an input (example:
mistralai/Mistral-7B-v0.1) - Optimises the model for faster inference (including warmups etc.)
- Waits for user to input the prompt
- Runs the model on the prompt
- Outputs the model response and performance metrics
Develop the script to beat the following benchmarks with the set constraints:
Total throughput (in + out tokens per second) = 200 tokens/sec
Here are the other details:
- Input tokens = 128
- Output tokens = 128
- Concurrency = 32
- GPU = 1 X Nvidia Tesla T4 (16GB VRAM)
- Model dtype = any dtype of choice supported by said GPU
Make the script compatible with LoRA models
- Google Colab
- Kaggle
- Amazon Sagemaker Studio Labs
For any clarifications, contact devansh@simplismart.ai or daksh.goel@simplismart.tech