Reduce LM memory usage by levinkhho · Pull Request #60 · apple/ml-mdm

levinkhho · 2024-12-13T19:27:09Z

If CUDA is available: loads the language model in 8-bit quantized format using bitsandbytes
Else: loads the LM in torch.float16

One could also look into using CTranslate2 for quantization, which would work on CPU.

Update factory.py

e1c58c7

levinkhho mentioned this pull request Dec 13, 2024

Use a quantized LM or something smaller to make it easier to run the demo #47

Open

Update factory.py

52c0d19

Provide feedback