Skip to content

Commit 6556e08

Browse files
committed
Update README.md to include RabbitLLM logo, clarify compatibility with Qwen2 and Qwen3 models, and provide detailed architecture support status. Remove outdated macOS installation instructions and enhance model loading examples for better user guidance.
1 parent 6e44fb5 commit 6556e08

2 files changed

Lines changed: 27 additions & 29 deletions

File tree

README.md

Lines changed: 27 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,21 @@
11
# RabbitLLM
22

3+
![RabbitLLM logo](assets/logo-rabbitllm.jpg)
4+
35
**Run 70B+ LLMs on a single 4GB GPU — no quantization required.**
46

57
[![PyPI](https://img.shields.io/pypi/v/rabbitllm?color=blue)](https://pypi.org/project/rabbitllm/)
68
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
79
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
810
[![CI](https://github.com/manuelslemos/rabbitllm/actions/workflows/ci.yml/badge.svg)](https://github.com/manuelslemos/rabbitllm/actions)
911

10-
RabbitLLM enables inference on large language models (70B+ parameters) on consumer GPUs with as
11-
little as 4GB VRAM by streaming model layers one at a time through GPU memory. No quantization,
12-
distillation, or pruning needed — full model quality.
12+
RabbitLLM is a **fork of [AirLLM](https://github.com/airllm/airllm)**. It enables inference on large language models (70B+ parameters) on consumer GPUs with as little as 4GB VRAM by streaming model layers one at a time through GPU memory. No quantization, distillation, or pruning needed — full model quality.
13+
14+
### Compatibility (current status)
15+
16+
- **Tested and supported:** only **Qwen2** and **Qwen3** are currently tested and compatible. Use these families for reliable results.
17+
- **Other architectures** (Llama, Mistral, Mixtral, etc.) are present in the codebase but **not yet compatible** — use at your own risk.
18+
- **Apple (macOS / Apple Silicon)** is **not supported**; run on Linux or Windows with a CUDA-capable GPU (or CPU fallback on x86/ARM Linux).
1319

1420
## How it works
1521

@@ -42,7 +48,7 @@ If the prebuilt wheel is unavailable for your setup, install from
4248
```python
4349
from rabbitllm import AutoModel
4450

45-
model = AutoModel.from_pretrained("meta-llama/Llama-3-8B")
51+
model = AutoModel.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct") # or any Qwen2 / Qwen3
4652

4753
input_tokens = model.tokenizer(
4854
["What is the capital of France?"],
@@ -68,19 +74,21 @@ no need to pick the right class manually.
6874

6975
## Supported models
7076

71-
| Family | Architectures | Class |
72-
|---|---|---|
73-
| Llama 2 / 3 / 3.1 / 3.2 | `LlamaForCausalLM` | `RabbitLLMLlama2` |
74-
| Qwen2 / Qwen2.5 / Qwen3 | `Qwen2ForCausalLM`, `Qwen3ForCausalLM` | `RabbitLLMQWen2` |
75-
| Qwen v1 | `QWenLMHeadModel` | `RabbitLLMQWen` |
76-
| Mistral | `MistralForCausalLM` | `RabbitLLMMistral` |
77-
| Mixtral | `MixtralForCausalLM` | `RabbitLLMMixtral` |
78-
| InternLM | `InternLMForCausalLM` | `RabbitLLMInternLM` |
79-
| ChatGLM | `ChatGLMModel` | `RabbitLLMChatGLM` |
80-
| Baichuan | `BaichuanForCausalLM` | `RabbitLLMBaichuan` |
81-
| Gemma 2 / 3 | `Gemma2ForCausalLM`, `Gemma3ForCausalLM` | `RabbitLLMLlama2` |
82-
| DeepSeek V2 / V3 | `DeepseekV2ForCausalLM`, `DeepseekV3ForCausalLM` | `RabbitLLMLlama2` |
83-
| Phi 2 / 3 / 4 | `Phi3ForCausalLM`, `Phi4ForCausalLM` | `RabbitLLMLlama2` |
77+
**Only Qwen2 and Qwen3 are tested and supported.** The following table lists the architectures present in the codebase; others are not yet compatible.
78+
79+
| Family | Architectures | Class | Status |
80+
|---|---|---|---|
81+
| **Qwen2 / Qwen2.5 / Qwen3** | `Qwen2ForCausalLM`, `Qwen3ForCausalLM` | `RabbitLLMQWen2` | **Tested, supported** |
82+
| Llama 2 / 3 / 3.1 / 3.2 | `LlamaForCausalLM` | `RabbitLLMLlama2` | Not yet compatible |
83+
| Qwen v1 | `QWenLMHeadModel` | `RabbitLLMQWen` | Not yet compatible |
84+
| Mistral | `MistralForCausalLM` | `RabbitLLMMistral` | Not yet compatible |
85+
| Mixtral | `MixtralForCausalLM` | `RabbitLLMMixtral` | Not yet compatible |
86+
| InternLM | `InternLMForCausalLM` | `RabbitLLMInternLM` | Not yet compatible |
87+
| ChatGLM | `ChatGLMModel` | `RabbitLLMChatGLM` | Not yet compatible |
88+
| Baichuan | `BaichuanForCausalLM` | `RabbitLLMBaichuan` | Not yet compatible |
89+
| Gemma 2 / 3 | `Gemma2ForCausalLM`, `Gemma3ForCausalLM` | `RabbitLLMLlama2` | Not yet compatible |
90+
| DeepSeek V2 / V3 | `DeepseekV2ForCausalLM`, `DeepseekV3ForCausalLM` | `RabbitLLMLlama2` | Not yet compatible |
91+
| Phi 2 / 3 / 4 | `Phi3ForCausalLM`, `Phi4ForCausalLM` | `RabbitLLMLlama2` | Not yet compatible |
8492

8593
Unknown architectures fall back to the Llama-based implementation with a warning.
8694

@@ -109,7 +117,7 @@ Block-wise quantization reduces on-disk and in-memory layer size:
109117
- **8-bit**: ~50% of original size.
110118

111119
```python
112-
model = AutoModel.from_pretrained("mistralai/Mistral-7B-v0.1", compression="4bit")
120+
model = AutoModel.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", compression="4bit")
113121
```
114122

115123
Requires `bitsandbytes`: `pip install bitsandbytes`.
@@ -119,7 +127,7 @@ Requires `bitsandbytes`: `pip install bitsandbytes`.
119127
Pass a HuggingFace token for repos that require access approval:
120128

121129
```python
122-
model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", token="hf_YOUR_TOKEN")
130+
model = AutoModel.from_pretrained("Qwen/Qwen2.5-7B-Instruct", token="hf_YOUR_TOKEN")
123131
```
124132

125133
Or set the `HF_TOKEN` environment variable.
@@ -135,16 +143,6 @@ export HF_HOME="$(pwd)/models"
135143
The `models/` directory is in `.gitignore`. RabbitLLM will store split layers alongside
136144
the HuggingFace cache.
137145

138-
## macOS / Apple Silicon
139-
140-
Install and run the same way. Requires [mlx](https://github.com/ml-explore/mlx) and PyTorch.
141-
Only Apple Silicon is supported.
142-
143-
```bash
144-
pip install mlx torch
145-
python -c "from rabbitllm import AutoModel; model = AutoModel.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')"
146-
```
147-
148146
## Documentation
149147

150148
- [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) — Design decisions: layer-streaming, KV cache, tied weights, attention implementations.

assets/logo-rabbitllm.jpg

56.3 KB
Loading

0 commit comments

Comments
 (0)