Skip to content

Latest commit

 

History

History
59 lines (41 loc) · 7.16 KB

File metadata and controls

59 lines (41 loc) · 7.16 KB

About running SLMs on CPU with limited RAM

Short answer: with plain PyTorch on a CPU and only 16 GB of system RAM, loading Meta‑Llama 3.1 8B “in 16‑bit” is not realistically workable. The weights alone are ~14.9 GiB if stored in 16‑bit (bf16) and you still need memory for the KV cache, model buffers, and the OS—so you’ll cross 16 GB even at modest context lengths. On CPU you generally won’t use fp16 (“half”) anyway; PyTorch’s CPU path either doesn’t implement it or upcasts, and the practical 16‑bit format is bfloat16 (bf16). (Hugging Face)

Why this happens (numbers you can trust)

  • Model weights: 8 B params × 2 bytes (bf16) ≈ 16.0 GB (14.9 GiB). This aligns with HF’s guidance that the 3.1‑8B typically “requires about 16 GB” just for weights on a GPU. On CPU you don’t have VRAM—this has to fit in RAM together with everything else. (Hugging Face)
  • KV cache (per token): for Llama 3/3.1‑8B the config is hidden_size = 4096, layers = 32, heads = 32, kv_heads = 8 → head_dim = 128. KV cache bytes per token ≈ 2 (K,V) × kv_heads (8) × head_dim (128) × bytes (bf16=2) × layers (32) ≈ 128 KB / token. So: • 2k ctx ≈ 0.25 GiB; 4k ≈ 0.5 GiB; 8k ≈ 1.0 GiB; 32k ≈ 4 GiB. Even at 4k tokens you’re ~15.4 GiB just weights+KV, before any framework overhead. (Hugging Face)
  • CPU dtypes: PyTorch CPU commonly errors on fp16 (e.g., “Half” not implemented for addmm) and often upcasts if you try. bf16 on CPU is supported (esp. on newer Intel/AMD or ARM cores), but many ops still compute in fp32 while keeping weights in bf16. In practice you don’t get a “pure fp16” CPU graph. (Stack Overflow)

So am I forced to fp32 on CPU? No—but you’re also not getting a clean fp16 path. The realistic CPU choice is bf16 for weights (and some activations), with parts computed in fp32 under the hood. That still leaves you right on the edge (or over) with 16 GB RAM once you include KV cache and overhead. (Hugging Face)

What actually works on 16 GB RAM (solutions that are current)

A) Weight‑only quantization on CPU (recommended) Use INT8/INT4 with Intel’s IPEX‑LLM or Optimum‑Intel/OpenVINO. Both are actively maintained for CPU and bring the 8B weights down to ~8 GiB (INT8) or ~4 GiB (INT4), leaving room for KV cache and the OS. IPEX‑LLM has drop‑in HF‑style loaders (load_in_4bit=True / load_in_low_bit), supports Llama 3/3.1 on CPU, and documents INT8/INT4 examples. Optimum‑Intel can quantize via Intel Neural Compressor and run via OpenVINO. These routes are the most reliable way to run 3.1‑8B on a 16 GB CPU box. (GitHub)

B) bitsandbytes on CPU As of the latest docs, bitsandbytes has a CPU backend. You can use 8‑bit or 4‑bit quantization in Transformers without a GPU (performance isn’t as tuned as IPEX/OpenVINO on CPU, but it works). (Hugging Face)

C) llama.cpp (GGUF) route If you’re open to leaving PyTorch for inference, convert to GGUF and run q4_k* on CPU. This is the most RAM‑efficient (weights ~4–6 GB), and you can cap --ctx_size to tame KV memory. (GitHub)

D) If you insist on “no quantization, PyTorch only” Use bf16 weights, batch_size=1, short contexts (≤2k), and “low_cpu_mem_usage=True”. You might squeeze by on a clean 16 GB machine, but any real workload, longer contexts, or background processes will push you into swapping. Expect it to be fragile and slow. (HF’s own guidance that 8B “needs ~16 GB” was about VRAM; on CPU you don’t have that headroom.) (Hugging Face)

Notes & gotchas

  • fp16 (“Half”) on CPU: many ops are not implemented; stick to bf16 if you want 16‑bit storage. Intel’s AMP and oneDNN help, but kernels still often accumulate in fp32. (Stack Overflow)
  • KV cache dtype: with PyTorch/Transformers you can often keep KV in bf16 to halve KV memory; if it silently upcasts to fp32 on your CPU, memory will double vs. the table above. HF’s KV‑cache guide covers the knobs. (Hugging Face)
  • Context size matters: llama.cpp defaults may try huge contexts (e.g., 128k), exploding RAM use; cap it (e.g., 4–8k) on 16 GB systems. (GitHub)

A pragmatic recipe for 16 GB RAM (CPU)

  • Best chance in PyTorch: quantize. For example, with IPEX‑LLM load 3.1‑8B in INT4/INT8 and run on CPU; this keeps total RAM in the ~6–10 GB range at 4k context, leaving room for the OS. (testbigdldocshane.readthedocs.io)
  • If you must stay unquantized: expect OOM at ≥4k context and be prepared for paging. Consider upgrading to 32 GB+ RAM if you need bf16 weights and multi‑KB contexts in PyTorch. (HF’s blog post’s 16 GB figure is a lower bound for weights only.) (Hugging Face)

My take (with certainty)

  • “16 GB RAM is not enough for Llama 3.1‑8B in bf16 on CPU for practical contexts/batch sizes without swapping.” Confidence: ~90%.
  • “Quantization (INT8/INT4) on CPU via IPEX‑LLM or Optimum‑Intel is the straightforward way to make it viable on 16 GB.” Confidence: ~85%.
  • “You are not ‘forced’ to fp32 on CPU; bf16 storage works, but many ops will still run in fp32 internally; that doesn’t save you enough memory to fit comfortably.” Confidence: ~80%.

References for the above HF Llama 3.1 blog (16 GB baseline), HF KV‑cache guide, configs showing 4096/32/8 architecture, PyTorch CPU dtype limitations (half vs bf16), and Intel/IPEX‑LLM & Optimum‑Intel docs for CPU‑side INT4/INT8 quantization. (Hugging Face)

If you want, I can jot down a minimal, CPU‑friendly code path for each option (IPEX‑LLM INT4, Optimum‑Intel + OpenVINO INT8, and a “bf16‑only but tiny‑context” baseline) tailored to your machine.