Context
With Bonsai-8B on Jetson Nano (4 GB RAM), the memory breakdown is:
- Model: 1016 MB (Q1_0, cannot be reduced)
- KV cache: 576 MB (FP16, K: 288 MB + V: 288 MB)
- Compute buffer: 304 MB (FP32)
- Free: ~980 MB
Attempted: KV cache quantization
Tested -ctk q8_0 -ctv q8_0 which would reduce KV cache from 576 MB to 306 MB (270 MB savings).
Result: SEGFAULT during model warm-up. The Q1_0 kernels from the PrismML fork are likely incompatible with quantized KV cache, or our if constexpr -> if patches broke type guards for unsupported KV type combinations.
Potential optimizations to investigate
-
KV cache quantization - Debug the SEGFAULT. Identify which kernel/operation crashes with q8_0 KV. This would save 270 MB (q8_0) or 432 MB (q4_0).
-
BF16 stub to FP16 - Currently our BF16 stub converts through FP32. Converting through FP16 instead could save memory in intermediate buffers. Impact estimated at a few MB (BF16 paths are not heavily used in Q1_0 inference).
-
Compute buffer FP16 - The 304 MB compute buffer uses FP32. If activations could use FP16, this would save ~150 MB. Requires code changes in ggml graph allocation, not just CLI flags.
Why it matters
Every MB saved on a 4 GB system enables:
- Longer context (currently 4096 tokens)
- More headroom for system stability
- Potential to run larger models
Test environment
- Jetson Nano 4 GB, CUDA 10.2, SM 5.3
- llamita.cpp (PrismML fork + CUDA 10.2 patches)
- Bonsai-8B Q1_0_g128
Context
With Bonsai-8B on Jetson Nano (4 GB RAM), the memory breakdown is:
Attempted: KV cache quantization
Tested
-ctk q8_0 -ctv q8_0which would reduce KV cache from 576 MB to 306 MB (270 MB savings).Result: SEGFAULT during model warm-up. The Q1_0 kernels from the PrismML fork are likely incompatible with quantized KV cache, or our
if constexpr->ifpatches broke type guards for unsupported KV type combinations.Potential optimizations to investigate
KV cache quantization - Debug the SEGFAULT. Identify which kernel/operation crashes with q8_0 KV. This would save 270 MB (q8_0) or 432 MB (q4_0).
BF16 stub to FP16 - Currently our BF16 stub converts through FP32. Converting through FP16 instead could save memory in intermediate buffers. Impact estimated at a few MB (BF16 paths are not heavily used in Q1_0 inference).
Compute buffer FP16 - The 304 MB compute buffer uses FP32. If activations could use FP16, this would save ~150 MB. Requires code changes in ggml graph allocation, not just CLI flags.
Why it matters
Every MB saved on a 4 GB system enables:
Test environment