Skip to content

Investigate KV cache quantization and memory optimization #1

Description

@coverblew

Context

With Bonsai-8B on Jetson Nano (4 GB RAM), the memory breakdown is:

  • Model: 1016 MB (Q1_0, cannot be reduced)
  • KV cache: 576 MB (FP16, K: 288 MB + V: 288 MB)
  • Compute buffer: 304 MB (FP32)
  • Free: ~980 MB

Attempted: KV cache quantization

Tested -ctk q8_0 -ctv q8_0 which would reduce KV cache from 576 MB to 306 MB (270 MB savings).

Result: SEGFAULT during model warm-up. The Q1_0 kernels from the PrismML fork are likely incompatible with quantized KV cache, or our if constexpr -> if patches broke type guards for unsupported KV type combinations.

Potential optimizations to investigate

  1. KV cache quantization - Debug the SEGFAULT. Identify which kernel/operation crashes with q8_0 KV. This would save 270 MB (q8_0) or 432 MB (q4_0).

  2. BF16 stub to FP16 - Currently our BF16 stub converts through FP32. Converting through FP16 instead could save memory in intermediate buffers. Impact estimated at a few MB (BF16 paths are not heavily used in Q1_0 inference).

  3. Compute buffer FP16 - The 304 MB compute buffer uses FP32. If activations could use FP16, this would save ~150 MB. Requires code changes in ggml graph allocation, not just CLI flags.

Why it matters

Every MB saved on a 4 GB system enables:

  • Longer context (currently 4096 tokens)
  • More headroom for system stability
  • Potential to run larger models

Test environment

  • Jetson Nano 4 GB, CUDA 10.2, SM 5.3
  • llamita.cpp (PrismML fork + CUDA 10.2 patches)
  • Bonsai-8B Q1_0_g128

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions