Investigate KV cache quantization and memory optimization

## Context

With Bonsai-8B on Jetson Nano (4 GB RAM), the memory breakdown is:
- Model: 1016 MB (Q1_0, cannot be reduced)
- KV cache: 576 MB (FP16, K: 288 MB + V: 288 MB)
- Compute buffer: 304 MB (FP32)
- Free: ~980 MB

## Attempted: KV cache quantization

Tested `-ctk q8_0 -ctv q8_0` which would reduce KV cache from 576 MB to 306 MB (270 MB savings).

**Result: SEGFAULT** during model warm-up. The Q1_0 kernels from the PrismML fork are likely incompatible with quantized KV cache, or our `if constexpr` -> `if` patches broke type guards for unsupported KV type combinations.

## Potential optimizations to investigate

1. **KV cache quantization** - Debug the SEGFAULT. Identify which kernel/operation crashes with q8_0 KV. This would save 270 MB (q8_0) or 432 MB (q4_0).

2. **BF16 stub to FP16** - Currently our BF16 stub converts through FP32. Converting through FP16 instead could save memory in intermediate buffers. Impact estimated at a few MB (BF16 paths are not heavily used in Q1_0 inference).

3. **Compute buffer FP16** - The 304 MB compute buffer uses FP32. If activations could use FP16, this would save ~150 MB. Requires code changes in ggml graph allocation, not just CLI flags.

## Why it matters

Every MB saved on a 4 GB system enables:
- Longer context (currently 4096 tokens)
- More headroom for system stability
- Potential to run larger models

## Test environment

- Jetson Nano 4 GB, CUDA 10.2, SM 5.3
- llamita.cpp (PrismML fork + CUDA 10.2 patches)
- Bonsai-8B Q1_0_g128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate KV cache quantization and memory optimization #1

Context

Attempted: KV cache quantization

Potential optimizations to investigate

Why it matters

Test environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Investigate KV cache quantization and memory optimization #1

Description

Context

Attempted: KV cache quantization

Potential optimizations to investigate

Why it matters

Test environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions