Summary
Attempted to reduce activation memory by casting intermediate tensors from F32 to F16. The approach of inserting ggml_cast nodes in the model graph does NOT save memory and reduces speed.
What was tried
- After
build_ffn output: ggml_cast(ctx0, cur, GGML_TYPE_F16)
- After
build_attn Wo output: same cast to F16
- Before
build_norm: ggml_cast(ctx0, cur, GGML_TYPE_F32) (rms_norm needs F32)
Results (Bonsai-8B on RPi 4, context 4096)
|
Without F16 |
With F16 casts |
Change |
| RAM used |
1665 MB |
1672 MB |
+7 MB (worse) |
| Prompt |
0.8 t/s |
0.6 t/s |
-25% |
| Generation |
0.6 t/s |
0.5 t/s |
-17% |
Why it doesn't work
The ggml scheduler allocates memory for ALL tensors in the graph, including cast intermediates. Adding cast nodes creates ADDITIONAL tensors that coexist with the originals, increasing peak memory instead of reducing it.
What would work
Modify the ggml_mul_mat CPU kernel in ggml-cpu/ops.cpp to optionally compute in F32 but store the result in F16 format — writing ggml_fp32_to_fp16() in the output loop. This avoids extra graph nodes and directly halves the result tensor size.
This requires:
- Adding an F16 output path to
ggml_compute_forward_mul_mat in ops.cpp
- Changing
ggml_mul_mat() in ggml.c to create result as GGML_TYPE_F16 when a flag is set
- Ensuring all downstream ops support F16 input (add, mul, rope, silu do; rms_norm and softmax don't)
- Converting to F32 only at the few ops that require it (rms_norm, softmax)
Previous failed approaches
- Changing result type globally (
GGML_TYPE_F16 in ggml_mul_mat): crashes in ggml_set_rows (KV cache) which asserts F32
- Cast nodes in graph: scheduler backend assignment fails (
cur_backend_id == -1) or uses more memory (this issue)
- Selective casting in model code (qwen3.cpp): works functionally but doesn't save memory due to scheduler allocation
Estimated impact if implemented correctly
- Compute buffer: ~304 MB → ~152 MB (Jetson), ~156 MB → ~78 MB (RPi)
- ~150 MB savings on Jetson, ~78 MB on RPi
- Moderate engineering effort (ops.cpp kernel modification)
Summary
Attempted to reduce activation memory by casting intermediate tensors from F32 to F16. The approach of inserting
ggml_castnodes in the model graph does NOT save memory and reduces speed.What was tried
build_ffnoutput:ggml_cast(ctx0, cur, GGML_TYPE_F16)build_attnWo output: same cast to F16build_norm:ggml_cast(ctx0, cur, GGML_TYPE_F32)(rms_norm needs F32)Results (Bonsai-8B on RPi 4, context 4096)
Why it doesn't work
The ggml scheduler allocates memory for ALL tensors in the graph, including cast intermediates. Adding cast nodes creates ADDITIONAL tensors that coexist with the originals, increasing peak memory instead of reducing it.
What would work
Modify the
ggml_mul_matCPU kernel inggml-cpu/ops.cppto optionally compute in F32 but store the result in F16 format — writingggml_fp32_to_fp16()in the output loop. This avoids extra graph nodes and directly halves the result tensor size.This requires:
ggml_compute_forward_mul_matin ops.cppggml_mul_mat()in ggml.c to create result asGGML_TYPE_F16when a flag is setPrevious failed approaches
GGML_TYPE_F16in ggml_mul_mat): crashes inggml_set_rows(KV cache) which asserts F32cur_backend_id == -1) or uses more memory (this issue)Estimated impact if implemented correctly