Skip to content

F16 activations: cast-node approach doesn't save memory, need mul_mat kernel change #3

Description

@coverblew

Summary

Attempted to reduce activation memory by casting intermediate tensors from F32 to F16. The approach of inserting ggml_cast nodes in the model graph does NOT save memory and reduces speed.

What was tried

  1. After build_ffn output: ggml_cast(ctx0, cur, GGML_TYPE_F16)
  2. After build_attn Wo output: same cast to F16
  3. Before build_norm: ggml_cast(ctx0, cur, GGML_TYPE_F32) (rms_norm needs F32)

Results (Bonsai-8B on RPi 4, context 4096)

Without F16 With F16 casts Change
RAM used 1665 MB 1672 MB +7 MB (worse)
Prompt 0.8 t/s 0.6 t/s -25%
Generation 0.6 t/s 0.5 t/s -17%

Why it doesn't work

The ggml scheduler allocates memory for ALL tensors in the graph, including cast intermediates. Adding cast nodes creates ADDITIONAL tensors that coexist with the originals, increasing peak memory instead of reducing it.

What would work

Modify the ggml_mul_mat CPU kernel in ggml-cpu/ops.cpp to optionally compute in F32 but store the result in F16 format — writing ggml_fp32_to_fp16() in the output loop. This avoids extra graph nodes and directly halves the result tensor size.

This requires:

  1. Adding an F16 output path to ggml_compute_forward_mul_mat in ops.cpp
  2. Changing ggml_mul_mat() in ggml.c to create result as GGML_TYPE_F16 when a flag is set
  3. Ensuring all downstream ops support F16 input (add, mul, rope, silu do; rms_norm and softmax don't)
  4. Converting to F32 only at the few ops that require it (rms_norm, softmax)

Previous failed approaches

  1. Changing result type globally (GGML_TYPE_F16 in ggml_mul_mat): crashes in ggml_set_rows (KV cache) which asserts F32
  2. Cast nodes in graph: scheduler backend assignment fails (cur_backend_id == -1) or uses more memory (this issue)
  3. Selective casting in model code (qwen3.cpp): works functionally but doesn't save memory due to scheduler allocation

Estimated impact if implemented correctly

  • Compute buffer: ~304 MB → ~152 MB (Jetson), ~156 MB → ~78 MB (RPi)
  • ~150 MB savings on Jetson, ~78 MB on RPi
  • Moderate engineering effort (ops.cpp kernel modification)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions