F16 activations: cast-node approach doesn't save memory, need mul_mat kernel change

## Summary

Attempted to reduce activation memory by casting intermediate tensors from F32 to F16. The approach of inserting `ggml_cast` nodes in the model graph does NOT save memory and reduces speed.

## What was tried

1. After `build_ffn` output: `ggml_cast(ctx0, cur, GGML_TYPE_F16)`
2. After `build_attn` Wo output: same cast to F16
3. Before `build_norm`: `ggml_cast(ctx0, cur, GGML_TYPE_F32)` (rms_norm needs F32)

## Results (Bonsai-8B on RPi 4, context 4096)

| | Without F16 | With F16 casts | Change |
|--|------------|----------------|--------|
| RAM used | 1665 MB | 1672 MB | +7 MB (worse) |
| Prompt | 0.8 t/s | 0.6 t/s | -25% |
| Generation | 0.6 t/s | 0.5 t/s | -17% |

## Why it doesn't work

The ggml scheduler allocates memory for ALL tensors in the graph, including cast intermediates. Adding cast nodes creates ADDITIONAL tensors that coexist with the originals, increasing peak memory instead of reducing it.

## What would work

Modify the `ggml_mul_mat` CPU kernel in `ggml-cpu/ops.cpp` to optionally **compute in F32 but store the result in F16 format** — writing `ggml_fp32_to_fp16()` in the output loop. This avoids extra graph nodes and directly halves the result tensor size.

This requires:
1. Adding an F16 output path to `ggml_compute_forward_mul_mat` in ops.cpp
2. Changing `ggml_mul_mat()` in ggml.c to create result as `GGML_TYPE_F16` when a flag is set
3. Ensuring all downstream ops support F16 input (add, mul, rope, silu do; rms_norm and softmax don't)
4. Converting to F32 only at the few ops that require it (rms_norm, softmax)

## Previous failed approaches

1. **Changing result type globally** (`GGML_TYPE_F16` in ggml_mul_mat): crashes in `ggml_set_rows` (KV cache) which asserts F32
2. **Cast nodes in graph**: scheduler backend assignment fails (`cur_backend_id == -1`) or uses more memory (this issue)
3. **Selective casting in model code** (qwen3.cpp): works functionally but doesn't save memory due to scheduler allocation

## Estimated impact if implemented correctly

- Compute buffer: ~304 MB → ~152 MB (Jetson), ~156 MB → ~78 MB (RPi)
- ~150 MB savings on Jetson, ~78 MB on RPi
- Moderate engineering effort (ops.cpp kernel modification)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F16 activations: cast-node approach doesn't save memory, need mul_mat kernel change #3

Summary

What was tried

Results (Bonsai-8B on RPi 4, context 4096)

Why it doesn't work

What would work

Previous failed approaches

Estimated impact if implemented correctly

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	Without F16	With F16 casts	Change
RAM used	1665 MB	1672 MB	+7 MB (worse)
Prompt	0.8 t/s	0.6 t/s	-25%
Generation	0.6 t/s	0.5 t/s	-17%

F16 activations: cast-node approach doesn't save memory, need mul_mat kernel change #3

Description

Summary

What was tried

Results (Bonsai-8B on RPi 4, context 4096)

Why it doesn't work

What would work

Previous failed approaches

Estimated impact if implemented correctly

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions