Responses from the training engines include templates and CoT. OpenClaw does not filter this out for every model, resulting in unwanted text in responses. To address this, the inference backends strip CoT and templates. But for training we need exact-token rollouts to remain on-policy. To address this, we have a hacky cache in the inference backends that retrieve the unfiltered responses from the filtered ones. We should build a more robust approach compared to this hack.
Responses from the training engines include templates and CoT. OpenClaw does not filter this out for every model, resulting in unwanted text in responses. To address this, the inference backends strip CoT and templates. But for training we need exact-token rollouts to remain on-policy. To address this, we have a hacky cache in the inference backends that retrieve the unfiltered responses from the filtered ones. We should build a more robust approach compared to this hack.