InfiniTensor · Sxy-17 · Oct 22, 2025 · Oct 28, 2025 · Oct 29, 2025 · Nov 10, 2025
diff --git a/.gitignore b/.gitignore
@@ -23,7 +23,10 @@ cache/
 #GGUF
 *.gguf
 
-# txt
-*.txt
+# # txt
+# *.txt
 
 *.http
+
+#log
+log
diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,18 @@
+1. 目标：
+
+    （miniCPM + Fastcache） x (DCU & 摩尔)
+    （llava + Fastcache） x (DCU & 摩尔)
+
+
+2. 具体工作拆分：
+
+    a. DCU平台端到端跑通：llava encoder部分正确性调试【目前可上手的工作，缺的算子暂时先占位】 + 把encoder/Fastcache/llm拼到一起。
+
+    b. 我今天搞：摩尔平台计算资源申请
+
+    c. 我最近两天：两个平台缺的算子搞定
+
+3. ddl: 10天后，本月25号
+
+4. weight：
+108：/home/weight/MiniCPM-V-2_6；/home/weight/llava-1.5-7b-hf
diff --git a/compress_ckpt/llava_mlp.bin b/compress_ckpt/llava_mlp.bin
diff --git a/compress_ckpt/llava_mlp_layerwise.bin b/compress_ckpt/llava_mlp_layerwise.bin
diff --git a/compress_ckpt/minicpm_mlp.pth b/compress_ckpt/minicpm_mlp.pth
diff --git a/compress_ckpt/minicpmv_mlp.bin b/compress_ckpt/minicpmv_mlp.bin
diff --git a/debug_data/qkv_debug.txt b/debug_data/qkv_debug.txt
diff --git a/docs/KVCacheCompressionMapping.md b/docs/KVCacheCompressionMapping.md
@@ -0,0 +1,34 @@
+# KV Cache Compression Weight Mapping (llava_mlp.bin)
+
+## 前缀与含义
+权重来自 Fastcache 的 KVCacheLinearDecoupleCompressor，`.pth` 结构位于 `compressor` 子树，键模式为 `<prefix>.<layer>.<slot>.weight`。当前导出的 bin 包含以下前缀（已按排序写入）：
+
+- `compress_tk`: 文本 K 压缩/投影相关权重
+- `compress_tv`: 文本 V 压缩/投影相关权重
+- `compress_iv`: 图像 V 压缩/投影相关权重（命名可能沿用 image/value 缩写）
+- `compress_ik`: 图像 K 压缩/投影相关权重
+- `attention`: 压缩器内部的注意力/门控线性层（小头数，通常 slot=0..7 等）
+
+> 注：原始 PyTorch 权重中未见 bias；转换脚本若发现 bias 长度不匹配会跳过。
+
+## 排序与写入顺序
+排序键：`prefix` 优先级（`compress_tk` → `compress_tv` → `compress_iv` → `compress_ik` → `attention`），然后 `layer` 升序，再 `slot` 升序。同一 `(prefix,layer,slot)` 下先 weight 后 bias（若存在）。
+
+## 形状推断与 hidden_size
+- 头部的 `hidden_size` 来自首个权重的列数（当前为 640）。
+- 每个 weight 块记录 `rows`、`cols`。可视为线性层 `out = W * in`，`W` 形状为 `[rows, cols]`，输出维度 = `rows`。
+- bias（若存在且长度==rows）紧随其后。
+
+## 可能的计算图猜测（供 C++ 实现对齐）
+- `compress_tk`: 对文本 K 做降维/解耦，slot 多个表示分阶段或多头混合投影。
+- `compress_tv`: 对文本 V 做降维/解耦。
+- `compress_iv`: 对图像 V 做降维/解耦。
+- `compress_ik`: 对图像 K 做降维/解耦。
+- `attention`: 压缩器内部的小型注意力/门控 MLP，用于生成压缩映射或融合文本/图像特征。
+
+实际计算顺序需结合 Fastcache 的 Python 源码（`KVCacheLinearDecoupleCompressor.forward`）逐层映射，将上述权重映射到具体的线性/激活/重排操作。
+
+## 与 bin 对齐的校验
+- 使用 `scripts/verify_llava_mlp_bin.py` 可对比 `.pth` 与 `.bin`：会打印头部、逐块形状及 max diff。
+- 当前验证结果：`num_layers=32`，`weight_count_per_layer=12`，384 个 weight 块，max diff=0（fp16）。
+
diff --git a/docs/KVCacheCompressionOpsChecklist.md b/docs/KVCacheCompressionOpsChecklist.md
@@ -0,0 +1,34 @@
+# KV Cache Compression 算法拆解与算子需求（llava_mlp.bin 基线）
+
+## 模块拆解（基于权重前缀推断）
+- `compress_tk`: 文本 K 路径压缩/解耦。若 slot 多个，可能对应多阶段或多头混合投影。
+- `compress_tv`: 文本 V 路径压缩/解耦。
+- `compress_iv`: 图像 V 路径压缩/解耦。
+- `compress_ik`: 图像 K 路径压缩/解耦。
+- `attention`: 压缩器内部的小型注意力/门控线性层（可能用于融合/生成映射）。
+
+## 可能的计算流程（参考 Fastcache 思路，需结合源码逐条对齐）
+1) 对 KV 按类别（文本/图像）分支，按头/slot 做线性变换降维或投影。
+2) 可选的 gating/注意力：使用 `attention.*` 权重对压缩特征做融合或生成索引/权重。
+3) 生成压缩后的 K/V（seq 维缩短或维度降维），并记录映射（indices/scale）。
+4) 解压路径：根据保存的映射/scale，将压缩 K/V 恢复到注意力可消费的形式（或直接在注意力中使用压缩格式）。
+
+## 需要的核心算子（优先复用 InfiniCore）
+- 矩阵乘 + bias：`linear`（已有）。需支持 fp16/bf16。
+- 激活：SiLU/GELU（确认是否已有；缺失则补充逐元素 kernel）。
+- 张量重排：view/reshape/permute/slice（`Tensor` 已支持）。
+- 可能的归一化/缩放：元素级乘加（已有基础算子可组合）。
+- 可选：索引/聚合（若压缩逻辑需要采样或根据权重重排 seq）。
+
+## 需要明确的映射关系（待确认）
+- 每个 prefix 对应的输入/输出形状：`[B, heads, seq, dim]` → `[...]`，slot 如何映射。
+- 压缩因子作用位置：seq 维还是隐藏维（或两者结合）。
+- `attention` 权重的输入特征来源与输出用途（生成哪类权重/索引）。
+- 是否需要存储 indices/scale 以便解压或稀疏注意力。
+
+## 实现阶段建议
+1) **占位压缩**：先实现保留最近 N/截断的简版，打通链路。
+2) **权重映射对齐**：阅读 Python 版 `KVCacheLinearDecoupleCompressor.forward`，写出每个 prefix 的线性/激活顺序和张量维度。
+3) **算子补齐**：如缺 SiLU/GELU，新增简洁 kernel；其余用现有 linear/elemwise 组合。
+4) **解压策略**：选择解压到密集 KV（改动小）或直接改注意力支持压缩格式（二选一）。
+5) **验证**：构造 C++/ctypes 小测试，随机 KV → 压缩 → 解压 → 对比误差，量化开销与收益。
diff --git a/docs/KVCacheCompressionWeightFormat.md b/docs/KVCacheCompressionWeightFormat.md
@@ -0,0 +1,57 @@
+# KV Cache Compression Weight Format (Binary, No PyTorch Dependency)
+
+## File Layout
+All values are little-endian unless stated otherwise. Strings are ASCII, null-terminated.
+
+- Header (fixed size)
+  - `uint32` magic = 0x4B56434D ("KV C M")
+  - `uint32` version = 1
+  - `uint16` dtype code: 0 = fp16, 1 = bf16, 2 = fp32
+  - `uint16` reserved = 0
+  - `uint32` num_layers
+  - `uint32` num_heads
+  - `uint32` head_dim
+  - `uint32` hidden_size
+  - `uint32` compression_factor (e.g., 4, 5)
+  - `uint32` min_seq_len
+  - `uint32` weight_count_per_layer (for sanity check)
+  - `uint32` metadata_size_bytes (future expansion; set 0 for now)
+- Layer blocks (repeat `num_layers` times)
+  - For each weight tensor (order defined below):
+    - `uint32` rows
+    - `uint32` cols
+    - `uint32` has_bias (0/1)
+    - data blob for weight: `rows * cols * sizeof(dtype)`
+    - optional bias blob: `cols * sizeof(dtype)` when `has_bias==1`
+- Footer
+  - `uint32` checksum (optional; set 0 if not used)
+
+## Weight Order per Layer (example for linear-decouple MLP)
+Adjust if实际模型结构不同，但顺序需在导出和加载一致。
+1. `proj_k` weight (+bias)
+2. `proj_v` weight (+bias)
+3. `compress_k` weight (+bias)
+4. `compress_v` weight (+bias)
+5. `decompress_k` weight (+bias)
+6. `decompress_v` weight (+bias)
+7. `gate`/`mlp` weights (+bias) if算法需要
+
+`weight_count_per_layer` = 实际包含的权重项数，便于解析时校验。
+
+## Export Steps (one-time, in external Python env)
+1) 使用 PyTorch 读取原 `.pth`：`state = torch.load(...)`.
+2) 提取压缩器权重到固定顺序的列表；统一 dtype（fp16/bf16）：
+   ```python
+   weights = [
+     (state['proj_k.weight'], state.get('proj_k.bias')),
+     ...
+   ]
+   ```
+3) 写入头部；逐层写元信息 + 数据；按 `dtype` 转为字节（fp16 用 `np.float16.tobytes()`）。
+4) 填充 footer（可置 0）。
+
+## Loader Expectations (C++/InfiniCore)
+- 读取并验证 magic/version/dtype/层数/weight_count_per_layer。
+- 为每个权重创建 `Tensor::weight`，dtype 与头部一致。
+- 如果缺少某些权重（has_bias=0），按约定跳过 bias。
+- 解析出的权重按同样顺序存入压缩器对象，以确保前向逻辑正确。
diff --git a/include/infinicore_infer.h b/include/infinicore_infer.h
@@ -2,9 +2,11 @@
 #define INFINICORE_INFER_H
 
 #include "infinicore_infer/cache.h"
+#include "infinicore_infer/kv_compression.h"
 #include "infinicore_infer/weights_loader.h"
 
 #include "infinicore_infer/models/deepseek.h"
 #include "infinicore_infer/models/jiuge.h"
+#include "infinicore_infer/models/minicpmv.h"
 
 #endif /* INFINICORE_INFER_H */
diff --git a/include/infinicore_infer/kv_compression.h b/include/infinicore_infer/kv_compression.h
@@ -0,0 +1,25 @@
+#ifndef KV_COMPRESSION_H
+#define KV_COMPRESSION_H
+
+#include <stdint.h>
+
+#include <infinirt.h>
+
+struct KVCache;
+
+typedef struct {
+    uint32_t enable;
+    uint32_t compression_factor;
+    uint32_t min_seq_len;
+    uint32_t image_kv_len;
+    const char *weight_path; // path to .bin weights (see docs/KVCacheCompressionWeightFormat.md)
+} KVCompressionConfig;
+
+// Compress KVCache in-place:
+// - Reads KV from [0, seq_len) and writes compressed KV back into the same cache prefix [0, new_len).
+// - Returns new_len on success; returns seq_len on no-op/failure.
+__C __export uint32_t
+compressKVCacheInplace(struct KVCache *kv_cache, uint32_t seq_len, const KVCompressionConfig *cfg);
+
+#endif // KV_COMPRESSION_H
+
diff --git a/include/infinicore_infer/models/jiuge.h b/include/infinicore_infer/models/jiuge.h
@@ -30,11 +30,15 @@ typedef struct
     // [dvoc, d]
     const void *output_embd;
     // nlayer * [d]
-    const void *const *attn_norm;
+    const void *const *attn_norm;  // 指针数组，每层一个RMSNorm权重
     // nlayer * [ndev, (nh + 2 * nkvh) / ndev * dh, d]
     const void *const *attn_qkv;
     // nlayer * [ndev, (nh + 2 * nkvh) / ndev * dh]
     const void *const *attn_qkv_b;
+    // nlayer * [dh]
+    const void *const *attn_q_norm;
+    // nlayer * [dh]
+    const void *const *attn_k_norm;
     // nlayer * [ndev, d, nkvh / ndev * dh]
     const void *const *attn_o;
     // nlayer * [d]
@@ -80,6 +84,43 @@ inferBatchJiuge(struct JiugeModel *,
                 const float *temperature, const uint32_t *topk, const float *topp,
                 uint32_t *output);
 
+/// @brief 批次推理一轮，并采样出新的 token（RoPE 位置与 KV 写入位置可解耦，用于 KV 压缩）
+/// @param req_pos 位置 id 基址（用于 RoPE/pos_ids 计算）
+/// @param kv_pos KVCache 写入/读取基址（用于 past_len/total_len 计算）
+__C __export void
+inferBatchJiugeEx(struct JiugeModel *,
+                  const uint32_t *tokens, uint32_t ntok,
+                  const uint32_t *req_lens, uint32_t nreq,
+                  const uint32_t *req_pos,
+                  const uint32_t *kv_pos,
+                  struct KVCache **kv_caches,
+                  const float *temperature, const uint32_t *topk, const float *topp,
+                  uint32_t *output);
+
+/// @brief 批次推理一轮，并采样出新的 token，同时输出 logits
+/// @param logits 输出 logits 数组
+__C __export void
+inferBatchJiugeWithLogits(struct JiugeModel *,
+                         const uint32_t *tokens, uint32_t ntok,
+                         const uint32_t *req_lens, uint32_t nreq, const uint32_t *req_pos,
+                         struct KVCache **kv_caches,
+                         const float *temperature, const uint32_t *topk, const float *topp,
+                         uint32_t *output, void *logits);
+
+/// @brief 批次推理一轮（RoPE 位置与 KV 写入位置可解耦），同时输出 logits
+/// @param req_pos 位置 id 基址（用于 RoPE/pos_ids 计算）
+/// @param kv_pos KVCache 写入/读取基址（用于 past_len/total_len 计算）
+/// @param logits 输出 logits 数组
+__C __export void
+inferBatchJiugeExWithLogits(struct JiugeModel *,
+                            const uint32_t *tokens, uint32_t ntok,
+                            const uint32_t *req_lens, uint32_t nreq,
+                            const uint32_t *req_pos,
+                            const uint32_t *kv_pos,
+                            struct KVCache **kv_caches,
+                            const float *temperature, const uint32_t *topk, const float *topp,
+                            uint32_t *output, void *logits);
+
 /// @brief 批次推理一轮，输出 output embedding 后的 logits
 /// @param tokens 输入 token 地址
 /// @param ntok 输入 token 数量
@@ -95,4 +136,94 @@ forwardBatchJiuge(struct JiugeModel *,
                   struct KVCache **kv_caches,
                   void *logits);
 
+/// @brief 批次推理一轮，输出 logits（RoPE 位置与 KV 写入位置可解耦，用于 KV 压缩）
+__C __export void
+forwardBatchJiugeEx(struct JiugeModel *,
+                    const uint32_t *tokens, uint32_t ntok,
+                    const uint32_t *req_lens, uint32_t nreq,
+                    const uint32_t *req_pos,
+                    const uint32_t *kv_pos,
+                    struct KVCache **kv_caches,
+                    void *logits);
+
+/// @brief 批次推理一轮，支持对指定 token 位置的输入 embedding 做覆盖（用于多模态 image embedding 注入）
+/// @note override_pos 需要按升序排列，且每个位置最多出现一次
+/// @param n_override 覆盖位置数量
+/// @param override_pos 覆盖位置（基于拼接后的 tokens 序列下标，范围 [0, ntok)）
+/// @param override_embeds 覆盖 embedding，shape [n_override, d]，dtype = meta.dt_logits
+__C __export void
+inferBatchJiugeWithOverrides(struct JiugeModel *,
+                             const uint32_t *tokens, uint32_t ntok,
+                             const uint32_t *req_lens, uint32_t nreq, const uint32_t *req_pos,
+                             struct KVCache **kv_caches,
+                             uint32_t n_override,
+                             const uint32_t *override_pos,
+                             const void *override_embeds,
+                             const float *temperature, const uint32_t *topk, const float *topp,
+                             uint32_t *output);
+
+/// @brief 批次推理一轮（RoPE 位置与 KV 写入位置可解耦），支持 embedding 覆盖
+__C __export void
+inferBatchJiugeWithOverridesEx(struct JiugeModel *,
+                               const uint32_t *tokens, uint32_t ntok,
+                               const uint32_t *req_lens, uint32_t nreq,
+                               const uint32_t *req_pos,
+                               const uint32_t *kv_pos,
+                               struct KVCache **kv_caches,
+                               uint32_t n_override,
+                               const uint32_t *override_pos,
+                               const void *override_embeds,
+                               const float *temperature, const uint32_t *topk, const float *topp,
+                               uint32_t *output);
+
+/// @brief 批次推理一轮，支持 embedding 覆盖，同时输出 logits
+__C __export void
+inferBatchJiugeWithOverridesWithLogits(struct JiugeModel *,
+                                       const uint32_t *tokens, uint32_t ntok,
+                                       const uint32_t *req_lens, uint32_t nreq, const uint32_t *req_pos,
+                                       struct KVCache **kv_caches,
+                                       uint32_t n_override,
+                                       const uint32_t *override_pos,
+                                       const void *override_embeds,
+                                       const float *temperature, const uint32_t *topk, const float *topp,
+                                       uint32_t *output, void *logits);
+
+// /// @brief 批次推理一轮（RoPE 位置与 KV 写入位置可解耦），支持 embedding 覆盖，同时输出 logits
+// __C __export void
+// inferBatchJiugeWithOverridesExWithLogits(struct JiugeModel *,
+//                                           const uint32_t *tokens, uint32_t ntok,
+//                                           const uint32_t *req_lens, uint32_t nreq,
+//                                           const uint32_t *req_pos,
+//                                           const uint32_t *kv_pos,
+//                                           struct KVCache **kv_caches,
+//                                           uint32_t n_override,
+//                                           const uint32_t *override_pos,
+//                                           const void *override_embeds,
+//                                           const float *temperature, const uint32_t *topk, const float *topp,
+//                                           uint32_t *output, void *logits);
+
+/// @brief 批次推理一轮，输出 logits，支持对指定 token 位置的输入 embedding 做覆盖
+__C __export void
+forwardBatchJiugeWithOverrides(struct JiugeModel *,
+                               const uint32_t *tokens, uint32_t ntok,
+                               const uint32_t *req_lens, uint32_t nreq, const uint32_t *req_pos,
+                               struct KVCache **kv_caches,
+                               uint32_t n_override,
+                               const uint32_t *override_pos,
+                               const void *override_embeds,
+                               void *logits);
+
+/// @brief 批次推理一轮，输出 logits（RoPE 位置与 KV 写入位置可解耦），支持 embedding 覆盖
+__C __export void
+forwardBatchJiugeWithOverridesEx(struct JiugeModel *,
+                                 const uint32_t *tokens, uint32_t ntok,
+                                 const uint32_t *req_lens, uint32_t nreq,
+                                 const uint32_t *req_pos,
+                                 const uint32_t *kv_pos,
+                                 struct KVCache **kv_caches,
+                                 uint32_t n_override,
+                                 const uint32_t *override_pos,
+                                 const void *override_embeds,
+                                 void *logits);
+
 #endif
-Original file line number
+Diff line change
@@ Expand Up / @@ -23,7 +23,10 @@ cache/ @@
     #GGUF
     *.gguf
-    # txt
-    *.txt
+    # # txt
+    # *.txt
     *.http
+    #log
+    log