Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
80909be
Issue/60 -main 修复输出token乱码并适配了qwen3模型
pengcheng888 Oct 22, 2025
7a41220
issue/64 - jiuge.py verbose output
wooway777 Oct 28, 2025
1c710c1
Merge pull request #65 from InfiniTensor/issue/64
PanZezhong1725 Oct 29, 2025
2188250
添加了一些开发的base
Sxy-17 Nov 10, 2025
0825fdb
到bind_kvcache后能跑起来了,准备开始搭建模型结构
Sxy-17 Nov 19, 2025
32b53c2
谢天谢地conv2d终于能跑了,minicpm看看是否是动态形状输入,需要确认conv2d正确性
Sxy-17 Nov 25, 2025
bfc322c
llava搭了两层到CLIPVisionEmbeddings
Sxy-17 Dec 6, 2025
f454489
Add kv compression & llava changes without large checkpoint
gofreelee Dec 12, 2025
5bf31d7
添加layernorm;前端添加self.language_meta和self.language_weights(复用jiuge的);模型…
Sxy-17 Dec 13, 2025
ea203b8
模型结构叠加中
Sxy-17 Dec 15, 2025
4c03b81
porting mincpmv
gofreelee Dec 16, 2025
d17384c
just successfully run
shiinakkk Dec 16, 2025
d84b5a4
fix ptr_delivery
shiinakkk Dec 18, 2025
814df22
minicpmv+mlp
gofreelee Dec 18, 2025
b046823
add log to git ignore
shiinakkk Dec 18, 2025
4d6852c
add DCU, finish before projector
shiinakkk Dec 19, 2025
f8a9999
save
shiinakkk Dec 20, 2025
ff62822
fix projector
shiinakkk Dec 21, 2025
4b5c277
finish projector
shiinakkk Dec 21, 2025
93d3bc7
before push
shiinakkk Dec 22, 2025
7f8b3cf
add AGENTS.md
gofreelee Dec 22, 2025
bf76c00
Merge branch 'feature/minicpmv' into integrate/llava-v15-hygon
gofreelee Dec 22, 2025
0e6b7d2
complete llava
gofreelee Dec 22, 2025
5876f0d
clean verify code scripts
gofreelee Dec 23, 2025
3c3e5ad
delete meaningless log
gofreelee Dec 23, 2025
98c32c7
add perplexity for minicpmv
gofreelee Dec 24, 2025
fabb948
add perplexity for llava
gofreelee Dec 24, 2025
5542b3a
add timer
gofreelee Dec 24, 2025
068331f
add imgs
Dec 24, 2025
ec94181
Simplified the perplexity output.
Dec 24, 2025
da3bd75
add gqa_samples
Dec 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ cache/
#GGUF
*.gguf

# txt
*.txt
# # txt
# *.txt

*.http

#log
log
18 changes: 18 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
1. 目标:

(miniCPM + Fastcache) x (DCU & 摩尔)
(llava + Fastcache) x (DCU & 摩尔)


2. 具体工作拆分:

a. DCU平台端到端跑通:llava encoder部分正确性调试【目前可上手的工作,缺的算子暂时先占位】 + 把encoder/Fastcache/llm拼到一起。

b. 我今天搞:摩尔平台计算资源申请

c. 我最近两天:两个平台缺的算子搞定

3. ddl: 10天后,本月25号

4. weight:
108:/home/weight/MiniCPM-V-2_6;/home/weight/llava-1.5-7b-hf
Binary file added compress_ckpt/llava_mlp.bin
Binary file not shown.
Binary file added compress_ckpt/llava_mlp_layerwise.bin
Binary file not shown.
Binary file added compress_ckpt/minicpm_mlp.pth
Binary file not shown.
Binary file added compress_ckpt/minicpmv_mlp.bin
Binary file not shown.
Empty file added debug_data/qkv_debug.txt
Empty file.
34 changes: 34 additions & 0 deletions docs/KVCacheCompressionMapping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# KV Cache Compression Weight Mapping (llava_mlp.bin)

## 前缀与含义
权重来自 Fastcache 的 KVCacheLinearDecoupleCompressor,`.pth` 结构位于 `compressor` 子树,键模式为 `<prefix>.<layer>.<slot>.weight`。当前导出的 bin 包含以下前缀(已按排序写入):

- `compress_tk`: 文本 K 压缩/投影相关权重
- `compress_tv`: 文本 V 压缩/投影相关权重
- `compress_iv`: 图像 V 压缩/投影相关权重(命名可能沿用 image/value 缩写)
- `compress_ik`: 图像 K 压缩/投影相关权重
- `attention`: 压缩器内部的注意力/门控线性层(小头数,通常 slot=0..7 等)

> 注:原始 PyTorch 权重中未见 bias;转换脚本若发现 bias 长度不匹配会跳过。

## 排序与写入顺序
排序键:`prefix` 优先级(`compress_tk` → `compress_tv` → `compress_iv` → `compress_ik` → `attention`),然后 `layer` 升序,再 `slot` 升序。同一 `(prefix,layer,slot)` 下先 weight 后 bias(若存在)。

## 形状推断与 hidden_size
- 头部的 `hidden_size` 来自首个权重的列数(当前为 640)。
- 每个 weight 块记录 `rows`、`cols`。可视为线性层 `out = W * in`,`W` 形状为 `[rows, cols]`,输出维度 = `rows`。
- bias(若存在且长度==rows)紧随其后。

## 可能的计算图猜测(供 C++ 实现对齐)
- `compress_tk`: 对文本 K 做降维/解耦,slot 多个表示分阶段或多头混合投影。
- `compress_tv`: 对文本 V 做降维/解耦。
- `compress_iv`: 对图像 V 做降维/解耦。
- `compress_ik`: 对图像 K 做降维/解耦。
- `attention`: 压缩器内部的小型注意力/门控 MLP,用于生成压缩映射或融合文本/图像特征。

实际计算顺序需结合 Fastcache 的 Python 源码(`KVCacheLinearDecoupleCompressor.forward`)逐层映射,将上述权重映射到具体的线性/激活/重排操作。

## 与 bin 对齐的校验
- 使用 `scripts/verify_llava_mlp_bin.py` 可对比 `.pth` 与 `.bin`:会打印头部、逐块形状及 max diff。
- 当前验证结果:`num_layers=32`,`weight_count_per_layer=12`,384 个 weight 块,max diff=0(fp16)。

34 changes: 34 additions & 0 deletions docs/KVCacheCompressionOpsChecklist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# KV Cache Compression 算法拆解与算子需求(llava_mlp.bin 基线)

## 模块拆解(基于权重前缀推断)
- `compress_tk`: 文本 K 路径压缩/解耦。若 slot 多个,可能对应多阶段或多头混合投影。
- `compress_tv`: 文本 V 路径压缩/解耦。
- `compress_iv`: 图像 V 路径压缩/解耦。
- `compress_ik`: 图像 K 路径压缩/解耦。
- `attention`: 压缩器内部的小型注意力/门控线性层(可能用于融合/生成映射)。

## 可能的计算流程(参考 Fastcache 思路,需结合源码逐条对齐)
1) 对 KV 按类别(文本/图像)分支,按头/slot 做线性变换降维或投影。
2) 可选的 gating/注意力:使用 `attention.*` 权重对压缩特征做融合或生成索引/权重。
3) 生成压缩后的 K/V(seq 维缩短或维度降维),并记录映射(indices/scale)。
4) 解压路径:根据保存的映射/scale,将压缩 K/V 恢复到注意力可消费的形式(或直接在注意力中使用压缩格式)。

## 需要的核心算子(优先复用 InfiniCore)
- 矩阵乘 + bias:`linear`(已有)。需支持 fp16/bf16。
- 激活:SiLU/GELU(确认是否已有;缺失则补充逐元素 kernel)。
- 张量重排:view/reshape/permute/slice(`Tensor` 已支持)。
- 可能的归一化/缩放:元素级乘加(已有基础算子可组合)。
- 可选:索引/聚合(若压缩逻辑需要采样或根据权重重排 seq)。

## 需要明确的映射关系(待确认)
- 每个 prefix 对应的输入/输出形状:`[B, heads, seq, dim]` → `[...]`,slot 如何映射。
- 压缩因子作用位置:seq 维还是隐藏维(或两者结合)。
- `attention` 权重的输入特征来源与输出用途(生成哪类权重/索引)。
- 是否需要存储 indices/scale 以便解压或稀疏注意力。

## 实现阶段建议
1) **占位压缩**:先实现保留最近 N/截断的简版,打通链路。
2) **权重映射对齐**:阅读 Python 版 `KVCacheLinearDecoupleCompressor.forward`,写出每个 prefix 的线性/激活顺序和张量维度。
3) **算子补齐**:如缺 SiLU/GELU,新增简洁 kernel;其余用现有 linear/elemwise 组合。
4) **解压策略**:选择解压到密集 KV(改动小)或直接改注意力支持压缩格式(二选一)。
5) **验证**:构造 C++/ctypes 小测试,随机 KV → 压缩 → 解压 → 对比误差,量化开销与收益。
57 changes: 57 additions & 0 deletions docs/KVCacheCompressionWeightFormat.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# KV Cache Compression Weight Format (Binary, No PyTorch Dependency)

## File Layout
All values are little-endian unless stated otherwise. Strings are ASCII, null-terminated.

- Header (fixed size)
- `uint32` magic = 0x4B56434D ("KV C M")
- `uint32` version = 1
- `uint16` dtype code: 0 = fp16, 1 = bf16, 2 = fp32
- `uint16` reserved = 0
- `uint32` num_layers
- `uint32` num_heads
- `uint32` head_dim
- `uint32` hidden_size
- `uint32` compression_factor (e.g., 4, 5)
- `uint32` min_seq_len
- `uint32` weight_count_per_layer (for sanity check)
- `uint32` metadata_size_bytes (future expansion; set 0 for now)
- Layer blocks (repeat `num_layers` times)
- For each weight tensor (order defined below):
- `uint32` rows
- `uint32` cols
- `uint32` has_bias (0/1)
- data blob for weight: `rows * cols * sizeof(dtype)`
- optional bias blob: `cols * sizeof(dtype)` when `has_bias==1`
- Footer
- `uint32` checksum (optional; set 0 if not used)

## Weight Order per Layer (example for linear-decouple MLP)
Adjust if实际模型结构不同,但顺序需在导出和加载一致。
1. `proj_k` weight (+bias)
2. `proj_v` weight (+bias)
3. `compress_k` weight (+bias)
4. `compress_v` weight (+bias)
5. `decompress_k` weight (+bias)
6. `decompress_v` weight (+bias)
7. `gate`/`mlp` weights (+bias) if算法需要

`weight_count_per_layer` = 实际包含的权重项数,便于解析时校验。

## Export Steps (one-time, in external Python env)
1) 使用 PyTorch 读取原 `.pth`:`state = torch.load(...)`.
2) 提取压缩器权重到固定顺序的列表;统一 dtype(fp16/bf16):
```python
weights = [
(state['proj_k.weight'], state.get('proj_k.bias')),
...
]
```
3) 写入头部;逐层写元信息 + 数据;按 `dtype` 转为字节(fp16 用 `np.float16.tobytes()`)。
4) 填充 footer(可置 0)。

## Loader Expectations (C++/InfiniCore)
- 读取并验证 magic/version/dtype/层数/weight_count_per_layer。
- 为每个权重创建 `Tensor::weight`,dtype 与头部一致。
- 如果缺少某些权重(has_bias=0),按约定跳过 bias。
- 解析出的权重按同样顺序存入压缩器对象,以确保前向逻辑正确。
2 changes: 2 additions & 0 deletions include/infinicore_infer.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@
#define INFINICORE_INFER_H

#include "infinicore_infer/cache.h"
#include "infinicore_infer/kv_compression.h"
#include "infinicore_infer/weights_loader.h"

#include "infinicore_infer/models/deepseek.h"
#include "infinicore_infer/models/jiuge.h"
#include "infinicore_infer/models/minicpmv.h"

#endif /* INFINICORE_INFER_H */
25 changes: 25 additions & 0 deletions include/infinicore_infer/kv_compression.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#ifndef KV_COMPRESSION_H
#define KV_COMPRESSION_H

#include <stdint.h>

#include <infinirt.h>

struct KVCache;

typedef struct {
uint32_t enable;
uint32_t compression_factor;
uint32_t min_seq_len;
uint32_t image_kv_len;
const char *weight_path; // path to .bin weights (see docs/KVCacheCompressionWeightFormat.md)
} KVCompressionConfig;

// Compress KVCache in-place:
// - Reads KV from [0, seq_len) and writes compressed KV back into the same cache prefix [0, new_len).
// - Returns new_len on success; returns seq_len on no-op/failure.
__C __export uint32_t
compressKVCacheInplace(struct KVCache *kv_cache, uint32_t seq_len, const KVCompressionConfig *cfg);

#endif // KV_COMPRESSION_H

133 changes: 132 additions & 1 deletion include/infinicore_infer/models/jiuge.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,15 @@ typedef struct
// [dvoc, d]
const void *output_embd;
// nlayer * [d]
const void *const *attn_norm;
const void *const *attn_norm; // 指针数组,每层一个RMSNorm权重
// nlayer * [ndev, (nh + 2 * nkvh) / ndev * dh, d]
const void *const *attn_qkv;
// nlayer * [ndev, (nh + 2 * nkvh) / ndev * dh]
const void *const *attn_qkv_b;
// nlayer * [dh]
const void *const *attn_q_norm;
// nlayer * [dh]
const void *const *attn_k_norm;
// nlayer * [ndev, d, nkvh / ndev * dh]
const void *const *attn_o;
// nlayer * [d]
Expand Down Expand Up @@ -80,6 +84,43 @@ inferBatchJiuge(struct JiugeModel *,
const float *temperature, const uint32_t *topk, const float *topp,
uint32_t *output);

/// @brief 批次推理一轮,并采样出新的 token(RoPE 位置与 KV 写入位置可解耦,用于 KV 压缩)
/// @param req_pos 位置 id 基址(用于 RoPE/pos_ids 计算)
/// @param kv_pos KVCache 写入/读取基址(用于 past_len/total_len 计算)
__C __export void
inferBatchJiugeEx(struct JiugeModel *,
const uint32_t *tokens, uint32_t ntok,
const uint32_t *req_lens, uint32_t nreq,
const uint32_t *req_pos,
const uint32_t *kv_pos,
struct KVCache **kv_caches,
const float *temperature, const uint32_t *topk, const float *topp,
uint32_t *output);

/// @brief 批次推理一轮,并采样出新的 token,同时输出 logits
/// @param logits 输出 logits 数组
__C __export void
inferBatchJiugeWithLogits(struct JiugeModel *,
const uint32_t *tokens, uint32_t ntok,
const uint32_t *req_lens, uint32_t nreq, const uint32_t *req_pos,
struct KVCache **kv_caches,
const float *temperature, const uint32_t *topk, const float *topp,
uint32_t *output, void *logits);

/// @brief 批次推理一轮(RoPE 位置与 KV 写入位置可解耦),同时输出 logits
/// @param req_pos 位置 id 基址(用于 RoPE/pos_ids 计算)
/// @param kv_pos KVCache 写入/读取基址(用于 past_len/total_len 计算)
/// @param logits 输出 logits 数组
__C __export void
inferBatchJiugeExWithLogits(struct JiugeModel *,
const uint32_t *tokens, uint32_t ntok,
const uint32_t *req_lens, uint32_t nreq,
const uint32_t *req_pos,
const uint32_t *kv_pos,
struct KVCache **kv_caches,
const float *temperature, const uint32_t *topk, const float *topp,
uint32_t *output, void *logits);

/// @brief 批次推理一轮,输出 output embedding 后的 logits
/// @param tokens 输入 token 地址
/// @param ntok 输入 token 数量
Expand All @@ -95,4 +136,94 @@ forwardBatchJiuge(struct JiugeModel *,
struct KVCache **kv_caches,
void *logits);

/// @brief 批次推理一轮,输出 logits(RoPE 位置与 KV 写入位置可解耦,用于 KV 压缩)
__C __export void
forwardBatchJiugeEx(struct JiugeModel *,
const uint32_t *tokens, uint32_t ntok,
const uint32_t *req_lens, uint32_t nreq,
const uint32_t *req_pos,
const uint32_t *kv_pos,
struct KVCache **kv_caches,
void *logits);

/// @brief 批次推理一轮,支持对指定 token 位置的输入 embedding 做覆盖(用于多模态 image embedding 注入)
/// @note override_pos 需要按升序排列,且每个位置最多出现一次
/// @param n_override 覆盖位置数量
/// @param override_pos 覆盖位置(基于拼接后的 tokens 序列下标,范围 [0, ntok))
/// @param override_embeds 覆盖 embedding,shape [n_override, d],dtype = meta.dt_logits
__C __export void
inferBatchJiugeWithOverrides(struct JiugeModel *,
const uint32_t *tokens, uint32_t ntok,
const uint32_t *req_lens, uint32_t nreq, const uint32_t *req_pos,
struct KVCache **kv_caches,
uint32_t n_override,
const uint32_t *override_pos,
const void *override_embeds,
const float *temperature, const uint32_t *topk, const float *topp,
uint32_t *output);

/// @brief 批次推理一轮(RoPE 位置与 KV 写入位置可解耦),支持 embedding 覆盖
__C __export void
inferBatchJiugeWithOverridesEx(struct JiugeModel *,
const uint32_t *tokens, uint32_t ntok,
const uint32_t *req_lens, uint32_t nreq,
const uint32_t *req_pos,
const uint32_t *kv_pos,
struct KVCache **kv_caches,
uint32_t n_override,
const uint32_t *override_pos,
const void *override_embeds,
const float *temperature, const uint32_t *topk, const float *topp,
uint32_t *output);

/// @brief 批次推理一轮,支持 embedding 覆盖,同时输出 logits
__C __export void
inferBatchJiugeWithOverridesWithLogits(struct JiugeModel *,
const uint32_t *tokens, uint32_t ntok,
const uint32_t *req_lens, uint32_t nreq, const uint32_t *req_pos,
struct KVCache **kv_caches,
uint32_t n_override,
const uint32_t *override_pos,
const void *override_embeds,
const float *temperature, const uint32_t *topk, const float *topp,
uint32_t *output, void *logits);

// /// @brief 批次推理一轮(RoPE 位置与 KV 写入位置可解耦),支持 embedding 覆盖,同时输出 logits
// __C __export void
// inferBatchJiugeWithOverridesExWithLogits(struct JiugeModel *,
// const uint32_t *tokens, uint32_t ntok,
// const uint32_t *req_lens, uint32_t nreq,
// const uint32_t *req_pos,
// const uint32_t *kv_pos,
// struct KVCache **kv_caches,
// uint32_t n_override,
// const uint32_t *override_pos,
// const void *override_embeds,
// const float *temperature, const uint32_t *topk, const float *topp,
// uint32_t *output, void *logits);

/// @brief 批次推理一轮,输出 logits,支持对指定 token 位置的输入 embedding 做覆盖
__C __export void
forwardBatchJiugeWithOverrides(struct JiugeModel *,
const uint32_t *tokens, uint32_t ntok,
const uint32_t *req_lens, uint32_t nreq, const uint32_t *req_pos,
struct KVCache **kv_caches,
uint32_t n_override,
const uint32_t *override_pos,
const void *override_embeds,
void *logits);

/// @brief 批次推理一轮,输出 logits(RoPE 位置与 KV 写入位置可解耦),支持 embedding 覆盖
__C __export void
forwardBatchJiugeWithOverridesEx(struct JiugeModel *,
const uint32_t *tokens, uint32_t ntok,
const uint32_t *req_lens, uint32_t nreq,
const uint32_t *req_pos,
const uint32_t *kv_pos,
struct KVCache **kv_caches,
uint32_t n_override,
const uint32_t *override_pos,
const void *override_embeds,
void *logits);

#endif
Loading