DeepSeek-V3 Decoupled Implementation

Standalone PyTorch implementation of DeepSeek-V3 (671B parameters, ~37B active) decoupled from HuggingFace transformers dependencies, with multiple kernel backends and comprehensive benchmarking.

Paper: DeepSeek-V3 Technical Report

Architecture

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model with:

Multi-head Latent Attention (MLA): Compresses KV into 512-dim latent space, applies RoPE to decoupled 64-dim stream
Mixture of Experts: 256 routed experts (top-8 per token), 1 shared expert, group-based routing (n_group=8, topk_group=4)
Auxiliary-loss-free load balancing: Sigmoid routing with correction bias (topk_method="noaux_tc")
YaRN RoPE: Extended context to 163,840 tokens (factor=40)
Multi-Token Prediction (MTP): 1 prediction layer for speculative decoding
61 layers: 3 dense + 58 MoE

Key Dimensions (671B variant)

Parameter	Value
hidden_size	7168
num_attention_heads	128
num_hidden_layers	61
q_lora_rank	1536
kv_lora_rank	512
qk_nope_head_dim	128
qk_rope_head_dim	64
v_head_dim	128
n_routed_experts	256
num_experts_per_tok	8
moe_intermediate_size	2048
vocab_size	129,280

Directory Structure

deepseekv3_2/
  deepseekv3_2-raw-decoupled-from-hf/   # Pure PyTorch, no dependencies
  deepseekv3_2-triton/                    # Triton kernels + unsloth
  deepseekv3_2-kernels-flashinfer/        # FlashInfer attention backend
  deepseekv3_2-kernels-flashmla-deepgemm/ # FlashMLA + DeepGEMM (highest perf)
  benchmark/                               # Comprehensive benchmarking suite
  data/                                    # Shared sample data
  research/                                # Research directions
  viz/                                     # Architecture visualizations

Kernel Selection

Kernel	Component	Status	Notes
FlashMLA	MLA Attention	Recommended	Built by DeepSeek specifically for MLA
DeepGEMM	MoE Expert GEMM	Recommended	FP8 grouped GEMM on Hopper
FlashInfer	MLA Attention	Alternative	Good MLA support, wider hardware compat
Triton	All components	Fallback	Custom fused kernels
Unsloth	RMSNorm, SwiGLU, CE, LoRA	All variants	Fast fused operations

Architecture Differences from GLM5

Feature	DeepSeek-V3	GLM5
DSA (Dynamic Sparse Attention)	No	Yes
MTP	1 layer	3 shared layers
RoPE	YaRN (factor=40)	Standard
MoE routing	Group-based (n_group=8)	Flat (n_group=1)
hidden_size	7168	6144
num_layers	61	78
num_heads	128	64
qk_nope_head_dim	128	192
v_head_dim	128	256
rms_norm_eps	1e-6	1e-5

Quick Start

from deepseekv3_2_raw_decoupled_from_hf.config import DEEPSEEK_V3_CONFIG
from deepseekv3_2_raw_decoupled_from_hf.model import DeepSeekV3ForCausalLM

model = DeepSeekV3ForCausalLM(DEEPSEEK_V3_CONFIG)

Tests

# Standalone tests (CPU, no GPU required)
python deepseekv3_2-raw-decoupled-from-hf/test_standalone.py

# Data flow integration tests
python test_data_flow_unsloth_pytorch.py

# Validation with shared data
python deepseekv3_2-raw-decoupled-from-hf/validate.py

DeepSeekV3.2-Tiered-Benchmarks

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmark		benchmark
data		data
deepseekv3_2-kernels-flashinfer		deepseekv3_2-kernels-flashinfer
deepseekv3_2-kernels-flashmla-deepgemm		deepseekv3_2-kernels-flashmla-deepgemm
deepseekv3_2-raw-decoupled-from-hf		deepseekv3_2-raw-decoupled-from-hf
deepseekv3_2-triton		deepseekv3_2-triton
research		research
viz		viz
.gitignore		.gitignore
ACADEMIC_BENCHMARK_RESEARCH.md		ACADEMIC_BENCHMARK_RESEARCH.md
CROSS_PLATFORM_FEASIBILITY.md		CROSS_PLATFORM_FEASIBILITY.md
Dockerfile.h100		Dockerfile.h100
LICENSE		LICENSE
PRECISION_TRACKING.md		PRECISION_TRACKING.md
README.md		README.md
RUNPOD_BENCHMARK_GUIDE.md		RUNPOD_BENCHMARK_GUIDE.md
benchmark_head_to_head.py		benchmark_head_to_head.py
compatibility_report.md		compatibility_report.md
kernel_decisions.md		kernel_decisions.md
test_data_flow_unsloth_pytorch.py		test_data_flow_unsloth_pytorch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepSeek-V3 Decoupled Implementation

Architecture

Key Dimensions (671B variant)

Directory Structure

Kernel Selection

Architecture Differences from GLM5

Quick Start

Tests

DeepSeekV3.2-Tiered-Benchmarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepSeek-V3 Decoupled Implementation

Architecture

Key Dimensions (671B variant)

Directory Structure

Kernel Selection

Architecture Differences from GLM5

Quick Start

Tests

DeepSeekV3.2-Tiered-Benchmarks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages