HelixLM: Recurrent Heterogeneous Graph Neural Language Model

Why "Helix"? A helix coils back on itself — just as our recurrent graph reuses its weights across depth iterations, refining understanding with each loop. Biological, elegant, memorable.

HelixLM is an optimized hybrid architecture for small-scale language modeling, designed for hyperpersonalization and on-device AI. It combines biological brain-inspired random graph wiring with modern SOTA primitives (hybrid attention, Mamba-2 SSD, RoPE, SwiGLU, RMSNorm, optional Titans neural memory) and full HuggingFace integration.

Use Cases

Hyperpersonalization — Train a small model (from sub-million to ~1B parameters) from cold start on a personal corpus, then full fine-tune on the user's own data. The model becomes an expert in the one thing generic frontier models don't know: you — your domain knowledge, style, notes, emails, and work patterns.
On-device AI — Efficient inference on CPU/GPU for desktops, laptops, tablets, and mobile. Strong quality per parameter, with optional fully-local (log-less) operation for sensitive use cases.

Architecture

Standard transformers force information through a fixed-depth stack of identical layers. Biological brains don't work that way: cortical columns contain diverse cell types, lateral connections enable short-circuit pathways, and feedback loops allow iterative refinement.

HelixLM mimics this structure inside a single recurrent block:

Input Tokens
    ↓
Embedding (d_model)
    ↓
┌─────────────────────────────────────────────────────────────┐
│  Recurrent Block (n_loops × shared graph weights)            │
│  ├── HelixGraph: Randomly wired heterogeneous neural columns │
│  │   ├── LinearAttnNode   : O(n) causal linear attention   │
│  │   ├── FullAttnNode     : Causal softmax (periodic)        │
│  │   ├── SwiGLUNode       : Modern gated activation          │
│  │   ├── Mamba2Node       : Mamba-2 SSD (long-range)         │
│  │   ├── TitansMemoryNode : Neural long-term memory (opt.)   │
│  │   ├── GateNode         : Learned multi-input aggregation  │
│  │   └── Random wiring    : Vertical + lateral connections   │
│  ├── LTI Injection        : Recurrent stability (A < 1)      │
│  └── ACT Halting          : Dynamic per-token depth         │
└─────────────────────────────────────────────────────────────┘
    ↓
RMSNorm + LM Head (tied embeddings)
    ↓
Logits / Generation

Key Design Choices

Component	What it does	Why it matters
Neural Columns & Heterogeneous Nodes	Each column holds diverse node types (attention variants, SwiGLU, Mamba-2, gating, optional Titans memory) instead of identical transformer blocks.	Different information pathways for different computations, like biological cortical columns.
Recurrent Depth (LTI + ACT)	The same graph weights are looped `n_loops` times. LTI injection keeps the recurrent state stable. ACT halting dynamically allocates compute per token.	Iterative refinement without parameter growth; easy tokens use 1 loop, hard reasoning uses more.
Hybrid Attention	Linear attention (O(n) complexity) in most columns, with periodic full-attention columns for exact retrieval.	Long-context efficiency without losing precise copy/lookup capability.
Mamba-2 SSD	State Space Duality with chunked parallel scan. Auto-activates when `ssm_d_state >= 64`.	Handles very long-range dependencies efficiently on CPU, CUDA, or MPS.
Titans Neural Memory (optional)	Persistent surprise-gated memory via outer-product updates (first column only by default).	Test-time memory that can retain patterns across long documents without growing KV cache.
Modern Primitives	RoPE, SwiGLU, RMSNorm, and weight tying.	Proven SOTA components for convergence and generation quality.

Note on column layout: The nodes_per_column config field is reserved for future extensibility. In the current graph builder, every column automatically instantiates Attention + SwiGLU + optional Mamba-2 SSD + optional Titans Memory + Gate. DenseNode is defined in the codebase but not used in the default wiring recipe.

Quick Start

Installation

pip install torch transformers datasets accelerate

Smoke Test (CPU, character-level)

cd helix_lm
python smoke_test.py

Minimal Demo (CPU, HuggingFace tokenizer)

python quick_demo_cpu.py

Comparison: Where does HelixLM fit?

The Landscape

Project	What They Do	Their Core Tech	Their Weakness
OpenMythos	Recurrent depth for LLMs	LTI stability + ACT halting	Standard transformer blocks; no heterogeneous graph
Cerebros	Biological brain-inspired wiring	Random DAG of dense layers	No real attention; no positional encoding; no recurrence
Phi-2 (Microsoft)	Small but capable	Standard transformer + quality data	Nothing novel architecturally; just good data curation
Qwen2.5 (Alibaba)	Multilingual small LLM	Standard transformer	No graph wiring; no recurrent depth; no SSM integration
Mamba-2 (Tri Dao)	Efficient SSM	State Space Duality	Not a complete LLM architecture; needs integration
Kimi Linear	Linear attention at scale	Hybrid linear + full attention	Not open source; no graph wiring
HelixLM (Ours)	All of the above, integrated	Graph + Recurrent + Hybrid Attention + Mamba-2 + Titans	Small team; needs community compute for 1B+ scaling

HelixLM vs. OpenMythos

OpenMythos pioneered recurrent depth with LTI stability and ACT halting. HelixLM takes that insight and makes it work inside a heterogeneous graph that mimics neural column and random topology connectivity found in biological brains.

Capability	OpenMythos	HelixLM
Recurrent depth	✅ Same block looped	✅ Same block looped
LTI stability	✅ Spectral radius < 1	✅ Spectral radius < 1
ACT halting	✅ Dynamic per-token depth	✅ Dynamic per-token depth
Architecture inside loop	Standard transformer block	Heterogeneous random graph
Attention	Standard full attention only	Linear + full hybrid
Node types	Single (transformer block)	7+ active types (attention variants, FFN, SSM, gate, neural memory)
Positional encoding	Standard learned	RoPE
Activation	GELU	SwiGLU
Normalization	LayerNorm	RMSNorm
Open source	✅ Yes	✅ Yes
HF integration	❌ No	✅ Full PreTrainedModel

HelixLM vs. Cerebros

Cerebros showed that biological random hyperdense vertical and lateral topology of Dense layers could outperform rigid layer stacks. It generated text without attention on small data, but required elaborate integration and clashed with standard model-structure paradigms. HelixLM integrates that topological insight into a modern, HF-compatible LLM backbone.

Why HelixLM in Practice?

Hyperpersonalization

Requirement	How HelixLM Delivers
Small enough to train per-user	0.4M–1.2B params = trainable on single GPU in hours
Rich enough to capture user style	Heterogeneous graph routes different info types through optimal processing paths
Fast inference for tool calls	Linear attention gives O(n) complexity; ACT skips unnecessary depth
Queryable by frontier model	HF integration = standard API; chat templates = structured responses

On-Device AI

Requirement	How HelixLM Delivers
Runs on CPU	Linear attention + lightweight graph = fast without GPU
Good quality for size	Recurrent depth reuses weights across loops = more capacity per parameter
Long context without forgetting	LTI-stable recurrence preserves state; RoPE generalizes length
Optional log-less operation	No cloud dependency; weights run locally
Responsive to long messages	Rolling dataset training = learns to handle arbitrary-length inputs

Parameter Efficiency & Scaling

Recurrent loops reuse weights (depth without parameter growth). Graph wiring creates expressive pathways without wide uniform layers. Heterogeneous nodes specialize, so no capacity is wasted on uniform operations.

Because the graph is the bulk of the model, parameter counts are shown for two common vocabularies: a minimal character-level vocab (~100 tokens) and the GPT-2 BPE vocab (50,257 tokens). Both assume tied embeddings/head.

Parameter Scaling

Preset	d_model	~Params (Char)	~Params (GPT-2)	Status
tiny	128	0.5 M	13 M	Smoke tests + demos
small	256	3.2 M	29 M	Experiments
base	512	32 M	83 M	Serious pretraining
medium	768	~134 M	~211 M	Production small
large	1 024	~286 M	~389 M	Competitive
xl	1 536	~1.1 B	~1.2 B	Frontier small
xxl	2 048	~2.7 B	~2.9 B	Frontier

# One-liner scaling
cfg = HelixConfig.medium(vocab_size=151936, tokenizer_name="qwen")
model = HelixForCausalLM(cfg)

HuggingFace Integration

from helix_lm import HelixConfig, HelixForCausalLM, HelixTokenizer

cfg = HelixConfig.small(vocab_size=50257)
model = HelixForCausalLM(cfg)
tokenizer = HelixTokenizer("gpt2")

# Save/load in standard HF format
model.save_pretrained("./my-helix-model")
tokenizer._backend.save_pretrained("./my-helix-model")

# Load later with standard transformers Auto classes
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./my-helix-model")
tokenizer = AutoTokenizer.from_pretrained("./my-helix-model")

Tokenizer Support

HelixLM supports multiple tokenizer backends, switchable via tokenizer_name:

Name	Backend	Vocab Size	Best For
`char`	Character-level	~100	Debugging, tiny models
`gpt2`	GPT-2 BPE	50,257	English, general text
`qwen`	Qwen3 family	151,936	Multilingual, code
`custom`	Any HF tokenizer	Any	Your own tokenizer

# GPT-2
tokenizer = HelixTokenizer("gpt2")

# Qwen3 multilingual
tokenizer = HelixTokenizer("qwen")

# Custom HF model
tokenizer = HelixTokenizer("meta-llama/Llama-3.2-1B")

# Character (for debugging)
tokenizer = HelixTokenizer("char")
tokenizer.build_char_vocab("some text here")

Forward Pass & Generation

import torch

# Forward
input_ids = torch.tensor([tokenizer.encode("Hello world")])
logits = model(input_ids)

# Standard generation
generated = model.generate(input_ids, max_new_tokens=50, temperature=0.8, top_k=50)
text = tokenizer.decode(generated[0])

# Extended generation with stop-string detection
generated = model.generate_ext(
    input_ids,
    max_new_tokens=2000,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    stop_strings=["</s>", "\n\n"],
)

Dataset & Training

Rolling Chunking

The HelixDataset handles three scenarios automatically:

Text >> seq_len: Rolling window with configurable stride (default 50% overlap)
Text == seq_len: Exact fit
Text < seq_len: Padding with attention mask

from helix_lm import HelixDataset, create_helix_dataloader

dataset = HelixDataset(
    texts=train_texts,
    tokenizer=tokenizer,
    seq_len=2048,
    stride=1024,  # 50% overlap
)

loader = create_helix_dataloader(
    texts=train_texts,
    tokenizer=tokenizer,
    seq_len=2048,
    batch_size=8,
    stride=1024,
)

Natural Stop Detection

Each sample includes is_natural_stop: True if the chunk ends at a document boundary. This lets the model learn to distinguish natural endings from forced truncations.

Document-Aware Chunking (No Boundary Crossings)

DocumentAwareDataset splits documents into non-overlapping chunks without crossing document boundaries. Only padding positions are masked in labels, giving 100% token utilization on real text.

HuggingFace Streaming

from helix_lm import HelixHFDataset

hf_dataset = HelixHFDataset(
    dataset_name="HuggingFaceTB/smoltalk",
    tokenizer=tokenizer,
    seq_len=2048,
    streaming=True,
)

Configuration & Parameters

HelixLM scales from 0.4M to 1.3B+ parameters through a single configuration class. Below are the key knobs and their practical effects.

Core Dimensions

Parameter	Effect	Practical Tip
`d_model`	Width of the model; determines embedding and hidden dimension.	128 for smoke tests; 768–1536 for production-quality small models.
`n_columns`	Number of neural columns (graph depth).	2–4 for fast experiments; 6–7 for large models.
`n_loops`	How many times the graph is recurrently executed.	1 for speed; 3–4 for iterative reasoning depth.
`n_heads`	Attention heads for attention nodes.	Must divide `d_model`. 4–8 for small models; 16–32 for large.

Attention & Memory

Parameter	Effect	Practical Tip
`attention_mode`	`linear` (O(n)), `full` (exact softmax), or `hybrid` (both).	`hybrid` is recommended for almost all use cases.
`hybrid_full_attention_interval`	How often to place a full-attention column in hybrid mode.	3 or 4 (i.e., every 3rd/4th column is full attention).
`use_ssm`	Enables Mamba-2 SSD nodes in the graph.	Turn on for long-context or stateful tasks.
`ssm_d_state`	State dimension for Mamba-2.	Must be `>= 64` to activate the optimized Mamba-2 path; smaller values use a simplified SSM.
`use_titans_memory`	Adds a Titans neural memory node to the first column.	Useful for very long documents; still experimental.
`seq_len`	Training context window.	Can generate far beyond this at inference time thanks to recurrent state.

Training & Stability

Parameter	Effect	Practical Tip
`ffn_expansion`	Hidden dimension multiplier for SwiGLU nodes.	2.0 is standard; 2.5–3.0 for XL/XXL presets.
`act_threshold`	Confidence threshold for Adaptive Computation Time halting.	0.99 is conservative. Reduce to 0.95 if you want deeper per-token thinking.
`dropout`	Regularization rate.	0.0–0.05 for small models; higher for large datasets.
`lr` / `weight_decay`	AdamW optimizer settings.	3e-4 with 0.1 decay works well for small models.
`grad_clip`	Max gradient norm.	Keep at 1.0 to prevent spikes in graph architectures.

Generation Features

Extended Generation Beyond max_seq_len

# Standard generation (uses full model each step)
generated = model.generate(input_ids, max_new_tokens=200)

# Extended generation via HelixForCausalLM (uses only last token, with truncation)
generated = model.generate_ext(
    input_ids,
    max_new_tokens=2000,  # Far beyond training seq_len
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    stop_strings=["</s>", "\n\n"],
)

Stop String Detection

model.generate_ext(
    input_ids,
    max_new_tokens=100,
    stop_strings=["Question:", "User:", "\n\n\n"],
)

Batched Generation

prompts = ["The weather is", "In 1492,", "Once upon a time"]
batch = torch.stack([torch.tensor(tokenizer.encode(p)) for p in prompts])
results = model.generate(batch, max_new_tokens=30)

Chat Template Support

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)

output = model.generate_ext(input_ids, max_new_tokens=50)
response = tokenizer.decode(output[0][input_ids.shape[-1]:])

Mamba-2 SSD Integration

For models with use_ssm=True and ssm_d_state >= 64, HelixLM automatically uses Mamba-2 SSD instead of the simplified SSM:

State Space Duality: Unifies attention and SSM perspectives
Parallel Scan: Efficient associative scan in PyTorch (chunked for large sequences)
Selective Parameters: Input-dependent B, C, and Δ (time step)

cfg = HelixConfig.medium(
    use_ssm=True,
    ssm_d_state=128,   # Use Mamba-2 when >= 64
    ssm_d_conv=4,
    ssm_expand=2,
)

VLM Extensibility (Future-Ready)

HelixLM includes architecture hooks for vision-language integration:

cfg = HelixConfig.base(
    is_vlm=True,
    vision_encoder="siglip",  # or "clip", "custom"
    vision_hidden_size=768,
    vision_patch_size=16,
    vision_image_size=448,
    fusion_strategy="perceiver",  # "perceiver" or "simple_merge"
)

# Future:
#   HelixForImageTextToText
#   processor.apply_chat_template with images

Training Best Practices

Warmup + Cosine Decay: Linear warmup followed by cosine decay to 0.1× base LR.
Gradient Clipping: Max norm 1.0 prevents spikes in graph architectures.
Weight Decay: 0.1 for small models, 0.01 for large.
Mixed Precision: Use torch.cuda.amp only on Ampere+ GPUs for XL/XXL; keep it off for tiny/small to avoid NaNs.
Streaming: HelixDataset supports corpora larger than RAM.
No Cross-Document Boundaries: Use DocumentAwareDataset to avoid pollution of gradient signal at document seams.
Data Quality: For small models, data quality > quantity. Curate aggressively.

Project Structure

helix_lm/
  __init__.py           - Package exports
  config.py             - HelixConfig with 7 presets
  tokenizer.py          - Multi-backend tokenizer (char / GPT-2 / Qwen / custom)
  rope.py               - Rotary positional embeddings
  nodes.py              - Heterogeneous node types (7+ active, Dense defined)
  graph.py              - HelixGraph executor with random wiring
  recurrent.py          - Recurrent block (LTI + ACT)
  model.py              - HelixLMCore (non-HF)
  hf_model.py           - HelixForCausalLM (HF PreTrainedModel)
  dataset.py            - Rolling chunking, streaming, HF integration
  trainer.py            - Production training loop
  mamba2.py             - Mamba-2 SSD with parallel scan
  smoke_test.py         - Self-contained CPU test
examples/
  train_hf_full.py      - HF tokenizer + training example
  quick_demo_cpu.py     - Minimal CPU demo

Existing Research We Build On

Idea	Source / Inspiration	How we use it
Hybrid Attention	Linear + full attention interleaving	3:1 or 4:1 linear-to-full ratio; O(n) training with exact copy layers
Recurrent Depth	OpenMythos / Universal Transformers	Same graph weights looped `n_loops` times; fewer params, iterative refinement
LTI Stability	OpenMythos	Log-parameterized state decay to keep spectral radius < 1
Mamba-2 SSD	Dao & Gu (2024)	Chunked associative scan for selective SSM; auto-enabled at `d_state >= 64`
Biological Graph Wiring	Cerebros / Random DAGs	Random vertical & lateral edges instead of strict feedforward stacks
Titans Neural Memory	Behrouz et al. (2025)	Optional first-column persistent memory with surprise-gated outer-product updates

License

This project is open-source under a modified Apache 2.0 license. Please see the license.md file for full terms and conditions.

@software{helixlm2026,
  title = {HelixLM: Recurrent Heterogeneous Graph Neural Language Model},
  year = {2026},
  note = {Open-source small language model architecture}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
examples		examples
helix_lm		helix_lm
README.md		README.md
bible.txt		bible.txt
license.md		license.md
quick_demo_cpu.py		quick_demo_cpu.py
quick_demo_cpu_titans_mem.py		quick_demo_cpu_titans_mem.py
requirements.txt		requirements.txt
train-with-dataset-cicd-test.py		train-with-dataset-cicd-test.py

Folders and files

Latest commit

History

Repository files navigation

HelixLM: Recurrent Heterogeneous Graph Neural Language Model

Use Cases

Architecture

Key Design Choices

Quick Start

Installation

Smoke Test (CPU, character-level)

Minimal Demo (CPU, HuggingFace tokenizer)

Comparison: Where does HelixLM fit?

The Landscape

HelixLM vs. OpenMythos

HelixLM vs. Cerebros

Why HelixLM in Practice?

Hyperpersonalization

On-Device AI

Parameter Efficiency & Scaling

Parameter Scaling

HuggingFace Integration

Tokenizer Support

Forward Pass & Generation

Dataset & Training

Rolling Chunking

Natural Stop Detection

Document-Aware Chunking (No Boundary Crossings)

HuggingFace Streaming

Configuration & Parameters

Core Dimensions

Attention & Memory

Training & Stability

Generation Features

Extended Generation Beyond max_seq_len

Stop String Detection

Batched Generation

Chat Template Support

Mamba-2 SSD Integration

VLM Extensibility (Future-Ready)

Training Best Practices

Project Structure

Existing Research We Build On

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages