This repository presents a controlled, component-level ablation study of the Baby Dragon Hatchling (BDH) architecture on byte-level WikiText-2 language modeling. The goal of this research is to isolate and quantify the representational importance of BDHβs core bio-inspired features.
The Dragon Hatchling (BDH) (Kosowski et al., arXiv:2509.26507) is a biologically-inspired sequence modeling architecture. It replaces standard Multi-Head Attention (MHA) and Feed-Forward Networks (FFN) with a dual-circuit state-space model that implements integrate-and-fire activation thresholding, Hebbian synaptic learning rules, and multiplicative concept binding.
While the original paper demonstrates that BDH scales competitively with GPT-2, the individual contribution of each core component has not been systematically isolated. This project addresses this gap through a controlled ablation study across four key dimensions:
-
Multiplicative Conjunction Gate (
$\odot$ ): Evaluated against an additive alternative ($+$ ). -
Latent Space Dimensionality (
$m$ ): Evaluated at the default$m=128$ vs. a compressed$m=32$ . - Activation Non-Linearity: Comparing biologically-plausible, exact-sparsity ReLU vs. smooth GELU.
- Attention Loop Representation: Checked against a parameter-matched Transformer baseline.
The following diagram maps the computational graph of one BDH-GPU layer. The central innovation is the Multiplicative Gate (logical AND) combining the primal sparse pathway (content encoding) and the dual sparse pathway (context/attention modulated).
graph TD
Input[Token Input: idx] --> Embed[Token Embedding E]
Embed --> LN1[Layer Norm]
subgraph BDH_Layer ["BDH Layer (Repeated L = 6 times)"]
LN1 --> EncPrimal[Primal Encoder W_enc]
EncPrimal --> ReLUPrimal["Primal Activation: ReLU(Β·)"]
ReLUPrimal --> Attn["Causal Softmax Attention <br> Q = K = xs, V = x"]
LN1 --> Attn
Attn --> LN2[Layer Norm]
LN2 --> EncDual[Dual Encoder W_enc_v]
EncDual --> ReLUDual["Dual Activation: ReLU(Β·)"]
ReLUPrimal -- xs --> Gate{{"Multiplicative Gate (xs β ys)"}}
ReLUDual -- ys --> Gate
Gate --> Drop[Dropout p=0.1]
Drop --> Dec[Decoder W_dec]
Dec --> LN3[Layer Norm]
LN1 -- Residual Skip --x LN4[Layer Norm]
LN3 --> LN4
end
LN4 --> LMHead[LM Head: Linear]
LMHead --> Logits[Output Logits]
BDH translates classical biological neural behavior into matrix operations.
-
Primal branch (
$x_s$ ): Excitatory neural assemblies responding directly to the stimulus. -
Dual branch (
$y_s$ ): Attention-gated inhibitory neural assemblies providing context-specific feedback. -
Product (
$x_s \odot y_s$ ): Co-activation of both circuits representing conjunctive concept binding (logical AND). -
Synaptic Matrix (
$S_t$ ): The linear attention matrix mimics Hebbian long-term potentiation: connections strengthen when neurons fire together.
graph LR
subgraph Excitatory_Circuit ["Excitatory Circuit (Primal xs)"]
A((Neuron A)) -->|Synapse| B((Neuron B))
end
subgraph Inhibitory_Circuit ["Inhibitory Circuit (Dual ys)"]
P((Neuron P)) -->|Inhibit| Q((Neuron Q))
end
Excitatory_Circuit -- Activation --> Gate(Conjunctive Binding: xs β ys)
Inhibitory_Circuit -- Gated Attention --> Gate
classDef excitatory fill:#e1f5fe,stroke:#0288d1,stroke-width:2px;
classDef inhibitory fill:#ffebee,stroke:#d32f2f,stroke-width:2px;
classDef gate fill:#fff8e1,stroke:#fbc02d,stroke-width:2px;
class A,B excitatory;
class P,Q inhibitory;
class Gate gate;
The following diagram maps the execution pipeline of the experiment runner:
flowchart TD
Start([Start Experiment]) --> InitConfigs[Initialize Configs & Seed 42]
InitConfigs --> LoadData[Load WikiText-2 Byte Stream]
LoadData --> SplitData[Split: 90% Train / 10% Validation]
subgraph Loop ["Iterate over all 5 Model Variants"]
Instantiate[Instantiate Model on Device] --> Optim[AdamW Optimizer: lr=1e-3, wd=0.1]
Optim --> StepLoop{Step < 2900?}
StepLoop -- Yes --> GetBatch[Sample Batch B=8, T=128]
GetBatch --> Forward[Forward Pass: Cross-Entropy Loss]
Forward --> Backward[Backward Pass & Gradients]
Backward --> Step[Optimizer Step]
Step --> CheckLog{Step % 100 == 0?}
CheckLog -- Yes --> Eval[Estimate Val Loss on 20 batches]
Eval --> Log[Log to results/log.jsonl]
Log --> Increment[Step += 1]
CheckLog -- No --> Increment
Increment --> StepLoop
end
StepLoop -- No --> SaveResults[Save Final Weights & Run analyze.py]
SaveResults --> Finish([End Experiment])
| Architecture | Interaction | Activation | Mult. ( |
Latent Dim ( |
Description |
|---|---|---|---|---|---|
| Transformer | β | GELU | β | β | Standard Multi-Head Attention & FFN |
| BDH Base | Multiplication ( |
ReLU | 128 | 8,192 | Standard BDH-GPU implementation |
| BDH-NoMul | Addition ( |
ReLU | 128 | 8,192 | Ablation: Replaces |
| BDH-LowDim | Multiplication ( |
ReLU | 32 | 2,048 | Ablation: Compresses latent dimension by 4x |
| BDH-Improved | Multiplication ( |
GELU | 128 | 8,192 | Ablation: Soft non-linearity replaces hard ReLU |
After training completed, running python analyze.py compiled the training trajectory logs into the following results.
Compare the convergence speed and final cross-entropy loss across all 5 models.
- The Transformer baseline converges the fastest initially, due to the representational expressiveness of unshared parameters per layer.
- BDH-NoMul (additive interaction) converges the slowest and plateaus at the highest final loss, illustrating that multiplication is vital to its performance.
Signed $\Delta$ in Average Last-50 Loss compared to BDH Base. Positive values represent performance degradation (hurts the model), while negative values represent improvements.
- Multiplicative Interaction is crucial (+0.059): Replacing the gate with addition leads to a massive drop in performance. Without multiplication, conjunctive binding collapses.
-
Latent space compression helps (-0.052): Reducing the multiplier from
$m=128$ to$m=32$ slightly improves perplexity, showing that the base latent space is heavily over-parameterized at this scale. - GELU activation yields minor gains (-0.039): Replacing ReLU with GELU improves loss by providing smoother gradients, though it forfeits biological interpretability.
pip install torch numpy matplotlib datasets# 1. Run all 5 training runs in sequence (saves to results/log.jsonl)
python -m experiments.runner
# 2. Compile metrics and generate plots
python analyze.py@article{khamitkar2026bdh,
title={BDH Architecture Analysis: A Controlled Component-Level Study of the Dragon Hatchling Language Model},
author={Khamitkar, Aditya and Jagatap, Tushar and Saini, Nitin},
year={2026},
institution={SCAI, VIT Bhopal University}
}
