Wake2Vec

TL;DR

A comparative embedding injection study fine-tuning ten LLMs on Finnegans Wake by injecting ~44K Joyce-specific tokens into the embedding layer and training in phases: embedding-only warm-up (P1), LoRA behavioural adaptation (P2), and morpheme-compositional alignment under auxiliary geometric losses (P3). Two embedding strategies for P1 (gradient masking for the Llama family, Mistral, Phi, Gemma; WakeOverlay for Qwen, which freezes base and trains only a separate Wake-row matrix), plus frozen-embedding LoRA across all P2 runs. Scale ranges from 1.1B (TinyLlama) to 14B (Qwen) across five architecture families (Llama, Mistral, Qwen, Phi, Gemma) and four base vocab sizes (32K, 128K, 152K, 256K). Three pipelines fully complete (TinyLlama through P3b; Llama 3.2-1B through P3; Llama 3.2-3B with P3 strong running toward a pre-registered manual stop at step 600), Qwen 2.5-14B P1 canonical landed 9 June 2026 with extender to launch from sentry_step_3000.pt, Llama 3.1-8B and Mistral 7B in P1, Phi-3.5 queued. Findings so far: (1) the smaller-model paradox, generation quality in the Joycean register correlates more strongly with Wake-vocab-share than with model scale; TinyLlama (32K vocab, 58% share) outperforms Llama 3.2-1B (128K vocab, 26% share) on identical data; the constraint becomes the creative advantage; (2) a cross-architecture geometric null, P2's LM objective alone implicitly encodes the morpheme-compositional structure P3's auxiliary losses target, with triplet contrastive loss for word-formation devices structurally unlearnable across all three completed configurations; (3) a LoRA ceiling for 128K-vocab Llama at 3B at val 5.33, confirmed across six consecutive P2 evaluations (range 0.001046), and under strong auxiliary pressure in P3 the model produces brief LM disruption followed by re-equilibration without breaking the wall; (4) accidental SGDR via manual-resume, 39 documented cycles across 14 weeks of Qwen 2.5-14B P1 training, one per session restart, producing a continuously-descending val trajectory past the planned cosine schedule's minimum (Loshchilov and Hutter 2017). All training on free Colab T4 GPUs across four Google accounts as an explicit methodological constraint, chosen to test whether linguistically interesting interventions remain reproducible under realistic compute conditions. Very much a work in progress.

For when that T4 hits (connecting...)

Models

Model	Params	Phase	Status	Notes
TinyLlama 1.1B	1.1B	complete	Done	P1: loss 8.46 to 0.079. P2: best val 0.6393. P3/P3b: geometric losses null, L_morph solved by P2, L_device structural null. Best ckpt: P3 step 400 (val 3.4188)
Llama 3.2-1B	1B	complete	Done	P3: L_morph=0.0007 (3.5x TinyLlama) but never moved. L_device flat. Same null, different baseline
Llama 3.2-3B	3B	P2 complete, P3 running	P3 step 110/1000	P2 terminated at step 600 with the wall confirmed across six consecutive evals at val 5.33 (range 0.001046). P3 strong launched from P2 step 100 (best val, tied across the wall). First eval at step 100: val 5.61 (+0.28 above P2 wall), L_morph pinned at 0.0006 (geometric null confirmed at 10% of run, fastest in lineup), L_device noise 0.19-0.21. NEW finding: strong lambdas (morph=50, device=2) actively degrade LM as a cost. Triple null forming: geometric metrics don't move, devices don't cluster, LM regresses
Llama 3.1-8B	8B	P1 running	Step 2250/3000	Compositional init at 1.0x radius (project's only model with this strategy). Val 11.45 last logged at step 2000, train 102.06 (lowest yet, broke 105). Slow surfacing pattern, val still in 11.4 band. 750 to P1 end
Mistral 7B v0.3	7B	P1 running	Step 2350/3000	Sliding window attention, 32K vocab (44,553 Wake tokens), SEQ_LEN 256. Broke 11.0 at step 1150 (val 10.92). Second 11.0 break window open through step 2500: val touched 11.28 at step 2150 (lowest since first break), wobbling 11.30-11.35 since in classic survey-phase signature. Train descending steadily below 116. Critical test of refined smaller model paradox: 58% Wake vocab share (matches TinyLlama). Date-stamped tarot prediction (3 June: Sun + 2 of Wands + Judgement) under empirical test
Qwen 2.5-14B	14B	canonical P1 complete, analysis next, extender to launch	Step 3000 landed 2026-06-09	WakeOverlay arch, Adafactor, SEQ_LEN 128, 43,824 Wake tokens injected (~22% share). 39 documented SGDR cycles via STEP_OFFSET manual-resume pattern (Loshchilov and Hutter 2017), one per session restart across 14 weeks. Each session loaded the checkpoint and produced a measurable train-spike-followed-by-val-descent pattern. The cumulative effect across 39 cycles was continuous val descent past the protocol's planned cosine-schedule minimum. Final canonical val 15.09 at step 3000; best val 15.05 at step 2700. The only model never to plateau in P1. Canonical sentry 428MB mirrored to Drive. Analysis pipeline next: loss curve, embedding analysis (chunked T4-safe), four-run generation battery, summary JSON, `outputs/p1_qwen14b_canonical_outputs.md`. Extender script launches from `sentry_step_3000.pt` with `STEP_OFFSET=3000` as the methodological appendix on whether the accidental SGDR mechanism keeps producing returns indefinitely
Phi-3.5 Mini	3.8B	P1 script ready	Not started	Microsoft, instruct-tuned (only publicly available variant, methodological deviation from base-model convention acknowledged in script and devlog). 32K vocab, ~58% Wake share expected. Cross-confirmation candidate for refined paradox as fourth datapoint in the 58% cohort (TinyLlama + Mistral + Phi). Spherical 1.5x init (cohort match with TinyLlama, Mistral, 3B). Training data: FW corpus + Wake lexicon + wake_embedding_groups.jsonl (258 morpheme groups rendered as co-occurrence text). 2x2 design partner with Llama 3.2-3B: identical hidden_dim (3072), same scale, only vocab + training data differ. Textbook-vs-internet hypothesis
Gemma 2 9B	9B	P1 script ready	Not started	Google architecture, 256K vocab. Lowest expected Wake share (~17%). Test of paradox at the high-vocab extreme
Gemma 3n E2B	~5B (2B effective)	P1 script pending	Not started	Efficient architecture: PLE and MatFormer. Tests whether Wake injection depends on always-active weights
Gemma 3n E4B	~8B (4B effective)	P1 script pending	Not started	Larger efficient variant. Same architecture class as E2B for within-family comparison

Stylistic adaptation of large language models typically proceeds through prompt-level conditioning, which leaves weights untouched and depends on in-context learning, or full fine-tuning, which updates all parameters and risks degrading the base model's general competencies. Wake2Vec investigates a less-studied intermediate intervention organised as a three-phase protocol, with each phase targeting a distinct level of the architecture.

Phase 1 extends the tokenizer with a curated Wake lexicon and trains only the new embedding rows under standard language modelling, leaving all transformer weights frozen via gradient masking. Phase 2 applies low-rank adaptation (LoRA, r=8) to attention and MLP projections while the P1-trained embeddings are carried forward. Phase 3 layers auxiliary geometric losses (morpheme-compositional alignment, word-formation device clustering) on top of the P2 configuration to test whether explicit structural pressure produces measurable changes to embedding geometry.

The protocol is evaluated across a controlled sweep of ten decoder-only transformers spanning five architecture families (Llama, Mistral, Qwen, Phi, Gemma), an order of magnitude in parameter count (1.1B to 14B), and an 8x range in tokenizer vocabulary (32K to 256K). The single corpus is Finnegans Wake; the same lexicon (and morpheme groupings for the 8b, Phi and Gemmas) are applied to every model. Training runs entirely on free Colab T4 GPUs, an explicit constraint chosen to test whether linguistically interesting interventions remain methodologically transparent and reproducible under realistic compute conditions.

Three findings have emerged so far. (1) Generation quality in the Joycean register correlates more strongly with Wake-vocab-share, defined as the fraction of the post-injection vocabulary that is newly added, than with model scale: a 1.1B parameter 32K-vocab model outperforms a 1B parameter 128K-vocab model trained on identical data which this research refers to as the smaller-model paradox. (2) Morpheme-compositional structure is implicitly encoded by the P2 language modelling objective alone; explicit geometric pressure in P3 does not move it across the three configurations tested to date (TinyLlama 1.1B, Llama 3.2-1B, Llama 3.2-3B). (3) Under strong auxiliary weighting (λ_morph=50, λ_device=2), the loss landscape penalises the geometric objective rather than satisfying it, producing language modelling regression without compensating geometric movement. Taken together, these results suggest that the structural regularities formal stylometric methods have attempted to extract from Joyce's late prose are already present in the embedding space without explicit annotation, and that the embedding layer is the most informative intervention site for stylistic adaptation of models whose pretraining did not encounter the target register.

Method (Morpheme-Aware)

Lexicon and Morphology

The morpheme dataset:

FW morphology extraction (FW morphology/): 405 unique morphemes (5,303 suffix entries, 1,406 prefix entries, 1 infix) across 6,711 total entries, extracted manually via AntConc from Finnegans Wake. Greedy prefix/suffix matching with a false-positive blocklist segments each Wake word into prefix|base|suffix triples. 92% segmentation success rate (6,174 / 6,710).

The extraction pipeline produces multiple JSONL formats for different training objectives:

File	Entries	Purpose
`wake_embedding_groups.jsonl`	258 groups, 6,048 words	Contrastive/embedding training (grouped by morpheme)
`wake_morpheme_pairs.jsonl`	6,710	Morpheme-word anchor pairs for contrastive loss
`wake_morphemes_full.jsonl`	6,710	Full segmentation records (prefix
`wake_segmentation.jsonl`	6,174	Seq2seq morphological analysis

Tokenizer Augmentation

New forms are added to the tokenizer as plain tokens (bare forms + SentencePiece start-of-word variants). Mean-resizing is disabled when expanding the embedding matrix (resize_token_embeddings(..., mean_resizing=False)) so that custom initialisation is preserved, and input/output embeddings are tied so the new vectors participate in prediction.

Compositional Initialisation

For new token w with greedy longest prefix/suffix match (p, s) and core r, set:

E(w) = a * E(p) + (1 - 2a) * E(r) + a * E(s) + e

Average embeddings of high-quality example words if a morpheme isn't single-token; e is small Gaussian noise for diversity. If r is unseen, fall back to a small random vector scaled to the embedding std.

Spherical Initialisation (P1)

New Wake token embeddings are initialised on a hypersphere:

base_radius = std(base_embeddings) * sqrt(dim)
target_radius = 1.5 * base_radius
E(w) = random_direction / ||random_direction|| * target_radius

This places new tokens at a consistent distance from the origin, near the surface of the existing embedding distribution, without biasing toward any particular semantic region.

Wake Lexicon

wake_lexicon.txt contains 44,989 unique tokens extracted from Finnegans Wake: neologisms, multilingual constructions, accented forms, and Joyce-specific compounds. These get added to whatever base tokenizer we're using. Vocab size matters: smaller tokenizers (32K for TinyLlama, Mistral, Phi-3) require near-full injection of ~44K new tokens, while larger tokenizers (128K for Llama 3.x, 152K, Qwen, 256K, Gemma) already cover many Wake forms natively and need fewer additions. This vocab-size variable turns out to drive one of the project's key findings, see the smaller model paradox below.

Model	Base vocab	Wake tokens added	Total vocab
TinyLlama 1.1B	32,000	~44,500	~76,500
Mistral 7B v0.3	32,768	44,553	77,321
Phi-3 Mini 3.8B	32,064	TBD (~44,500 est.)	TBD
Llama 3.2-1B	128,256	44,195	172,451
Llama 3.2-3B	128,256	44,195	172,451
Llama 3.1-8B	128,256	44,195	172,451
Qwen 2.5-14B	152,064	43,824	196,888
Gemma 2 9B	256,000	TBD (minimal expected)	TBD
Gemma 3n E2B	256,000	TBD (minimal expected)	TBD
Gemma 3n E4B	256,000	TBD (minimal expected)	TBD

Three-Phase Protocol

Phase 1: Embedding-Only Training

Freeze the entire transformer. Only the embedding layer is trainable.

New Wake tokens initialised on a hypersphere (see above)
Input and output embeddings are tied
A frozen LoRA r=1 adapter on q_proj is included purely for PEFT compatibility with quantized models -- it contributes nothing to training

Gradient protection strategies:

Two approaches are used depending on the model:

Gradient masking (TinyLlama, Llama): A backward hook on the embedding weight tensor zeros out gradients for all base vocabulary rows. Only Wake token rows receive gradients. Hard guarantee against catastrophic forgetting.

def mask_grad(grad):
    grad[base_rows] = 0
    return grad
wte.weight.register_hook(mask_grad)

WakeOverlay (Qwen): See dedicated section below.

Phase 2: LoRA Fine-Tune

Load P1 embeddings and freeze them. Apply LoRA adapters to attention and MLP projections. The model learns to use the Wake-adapted embeddings through attention redistribution and MLP adaptation.

LoRA targets: q_proj, k_proj, v_proj, gate_proj, up_proj, down_proj

k_proj is included alongside q/v to allow symmetric reshaping of attention patterns. MLP layers are targeted because Wake morphology requires adaptation of token-to-meaning mappings beyond attention alone.

P2 trains on FW text only (no lexicon). LoRA adapters learn to use frozen embeddings through contextual exposure -- isolated token lists provide less useful context than running prose.

Phase 3: Morpheme-Compositional Alignment

Unfreeze embeddings with morpheme-aware regularisation. Uses decomposition data (prefixes/suffixes) to enforce compositional semantics in new token embeddings.

Loss components:

L_lm: Standard language modeling loss
L_morpheme: Compositional constraint forcing Wake tokens toward component averages
L_repulsion: Adversarial term preventing Wake token collapse
L_norm: Norm hygiene keeping Wake embeddings in distribution

Composite loss: L_total = L_lm + λ_morph * L_morpheme + λ_device * L_device + λ_repulsion * L_repulsion + λ_norm * L_norm

Scripts ready for TinyLlama (wake2vec_phase_3_morpheme_v2.py) and Llama (wake2vec_llama_p3_morpheme.py).

Phase 3b: Geometric Refinement

A follow-up to P3 with significantly stronger geometric lambdas, testing whether the auxiliary losses can compete with L_lm when amplified.

Parameter	P3	P3b
Source	P2 step 1400	P3 step 400 (best val)
LR	5e-5	2e-5
λ_morph	0.1	50.0 (500x)
λ_device	0.05	2.0 (40x)
Max steps	3000	1000
Early stop patience	5	3

At P3b's lambdas, L_device contributed 12% of total loss (up from 0.3% in P3). The geometry signal was audible. It still didn't learn. See Key Findings below.

Key Findings (TinyLlama P3 Arc)

L_morph null result as evidence: L_morph was pinned at 0.0002 across 2,000+ combined P3/P3b steps, never moving even under 500x lambda amplification. This proves P2 (attention routing via LoRA) implicitly learned morphological compositional structure — the morpheme decomposition the Wake demands was already encoded through language modelling alone, before explicit geometric pressure was applied.

L_device structural null: The device triplet contrastive loss (clustering words by word-formation process — portmanteau, blend, compound, derivation, onomatopoeia) never left the 0.17–0.24 random walk range across two lambda regimes (0.05 and 2.0). The diagnosis: embeddings encode meaning and usage, not morphological construction method. A portmanteau of "river"+"run" should be near "river" and "run" in embedding space, not near a portmanteau of "chaos"+"cosmos". The loss was fighting the geometry it's built on — a direction problem, not a volume problem.

Alternative geometric objectives (future work):

Character n-gram overlap: words sharing substrings pushed closer. Natural for embeddings, captures orthographic play.
Phonological similarity: words that rhyme or alliterate pushed closer. The Wake is deeply sonic.
Source language clustering: Wake words blend specific languages (German, Irish, Italian, Latin). Etymology may correlate with learnable character patterns.

Implication:

The computational "invisibility" of Wake's figuration may be because the structure is implicit in the language patterns themselves, not requiring explicit annotation to emerge in embedding space.

Qwen WakeOverlay Architecture

Qwen 2.5-14B uses a fundamentally different embedding strategy from the Llama/TinyLlama gradient masking approach.

Problem: Qwen's 152K-token base vocab makes gradient masking on the full embedding matrix wasteful -- zeroing out 152K rows per backward pass for only ~44K trainable rows.

Solution: A separate nn.Embedding layer that holds only the Wake token embeddings:

Base embeddings: Frozen fp16 (152,064 x 5,120)
Wake overlay: Trainable fp32 (43,824 x 5,120)
forward() copies base embeddings, then scatters Wake rows on top via index replacement at wake_start
Backward hook on base embeddings zeros all gradients (safety net)
Only the overlay's parameters are passed to the optimizer

Why Adafactor: Adafactor stores no momentum states. This means:

Lower VRAM overhead (~0 optimizer memory vs ~2x for AdamW)
Lightweight resume: embedding checkpoint + step count is all that's needed (no optimizer state to restore)
STEP_OFFSET pattern works cleanly: resume from any sentry with trainer.train() and offset callbacks

VRAM budget (T4 15GB):

4-bit model body: ~8GB
fp32 Wake embeddings: ~1GB
Adafactor states: ~0GB
Gradients + activations: ~1-2GB
SEQ_LEN had to be reduced to 128 (OOM at 256 on backward pass)

Training Configs

Phase 1 (Embedding-Only)

	TinyLlama 1.1B	Llama 3.2-1B	Qwen 2.5-14B
Quantization	fp32 (whole model)	4-bit NF4	4-bit NF4
Embedding strategy	Gradient masking	Gradient masking	WakeOverlay
Optimizer	Adafactor	AdamW	Adafactor
LR	5e-4	2e-4	5e-4
Warmup	5% (65 steps)	5% (150 steps)	5% (150 steps)
Batch	1 (effective 16)	1 (effective 16)	1 (effective 16)
Seq len	256	512	128
Steps	3,000	3,000	3,000
Save every	100	50	20

Phase 2 (LoRA)

	TinyLlama 1.1B	Llama 3.2-1B
Quantization	4-bit NF4	4-bit NF4
LoRA rank	8	8
LoRA alpha	16	16
LoRA dropout	0.1	0.1
Trainable params	~5.6M	~5.1M
Embeddings	Frozen (from P1)	Frozen (P1 step 1400)
LR	2e-5	2e-5
Warmup	10%	10%
Batch	8 (effective 16)	4 (effective 16)
Seq len	256	512
Steps	3,000	3,000
Weight decay	0.01	0.01

Phase 3 (Morpheme-Compositional Alignment)

	TinyLlama P3	TinyLlama P3b
Source	P2 step 1400	P3 step 400 (best val)
LR	5e-5	2e-5
λ_morph	0.1	50.0
λ_device	0.05	2.0
λ_repulsion	0.05	0.05
λ_norm	0.01	0.01
Max steps	3,000	1,000
Early stop patience	5	3
Outcome	L_morph/L_device flat. Best val 3.4188 @ step 400	L_device still flat at 40x lambda. Early stop @ step 800

Data

Finnegans Wake corpus (FW_TEXT.txt): 24,483 lines. Primary training text
Wake lexicon (wake_lexicon.txt): 44,989 tokens. Injected into tokenizer
Train/val split: 90/10, seed 42
Block size: Non-overlapping chunks of seq_len tokens

Block counts vary by model (different SEQ_LEN):

Model	SEQ_LEN	Train blocks	Val blocks
TinyLlama 1.1B P1	256	1,566	174
Llama 3.2-1B P1	512	~800	~90
Llama 3.2-3B P1	512	802	90
Llama 3.1-8B P1	256	~1,600	~180
Mistral 7B v0.3 P1	256	~1,600	~180
Qwen 2.5-14B P1	128	3,221	358
Phi-3 Mini P1	TBD	TBD	TBD
Gemma 2 9B P1	TBD	TBD	TBD
Gemma 3n E2B P1	TBD	TBD	TBD
Gemma 3n E4B P1	TBD	TBD	TBD

Embedding Analysis

Every P1 and P2 script includes a post-training analysis suite:

Norm distributions -- L2 norms of base vs new token embeddings, with Welch t-test, Mann-Whitney U, Cohen's d
Isotropy -- partition function ratio. Measures how uniformly embeddings spread across the space
Embedding drift -- cosine similarity between pre- and post-training embeddings. Base tokens should be ~1.0 (unchanged). Wake tokens should show meaningful movement
Nearest neighbours -- for sampled Wake tokens, find 5 closest base vocab tokens by cosine similarity
Intrinsic dimensionality -- PCA explained variance. How many principal components capture 90%/95% of variance in base vs new embeddings
Pairwise cosine similarity -- distributions for (base,base), (new,new), (base,new) pairs with KS test

All results saved as JSON + 6-panel matplotlib figure.

Results

TinyLlama 1.1B P1 (Complete)

Final: train loss 8.46 -> 0.079 over 3000 steps.

Generation from the prompt riverrun, past Eve and Adam's, at temp=0.9:

The model produces extended Wakean prose with structural mimicry: parenthetical asides, italicised stage directions, numbered fragments, verse-like indentation, footnote markers, rhetorical question cascades. Long clauses chained with "and", commas doing the work of periods, sudden register shifts.

Key features across all temperatures:

Lexical invention: Portmanteaus and neologisms not in the training text
Character and place references: Shem, Shaun, HCE, Matt Gregory, Mourne, Cromwell, Gracehoper -- the Wake's cast and palimpsest geography are intact
Spacing artifacts: Consistent compound-fusing (theshade, haveheard, willgive) across all temperatures -- the main P1 limitation, from frozen attention layers that can't adapt to new tokenisation boundaries

All of this comes from embedding geometry alone. The transformer weights are entirely frozen at their chat-tuned values.

TinyLlama 1.1B P2 (Complete)

Best checkpoint: step 1400, val loss 0.6393. Overfitting started around step 2000 (train/val gap widening).

The validation gap is used diagnostically rather than treated as a problem:

P2 starting around val ~4.5 (not 7+) confirms P1 embeddings loaded correctly
The gap that existed in P1 simply wasn't visible without a held-out set
Different levels of overfitting serve as starting points for P3 branches

Llama 3.2-1B P1 (Complete)

Final: train 61.23 / val 5.46 over 3,000 steps. Val plateaued from step 1400 onward (best val 5.36 @ step 1400).

Generation from the prompt riverrun, past Eve and Adam's, shows a clear temperature gradient for Wake token density:

temp 0.5: Almost no Wake tokens -- clean theological prose, but the model invents etymologies using Wake logic (pseudo-definitions embedded as asides)
temp 0.7: Minimal Wake intrusion (one or two compounds). Reads like a book review. Most coherent of the set
temp 0.9: Wake tokens start appearing in scholarly context. Pseudo-etymology and slipping into FW's theological-sexual register
temp 1.0: Exclamatory Wake eruptions. Prose fragments into preacher cadence with parenthetical neologisms
temp 1.2: Full Wake mode -- dictionary-entry formatting breaks down into direct address. Maximum portmanteau density

The sweet spot for Wakean generation is 0.9--1.1: enough temperature to surface the neologisms while maintaining syntactic context for them to land in.

Key difference from TinyLlama P1: Llama inserts Wake tokens as embedded neologisms within otherwise coherent Victorian/biblical prose, rather than generating sustained Wakean pastiche. The Wake tokens blend with the surrounding register rather than overwhelming it. This is likely a consequence of the larger model's stronger language priors.

Llama 3.2-1B P2 (In Progress)

Step 200/3000: train 4.03 / val 4.21 (gap 0.18).

Already below P1's final val (5.46) at first eval (step 100). LoRA picked up the frozen Wake embeddings immediately. ~38s/step on T4.

Qwen 2.5-14B P1 (In Progress)

Step ~161/3000: train 321.48 / val 20.98 at step 100. Both still dropping. ~131s/step on T4.

Higher initial loss values are expected given the WakeOverlay architecture -- the model is learning ~44K new embedding vectors from scratch with a 14B-parameter frozen transformer, compared to Llama's ~1.3K new tokens.

Checkpoint Infrastructure

DriveSentry

Mirrors embedding snapshots and training state to Google Drive at configurable intervals. Two key patterns:

Local-first write: torch.save directly to Drive FUSE can block training indefinitely on large files. Fix: save to local tmp, shutil.copy2 to Drive, unlink local tmp.
STEP_OFFSET: When resuming with a fresh trainer.train() call, the Trainer's state.global_step restarts at 0. Callbacks add a configurable step_offset for globally unique file names, preventing sentry collisions across sessions.

EmbeddingSnapshot

Saves Wake token embeddings at configurable step intervals. Lightweight (~2MB for Llama, ~340MB for Qwen) -- enables post-hoc analysis of embedding trajectory without full checkpoint overhead.

Resume Strategies

Two resume patterns depending on model architecture:

Trainer-native resume (Llama P2): trainer.train(resume_from_checkpoint=...) restores optimizer state, LR scheduler, and global_step automatically. No STEP_OFFSET needed.
Manual resume (Qwen P1, Llama P1): Load embeddings from sentry, fresh trainer.train(). Adafactor's stateless design means no optimizer state to restore. STEP_OFFSET handles file naming. Manual override: STEP_OFFSET = STEP_OFFSET if STEP_OFFSET > 0 else ckpt['step'] for transitioning from pre-offset sentries.

Environment

Dependencies (Colab, March 2026):

Python 3.12
torch>=2.5.1 (Colab ships 2.8.0; some scripts pin 2.5.1+cu121 for bnb compatibility)
transformers>=5.0
accelerate>=1.2
datasets>=2.21.0
peft>=0.14
bitsandbytes>=0.45.0
triton>=3.0 (requires shim -- see below)
umap-learn
faiss-cpu
wordfreq
unidecode
matplotlib

Triton shim: bitsandbytes>=0.45.0 imports triton.ops.matmul_perf_model, which was removed in triton>=3.x (shipped with Colab 2026.02). Every script includes a fake-module shim:

import types, sys
fake_perf = types.ModuleType('triton.ops.matmul_perf_model')
fake_perf.early_config_prune = lambda *a, **k: []
fake_perf.estimate_matmul_time = lambda *a, **k: 0
sys.modules['triton.ops'] = types.ModuleType('triton.ops')
sys.modules['triton.ops.matmul_perf_model'] = fake_perf

Other Colab notes:

warmup_ratio deprecated in transformers 5.x -- use warmup_steps instead
bfloat16 tensors cannot call .numpy() directly -- cast .float() first in analysis cells
Keep use_cache=False during training
Prefer Adafactor or 8-bit Adam on T4
Enable gradient checkpointing in Phase 2 to reduce memory

Practical Notes

If load_best_model_at_end=True, match eval_strategy and save_strategy to "steps"
For OOM on T4: reduce per_device_train_batch_size, increase gradient_accumulation_steps, shorten SEQ_LEN (Qwen had to go from 256 to 128), or switch Phase 2 to LoRA
Keep random seeds fixed for comparability across phases
Keep fp16 off on T4 for this pipeline
DriveSentry FUSE hangs are the most common cause of training stalls -- always use the local-first write pattern for saves larger than a few MB
STEP_OFFSET only affects file naming in callbacks, not the Trainer progress bar (which always shows local step count)

Monitoring

For long-running training on preemptible compute, a heartbeat monitoring notebook provides non-invasive inspection of training progress without interfering with active processes. It tracks loss trajectory from JSON logs, checkpoint inventory across local and persistent storage, embedding snapshot presence and modification times, and identifies the most recent valid checkpoint suitable for resumption.

Storage hierarchy:

Local ephemeral: /content/runs/t4_*
Drive persistent: /content/drive/MyDrive/wake2vec/runs/t4_*
Sentry backup: /content/drive/MyDrive/wake2vec/sentry_backups/t4_*

Scripts

Script	Model	Phase	Notes
`wake2vec_llama_p1_clean.py`	Llama 3.2-1B	P1	Gradient masking, AdamW
`wake2vec_llama_p2_lora.py`	Llama 3.2-1B	P2	LoRA r=8, resume support
`wake2vec_llama_p3_morpheme.py`	Llama 3.2-1B	P3	Morpheme alignment (ready)
`wake2vec_on_qwen_2_5_14b.py`	Qwen 2.5-14B	P1	WakeOverlay, Adafactor
`wake2vec_p2_tinyllama_with_lora-2.py`	TinyLlama 1.1B	P2	LoRA r=8
`wake2vec_phase_3_morpheme_v2.py`	TinyLlama 1.1B	P3	Morpheme alignment (ready)

Current Status (June 2026)

Complete pipelines:

TinyLlama 1.1B: P1 + P2 + P3 + P3b. Best checkpoint: P3 step 400 (val 3.4188). Full generation outputs in outputs/p3b_generation_outputs.md. The original cross-architecture null finding lives here.
Llama 3.2-1B: P1 + P2 + P3. Confirms TinyLlama null across configurations. Best checkpoint: P2 step 500 (val 4.04).
Llama 3.2-3B: P1 + P2 complete. P2 produced the wall at val 5.33 across six consecutive evals (range 0.001046, terminated at step 600). P3 strong running with manual stop pre-registered at step 600 — by step 200, transient-disequilibrium reading confirmed; full pipeline outputs file pending termination.
Qwen 2.5-14B P1 canonical: Step 3000 landed 9 June 2026. Best val 15.05 @ step 2700, canonical val 15.09 @ step 3000. 39 documented SGDR cycles across 14 weeks. Embedding analysis + generation battery shipped in outputs/p1_qwen14b_canonical_outputs.md and outputs/p1_qwen14b_generation.md. The generation result falsified the simple smaller-model paradox

In progress (P1):

Llama 3.1-8B: step 2400/3000, val 11.48 (train just broke 100 for the first time at 93.99). Monotonic upward val drift +0.07 across 800 steps and most extreme train-val divergence in lineup. Compositional init + 1.0x radius (the lineup's only experimental init variation; results so far suggest the experimental variation did not pay off). 600 to P1 end.
Mistral 7B v0.3: step 2350/3000, val 11.32. Survey-phase wobble inside the 11.28-11.35 band continues; eleven consecutive evals in a 0.07-wide band. Second 11.0 break window 150 steps wide through step 2500 (under empirical test from the 3 June tarot prediction). Critical test of refined smaller-model paradox: 58% Wake-vocab-share (matches TinyLlama).
Llama 3.2-3B P3 strong: step 200/1000, val 5.49 (+0.16 above P2 wall, recovering). Manual termination at step 600 unless val drops below 5.40.

Queued to launch:

Phi-3.5 Mini (3.8B): P1 script ready (wake2vec_phi35_p1_clean.py). Launches as soon as 3B P3 terminates at step 600. Microsoft textbook-quality training data comparison; 32K vocab, ~58% Wake share expected (joins TinyLlama+Mistral cohort). Spherical 1.5x init (cohort match). Training data: FW + lexicon + wake_embedding_groups.jsonl. Methodological flag: instruct-tuned (only publicly available variant), acknowledged deviation from the lineup's base-model convention.
Qwen 2.5-14B extender: launches from sentry_step_3000.pt with STEP_OFFSET=3000. Tests whether the 39-cycle accidental SGDR mechanism keeps finding descent past the canonical endpoint. Prior on continued descent is very strong given 39 confirmed reproductions in the canonical.
Gemma 2 9B: P1 script ready. Google architecture, 256K vocab, lowest expected Wake share (~17%). Critical test of paradox at the high-vocab extreme.
Gemma 3n E2B & E4B: P1 scripts pending. Efficient-architecture variants (PLE + MatFormer).

Key findings established:

Refined smaller-model paradox (falsifying the original simple version): generation quality in the Joycean register is achievable at multiple points in (Wake-vocab-share, model scale, training depth) space. Wake-vocab-share at ~58% (TinyLlama-class) is the compute-efficient configuration; scale at 14B with extended training is the brute-force-efficient configuration. The minimal-computing argument prefers the compute-efficient path. The original paradox claim was an empirical observation about share alone; the refined claim accounts for the three-axis trade-off the Qwen result revealed.
Cross-architecture geometric null (confirmed in four configurations): TinyLlama P3, Llama 3.2-1B P3, Llama 3.2-3B P3 strong, Qwen 2.5-14B P1 canonical, they all produce Wake region isotropy at 0.998. P2's LM objective alone implicitly encodes the morpheme-compositional structure P3's auxiliary losses target. Triplet contrastive loss for word-formation devices is structurally unlearnable because Wake tokens distribute on a near-uniform sphere with nowhere preferential for clusters to form.
LoRA ceiling for 128K-vocab Llama at 3B: val 5.33 confirmed across six consecutive P2 evaluations (range 0.001046). Under strong auxiliary pressure in P3, the model produces brief LM disruption followed by re-equilibration without breaking the wall. The lasting cost of strong λs is the train-val gap widening (0.09 in P2 → 0.74 in P3 strong), a generalisation cost, not LM fit cost.
Accidental SGDR via manual-resume (39 documented cycles): the STEP_OFFSET-based manual-resume pattern necessitated by free Colab T4 cuts produces a near-perfect reproduction of the Loshchilov-Hutter SGDR schedule. Each of 39 sessions across Qwen's 14-week training produced a measurable train-spike-followed-by-val-descent pattern. Mechanism is robust at scale, not anecdotal.
Bridge tokens identified: the most-changed Wake tokens after training (wher, leas, hing, throug, befor, nig, hough, bri, thos, tch) are truncated common English words, the Wake tokens that bridge between Wake-specific content and base English at sentence boundaries. The model concentrates learning on the English-Wake boundary.

Infrastructure:

Triton shim for bnb/triton 3.x compatibility
DriveSentry local-first write pattern for FUSE reliability
STEP_OFFSET pattern for session-safe callback file naming and accidental SGDR mechanism
Resume support: Trainer-native (P2) and manual with STEP_OFFSET (P1)
WakeOverlay sentry-only-Wake-rows storage (Qwen-specific): saves only the trained Wake-row matrix per sentry (448MB) rather than the full embedding matrix (~2GB). Across 150 saves over 14 weeks: 67GB vs 300GB Drive footprint. Architectural choice that made the 14-week canonical feasible on free Colab + Drive infrastructure.
Devlog tables as canonical training record: when the Trainer's trainer_state.json was lost in the 9 June Colab cut shortly after canonical step 3000, the devlog tables (maintained by hand across all 39 sessions) supplied the complete (step, train, val) history for canonical loss curve reconstruction. The devlog-as-canonical-record practice is itself a methodological contribution for long training runs on free infrastructure.
Spherical 1.5x init is the cohort default across the 32K-cohort (TinyLlama, Mistral, Phi-3.5) and the 128K-cohort (Llama family). Compositional init at 1.0x was deployed only on Llama 8B as an experimental variation; results so far do not support it as the new default. Phi-3.5 explicitly uses 1.5x to preserve cohort comparability.

TinyLlama 1.1B P3b Generation Samples

Sample 1

Prompt: riverrun, past Eve and Adam's, Params: temp=0.9, top_p=0.92, top_k=50, rep=1.15

and the crossgoing out of Humphrey's age, to thenext song of the Rocks. The marriage of Barandy and the Foughty Fatima, my shopsoiledgirls,under theillassumed names of Perkinson and Misham. The sighing of aladle man thatstill will be the thing that joyfakes alllife. I have read in the pink of Mourne's face as he dreams of her. I haveheard the windcursy of his doom. I haveheard theshades of histhumbs. How Thine Sidesup, Jeminy, isber,such and is offensive! And you, woman, are not as clean as you are now, my dearest Yokeoff. And it is my last endings, with that oldmilkless a-going for asoft time, thatsleep by the staircase, that threesheets at nunc or twogiving a toll, thatgrip at noon (for if she has achild, she can never have asleep) that has a

Sample 2: Temperature sweep

Prompt: riverrun, past Eve and Adam's, Fixed params: top_p=0.92, top_k=50, rep=1.15

temp=0.5

and thelightning Nixy, and thefields of Olaf, and the bannars up with the MacGhoul, the Vale's inheaven, and the harricanahurries andhotsfoots, zingo,zango,segur. To the cross Lally, to the cross Doolin, to the cross Sean Adoyer, to the cross Arth Mockmorrow Koughenough, O' Ryne!

Ryeer! Hear, O' Dullmailee! The Macclefield Witch is reelingthrough the top of theworld! That washes yourhead out, youunderstand, and you can't forget the ten or so your twin had to pull, and thetwenty one that was sitting on the three, and all thefour, two by two, with their white arms below their ears, at the end of theshadow, howoft right enough, as I think, in the way of fun, for their castor and porridge'sgame, as they were going to behind a wall and the taller man

temp=0.7

and thelightning Nuns and the Cameen or Corpse and the

[104] Tublin. This is not a very long way, myprodder again! Once more after this time, in thefuture oflife, when ourpantriarch have entered their ownsummers, while old Matt Gregory wouldn't be seen, there's a few more between you and the man in statewearholipoliwhollyisland peeeeeeee[132] werewhere, when he was just achild, and you werestill in thewhole. That's what wouldn't be too far, my very fructification, mylittleheart, my same uponhearts, my hair, my ears, my nose, my eyes, my faith, my hair, my hoops and all my ether, no matter how many, when that man had not beengiven thelobby, when thecorner was in his place, and I was too far away to askhimself fornothing.

So, now, as we are in the

temp=0.9

[175] from the day in all our things has been

UNDES.- _Nonquodsed Vestrae

'tis everynight 'tis all about._

[1] I have only a staircase) [2] Six on the run) [3] Who is on thefourfirst then? [4] Weopen we or mates our winds with itsnation,[2] like asfour round about [5] Cthahraet and Malthosius trying to die! [6] We dohear some old times (you and two verysmallthirtygirls!) Shem and Shaun, out of date. [7] A pair of green eyes at the back of a shirt at Pickardstown. [8] None of thefour by the sea,through the black man at Roseleys. [9] Alared by the blackhearts allaround roundbrigidschool —Truly much for thee,histindier. When was it ever ever up?

temp=1.0

withlustres ofpeins. Whatsound be done if only so they were?[1] 1065 (3618) No. I say it is awild'ssort to be cracked by all.[2] Now, old man, it's time you turned thesleep and come out of yoursleepingexex. Aye, and forwards I will stand tobring you out. And you to her, and you, and she to her back! So pass thetrouble on, and take your Bylineal in the bedroom. Bier, stiff pumps, 1169.

Waxens for wimwyer,head in love,bloodtune onsweet andfirst, thump, by, shirt off, shints tolife, cakestood,kiss up, buckler,head off,hear, Mi-face,such as Tuskar and Ania. _Tuesay, Pudge and Be Peposys. This issuch achild. Proper

temp=1.2

where the Nilsens made the coke of this tay for thehead part in thefour, where hewallowednnykins all down the rainvert redvilla. To mark her ownlife or pity to him. So the water and thehind that was milling in thefirst Shem or the Vain that had nowhad it, now love it, now anextinsionkissed the twins (for sheknew not thelanguage, but what sheknew was so long as she just caned her heirs) while thatwoman (who, then,knew howsuch aperiodiosit bead out of Vrittiants and Tadters, no lie!), when her old time-ricking time waran act was on, with apurecures for a wound to be due she putunder hispallyass and begin togive arms, girdles,hatsoff to all theirpurtybussesning lovely about

[120] hissleep and his flesh may neverfall. And there shestill words how to jayne and musical

Sample 3: Multi-sequence generation (×7)

Prompt: riverrun, past Eve and Adam's, Params: temp=0.9, top_p=0.92, top_k=50, rep=1.15

[1]

theshade of ages (our times are done) with theirhistoricbringing them. Those were the Homo Vestrae, Vale, O'Neill! Theheart of Lifé, the year of the Cure, Fought for Humans' mound in Peruvian: Ere I go to quest of Wachtman's Cromwell, high time as far as Tear-nan-Og, as far as the Oyest Brayles;

The butwhere is he? Tell me, why do we be on of thatclass? Why not at the Rother's stomach? If she can't keep him at lughts or forshee Chambers? Not then? without the having to be off tobridges,through the Arsa, the Nodderlands Nurskery, the Manulinstight; now

[2 extended]

and the sigh from theopenns as by the moors made. But you are doing your own thing. The time for e'erthose days was only atrifle and then allover when it took place. Thefirst thing that ever was done in the early days of my good man is afterwhere the grandgame was representsing hislowness! Whoguesse, howsuccessy do you havesuch a shorthead? Whatshould I have aheart? But, let usmooremooremurgessly there andhinl. Ahighlife of it. The tembo in his hand willgive him another. And, atweare if it's their hand, may the scene in his eye! From old ocean to oill or white, the rain has no matter when it's the use of avoice._

[41]

Shem was thinking fairly killing times too. He had it incurrent and they were all upagainst that. When he was with the MacHammuds after the fish went wrong (but, leave me this, it is looking aged)

[3 extended]

and the sigh I made in the full marpliche! by the grace of the Gracehoper. But my eyries be to him asbefore the ghost have itshead, with apoint ofhorror in hiswear, for the moment I am not up, he hascured down his Λ, (theloa, signing as manyarchers as there are bones in thebloo,) andstill reelingover theworld, like abottle of a wind, that spoiled fonceys andkissed us all by the bones in theirshadows.

But I am asdying to Gode's will, and I will do all that he does, if he has it, if he does, though I am not going to saynothing about the gothtends oflife, for I mean to stay by the lord's side, atleast, and beinstead of cough andsleep and spit in a strawberryfrolic, just pass the teeth in olddummydeaf, as Morgents Fins me, andtouch yourtrousers about the rain and the

Notes

Temperature behavior

The model shows coherent temperature scaling:

0.5 Most structured. Anaphoric lists ("to the cross Lally, to the cross Doolin"), confident proper nouns ("MacGhoul", "Koughenough", "Dullmailee"), clear narrative momentum. Closest to readable pastiche.
0.7 Longer flowing passages, invention ramps up ("statewearholipoliwhollyisland", "pantriarch", "fructification").
0.9 Structural experimentation begins. Numbered lists, footnote markers, dramatic formatting. "Cthahraet and Malthosius" and "roundbrigidschool" feel authentically Joycean.
1.0 Dense, compressed. Stage directions and numbering intrude ("1065 (3618)"). Portmanteau density increases: "sleepingexex", "wimwyer", "bloodtune", "Bylineal".
1.2 Maximum invention. "wallowednnykins", "aperiodiosit", "Vrittiants and Tadters", "purtybussesning", "purecures", "pallyass". Grammatical structure loosens but never collapses entirely.

Recurring features across all samples

Lexical invention Portmanteaus and neologisms that don't appear in the training text: "shopsoiledgirls", "windcursy", "joyfakes", "Yokeoff", "mooremooremurgessly", "Manulinstight", "strawberryfrolic", "olddummydeaf", "gothtends", "fonceys", "marpliche", "harricanahurries", "purtybussesning", "wallowednnykins". The model invents in Joyce's style.

Character and place references Shem, Shaun, HCE ("Humphrey"), Matt Gregory, Mourne, O'Neill, Cromwell, "Tear-nan-Og" (Tír na nÓg), "Nodderlands Nurskery", "MacHammuds", "Nilsens", "Gracehoper" (recovered directly from Joyce). The Wake's cast and palimpsest geography are intact.

Structural mimicry Parenthetical asides, italicized stage directions, numbered fragments, verse-like indentation, footnote markers, rhetorical question cascades. The rhythm of Wake prose: long clauses chained with "and", commas doing the work of periods, sudden register shifts.

Spacing artifacts Consistent compound-fusing ("theshade", "haveheard", "willgive") across all temperatures. This is the main Phase 1 limitation, from frozen attention layers that can't adapt to new tokenization boundaries.

to note

All of this comes from embedding geometry alone. The transformer weights are entirely frozen at their chat-tuned values. The model generates Wakean text by navigating a reshaped embedding space through unchanged attention patterns.

Qwen 2.5-14B P1 Canonical, Generation Samples

(Full outputs in outputs/p1_qwen14b_generation.md. The Qwen samples produce a visually dense compound-mass at all temperatures, often without word breaks, so excerpts here are trimmed for legibility.)

Sample 1: Single generation

Prompt: riverrun, past Eve and Adam's, Params: temp=0.9, top_p=0.92, top_k=50, rep=1.15

stillleroxbelledfarkkeyholemainestsprokeduvlingebrorenhanzassognsplapdustamountturfbrowneirestgenikdonnerycrazingchristmastydepassioflowermockcomickirikirikiringpettyfib'srockelosedarklinghandmakeaprioricanarchcustoscrazingcolumnkillalmeaniummatteroffactnesshypsometersgrandestlownesskinkincaraborgroothsdeliveriedgrandesto'kayanarchpppeaseogonochicinglassgayemiddenmewseyfumederry'smaterfamiliaswednesburyboomoosternightmailnattesmaterfamiliashoarsemen阳magreesfistiknotslimpetuckpointefoxtricklesomethuartpeatrickmaterfamiliasthankyoufulrossecullinansrossecullinansghasternsuckabollytalkingtreethrostlesvowelglidelispinglirraplapvoiceyversychurchman ... [continues for ~256 tokens without breaks]

Sample 2: Temperature sweep

Prompt: riverrun, past Eve and Adam's, Fixed params: top_p=0.92, top_k=50, rep=1.15

temp=0.5

stilllerhoosematinangeugaulesrussetspapelboypapelboyainsellahoneconscribedurnthritytwowiddarsmirthpealsdolingsduvlinhitchespoingtapopoOLopingrearrived阳langloanchoritedemoralizingterritorialslozengescontrivancemasticgaulescomepullgaulesgauleswooingcisamispalpruyparlourmendoogdoogtoper'starpinacciolfaolfakemolfa Aleatupusgypsinghouhnhymn美国placelikeconstantinealbumchubsiddlecowldzessid'stoomeydemocriticosleprousficsimilarlozengesbetweenlyohibohsittanghankypanksplapficsimilarduvlinfassedtelekinesisonviedmoysighinspirervagrantloavesgulughuruttyyoulkyoulkarseusanlescingderry's ...

temp=0.7

stilllerhoosematinangeugaeilishgnawthingalmeaniumpapelboyourselfsakepapelboyjury'sdolingsduvlin Mapolfatithingpoingt nanrearrivedsquigglingyoulkcomepullwiggychrystanthemlandercolumnkillahone polyglutturalduvlinbottlerplacelikevallsallsinistrousisod'sterritorialshennarosimund's普通shaddaloavescramwellshaldmelarancitronelionses roughdusessmehrkurioscryptogampalpruyostralianyonsidesuckdooghapsnots essential ...

temp=0.9

diddydidcombitschbiguiddhoosematin objectsangeuleprouspoingtpapelboyumprincipiantahonepapelboy。sallybrightdoublefirstwiggysalaamesjoltinggaulesslivenamondtidiesbeyantwiggywexterford'stonedeafshibernska knows idusessplurabelle'scorveeturetwoheadedsecurelysealingdren'sdurnachewingtarpinaccipraisegaddumptydum'lozengessousersdumptydumenvelopedducomans>convenienceschauysrearrivedhibernskacatclubsubsecmehrkuriosjacobsentawnyforbearcoptplaceliketawnystodgeduvlinneverworn阳magreesmaisonry创业hankypankscomepullhelfmoscashoundedclaudduggelduggeldeaubaleauyoulkuckdoog ...

temp=1.0

censefaulterercarniumexistentialitywebbethshufferingpigstickularlypuptisedphaynixthealmostferehousefulduvlinberaddyolfatarrapoullingberaddypurefusion美国prosperousnessrayburnrayburnrayburnplacelikegobydaffyrearrivedrearrivedmistadolingsjotalphesonsublimatewouldpaygaeilishbuoyedpoingtheadwoodbeppy'swellinformed Sarfrore日子corruptiblecomeallyousgenmenputshameyubetholderpotablybetweenlyillustrationingoncontinentrearrivedrearrivedgidgadcryptoconchoidsiphonostomatarearrived ...

temp=1.2

hogofemiliesturbtelekinesisanarchlepertiesnublidlanglobeaushairwireaneathwebbeth-caplozengesdonnery /*hamovsblondyanchoriterhonndashukarkithagainexchequeredelpistarpinaccianchoriteperturbingelpiscryptoconchoidsiphonostomataputshameyuelpisclerydarklingduvlinnightiestarpinacciumclausedauthorwaysmillickmaam'smasticarkmillickmaam'sgauleseatupusrearrivedpigstickularlyseventeenyearoldwaltzersllongsnipehitting总结physiog ...

Sample 3: Multi-sequence generation (×3)

Prompt: riverrun, past Eve and Adam's, Params: temp=0.9, top_p=0.92, top_k=50, rep=1.15, num_return_sequences=3

[1]

censepostfaceumprincipianthoosemermenfrigoriquehoodendosesangeu Eastheadwoodsalaameslozengesbucknesstpapelboyahoneahonecomepullumprincipiant美国ohohcowldcramwellsimperfectionlaudszessid'scruciantidies阳zaynithshebicomepullsalaamesknockingshop roughpoingtpalmsweatdemocriticosashpitsscimmianisedwumblindeedpolldurnluttrellsandhurstrumanyoelambhaughtpipettetumtytum weremcadoopapelboydolingsknowmeyesternterritorialscomepullcomepulltaskmaster'scomepullshellaliterunesturbaryexhortingtumtytumputshameyubowandcoatinjectivejovesday ...

[2]

allsalldoulseme-spondeeschilforebiddenyemcrazedledazechimbesschtinkenkotdvershenradientscupslipsforebiddenyordeffusiongenrouslylauralyeblanalambelhomoidpott美国rassiasheadwood turfbrownaliment智能bigrobbissingmaisonrysalaamessalaames。soferimpalpruyejussukkotbaredsixesuphillsracecoursefulracecoursefulseightpigsesolfa moreporkgrapciasyoelambpalpabrows ...

[3]

hogohemelfarkmainestdeliverieddarklingdullcisamicagenikplaintiff'sallsortprovidentialitylillhavesthreftthoroughgoinglellymarrackspussinessdiffusingfinightthreefoilednavigableathiacarohandmakemirrylambduvlinbutteredhitchesoheremahoremarklablejotalphesongayeth'avignuetarpinaccidonnerygavelkindmourneplapbakereen'stwaddlebiguidddonnelly's ...

Sample 4: Extended (×3, temp=1.1, max_new_tokens=512)

Prompt: riverrun, past Eve and Adam's, Params: temp=1.1, top_p=0.92, top_k=50, rep=1.15, num_return_sequences=3, max_new_tokens=512

[1]

hogodeliveriedhooseimpersonatinghibernskasummumpapelboyaringarunglispingpursuitinglorkingjoltingboldyluggedpoingtleprouswha'mmuckyregulectbreavinghennalozengespoourdurnvitiousapopocroscopedemoralizing防hankypanksquoiquoiquoiquoiquoiquoiquoiqjude'shisucowlddoogvitiouslescingmirthpealsinjectivestretchingtrisspassvaulsiesexhortingdusessrearrivedrillieshennaourselfsake roughfallener ... [extended polyglot Wake-style continues for 512 tokens]

[2 and 3]

(See outputs/p1_qwen14b_generation.md for full extended samples.)

Notes

Temperature behaviour

The model produces sustained Wake-style output across all temperatures. Unlike TinyLlama's gradual unravelling from temp=0.5 (most readable) to temp=1.2 (maximum invention), Qwen stays inside the Wake-anchored compound-mass register at all temperatures, with the only variation being density vs diversity:

0.5 Maximum density. Heaviest repetition of attractor tokens (headwood, loab, salaames). Compound-mass continuous with no breaks. Least diverse vocabulary.
0.7 High density. Reduced repetition. Polyglot tokens start appearing (美国, presbyoperian, scotobrit).
0.9 Onomatopoeic mass begins (rrrwwwkkkrrr). Joyce-signature constructions emerge (comeallyous, tarpinacci). Multilingual mixing increases.
1.0 Most diverse polyglot mixing. Wider Chinese-character inclusion. New compound coinages at every position.
1.2 Most chaotic but still recognizably Wake-style. New invention at every position. Grammatical structure remains loose throughout (consistent across the temperature range, not a feature of high-temp degradation).

Recurring features across all samples

Compound morphology at scale. Hundreds of tokens of continuous compound-mass without word breaks. Far denser than TinyLlama's output, which shows clear word boundaries and grammatical scaffolding.

Polyglot register. Chinese characters (美国 = "America", 创业 = "entrepreneurship", 总结 = "summary", 阳, 趁, 望, 克, 秧, 蕾, 红, 思, 福, 防), Thai (ัน), German-flavoured constructions (schtinkenkot), Semitic anchors (salaames). The polyglot Wake signature is produced compositionally, not as a parody.

Wake-vocabulary attractors. Certain tokens recur across all temperatures and runs despite rep_penalty=1.15: salaames, duvlin (Dublin), tarpinacci, schtinkenkot, headwood, loab, pigses, materfamilias, comeallyous, magrees. These are model-specific attractor states the canonical Qwen reliably samples toward.

Joyce-signature constructions. Reduplication (kirikirikiring, natinatinatinati, duggelduggel, shahrryardhushahrryard), onomatopoeic mass (rrrwwwkkkrrr), number-as-word (thritytwo, seventyseventh, twentynine, fourscore), Wake place names (duvlin, wexterford's, tallaght's, hibernia), signature Joyce coinages (comeallyous from "come all ye", darkling, morrowweth).

No bridge-token routing. The drift-most Wake tokens (wher, leas, hing, throug, befor) that act as English-Wake boundary tokens do NOT appear prominently in the generation. The model stays inside the Wake-anchored semantic field rather than routing through the English boundary tokens. This is the structural difference from TinyLlama, whose output passes through English-fluent passages between Wake-style bursts.

Spacing artifacts. Consistent compound-fusing across all temperatures (much more extreme than TinyLlama's). This is the WakeOverlay P1 limitation: frozen attention layers can't adapt to new tokenization boundaries, and at Qwen's 14B scale the effect compounds into hundreds of tokens of continuous compound-mass.

Cross-model comparison

The TinyLlama (1.1B, 58% Wake share, P3b) and Qwen (14B, 22% Wake share, P1 canonical) outputs both produce Wake-style generation but through visibly different mechanisms:

	TinyLlama 1.1B P3b	Qwen 2.5-14B P1
Wake-vocab-share	58%	22%
Training depth	~3 weeks	14 weeks, 39 SGDR cycles
Word boundaries	Visible, grammatically scaffolded	Absent; continuous compound-mass
Polyglot register	Latin/medieval European	Chinese/multilingual + Latin/European
Joyce-signature density	Periodic	Continuous
Wake invention	Per-passage portmanteaus	Per-token portmanteaus
Readability	Pastiche-readable	Density-overwhelming
Mechanism	Wake region integrated into English-anchored base manifold	Wake region orthogonal to multilingual base manifold; scale + depth compensate

The two outputs are evidence for the refined finding (see outputs/p1_qwen14b_canonical_outputs.md): generation quality is achievable across multiple points in (Wake-vocab-share, model scale, training depth) space. TinyLlama achieves it via the compute-efficient path. Qwen achieves it via the brute-force-efficient path. Both produce sustained Wake-style output. The minimal-computing argument prefers the TinyLlama-class configuration as the methodologically appropriate choice under infrastructural constraint.

Citation and Credit

Text: James Joyce, Finnegans Wake (1939)

Base models:

Conceptual inspiration:

Embedding surgery, retrofitting, and lightweight adapter methods (LoRA, PEFT)
Biehle, M. (2025). Comparative Suspension: Joyce's Dubliners and the Computational Invisibility of Figuration. MA dissertation, UCL. [Comparative Suspension Theory provides the theoretical framework for interpreting null results in Wake embedding geometry.]
Zhang, C. (2025). Attention Is Not What You Need: Grassmann Flows as an Attention-Free Alternative for Sequence Modeling. arXiv:2512.19428. [Experimental Grassmann mixing framework in grassmann_vs_attention.py.]
Acheli, M., et al. (2026). Motivation is Something You Need. arXiv:2602.21064. [Dual-model training paradigm informs multi-phase pipeline design.]

Cite: https://github.com/mahb97/Wake2vec/blob/21469d75c26d40988ec5af8a4358d1796a36fdf0/data/CITATION.cff

Name		Name	Last commit message	Last commit date
Latest commit History 773 Commits
Dataviz		Dataviz
Gemma		Gemma
Grassmanntesting		Grassmanntesting
Llama		Llama
Mistral-7B-v0.3		Mistral-7B-v0.3
MorphemeTesting		MorphemeTesting
Phi-3 Mini/Phi_Notes		Phi-3 Mini/Phi_Notes
Qwen		Qwen
TinyLlama-1.1B-Chat-v1.0		TinyLlama-1.1B-Chat-v1.0
configs		configs
data		data
notes		notes
reports		reports
runs		runs
scripts		scripts
P3_FoodForThought.md		P3_FoodForThought.md
README.md		README.md
Wake2Vec Phase 2.md		Wake2Vec Phase 2.md
Wake2Vec_Phase_2.py		Wake2Vec_Phase_2.py
Wake2Vec_three_cell.ipynb		Wake2Vec_three_cell.ipynb
requirements.txt		requirements.txt
smallermodelparadox.md		smallermodelparadox.md
token_injection_&_training_.py		token_injection_&_training_.py
wake2vec_morpheme_expansion-2.py		wake2vec_morpheme_expansion-2.py

Folders and files

Latest commit

History

Repository files navigation

Wake2Vec

TL;DR

Models

Method (Morpheme-Aware)

Lexicon and Morphology

Tokenizer Augmentation

Compositional Initialisation

Spherical Initialisation (P1)

Wake Lexicon

Three-Phase Protocol

Phase 1: Embedding-Only Training

Phase 2: LoRA Fine-Tune

Phase 3: Morpheme-Compositional Alignment

Phase 3b: Geometric Refinement

Key Findings (TinyLlama P3 Arc)

Implication:

Qwen WakeOverlay Architecture

Training Configs

Phase 1 (Embedding-Only)

Phase 2 (LoRA)

Phase 3 (Morpheme-Compositional Alignment)

Data

Embedding Analysis

Results

TinyLlama 1.1B P1 (Complete)

TinyLlama 1.1B P2 (Complete)

Llama 3.2-1B P1 (Complete)

Llama 3.2-1B P2 (In Progress)

Qwen 2.5-14B P1 (In Progress)

Checkpoint Infrastructure

DriveSentry

EmbeddingSnapshot

Resume Strategies

Environment

Practical Notes

Monitoring

Scripts

Current Status (June 2026)

TinyLlama 1.1B P3b Generation Samples

Sample 1

Sample 2: Temperature sweep

temp=0.5

temp=0.7

temp=0.9

temp=1.0

temp=1.2

Sample 3: Multi-sequence generation (×7)

[1]

[2 extended]

[3 extended]

Notes

Temperature behavior

Recurring features across all samples

Qwen 2.5-14B P1 Canonical, Generation Samples

Sample 1: Single generation

Sample 2: Temperature sweep

temp=0.5

temp=0.7

temp=0.9

temp=1.0

temp=1.2

Sample 3: Multi-sequence generation (×3)

[1]

[2]

[3]

Sample 4: Extended (×3, temp=1.1, max_new_tokens=512)

[1]

[2 and 3]

Notes

Temperature behaviour

Recurring features across all samples

Cross-model comparison

Citation and Credit

About

Topics

Resources

Packages