|
| 1 | +# BitNet-b1.58-Sharp: Bucketing Implementation Plan v1.0 |
| 2 | +**Chain-Bucket Speculative Decoding + Training-Time Sequence Compression** |
| 3 | +**Core Feature for Inference Speedup and Training Efficiency** |
| 4 | + |
| 5 | +**Version:** 1.0 |
| 6 | +**Date:** March 20, 2026 |
| 7 | +**Status:** Production-ready blueprint |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Table of Contents |
| 12 | +1. Executive Summary & Success Criteria |
| 13 | +2. Prerequisites & Integration Points |
| 14 | +3. Overall Architecture |
| 15 | +4. Phase 1: Offline Bucket Mining Pipeline (5–7 days) |
| 16 | +5. Phase 2: Inference-Time Chain-Bucket Speculative Decoding (7–10 days) |
| 17 | +6. Phase 3: Training-Time Sequence Compression with Super-Tokens (8–12 days) |
| 18 | +7. Phase 4: Quality Safeguards, Evaluation & Benchmarks (5–7 days) |
| 19 | +8. Phase 5: CLI, Documentation & Release (3–5 days) |
| 20 | +9. Full UML Catalog (Object & Logic Examples) |
| 21 | +10. Risk Register & Mitigation |
| 22 | +11. Timeline, Milestones & Effort Estimates |
| 23 | +12. Future Extensions |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## 1. Executive Summary & Success Criteria |
| 28 | +Goal: Add **bucketing** as a core optimization that accelerates both inference (via speculative multi-token jumps) and training (via compressed token sequences using super-tokens). |
| 29 | + |
| 30 | +**Success Criteria** |
| 31 | +- Inference: ≥ 1.8× tokens/sec uplift with ≥ 70 % chain acceptance rate |
| 32 | +- Training: ≥ 25 % reduction in effective sequence length and training time |
| 33 | +- Zero quality regression (verified by perplexity and downstream metrics) |
| 34 | +- Fully optional via `BitNetOptions` (enabled by default for new models) |
| 35 | +- Works with any tokenizer and any BitNet checkpoint |
| 36 | + |
| 37 | +--- |
| 38 | + |
| 39 | +## 2. Prerequisites & Integration Points |
| 40 | +- Existing `BitNetTransformer`, `BitNetPaperModel`, and training loop |
| 41 | +- `BitNetOptions` class (for toggles) |
| 42 | +- Existing tokenizer and training corpus |
| 43 | +- Benchmark suite (TinyLlama-1.1B + perplexity) |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## 3. Overall Architecture |
| 48 | + |
| 49 | +```mermaid |
| 50 | +graph TD |
| 51 | + BitNetPaperModel --> ChainBucketTable |
| 52 | + BucketMiner --> ChainBucketTable |
| 53 | + ChainBucketTable --> InferencePath[Inference: Speculative Decoding] |
| 54 | + ChainBucketTable --> TrainingPath[Training: Sequence Compression] |
| 55 | +``` |
| 56 | + |
| 57 | +--- |
| 58 | + |
| 59 | +## 4. Phase 1: Offline Bucket Mining Pipeline (5–7 days) |
| 60 | +1. Create `BucketMiner` service that scans tokenized corpora. |
| 61 | +2. Extract frequent n-grams (n=2 to n=8). |
| 62 | +3. Score candidates by frequency × conditional probability. |
| 63 | +4. Pack top candidates into exactly 256 buckets (one byte). |
| 64 | +5. Store: `byte ChainID → TokenID[] chain + float confidence`. |
| 65 | +6. Output: `ChainBucketTable` (versioned, < 50 KB). |
| 66 | + |
| 67 | +**Implementation:** `src/BitNetSharp.Core/Bucketing/BucketMiner.cs` |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## 5. Phase 2: Inference-Time Chain-Bucket Speculative Decoding (7–10 days) |
| 72 | +**Core flow:** |
| 73 | +1. After each token, check last 1–3 tokens against bucket prefixes. |
| 74 | +2. If match found, speculatively emit continuation tokens from the matching chain. |
| 75 | +3. Run parallel verification pass: confirm model top-1 prediction matches each chain token. |
| 76 | +4. Accept tokens sequentially until first mismatch (classic speculative safety). |
| 77 | +5. Context window updated once for the entire accepted chain. |
| 78 | + |
| 79 | +**Integration:** |
| 80 | +- Extend `BitNetPaperModel.GenerateResponse()` with optional bucketing path. |
| 81 | +- Add `ChainBucketTable` loaded via `MineAndLoadBuckets()` or `LoadBucketTable()`. |
| 82 | +- Configurable via `BitNetOptions.EnableChainBuckets` and `MaxChainLength`. |
| 83 | + |
| 84 | +**Implementation:** `src/BitNetSharp.Core/BitNetPaperModel.cs` |
| 85 | + |
| 86 | +--- |
| 87 | + |
| 88 | +## 6. Phase 3: Training-Time Sequence Compression with Super-Tokens (8–12 days) |
| 89 | +**New capability:** During training, replace frequent n-grams with a single first-token placeholder to shorten sequences. |
| 90 | + |
| 91 | +**Steps:** |
| 92 | +1. Before each training batch forward pass, scan the prompt sequence for chains. |
| 93 | +2. Replace matching n-grams with just the first token of the chain. |
| 94 | +3. During forward pass, the model sees compressed sequences (shorter context = faster training). |
| 95 | +4. Loss is still computed against the original first target token. |
| 96 | +5. Periodic re-mining at startup or on demand adapts to corpus content. |
| 97 | + |
| 98 | +**BitNet specifics:** |
| 99 | +- Compression is applied to the INPUT context only; target tokens are unchanged. |
| 100 | +- Re-quantization schedule unchanged. |
| 101 | +- Expected benefit: 20–35 % reduction in training tokens processed per epoch. |
| 102 | + |
| 103 | +**Configuration:** `BitNetOptions.EnableSequenceCompression = true` |
| 104 | + |
| 105 | +**Implementation:** `src/BitNetSharp.Core/BitNetPaperModel.cs` (`CompressSequence` helper) |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | +## 7. Phase 4: Quality Safeguards, Evaluation & Benchmarks (5–7 days) |
| 110 | +1. Add verification step: every generated chain must match model top-1 probabilities. |
| 111 | +2. Perplexity check on compressed vs uncompressed validation set. |
| 112 | +3. Benchmark suite extension: |
| 113 | + - Tokens/sec with/without bucketing |
| 114 | + - Training time per epoch with/without sequence compression |
| 115 | + - Acceptance rate and compression ratio metrics |
| 116 | +4. Add to existing TinyLlama-1.1B benchmark pipeline. |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +## 8. Phase 5: CLI, Documentation & Release (3–5 days) |
| 121 | +1. CLI commands: |
| 122 | + - `dotnet run -- chat "hello" --enable-bucketing` |
| 123 | + - `dotnet run -- train --enable-bucketing` |
| 124 | + - `dotnet run -- datagen --domain code --count 10 --output data.jsonl` |
| 125 | +2. Update `/docs/bucketing-guide.md` with usage, expected speedups, and quality notes. |
| 126 | +3. Add to main README as core optimization feature. |
| 127 | +4. Release with pre-mined bucket tables for common tokenizers. |
| 128 | + |
| 129 | +**Implementation:** `src/BitNetSharp.App/Program.cs` |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## 9. Full UML Catalog (Object & Logic Examples) |
| 134 | + |
| 135 | +**Inference-Time Flow** |
| 136 | + |
| 137 | +```mermaid |
| 138 | +flowchart TD |
| 139 | + A[Last 1-3 Tokens] --> B[Bucket Table Lookup] |
| 140 | + B --> C[Chain Candidate Found?] |
| 141 | + C -->|Yes| D[Expand + Verify Each Token] |
| 142 | + D --> E[Accept Until Mismatch] |
| 143 | + E --> F[Context Updated for Full Accepted Chain] |
| 144 | + C -->|No| G[Normal Single-Token Generation] |
| 145 | +``` |
| 146 | + |
| 147 | +**Training-Time Compression Flow** |
| 148 | + |
| 149 | +```mermaid |
| 150 | +flowchart TD |
| 151 | + A[Raw Token Sequence] --> B[CompressSequence] |
| 152 | + B --> C[Replace n-grams with Chain First Token] |
| 153 | + C --> D[Compressed Sequence → BitNet Forward] |
| 154 | + D --> E[Loss Computed on Original Target Token] |
| 155 | + E --> F[Backprop on Compressed Sequence] |
| 156 | +``` |
| 157 | + |
| 158 | +**Class Structure** |
| 159 | + |
| 160 | +```mermaid |
| 161 | +classDiagram |
| 162 | + class ChainBucket { |
| 163 | + +byte ChainId |
| 164 | + +int[] TokenIds |
| 165 | + +float Confidence |
| 166 | + +int Length |
| 167 | + } |
| 168 | + class ChainBucketTable { |
| 169 | + +int Count |
| 170 | + +IReadOnlyList~ChainBucket~ Buckets |
| 171 | + +TryLookupPrefix(contextTail, out chain) bool |
| 172 | + +GetById(chainId) ChainBucket? |
| 173 | + } |
| 174 | + class BucketMiner { |
| 175 | + +Mine(sequences, maxBuckets) ChainBucketTable$ |
| 176 | + } |
| 177 | + class BitNetPaperModel { |
| 178 | + +ChainBucketTable? BucketTable |
| 179 | + +BitNetOptions Options |
| 180 | + +LoadBucketTable(table) |
| 181 | + +MineAndLoadBuckets(examples) ChainBucketTable |
| 182 | + +GenerateResponse(prompt, maxTokens) BitNetGenerationResult |
| 183 | + +Train(examples, epochs) TrainingReport |
| 184 | + } |
| 185 | + BitNetPaperModel --> ChainBucketTable |
| 186 | + BucketMiner --> ChainBucketTable |
| 187 | + ChainBucketTable "1" *-- "0..256" ChainBucket |
| 188 | +``` |
| 189 | + |
| 190 | +--- |
| 191 | + |
| 192 | +## 10. Risk Register & Mitigation |
| 193 | +| Risk | Likelihood | Impact | Mitigation | |
| 194 | +|------|------------|--------|------------| |
| 195 | +| Quality regression from compression | Medium | High | Strong verification + perplexity guardrails | |
| 196 | +| Bucket table staleness | Low | Medium | Periodic re-mining during training | |
| 197 | +| Increased memory for table | Low | Low | 256 buckets only (~few KB) | |
| 198 | + |
| 199 | +--- |
| 200 | + |
| 201 | +## 11. Timeline, Milestones & Effort Estimates (Solo Developer) |
| 202 | +- Phase 1: 5–7 days → "Bucket Mining Ready" |
| 203 | +- Phase 2: 7–10 days → "Inference Bucketing Live" |
| 204 | +- Phase 3: 8–12 days → "Training Compression Live" |
| 205 | +- Phase 4–5: 8–12 days → "Full Release" |
| 206 | + |
| 207 | +**Total estimated effort:** 35–50 days (highly parallelizable with existing training loop). |
| 208 | + |
| 209 | +--- |
| 210 | + |
| 211 | +## 12. Future Extensions |
| 212 | +- Dynamic bucket updating during training |
| 213 | +- Multi-byte chain IDs for >256 buckets |
| 214 | +- Integration with DataGen SLM for bucket-aware synthetic data |
| 215 | + |
| 216 | +**End of Document** |
0 commit comments