Skip to content

Restore BitNetPaperModel checkpoint API after TinyLlama constructor refactor#19

Merged
sharpninja merged 7 commits intomainfrom
copilot/add-training-perplexity-comparisons
Mar 21, 2026
Merged

Restore BitNetPaperModel checkpoint API after TinyLlama constructor refactor#19
sharpninja merged 7 commits intomainfrom
copilot/add-training-perplexity-comparisons

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 20, 2026

The TinyLlama benchmark work moved training/model initialization onto constructor-backed corpus setup, but it also dropped part of the BitNetPaperModel surface that checkpoint serialization still depends on. This PR restores that model contract so the PR builds again and checkpoint round-trips continue to work.

  • Checkpoint/model contract

    • Restore BitNetPaperModel.ExportMemorizedResponses()
    • Restore BitNetPaperModel.ImportMemorizedResponses(...)
    • Keep the fix on the model itself rather than adding a workaround in checkpoint code or benchmark plumbing
  • Why the build broke

    • BitNetPaperCheckpoint.Save(...) still exports trained prompt memory from the model
    • BitNetPaperCheckpoint.Load(...) still rehydrates that memory into the model
    • The constructor refactor preserved _memorizedResponses internally, but removed the accessors the checkpoint path calls
  • Regression coverage

    • Add a focused checkpoint round-trip test for a corpus-backed paper model trained on the TinyLlama benchmark examples
    • Verify that a trained prompt response survives save/load without changing the benchmark-facing constructor path
var model = BitNetPaperModel.CreateForTrainingCorpus(
    BitNetTrainingCorpus.CreateBenchmarkExamples(),
    VerbosityLevel.Quiet);

model.Train(BitNetTrainingCorpus.CreateBenchmarkExamples(), epochs: 1);

BitNetPaperCheckpoint.Save(model, checkpointPath);
var reloaded = BitNetPaperCheckpoint.Load(checkpointPath, VerbosityLevel.Quiet);

⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with Raycast.

Copilot AI and others added 2 commits March 20, 2026 17:22
Copilot AI changed the title [WIP] Add training benchmark and perplexity evaluation for BitNet Add benchmark comparison summary metrics for training, perplexity, and BitNet vs traditional reports Mar 20, 2026
Copilot AI requested a review from sharpninja March 20, 2026 17:34
Copilot AI changed the title Add benchmark comparison summary metrics for training, perplexity, and BitNet vs traditional reports Add benchmark comparison summaries for training, perplexity, and BitNet vs traditional reports Mar 20, 2026
Copilot AI and others added 2 commits March 20, 2026 17:57
Copilot AI changed the title Add benchmark comparison summaries for training, perplexity, and BitNet vs traditional reports Train both benchmark models on TinyLlama-1.1B via constructor-backed corpus initialization Mar 20, 2026
Copilot AI changed the title Train both benchmark models on TinyLlama-1.1B via constructor-backed corpus initialization Restore BitNetPaperModel checkpoint API after TinyLlama constructor refactor Mar 21, 2026
Copilot AI requested a review from sharpninja March 21, 2026 06:22
@sharpninja sharpninja marked this pull request as ready for review March 21, 2026 06:38
Copilot AI review requested due to automatic review settings March 21, 2026 06:38
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@sharpninja sharpninja merged commit 75c1e6b into main Mar 21, 2026
3 checks passed
@sharpninja sharpninja deleted the copilot/add-training-perplexity-comparisons branch March 21, 2026 06:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restores the BitNetPaperModel checkpoint/memory contract that was lost during the TinyLlama constructor refactor, and updates the benchmark + report pipeline to consistently use the shared TinyLlama benchmark corpus (including added perplexity/comparison reporting).

Changes:

  • Restores/adjusts paper model training-corpus construction and reintroduces memorized-response export/import for checkpoint round-trips, plus adds a round-trip regression test.
  • Switches benchmark training/report generation to use BitNetTrainingCorpus.CreateBenchmarkExamples() (TinyLlama-1.1B) and adds vocabulary/perplexity validation tests.
  • Extends the benchmark report with WikiText2 perplexity + a BitNet-vs-traditional comparison summary (tables + inline HTML charts).

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/BitNetSharp.Tests/HostedAgentBenchmarksExecutionTests.cs Updates training expectations to benchmark corpus; adds perplexity + vocabulary assertions.
tests/BitNetSharp.Tests/HostedAgentBenchmarkReportRunnerTests.cs Updates report runner test to cover new comparison/perplexity + training dataset rendering.
tests/BitNetSharp.Tests/BitNetPaperCheckpointTests.cs Adds checkpoint round-trip test to ensure memorized prompt responses survive save/load.
src/BitNetSharp.Core/TraditionalLocalModel.cs Adds corpus-backed constructor/factory and perplexity evaluation.
src/BitNetSharp.Core/BitNetTrainingCorpus.cs Introduces TinyLlama benchmark examples/vocabulary and a regex-based vocabulary builder.
src/BitNetSharp.Core/BitNetPaperModel.cs Adds corpus-backed constructor/factory, perplexity evaluation, and restores memorized-response export/import for checkpoints.
src/BitNetSharp.Core/BitNetPaperAudit.cs Switches perplexity fixtures to shared BitNetBenchmarkFixtures.
src/BitNetSharp.Core/BitNetBootstrap.cs Adds overload to create a paper model from a provided training corpus.
src/BitNetSharp.Core/BitNetBenchmarkFixtures.cs Adds shared perplexity datasets/validation samples used across audit/reporting/tests.
src/BitNetSharp.App/TraditionalLocalHostedAgentModel.cs Allows construction from a training corpus and surfaces the underlying model for reporting.
src/BitNetSharp.App/HostedAgentModelFactory.cs Threads an optional training corpus into built-in model construction.
src/BitNetSharp.App/HostedAgentBenchmarks.cs Updates training benchmark to train on TinyLlama benchmark examples and build models with that vocabulary.
src/BitNetSharp.App/HostedAgentBenchmarkReportRunner.cs Generates report using benchmark corpus, adds perplexity + comparison summary + charts, and parses performance outputs for derived metrics.
docs/benchmarking.md Updates docs to reflect TinyLlama training corpus and the new comparison/perplexity reporting.

public static IReadOnlyList<string> CreateBenchmarkVocabulary() =>
CreateVocabulary(
CreateBenchmarkExamples(),
["tinyllama", "1", "b", "-", ".", "wikitext2", "perplexity", "benchmark", "american", "english", "agent", "framework", "hosting", "tensor", "ternary"]);
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateBenchmarkVocabulary includes an additional token "wikitext2", but the tokenizer/regex splits alphanumerics into separate [A-Za-z]+ and [0-9]+ matches, so the model will never emit a wikitext2 token during tokenization (it will be wikitext + 2). This makes the token effectively unreachable noise in the vocabulary. Consider removing it or replacing it with the tokens the tokenizer can actually produce.

Suggested change
["tinyllama", "1", "b", "-", ".", "wikitext2", "perplexity", "benchmark", "american", "english", "agent", "framework", "hosting", "tensor", "ternary"]);
["tinyllama", "1", "b", "-", ".", "perplexity", "benchmark", "american", "english", "agent", "framework", "hosting", "tensor", "ternary"]);

Copilot uses AI. Check for mistakes.
_model = TraditionalLocalModel.CreateDefault(verbosity);
_trainingCorpusDescription = trainingExamples is null
? "default corpus"
: BitNetTrainingCorpus.BenchmarkDatasetName;
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_trainingCorpusDescription is set to BenchmarkDatasetName whenever trainingExamples is non-null, but callers can pass arbitrary corpora via HostedAgentModelFactory.Create(..., trainingExamples). This makes the displayed training description potentially incorrect. Consider using a neutral label like "provided corpus" or threading through an explicit dataset name/description instead of assuming TinyLlama.

Suggested change
: BitNetTrainingCorpus.BenchmarkDatasetName;
: "provided corpus";

Copilot uses AI. Check for mistakes.
var traditionalModel = TraditionalLocalModel.CreateForTrainingCorpus(examples);

bitNetModel.Train(examples, epochs: 3);
traditionalModel.Train(examples, epochs: TraditionalLocalModel.DefaultTrainingEpochs);
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test trains TraditionalLocalModel for DefaultTrainingEpochs (24) just to assert perplexity is finite/positive, which can unnecessarily slow the unit test suite. Consider reducing epochs (e.g., 1-3) or using a cheaper training configuration while keeping the same assertion intent.

Suggested change
traditionalModel.Train(examples, epochs: TraditionalLocalModel.DefaultTrainingEpochs);
traditionalModel.Train(examples, epochs: 3);

Copilot uses AI. Check for mistakes.
public sealed class BitNetPaperModel
{
private const int MaxPredictionLimit = 8;
private const double ProbabilityFloor = 1e-9d;
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxPredictionLimit is declared but never used. With TreatWarningsAsErrors enabled, this will fail the build due to an unused private const/field warning. Either remove the constant or use it where the prediction list limit is applied.

Suggested change
private const double ProbabilityFloor = 1e-9d;

Copilot uses AI. Check for mistakes.
Comment on lines +240 to +254
var totalLoss = 0d;
var totalTokens = 0;
foreach (var sample in validationSamples)
{
var tokenIds = EncodeTokenIds(sample, appendEndToken: true);
for (var index = 0; index < tokenIds.Count - 1; index++)
{
var context = tokenIds.Take(index + 1).ToArray();
var logits = ForwardLogits(context);
totalLoss -= Math.Log(GetTargetProbability(logits, tokenIds[index + 1]));
totalTokens++;
}
}

return totalTokens == 0 ? 0d : Math.Exp(totalLoss / totalTokens);
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalculatePerplexity reads model state (token IDs + transformer logits) without taking _gate, while Train/GenerateResponse are locked. This makes perplexity evaluation potentially race with training/inference and can produce inconsistent results. Consider wrapping the calculation in lock (_gate) (or otherwise ensuring thread safety) similar to TraditionalLocalModel.CalculatePerplexity.

Suggested change
var totalLoss = 0d;
var totalTokens = 0;
foreach (var sample in validationSamples)
{
var tokenIds = EncodeTokenIds(sample, appendEndToken: true);
for (var index = 0; index < tokenIds.Count - 1; index++)
{
var context = tokenIds.Take(index + 1).ToArray();
var logits = ForwardLogits(context);
totalLoss -= Math.Log(GetTargetProbability(logits, tokenIds[index + 1]));
totalTokens++;
}
}
return totalTokens == 0 ? 0d : Math.Exp(totalLoss / totalTokens);
lock (_gate)
{
var totalLoss = 0d;
var totalTokens = 0;
foreach (var sample in validationSamples)
{
var tokenIds = EncodeTokenIds(sample, appendEndToken: true);
for (var index = 0; index < tokenIds.Count - 1; index++)
{
var context = tokenIds.Take(index + 1).ToArray();
var logits = ForwardLogits(context);
totalLoss -= Math.Log(GetTargetProbability(logits, tokenIds[index + 1]));
totalTokens++;
}
}
return totalTokens == 0 ? 0d : Math.Exp(totalLoss / totalTokens);
}

Copilot uses AI. Check for mistakes.
Comment on lines +278 to +280
double? responseTokensPerSecond = responseMeanMilliseconds is > 0d && modelReport.BenchmarkPromptTokenCount > 0
? (modelReport.BenchmarkPromptTokenCount * MillisecondsPerSecond) / responseMeanMilliseconds.Value
: null;
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResponseTokensPerSecond is computed using BenchmarkPromptTokenCount derived from an ad-hoc GetResponseAsync call on the (already trained) model, but the ResponseOperation mean/allocated numbers come from BenchmarkDotNet runs (HostedAgentResponseBenchmarks) that construct models without the TinyLlama training corpus and without training. This mismatch can make the reported tokens/sec and deltas inaccurate/misleading. Align the response benchmark setup with the report model setup (e.g., train once in a BenchmarkDotNet GlobalSetup using the same corpus), or compute tokens/sec from the benchmarked response output itself.

Suggested change
double? responseTokensPerSecond = responseMeanMilliseconds is > 0d && modelReport.BenchmarkPromptTokenCount > 0
? (modelReport.BenchmarkPromptTokenCount * MillisecondsPerSecond) / responseMeanMilliseconds.Value
: null;
// Tokens-per-second cannot be reliably computed here because the prompt token count
// comes from a different execution path/model configuration than the benchmark timings.
// Leave this metric unset until it can be derived from consistent benchmark data.
double? responseTokensPerSecond = null;

Copilot uses AI. Check for mistakes.
Comment on lines +563 to +568
var parts = value.Split(' ', StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries);
if (parts.Length < 2 || !double.TryParse(parts[0], out var magnitude))
{
return null;
}

Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double.TryParse is used without an explicit culture when parsing BenchmarkDotNet output. On machines with a non-invariant current culture (e.g., comma decimal separators), this can fail to parse and silently drop metrics. Use double.TryParse(..., NumberStyles.Float, CultureInfo.InvariantCulture, out ...) (and similarly for allocated parsing).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants