Skip to content

Keep published benchmark reports aligned with BitNet training support#18

Merged
sharpninja merged 3 commits intomainfrom
copilot/fill-report-gaps
Mar 20, 2026
Merged

Keep published benchmark reports aligned with BitNet training support#18
sharpninja merged 3 commits intomainfrom
copilot/fill-report-gaps

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 20, 2026

The published benchmark report could still show bitnet-b1.58-sharp as Not supported for training even though the current hosted model implements the trainable surface. The gap was in report publication freshness, not in the active report-generation logic.

  • Report publication

    • Update .github/workflows/benchmark-report.yml to run on pushes to main when benchmark/runtime/test paths change, in addition to manual dispatch.
    • This keeps the GitHub Pages benchmark report synchronized with the current BitNet training surface instead of relying on a stale manually generated artifact.
  • Regression coverage

    • Add a focused report-generation test that asserts the BitNet model renders as:
      • Completed (6 examples, 3 epochs)
      • and not Not supported
    • This locks the intended output into HostedAgentBenchmarkReportRunnerTests.
  • Docs

    • Clarify in docs/benchmarking.md that the benchmark report workflow is now both automatic and manually triggerable.

Example of the rendered report state now covered by test:

new HostedAgentBenchmarkModelReport(
    HostedAgentModelFactory.DefaultModelId,
    "Paper-aligned BitNet b1.58 transformer",
    TrainingSupported: true,
    TrainingCompleted: true,
    TrainingExamples: 6,
    TrainingEpochs: 3,
    SuccessfulQueries: 1,
    TotalQueries: 1,
    ExactMatches: 0,
    AverageExpectedTokenRecall: 0.5d,
    QueryResults:
    [
        new HostedAgentBenchmarkQueryResult("hello", "Hello!", "Hello!", true, true, 1.0d)
    ])

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot AI changed the title [WIP] Fill all pending gaps in benchmark report Fill benchmark report audit gaps and align the benchmark pipeline with the paper-model runtime Mar 20, 2026
Copilot AI requested a review from sharpninja March 20, 2026 16:04
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Agent-Logs-Url: https://github.com/sharpninja/BitNet-b1.58-Sharp/sessions/bf14f461-8ce5-40da-b0d3-3d3b122141fa
Copilot AI changed the title Fill benchmark report audit gaps and align the benchmark pipeline with the paper-model runtime Keep published benchmark reports aligned with BitNet training support Mar 20, 2026
@sharpninja sharpninja marked this pull request as ready for review March 20, 2026 17:06
Copilot AI review requested due to automatic review settings March 20, 2026 17:06
@sharpninja sharpninja merged commit 537d7e4 into main Mar 20, 2026
6 checks passed
@sharpninja sharpninja deleted the copilot/fill-report-gaps branch March 20, 2026 17:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the benchmark-report publication pipeline and related runtime surfaces so published GitHub Pages benchmark reports stay in sync with the paper-aligned BitNet model’s training/coverage status.

Changes:

  • Make benchmark-report workflow run automatically on main pushes affecting core/app/tests (in addition to manual dispatch).
  • Add/adjust tests to lock in report rendering for “Completed (6 examples, 3 epochs)” training status and paper-audit “no pending” coverage.
  • Extend the paper-aligned model/audit surface with output-head fine-tuning, hidden-state forwarding, and checkpoint round-trip validation.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
.github/workflows/benchmark-report.yml Triggers benchmark report publishing on main pushes for relevant paths.
src/BitNetSharp.App/BitNetHostedAgentModel.cs Exposes paper-aligned model training via ITrainableHostedAgentModel.
src/BitNetSharp.App/HostedAgentBenchmarks.cs Treats bitnet-b1.58-sharp as trainable in benchmark parameter selection.
src/BitNetSharp.App/Program.cs Updates CLI training messaging when a model is not trainable.
src/BitNetSharp.Core/BitNetPaperAudit.cs Shifts audit items from “pending roadmap” to runtime/benchmark coverage checks (training/perplexity/zero-shot/checkpoint).
src/BitNetSharp.Core/BitNetPaperCheckpoint.cs Adds checkpoint save/load + round-trip validation for the paper model.
src/BitNetSharp.Core/BitNetPaperModel.cs Adds a repository-local training routine and exposes internal helpers used by audit/training.
src/BitNetSharp.Core/Models/BitNetTransformer.cs Splits forwarding into hidden-state vs logits paths.
tests/BitNetSharp.Tests/BitNetPaperAuditTests.cs Updates assertions for “runtime coverage” (no pending checks).
tests/BitNetSharp.Tests/Features/PaperAlignedRuntime.feature Updates audit step wording and adds BitNet model to training examples.
tests/BitNetSharp.Tests/HostedAgentBenchmarksExecutionTests.cs Adds a focused test that exercises the BitNet training benchmark path.
tests/BitNetSharp.Tests/HostedAgentBenchmarkReportRunnerTests.cs Adds regression test ensuring report renders “Completed (6 examples, 3 epochs)” and not “Not supported”.
tests/BitNetSharp.Tests/Steps/PaperAlignedRuntimeSteps.cs Updates SpecFlow step to assert no pending paper-audit items.
docs/benchmarking.md Documents that the benchmark-report workflow is automatic + manual.
docs/usage.md Updates CLI/paper-audit/training documentation to reflect new runtime coverage.

Comment on lines +127 to +134
var stats = GetTernaryWeightStats();
return new TrainingReport(
lossHistory,
trainingSet.Count * epochs,
epochs,
stats.NegativeCount,
stats.ZeroCount,
stats.PositiveCount);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrainingReport.SamplesSeen is currently computed as trainingSet.Count * epochs, but this method can continue when a response tokenizes to zero tokens, so the number of training updates actually performed can be lower. Track the number of observations actually used (e.g., accumulate observations across epochs) and report that instead so SamplesSeen aligns with the loss history.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +42
var document = new BitNetPaperCheckpointDocument(
FormatName,
model.ModelId,
BootstrapSeed,
model.Config,
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint save hard-codes BootstrapSeed = 42 rather than persisting the actual seed used to initialize the model/transformer. Since BitNetPaperModel has a public seed constructor parameter, this makes Save/Load unable to round-trip models created with non-default seeds. Persist the model’s actual seed (or remove the seed parameter / make the constraint explicit and validated) so the checkpoint format is correct for all supported instances.

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +126
var result = new float[matrix.Length, matrix[0].Length];
for (var row = 0; row < matrix.Length; row++)
{
for (var column = 0; column < matrix[row].Length; column++)
{
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToMatrix assumes a rectangular float[][] and sizes the 2D array from matrix[0].Length, but then iterates up to matrix[row].Length. If any row is longer than the first (or null), this will throw or produce a partially initialized matrix. Validate that all rows are non-null and have the same length (and throw a clear exception) before allocating/copying.

Suggested change
var result = new float[matrix.Length, matrix[0].Length];
for (var row = 0; row < matrix.Length; row++)
{
for (var column = 0; column < matrix[row].Length; column++)
{
var firstRow = matrix[0]
?? throw new ArgumentException("The matrix contains a null row at index 0.", nameof(matrix));
var columns = firstRow.Length;
for (var row = 0; row < matrix.Length; row++)
{
var currentRow = matrix[row]
?? throw new ArgumentException($"The matrix contains a null row at index {row}.", nameof(matrix));
if (currentRow.Length != columns)
{
throw new ArgumentException(
$"All rows must have the same length. Expected {columns} columns but row {row} has {currentRow.Length}.",
nameof(matrix));
}
}
var result = new float[matrix.Length, columns];
for (var row = 0; row < matrix.Length; row++)
{
for (var column = 0; column < columns; column++)
{

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4780986fae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

var document = new BitNetPaperCheckpointDocument(
FormatName,
model.ModelId,
BootstrapSeed,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist the model seed in paper checkpoints

Save always writes BootstrapSeed as the constant 42, so checkpoints created from BitNetPaperModel instances initialized with a different seed reload with different embeddings/layers and no longer represent the original model state. This breaks the public Save/Load contract for custom-seeded models (a valid constructor path) and can change generation/metrics after reload.

Useful? React with 👍 / 👎.

document.PrimaryLanguage),
document.Config,
document.BootstrapSeed);
model.ImportOutputHeadWeights(ToMatrix(document.OutputHeadWeights));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid re-quantizing checkpointed output-head weights

Load feeds document.OutputHeadWeights back through ImportOutputHeadWeights, which re-quantizes weights that were already quantized by Save (ExportOutputHeadWeights). Because re-quantization recomputes Gamma, repeated save/load cycles drift the output-head scale (especially with zero-valued ternary entries), so checkpoint round-trips are not numerically idempotent and can skew downstream perplexity/training behavior even when top-token text still matches.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants