Skip to content

Add executable BitNet paper audit and surface it in benchmark reports#17

Merged
sharpninja merged 12 commits intomainfrom
copilot/add-datagene-slm-generator
Mar 20, 2026
Merged

Add executable BitNet paper audit and surface it in benchmark reports#17
sharpninja merged 12 commits intomainfrom
copilot/add-datagene-slm-generator

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 20, 2026

The repository was still treating broader paper-alignment gaps as passive audit notes instead of first-class tooling. This change turns those findings into an executable audit so the repo can function as a more canonical, inspectable representation of the BitNet b1.58 paper and clearly show what is implemented vs. still pending.

  • Executable paper-alignment audit

    • Added a core audit model that checks the active BitNet runtime for measurable paper invariants:
      • decoder-only topology
      • bias-free BitLinear
      • ternary weight distribution and 8-bit activation quantization
      • RmsNorm epsilon / scale-only behavior
      • RoPE on Q/K and causal attention
      • SwiGLU feed-forward structure
      • deterministic seeded inference
    • Reports unmet paper-reproduction work as explicit Pending roadmap items instead of burying them in comments/docs.
  • New CLI surface

    • Added paper-audit to the app so the canonical runtime can be inspected directly from the command line.
    • The command emits a concise pass/pending/fail summary plus per-check details.
    dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- paper-audit
  • Benchmark report integration

    • Extended benchmark comparison reports to include a dedicated Paper-alignment audit section for bitnet-b1.58-sharp.
    • Markdown, HTML, and JSON outputs now carry both:
      • current hosted-model/benchmark data
      • paper-alignment status for the canonical model
  • Benchmark report correctness

    • Fixed benchmark-report to run BenchmarkDotNet from the project/repo context and copy BenchmarkDotNet.Artifacts into the selected output directory afterward.
    • This avoids project discovery failures when generating reports outside the repo root and keeps report generation usable as a canonical tool.
  • Enforced in tests

    • Added unit coverage for audit generation/formatting.
    • Extended SpecFlow coverage so the paper audit is part of the runtime contract.
    • Updated benchmark report tests to assert the new audit section is emitted.
  • Documentation

    • Updated usage and benchmarking docs to document:
      • the new paper-audit command
      • benchmark reports now carrying paper-alignment audit results

Example output shape:

Paper-alignment audit: bitnet-b1.58-sharp
Passed: 6
Pending: 4
Failed: 0

[PASS] Architecture - Decoder-only transformer topology matches the paper-aligned BitNet surface.
[PENDING] Roadmap - Perplexity parity against the paper datasets is measured in-repository.

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Copilot AI and others added 2 commits March 20, 2026 12:36
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot AI changed the title [WIP] Add DataGen SLM general-purpose synthetic training set generator Add DataGen CLI for grounded synthetic JSONL dataset generation Mar 20, 2026
Copilot AI requested a review from sharpninja March 20, 2026 12:42
Copilot AI and others added 3 commits March 20, 2026 13:23
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot AI changed the title Add DataGen CLI for grounded synthetic JSONL dataset generation Merge PR16 DataGen baseline with template-driven prompt flow Mar 20, 2026
@sharpninja sharpninja marked this pull request as ready for review March 20, 2026 13:57
Copilot AI review requested due to automatic review settings March 20, 2026 13:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d1ce4dd5be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +54 to +56
ReadOptionalOption(args, "--task-type")?.Trim() ?? DefaultTaskType,
inlineConstraints.Concat(csvConstraints).Distinct(StringComparer.OrdinalIgnoreCase).ToArray(),
string.IsNullOrWhiteSpace(seedsPath) ? null : Path.GetFullPath(seedsPath),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make --seeds truly optional in option parsing

This new nullable SeedsPath branch does not actually enable seedless runs because Parse still calls ReadRequiredOption(args, "--seeds") earlier in the method. As a result, datagen without --seeds fails before LoadSeeds can synthesize a fallback seed, so the documented seedless workflow is unreachable.

Useful? React with 👍 / 👎.

ReadOptionalOption(args, "--task-type")?.Trim() ?? DefaultTaskType,
inlineConstraints.Concat(csvConstraints).Distinct(StringComparer.OrdinalIgnoreCase).ToArray(),
string.IsNullOrWhiteSpace(seedsPath) ? null : Path.GetFullPath(seedsPath),
ReadOptionalOption(args, "--output-schema")?.Trim() ?? DefaultOutputSchema,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor template defaultOutputSchema when flag is omitted

Setting OutputSchema to DefaultOutputSchema at parse time means options.OutputSchema is never blank, so DataGenPromptTemplate.RenderPrompt never falls back to template.DefaultOutputSchema. Custom templates therefore cannot provide their own default schema unless every caller explicitly passes --output-schema, which undercuts the template-driven behavior added in this change.

Useful? React with 👍 / 👎.

Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR unifies the DataGen implementation path by keeping the PR16 core generator/CLI surface while adding template-driven prompt rendering, richer JSONL metadata, and quality gating, plus updating docs/tests to match the merged behavior.

Changes:

  • Extends DataGenCommandOptions/DataGenCommand to support templates, constraints, quality filtering, and additional JSONL metadata fields.
  • Adds a default DataGen prompt template (templates/datagen/default.json) and runtime template loader/renderer.
  • Updates and adds tests + docs to cover the merged DataGen option/output semantics (including seedless runs).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/BitNetSharp.Tests/DataGenTests.cs Adds new tests for extended option parsing, merged metadata emission, and seedless DataGen runs.
tests/BitNetSharp.Tests/BitNetModelTests.cs Adjusts assertion to account for normalized/absolute LoRA paths.
templates/datagen/default.json Introduces the default DataGen prompt template and default output schema hint.
src/BitNetSharp.App/DataGenPromptTemplate.cs Adds template loading + placeholder expansion to render merged prompts.
src/BitNetSharp.App/DataGenDatasetEntry.cs Defines the expanded JSONL contract with camelCase metadata fields.
src/BitNetSharp.App/DataGenCommand.cs Merges option parsing, prompt rendering, quality scoring/filtering, and seed fallback behavior.
src/BitNetSharp.App/BitNetSharp.App.csproj Copies DataGen templates into build output for runtime availability.
docs/usage.md Updates documented DataGen CLI usage and emitted metadata fields.
docs/datagen-guide.md Expands DataGen docs for merged options/schema/templates/quality controls.

Comment on lines +315 to +319
var completenessScore = !string.IsNullOrWhiteSpace(prompt) && !string.IsNullOrWhiteSpace(response) ? 1d : 0d;
var diversityScore = ComputeDiversityScore(response, acceptedEntries);
return Math.Round((completenessScore * SchemaWeight) + (consistencyScore * ConsistencyWeight) + (diversityScore * DiversityWeight), 4);
}

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quality scoring comment/weights mention “schema validity,” but completenessScore only checks that prompt and response are non-empty, and response isn’t otherwise used in scoring. Either implement an actual schema/format check aligned with --output-schema (or expected JSON fields), or rename the score components/comments so the metric matches what is actually computed.

Suggested change
var completenessScore = !string.IsNullOrWhiteSpace(prompt) && !string.IsNullOrWhiteSpace(response) ? 1d : 0d;
var diversityScore = ComputeDiversityScore(response, acceptedEntries);
return Math.Round((completenessScore * SchemaWeight) + (consistencyScore * ConsistencyWeight) + (diversityScore * DiversityWeight), 4);
}
var completenessScore = ComputeSchemaValidityScore(prompt, response, options);
var diversityScore = ComputeDiversityScore(response, acceptedEntries);
return Math.Round((completenessScore * SchemaWeight) + (consistencyScore * ConsistencyWeight) + (diversityScore * DiversityWeight), 4);
}
private static double ComputeSchemaValidityScore(string prompt, string response, DataGenCommandOptions options)
{
// Preserve existing behavior: if either prompt or response are empty, the score is 0.
if (string.IsNullOrWhiteSpace(prompt) || string.IsNullOrWhiteSpace(response))
{
return 0d;
}
// If no output schema is configured, treat a non-empty prompt/response as fully complete.
if (string.IsNullOrWhiteSpace(options.OutputSchema))
{
return 1d;
}
try
{
using var schemaDocument = JsonDocument.Parse(options.OutputSchema);
using var responseDocument = JsonDocument.Parse(response);
if (schemaDocument.RootElement.ValueKind != JsonValueKind.Object ||
responseDocument.RootElement.ValueKind != JsonValueKind.Object)
{
return 0d;
}
var requiredProperties = schemaDocument.RootElement
.EnumerateObject()
.Select(property => property.Name)
.ToArray();
var responseRoot = responseDocument.RootElement;
var hasAllRequired = requiredProperties.All(name => responseRoot.TryGetProperty(name, out _));
return hasAllRequired ? 1d : 0d;
}
catch (JsonException)
{
// If either the schema or the response are not valid JSON, treat it as schema-invalid.
return 0d;
}
}

Copilot uses AI. Check for mistakes.
if (acceptedEntries.Count < options.Count)
{
throw new InvalidOperationException(
$"DataGen could only accept {acceptedEntries.Count} examples after evaluating {candidateTarget} candidates. Lower --min-quality, increase --candidate-count, or add seeds/constraints.");
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure message suggests increasing --candidate-count to get more accepted examples, but --candidate-count only changes how many probe generations are used inside the quality score (and with the current deterministic model it won’t help at all). Consider suggesting flags that actually increase acceptance odds (e.g., lower --min-quality, add seeds/constraints, or increase the candidate pool size).

Suggested change
$"DataGen could only accept {acceptedEntries.Count} examples after evaluating {candidateTarget} candidates. Lower --min-quality, increase --candidate-count, or add seeds/constraints.");
$"DataGen could only accept {acceptedEntries.Count} examples after evaluating {candidateTarget} candidates. Lower --min-quality, add seeds/constraints, or increase the candidate pool size.");

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +41
"--seeds=/tmp/seeds.json",
"--output-schema={\"instruction\":\"string\",\"response\":\"string\"}",
"--template=/tmp/template.json",
"--lora=/tmp/domain-lora.bin",
"--candidate-count=5",
"--min-quality=0.7",
"--max-tokens=64"
]);

Assert.Equal("code-review", options.Domain);
Assert.Equal(2, options.Count);
Assert.EndsWith(Path.Combine("data", "code-review.jsonl"), options.OutputPath, StringComparison.Ordinal);
Assert.Equal("qa", options.TaskType);
Assert.Equal(["Use American English", "Grounded", "Diverse"], options.Constraints);
Assert.Equal("/tmp/seeds.json", options.SeedsPath);
Assert.Contains("\"instruction\"", options.OutputSchema, StringComparison.Ordinal);
Assert.Equal("/tmp/template.json", options.TemplatePath);
Assert.Equal("/tmp/domain-lora.bin", options.LoraPath);
Assert.Equal(5, options.CandidateCount);
Assert.Equal(0.7d, options.MinimumQualityScore);
Assert.Equal(64, options.MaxOutputTokens);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests hardcode /tmp/... paths and assert exact equality on normalized paths. This will fail on Windows runners (and any environment where /tmp isn’t a valid rooted path). Use Path.GetTempPath()/Path.Combine(...) for temp files and prefer EndsWith(Path.Combine(...)) or compare against Path.GetFullPath(...) results.

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +139
await DataGenCommand.RunAsync(
[
"datagen",
"--domain=education",
"--count=1",
$"--output={outputPath}",
"--task-type=classification",
"--constraint=Use American English"
],
VerbosityLevel.Quiet);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataGenCommandCanGenerateWithoutExplicitSeeds relies on seedless runs, but DataGenCommandOptions.Parse currently requires --seeds (and also has a duplicate seedsPath local). This test will fail until --seeds is truly optional and parsing compiles.

Copilot uses AI. Check for mistakes.
- Filter or deduplicate generated samples before fine-tuning if your target pipeline requires stricter curation.
The current implementation applies lightweight quality scoring to every accepted example:

1. prompt/response schema validation
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section claims the implementation performs “prompt/response schema validation,” but the current ComputeQualityScore logic only checks for non-empty prompt/response and doesn’t validate against --output-schema (or any structured schema) before accepting. Either implement schema/format validation or adjust this doc wording to match the current behavior.

Suggested change
1. prompt/response schema validation
1. basic prompt/response presence checks

Copilot uses AI. Check for mistakes.
Comment on lines +46 to 49
var csvConstraints = SplitCsvOption(ReadOptionalOption(args, "--constraints"));
var seedsPath = ReadOptionalOption(args, "--seeds");

return new DataGenCommandOptions(
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seedsPath is declared twice in the same scope (ReadRequiredOption earlier in the method and ReadOptionalOption again here), which will not compile. Also, keeping the required --seeds read conflicts with the seedless-run behavior (fallback seeds) exercised by DataGenTests; consider removing the required --seeds read and making seeds truly optional, letting DataGenCommand.LoadSeeds synthesize a fallback when missing.

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +184
{
if (string.IsNullOrWhiteSpace(value))
{
return defaultValue;
}

if (double.TryParse(value, out var parsed))
{
return parsed;
}

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseDouble uses double.TryParse without an explicit culture, so --min-quality=0.7 can fail on locales that use a comma decimal separator. Also consider rejecting NaN explicitly (NaN will currently bypass the 0..1 range check). Parse with CultureInfo.InvariantCulture (and appropriate NumberStyles) and validate double.IsFinite(...).

Copilot uses AI. Check for mistakes.
Comment on lines +304 to +314
var candidates = Enumerable.Range(0, options.CandidateCount)
.Select(_ => model.GenerateResponse(prompt, options.MaxOutputTokens ?? DefaultQualityProbeMaxTokens).ResponseText)
.ToArray();

var majorityCount = candidates
.GroupBy(candidate => candidate, StringComparer.Ordinal)
.Select(group => group.Count())
.DefaultIfEmpty(0)
.Max();

var consistencyScore = majorityCount / (double)candidates.Length;
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ComputeQualityScore’s self-consistency signal is currently always 1.0 because BitNetPaperModel.GenerateResponse is deterministic (seeded transformer + top-k ranking), so the --candidate-count option won’t change the score or filtering behavior. Either remove this dimension/option or change the scoring to measure something that can vary (e.g., compare the generated response against the model probe output, or introduce controlled sampling).

Suggested change
var candidates = Enumerable.Range(0, options.CandidateCount)
.Select(_ => model.GenerateResponse(prompt, options.MaxOutputTokens ?? DefaultQualityProbeMaxTokens).ResponseText)
.ToArray();
var majorityCount = candidates
.GroupBy(candidate => candidate, StringComparer.Ordinal)
.Select(group => group.Count())
.DefaultIfEmpty(0)
.Max();
var consistencyScore = majorityCount / (double)candidates.Length;
// Compute a consistency signal by comparing the provided response to
// a fresh baseline response from the model for the same prompt.
var baselineResponse = model
.GenerateResponse(prompt, options.MaxOutputTokens ?? DefaultQualityProbeMaxTokens)
.ResponseText;
var baselineTerms = Tokenize(baselineResponse);
var responseTerms = Tokenize(response);
var consistencyScore = ComputeJaccardSimilarity(baselineTerms, responseTerms);

Copilot uses AI. Check for mistakes.
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot AI changed the title Merge PR16 DataGen baseline with template-driven prompt flow Fix PR build failure in DataGen option parsing Mar 20, 2026
Copilot AI requested a review from sharpninja March 20, 2026 14:09
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot AI changed the title Fix PR build failure in DataGen option parsing Close paper-alignment test gaps in SpecFlow and hosted benchmarks Mar 20, 2026
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot stopped work on behalf of sharpninja due to an error March 20, 2026 15:08
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot AI changed the title Close paper-alignment test gaps in SpecFlow and hosted benchmarks Add executable BitNet paper audit and surface it in benchmark reports Mar 20, 2026
@sharpninja sharpninja merged commit 1a7605d into main Mar 20, 2026
2 checks passed
@sharpninja sharpninja deleted the copilot/add-datagene-slm-generator branch March 20, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants