Add executable BitNet paper audit and surface it in benchmark reports#17
Add executable BitNet paper audit and surface it in benchmark reports#17sharpninja merged 12 commits intomainfrom
Conversation
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d1ce4dd5be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ReadOptionalOption(args, "--task-type")?.Trim() ?? DefaultTaskType, | ||
| inlineConstraints.Concat(csvConstraints).Distinct(StringComparer.OrdinalIgnoreCase).ToArray(), | ||
| string.IsNullOrWhiteSpace(seedsPath) ? null : Path.GetFullPath(seedsPath), |
There was a problem hiding this comment.
Make --seeds truly optional in option parsing
This new nullable SeedsPath branch does not actually enable seedless runs because Parse still calls ReadRequiredOption(args, "--seeds") earlier in the method. As a result, datagen without --seeds fails before LoadSeeds can synthesize a fallback seed, so the documented seedless workflow is unreachable.
Useful? React with 👍 / 👎.
| ReadOptionalOption(args, "--task-type")?.Trim() ?? DefaultTaskType, | ||
| inlineConstraints.Concat(csvConstraints).Distinct(StringComparer.OrdinalIgnoreCase).ToArray(), | ||
| string.IsNullOrWhiteSpace(seedsPath) ? null : Path.GetFullPath(seedsPath), | ||
| ReadOptionalOption(args, "--output-schema")?.Trim() ?? DefaultOutputSchema, |
There was a problem hiding this comment.
Honor template defaultOutputSchema when flag is omitted
Setting OutputSchema to DefaultOutputSchema at parse time means options.OutputSchema is never blank, so DataGenPromptTemplate.RenderPrompt never falls back to template.DefaultOutputSchema. Custom templates therefore cannot provide their own default schema unless every caller explicitly passes --output-schema, which undercuts the template-driven behavior added in this change.
Useful? React with 👍 / 👎.
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR unifies the DataGen implementation path by keeping the PR16 core generator/CLI surface while adding template-driven prompt rendering, richer JSONL metadata, and quality gating, plus updating docs/tests to match the merged behavior.
Changes:
- Extends
DataGenCommandOptions/DataGenCommandto support templates, constraints, quality filtering, and additional JSONL metadata fields. - Adds a default DataGen prompt template (
templates/datagen/default.json) and runtime template loader/renderer. - Updates and adds tests + docs to cover the merged DataGen option/output semantics (including seedless runs).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/BitNetSharp.Tests/DataGenTests.cs | Adds new tests for extended option parsing, merged metadata emission, and seedless DataGen runs. |
| tests/BitNetSharp.Tests/BitNetModelTests.cs | Adjusts assertion to account for normalized/absolute LoRA paths. |
| templates/datagen/default.json | Introduces the default DataGen prompt template and default output schema hint. |
| src/BitNetSharp.App/DataGenPromptTemplate.cs | Adds template loading + placeholder expansion to render merged prompts. |
| src/BitNetSharp.App/DataGenDatasetEntry.cs | Defines the expanded JSONL contract with camelCase metadata fields. |
| src/BitNetSharp.App/DataGenCommand.cs | Merges option parsing, prompt rendering, quality scoring/filtering, and seed fallback behavior. |
| src/BitNetSharp.App/BitNetSharp.App.csproj | Copies DataGen templates into build output for runtime availability. |
| docs/usage.md | Updates documented DataGen CLI usage and emitted metadata fields. |
| docs/datagen-guide.md | Expands DataGen docs for merged options/schema/templates/quality controls. |
| var completenessScore = !string.IsNullOrWhiteSpace(prompt) && !string.IsNullOrWhiteSpace(response) ? 1d : 0d; | ||
| var diversityScore = ComputeDiversityScore(response, acceptedEntries); | ||
| return Math.Round((completenessScore * SchemaWeight) + (consistencyScore * ConsistencyWeight) + (diversityScore * DiversityWeight), 4); | ||
| } | ||
|
|
There was a problem hiding this comment.
The quality scoring comment/weights mention “schema validity,” but completenessScore only checks that prompt and response are non-empty, and response isn’t otherwise used in scoring. Either implement an actual schema/format check aligned with --output-schema (or expected JSON fields), or rename the score components/comments so the metric matches what is actually computed.
| var completenessScore = !string.IsNullOrWhiteSpace(prompt) && !string.IsNullOrWhiteSpace(response) ? 1d : 0d; | |
| var diversityScore = ComputeDiversityScore(response, acceptedEntries); | |
| return Math.Round((completenessScore * SchemaWeight) + (consistencyScore * ConsistencyWeight) + (diversityScore * DiversityWeight), 4); | |
| } | |
| var completenessScore = ComputeSchemaValidityScore(prompt, response, options); | |
| var diversityScore = ComputeDiversityScore(response, acceptedEntries); | |
| return Math.Round((completenessScore * SchemaWeight) + (consistencyScore * ConsistencyWeight) + (diversityScore * DiversityWeight), 4); | |
| } | |
| private static double ComputeSchemaValidityScore(string prompt, string response, DataGenCommandOptions options) | |
| { | |
| // Preserve existing behavior: if either prompt or response are empty, the score is 0. | |
| if (string.IsNullOrWhiteSpace(prompt) || string.IsNullOrWhiteSpace(response)) | |
| { | |
| return 0d; | |
| } | |
| // If no output schema is configured, treat a non-empty prompt/response as fully complete. | |
| if (string.IsNullOrWhiteSpace(options.OutputSchema)) | |
| { | |
| return 1d; | |
| } | |
| try | |
| { | |
| using var schemaDocument = JsonDocument.Parse(options.OutputSchema); | |
| using var responseDocument = JsonDocument.Parse(response); | |
| if (schemaDocument.RootElement.ValueKind != JsonValueKind.Object || | |
| responseDocument.RootElement.ValueKind != JsonValueKind.Object) | |
| { | |
| return 0d; | |
| } | |
| var requiredProperties = schemaDocument.RootElement | |
| .EnumerateObject() | |
| .Select(property => property.Name) | |
| .ToArray(); | |
| var responseRoot = responseDocument.RootElement; | |
| var hasAllRequired = requiredProperties.All(name => responseRoot.TryGetProperty(name, out _)); | |
| return hasAllRequired ? 1d : 0d; | |
| } | |
| catch (JsonException) | |
| { | |
| // If either the schema or the response are not valid JSON, treat it as schema-invalid. | |
| return 0d; | |
| } | |
| } |
| if (acceptedEntries.Count < options.Count) | ||
| { | ||
| throw new InvalidOperationException( | ||
| $"DataGen could only accept {acceptedEntries.Count} examples after evaluating {candidateTarget} candidates. Lower --min-quality, increase --candidate-count, or add seeds/constraints."); |
There was a problem hiding this comment.
The failure message suggests increasing --candidate-count to get more accepted examples, but --candidate-count only changes how many probe generations are used inside the quality score (and with the current deterministic model it won’t help at all). Consider suggesting flags that actually increase acceptance odds (e.g., lower --min-quality, add seeds/constraints, or increase the candidate pool size).
| $"DataGen could only accept {acceptedEntries.Count} examples after evaluating {candidateTarget} candidates. Lower --min-quality, increase --candidate-count, or add seeds/constraints."); | |
| $"DataGen could only accept {acceptedEntries.Count} examples after evaluating {candidateTarget} candidates. Lower --min-quality, add seeds/constraints, or increase the candidate pool size."); |
| "--seeds=/tmp/seeds.json", | ||
| "--output-schema={\"instruction\":\"string\",\"response\":\"string\"}", | ||
| "--template=/tmp/template.json", | ||
| "--lora=/tmp/domain-lora.bin", | ||
| "--candidate-count=5", | ||
| "--min-quality=0.7", | ||
| "--max-tokens=64" | ||
| ]); | ||
|
|
||
| Assert.Equal("code-review", options.Domain); | ||
| Assert.Equal(2, options.Count); | ||
| Assert.EndsWith(Path.Combine("data", "code-review.jsonl"), options.OutputPath, StringComparison.Ordinal); | ||
| Assert.Equal("qa", options.TaskType); | ||
| Assert.Equal(["Use American English", "Grounded", "Diverse"], options.Constraints); | ||
| Assert.Equal("/tmp/seeds.json", options.SeedsPath); | ||
| Assert.Contains("\"instruction\"", options.OutputSchema, StringComparison.Ordinal); | ||
| Assert.Equal("/tmp/template.json", options.TemplatePath); | ||
| Assert.Equal("/tmp/domain-lora.bin", options.LoraPath); | ||
| Assert.Equal(5, options.CandidateCount); | ||
| Assert.Equal(0.7d, options.MinimumQualityScore); | ||
| Assert.Equal(64, options.MaxOutputTokens); |
There was a problem hiding this comment.
These tests hardcode /tmp/... paths and assert exact equality on normalized paths. This will fail on Windows runners (and any environment where /tmp isn’t a valid rooted path). Use Path.GetTempPath()/Path.Combine(...) for temp files and prefer EndsWith(Path.Combine(...)) or compare against Path.GetFullPath(...) results.
| await DataGenCommand.RunAsync( | ||
| [ | ||
| "datagen", | ||
| "--domain=education", | ||
| "--count=1", | ||
| $"--output={outputPath}", | ||
| "--task-type=classification", | ||
| "--constraint=Use American English" | ||
| ], | ||
| VerbosityLevel.Quiet); |
There was a problem hiding this comment.
DataGenCommandCanGenerateWithoutExplicitSeeds relies on seedless runs, but DataGenCommandOptions.Parse currently requires --seeds (and also has a duplicate seedsPath local). This test will fail until --seeds is truly optional and parsing compiles.
| - Filter or deduplicate generated samples before fine-tuning if your target pipeline requires stricter curation. | ||
| The current implementation applies lightweight quality scoring to every accepted example: | ||
|
|
||
| 1. prompt/response schema validation |
There was a problem hiding this comment.
This section claims the implementation performs “prompt/response schema validation,” but the current ComputeQualityScore logic only checks for non-empty prompt/response and doesn’t validate against --output-schema (or any structured schema) before accepting. Either implement schema/format validation or adjust this doc wording to match the current behavior.
| 1. prompt/response schema validation | |
| 1. basic prompt/response presence checks |
| var csvConstraints = SplitCsvOption(ReadOptionalOption(args, "--constraints")); | ||
| var seedsPath = ReadOptionalOption(args, "--seeds"); | ||
|
|
||
| return new DataGenCommandOptions( |
There was a problem hiding this comment.
seedsPath is declared twice in the same scope (ReadRequiredOption earlier in the method and ReadOptionalOption again here), which will not compile. Also, keeping the required --seeds read conflicts with the seedless-run behavior (fallback seeds) exercised by DataGenTests; consider removing the required --seeds read and making seeds truly optional, letting DataGenCommand.LoadSeeds synthesize a fallback when missing.
| { | ||
| if (string.IsNullOrWhiteSpace(value)) | ||
| { | ||
| return defaultValue; | ||
| } | ||
|
|
||
| if (double.TryParse(value, out var parsed)) | ||
| { | ||
| return parsed; | ||
| } | ||
|
|
There was a problem hiding this comment.
ParseDouble uses double.TryParse without an explicit culture, so --min-quality=0.7 can fail on locales that use a comma decimal separator. Also consider rejecting NaN explicitly (NaN will currently bypass the 0..1 range check). Parse with CultureInfo.InvariantCulture (and appropriate NumberStyles) and validate double.IsFinite(...).
| var candidates = Enumerable.Range(0, options.CandidateCount) | ||
| .Select(_ => model.GenerateResponse(prompt, options.MaxOutputTokens ?? DefaultQualityProbeMaxTokens).ResponseText) | ||
| .ToArray(); | ||
|
|
||
| var majorityCount = candidates | ||
| .GroupBy(candidate => candidate, StringComparer.Ordinal) | ||
| .Select(group => group.Count()) | ||
| .DefaultIfEmpty(0) | ||
| .Max(); | ||
|
|
||
| var consistencyScore = majorityCount / (double)candidates.Length; |
There was a problem hiding this comment.
ComputeQualityScore’s self-consistency signal is currently always 1.0 because BitNetPaperModel.GenerateResponse is deterministic (seeded transformer + top-k ranking), so the --candidate-count option won’t change the score or filtering behavior. Either remove this dimension/option or change the scoring to measure something that can vary (e.g., compare the generated response against the model probe output, or introduce controlled sampling).
| var candidates = Enumerable.Range(0, options.CandidateCount) | |
| .Select(_ => model.GenerateResponse(prompt, options.MaxOutputTokens ?? DefaultQualityProbeMaxTokens).ResponseText) | |
| .ToArray(); | |
| var majorityCount = candidates | |
| .GroupBy(candidate => candidate, StringComparer.Ordinal) | |
| .Select(group => group.Count()) | |
| .DefaultIfEmpty(0) | |
| .Max(); | |
| var consistencyScore = majorityCount / (double)candidates.Length; | |
| // Compute a consistency signal by comparing the provided response to | |
| // a fresh baseline response from the model for the same prompt. | |
| var baselineResponse = model | |
| .GenerateResponse(prompt, options.MaxOutputTokens ?? DefaultQualityProbeMaxTokens) | |
| .ResponseText; | |
| var baselineTerms = Tokenize(baselineResponse); | |
| var responseTerms = Tokenize(response); | |
| var consistencyScore = ComputeJaccardSimilarity(baselineTerms, responseTerms); |
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
The repository was still treating broader paper-alignment gaps as passive audit notes instead of first-class tooling. This change turns those findings into an executable audit so the repo can function as a more canonical, inspectable representation of the BitNet b1.58 paper and clearly show what is implemented vs. still pending.
Executable paper-alignment audit
BitLinearRmsNormepsilon / scale-only behaviorSwiGLUfeed-forward structurePendingroadmap items instead of burying them in comments/docs.New CLI surface
paper-auditto the app so the canonical runtime can be inspected directly from the command line.Benchmark report integration
bitnet-b1.58-sharp.Benchmark report correctness
benchmark-reportto run BenchmarkDotNet from the project/repo context and copyBenchmarkDotNet.Artifactsinto the selected output directory afterward.Enforced in tests
Documentation
paper-auditcommandExample output shape:
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.