Skip to content

Commit bcf4a3f

Browse files
authored
Merge pull request #16 from sharpninja/copilot/add-datagen-synthetic-data-generator
Add DataGen synthetic dataset generation to the BitNet b1.58 CLI
2 parents 5f1928a + f2f70a7 commit bcf4a3f

File tree

10 files changed

+611
-5
lines changed

10 files changed

+611
-5
lines changed

docs/README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,27 @@ BitNet b1.58 Sharp is a .NET 10 C# reference implementation of the paper-aligned
88
- A decoder-only transformer implementation with `BitLinear`, `RmsNorm`, RoPE, causal attention, SwiGLU, and `BitNetTransformer`
99
- Microsoft Agent Framework-oriented hosting in `/src/BitNetSharp.App`
1010
- BenchmarkDotNet-based local model comparison in `/src/BitNetSharp.App`
11+
- DataGen synthetic dataset generation from JSON seed examples
1112
- Default American English interaction behavior
1213
- Seeded transformer inspection and ternary weight summaries
1314
- GitBook-formatted project documentation in `/docs`
1415

1516
## Quick start
1617

1718
```bash
18-
dotnet build /home/runner/work/BitNet-b1.58-Sharp/BitNet-b1.58-Sharp/BitNet-b1.58-Sharp.slnx
19-
dotnet run --project /home/runner/work/BitNet-b1.58-Sharp/BitNet-b1.58-Sharp/src/BitNetSharp.App/BitNetSharp.App.csproj -- chat "hello"
20-
dotnet run --project /home/runner/work/BitNet-b1.58-Sharp/BitNet-b1.58-Sharp/src/BitNetSharp.App/BitNetSharp.App.csproj -- visualize
21-
dotnet test /home/runner/work/BitNet-b1.58-Sharp/BitNet-b1.58-Sharp/BitNet-b1.58-Sharp.slnx
19+
dotnet build BitNet-b1.58-Sharp.slnx
20+
dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- chat "hello"
21+
dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- datagen --domain "customer-support" --count 10 --seeds examples/seed-examples.json --output data/customer-support.jsonl
22+
dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- visualize
23+
dotnet test BitNet-b1.58-Sharp.slnx
2224
```
2325

2426
## Documentation map
2527

2628
- [Architecture](architecture.md)
2729
- [Benchmarking and model comparison](benchmarking.md)
28-
- [Implementation plan](implementation-plan.md)
30+
- [DataGen guide](datagen-guide.md)
31+
- [Implementation plan](implementation-plan-v3.md)
2932
- [Releases and packaging](releases-and-packaging.md)
3033
- [Usage](usage.md)
3134
- [Training and visualization](training-and-visualization.md)

docs/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
- [BitNet b1.58 Sharp](README.md)
44
- [Architecture](architecture.md)
5+
- [DataGen guide](datagen-guide.md)
56
- [Implementation plan v3 (active)](implementation-plan-v3.md)
67
- [Implementation plan v2 (archived)](implementation-plan-v2.md)
78
- [Implementation plan v1 (archived)](implementation-plan-v1.md)

docs/architecture.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ BitNet b1.58 Sharp targets the paper-aligned BitNet b1.58 decoder-only transform
77
`/src/BitNetSharp.Core` contains the paper-model runtime and transformer building blocks:
88

99
- `BitNetPaperModel` wraps tokenizer state, the seeded transformer, and next-token inspection output
10+
- `DataGenGenerator` expands JSON seed examples into synthetic JSONL records while recording BitNet model provenance
1011
- `VerbosityLevel` exposes exactly three interaction levels: `Quiet`, `Normal`, and `Verbose`
1112
- `BitLinear` implements absmean-scaled ternary projections with signed int8 activation quantization
1213
- `RmsNorm`, `RotaryPositionEmbedding`, `MultiHeadAttention`, `SwiGLUFeedForward`, `BitNetLayer`, and `BitNetTransformer` implement the decoder-only paper architecture
@@ -28,6 +29,8 @@ The hosting layer now resolves multiple local model types behind the same agent
2829

2930
This lets BenchmarkDotNet measure host construction, querying, streaming, and local training through one shared path.
3031

32+
The same app surface also exposes a `datagen` command that keeps synthetic data generation local to the repository checkout.
33+
3134
## Language and interaction model
3235

3336
The built-in vocabulary and command output default to American English. That keeps prompts, diagnostics, and help text aligned with the requirement for a primary U.S. English interface.

docs/datagen-guide.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# DataGen guide
2+
3+
DataGen is the repository's offline synthetic dataset bootstrapper for the paper-aligned BitNet b1.58 runtime. It takes a small set of seed examples, applies deterministic variation patterns, and uses the built-in BitNet transformer to condition each batch with lightweight next-token cues.
4+
5+
## Generate a dataset
6+
7+
```bash
8+
dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- datagen \
9+
--domain "medical-diagnosis" \
10+
--count 50000 \
11+
--seeds examples/seed-examples.json \
12+
--output data/synthetic-medical.jsonl \
13+
--lora medical-lora.bin
14+
```
15+
16+
The command writes one JSON object per line so the output can flow directly into local fine-tuning or evaluation jobs.
17+
18+
## Seed format
19+
20+
Seed files are standard JSON arrays. Each object must include one instruction-like field and one response-like field. DataGen accepts the following aliases:
21+
22+
- instruction: `instruction`, `prompt`, or `input`
23+
- response: `response`, `output`, or `answer`
24+
25+
Example:
26+
27+
```json
28+
[
29+
{
30+
"prompt": "Summarize the patient's main complaint and likely differential diagnosis.",
31+
"response": "Restate the complaint, list the most likely causes, and flag any immediate safety concerns."
32+
},
33+
{
34+
"instruction": "Explain what evidence should be gathered before choosing a treatment plan.",
35+
"answer": "Collect history, exam findings, recent medications, and any contraindications before recommending next steps."
36+
}
37+
]
38+
```
39+
40+
## Output schema
41+
42+
Each JSONL line includes:
43+
44+
- `domain`
45+
- `instruction`
46+
- `response`
47+
- `seedInstruction`
48+
- `seedResponse`
49+
- `variation`
50+
- `generatorModel`
51+
- `loraAdapter`
52+
- `tags`
53+
54+
The optional `--lora` argument is recorded in output metadata so runs can stay attributable even when adapter-conditioned execution is handled outside the CLI.
55+
56+
## Quality controls
57+
58+
- Start from diverse seeds that already match the tone and structure you need.
59+
- Generate a smaller preview set first, then inspect the JSONL output before scaling up.
60+
- Filter or deduplicate generated samples before fine-tuning if your target pipeline requires stricter curation.
61+
62+
## Integration notes
63+
64+
DataGen is intentionally local-first:
65+
66+
- generation runs entirely offline
67+
- output stays in your working directory
68+
- the same built-in BitNet model ID is recorded with every example for traceability

docs/usage.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,15 @@ dotnet run --configuration Release --project src/BitNetSharp.App/BitNetSharp.App
5555

5656
This command runs the BenchmarkDotNet suite, evaluates both built-in models against the shared default training corpus/query script, and writes HTML, Markdown, and JSON comparison reports to the selected output directory.
5757

58+
## DataGen
59+
60+
```bash
61+
dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- datagen --domain "medical-diagnosis" --count 25 --seeds examples/seed-examples.json --output data/synthetic-medical.jsonl
62+
dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- datagen --domain "medical-diagnosis" --count 25 --seeds examples/seed-examples.json --output data/synthetic-medical.jsonl --lora medical-lora.bin
63+
```
64+
65+
This command reads a JSON array of seed examples, expands them into synthetic instruction-response pairs, and writes JSONL output for downstream local fine-tuning or evaluation. See the [DataGen guide](datagen-guide.md) for accepted seed aliases and the output schema.
66+
5867
## Train the traditional comparison model
5968

6069
```bash

examples/seed-examples.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[
2+
{
3+
"prompt": "Summarize the patient's symptoms and identify the most urgent risk to rule out first.",
4+
"response": "Restate the symptoms, note the highest-priority red flag, and recommend the next safe diagnostic step."
5+
},
6+
{
7+
"instruction": "Explain how to gather missing context before recommending a treatment plan.",
8+
"answer": "Review history, medications, allergies, exam findings, and contraindications before proposing options."
9+
},
10+
{
11+
"input": "Create a concise follow-up instruction for documenting how the case should be monitored.",
12+
"output": "Specify what should be monitored, how often it should be reviewed, and which changes require escalation."
13+
}
14+
]
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
using BitNetSharp.Core;
2+
using System.Text.Json;
3+
4+
namespace BitNetSharp.App;
5+
6+
public sealed record DataGenCommandOptions(
7+
string Domain,
8+
int Count,
9+
string SeedsPath,
10+
string OutputPath,
11+
string? LoraPath)
12+
{
13+
public static DataGenCommandOptions Parse(string[] args)
14+
{
15+
ArgumentNullException.ThrowIfNull(args);
16+
17+
var domain = ReadRequiredOption(args, "--domain");
18+
var seedsPath = ReadRequiredOption(args, "--seeds");
19+
var outputPath = ReadRequiredOption(args, "--output");
20+
var countValue = ReadRequiredOption(args, "--count");
21+
22+
if (!int.TryParse(countValue, out var count) || count <= 0)
23+
{
24+
throw new ArgumentException("The datagen count must be a positive integer.", nameof(args));
25+
}
26+
27+
return new DataGenCommandOptions(
28+
domain,
29+
count,
30+
Path.GetFullPath(seedsPath),
31+
Path.GetFullPath(outputPath),
32+
ReadOptionalOption(args, "--lora"));
33+
}
34+
35+
private static string ReadRequiredOption(string[] args, string optionName)
36+
{
37+
var value = ReadOptionalOption(args, optionName);
38+
if (value is null)
39+
{
40+
throw new ArgumentException($"Missing required option '{optionName}'.", nameof(args));
41+
}
42+
43+
if (string.IsNullOrWhiteSpace(value))
44+
{
45+
throw new ArgumentException($"Option '{optionName}' requires a non-empty value.", nameof(args));
46+
}
47+
48+
return value;
49+
}
50+
51+
private static string? ReadOptionalOption(string[] args, string optionName)
52+
{
53+
var equalsPrefix = $"{optionName}=";
54+
var missingValueDetected = false;
55+
for (var index = 0; index < args.Length; index++)
56+
{
57+
var argument = args[index];
58+
if (argument.StartsWith(equalsPrefix, StringComparison.OrdinalIgnoreCase))
59+
{
60+
return argument[equalsPrefix.Length..];
61+
}
62+
63+
if (string.Equals(argument, optionName, StringComparison.OrdinalIgnoreCase))
64+
{
65+
var nextIndex = index + 1;
66+
if (nextIndex >= args.Length)
67+
{
68+
missingValueDetected = true;
69+
continue;
70+
}
71+
72+
var nextArgument = args[nextIndex];
73+
if (nextArgument.StartsWith("--", StringComparison.Ordinal))
74+
{
75+
missingValueDetected = true;
76+
continue;
77+
}
78+
79+
return nextArgument;
80+
}
81+
}
82+
83+
if (missingValueDetected)
84+
{
85+
throw new ArgumentException($"Option '{optionName}' requires a value.", nameof(args));
86+
}
87+
88+
return null;
89+
}
90+
}
91+
92+
public static class DataGenCommand
93+
{
94+
private static readonly JsonSerializerOptions OutputJsonOptions = new()
95+
{
96+
PropertyNamingPolicy = JsonNamingPolicy.CamelCase,
97+
WriteIndented = false
98+
};
99+
100+
public static async Task<string> RunAsync(string[] args, VerbosityLevel verbosity, CancellationToken cancellationToken = default)
101+
{
102+
var options = DataGenCommandOptions.Parse(args);
103+
var seeds = DataGenGenerator.LoadSeeds(options.SeedsPath);
104+
var generator = new DataGenGenerator(BitNetBootstrap.CreatePaperModel(verbosity));
105+
var outputDirectory = Path.GetDirectoryName(options.OutputPath);
106+
107+
if (!string.IsNullOrWhiteSpace(outputDirectory))
108+
{
109+
Directory.CreateDirectory(outputDirectory);
110+
}
111+
112+
await using var stream = File.Create(options.OutputPath);
113+
await using var writer = new StreamWriter(stream);
114+
115+
foreach (var example in generator.Generate(options.Domain, options.Count, seeds, options.LoraPath))
116+
{
117+
cancellationToken.ThrowIfCancellationRequested();
118+
var line = JsonSerializer.Serialize(example, OutputJsonOptions);
119+
await writer.WriteLineAsync(line.AsMemory(), cancellationToken);
120+
}
121+
122+
return options.OutputPath;
123+
}
124+
}

src/BitNetSharp.App/Program.cs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,13 @@
2828
return;
2929
}
3030

31+
if (command == "datagen")
32+
{
33+
var outputPath = await DataGenCommand.RunAsync(args, verbosity);
34+
Console.WriteLine($"Saved synthetic dataset to {outputPath}");
35+
return;
36+
}
37+
3138
using var model = HostedAgentModelFactory.Create(modelSpecifier, verbosity);
3239
using var host = BitNetAgentHost.Build(model);
3340
var hostSummary = host.Services.GetRequiredService<BitNetHostSummary>();

0 commit comments

Comments
 (0)