Skip to content

Add DataGen synthetic dataset generation to the BitNet b1.58 CLI#16

Merged
sharpninja merged 6 commits intomainfrom
copilot/add-datagen-synthetic-data-generator
Mar 20, 2026
Merged

Add DataGen synthetic dataset generation to the BitNet b1.58 CLI#16
sharpninja merged 6 commits intomainfrom
copilot/add-datagen-synthetic-data-generator

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 20, 2026

This PR closes the gap between the repository’s stated BitNet b1.58 reference surface and the implemented app surface by adding the missing DataGen workflow. It introduces an offline synthetic dataset generator that expands minimal JSON seed sets into JSONL instruction/response corpora while preserving BitNet model provenance.

  • CLI surface

    • adds a new datagen command to BitNetSharp.App
    • supports required inputs for domain, count, seed file, output file, and optional LoRA metadata
    • accepts both --option=value and --option value forms
  • Core DataGen implementation

    • adds a focused DataGenGenerator in BitNetSharp.Core
    • loads seed examples from JSON using common alias fields:
      • instruction: instruction, prompt, input
      • response: response, output, answer
    • emits structured JSONL records with:
      • domain
      • generated instruction/response
      • source seed instruction/response
      • variation ID
      • generator model ID
      • optional LoRA adapter filename
      • tags
  • Model alignment

    • keeps generation local-first and tied to the paper-aligned BitNet runtime
    • uses the existing BitNet model surface to condition synthetic examples without reintroducing retired toy workflows
  • Docs and examples

    • adds docs/datagen-guide.md
    • updates GitBook navigation in docs/README.md and docs/SUMMARY.md
    • documents DataGen usage in docs/usage.md
    • adds a sample seed file at examples/seed-examples.json
  • Targeted coverage

    • adds focused tests for:
      • seed alias parsing
      • DataGen option parsing
      • structured record generation
      • JSONL dataset writing

Example:

dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- datagen \
  --domain "medical-diagnosis" \
  --count 50000 \
  --seeds examples/seed-examples.json \
  --output data/synthetic-medical.jsonl \
  --lora medical-lora.bin

Example output shape:

{
  "domain": "medical-diagnosis",
  "instruction": "...",
  "response": "...",
  "seedInstruction": "...",
  "seedResponse": "...",
  "variation": "pattern-1",
  "generatorModel": "bitnet-b1.58-sharp",
  "loraAdapter": "medical-lora.bin",
  "tags": ["synthetic", "offline", "pattern-1", "medical", "diagnosis"]
}

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits March 20, 2026 12:34
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot AI changed the title [WIP] Add DataGen synthetic data generator for domain bootstrapping Add DataGen synthetic dataset generation to the BitNet b1.58 CLI Mar 20, 2026
Copilot AI requested a review from sharpninja March 20, 2026 12:41
@sharpninja sharpninja marked this pull request as ready for review March 20, 2026 12:41
Copilot AI review requested due to automatic review settings March 20, 2026 12:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eaa009cb12

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an offline “DataGen” workflow to the BitNetSharp CLI and core library, enabling synthetic JSONL dataset generation from JSON seed examples while recording BitNet model provenance.

Changes:

  • Introduces datagen CLI command that parses options and writes JSONL output.
  • Adds BitNetSharp.Core DataGenGenerator for loading seed aliases and emitting structured synthetic records.
  • Updates docs + examples and adds targeted tests covering seed parsing, option parsing, and JSONL writing.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/BitNetSharp.Tests/BitNetModelTests.cs Adds unit tests for seed alias loading, generator output shape, option parsing, and JSONL writing.
src/BitNetSharp.Core/DataGenGenerator.cs Implements seed loading + synthetic record generation (domain/variation/model/tags/LoRA metadata).
src/BitNetSharp.App/Program.cs Wires the new datagen command into the CLI entry point.
src/BitNetSharp.App/DataGenCommand.cs Implements CLI option parsing and JSONL dataset writing for datagen.
examples/seed-examples.json Provides a sample seed file demonstrating supported alias fields.
docs/usage.md Documents datagen usage examples and links to the new guide.
docs/datagen-guide.md Adds a dedicated DataGen guide (seed aliases, output schema, usage).
docs/architecture.md Mentions DataGenGenerator and the CLI command in the architecture overview.
docs/SUMMARY.md Adds the DataGen guide to GitBook navigation.
docs/README.md Updates quick start + feature list to include DataGen and adds a docs map entry.

@sharpninja
Copy link
Copy Markdown
Owner

@copilot apply changes based on the comments in this thread

Copilot AI and others added 3 commits March 20, 2026 12:48
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 20, 2026

@copilot apply changes based on the comments in this thread

Applied on the branch. The review-thread fixes landed in f2d5b34, with follow-up parser naming cleanup in f2f70a7. I also revalidated the DataGen tests and full solution build/test locally. No UI changes were involved here.

@sharpninja sharpninja merged commit bcf4a3f into main Mar 20, 2026
3 checks passed
@sharpninja sharpninja deleted the copilot/add-datagen-synthetic-data-generator branch March 20, 2026 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants