Skip to content

fix: D2 builder unblock truncated sch issue#10

Merged
electron-rare merged 1 commit into
mainfrom
fix/d2-builder-unblock-truncated-2026-05-11
May 11, 2026
Merged

fix: D2 builder unblock truncated sch issue#10
electron-rare merged 1 commit into
mainfrom
fix/d2-builder-unblock-truncated-2026-05-11

Conversation

@electron-rare
Copy link
Copy Markdown
Contributor

Two changes to make D2 builder produce real triplets after PR #9 discovered the truncation blocker.

Changes

  1. SOURCE_UNIFIED = electron-rare/kicad9plus-sch-corpus — replaces per-bucket sources. The unified corpus exposes license_spdx per-row in metadata, so we filter by license bucket AND skip truncated rows (actual < 95% of declared) at load time. New PERMISSIVE_LICENSES / COPYLEFT_LICENSES SPDX sets cover Apache-2.0, MIT, CC0-1.0, EUPL-1.2, CERN-OHL-P-2.0 (perm) and GPL-3.0, CERN-OHL-S-2.0, AGPL (copyleft).

  2. Opportunistic noise injection loop — old loop iterated by index, so if the first 3 ops didn't match (hierarchical top-level schema with no wires/symbols/global_labels), 0 triplets emitted. New loop iterates ALL ops, skips ones that leave sch_text unchanged, breaks after N successes. Added 3 universal ops:

    • delete_sheet (balanced-paren block removal)
    • corrupt_uuid (replaces hex with BROKEN-UUID-NOT-HEX)
    • truncate_tail (cut last 5% — always breaks S-exp)

E2E smoke validated

$ python build_kicad_d2_combined.py --max-projects 3 --skip-prose
[1] bucket=permissive: seen=85 wrong_bucket=29 truncated=53 intact=3
[1] bucket=copyleft:   seen=127 wrong_bucket=83 truncated=41 intact=3
[6] PII scan + filter on permissive_train.jsonl
=== D2 builder done ===

permissive_train.jsonl: 4 triplets
  - drop_global_label MIT test/fixtures/.../pass_minimal_mcu_board/demo.kicad_sch
  - drop_global_label MIT test/fixtures/.../fail_sismosmart_like_label_.kicad_sch
  - truncate_tail     MIT  (idem)
  - truncate_tail     MIT  (idem)
copyleft_train.jsonl: 0 triplets (3 picks were all top-level hierarchies)

Bumping --max-projects to a few dozens should yield workable training set per bucket. Closes #8.

Two changes to make D2 builder produce real triplets:

1. SOURCE_UNIFIED = electron-rare/kicad9plus-sch-corpus
   Replaces per-bucket sources (kicad9plus-permissive/copyleft)
   which had only 18 of 98 schemas intact. The unified sch-corpus
   exposes license_spdx per-row in metadata, so we filter by
   license bucket AND skip truncated rows at load time.

   New PERMISSIVE_LICENSES / COPYLEFT_LICENSES SPDX sets cover
   Apache-2.0, MIT, CC0-1.0, EUPL-1.2, CERN-OHL-P-2.0 (perm)
   and GPL-3.0, CERN-OHL-S-2.0, AGPL (copyleft).

2. Opportunistic noise injection loop
   Old loop iterated by index, so if the first 3 ops did not
   match (hierarchical top-level schema has no wires), zero
   triplets were emitted. New loop iterates ALL ops in
   NOISE_OPERATIONS, skipping ops that leave sch_text unchanged,
   breaks after N successes.

   Added 3 universal ops: delete_sheet (balanced-paren removal),
   corrupt_uuid (BROKEN-UUID-NOT-HEX), truncate_tail (cut tail).

E2E smoke --max-projects 3 --skip-prose: 4 triplets in permissive
bucket (drop_global_label x2 + truncate_tail x2 on MIT sch).
Copilot AI review requested due to automatic review settings May 11, 2026 20:06
@electron-rare electron-rare merged commit ab8d3db into main May 11, 2026
0 of 2 checks passed
@electron-rare electron-rare deleted the fix/d2-builder-unblock-truncated-2026-05-11 branch May 11, 2026 20:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the KiCad D2 combined dataset builder to (1) switch to a unified source corpus with per-row SPDX metadata and truncation filtering so schematics are parseable by kicad-cli, and (2) make noise injection opportunistic so it emits triplets even when early noise ops don’t match a given schematic.

Changes:

  • Replace per-bucket source datasets with a unified electron-rare/kicad9plus-sch-corpus source and filter rows by SPDX license bucket + truncation ratio at load time.
  • Expand noise operations and change the injection loop to try all ops until NOISE_VARIANTS_PER_BOARD successful “bad” variants are produced.
Comments suppressed due to low confidence (1)

datasets/builders/build_kicad_d2_combined.py:333

  • inject_noise() accepts an rng parameter and claims operations are deterministic given rng.seed, but rng is not used anywhere in the function. Either remove rng from the signature/doc, or use it for any randomized choices so determinism is actually controlled by the provided RNG.
def inject_noise(sch_text: str, pcb_text: str | None,
                 noise_op: str, rng: random.Random) -> tuple[str, str | None]:
    """Apply one perturbation; returns (bad_sch, bad_pcb_or_none).

    Operations are deterministic given (sch_text, noise_op, rng.seed).
    They are minimal-edit so the diff between good and bad is small and

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# content is intact (≥95% of declared file_size_bytes) so kicad-cli can parse
# them — truncated rows fail "Failed to load schematic" anyway.
SOURCE_UNIFIED = "electron-rare/kicad9plus-sch-corpus"

Comment on lines +69 to +72
PERMISSIVE_LICENSES = {
"Apache-2.0", "MIT", "CC0-1.0", "EUPL-1.2",
"CERN-OHL-P-2.0", "BSD-3-Clause", "BSD-2-Clause", "ISC",
}
"displace_symbol", # move a (symbol (at x y a)) — DRC clearance issue
"drop_global_label", # remove a (global_label "X") — unconnected net
"delete_sheet", # remove a (sheet ...) reference — missing hierarchy
"corrupt_uuid", # mutate a (uuid "...") — schematic identity broken
Comment on lines +146 to 154
cache_dir = WORK_DIR / "sch_corpus"
cache_dir.mkdir(parents=True, exist_ok=True)

try:
local_path = Path(snapshot_download(
repo_id=source_repo,
repo_id=SOURCE_UNIFIED,
repo_type="dataset",
local_dir=str(cache_dir),
))
Comment on lines +422 to 431
# Opportunistic loop: try every noise op available; produce a triplet for
# each op that successfully breaks ERC (exit_code != 0). Stop after
# NOISE_VARIANTS_PER_BOARD successes. Many real schematics have only some
# of the matchable primitives (e.g. a top-level hierarchy has no wires
# but has sheets), so trying all variants is the safest design.
n_success = 0
for noise_op in NOISE_OPERATIONS:
if n_success >= NOISE_VARIANTS_PER_BOARD:
break
bad_sch, _ = inject_noise(sch_text, pcb_text, noise_op, rng)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants