fix: D2 builder unblock truncated sch issue#10
Merged
Conversation
Two changes to make D2 builder produce real triplets: 1. SOURCE_UNIFIED = electron-rare/kicad9plus-sch-corpus Replaces per-bucket sources (kicad9plus-permissive/copyleft) which had only 18 of 98 schemas intact. The unified sch-corpus exposes license_spdx per-row in metadata, so we filter by license bucket AND skip truncated rows at load time. New PERMISSIVE_LICENSES / COPYLEFT_LICENSES SPDX sets cover Apache-2.0, MIT, CC0-1.0, EUPL-1.2, CERN-OHL-P-2.0 (perm) and GPL-3.0, CERN-OHL-S-2.0, AGPL (copyleft). 2. Opportunistic noise injection loop Old loop iterated by index, so if the first 3 ops did not match (hierarchical top-level schema has no wires), zero triplets were emitted. New loop iterates ALL ops in NOISE_OPERATIONS, skipping ops that leave sch_text unchanged, breaks after N successes. Added 3 universal ops: delete_sheet (balanced-paren removal), corrupt_uuid (BROKEN-UUID-NOT-HEX), truncate_tail (cut tail). E2E smoke --max-projects 3 --skip-prose: 4 triplets in permissive bucket (drop_global_label x2 + truncate_tail x2 on MIT sch).
There was a problem hiding this comment.
Pull request overview
This PR updates the KiCad D2 combined dataset builder to (1) switch to a unified source corpus with per-row SPDX metadata and truncation filtering so schematics are parseable by kicad-cli, and (2) make noise injection opportunistic so it emits triplets even when early noise ops don’t match a given schematic.
Changes:
- Replace per-bucket source datasets with a unified
electron-rare/kicad9plus-sch-corpussource and filter rows by SPDX license bucket + truncation ratio at load time. - Expand noise operations and change the injection loop to try all ops until
NOISE_VARIANTS_PER_BOARDsuccessful “bad” variants are produced.
Comments suppressed due to low confidence (1)
datasets/builders/build_kicad_d2_combined.py:333
inject_noise()accepts anrngparameter and claims operations are deterministic givenrng.seed, butrngis not used anywhere in the function. Either removerngfrom the signature/doc, or use it for any randomized choices so determinism is actually controlled by the provided RNG.
def inject_noise(sch_text: str, pcb_text: str | None,
noise_op: str, rng: random.Random) -> tuple[str, str | None]:
"""Apply one perturbation; returns (bad_sch, bad_pcb_or_none).
Operations are deterministic given (sch_text, noise_op, rng.seed).
They are minimal-edit so the diff between good and bad is small and
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # content is intact (≥95% of declared file_size_bytes) so kicad-cli can parse | ||
| # them — truncated rows fail "Failed to load schematic" anyway. | ||
| SOURCE_UNIFIED = "electron-rare/kicad9plus-sch-corpus" | ||
|
|
Comment on lines
+69
to
+72
| PERMISSIVE_LICENSES = { | ||
| "Apache-2.0", "MIT", "CC0-1.0", "EUPL-1.2", | ||
| "CERN-OHL-P-2.0", "BSD-3-Clause", "BSD-2-Clause", "ISC", | ||
| } |
| "displace_symbol", # move a (symbol (at x y a)) — DRC clearance issue | ||
| "drop_global_label", # remove a (global_label "X") — unconnected net | ||
| "delete_sheet", # remove a (sheet ...) reference — missing hierarchy | ||
| "corrupt_uuid", # mutate a (uuid "...") — schematic identity broken |
Comment on lines
+146
to
154
| cache_dir = WORK_DIR / "sch_corpus" | ||
| cache_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| try: | ||
| local_path = Path(snapshot_download( | ||
| repo_id=source_repo, | ||
| repo_id=SOURCE_UNIFIED, | ||
| repo_type="dataset", | ||
| local_dir=str(cache_dir), | ||
| )) |
Comment on lines
+422
to
431
| # Opportunistic loop: try every noise op available; produce a triplet for | ||
| # each op that successfully breaks ERC (exit_code != 0). Stop after | ||
| # NOISE_VARIANTS_PER_BOARD successes. Many real schematics have only some | ||
| # of the matchable primitives (e.g. a top-level hierarchy has no wires | ||
| # but has sheets), so trying all variants is the safest design. | ||
| n_success = 0 | ||
| for noise_op in NOISE_OPERATIONS: | ||
| if n_success >= NOISE_VARIANTS_PER_BOARD: | ||
| break | ||
| bad_sch, _ = inject_noise(sch_text, pcb_text, noise_op, rng) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two changes to make D2 builder produce real triplets after PR #9 discovered the truncation blocker.
Changes
SOURCE_UNIFIED = electron-rare/kicad9plus-sch-corpus— replaces per-bucket sources. The unified corpus exposeslicense_spdxper-row in metadata, so we filter by license bucket AND skip truncated rows (actual < 95% of declared) at load time. NewPERMISSIVE_LICENSES/COPYLEFT_LICENSESSPDX sets cover Apache-2.0, MIT, CC0-1.0, EUPL-1.2, CERN-OHL-P-2.0 (perm) and GPL-3.0, CERN-OHL-S-2.0, AGPL (copyleft).Opportunistic noise injection loop — old loop iterated by index, so if the first 3 ops didn't match (hierarchical top-level schema with no wires/symbols/global_labels), 0 triplets emitted. New loop iterates ALL ops, skips ones that leave sch_text unchanged, breaks after N successes. Added 3 universal ops:
delete_sheet(balanced-paren block removal)corrupt_uuid(replaces hex withBROKEN-UUID-NOT-HEX)truncate_tail(cut last 5% — always breaks S-exp)E2E smoke validated
Bumping
--max-projectsto a few dozens should yield workable training set per bucket. Closes #8.