feat: fill 6 TODO sections in D2 builder by electron-rare · Pull Request #8 · ailiance/ailiance-models-tuning

electron-rare · 2026-05-11T19:48:30Z

Implementation of the 6 TODO stubs from PR #7 skeleton.

What's implemented

Function	Impl strategy
`load_source_corpus`	`huggingface_hub.snapshot_download` + Path.rglob walk + LICENSE detection (GPL/Apache/MIT/BSD)
`run_erc_drc_for_project`	Docker run kicad-cli ERC then DRC with `&& cat /tmp/*.json` trick to retrieve structured reports
`inject_noise`	4 regex-based S-expression perturbations (delete_wire/displace_symbol/drop_global_label/shrink_track_width), seeded per (project, op)
`load_prose_corpus`	KiCad wiki seed + Wikipedia EMC + arXiv eess fetchers (placeholders for full impl, returns chunked prose triplets with per-source license)
`compliance_audit`	Dynamic import of `tools/pii_scan.py` `filter_rows()`, writes `_clean.jsonl`, returns stats
`gen_readme`	Template-based Annex IV §2(b) emit (EU AI Act-compliant), TDM-DSM Art 4 disclosure for arXiv

Verified

AST/import valid (745 lines)
--dry-run runs end-to-end
Smoke test with --max-projects 3 --skip-prose on real Docker (next)
Full run + private publish to HF Ailiance-fr after smoke pass

Out of scope (future PRs)

Wikipedia API + arXiv API real fetchers (currently placeholders)
pandoc-based KiCad wiki → markdown conversion
4 more noise op variants (current 4 are minimal viable)

Implements all 6 TODO stubs from the previous skeleton (PR #7): 1. load_source_corpus: snapshot_download from HF datasets, walks .kicad_sch + .kicad_pcb pairs, extracts SPDX license_spdx from LICENSE files in project tree. 2. run_erc_drc_for_project: parses erc.json + drc.json from docker stdout via cat trick, structured pass/fail + error counts returned. 3. inject_noise: 4 regex-based S-expression perturbations (delete_wire, displace_symbol, drop_global_label, shrink_track_width), deterministic per (project, op, seed). 4. load_prose_corpus: KiCad wiki seed + Wikipedia EMC + arXiv eess.SP fetchers (placeholders for full impl), chunks to 1500 chars, emits prose-doc triplets with per-source license. 5. compliance_audit: imports pii_scan.filter_rows dynamically, filters hard-PII rows, writes _clean.jsonl, returns stats (rows_in, rows_out, hard_pii_filtered). Graceful fallback if pii_scan unavailable. 6. gen_readme: Annex IV section 2b template emit with EU AI Act fields: provenance, license buckets, statistics, build reproducibility, TDM-DSM Art 4 disclosure, references. Verified: AST valid (745 lines), --dry-run runs end-to-end. Next: smoke test with --max-projects 3 --skip-prose on real electron-server Docker before full run.

Copilot

Pull request overview

Implements the previously stubbed TODO sections in the KiCad D2 combined dataset builder, covering source snapshot download + corpus walk, Docker-based ERC/DRC generation, regex-based noise injection, basic prose triplet seeding, PII audit integration, and README generation for compliance/publishing.

Changes:

Implemented Hugging Face snapshot_download-based source corpus loading with basic LICENSE detection.
Implemented ERC/DRC execution via sandboxed Docker + JSON extraction from stdout, plus noise ops and fix-response formatting.
Added initial prose triplet seeds, a PII-audit step, and a template README generator for dataset publishing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    for sch_file in sorted(local_path.rglob("*.kicad_sch")):
+        project_dir = sch_file.parent
+        pcb_file = project_dir / sch_file.stem.replace(".kicad_sch", ".kicad_pcb")
+
+        # Skip if no paired PCB (DRC requires PCB)
+        if not pcb_file.exists():
+            continue


+        project_name = project_dir.name
+
+        # Extract license from LICENSE file or parent LICENSE files
+        license_spdx = "NOASSERTION"  # fallback
+        for potential_license in [project_dir / "LICENSE", 
+                                  project_dir.parent / "LICENSE",
+                                  local_path / "LICENSE"]:
+            if potential_license.exists():
+                license_text = potential_license.read_text(errors="ignore").upper()
+                if "GPL" in license_text:
+                    license_spdx = "GPL-3.0-or-later"
+                    break
+                elif "APACHE" in license_text:
+                    license_spdx = "Apache-2.0"
+                    break
+                elif "MIT" in license_text:
+                    license_spdx = "MIT"
+                    break
+                elif "BSD" in license_text:
+                    license_spdx = "BSD-3-Clause"
+                    break
+
+        projects[project_name] = {
+            "project": project_name,
+            "source_repo": source_repo,
+            "source_path": str(sch_file.relative_to(local_path)),
+            "sch_path": str(sch_file),
+            "pcb_path": str(pcb_file),
+            "license_spdx": license_spdx,
+        }
+        count += 1


+    # Parse JSON from stdout (the `&& cat` trick prints JSON)
+    erc_json = None
+    drc_json = None
+    try:
+        if erc["exit_code"] == 0 and erc["stdout"].strip():
+            erc_json = json.loads(erc["stdout"])
+    except (json.JSONDecodeError, ValueError):
+        log.debug("  erc json parse failed, stdout: %s", erc["stdout"][:200])
+
+    try:
+        if drc["exit_code"] == 0 and drc["stdout"].strip():
+            drc_json = json.loads(drc["stdout"])
+    except (json.JSONDecodeError, ValueError):
+        log.debug("  drc json parse failed, stdout: %s", drc["stdout"][:200])
+
+    return {
+        "erc": {**erc, "json": erc_json},
+        "drc": {**drc, "json": drc_json},
+        "valid": erc["exit_code"] == 0 and drc["exit_code"] == 0,


+    if noise_op == "delete_wire":
+        # Find first wire block (wire (pts ...)) and remove it
+        match = re.search(r'\(wire\s+\(pts[^)]*\)[^)]*\)\s*', bad_sch)
+        if match:
+            bad_sch = bad_sch[:match.start()] + bad_sch[match.end():]
+
+    elif noise_op == "displace_symbol":
+        # Find first symbol with (at x y angle) and increment x by 500mil
+        def displace_at(m):
+            pre = m.group(1)
+            x = int(m.group(2))
+            y = m.group(3)
+            angle = m.group(4)
+            return f"{pre}(at {x + 500} {y} {angle})"
+        bad_sch = re.sub(
+            r'(\(symbol[^)]*?\(at\s+)(-?\d+)(\s+-?\d+\s+[\d.]+)\)',
+            displace_at, bad_sch, count=1
+        )
+
+    elif noise_op == "drop_global_label":
+        # Find first global_label and remove entire block
+        match = re.search(r'\(global_label\s+"[^"]*"[^)]*\)[^)]*\)\s*', bad_sch)
+        if match:
+            bad_sch = bad_sch[:match.start()] + bad_sch[match.end():]
+
+    elif noise_op == "shrink_track_width" and pcb_text:
+        # In PCB, find segment with (width 0.25) and shrink to 0.05
+        bad_pcb = pcb_text.replace("(width 0.25)", "(width 0.05)", 1)


+    try:
+        # Try to import pii_scan module
+        sys.path.insert(0, "/tmp/ailiance-models-tuning/tools")
+        import pii_scan
+
+        # Read input JSONL
+        rows_in = []
+        with open(jsonl_path) as f:
+            for line in f:
+                if line.strip():
+                    rows_in.append(json.loads(line))
+
+        stats["rows_in"] = len(rows_in)
+
+        # Apply PII filter (assuming pii_scan has a filter_rows function)
+        if hasattr(pii_scan, "filter_rows"):
+            rows_out = pii_scan.filter_rows(rows_in)
+            stats["rows_out"] = len(rows_out)
+            stats["hard_pii_filtered"] = stats["rows_in"] - stats["rows_out"]
+
+            # Write cleaned output
+            clean_path = jsonl_path.with_stem(jsonl_path.stem + "_clean")
+            with open(clean_path, "w") as f:
+                for row in rows_out:
+                    f.write(json.dumps(row, ensure_ascii=False) + "\n")
+            log.info("  wrote %d clean rows to %s", stats["rows_out"], clean_path)
+        else:
+            log.warning("  pii_scan.filter_rows not found, skipping filter")
+            stats["rows_out"] = stats["rows_in"]


+### Permissive License Bucket
+If bucket == "permissive":
+- Sources: Apache 2.0, MIT, BSD-3-Clause .kicad_sch files
+- Prose: CC-BY-SA-4.0 KiCad wiki, CC-BY-SA-3.0 Wikipedia EMC articles
+- Intended for: Dual-licensed LoRA artifacts (Apache + CC-BY-SA)
+
+### Copyleft License Bucket
+If bucket == "copyleft":
+- Sources: GPL-3.0-or-later .kicad_sch files
+- Prose: CC-BY-SA-4.0 KiCad wiki, CC-BY-SA-3.0 Wikipedia EMC articles
+- Intended for: GPL-compliant LoRA artifacts


Copilot AI review requested due to automatic review settings May 11, 2026 19:48

electron-rare merged commit 7e71093 into main May 11, 2026
1 of 5 checks passed

electron-rare deleted the feat/kicad-d2-builder-impl-2026-05-11 branch May 11, 2026 19:48

Copilot started reviewing on behalf of electron-rare May 11, 2026 19:49 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

electron-rare mentioned this pull request May 11, 2026

fix: D2 builder unblock truncated sch issue #10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fill 6 TODO sections in D2 builder#8

feat: fill 6 TODO sections in D2 builder#8
electron-rare merged 1 commit into
mainfrom
feat/kicad-d2-builder-impl-2026-05-11

electron-rare commented May 11, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

electron-rare commented May 11, 2026

What's implemented

Verified

Out of scope (future PRs)

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants