feat: fill 6 TODO sections in D2 builder#8
Merged
Conversation
Implements all 6 TODO stubs from the previous skeleton (PR #7): 1. load_source_corpus: snapshot_download from HF datasets, walks .kicad_sch + .kicad_pcb pairs, extracts SPDX license_spdx from LICENSE files in project tree. 2. run_erc_drc_for_project: parses erc.json + drc.json from docker stdout via cat trick, structured pass/fail + error counts returned. 3. inject_noise: 4 regex-based S-expression perturbations (delete_wire, displace_symbol, drop_global_label, shrink_track_width), deterministic per (project, op, seed). 4. load_prose_corpus: KiCad wiki seed + Wikipedia EMC + arXiv eess.SP fetchers (placeholders for full impl), chunks to 1500 chars, emits prose-doc triplets with per-source license. 5. compliance_audit: imports pii_scan.filter_rows dynamically, filters hard-PII rows, writes _clean.jsonl, returns stats (rows_in, rows_out, hard_pii_filtered). Graceful fallback if pii_scan unavailable. 6. gen_readme: Annex IV section 2b template emit with EU AI Act fields: provenance, license buckets, statistics, build reproducibility, TDM-DSM Art 4 disclosure, references. Verified: AST valid (745 lines), --dry-run runs end-to-end. Next: smoke test with --max-projects 3 --skip-prose on real electron-server Docker before full run.
There was a problem hiding this comment.
Pull request overview
Implements the previously stubbed TODO sections in the KiCad D2 combined dataset builder, covering source snapshot download + corpus walk, Docker-based ERC/DRC generation, regex-based noise injection, basic prose triplet seeding, PII audit integration, and README generation for compliance/publishing.
Changes:
- Implemented Hugging Face
snapshot_download-based source corpus loading with basic LICENSE detection. - Implemented ERC/DRC execution via sandboxed Docker + JSON extraction from stdout, plus noise ops and fix-response formatting.
- Added initial prose triplet seeds, a PII-audit step, and a template README generator for dataset publishing.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+142
to
+148
| for sch_file in sorted(local_path.rglob("*.kicad_sch")): | ||
| project_dir = sch_file.parent | ||
| pcb_file = project_dir / sch_file.stem.replace(".kicad_sch", ".kicad_pcb") | ||
|
|
||
| # Skip if no paired PCB (DRC requires PCB) | ||
| if not pcb_file.exists(): | ||
| continue |
Comment on lines
+150
to
+180
| project_name = project_dir.name | ||
|
|
||
| # Extract license from LICENSE file or parent LICENSE files | ||
| license_spdx = "NOASSERTION" # fallback | ||
| for potential_license in [project_dir / "LICENSE", | ||
| project_dir.parent / "LICENSE", | ||
| local_path / "LICENSE"]: | ||
| if potential_license.exists(): | ||
| license_text = potential_license.read_text(errors="ignore").upper() | ||
| if "GPL" in license_text: | ||
| license_spdx = "GPL-3.0-or-later" | ||
| break | ||
| elif "APACHE" in license_text: | ||
| license_spdx = "Apache-2.0" | ||
| break | ||
| elif "MIT" in license_text: | ||
| license_spdx = "MIT" | ||
| break | ||
| elif "BSD" in license_text: | ||
| license_spdx = "BSD-3-Clause" | ||
| break | ||
|
|
||
| projects[project_name] = { | ||
| "project": project_name, | ||
| "source_repo": source_repo, | ||
| "source_path": str(sch_file.relative_to(local_path)), | ||
| "sch_path": str(sch_file), | ||
| "pcb_path": str(pcb_file), | ||
| "license_spdx": license_spdx, | ||
| } | ||
| count += 1 |
Comment on lines
+251
to
+269
| # Parse JSON from stdout (the `&& cat` trick prints JSON) | ||
| erc_json = None | ||
| drc_json = None | ||
| try: | ||
| if erc["exit_code"] == 0 and erc["stdout"].strip(): | ||
| erc_json = json.loads(erc["stdout"]) | ||
| except (json.JSONDecodeError, ValueError): | ||
| log.debug(" erc json parse failed, stdout: %s", erc["stdout"][:200]) | ||
|
|
||
| try: | ||
| if drc["exit_code"] == 0 and drc["stdout"].strip(): | ||
| drc_json = json.loads(drc["stdout"]) | ||
| except (json.JSONDecodeError, ValueError): | ||
| log.debug(" drc json parse failed, stdout: %s", drc["stdout"][:200]) | ||
|
|
||
| return { | ||
| "erc": {**erc, "json": erc_json}, | ||
| "drc": {**drc, "json": drc_json}, | ||
| "valid": erc["exit_code"] == 0 and drc["exit_code"] == 0, |
Comment on lines
+294
to
+321
| if noise_op == "delete_wire": | ||
| # Find first wire block (wire (pts ...)) and remove it | ||
| match = re.search(r'\(wire\s+\(pts[^)]*\)[^)]*\)\s*', bad_sch) | ||
| if match: | ||
| bad_sch = bad_sch[:match.start()] + bad_sch[match.end():] | ||
|
|
||
| elif noise_op == "displace_symbol": | ||
| # Find first symbol with (at x y angle) and increment x by 500mil | ||
| def displace_at(m): | ||
| pre = m.group(1) | ||
| x = int(m.group(2)) | ||
| y = m.group(3) | ||
| angle = m.group(4) | ||
| return f"{pre}(at {x + 500} {y} {angle})" | ||
| bad_sch = re.sub( | ||
| r'(\(symbol[^)]*?\(at\s+)(-?\d+)(\s+-?\d+\s+[\d.]+)\)', | ||
| displace_at, bad_sch, count=1 | ||
| ) | ||
|
|
||
| elif noise_op == "drop_global_label": | ||
| # Find first global_label and remove entire block | ||
| match = re.search(r'\(global_label\s+"[^"]*"[^)]*\)[^)]*\)\s*', bad_sch) | ||
| if match: | ||
| bad_sch = bad_sch[:match.start()] + bad_sch[match.end():] | ||
|
|
||
| elif noise_op == "shrink_track_width" and pcb_text: | ||
| # In PCB, find segment with (width 0.25) and shrink to 0.05 | ||
| bad_pcb = pcb_text.replace("(width 0.25)", "(width 0.05)", 1) |
Comment on lines
+543
to
+571
| try: | ||
| # Try to import pii_scan module | ||
| sys.path.insert(0, "/tmp/ailiance-models-tuning/tools") | ||
| import pii_scan | ||
|
|
||
| # Read input JSONL | ||
| rows_in = [] | ||
| with open(jsonl_path) as f: | ||
| for line in f: | ||
| if line.strip(): | ||
| rows_in.append(json.loads(line)) | ||
|
|
||
| stats["rows_in"] = len(rows_in) | ||
|
|
||
| # Apply PII filter (assuming pii_scan has a filter_rows function) | ||
| if hasattr(pii_scan, "filter_rows"): | ||
| rows_out = pii_scan.filter_rows(rows_in) | ||
| stats["rows_out"] = len(rows_out) | ||
| stats["hard_pii_filtered"] = stats["rows_in"] - stats["rows_out"] | ||
|
|
||
| # Write cleaned output | ||
| clean_path = jsonl_path.with_stem(jsonl_path.stem + "_clean") | ||
| with open(clean_path, "w") as f: | ||
| for row in rows_out: | ||
| f.write(json.dumps(row, ensure_ascii=False) + "\n") | ||
| log.info(" wrote %d clean rows to %s", stats["rows_out"], clean_path) | ||
| else: | ||
| log.warning(" pii_scan.filter_rows not found, skipping filter") | ||
| stats["rows_out"] = stats["rows_in"] |
Comment on lines
+612
to
+622
| ### Permissive License Bucket | ||
| If bucket == "permissive": | ||
| - Sources: Apache 2.0, MIT, BSD-3-Clause .kicad_sch files | ||
| - Prose: CC-BY-SA-4.0 KiCad wiki, CC-BY-SA-3.0 Wikipedia EMC articles | ||
| - Intended for: Dual-licensed LoRA artifacts (Apache + CC-BY-SA) | ||
|
|
||
| ### Copyleft License Bucket | ||
| If bucket == "copyleft": | ||
| - Sources: GPL-3.0-or-later .kicad_sch files | ||
| - Prose: CC-BY-SA-4.0 KiCad wiki, CC-BY-SA-3.0 Wikipedia EMC articles | ||
| - Intended for: GPL-compliant LoRA artifacts |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implementation of the 6 TODO stubs from PR #7 skeleton.
What's implemented
load_source_corpushuggingface_hub.snapshot_download+ Path.rglob walk + LICENSE detection (GPL/Apache/MIT/BSD)run_erc_drc_for_project&& cat /tmp/*.jsontrick to retrieve structured reportsinject_noiseload_prose_corpuscompliance_audittools/pii_scan.pyfilter_rows(), writes_clean.jsonl, returns statsgen_readmeVerified
--dry-runruns end-to-end--max-projects 3 --skip-proseon real Docker (next)Out of scope (future PRs)