feat: D2 prose Wikipedia EMC + EU AI Act README#12
Merged
Conversation
Two enhancements: 1. load_prose_corpus: real Wikipedia EMC fetcher Adds Wikipedia REST API fetcher for 6 EMC topics (Electromagnetic_compatibility, Signal_integrity, Decoupling_capacitor, Printed_circuit_board, Impedance_matching, Ground_plane). CC-BY-SA-3.0. User-Agent references EU DSM Art 4 TDM exception. arXiv eess.SP fetcher remains TODO (registered email + rate-limit + TDM opt-out check required). 2. gen_readme: full Annex IV section 2(b) compliance - YAML frontmatter HF schema (license, tags, pretty_name) - Per-bucket conditional overview (permissive vs copyleft) - License distribution table from MANIFEST_D2.json - Surface distribution (sch / erc / noise-fix / prose-doc) - Build stats: builder SHA + timestamp + PII filter delta - Reproducibility section with exact rebuild command - Intended use + foreseeable misuse Art 53(1)(b) - TDM-DSM disclosure paragraph - References EU AI Act + IEC 61000 + KiCad docs E2E smoke with prose --max-projects 30: 8 prose triplets (2 seeds + 6 Wikipedia) permissive: 24 triplets total copyleft: 15 triplets total
There was a problem hiding this comment.
Pull request overview
Enhances the KiCad D2 combined dataset builder by expanding the prose corpus (now including Wikipedia EMC/SI extracts) and generating a more EU AI Act–oriented README/dataset card.
Changes:
- Refactors prose triplet generation and adds Wikipedia REST API summary fetching for a fixed set of EMC/SI topics.
- Expands
gen_readme()to emit HF dataset-card YAML frontmatter and richer Annex IV / Art. 53 compliance sections, plus license/surface distribution tables.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+563
to
+573
| t.metadata = { | ||
| "provenance": asdict(Provenance( | ||
| source_repo=source_repo, | ||
| source_path=title, | ||
| license_spdx=license_spdx, | ||
| surface="prose-doc", | ||
| file_sha256=hashlib.sha256(content.encode()).hexdigest(), | ||
| build_sha=BUILD_SHA, | ||
| timestamp_utc=datetime.now(timezone.utc).isoformat(), | ||
| )), | ||
| } |
Comment on lines
+584
to
+620
| # Wikipedia content is CC-BY-SA-3.0; attribution is by article title. | ||
| # We fetch the lead section (first ~1500 chars) only — sufficient for | ||
| # a self-contained design-principle chunk. | ||
| WIKI_TOPICS = [ | ||
| "Electromagnetic_compatibility", | ||
| "Signal_integrity", | ||
| "Decoupling_capacitor", | ||
| "Printed_circuit_board", | ||
| "Impedance_matching", | ||
| "Ground_plane", | ||
| ] | ||
| import urllib.request | ||
| import urllib.parse | ||
| for topic in WIKI_TOPICS: | ||
| try: | ||
| url = ( | ||
| "https://en.wikipedia.org/api/rest_v1/page/summary/" | ||
| + urllib.parse.quote(topic) | ||
| ) | ||
| req = urllib.request.Request( | ||
| url, headers={"User-Agent": "ailiance-d2-builder/0.1 (compliance: EU-DSM-TDM-Art4)"}, | ||
| ) | ||
| with urllib.request.urlopen(req, timeout=10) as r: | ||
| data = json.loads(r.read().decode("utf-8")) | ||
| extract = (data.get("extract") or "").strip() | ||
| if not extract or len(extract) < 200: | ||
| log.debug(" wiki %s: extract too short, skipping", topic) | ||
| continue | ||
| chunk = extract[:PROSE_CHUNK_CHARS] | ||
| triplets.append(_prose_triplet( | ||
| chunk, | ||
| f"Wikipedia/{topic}", | ||
| "CC-BY-SA-3.0", | ||
| "en.wikipedia.org", | ||
| )) | ||
| except Exception as e: | ||
| log.warning(" wiki %s fetch failed: %r", topic, e) |
Comment on lines
+627
to
+628
| log.info(" loaded %d prose triplets (seeds %d + wikipedia %d)", | ||
| len(triplets), len(kicad_seeds), len(WIKI_TOPICS)) |
Comment on lines
+752
to
+765
| # HF Hub dataset card YAML frontmatter | ||
| yaml_license = "apache-2.0" if bucket == "permissive" else "gpl-3.0" | ||
| pretty = f"ailiance D2 KiCad combined corpus — {bucket}" | ||
| bucket_overview = ( | ||
| "Apache-2.0 / MIT / BSD / CC0 / EUPL / CERN-OHL-P `.kicad_sch` files " | ||
| "with derived ERC reports and noise-injected fix-it triplets, plus " | ||
| "CC-BY-SA prose on EMC/signal integrity. Suitable for permissive " | ||
| "downstream LoRA artifacts." | ||
| if bucket == "permissive" | ||
| else | ||
| "GPL-3.0 / CERN-OHL-S `.kicad_sch` files (copyleft) with derived ERC " | ||
| "reports and noise-injected fix-it triplets, plus CC-BY-SA prose. " | ||
| "Downstream LoRA artifacts MUST be GPL-compatible per share-alike." | ||
| ) |
Comment on lines
+825
to
+829
| - `source_repo` (HF dataset id), | ||
| - `source_path` (path within the repo), | ||
| - `license_spdx` (SPDX identifier — never mixed across buckets), | ||
| - `surface` (one of: `sch`, `erc-report`, `noise-fix:<op>`, `prose-doc`), | ||
| - `file_sha256` (64-hex of the original file — dedup + audit), |
| else | ||
| "GPL-3.0 / CERN-OHL-S `.kicad_sch` files (copyleft) with derived ERC " | ||
| "reports and noise-injected fix-it triplets, plus CC-BY-SA prose. " | ||
| "Downstream LoRA artifacts MUST be GPL-compatible per share-alike." |
| git clone https://github.com/ailiance/ailiance-models-tuning | ||
| cd ailiance-models-tuning | ||
| pip install huggingface-hub | ||
| # rebuild this exact bucket on electron-server (Docker iact-bench-kicad required): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two enhancements on the D2 builder.
load_prose_corpus
Added real Wikipedia REST API fetcher for 6 EMC topics (Electromagnetic_compatibility, Signal_integrity, Decoupling_capacitor, Printed_circuit_board, Impedance_matching, Ground_plane). Content is CC-BY-SA-3.0. User-Agent string explicitly references EU DSM Article 4 TDM exception per upstream attribution best-practice. Per-source provenance propagated to Provenance dataclass.
arXiv eess.SP fetcher remains TODO — requires registered email, rate-limit-aware client, and TDM opt-out check against arXiv's opt-out endpoint.
gen_readme — full Annex IV §2(b)
E2E smoke with prose enabled
--max-projects 30→