Skip to content

feat: D2 prose Wikipedia EMC + EU AI Act README#12

Merged
electron-rare merged 1 commit into
mainfrom
feat/d2-prose-wikipedia-readme-2026-05-11
May 11, 2026
Merged

feat: D2 prose Wikipedia EMC + EU AI Act README#12
electron-rare merged 1 commit into
mainfrom
feat/d2-prose-wikipedia-readme-2026-05-11

Conversation

@electron-rare
Copy link
Copy Markdown
Contributor

Two enhancements on the D2 builder.

load_prose_corpus

Added real Wikipedia REST API fetcher for 6 EMC topics (Electromagnetic_compatibility, Signal_integrity, Decoupling_capacitor, Printed_circuit_board, Impedance_matching, Ground_plane). Content is CC-BY-SA-3.0. User-Agent string explicitly references EU DSM Article 4 TDM exception per upstream attribution best-practice. Per-source provenance propagated to Provenance dataclass.

arXiv eess.SP fetcher remains TODO — requires registered email, rate-limit-aware client, and TDM opt-out check against arXiv's opt-out endpoint.

gen_readme — full Annex IV §2(b)

  • YAML frontmatter (license, tags, pretty_name, task_categories, size_categories)
  • Per-bucket conditional overview (permissive vs copyleft)
  • License + surface distribution tables from MANIFEST_D2
  • Build stats with commit SHA + timestamp + PII delta
  • Reproducibility section
  • Intended use + foreseeable misuse (Art. 53(1)(b))
  • TDM-DSM disclosure
  • References (EU AI Act, IEC 61000, KiCad)

E2E smoke with prose enabled

--max-projects 30

  • 8 prose triplets loaded (2 KiCad seeds + 6 Wikipedia)
  • permissive_train.jsonl: 24 triplets total
  • copyleft_train.jsonl: 15 triplets total

Two enhancements:

1. load_prose_corpus: real Wikipedia EMC fetcher
   Adds Wikipedia REST API fetcher for 6 EMC topics
   (Electromagnetic_compatibility, Signal_integrity,
   Decoupling_capacitor, Printed_circuit_board,
   Impedance_matching, Ground_plane). CC-BY-SA-3.0.
   User-Agent references EU DSM Art 4 TDM exception.

   arXiv eess.SP fetcher remains TODO (registered email +
   rate-limit + TDM opt-out check required).

2. gen_readme: full Annex IV section 2(b) compliance
   - YAML frontmatter HF schema (license, tags, pretty_name)
   - Per-bucket conditional overview (permissive vs copyleft)
   - License distribution table from MANIFEST_D2.json
   - Surface distribution (sch / erc / noise-fix / prose-doc)
   - Build stats: builder SHA + timestamp + PII filter delta
   - Reproducibility section with exact rebuild command
   - Intended use + foreseeable misuse Art 53(1)(b)
   - TDM-DSM disclosure paragraph
   - References EU AI Act + IEC 61000 + KiCad docs

E2E smoke with prose --max-projects 30:
  8 prose triplets (2 seeds + 6 Wikipedia)
  permissive: 24 triplets total
  copyleft:   15 triplets total
Copilot AI review requested due to automatic review settings May 11, 2026 20:41
@electron-rare electron-rare merged commit 1ca581c into main May 11, 2026
2 of 6 checks passed
@electron-rare electron-rare deleted the feat/d2-prose-wikipedia-readme-2026-05-11 branch May 11, 2026 20:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enhances the KiCad D2 combined dataset builder by expanding the prose corpus (now including Wikipedia EMC/SI extracts) and generating a more EU AI Act–oriented README/dataset card.

Changes:

  • Refactors prose triplet generation and adds Wikipedia REST API summary fetching for a fixed set of EMC/SI topics.
  • Expands gen_readme() to emit HF dataset-card YAML frontmatter and richer Annex IV / Art. 53 compliance sections, plus license/surface distribution tables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +563 to +573
t.metadata = {
"provenance": asdict(Provenance(
source_repo=source_repo,
source_path=title,
license_spdx=license_spdx,
surface="prose-doc",
file_sha256=hashlib.sha256(content.encode()).hexdigest(),
build_sha=BUILD_SHA,
timestamp_utc=datetime.now(timezone.utc).isoformat(),
)),
}
Comment on lines +584 to +620
# Wikipedia content is CC-BY-SA-3.0; attribution is by article title.
# We fetch the lead section (first ~1500 chars) only — sufficient for
# a self-contained design-principle chunk.
WIKI_TOPICS = [
"Electromagnetic_compatibility",
"Signal_integrity",
"Decoupling_capacitor",
"Printed_circuit_board",
"Impedance_matching",
"Ground_plane",
]
import urllib.request
import urllib.parse
for topic in WIKI_TOPICS:
try:
url = (
"https://en.wikipedia.org/api/rest_v1/page/summary/"
+ urllib.parse.quote(topic)
)
req = urllib.request.Request(
url, headers={"User-Agent": "ailiance-d2-builder/0.1 (compliance: EU-DSM-TDM-Art4)"},
)
with urllib.request.urlopen(req, timeout=10) as r:
data = json.loads(r.read().decode("utf-8"))
extract = (data.get("extract") or "").strip()
if not extract or len(extract) < 200:
log.debug(" wiki %s: extract too short, skipping", topic)
continue
chunk = extract[:PROSE_CHUNK_CHARS]
triplets.append(_prose_triplet(
chunk,
f"Wikipedia/{topic}",
"CC-BY-SA-3.0",
"en.wikipedia.org",
))
except Exception as e:
log.warning(" wiki %s fetch failed: %r", topic, e)
Comment on lines +627 to +628
log.info(" loaded %d prose triplets (seeds %d + wikipedia %d)",
len(triplets), len(kicad_seeds), len(WIKI_TOPICS))
Comment on lines +752 to +765
# HF Hub dataset card YAML frontmatter
yaml_license = "apache-2.0" if bucket == "permissive" else "gpl-3.0"
pretty = f"ailiance D2 KiCad combined corpus — {bucket}"
bucket_overview = (
"Apache-2.0 / MIT / BSD / CC0 / EUPL / CERN-OHL-P `.kicad_sch` files "
"with derived ERC reports and noise-injected fix-it triplets, plus "
"CC-BY-SA prose on EMC/signal integrity. Suitable for permissive "
"downstream LoRA artifacts."
if bucket == "permissive"
else
"GPL-3.0 / CERN-OHL-S `.kicad_sch` files (copyleft) with derived ERC "
"reports and noise-injected fix-it triplets, plus CC-BY-SA prose. "
"Downstream LoRA artifacts MUST be GPL-compatible per share-alike."
)
Comment on lines +825 to +829
- `source_repo` (HF dataset id),
- `source_path` (path within the repo),
- `license_spdx` (SPDX identifier — never mixed across buckets),
- `surface` (one of: `sch`, `erc-report`, `noise-fix:<op>`, `prose-doc`),
- `file_sha256` (64-hex of the original file — dedup + audit),
else
"GPL-3.0 / CERN-OHL-S `.kicad_sch` files (copyleft) with derived ERC "
"reports and noise-injected fix-it triplets, plus CC-BY-SA prose. "
"Downstream LoRA artifacts MUST be GPL-compatible per share-alike."
git clone https://github.com/ailiance/ailiance-models-tuning
cd ailiance-models-tuning
pip install huggingface-hub
# rebuild this exact bucket on electron-server (Docker iact-bench-kicad required):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants