Skip to content

fix: D2 builder JSONL source + sh -c + pii path#9

Merged
electron-rare merged 1 commit into
mainfrom
fix/d2-builder-jsonl-source-format-2026-05-11
May 11, 2026
Merged

fix: D2 builder JSONL source + sh -c + pii path#9
electron-rare merged 1 commit into
mainfrom
fix/d2-builder-jsonl-source-format-2026-05-11

Conversation

@electron-rare
Copy link
Copy Markdown
Contributor

Three fixes after smoke testing the D2 builder on real electron-server Docker. Outstanding blocker discovered, documented for tomorrow.

Fixes shipped

  1. load_source_corpus refactorkicad9plus-{permissive,copyleft} is a chat-format JSONL (98 / 209 rows), not a tree of .kicad_sch files. Each row carries the schematic in messages[1].content + full metadata (license_spdx, source_url, file_sha256, ia_act_status). Switch from Path.rglob to JSONL parser. Saves the LICENSE walk since SPDX is already in metadata.

  2. Docker run ERCkicad-cli sch erc ... && cat /tmp/erc.json was passed as docker run argv, so && became a positional arg → Maximum number of positional arguments exceeded. Wrap in ["sh", "-c", "kicad-cli ... && cat ..."].

  3. pii_scan import search path — widened to 5 plausible paths (repo-relative, electron-server /tmp clone, grosmac /tmp clone, ~/ailiance-models-tuning) so the builder finds tools/pii_scan.py regardless of host.

🚨 Outstanding blocker (NOT fixed here)

80/98 schematics in kicad9plus-permissive are truncated to ~8241 bytes by the chat-format chunker. kicad-cli rejects them "Failed to load schematic" because the S-expression isn't balanced.

File declared bytes actual bytes ratio
C11pwr.kicad_sch 177215 8241 4.7%
C11bus.kicad_sch 182569 8241 4.5%
C11con.kicad_sch 105105 8241 7.8%
C11.kicad_sch 4475 4475 100%
fpga_candelabra/power_supply_main.kicad_sch 286104 8241 2.9%

Two paths to unblock (pick tomorrow)

A. Fetch full .kicad_sch from metadata.source_url (GitHub raw) with a local cache. ~50 LOC. Network at fetch time, but kicad-cli stays in --network=none sandbox.

B. Switch source corpus to electron-rare/kicad9plus-sch-corpus (the scraper raw output, no chat-format truncation). ~10 LOC. Less proven license bucketing (no permissive/copyleft split yet).

Smoke test on electron-server (this branch)

loaded 3 projects from Ailiance-fr/kicad9plus-permissive
loaded 3 projects from Ailiance-fr/kicad9plus-copyleft
[6] PII scan + filter on /tmp/d2_build/permissive_train.jsonl
WARNING pii_scan not available (No module named 'pii_scan'), skipping PII filter
=== D2 builder done ===

→ Pipeline runs cleanly, ERC gracefully fails on truncated sch (valid=False), all projects skipped, empty jsonl output. Fix the blocker → smoke test should produce triplets.

Three fixes after smoke testing the D2 builder on real
electron-server Docker:

1. load_source_corpus refactor
   The kicad9plus dataset is a chat-format jsonl, not a tree of
   files. Each row is {messages: [user, assistant], metadata: {...}}
   where messages[1].content holds the .kicad_sch text. Switch from
   Path.rglob to dataset.jsonl line parser; metadata already carries
   license_spdx + source_url + file_sha256, no need to walk LICENSE.

2. docker run ERC must use sh -c
   The previous cmd list passed && to docker run argv as a positional
   arg, kicad-cli complained "Maximum number of positional arguments
   exceeded". Wrap in sh -c so the && cat redirect works.

3. pii_scan import search path widened
   The builder may run from grosmac /tmp, electron-server /tmp, or
   the repo root. Probe several plausible paths for tools/pii_scan.py
   before falling back to the warning-and-skip code path.

Outstanding blocker (not fixed by this PR):
  The chat-format dataset truncates the assistant content to
  ~8241 bytes for files larger than that. 80 of 98 schematics in
  kicad9plus-permissive are truncated to 2.9-52 percent of the
  declared file_size_bytes. kicad-cli sch erc fails Failed to load
  schematic on these unbalanced S-expression fragments.

Two options to unblock D2 ERC, picked tomorrow:
  A. fetch full sch from metadata.source_url (GitHub raw) with a
     local file cache, then run kicad-cli on the cached file.
  B. switch SOURCE_PERMISSIVE/COPYLEFT to electron-rare/kicad9plus
     sch-corpus the scraper raw output, no chat-format truncation.
Copilot AI review requested due to automatic review settings May 11, 2026 19:55
@electron-rare electron-rare merged commit f64098f into main May 11, 2026
1 of 5 checks passed
@electron-rare electron-rare deleted the fix/d2-builder-jsonl-source-format-2026-05-11 branch May 11, 2026 19:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the KiCad D2 combined dataset builder to work with the actual kicad9plus-{permissive,copyleft} dataset format and to fix command invocation/import path issues found during Docker smoke testing.

Changes:

  • Refactors source loading to parse dataset.jsonl (chat-format) instead of walking .kicad_sch/.kicad_pcb files.
  • Fixes ERC invocation by executing the kicad-cli ... && cat ... sequence via a shell.
  • Broadens the search path used to import tools/pii_scan.py during the compliance audit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +112 to +115
The dataset is already in chat format: each row is
{"messages": [{role:user, ...}, {role:assistant, content:<sch>}],
"metadata": {source_url, license_spdx, commit_sha, kicad_version,
repo, rel_path, file_size_bytes, file_sha256, ...}}
Comment on lines 225 to 229
"""Generate ERC and DRC reports for one project.

Returns {"erc": {...}, "drc": {...}, "valid": bool} where `valid` means
both ERC and DRC return 0 errors (only 0/warnings).
"""
@@ -333,15 +333,18 @@
"""
triplets = []
rng = random.Random(SEED + hash(project["project"]))
Comment on lines +546 to +558
# Search several plausible paths to find tools/pii_scan.py — works
# whether builder runs from the repo, grosmac /tmp clone, or
# electron-server /tmp clone.
for p in [
Path(__file__).parent.parent / "tools", # repo-relative (builders/ → ../tools/)
Path("/home/electron/ailiance-models-tuning/tools"),
Path("/tmp/ailiance-models-tuning/tools"),
Path("/tmp/amt_pr/tools"),
Path.home() / "ailiance-models-tuning" / "tools",
]:
if (p / "pii_scan.py").exists():
sys.path.insert(0, str(p))
break
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants