fix: D2 builder JSONL source + sh -c + pii path#9
Merged
electron-rare merged 1 commit intoMay 11, 2026
Conversation
Three fixes after smoke testing the D2 builder on real
electron-server Docker:
1. load_source_corpus refactor
The kicad9plus dataset is a chat-format jsonl, not a tree of
files. Each row is {messages: [user, assistant], metadata: {...}}
where messages[1].content holds the .kicad_sch text. Switch from
Path.rglob to dataset.jsonl line parser; metadata already carries
license_spdx + source_url + file_sha256, no need to walk LICENSE.
2. docker run ERC must use sh -c
The previous cmd list passed && to docker run argv as a positional
arg, kicad-cli complained "Maximum number of positional arguments
exceeded". Wrap in sh -c so the && cat redirect works.
3. pii_scan import search path widened
The builder may run from grosmac /tmp, electron-server /tmp, or
the repo root. Probe several plausible paths for tools/pii_scan.py
before falling back to the warning-and-skip code path.
Outstanding blocker (not fixed by this PR):
The chat-format dataset truncates the assistant content to
~8241 bytes for files larger than that. 80 of 98 schematics in
kicad9plus-permissive are truncated to 2.9-52 percent of the
declared file_size_bytes. kicad-cli sch erc fails Failed to load
schematic on these unbalanced S-expression fragments.
Two options to unblock D2 ERC, picked tomorrow:
A. fetch full sch from metadata.source_url (GitHub raw) with a
local file cache, then run kicad-cli on the cached file.
B. switch SOURCE_PERMISSIVE/COPYLEFT to electron-rare/kicad9plus
sch-corpus the scraper raw output, no chat-format truncation.
There was a problem hiding this comment.
Pull request overview
This PR updates the KiCad D2 combined dataset builder to work with the actual kicad9plus-{permissive,copyleft} dataset format and to fix command invocation/import path issues found during Docker smoke testing.
Changes:
- Refactors source loading to parse
dataset.jsonl(chat-format) instead of walking.kicad_sch/.kicad_pcbfiles. - Fixes ERC invocation by executing the
kicad-cli ... && cat ...sequence via a shell. - Broadens the search path used to import
tools/pii_scan.pyduring the compliance audit.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+112
to
+115
| The dataset is already in chat format: each row is | ||
| {"messages": [{role:user, ...}, {role:assistant, content:<sch>}], | ||
| "metadata": {source_url, license_spdx, commit_sha, kicad_version, | ||
| repo, rel_path, file_size_bytes, file_sha256, ...}} |
Comment on lines
225
to
229
| """Generate ERC and DRC reports for one project. | ||
|
|
||
| Returns {"erc": {...}, "drc": {...}, "valid": bool} where `valid` means | ||
| both ERC and DRC return 0 errors (only 0/warnings). | ||
| """ |
| @@ -333,15 +333,18 @@ | |||
| """ | |||
| triplets = [] | |||
| rng = random.Random(SEED + hash(project["project"])) | |||
Comment on lines
+546
to
+558
| # Search several plausible paths to find tools/pii_scan.py — works | ||
| # whether builder runs from the repo, grosmac /tmp clone, or | ||
| # electron-server /tmp clone. | ||
| for p in [ | ||
| Path(__file__).parent.parent / "tools", # repo-relative (builders/ → ../tools/) | ||
| Path("/home/electron/ailiance-models-tuning/tools"), | ||
| Path("/tmp/ailiance-models-tuning/tools"), | ||
| Path("/tmp/amt_pr/tools"), | ||
| Path.home() / "ailiance-models-tuning" / "tools", | ||
| ]: | ||
| if (p / "pii_scan.py").exists(): | ||
| sys.path.insert(0, str(p)) | ||
| break |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three fixes after smoke testing the D2 builder on real electron-server Docker. Outstanding blocker discovered, documented for tomorrow.
Fixes shipped
load_source_corpusrefactor —kicad9plus-{permissive,copyleft}is a chat-format JSONL (98 / 209 rows), not a tree of.kicad_schfiles. Each row carries the schematic inmessages[1].content+ full metadata (license_spdx, source_url, file_sha256, ia_act_status). Switch fromPath.rglobto JSONL parser. Saves the LICENSE walk since SPDX is already in metadata.Docker run ERC —
kicad-cli sch erc ... && cat /tmp/erc.jsonwas passed as docker run argv, so&&became a positional arg →Maximum number of positional arguments exceeded. Wrap in["sh", "-c", "kicad-cli ... && cat ..."].pii_scanimport search path — widened to 5 plausible paths (repo-relative, electron-server /tmp clone, grosmac /tmp clone, ~/ailiance-models-tuning) so the builder findstools/pii_scan.pyregardless of host.🚨 Outstanding blocker (NOT fixed here)
80/98 schematics in
kicad9plus-permissiveare truncated to ~8241 bytes by the chat-format chunker. kicad-cli rejects them "Failed to load schematic" because the S-expression isn't balanced.Two paths to unblock (pick tomorrow)
A. Fetch full
.kicad_schfrommetadata.source_url(GitHub raw) with a local cache. ~50 LOC. Network at fetch time, but kicad-cli stays in--network=nonesandbox.B. Switch source corpus to
electron-rare/kicad9plus-sch-corpus(the scraper raw output, no chat-format truncation). ~10 LOC. Less proven license bucketing (no permissive/copyleft split yet).Smoke test on electron-server (this branch)
→ Pipeline runs cleanly, ERC gracefully fails on truncated sch (valid=False), all projects skipped, empty jsonl output. Fix the blocker → smoke test should produce triplets.