Skip to content

feat: ingestion primitives — clone, discover, frontmatter, bundle (tasks 4.1-4.4)#4

Merged
johnmcollier merged 3 commits into
redhat-ai-dev:mainfrom
johnmcollier:feat/epic4-ingestion-primitives
Jun 19, 2026
Merged

feat: ingestion primitives — clone, discover, frontmatter, bundle (tasks 4.1-4.4)#4
johnmcollier merged 3 commits into
redhat-ai-dev:mainfrom
johnmcollier:feat/epic4-ingestion-primitives

Conversation

@johnmcollier

Copy link
Copy Markdown
Contributor

Summary

Implements the four foundational ingestion primitives in src/server/ingestion/:

  • clone.ts — shallow --depth 1 git clone via simple-git; throws CLONE_FAILED: <message> on error
  • discover.ts — walks the 6 spec-mandated discovery paths (skills/, .claude/skills/, .cursor/skills/, .github/copilot/skills/, .windsurf/skills/, .gemini/skills/); case-insensitive SKILL.md match; returns slug, skillMdPath, skillDir, discoveryPath, supportingFiles
  • frontmatter.ts — parses ----delimited YAML frontmatter; returns { ok: false, reason } (never throws) for missing delimiters, malformed YAML, or missing name/description
  • bundle.ts — classifies single-file skills as skill-md, multi-file as archive (in-memory tar.gz); computes SHA256 hex digest on the artifact

Notes

Edge case (root-level SKILL.md slug): If a SKILL.md sits directly in a discovery root (e.g. skills/SKILL.md), the slug resolves to the discovery dir basename (e.g. "skills"). This is an unspecified edge case in the spec — if two discovery paths both have a root-level SKILL.md the slugs would collide. Will be addressed when ingestSource wires slugs into the DB in the follow-up PR.

Test plan

  • npm run build:server — zero TypeScript errors
  • npm test — 16/16 passing (no regressions to existing DB tests)

Related

  • OpenSpec change: rhess-enterprise-skills-server
  • Tasks: 4.1, 4.2, 4.3, 4.4
  • Followed by: feat/epic4-ingestion-pipeline (tasks 4.5–4.8)

Made with Cursor

Add src/server/ingestion/ module with:
- clone.ts: shallow git clone via simple-git, throws CLONE_FAILED on error
- discover.ts: walks 6 spec-mandated discovery paths for SKILL.md files
- frontmatter.ts: js-yaml YAML frontmatter parser with name/description validation
- bundle.ts: SHA256 + skill-md or tar.gz archive bundling
- index.ts: re-exports all public types and functions

Co-authored-by: Cursor <cursoragent@cursor.com>
@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Add ingestion primitives: clone, discover, frontmatter, bundle
✨ Enhancement 📝 Documentation 🕐 20-40 Minutes

Grey Divider

Description

• Add core ingestion primitives for cloning repos, discovering skills, parsing frontmatter, and
 bundling artifacts.
• Discover SKILL.md across all spec discovery roots, emitting slugs and supporting file lists.
• Bundle skills as raw SKILL.md or tar.gz with SHA-256 digest; mark tasks 4.1–4.4 complete.
Diagram

graph TD
  A{{"Git repo URL"}} --> B["clone()"] --> C["Local repo dir"] --> D["discoverSkills()"] --> E["SkillCandidate[]"]
  E --> F["parseFrontmatter()"]
  E --> G["bundleSkill()"] --> H["BundleResult"]
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Stream tar+hash instead of buffering archive in memory
  • ➕ Avoids holding full tar.gz bytes in memory for large skills
  • ➕ Can compute SHA-256 incrementally while streaming output
  • ➖ More complex implementation (stream wiring and error handling)
  • ➖ May require different tar API usage or additional utilities
2. Use a dedicated frontmatter library (e.g., gray-matter)
  • ➕ Battle-tested frontmatter detection and parsing
  • ➕ Can simplify delimiter handling and metadata extraction
  • ➖ Adds/locks in another dependency and behavior surface
  • ➖ May not match the spec’s exact validation/error messaging needs
3. Namespace slugs by discovery root to prevent collisions
  • ➕ Eliminates root-level SKILL.md slug collisions across discovery paths
  • ➕ Makes candidate identity stable before DB integration
  • ➖ Changes slug semantics (may diverge from current spec expectations)
  • ➖ Could require migration/normalization logic once DB wiring lands

Recommendation: The current approach is a good MVP for the ingestion pipeline follow-up: it’s small, dependency-light (reuses existing deps), and provides clear typed outputs. When wiring into ingestSource/DB, strongly consider namespacing slugs (or otherwise disambiguating) and switching archive creation to a streaming approach if skill bundles can be large.

Files changed (6) +267 / -4

Enhancement (5) +263 / -0
bundle.tsAdd skill bundling with SHA-256 and tar.gz for multi-file skills +64/-0

Add skill bundling with SHA-256 and tar.gz for multi-file skills

• Introduces BundleResult and bundleSkill() to emit either raw SKILL.md content or a base64-encoded tar.gz archive when supporting files exist. Computes a SHA-256 hex digest over the produced artifact and includes SKILL.md plus all supporting files in the archive.

src/server/ingestion/bundle.ts

clone.tsAdd shallow git clone primitive with normalized error message +11/-0

Add shallow git clone primitive with normalized error message

• Adds clone(url, dest) implemented via simple-git with --depth 1 for shallow cloning. Wraps failures into a CLONE_FAILED: <message> error to standardize downstream handling.

src/server/ingestion/clone.ts

discover.tsImplement spec discovery of SKILL.md candidates across supported roots +120/-0

Implement spec discovery of SKILL.md candidates across supported roots

• Walks the six spec-mandated discovery directories and recursively finds case-insensitive SKILL.md files. Emits SkillCandidate objects with slug selection (discovery-root basename for root-level skills), absolute paths, discovery path, and a relative supportingFiles list including nested subdirectories.

src/server/ingestion/discover.ts

frontmatter.tsAdd non-throwing YAML frontmatter parser with required field validation +61/-0

Add non-throwing YAML frontmatter parser with required field validation

• Implements parseFrontmatter() using a --- delimited regex and js-yaml parsing. Returns a typed ok/result union (never throws) and validates required name/description plus optional allowed-tools string array.

src/server/ingestion/frontmatter.ts

index.tsExport ingestion primitives and public types +7/-0

Export ingestion primitives and public types

• Adds a barrel index exporting clone, discoverSkills, parseFrontmatter, bundleSkill, and their associated public types for consumption by the upcoming ingestion pipeline.

src/server/ingestion/index.ts

Documentation (1) +4 / -4
tasks.mdMark ingestion primitive tasks 4.1–4.4 as complete +4/-4

Mark ingestion primitive tasks 4.1–4.4 as complete

• Updates the OpenSpec task checklist to mark clone/discover/frontmatter/bundle tasks as done. No behavioral code changes; documentation/progress tracking only.

openspec/changes/rhess-enterprise-skills-server/tasks.md

@qodo-code-review

qodo-code-review Bot commented Jun 19, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)

Grey Divider


Action required

1. Nested SKILL.md bundled ✓ Resolved 🐞 Bug ≡ Correctness
Description
discoverSkills() includes all files from subdirectories as supportingFiles without excluding nested
SKILL.md, so a parent skill can be misclassified as an archive and its artifact can include other
skills’ SKILL.md content. This breaks skill boundaries because the spec treats every SKILL.md under
discovery paths as a separate ingestion candidate.
Code

src/server/ingestion/discover.ts[R71-85]

+      // Collect supporting files (all other files in skillDir, relative paths)
+      const supportingFiles: string[] = [];
+      try {
+        const entries = fs.readdirSync(skillDir, { withFileTypes: true });
+        for (const entry of entries) {
+          if (entry.isFile() && !isSkillMd(entry.name)) {
+            supportingFiles.push(entry.name);
+          } else if (entry.isDirectory()) {
+            // Include files in subdirectories too
+            const sub = walkAllFiles(path.join(skillDir, entry.name));
+            for (const f of sub) {
+              supportingFiles.push(path.relative(skillDir, f));
+            }
+          }
+        }
Evidence
The supportingFiles collection explicitly recurses into subdirectories and adds every file returned
by walkAllFiles(); walkAllFiles() includes all files and does not filter out SKILL.md. The spec
defines all SKILL.md under discovery paths as ingestion candidates, so nested SKILL.md should not be
treated as a supporting file of another skill.

src/server/ingestion/discover.ts[71-120]
openspec/changes/rhess-enterprise-skills-server/specs/skill-ingestion/spec.md[23-26]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`discoverSkills()` builds `supportingFiles` by recursively walking subdirectories and collecting *all* files. Because `walkAllFiles()` does not exclude `SKILL.md`, a nested skill (a subdirectory containing its own `SKILL.md`) will be treated as a supporting file of the parent skill. This can (a) incorrectly classify the parent as `archive`, and (b) bundle another skill’s `SKILL.md` into the parent artifact.

### Issue Context
The ingestion spec states that **all** `SKILL.md` files found under discovery paths are candidates for ingestion, i.e., nested `SKILL.md` represents a separate skill rather than a supporting file.

### Fix Focus Areas
- src/server/ingestion/discover.ts[71-120]

### Suggested fix approach
- When collecting `supportingFiles`, exclude any file whose basename matches `SKILL.md` (case-insensitive).
- Additionally (recommended), if a subdirectory contains its own `SKILL.md`, do **not** include any files from that subdirectory in the parent skill’s `supportingFiles` (treat it as a separate skill subtree). This prevents bundling an entire nested skill as “supporting files” of the parent.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Non-deterministic archive digest ✓ Resolved 🐞 Bug ☼ Reliability
Description
supportingFiles are collected in filesystem enumeration order and then fed directly to tar.create();
because the list is not sorted, the tar.gz byte stream (and computed SHA-256 digest) can change
across runs/filesystems even when file contents are unchanged. This makes digests unstable and can
cause unnecessary churn in downstream indexing/caching.
Code

src/server/ingestion/bundle.ts[R35-52]

+  await new Promise<void>((resolve, reject) => {
+    const pack = create(
+      {
+        gzip: true,
+        cwd: candidate.skillDir,
+        portable: true,
+      },
+      getAllRelativeFiles(candidate),
+    );
+
+    pack.on("data", (chunk: Buffer) => chunks.push(chunk));
+    pack.on("end", resolve);
+    pack.on("error", reject);
+  });
+
+  const buffer = Buffer.concat(chunks);
+  const digest = crypto.createHash("sha256").update(buffer).digest("hex");
+
Evidence
discoverSkills() builds supportingFiles by iterating over fs.readdirSync() results and pushing into
an array without sorting; bundleArchive() passes that array as-is into tar.create() and computes the
digest from the resulting archive bytes.

src/server/ingestion/discover.ts[73-85]
src/server/ingestion/bundle.ts[35-52]
src/server/ingestion/bundle.ts[60-63]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The archive bytes (and thus SHA-256 digest) depend on the order of entries passed to `tar.create()`. Currently, `supportingFiles` comes from `fs.readdirSync()` traversal order and is not sorted before bundling, making archive output/digests potentially non-reproducible.

### Issue Context
Even with `portable: true`, entry ordering still affects tar output. Stabilizing ordering makes digests predictable and reduces needless updates.

### Fix Focus Areas
- src/server/ingestion/discover.ts[73-85]
- src/server/ingestion/bundle.ts[35-52]
- src/server/ingestion/bundle.ts[60-63]

### Suggested fix approach
- Sort deterministically before archiving:
 - Option A: `supportingFiles.sort()` in `discoverSkills()` before returning the candidate.
 - Option B (preferred): sort inside `bundleArchive()`/`getAllRelativeFiles()` (defensive, ensures determinism regardless of candidate producer).
- Keep `SKILL.md` first if desired, but sort the remaining paths lexicographically.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Clone URL not validated ✓ Resolved 🐞 Bug ⛨ Security
Description
clone() passes the provided url straight to git clone without enforcing the spec’s “HTTPS or SSH Git
URL” contract. This allows unsupported/unsafe URL schemes or local-path clones to slip through
unless every caller independently validates.
Code

src/server/ingestion/clone.ts[R3-7]

+export async function clone(url: string, dest: string): Promise<void> {
+  const git = simpleGit();
+  try {
+    await git.clone(url, dest, ["--depth", "1"]);
+  } catch (err) {
Evidence
The spec constrains the accepted source URL type to HTTPS/SSH Git URLs, while the current clone()
implementation forwards any string directly into git clone without checks.

src/server/ingestion/clone.ts[3-7]
openspec/changes/rhess-enterprise-skills-server/specs/skill-source-management/spec.md[3-6]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The ingestion primitive `clone(url, dest)` does not validate `url` before invoking `git.clone(...)`. The spec for source registration defines `url` as an **HTTPS or SSH Git URL**, so accepting arbitrary strings (including other protocols or local paths) violates the contract and increases risk.

### Issue Context
Even if validation is planned at the API layer, keeping the allowlist close to the cloning boundary reduces the chance of future callers accidentally bypassing validation.

### Fix Focus Areas
- src/server/ingestion/clone.ts[3-7]
- openspec/changes/rhess-enterprise-skills-server/specs/skill-source-management/spec.md[3-6]

### Suggested fix approach
- Validate/allowlist URL forms:
 - `https://...`
 - `ssh://...`
 - SCP-like SSH form: `git@host:org/repo(.git)`
- If invalid, throw an error that can be mapped to `CLONE_FAILED`/`INVALID_URL` at the API layer (depending on the error model you adopt in the next PR).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread src/server/ingestion/discover.ts Outdated
johnmcollier and others added 2 commits June 19, 2026 14:37
- discover: skip subdirs containing their own SKILL.md (separate skill
  roots) when collecting supportingFiles; defensively filter SKILL.md
  from walkAllFiles results
- clone: validate URL scheme (https/ssh/git@) before invoking git clone
- bundle: sort supportingFiles lexicographically inside getAllRelativeFiles
  to guarantee deterministic archive byte stream and SHA-256 digest

Co-authored-by: Cursor <cursoragent@cursor.com>
Covers discoverSkills, parseFrontmatter, bundleSkill, and clone URL
validation. Includes explicit regression tests for the three Qodo bugs:
- nested SKILL.md excluded from supportingFiles (bug 1)
- invalid URL scheme rejected before git clone (bug 2)
- archive digest is deterministic regardless of supportingFiles order (bug 3)

Co-authored-by: Cursor <cursoragent@cursor.com>
@johnmcollier johnmcollier merged commit 6fca912 into redhat-ai-dev:main Jun 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant