diff --git a/.github/scripts/pr-triage-act.sh b/.github/scripts/pr-triage-act.sh index b2fb5efec3..7098852a8b 100644 --- a/.github/scripts/pr-triage-act.sh +++ b/.github/scripts/pr-triage-act.sh @@ -277,9 +277,14 @@ if [ -z "$STATE" ]; then EVAL_STATE=$(eval_status_state) log "reviewDecision=$REVIEW_DECISION unresolved_threads=$UNRESOLVED eval_status=$EVAL_STATE" - # Malicious scan precedence (non-bot, untrusted, no marker for current head) + # Malicious scan precedence (non-bot, untrusted, no marker for current head). + # Match either the orchestrator-posted dispatched marker (source of truth) or + # the agent-posted fingerprint marker (set by a successful scan run). if [ "$IS_BOT" = "false" ] && [ "$IS_TRUSTED" = "false" ]; then - SECS=$(seconds_since_marker "") + if [ -z "$SECS" ]; then + SECS=$(seconds_since_marker "` + marker** — that one is posted by the orchestrator immediately *before* it + dispatches this workflow, so it will always be present at the start of + your run. Treating it as "already scanned" would cause every scan to + no-op. ## Step 2 — Fetch the diff -Use the GitHub API. Do not run `git checkout` on the PR head. +Use the GitHub MCP `pull_request_read` tool with the `files` action (or the +`repos` toolset for raw blob reads). Do not run `git checkout` on the PR +head, and do not invoke `gh` or `curl` from bash — only the MCP tools and +the text-processing utilities listed under `tools.bash` are available. -```bash -gh api --paginate "repos/${REPO}/pulls/${PR}/files" \ - --jq '.[] | {filename, status, additions, deletions, patch}' -``` - -For files where `patch` is null/empty (binary or oversized), record the -filename and treat it as `binary-or-oversized`. For at most 5 such files that -are also under a sensitive path (see Step 3), fetch the raw blob: - -```bash -gh api "repos/${REPO}/contents/${path}?ref=${HEAD_SHA}" --jq .content | base64 -d | head -c 8192 -``` +For each changed file, capture `filename`, `status`, `additions`, +`deletions`, and `patch`. For files where `patch` is null/empty (binary or +oversized), record the filename and treat it as `binary-or-oversized`. For +at most 5 such files that are also under a sensitive path (see Step 3), +fetch the raw file content via the MCP `repos` toolset (`get_file_contents` +at `ref=`) and inspect the first ~8 KB. Limit total inspection to ~64 changed files / ~256 KB of patch text. If the diff is larger, scan the most-sensitive paths first @@ -211,9 +245,15 @@ comment (Step 5). Do not apply labels. ## Step 5 — Idempotency marker (always) -Always post a single PR comment containing the marker so the orchestrator and -the per-PR worker can detect that this head SHA has been scanned. Use -`add_comment` with body shaped exactly: +> [!IMPORTANT] +> The `add_comment` body **must begin with the literal HTML-comment marker +> line on its own first line**. Do not add any prefix, blank line, indentation, +> emoji, or other text before it. The orchestrator parses prior bot comments +> looking for this exact marker. + +Always post a single PR comment containing the marker so the orchestrator +and the per-PR worker can detect that this head SHA has been scanned. Use +`add_comment` with body shaped exactly (the **first line** is the marker): - **Clean scan** (no findings): diff --git a/.github/workflows/pr-triage-batch.yml b/.github/workflows/pr-triage-batch.yml index 297ecb802c..ab32018ec9 100644 --- a/.github/workflows/pr-triage-batch.yml +++ b/.github/workflows/pr-triage-batch.yml @@ -3,8 +3,13 @@ name: "PR Triage — Batch" # Hourly orchestrator. Enumerates open PRs, computes a deterministic state # for each, and dispatches the per-PR worker (pr-triage.yml) or the malicious- # code scanner (pr-malicious-scan.agent.lock.yml) for PRs that need action. -# No model calls; no comments; no labels are applied here. The worker owns the -# side effects. +# No model calls; no labels are applied here. The worker owns label and +# author-ping side effects. The orchestrator itself posts at most ONE comment +# per scanner dispatch — a deterministic +# `` idempotency marker — before +# triggering the scanner workflow. That marker is the source of truth that +# survives any scanner-side failure mode (PAT outage, integrity block, +# dropped HTML marker by the agent). on: schedule: @@ -29,6 +34,7 @@ on: permissions: pull-requests: read + issues: write statuses: read actions: write contents: read @@ -129,12 +135,17 @@ jobs: # Compute state — same logic as worker, kept simple and deterministic. STATE="" if [ "$IS_BOT" = "false" ] && [ "$IS_TRUSTED" = "false" ]; then - # Look for prior malicious-scan marker on this head + # Look for prior malicious-scan signal on this head. Match either: + # (orchestrator-authored, posted + # just before `gh workflow run`; survives any agent-side failure mode), OR + # (agent-authored, posted by + # a successful scan run). + # Either marker means "do not re-dispatch for this head SHA". SHORT="${HEAD_SHA:0:7}" # NB: --paginate runs --jq per page, so aggregations like 'length' would emit one # number per page. Emit one .id per matching comment and count lines in the shell. MARKER=$(gh api --paginate "repos/$REPO/issues/$PR/comments" \ - --jq ".[] | select(.user.login == \"github-actions[bot]\") | select(.body | contains(\"\")) or (.body | contains(\"" \ + "🔍 Automated malicious-diff scan dispatched for \`$SHORT\`." \ + "_Results will be posted as code-scanning alerts and a follow-up comment by github-actions[bot]._") + if ! gh api -X POST "repos/$REPO/issues/$PR/comments" \ + -f body="$PRE_DISPATCH_BODY" >/dev/null 2>&1; then + echo "::warning::failed to post pre-dispatch marker for PR #$PR — skipping scanner dispatch" + continue + fi + if gh workflow run pr-malicious-scan.agent.lock.yml --repo "$REPO" \ + -f pr_number="$PR"; then + DISPATCHED=$((DISPATCHED + 1)) + else + echo "::warning::failed to dispatch scanner for PR #$PR" + fi else echo "::notice::scanner workflow not yet present; would dispatch for PR #$PR" fi