perf(#2354): bound enrollment wait with timeout and backoff#76
perf(#2354): bound enrollment wait with timeout and backoff#76guyoron1 wants to merge 53 commits into
Conversation
Design for a new `prerequisites` triage action that replaces `blocked`. The agent can now express both existing blockers and new issues that need to be created upstream before progress can happen. Includes allowlist configuration for cross-repo issue creation and a degraded path when targets are not authorized. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…nd-ai#401) Seven-task plan covering config structs, JSON schema, agent prompt, post-script, user docs, and caller updates. TDD approach with exact file paths and code blocks. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Add CreateIssuesConfig and AllowTargets types to both OrgConfig and PerRepoConfig. NewOrgConfig populates defaults with the org and fullsend-ai/fullsend. NewPerRepoConfig populates with the target repo and fullsend-ai/fullsend. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…ues (fullsend-ai#401) Pass org name and target repo to config constructors so create_issues defaults are populated at install time. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…pt (fullsend-ai#401) The triage agent can now recommend creating upstream issues via the prerequisites action's create array, in addition to referencing existing blockers. Adds hard constraint against emitting sufficient when prerequisites exist. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…d-ai#401) Update triage agent docs to explain the new prerequisites action and the create_issues.allow_targets configuration surface. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…#401) Replace the blocked handler with prerequisites. The post-script reads the create_issues allowlist from config.yaml, creates permitted upstream issues via gh, and includes collapsed draft bodies for disallowed or failed creates so humans can file them manually. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…ullsend-ai#401) The agent prompt referenced a nonexistent `prerequisites` label when checking for prior blockers — the post-script actually applies the `blocked` label. Also removed unused SOURCE_ORG variable from post-triage.sh. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…end-ai#401) Replace the four blocked-action test cases with five prerequisites-action test cases that exercise the new schema (existing[], create[], allowlist validation). Set up GITHUB_WORKSPACE with a config.yaml fixture and add a mock gh issue-create handler that returns a fake URL. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…ullsend-ai#401) Replace blocked-action test cases with prerequisites-action equivalents and update the expected property list (blocked_by → prerequisites). Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…d-ai#401) - Replace stale blocked-* schema validation tests with prerequisites equivalents (missing field, both arrays empty, malformed URL) - Fix validateCreateIssues to reject malformed repo formats like "/", "/repo", "owner/" - Align triage.md section 2c terminology from "blocker" to "prerequisite" consistently - Update bugfix-workflow.md and architecture.md to document upstream issue creation capability - Emit ::warning:: when yq is unavailable so silent degradation of cross-repo issue creation is diagnosable Signed-off-by: Ralph Bean <rbean@redhat.com> Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Adds a skill that summarizes recent E2E Tests workflow runs on main, presents them in a table with clickable links, and diagnoses failures by grepping failed step logs for signal lines. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
The markdown link linter was parsing `[run-id](url)` as a real file reference. Wrapping it in backticks marks it as a code example. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
- Move list-runs.sh to scripts/ subdirectory to match convention - Add bash command prefix to allowed-tools declaration - Clarify status vs conclusion field handling for in-progress runs - Use case-insensitive grep to catch Timeout/timeout variants - Tighten frontmatter description Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
When multiple runners exhaust the GraphQL rate limit simultaneously, they all sleep until the same reset timestamp and wake up together. The existing slot jitter (250-750ms) is too narrow to desynchronize them, causing collisions that surface as "unknown owner type" errors from gh project view. Add a post-reset spread of up to 60s (configurable via GITHUB_CSMA_SPREAD_MAX_SEC) so runners fan out over a wide window after waking from a rate-limit sleep. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…ss schema Add end-to-end integration tests covering the full Phase 2 pipeline (PR 6 of 6 in the ADR-0045 forge-portable harness schema adoption): - LoadWithBase wrapper→scaffold merge with field inheritance and override - All scaffold templates forge resolution (pre/post scripts, runner_env) - Backward compatibility via Load() (no forge platform) - DiscoverAgents scaffold directory scanning with correct role/slug pairs - HarnessContentHash integrity verification against embedded content - LoadRaw generated wrapper format validation - ResolveForge scaffold runner_env merge with per-template key assertions Resolves fullsend-ai#2328 Signed-off-by: Greg Allen <greg@fullsend.ai> Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Greg Allen <gallen@redhat.com>
Status comments on PRs/issues get stuck in "Started" when the pre-minted agent token expires before PostCompletion runs. Instead of relying on a static token, have the fullsend binary mint its own fresh short-lived token via mintclient.MintToken() before each status comment API call. Key changes: - Add ClientFactory pattern to statuscomment.Notifier so each API operation gets a freshly minted forge.Client - Add --mint-url flag to fullsend run and reconcile-status commands - Add mint-url input to action.yml and all reusable workflows - Deprecate --status-token (run) and --token (reconcile-status) with runtime warnings; hidden from help output - Deprecate status-token input in action.yml; mask unconditionally - Validate token format before ::add-mask:: to prevent workflow command injection - Move refreshClient below commentEnabled guard in PostCompletion - Make refreshClient failure in cleanup path fail-open (warning) - Add "code" -> "coder" role alias for agent name resolution Closes fullsend-ai#2130 Signed-off-by: Greg Allen <gallen@redhat.com> Signed-off-by: Claude <noreply@anthropic.com> Signed-off-by: Greg Allen <gallen@redhat.com>
…window fix: widen CSMA post-reset jitter to prevent thundering herd
The previous backtick-escaping attempt (7c40a70) did not prevent lychee from resolving `url` as a relative file path. Remove the markdown link syntax entirely so the link checker has nothing to chase. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…kill feat(skills): add e2e-health skill
…ter_rate_limit PR fullsend-ai#2304 added post-reset spread to github_csma_sense to prevent thundering herd when runners wake after a rate-limit reset. The structurally parallel _github_csma_sleep_after_rate_limit function was missing the same treatment — multiple runners hitting a 429 would all wake at the same reset timestamp and fire simultaneously. Extract the spread logic into a shared _github_csma_post_reset_spread helper and call it from both github_csma_sense (replacing the inline code) and _github_csma_sleep_after_rate_limit (added after the backoff sleep). Both paths now use GITHUB_CSMA_SPREAD_MAX_SEC to stagger runner wake times. Note: pre-commit and make lint could not run due to shellcheck-py network restriction in sandbox. Scaffold Go tests pass. Closes fullsend-ai#2343
…spread-rate-limit fix(fullsend-ai#2343): add post-reset spread to _github_csma_sleep_after_rate_limit
The NewOrgConfig call gained a 6th parameter (org string) on this branch. Main didn't have it yet, causing a conflict. Keep the 6-parameter version to match the current function signature. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
…atus-token fix(fullsend-ai#2130): mint fresh tokens for status comments
test: Phase 2 integration tests for ADR-0045 forge-portable harness
…-decompose-issues feat(triage): add prerequisites action for upstream issue creation (fullsend-ai#401)
Replace the hardcoded 36-iteration fixed-interval polling loop in
awaitWorkflowRun with a time-bounded loop using exponential
backoff. The total wait is capped at 3 minutes (matching the
previous maximum), but polling starts at 2s intervals and doubles
up to 15s, reducing API calls and giving faster feedback when
the workflow completes quickly.
Changes:
- Add enrollmentWaitTimeout, enrollmentPollInitial, and
enrollmentPollMax constants to control polling behavior
- Replace iteration-count loop with deadline-based loop
- Use exponential backoff (2s → 4s → 8s → 15s cap) via
nextInterval helper
- Improve progress messages to show elapsed time instead of
attempt numbers
- Include actionable guidance in timeout error message
("check the workflow in .fullsend and re-run install")
- Add progress indicator before starting the wait
Closes fullsend-ai#2354
|
/fs-qf |
|
🤖 Finished Review · ✅ Success · Started 4:08 AM UTC · Completed 4:26 AM UTC |
Resolved all 4 major and 7 minor findings from initial review: - Rewrote scenarios to use user-observable behavior (D1-A-001) - Fixed personal fork URLs to upstream (D1-N-001) - Added triage prerequisites to Out of Scope (D2-COV-001) - Replaced vague qualifiers with specific outcomes (D1-A2-001) - Reclassified Dependencies, added QE kickoff timing - Added P2 priority tier, removed empty risk entries - Simplified Feature Overview, acknowledged security dimension Weighted score: 81.6 → 96.4
|
/fs-review |
|
🤖 Finished Review · ✅ Success · Started 4:30 AM UTC · Completed 4:46 AM UTC |
STD refiner could not proceed: STD YAML was never generated. Verdict remains BLOCKED. Run std-builder first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/fs-review |
Generated 58 Go tests across 3 packages covering: - Enrollment wait timeout and exponential backoff (25 tests) - Status comment lifecycle and orphan reconciliation (29 tests) - reconcile-status CLI command validation (14 tests) All tests compile and pass. Co-located with source packages per QualityFlow convention (qf_ prefix).
Removes intermediate pipeline artifacts (STP, STD, reviews). Test files (3) are co-located in source tree with qf_ prefix. Jira: GH-76 [skip ci]
|
/fs-qf |
|
🤖 Finished Review · ✅ Success · Started 5:49 AM UTC · Completed 6:05 AM UTC |
|
/fs-review |
|
🤖 Finished Review · ✅ Success · Started 6:08 AM UTC · Completed 6:43 AM UTC |
Removes intermediate pipeline artifacts (STP, STD, reviews). Test files (3) are co-located in source tree with qf_ prefix. Jira: GH-76 [skip ci]
|
/fs-review |
|
🤖 Finished Review · ✅ Success · Started 6:46 AM UTC · Completed 7:03 AM UTC |
| n := statuscomment.New(client, notifyCfg, owner, repo, sOpts.statusNum, sOpts.runURL, sha, runID) | ||
| var initialClient forge.Client | ||
| if staticToken != "" { | ||
| initialClient = gh.New(staticToken) |
There was a problem hiding this comment.
[medium] nil-deref
initialClient is nil when mintURL is empty (the else branch at ~line 1880 does not set it). This nil client is passed to statuscomment.New(...). If ClientFactory is also nil or fails on first refresh, methods on the nil initialClient will panic.
Suggested fix: Either initialize initialClient to a no-op client, or guard against nil in statuscomment.New.
| if mintURL != "" { | ||
| if role == "" { | ||
| return fmt.Errorf("--role is required when using --mint-url") | ||
| } |
There was a problem hiding this comment.
[medium] api-contract
MintRequest.Repos is set to []string{statusRepo} where statusRepo is owner/repo format. Verify that the mint endpoint expects full owner/repo strings in the Repos field rather than bare repo names.
Suggested fix: Confirm the mint API contract for the Repos field format and add a comment documenting the expected format.
| @@ -405,17 +420,19 @@ runs: | |||
| JOB_STATUS: ${{ job.status }} | |||
There was a problem hiding this comment.
[medium] fail-open
The reconciliation step is gated on inputs.mint-url being non-empty. When mint-url is not provided, the entire status reconciliation is silently skipped with no warning.
Suggested fix: Add a warning annotation when reconciliation is skipped due to missing mint-url.
| return gh.New(token) | ||
| } | ||
|
|
||
| func newReconcileStatusCmd() *cobra.Command { |
There was a problem hiding this comment.
[low] pattern-inconsistency
The deprecated --token flag is mapped to --status-token but the deprecation message references --mint-url as the replacement. Verify the migration path is documented.
Co-located tests (qf_* prefix) are now in source package directories. The qf-tests/ directory contained non-compiling tests from the old pipeline.
|
🤖 Finished Review · ✅ Success · Started 8:33 AM UTC · Completed 8:53 AM UTC |
| fmt.Fprintf(os.Stderr, "WARNING: --token is deprecated; use --mint-url instead\n") | ||
| client = newForgeClient(token) | ||
| } else { | ||
| return fmt.Errorf("--mint-url or FULLSEND_MINT_URL required (--token is deprecated)") |
There was a problem hiding this comment.
[high] cli_behavior_breaking_change
GITHUB_TOKEN environment variable fallback for status reconciliation silently removed. Existing CI configurations relying on GITHUB_TOKEN for status updates will fail at runtime with no clear error message.
Suggested fix: Add a clear error message when neither mint-url nor a working token source is available, pointing to migration path. Consider a deprecation period.
| @@ -1882,10 +1888,34 @@ func setupStatusNotifier(fullsendDir string, sOpts statusOpts, printer *ui.Print | |||
| runID = fmt.Sprintf("%d", time.Now().UnixNano()) | |||
There was a problem hiding this comment.
[low] nil-deref
ClientFactory set on statuscomment package but empty mintURL with no fallback leads to unclear error behavior rather than explicit early failure.
Suggested fix: Add explicit check for empty mintURL before setting client factory.
| @@ -206,23 +228,26 @@ run_test "duplicate-self-reference-fails" \ | |||
| "" \ | |||
There was a problem hiding this comment.
[info] test-inadequacy
Test for prerequisites case covers happy path but not error conditions (failed issue creation, invalid allowlist, network failures).
Suggested fix: Add test cases for error paths in prerequisites handling.
Mirror of upstream fullsend-ai#2359
Adds timeout and backoff to enrollment wait to prevent unbounded blocking.