Skip to content

doctor --remediate can loop indefinitely when sync.repo does not clear stale_pages #1230

@shandutta

Description

@shandutta

Bug Description

gbrain doctor --remediate can loop indefinitely / appear hung when a remediation step completes but the condition that generated that recommendation does not clear.

In my current repo-backed brain, the planner keeps recommending sync.repo because health.stale_pages remains nonzero (21) after sync/extract. The remediation loop recomputes recommendations between steps and reintroduces the same remediation ID, so the command can keep submitting/waiting on sync.repo until an external timeout kills it.

Environment

  • Repo: garrytan/gbrain
  • Local package/version: gbrain 0.36.4.0
  • Engine: Supabase/Postgres-backed brain
  • Source: repo-backed markdown brain at /home/shan/brain/knowledge
  • Current health before/after manual remediation:
    • brain_score: 84
    • missing_embeddings: 0
    • stale_pages: 21
    • orphan_pages: 104
    • link_coverage: 0.909090909090909
    • timeline_coverage: 0.636363636363636

Steps to Reproduce

gbrain doctor --remediation-plan --target-score 84 --max-usd 5 --json
# plan includes sync.repo because stale_pages > 0

gbrain doctor --remediate --yes --target-score 84 --max-usd 5
# command produces no useful progress output and can hang until external timeout

A safer bounded repro shows the repeat clearly:

timeout 120s gbrain doctor --remediate --yes --target-score 84 --max-usd 5 --max-jobs 2 --json

Observed output:

{
  "brain_score_initial": 84,
  "brain_score_final": 84,
  "brain_score_target": 84,
  "target_reached": true,
  "submitted": [
    {
      "step": 1,
      "id": "sync.repo",
      "job_id": 5665,
      "status": "completed"
    },
    {
      "step": 2,
      "id": "sync.repo",
      "job_id": 5665,
      "status": "completed"
    }
  ],
  "aborted_count": 0
}

Expected Behavior

After a remediation ID completes once in a doctor_run_id, one of the following should happen:

  1. The completed remediation ID is suppressed for the remainder of that run unless its input/content hash changes; or
  2. If the same check remains unhealthy after a completed remediation, the step is marked blocked / non_clearing / failed_to_clear and the run moves on or exits; or
  3. sync.repo should not claim to remediate stale_pages when stale_pages is not actually expected to clear via sync.

The CLI should also print progress before waiting on each job in non-JSON mode so it does not look dead.

Actual Behavior

  • sync.repo completes successfully.
  • health.stale_pages remains at 21.
  • The remediation loop recomputes recommendations and reintroduces sync.repo.
  • With no --max-jobs, the command can appear to hang / run until external timeout.
  • In non-JSON mode, no useful progress is printed before/during the wait.

Relevant Code Pointers

src/commands/doctor.ts:

  • runRemediate() recomputes recommendations from fresh health after each step:
const freshHealth = await engine.getHealth();
recs = computeRecommendations(freshHealth, ctx).filter((r) => r.status === 'remediable');
  • There does not appear to be a same-run suppression set for completed remediation IDs.

src/core/brain-score-recommendations.ts:

  • sync.repo fires when health.stale_pages > 0.

Workaround

Manual maintenance works fine:

gbrain sync --source default
gbrain embed --stale
gbrain extract all --dir /home/shan/brain/knowledge
gbrain doctor

That cleared missing embeddings and ran full extraction without runaway generation, but stale_pages remained 21, so autonomous remediation still wants to run sync.repo again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions