Skip to content

NMPC fallback never clears on slow-but-feasible solves (test_fallback_clears_on_success regression in CI) #2

@tntpsu

Description

@tntpsu

Symptom

tests/test_nmpc_controller.py::TestNMPCControllerFallback::test_fallback_clears_on_success fails on GitHub Actions runners but passes on a local laptop. Surfaced after PR #1 (runbook-sync gate fix) — the runbook gate had been masking it.

E   assert True is False
   assert r['nmpc_fallback_active'] is False

Root cause

control/nmpc_controller.py:937-952 treats two distinct conditions as a single failed flag:

is_timeout = result.get('solve_time_ms', 0.0) > p.max_solve_ms   # 20 ms budget
failed = (not result['feasible']) or is_timeout
if failed:
    self._consecutive_failures += 1
    if self._consecutive_failures >= p.max_consecutive_failures:
        self._fallback_active = True
    # NOTE: fallback never clears in this branch
else:
    self._consecutive_failures = 0
    self._fallback_active = False     # only path that clears

When the SLSQP solver returns feasible=True but exceeds the 20 ms wall-clock budget (common on GH Actions runners, where the perf tests already had to be CI-skipped at 5/8/20 ms), is_timeout is True → failed is True → fallback never clears even though the math worked.

The test's intent is "fallback clears when the solver succeeds." Production code's effective definition of "succeeds" is "feasible AND fast." That mismatch is the bug.

Two ways to resolve

Option A — test-side (low risk, doesn't touch production)

Pin solve_time_ms = 0.0 in both branches of the test mock so the test exercises only the fallback state machine, not the timeout policy:

def sometimes_fail(*args, **kwargs):
    r = orig_solve(*args, **kwargs)
    if fail_mode[0]:
        r['feasible'] = False
        r['solve_time_ms'] = 0.0
    else:
        r['solve_time_ms'] = 0.0   # ← add this line
    return r

Honest with the test's stated intent. Doesn't change runtime behavior. CI green on the next push.

Option B — production-side (semantic fix, needs review)

A 21 ms feasible solve probably shouldn't re-enter fallback or block clearing it. Slow solves should emit a separate "slow" telemetry signal, not equate with infeasibility. Sketch:

is_timeout = result.get('solve_time_ms', 0.0) > p.max_solve_ms
infeasible = not result['feasible']
if infeasible:
    self._consecutive_failures += 1
    if self._consecutive_failures >= p.max_consecutive_failures:
        self._fallback_active = True
    result['steering_normalized'] = self._last_steering
elif is_timeout:
    # slow but valid: keep the result, leave fallback state untouched
    self._last_steering = result['steering_normalized']
else:
    self._consecutive_failures = 0
    self._fallback_active = False
    self._last_steering = result['steering_normalized']

This is more correct but changes the controller's failure model. Wants a real driving review (does a single slow frame justify dropping to LMPC for the next ~3 cycles?).

Recommendation

Apply Option A to unblock CI immediately. Defer Option B until someone with controller-tuning context can confirm whether slow-feasible solves should affect fallback state.

Related

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions