Symptom
tests/test_nmpc_controller.py::TestNMPCControllerFallback::test_fallback_clears_on_success fails on GitHub Actions runners but passes on a local laptop. Surfaced after PR #1 (runbook-sync gate fix) — the runbook gate had been masking it.
E assert True is False
assert r['nmpc_fallback_active'] is False
Root cause
control/nmpc_controller.py:937-952 treats two distinct conditions as a single failed flag:
is_timeout = result.get('solve_time_ms', 0.0) > p.max_solve_ms # 20 ms budget
failed = (not result['feasible']) or is_timeout
if failed:
self._consecutive_failures += 1
if self._consecutive_failures >= p.max_consecutive_failures:
self._fallback_active = True
# NOTE: fallback never clears in this branch
else:
self._consecutive_failures = 0
self._fallback_active = False # only path that clears
When the SLSQP solver returns feasible=True but exceeds the 20 ms wall-clock budget (common on GH Actions runners, where the perf tests already had to be CI-skipped at 5/8/20 ms), is_timeout is True → failed is True → fallback never clears even though the math worked.
The test's intent is "fallback clears when the solver succeeds." Production code's effective definition of "succeeds" is "feasible AND fast." That mismatch is the bug.
Two ways to resolve
Option A — test-side (low risk, doesn't touch production)
Pin solve_time_ms = 0.0 in both branches of the test mock so the test exercises only the fallback state machine, not the timeout policy:
def sometimes_fail(*args, **kwargs):
r = orig_solve(*args, **kwargs)
if fail_mode[0]:
r['feasible'] = False
r['solve_time_ms'] = 0.0
else:
r['solve_time_ms'] = 0.0 # ← add this line
return r
Honest with the test's stated intent. Doesn't change runtime behavior. CI green on the next push.
Option B — production-side (semantic fix, needs review)
A 21 ms feasible solve probably shouldn't re-enter fallback or block clearing it. Slow solves should emit a separate "slow" telemetry signal, not equate with infeasibility. Sketch:
is_timeout = result.get('solve_time_ms', 0.0) > p.max_solve_ms
infeasible = not result['feasible']
if infeasible:
self._consecutive_failures += 1
if self._consecutive_failures >= p.max_consecutive_failures:
self._fallback_active = True
result['steering_normalized'] = self._last_steering
elif is_timeout:
# slow but valid: keep the result, leave fallback state untouched
self._last_steering = result['steering_normalized']
else:
self._consecutive_failures = 0
self._fallback_active = False
self._last_steering = result['steering_normalized']
This is more correct but changes the controller's failure model. Wants a real driving review (does a single slow frame justify dropping to LMPC for the next ~3 cycles?).
Recommendation
Apply Option A to unblock CI immediately. Defer Option B until someone with controller-tuning context can confirm whether slow-feasible solves should affect fallback state.
Related
🤖 Generated with Claude Code
Symptom
tests/test_nmpc_controller.py::TestNMPCControllerFallback::test_fallback_clears_on_successfails on GitHub Actions runners but passes on a local laptop. Surfaced after PR #1 (runbook-sync gate fix) — the runbook gate had been masking it.Root cause
control/nmpc_controller.py:937-952treats two distinct conditions as a singlefailedflag:When the SLSQP solver returns
feasible=Truebut exceeds the 20 ms wall-clock budget (common on GH Actions runners, where the perf tests already had to be CI-skipped at 5/8/20 ms),is_timeoutis True →failedis True → fallback never clears even though the math worked.The test's intent is "fallback clears when the solver succeeds." Production code's effective definition of "succeeds" is "feasible AND fast." That mismatch is the bug.
Two ways to resolve
Option A — test-side (low risk, doesn't touch production)
Pin
solve_time_ms = 0.0in both branches of the test mock so the test exercises only the fallback state machine, not the timeout policy:Honest with the test's stated intent. Doesn't change runtime behavior. CI green on the next push.
Option B — production-side (semantic fix, needs review)
A 21 ms feasible solve probably shouldn't re-enter fallback or block clearing it. Slow solves should emit a separate "slow" telemetry signal, not equate with infeasibility. Sketch:
This is more correct but changes the controller's failure model. Wants a real driving review (does a single slow frame justify dropping to LMPC for the next ~3 cycles?).
Recommendation
Apply Option A to unblock CI immediately. Defer Option B until someone with controller-tuning context can confirm whether slow-feasible solves should affect fallback state.
Related
test_no_oscillation_at_q4(MPC delay compensation) — separate logic regression also surfaced after PR docs(runbook): catch up SCRIPT_RUNBOOK with 5 previously-undocumented tools #1's runbook-sync fix. Different file, different root cause; punted to its own ticket.@pytest.mark.skipif(_IN_CI, reason=...)in PR docs(runbook): catch up SCRIPT_RUNBOOK with 5 previously-undocumented tools #1 commit71b7a4d. This NMPC test is the same family of issue (CI hardware too slow for an ms-level budget) but on a different test that wasn't part of that batch.🤖 Generated with Claude Code