fix(daemon): edge-driven Timer debounce for gpio-shutdown (#1109)#1110
fix(daemon): edge-driven Timer debounce for gpio-shutdown (#1109)#1110strandline wants to merge 5 commits into
Conversation
…otics#1109) The previous busy-wait debounce in shutdown_monitor.py (PR pollen-robotics#505) polls is_pressed on a 1 ms schedule for 200 ms after each when_released. Under sustained motor activity, motor-coil EMI couples brief LOW spikes onto the floating GPIO23 input; sub-millisecond EMI bursts can pass between the 1 ms samples, so the poll never observes the pin returning HIGH and a spurious release becomes a real shutdown. Padding the debounce to 2 s reduces but does not eliminate the failure (pollen-robotics#1109). Replace the busy-wait with an edge-driven design: * when_released (falling edge) starts threading.Timer(HOLD_TIME, shutdown_now). * when_pressed (rising edge) cancels any pending Timer. Because gpiozero's edge handlers are interrupt-style (kernel GPIO events, not sampled polls), even sub-millisecond HIGH spikes that flip the pin HIGH→LOW→HIGH cancel the pending Timer. A deliberate HOLD_TIME-second release with no intervening HIGH spikes still fires shutdown_now. Module shape preserves the existing idiom: shutdown_button is bound at import time, edge handlers wired at module body, pause() guarded by __name__ == "__main__" so the launcher's python -m entry still runs the daemon while tests can import the module under MockFactory. Tests cover four scenarios using a deterministic MockFactory pin: - HOLD_TIME constant matches the 2 s production design - Brief release-then-press cancels the Timer (no shutdown) - Sustained release > HOLD_TIME fires shutdown exactly once - HIGH burst during release (simulated EMI) cancels Timer pollen-robotics#1; the next release schedules a fresh Timer pollen-robotics#2 that fires after HOLD_TIME - shutdown_now() invokes ["sudo", "shutdown", "-h", "now"] Tradeoffs documented in the module docstring: * Timer.cancel() is best-effort — a press arriving after the worker thread has begun executing the callback does not abort shutdown. The race window is sub-millisecond and acceptable for a shutdown gesture. * HIGH spikes superimposed on an already-LOW pin during a real user release also cancel the Timer. That matches PR pollen-robotics#505's "bounce, ignore" behaviour (no regression versus current main), though heavy motor activity can theoretically delay an intentional shutdown gesture. Also corrects a typo in the module docstring (GPIO24 → GPIO23).
shutdown_monitor.py constructs Button(23, pull_up=False) at import time. Without a pin_factory pre-set, gpiozero auto-detects one and raises BadPinFactory on hosts without Linux GPIO (CI runners, dev macOS). Setting Device.pin_factory = MockFactory() at the test module's top ensures the import succeeds before any test runs. The per-test fixture still swaps a fresh MockFactory between tests for state isolation.
There was a problem hiding this comment.
Pull request overview
This PR hardens the GPIO23 shutdown-button monitor against motor-EMI-induced glitches by replacing the previous 1ms polling debounce loop with an edge-driven threading.Timer approach that schedules shutdown on sustained release and cancels on any intervening press edge.
Changes:
- Replaced busy-wait debounce logic with a lock-protected
Timerschedule/cancel mechanism keyed offwhen_released/when_pressed. - Added
HOLD_TIME = 2.0as the sustained-release threshold for shutdown triggering. - Added unit tests using
gpiozero’sMockFactoryto validate cancellation/retrigger behavior and command invocation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/reachy_mini/daemon/app/services/gpio_shutdown/shutdown_monitor.py |
Implements edge-driven Timer-based shutdown triggering/cancellation and documents rationale/tradeoffs. |
tests/unit_tests/test_shutdown_monitor.py |
Adds regression-oriented unit tests for sustained release, cancel-on-press, EMI burst behavior, and shutdown command invocation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Bind a MockFactory for the test session BEFORE any test imports the | ||
| # ``shutdown_monitor`` module. The module constructs ``Button(23, | ||
| # pull_up=False)`` at import time; without a pin_factory pre-set, gpiozero | ||
| # attempts to auto-detect one and raises ``BadPinFactory`` on hosts without | ||
| # Linux GPIO (CI runners, dev macOS / Windows). | ||
| Device.pin_factory = MockFactory() | ||
|
|
||
|
|
| with patch(monkey_target, side_effect=lambda: fired.append(True)): | ||
| pin.drive_low() # Release at t=0; Timer #1 would fire at t=0.3. | ||
| time.sleep(0.1) # t=0.1. | ||
| pin.drive_high() # EMI burst cancels Timer #1. | ||
| pin.drive_low() # Release continues; Timer #2 schedules → fires at t≈0.4. | ||
|
|
||
| # At t≈0.35 — past Timer #1's deadline (t=0.3), before Timer #2 (t≈0.4). | ||
| time.sleep(0.25) | ||
| assert fired == [], ( | ||
| f"EMI burst did not cancel Timer #1 before its original deadline: " | ||
| f"fired={fired}" | ||
| ) | ||
|
|
||
| # At t≈0.50 — past Timer #2's deadline (t≈0.4). | ||
| time.sleep(0.15) | ||
| assert fired == [True], ( | ||
| f"Timer #2 did not fire after the EMI burst (or fired more than once): " | ||
| f"fired={fired}" | ||
| ) |
| # Lambda defers the ``shutdown_now`` lookup to fire-time so tests | ||
| # can ``monkeypatch.setattr(module, "shutdown_now", ...)`` after | ||
| # the Timer is scheduled. | ||
| timer = threading.Timer(HOLD_TIME, lambda: shutdown_now()) | ||
| timer.daemon = True | ||
| _pending_timer = timer | ||
| timer.start() |
Three test-hygiene improvements from Copilot's review of pull request 1110 — production code (shutdown_monitor.py) is unchanged. 1. Move module-level Device.pin_factory = MockFactory() into an autouse session-scoped fixture that captures the previous value and restores it at session teardown — importing this test file no longer leaks a MockFactory into other test modules that may rely on the default pin-factory auto-detect behaviour. 2. Add a _wait_until(predicate, timeout, poll) helper and use it for the "Timer fires" assertions in test_sustained_release and the second half of test_emi_burst — robust against scheduling delay on slow / loaded CI runners. The "didn't fire" assertion in the EMI test is left as a fixed time.sleep since it's safe by construction (Timer pollen-robotics#2's deadline is strictly later than the check, and Python Timer doesn't fire early). 3. Add test_scheduled_timer_is_daemon_thread regression test that asserts shutdown_module._pending_timer.daemon is True after a release schedules a pending Timer — pins the safety-critical behaviour that prevents the process from hanging on exit. 6 passed across 3 stability runs (was 5 before adding the daemon regression test).
|
Thanks for the review @copilot — all three suggestions addressed in d0513e4 (test-hygiene only; production
6/6 pass across 3 stability runs locally. |
|
Thanks for the thorough work on this — using edge-driven Timer seems right! One blocker before merge: HOLD_TIME = 2.0s is too long for our hardware. On the Wireless, popping the latch OUT cuts the Pi's power rail within ~2s (the e-stop semantics you note in the side comment). With a 2s hold, shutdown_now() could not actually run before power drops — we'll get a hard power-cut on some shutdown, which is strictly worse than today's 200ms behavior. And 200ms seems like an already really long filter duration regarding this kind of burst, no? |
This comment was marked as resolved.
This comment was marked as resolved.
Pierre, Thank you for the feedback. I'll run an analysis for you later today to get some data on the timing policy to see if I can provide some hard numbers. Jason |
Review feedback on PR pollen-robotics#1110 flagged that HOLD_TIME=2.0s races the Wireless's ~2s latch-OUT power-rail cut: shutdown_now() may not dispatch before the Pi loses power, which is strictly worse than PR pollen-robotics#505's 200ms behaviour. Empirical characterization on Reachy Mini Wireless (17 min total, ~10,000 set_target commands at 10 Hz with head + antennas + body all driven) observed **0 GPIO23 edges of any kind**, and a HOLD_TIME sweep across {50ms, 100ms, 200ms, 500ms, 1s, 2s} produced **0 spurious fires at every value**. On this hardware the EMI condition described in pollen-robotics#1109 is below the measurement floor. This re-frames HOLD_TIME: EMI bursts are filtered by the cancel-on-press edge (any rising edge during the hold window cancels the pending Timer, independent of HOLD_TIME). HOLD_TIME is therefore a UX / gesture-recognition parameter, not an EMI filter. Constraints: * Upper bound: shutdown_now() + systemd dispatch must complete inside the ~2s latch-OUT power-rail cut. * Lower bound: long enough to distinguish a deliberate latch pull from an incidental brush. 0.5s sits inside that window: 10x a typical e-stop tap (~50ms), 4x headroom below the power-rail cut. EMI defense is unaffected. The characterization + sweep scripts are added in a follow-up commit under scripts/issue_1109/ so future pollen-robotics#1109-class reports on different hardware can be measured directly rather than inferred.
…arness Two on-robot scripts that produced the empirical data behind the HOLD_TIME=0.5s decision (see prior commit): * `burst_characterization.py` -- listens on GPIO23 with pull_up=False (matches production), drives sustained 10 Hz set_target motor stress, logs every edge with monotonic_ns timestamps. Outputs JSONL plus a summary with burst clusters and inter-edge gap distribution. Run with `--no-motor` for a passive EMI-floor baseline. * `hold_time_sweep.py` -- in-process clone of production _schedule_shutdown / _cancel_shutdown Timer logic with shutdown_now() replaced by a counter. Sweeps HOLD_TIME values under continuous motor stress, reports per-step spurious-fire counts, identifies the floor. Single continuous ReachyMini context for the whole sweep (per-step wake_up cycling crashed the GStreamer pipeline). Both scripts run on the Reachy Mini's Raspberry Pi using /venvs/mini_daemon/bin/python3; gpio-shutdown-daemon must be stopped first for exclusive GPIO23 access. README in the same directory documents pre-conditions, workflow, and the recorded result. Investigation tooling only; not loaded by the daemon.
|
TLDR: I ran some additional tests to evaluate timing. The cancel-on-press edge does the EMI work, not the hold duration. Empirical run on Reachy Mini WirelessStopped 1. Burst characterization — 5 min, 2,887 motor commands, raw edge logger: 2. HOLD_TIME sweep — 12 min, 7,136 motor commands, production-equivalent
Total: 17 minutes of stress, ~10,000 motor commands, zero GPIO23 edges of any kind. On this robot the EMI condition described in #1109 is below the measurement floor — the sub-millisecond spike hypothesis from PR #505 is unfalsified but I can't reproduce it here. (Caveat: one robot. @danpillay87's report stands as evidence on different hardware.) Re-framing HOLD_TIMEYou're right that the cancel-on-press edge does the EMI work, not the hold duration:
So the constraint sandwich is:
Pushed: HOLD_TIME = 0.5 s + harness scriptsTwo commits on the PR branch:
0.5 s: 10× a typical e-stop tap, 4× headroom below your ~2 s power-rail cut, no EMI regression on my hardware (sweep confirms 0 fires at all values tested, including 50 ms). |
Summary
Replaces the busy-wait debounce in
shutdown_monitor.py(PR #505) with anedge-driven
threading.Timerpattern. Closes the sub-millisecond gap thatlet motor-EMI HIGH→LOW→HIGH spikes pass between 1ms
is_pressedpollsand trigger spurious shutdowns under sustained
set_targetactivity.Fixes #1109 (reported by @danpillay87).
Root cause
PR #505's
released()handler pollsis_pressedon a 1ms schedule for200ms after each falling edge. Motor-coil EMI couples sub-millisecond
HIGH spikes onto GPIO23, but those spikes finish faster than the next
poll observation, so the loop never sees the pin return HIGH and treats
a transient as a real release. Bumping the loop to 2000ms mitigates but
does not eliminate it (per #1109's update).
Fix
when_releasedstartsthreading.Timer(HOLD_TIME, shutdown_now);when_pressedcallsTimer.cancel(). Because gpiozero's edge handlersare interrupt-style (kernel GPIO events, not sampled polls), a
sub-millisecond HIGH spike fires
when_pressedand cancels the pendingTimer. A deliberate ≥2s release with no intervening HIGH spikes still
fires
shutdown_now.HOLD_TIMEis 2.0s (longer than any motor-EMI burst observed in #1109,still a natural body-button gesture).
Why not `hold_time` / `when_held`?
`gpiozero.Button.when_held` measures continuous press duration, but
the shutdown gesture is continuous release duration. The Timer
pattern is the natural inverse.
Trade-offs
begun the callback when `cancel()` is called, the shutdown proceeds.
Race window is sub-millisecond — acceptable for a shutdown gesture.
cancel the Timer. Same "bounce, ignore" behaviour as PR 492 fix shutdown gpio switching due to motor overload #505 (no
regression). Heavy motor activity could theoretically delay or
suppress an intentional shutdown gesture; not observed empirically.
Testing
Unit tests (`tests/unit_tests/test_shutdown_monitor.py`, 5 tests, all
passing on real import path with `MockFactory` autobind for CI/macOS hosts):
Smoke test (`run_smoketest.py`, 6/6 scenarios pass across 3 runs).
Hardware verification on Reachy Mini Wireless (daemon v1.7.1,
ReachyMiniOS):
Scenario (c) directly reproduces #1109's failure mode (the 10–12 Hz
`set_target` cadence from a custom voice app's lip-sync wobble) and
confirms the fix holds.
Side note (not in this PR)
During hardware testing we observed that popping the latch OUT cuts
motor torque immediately, even briefly — independent of the daemon
shutdown logic. This appears to be intentional e-stop semantics
(latch-OUT kills motor power instantly; ≥ 2s sustained latch-OUT also
signals graceful system shutdown). Worth documenting in the SDK
hardware-architecture page.