Skip to content

fix(daemon): edge-driven Timer debounce for gpio-shutdown (#1109)#1110

Open
strandline wants to merge 5 commits into
pollen-robotics:mainfrom
strandline:1109-edge-driven-timer-debounce
Open

fix(daemon): edge-driven Timer debounce for gpio-shutdown (#1109)#1110
strandline wants to merge 5 commits into
pollen-robotics:mainfrom
strandline:1109-edge-driven-timer-debounce

Conversation

@strandline
Copy link
Copy Markdown

Summary

Replaces the busy-wait debounce in shutdown_monitor.py (PR #505) with an
edge-driven threading.Timer pattern. Closes the sub-millisecond gap that
let motor-EMI HIGH→LOW→HIGH spikes pass between 1ms is_pressed polls
and trigger spurious shutdowns under sustained set_target activity.

Fixes #1109 (reported by @danpillay87).

Root cause

PR #505's released() handler polls is_pressed on a 1ms schedule for
200ms after each falling edge. Motor-coil EMI couples sub-millisecond
HIGH spikes onto GPIO23, but those spikes finish faster than the next
poll observation, so the loop never sees the pin return HIGH and treats
a transient as a real release. Bumping the loop to 2000ms mitigates but
does not eliminate it (per #1109's update).

Fix

when_released starts threading.Timer(HOLD_TIME, shutdown_now);
when_pressed calls Timer.cancel(). Because gpiozero's edge handlers
are interrupt-style (kernel GPIO events, not sampled polls), a
sub-millisecond HIGH spike fires when_pressed and cancels the pending
Timer. A deliberate ≥2s release with no intervening HIGH spikes still
fires shutdown_now.

HOLD_TIME is 2.0s (longer than any motor-EMI burst observed in #1109,
still a natural body-button gesture).

Why not `hold_time` / `when_held`?

`gpiozero.Button.when_held` measures continuous press duration, but
the shutdown gesture is continuous release duration. The Timer
pattern is the natural inverse.

Trade-offs

  • `Timer.cancel()` is best-effort. If the worker thread has already
    begun the callback when `cancel()` is called, the shutdown proceeds.
    Race window is sub-millisecond — acceptable for a shutdown gesture.
  • HIGH spikes superimposed on an already-LOW pin during a real release
    cancel the Timer. Same "bounce, ignore" behaviour as PR 492 fix shutdown gpio switching due to motor overload #505 (no
    regression). Heavy motor activity could theoretically delay or
    suppress an intentional shutdown gesture; not observed empirically.

Testing

Unit tests (`tests/unit_tests/test_shutdown_monitor.py`, 5 tests, all
passing on real import path with `MockFactory` autobind for CI/macOS hosts):

  • brief release < 2s does not fire shutdown
  • 2s sustained release fires shutdown
  • press during pending Timer cancels it
  • multiple releases reset the Timer correctly
  • Timer is daemon-thread (no shutdown hang)

Smoke test (`run_smoketest.py`, 6/6 scenarios pass across 3 runs).

Hardware verification on Reachy Mini Wireless (daemon v1.7.1,
ReachyMiniOS):

Scenario Action Expected Result
a brief OUT→IN < 2s, idle no shutdown
c brief OUT→IN under 10 Hz `set_target` motor stress no shutdown
b sustained OUT ≥ 3s clean shutdown

Scenario (c) directly reproduces #1109's failure mode (the 10–12 Hz
`set_target` cadence from a custom voice app's lip-sync wobble) and
confirms the fix holds.

Side note (not in this PR)

During hardware testing we observed that popping the latch OUT cuts
motor torque immediately, even briefly — independent of the daemon
shutdown logic. This appears to be intentional e-stop semantics
(latch-OUT kills motor power instantly; ≥ 2s sustained latch-OUT also
signals graceful system shutdown). Worth documenting in the SDK
hardware-architecture page.

strandline added 2 commits May 9, 2026 13:31
…otics#1109)

The previous busy-wait debounce in shutdown_monitor.py (PR pollen-robotics#505) polls
is_pressed on a 1 ms schedule for 200 ms after each when_released. Under
sustained motor activity, motor-coil EMI couples brief LOW spikes onto
the floating GPIO23 input; sub-millisecond EMI bursts can pass between
the 1 ms samples, so the poll never observes the pin returning HIGH and
a spurious release becomes a real shutdown. Padding the debounce to 2 s
reduces but does not eliminate the failure (pollen-robotics#1109).

Replace the busy-wait with an edge-driven design:

* when_released (falling edge) starts threading.Timer(HOLD_TIME, shutdown_now).
* when_pressed (rising edge) cancels any pending Timer.

Because gpiozero's edge handlers are interrupt-style (kernel GPIO events,
not sampled polls), even sub-millisecond HIGH spikes that flip the pin
HIGH→LOW→HIGH cancel the pending Timer. A deliberate HOLD_TIME-second
release with no intervening HIGH spikes still fires shutdown_now.

Module shape preserves the existing idiom: shutdown_button is bound at
import time, edge handlers wired at module body, pause() guarded by
__name__ == "__main__" so the launcher's python -m entry still runs the
daemon while tests can import the module under MockFactory.

Tests cover four scenarios using a deterministic MockFactory pin:
- HOLD_TIME constant matches the 2 s production design
- Brief release-then-press cancels the Timer (no shutdown)
- Sustained release > HOLD_TIME fires shutdown exactly once
- HIGH burst during release (simulated EMI) cancels Timer pollen-robotics#1; the next
  release schedules a fresh Timer pollen-robotics#2 that fires after HOLD_TIME
- shutdown_now() invokes ["sudo", "shutdown", "-h", "now"]

Tradeoffs documented in the module docstring:
* Timer.cancel() is best-effort — a press arriving after the worker
  thread has begun executing the callback does not abort shutdown. The
  race window is sub-millisecond and acceptable for a shutdown gesture.
* HIGH spikes superimposed on an already-LOW pin during a real user
  release also cancel the Timer. That matches PR pollen-robotics#505's "bounce, ignore"
  behaviour (no regression versus current main), though heavy motor
  activity can theoretically delay an intentional shutdown gesture.

Also corrects a typo in the module docstring (GPIO24 → GPIO23).
shutdown_monitor.py constructs Button(23, pull_up=False) at import time.
Without a pin_factory pre-set, gpiozero auto-detects one and raises
BadPinFactory on hosts without Linux GPIO (CI runners, dev macOS).
Setting Device.pin_factory = MockFactory() at the test module's top
ensures the import succeeds before any test runs. The per-test fixture
still swaps a fresh MockFactory between tests for state isolation.
Copilot AI review requested due to automatic review settings May 9, 2026 19:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the GPIO23 shutdown-button monitor against motor-EMI-induced glitches by replacing the previous 1ms polling debounce loop with an edge-driven threading.Timer approach that schedules shutdown on sustained release and cancels on any intervening press edge.

Changes:

  • Replaced busy-wait debounce logic with a lock-protected Timer schedule/cancel mechanism keyed off when_released/when_pressed.
  • Added HOLD_TIME = 2.0 as the sustained-release threshold for shutdown triggering.
  • Added unit tests using gpiozero’s MockFactory to validate cancellation/retrigger behavior and command invocation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/reachy_mini/daemon/app/services/gpio_shutdown/shutdown_monitor.py Implements edge-driven Timer-based shutdown triggering/cancellation and documents rationale/tradeoffs.
tests/unit_tests/test_shutdown_monitor.py Adds regression-oriented unit tests for sustained release, cancel-on-press, EMI burst behavior, and shutdown command invocation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +36 to +43
# Bind a MockFactory for the test session BEFORE any test imports the
# ``shutdown_monitor`` module. The module constructs ``Button(23,
# pull_up=False)`` at import time; without a pin_factory pre-set, gpiozero
# attempts to auto-detect one and raises ``BadPinFactory`` on hosts without
# Linux GPIO (CI runners, dev macOS / Windows).
Device.pin_factory = MockFactory()


Comment on lines +149 to +167
with patch(monkey_target, side_effect=lambda: fired.append(True)):
pin.drive_low() # Release at t=0; Timer #1 would fire at t=0.3.
time.sleep(0.1) # t=0.1.
pin.drive_high() # EMI burst cancels Timer #1.
pin.drive_low() # Release continues; Timer #2 schedules → fires at t≈0.4.

# At t≈0.35 — past Timer #1's deadline (t=0.3), before Timer #2 (t≈0.4).
time.sleep(0.25)
assert fired == [], (
f"EMI burst did not cancel Timer #1 before its original deadline: "
f"fired={fired}"
)

# At t≈0.50 — past Timer #2's deadline (t≈0.4).
time.sleep(0.15)
assert fired == [True], (
f"Timer #2 did not fire after the EMI burst (or fired more than once): "
f"fired={fired}"
)
Comment on lines +75 to +81
# Lambda defers the ``shutdown_now`` lookup to fire-time so tests
# can ``monkeypatch.setattr(module, "shutdown_now", ...)`` after
# the Timer is scheduled.
timer = threading.Timer(HOLD_TIME, lambda: shutdown_now())
timer.daemon = True
_pending_timer = timer
timer.start()
Three test-hygiene improvements from Copilot's review of pull request
1110 — production code (shutdown_monitor.py) is unchanged.

1. Move module-level Device.pin_factory = MockFactory() into an
   autouse session-scoped fixture that captures the previous value
   and restores it at session teardown — importing this test file no
   longer leaks a MockFactory into other test modules that may rely
   on the default pin-factory auto-detect behaviour.

2. Add a _wait_until(predicate, timeout, poll) helper and use it for
   the "Timer fires" assertions in test_sustained_release and the
   second half of test_emi_burst — robust against scheduling delay
   on slow / loaded CI runners. The "didn't fire" assertion in the
   EMI test is left as a fixed time.sleep since it's safe by
   construction (Timer pollen-robotics#2's deadline is strictly later than the
   check, and Python Timer doesn't fire early).

3. Add test_scheduled_timer_is_daemon_thread regression test that
   asserts shutdown_module._pending_timer.daemon is True after a
   release schedules a pending Timer — pins the safety-critical
   behaviour that prevents the process from hanging on exit.

6 passed across 3 stability runs (was 5 before adding the daemon
regression test).
@strandline
Copy link
Copy Markdown
Author

Thanks for the review @copilot — all three suggestions addressed in d0513e4 (test-hygiene only; production shutdown_monitor.py unchanged):

  1. Module-level pin_factory mutation — moved into an autouse=True, scope="session" fixture that captures previous = Device.pin_factory and restores it at teardown. Importing this test file no longer leaks MockFactory into other test modules.

  2. Tight time.sleep timing — added a _wait_until(predicate, timeout, poll) helper and switched the two "Timer fires" assertions (test_sustained_release_fires_shutdown_once and the second half of test_emi_burst_during_release_cancels_pending_timer) to bounded polling. The "didn't fire" assertion in the EMI test is left as a fixed time.sleep since it's safe by construction (Timer Rename this repo "Reachy Mini"? #2's deadline is strictly later than the check, and Python Timer doesn't fire early).

  3. timer.daemon = True regression guard — added test_scheduled_timer_is_daemon_thread that asserts shutdown_module._pending_timer.daemon is True after a release schedules a pending Timer.

6/6 pass across 3 stability runs locally.

@pierre-rouanet pierre-rouanet self-requested a review May 11, 2026 14:29
@pierre-rouanet
Copy link
Copy Markdown
Member

Thanks for the thorough work on this — using edge-driven Timer seems right!

One blocker before merge: HOLD_TIME = 2.0s is too long for our hardware. On the Wireless, popping the latch OUT cuts the Pi's power rail within ~2s (the e-stop semantics you note in the side comment). With a 2s hold, shutdown_now() could not actually run before power drops — we'll get a hard power-cut on some shutdown, which is strictly worse than today's 200ms behavior.

And 200ms seems like an already really long filter duration regarding this kind of burst, no?

@strandline

This comment was marked as resolved.

@strandline
Copy link
Copy Markdown
Author

strandline commented May 11, 2026

Thanks for the thorough work on this — using edge-driven Timer seems right!

One blocker before merge: HOLD_TIME = 2.0s is too long for our hardware. On the Wireless, popping the latch OUT cuts the Pi's power rail within ~2s (the e-stop semantics you note in the side comment). With a 2s hold, shutdown_now() could not actually run before power drops — we'll get a hard power-cut on some shutdown, which is strictly worse than today's 200ms behavior.

And 200ms seems like an already really long filter duration regarding this kind of burst, no?

Pierre,

Thank you for the feedback. I'll run an analysis for you later today to get some data on the timing policy to see if I can provide some hard numbers.

Jason

Review feedback on PR pollen-robotics#1110 flagged that HOLD_TIME=2.0s races the
Wireless's ~2s latch-OUT power-rail cut: shutdown_now() may not
dispatch before the Pi loses power, which is strictly worse than
PR pollen-robotics#505's 200ms behaviour.

Empirical characterization on Reachy Mini Wireless (17 min total,
~10,000 set_target commands at 10 Hz with head + antennas + body
all driven) observed **0 GPIO23 edges of any kind**, and a HOLD_TIME
sweep across {50ms, 100ms, 200ms, 500ms, 1s, 2s} produced **0
spurious fires at every value**. On this hardware the EMI condition
described in pollen-robotics#1109 is below the measurement floor.

This re-frames HOLD_TIME: EMI bursts are filtered by the cancel-on-press
edge (any rising edge during the hold window cancels the pending
Timer, independent of HOLD_TIME). HOLD_TIME is therefore a UX /
gesture-recognition parameter, not an EMI filter. Constraints:

* Upper bound: shutdown_now() + systemd dispatch must complete inside
  the ~2s latch-OUT power-rail cut.
* Lower bound: long enough to distinguish a deliberate latch pull
  from an incidental brush.

0.5s sits inside that window: 10x a typical e-stop tap (~50ms), 4x
headroom below the power-rail cut. EMI defense is unaffected.

The characterization + sweep scripts are added in a follow-up commit
under scripts/issue_1109/ so future pollen-robotics#1109-class reports on different
hardware can be measured directly rather than inferred.
…arness

Two on-robot scripts that produced the empirical data behind the
HOLD_TIME=0.5s decision (see prior commit):

* `burst_characterization.py` -- listens on GPIO23 with pull_up=False
  (matches production), drives sustained 10 Hz set_target motor stress,
  logs every edge with monotonic_ns timestamps. Outputs JSONL plus a
  summary with burst clusters and inter-edge gap distribution. Run
  with `--no-motor` for a passive EMI-floor baseline.

* `hold_time_sweep.py` -- in-process clone of production
  _schedule_shutdown / _cancel_shutdown Timer logic with shutdown_now()
  replaced by a counter. Sweeps HOLD_TIME values under continuous
  motor stress, reports per-step spurious-fire counts, identifies the
  floor. Single continuous ReachyMini context for the whole sweep
  (per-step wake_up cycling crashed the GStreamer pipeline).

Both scripts run on the Reachy Mini's Raspberry Pi using
/venvs/mini_daemon/bin/python3; gpio-shutdown-daemon must be stopped
first for exclusive GPIO23 access. README in the same directory
documents pre-conditions, workflow, and the recorded result.

Investigation tooling only; not loaded by the daemon.
@strandline
Copy link
Copy Markdown
Author

TLDR: I ran some additional tests to evaluate timing. The cancel-on-press edge does the EMI work, not the hold duration.

Empirical run on Reachy Mini Wireless

Stopped gpio-shutdown-daemon for exclusive GPIO23 access, then ran two measurements against #1109's reproducer (10 Hz set_target, sustained, head + antennas + body_yaw all driven). Both use gpiozero.Button(23, pull_up=False) to match production semantics exactly.

1. Burst characterization — 5 min, 2,887 motor commands, raw edge logger:

n_edges: 0
n_pressed: 0
n_released: 0
n_bursts: 0

2. HOLD_TIME sweep — 12 min, 7,136 motor commands, production-equivalent _schedule_shutdown / _cancel_shutdown logic with shutdown_now replaced by a counter:

HOLD_TIME edges timers fired verdict
50 ms 0 0 PASS
100 ms 0 0 PASS
200 ms 0 0 PASS
500 ms 0 0 PASS
1.0 s 0 0 PASS
2.0 s 0 0 PASS

Total: 17 minutes of stress, ~10,000 motor commands, zero GPIO23 edges of any kind. On this robot the EMI condition described in #1109 is below the measurement floor — the sub-millisecond spike hypothesis from PR #505 is unfalsified but I can't reproduce it here. (Caveat: one robot. @danpillay87's report stands as evidence on different hardware.)

Re-framing HOLD_TIME

You're right that the cancel-on-press edge does the EMI work, not the hold duration:

  • Any rising edge during the hold window cancels the pending Timer — independent of HOLD_TIME.
  • HOLD_TIME is therefore a UX / gesture-recognition parameter, not an EMI filter parameter.
  • The 2.0 s value was originally chosen as a conservative ceiling above observed EMI bursts — but cancel-on-press makes that framing wrong; it conflates orthogonal concerns.

So the constraint sandwich is:

  • Floor: distinguish a deliberate latch pull from an incidental brush. ~50 ms or higher is fine.
  • Ceiling: the Wireless's ~2 s latch-OUT → power-rail cut you noted. shutdown_now() plus systemd shutdown initiation must complete inside that window.

Pushed: HOLD_TIME = 0.5 s + harness scripts

Two commits on the PR branch:

0.5 s: 10× a typical e-stop tap, 4× headroom below your ~2 s power-rail cut, no EMI regression on my hardware (sweep confirms 0 fires at all values tested, including 50 ms).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spurious reboots on Wireless from gpio-shutdown-daemon GPIO23 noise — PR #505 200ms debounce insufficient under sustained motor activity

3 participants