Skip to content

Surface IPv6-only advertisements and make dropped discovery records observable #239

Description

@conradbzura

Description

A worker advertised over LanDiscovery with only IPv6 addresses in its DNS-SD record never surfaces as a discovery event — no worker-added, no error, no log. Since #237, advertise_host accepts IPv6 literals, so the dead end is reachable through supported, validated configuration: the publisher registers the record successfully and every subscriber silently ignores it.

Steps to reproduce:

  1. Publish a worker through LanDiscovery("ns", advertise_host="fd00::1").publisher.
  2. Iterate LanDiscovery("ns").subscriber.
  3. No event ever arrives, and a pool dispatching on that namespace waits forever on the quorum gate.

The IPv6 case is one instance of a broader gap. The zeroconf listener handlers (_handle_add_service / _handle_update_service) drop every unusable announcement silently, across three distinct swallow sites: an unresolvable service-info fetch (None), a malformed record (_deserialize_metadata raising ValueError), and the catch-all except Exception: pass that hides the IPv6 IndexError and any defect in the handler itself. Making the IPv6 record deserialize cleanly removes one cause, but the silent-drop machinery around it stays opaque. Both halves touch the same except blocks, so this issue covers the IPv6 fix together with making the drops observable.

Expected behavior

Address resolution:

  • Prefer IPv4 when the record carries both address families (current behavior for V4-bearing records is unchanged).
  • Fall back to the first IPv6 address when the record is IPv6-only, formatted bracketed — [fd00::1]:<port> — so the address survives host/port splitting (the bracket-aware _split_host_port introduced in Make LanDiscovery-published worker pools reachable across hosts #237 already parses this form).
  • Raise ValueError from _deserialize_metadata when a record has no usable address — the zeroconf listeners already catch ValueError and skip deliberately — instead of leaking IndexError into the broad except Exception: pass handler.

Observability of dropped records — the drops are a rate phenomenon (the operator signal is "malformed records/sec is climbing," indicating a version-skewed or buggy peer on the namespace), so the production-facing answer is a metric, not a per-event log and not a warning. Treat the three sites by character:

  • Unresolvable fetch (service_info is None): expected churn, i.e., a service caught mid-registration, a browse racing an unregister, or TTL expiry. logger.debug at most; not an anomaly.
  • Malformed record (ValueError, including the address-less case above): a peer published something this version cannot parse. logger.warning with structured fields (the service name, the reason), not a bare formatted message.
  • Catch-all: narrow except Exception: pass and use logger.exception so an unexpected failure carries a traceback instead of vanishing. This is the site that currently hides both the IPv6 IndexError and any bug in the handler itself.

Emit through stdlib logging with structured extra={} fields, not f-string-interpolated messages, so a future OpenTelemetry log bridge exports real attributes instead of re-parsing strings; and name the counter these sites would feed — wool.discovery.records_dropped, dimensioned by {reason, backend} — as the eventual metric, without taking an OTel dependency now. Do not route these data-plane drops through warnings.warn: warnings are stderr-bound, deduplicated by source location (so they under-report a repeating bad record), and lossy under OTel log capture. Reserve the warning categories for deterministic, developer-facing configuration signals, e.g., LoopbackAdvertisementWarning. The diagnostic vocabulary should be discovery-agnostic — LocalDiscovery's subscriber has the analogous malformed-record drop — so the logging keys and the eventual counter apply to both backends.

Root cause

_deserialize_metadata (wool/src/wool/runtime/discovery/lan.py) resolves the advertised address with a V4-only lookup and indexes the first entry:

ip = info.ip_addresses_by_version(IPVersion.V4Only)[0]

For an IPv6-only record the list is empty and the IndexError propagates into _handle_add_service / _handle_update_service, whose except Exception: pass swallows it — the worker is invisible to subscribers with no trace.

Implementation notes:

  • Verify gRPC dial-target compatibility for the bracketed-IPv6 WorkerMetadata.address form (grpc.aio.insecure_channel("[fd00::1]:50051")) before relying on the fallback for dispatch.
  • Unit tests: a V6-only record surfaces with a bracketed address; a V4+V6 record prefers V4; an address-less record is skipped without breaking the event stream — via the real-zeroconf ServiceInfo registration pattern in test___aiter___ignores_malformed_zeroconf_service. Assert the differentiated diagnostics with caplog: a malformed record logs at WARNING with the structured reason, an unresolvable fetch stays at DEBUG, and the catch-all path logs an exception with traceback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions