LruDedup queue policy notification#1899
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
securely1g
left a comment
There was a problem hiding this comment.
Review: LruDedup Queue Policy for RedisChannel NotificationConsumer
Summary
Opts RedisChannel::m_notificationConsumer into NotificationQueuePolicy::LruDedup to bound queue depth under FDB event storms. Single functional change in lib/RedisChannel.cpp (+40/-1), plus aspell dictionary updates.
What's Good
- Targeted, minimal change — one file, one consumer, clear problem statement (1.5 GB queue growth under storm)
- Well-guarded —
#ifdef SWSS_NOTIFICATIONCONSUMER_HAS_LRU_DEDUPensures clean builds against older swss-common - Good inline documentation — explains why byte-identical dedup is safe (saimeta::Meta transitions are idempotent under identical payloads)
- Preserves Select-loop priority — explicitly passes
pri=100to match the previous 2-arg ctor default - Stats label — orch-qualified label distinguishes this consumer from orchagent's per-orch consumers in syslog/monitoring
🟡 Medium Priority Issues
1. No setOpAllowList set on this consumer
The orchagent-side PR (sonic-swss #4586) sets setOpAllowList(...) on every consumer to filter irrelevant ops early. This RedisChannel consumer doesn't set one — it accepts all ops from the NOTIFICATIONS channel. This is likely intentional (libsairedis needs to process all notification types: fdb_event, port_state_change, bfd, etc.), but worth confirming. If it processes everything, the dropped_allowlist counter will always be 0. A brief comment saying "no allowlist — RedisChannel processes all notification types" would clarify intent.
2. No registration with gNotifConsumerStatsOrch
The companion swss PR registers each consumer with the stats orch for COUNTERS_DB publication. This consumer lives in libsairedis (a library), not in orchagent, so it can't directly reference gNotifConsumerStatsOrch. The setStatsLabel is set but nobody publishes its stats to COUNTERS_DB. Is the intent to only emit syslog-level stats, or is there a plan for libsairedis to self-publish? Worth noting in the PR description.
3. Two commits — should squash
The second commit ("Remove duplicate entries and clean up formatting") is purely aspell cleanup. SONiC convention is typically a single logical commit per PR. Consider squashing before merge.
🟢 Minor
4. The aspell additions are fine but PCID and ZeroMQChannel appear unrelated to this PR — possibly leftover from a rebase. Not blocking but worth cleaning up.
5. The #else legacy path can be annotated with a TODO and target release for removal, matching what's done in the swss PR.
Questions
- Under storm conditions with LruDedup active, what's the observed peak queue depth now? (The PR description mentions 1.5 GB at peak with FIFO — what does it drop to with dedup?)
- Is there concern about the notification thread's
pop()latency increasing due to the LRU index maintenance, or is the dedup lookup O(1) amortized? - The
REDIS_TABLE_NOTIFICATIONS_PER_DB(dbAsic)macro — does this resolve to different channel names per DB instance, or is it always"NOTIFICATIONS"? If always the same, this consumer and orchagent's consumers are all on the same channel (redundant fan-out).
Dependency
Requires sonic-swss-common PR #1191 merged first (provides SWSS_NOTIFICATIONCONSUMER_HAS_LRU_DEDUP, NotificationQueuePolicy::LruDedup, DEFAULT_NC_POP_BATCH_SIZE, setStatsLabel, getLruDedupQueue).
Verdict: Simple, safe, well-guarded change. The core logic is correct — LruDedup on byte-identical payloads is safe for saimeta's idempotent processing model. Squash commits and optionally add a comment about the missing allowlist. Good to merge once the swss-common dependency lands.
12cb318 to
40267a2
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The RedisChannel consumer should see ALL notification, so an op allow list is not needed.
The syslog notification would still provide the dedup stats under storm.
Done.
Under FDB storm, the peek queue depth was 4000 (for 1000 MACs moving between 2 ports, at 50K/second).
The cost is amortized.
It resolves to NOTIFICATIONS channel, same as all the other orchs. |
40267a2 to
a5b6078
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
a5b6078 to
6d27660
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
@lolyu Just FYI, this PR has a dependency on sonic-net/sonic-swss-common#1191, which adds a symbol needed by this PR. So the build will fail until swss-common merges. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Senthil Krishnamurthy <senthil@nexthop.ai>
6d27660 to
55b2192
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
I just pulled down the latest sonic-buildimage which picked up this change and now I'm seeing the following build error does this look familiar? |
Description of PR
Opts
RedisChannel::m_notificationConsumerintoNotificationQueuePolicy::LruDedupand sets an orch-qualified stats label.RedisChannelruns a dedicated notification thread inside every libsairedis-linked process (orchagent, etc.) that SUBSCRIBE's to the same"NOTIFICATIONS"Redis channel as orchagent's per-orch NotificationConsumers. Under an FDB-event storm this thread's FIFO queue grew unbounded — we measured ~1.5 GB at peak even after the orchagent-side opt-in.LRU-dedup collapses byte-identical payloads at admission.
saimeta::Metastate transitions are idempotent under byte-identical payloads, som_saiObjectCollectionend state is unchanged.Type of change
Approach
How did you do it?
Changed the Notification consumer class to use the LruDedup queue policy
How did you verify/test it?
Unit Tests
Related work
HLD: sonic-net/SONiC#2334
sonic-swss PR: sonic-net/sonic-swss#4586
sonic-swss-common PR: sonic-net/sonic-swss-common#1191