Skip to content

[bmp] Pipeline and batch redis writes from openbmpd#36

Open
yutongzhang-microsoft wants to merge 3 commits into
sonic-net:masterfrom
yutongzhang-microsoft:optimize/redis-pipeline-batching
Open

[bmp] Pipeline and batch redis writes from openbmpd#36
yutongzhang-microsoft wants to merge 3 commits into
sonic-net:masterfrom
yutongzhang-microsoft:optimize/redis-pipeline-batching

Conversation

@yutongzhang-microsoft
Copy link
Copy Markdown

What this PR does

Profiling test_sessions_flapping[500] on a SONiC DUT showed that openbmpd consumed ~15% of OnCPU during BGP convergence, with the entire stack dominated by:

openbmpd -> libswsscommon -> libhiredis -> libc -> vmlinux syscall

Root cause is in Server/src/RedisManager.cpp::WriteBMPTable: every BMP route update goes through a freshly constructed swss::Table and a synchronous HSET round-trip to redis. With ~500 BGP sessions flapping simultaneously and thousands of routes per session, this becomes millions of small synchronous redis round-trips on the BMP message processing thread during a single convergence window.

This PR replaces the per-call swss::Table + synchronous set() with a shared swss::RedisPipeline and buffered swss::Table writers cached per table name. Flush points are chosen at natural batch boundaries (end of each BMP UPDATE message; immediately for peer state transitions).

Changes

Server/src/RedisManager.{h,cpp}

  • New shared swss::RedisPipeline pipeline_ (size 1024) built in Setup(), bound to stateDb_.
  • New std::unordered_map<std::string, std::unique_ptr<swss::Table>> bufferedTables_ caches one buffered Table per enabled table name (BGP_NEIGHBOR_TABLE / BGP_RIB_IN_TABLE / BGP_RIB_OUT_TABLE).
  • WriteBMPTable() calls GetOrCreateBufferedTable() instead of constructing a fresh Table; each set() enqueues into the pipeline buffer rather than triggering a synchronous HSET.
  • New FlushBMPTables() drains the pipeline; callers invoke it at message boundaries.
  • RemoveEntityFromBMPTable() flushes the pipeline before issuing the batched DEL, preserving SET-then-DEL ordering on the same key.
  • ExitRedisManager() drains the pipeline so updates already observed by openbmpd are not lost at shutdown.

Server/src/redis/MsgBusImpl_redis.cpp

  • update_unicastPrefix(): after the per-rib-entry loop, call FlushBMPTables() once on the ADD path; the DEL path is unchanged because RemoveEntityFromBMPTable() flushes internally.
  • update_Peer(): flush immediately after the single neighbor write, because peer up/down events are infrequent and consumers expect them to be visible right away.

Why this is safe

Concern Mitigation
Thread safety MsgBusImpl_redis (and therefore RedisManager + its pipeline) is constructed per BMP client connection in openbmpd's per-client-thread model. The shared pipeline never crosses threads, so no new locking is needed.
Visibility delay Writes are flushed at every BMP UPDATE boundary, every peer event, every DEL, and on shutdown. The pipeline also auto-flushes when its 1024-entry buffer fills. Net effect: BMP state visibility is bounded by the duration of a single BMP UPDATE message, which is already a natural granularity for BMP consumers.
CONFIG_DB gating enabledTables_ is still consulted before any Table object is touched; disabled tables short-circuit early just as before.
DEL ordering RemoveEntityFromBMPTable() now flushes the pipeline before issuing the batched DEL on the same connection, so a SET followed by a DEL of the same key cannot be reordered.
ResetAllTables / ResetBMPTable These existing reset paths operate directly on stateDb_ (not via the pipeline) and are only triggered on FRR reconnect, which is rare; they remain functionally unchanged.

Expected impact

For a BMP UPDATE carrying N route entries, redis round-trips drop from N synchronous HSETs to ~1 pipelined batch. In the test_sessions_flapping[500] profile this is the change targeting the ~15% openbmpd OnCPU share; combined with a separate effort to tighten the default of BMP|table|bgp_rib_out_table for deployments that do not consume the outbound RIB, that share is expected to drop substantially.

Test plan

  • Build openbmpd against current swss-common (which already provides swss::RedisPipeline and the buffered swss::Table ctor used here).
  • Run an existing BMP scale scenario; verify BGP_NEIGHBOR_TABLE / BGP_RIB_IN_TABLE / BGP_RIB_OUT_TABLE entries appear in BMP_STATE_DB exactly as before (same keys, same fields), and that DEL events still remove the corresponding entries.
  • Verify that on FRR disconnect + reconnect, ResetAllTables still clears state.
  • Re-profile test_sessions_flapping[500]; expect the openbmpd -> libswsscommon -> libhiredis stack to shrink substantially.

Profiling test_sessions_flapping[500] on a SONiC DUT showed openbmpd
consuming ~15% of OnCPU during BGP scale events. The flame graph stack
was dominated by:

    openbmpd -> libswsscommon -> libhiredis -> libc -> vmlinux syscall

Root cause is in Server/src/RedisManager.cpp::WriteBMPTable: every BMP
route update went through a freshly constructed swss::Table and a
synchronous HSET round-trip to redis. With ~500 BGP sessions flapping
simultaneously and thousands of routes per session, this translated to
millions of small synchronous redis round-trips during a single
convergence window, all on the BMP message processing thread.

This commit changes RedisManager to use a swss::RedisPipeline plus
buffered swss::Table writers cached per-table-name:

  * Setup() builds a single shared RedisPipeline (size 1024) bound to
    stateDb_.
  * WriteBMPTable() looks up (or lazily creates) a buffered Table for
    the table name and calls set() into the pipeline buffer instead of
    constructing a new Table per call.
  * A new FlushBMPTables() method drains the pipeline; callers invoke
    it at natural batch boundaries.

Flush points in MsgBusImpl_redis:

  * update_unicastPrefix(): after processing all rib[i] entries, flush
    once so that a BMP UPDATE carrying N routes results in ~1 pipelined
    round-trip instead of N synchronous HSETs. The DEL path
    (RemoveEntityFromBMPTable) already batched its keys; it now flushes
    the SET pipeline before issuing the DEL so SET-then-DEL ordering on
    the same key is preserved.
  * update_Peer(): flush immediately after the single neighbor write
    because peer up/down events are infrequent and consumers expect
    them to be visible right away.
  * ExitRedisManager(): drain any still-buffered updates so observed
    state isn't lost on shutdown.

Behavioral notes:

  * Thread safety: each MsgBusImpl_redis (and therefore each
    RedisManager + pipeline) is constructed per BMP client connection
    in openbmpd's per-client-thread model. The shared pipeline is not
    used across threads, so no additional locking is introduced.
  * Visibility: BMP state writes that previously hit redis
    synchronously now wait until the enclosing BMP message finishes
    processing before being flushed. In the absence of new BMP
    messages, the pipeline still auto-flushes when its 1024-entry
    buffer fills.
  * The CONFIG_DB-driven enabledTables_ gate is preserved; disabled
    tables short-circuit before any Table object is touched.

This is the openbmpd-side half of the work to reduce DUT-side observer
cost in BGP scale convergence tests; a follow-up will look at tightening
the default of BMP|table|bgp_rib_out_table for deployments that do not
consume the outbound RIB.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yutongzhang-microsoft
Copy link
Copy Markdown
Author

/azpw run

@mssonicbld
Copy link
Copy Markdown
Collaborator

⚠️ Notice: /azpw run only runs failed jobs now. If you want to trigger a whole pipline run, please rebase your branch or close and reopen the PR.
💡 Tip: You can also use /azpw retry to retry failed jobs directly.

Retrying failed(or canceled) jobs...

@mssonicbld
Copy link
Copy Markdown
Collaborator

No Azure DevOps builds found for #36.

Address self-review findings on the RedisPipeline batching change:

1. ResetBMPTable now flushes the pipeline before reading keys via
   stateDb_. getKeys() runs on stateDb_'s connection and cannot see
   SETs still buffered in pipeline_ (which is on an independent
   connection), so without the flush those keys are missed by the
   DEL list and survive the reset. The buffered SETs then land on
   redis after the DEL, leaving stale entries. Flushing first makes
   the reset see the full key set and guarantees SET-before-DEL
   ordering for any key that gets re-added afterwards.

2. ExitRedisManager wraps FlushBMPTables in try/catch. This path is
   reached from ~MsgBusImpl_redis; letting a redis I/O error escape
   into the destructor chain could call std::terminate during
   unwinding. swss::RedisPipeline::~RedisPipeline already swallows
   exceptions for the same reason - mirror that here.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

bufferedTables_ stores swss::Table writers that hold raw pointers into
pipeline_ (obtained via pipeline_.get()). Today Setup() is only invoked
once per RedisManager - from MsgBusImpl_redis's constructor - so this
isn't a live bug. But if a future caller ever re-invokes Setup(), the
shared_ptr reassignment of pipeline_ destroys the old pipeline while
the existing Table entries still reference it, turning the next
WriteBMPTable into a use-after-free.

Clear bufferedTables_ at the top of Setup() so the contract "Setup
rebuilds all redis state" actually holds.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants