[bmp] Pipeline and batch redis writes from openbmpd#36
Open
yutongzhang-microsoft wants to merge 3 commits into
Open
[bmp] Pipeline and batch redis writes from openbmpd#36yutongzhang-microsoft wants to merge 3 commits into
yutongzhang-microsoft wants to merge 3 commits into
Conversation
Profiling test_sessions_flapping[500] on a SONiC DUT showed openbmpd
consuming ~15% of OnCPU during BGP scale events. The flame graph stack
was dominated by:
openbmpd -> libswsscommon -> libhiredis -> libc -> vmlinux syscall
Root cause is in Server/src/RedisManager.cpp::WriteBMPTable: every BMP
route update went through a freshly constructed swss::Table and a
synchronous HSET round-trip to redis. With ~500 BGP sessions flapping
simultaneously and thousands of routes per session, this translated to
millions of small synchronous redis round-trips during a single
convergence window, all on the BMP message processing thread.
This commit changes RedisManager to use a swss::RedisPipeline plus
buffered swss::Table writers cached per-table-name:
* Setup() builds a single shared RedisPipeline (size 1024) bound to
stateDb_.
* WriteBMPTable() looks up (or lazily creates) a buffered Table for
the table name and calls set() into the pipeline buffer instead of
constructing a new Table per call.
* A new FlushBMPTables() method drains the pipeline; callers invoke
it at natural batch boundaries.
Flush points in MsgBusImpl_redis:
* update_unicastPrefix(): after processing all rib[i] entries, flush
once so that a BMP UPDATE carrying N routes results in ~1 pipelined
round-trip instead of N synchronous HSETs. The DEL path
(RemoveEntityFromBMPTable) already batched its keys; it now flushes
the SET pipeline before issuing the DEL so SET-then-DEL ordering on
the same key is preserved.
* update_Peer(): flush immediately after the single neighbor write
because peer up/down events are infrequent and consumers expect
them to be visible right away.
* ExitRedisManager(): drain any still-buffered updates so observed
state isn't lost on shutdown.
Behavioral notes:
* Thread safety: each MsgBusImpl_redis (and therefore each
RedisManager + pipeline) is constructed per BMP client connection
in openbmpd's per-client-thread model. The shared pipeline is not
used across threads, so no additional locking is introduced.
* Visibility: BMP state writes that previously hit redis
synchronously now wait until the enclosing BMP message finishes
processing before being flushed. In the absence of new BMP
messages, the pipeline still auto-flushes when its 1024-entry
buffer fills.
* The CONFIG_DB-driven enabledTables_ gate is preserved; disabled
tables short-circuit before any Table object is touched.
This is the openbmpd-side half of the work to reduce DUT-side observer
cost in BGP scale convergence tests; a follow-up will look at tightening
the default of BMP|table|bgp_rib_out_table for deployments that do not
consume the outbound RIB.
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Author
|
/azpw run |
Collaborator
|
Retrying failed(or canceled) jobs... |
Collaborator
|
No Azure DevOps builds found for #36. |
Address self-review findings on the RedisPipeline batching change: 1. ResetBMPTable now flushes the pipeline before reading keys via stateDb_. getKeys() runs on stateDb_'s connection and cannot see SETs still buffered in pipeline_ (which is on an independent connection), so without the flush those keys are missed by the DEL list and survive the reset. The buffered SETs then land on redis after the DEL, leaving stale entries. Flushing first makes the reset see the full key set and guarantees SET-before-DEL ordering for any key that gets re-added afterwards. 2. ExitRedisManager wraps FlushBMPTables in try/catch. This path is reached from ~MsgBusImpl_redis; letting a redis I/O error escape into the destructor chain could call std::terminate during unwinding. swss::RedisPipeline::~RedisPipeline already swallows exceptions for the same reason - mirror that here. Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
bufferedTables_ stores swss::Table writers that hold raw pointers into pipeline_ (obtained via pipeline_.get()). Today Setup() is only invoked once per RedisManager - from MsgBusImpl_redis's constructor - so this isn't a live bug. But if a future caller ever re-invokes Setup(), the shared_ptr reassignment of pipeline_ destroys the old pipeline while the existing Table entries still reference it, turning the next WriteBMPTable into a use-after-free. Clear bufferedTables_ at the top of Setup() so the contract "Setup rebuilds all redis state" actually holds. Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Profiling
test_sessions_flapping[500]on a SONiC DUT showed thatopenbmpdconsumed ~15% of OnCPU during BGP convergence, with the entire stack dominated by:openbmpd -> libswsscommon -> libhiredis -> libc -> vmlinux syscallRoot cause is in
Server/src/RedisManager.cpp::WriteBMPTable: every BMP route update goes through a freshly constructedswss::Tableand a synchronousHSETround-trip to redis. With ~500 BGP sessions flapping simultaneously and thousands of routes per session, this becomes millions of small synchronous redis round-trips on the BMP message processing thread during a single convergence window.This PR replaces the per-call
swss::Table+ synchronousset()with a sharedswss::RedisPipelineand bufferedswss::Tablewriters cached per table name. Flush points are chosen at natural batch boundaries (end of each BMP UPDATE message; immediately for peer state transitions).Changes
Server/src/RedisManager.{h,cpp}swss::RedisPipeline pipeline_(size 1024) built inSetup(), bound tostateDb_.std::unordered_map<std::string, std::unique_ptr<swss::Table>> bufferedTables_caches one bufferedTableper enabled table name (BGP_NEIGHBOR_TABLE/BGP_RIB_IN_TABLE/BGP_RIB_OUT_TABLE).WriteBMPTable()callsGetOrCreateBufferedTable()instead of constructing a freshTable; eachset()enqueues into the pipeline buffer rather than triggering a synchronous HSET.FlushBMPTables()drains the pipeline; callers invoke it at message boundaries.RemoveEntityFromBMPTable()flushes the pipeline before issuing the batchedDEL, preserving SET-then-DEL ordering on the same key.ExitRedisManager()drains the pipeline so updates already observed by openbmpd are not lost at shutdown.Server/src/redis/MsgBusImpl_redis.cppupdate_unicastPrefix(): after the per-rib-entry loop, callFlushBMPTables()once on the ADD path; the DEL path is unchanged becauseRemoveEntityFromBMPTable()flushes internally.update_Peer(): flush immediately after the single neighbor write, because peer up/down events are infrequent and consumers expect them to be visible right away.Why this is safe
MsgBusImpl_redis(and thereforeRedisManager+ its pipeline) is constructed per BMP client connection in openbmpd's per-client-thread model. The shared pipeline never crosses threads, so no new locking is needed.enabledTables_is still consulted before anyTableobject is touched; disabled tables short-circuit early just as before.RemoveEntityFromBMPTable()now flushes the pipeline before issuing the batchedDELon the same connection, so a SET followed by a DEL of the same key cannot be reordered.stateDb_(not via the pipeline) and are only triggered on FRR reconnect, which is rare; they remain functionally unchanged.Expected impact
For a BMP UPDATE carrying N route entries, redis round-trips drop from N synchronous HSETs to ~1 pipelined batch. In the
test_sessions_flapping[500]profile this is the change targeting the ~15% openbmpd OnCPU share; combined with a separate effort to tighten the default ofBMP|table|bgp_rib_out_tablefor deployments that do not consume the outbound RIB, that share is expected to drop substantially.Test plan
swss-common(which already providesswss::RedisPipelineand the bufferedswss::Tablector used here).BGP_NEIGHBOR_TABLE/BGP_RIB_IN_TABLE/BGP_RIB_OUT_TABLEentries appear inBMP_STATE_DBexactly as before (same keys, same fields), and that DEL events still remove the corresponding entries.ResetAllTablesstill clears state.test_sessions_flapping[500]; expect theopenbmpd -> libswsscommon -> libhiredisstack to shrink substantially.