Skip to content

swsscommon: expose batched ZmqProducerStateTable APIs to SWIG#1188

Merged
qiluo-msft merged 1 commit into
sonic-net:masterfrom
hdwhdw:daweihuang/zmq-batched-go-bindings
May 29, 2026
Merged

swsscommon: expose batched ZmqProducerStateTable APIs to SWIG#1188
qiluo-msft merged 1 commit into
sonic-net:masterfrom
hdwhdw:daweihuang/zmq-batched-go-bindings

Conversation

@hdwhdw
Copy link
Copy Markdown
Contributor

@hdwhdw hdwhdw commented May 14, 2026

Why

The batched overloads of ZmqProducerStateTable::set/del/send that take a std::vector<KeyOpFieldsValuesTuple> are unreachable from the Go bindings today (and Python is in the same boat). SWIG happily generates the C trampolines and Go methods (e.g. SwigcptrZmqProducerStateTable.Set__SWIG_3), but the parameter type std::vector<KeyOpFieldsValuesTuple> is not %template'd and SWIG cannot synthesize a default constructor for the opaque underlying tuple element, so callers have no way to build the argument — the type exposes only Swigcptr().

Verified empirically in a clean trixie + SWIG 4.3 + Go 1.24 environment:

Type %template? Constructor Add/Get/Size
FieldValuePairs (vector<FieldValueTuple>) yes
VectorString (vector<string>) yes
KeyOpFieldsValuesQueue (deque<KeyOpFieldsValuesTuple>) yes
vector<KeyOpFieldsValuesTuple> (parameter of batched set/send) no ❌ only Swigcptr()

Adding a single %template is not enough: even with the list templated, the element std::tuple<…> itself has no usable constructor — std_vector.i does not synthesize constructors for opaque tuple elements (std_deque.i is more permissive, which is why the existing KeyOpFieldsValuesQueue works). So the cleanest fix is a small inline shim layer.

Batched del(vector<string>) already works today because VectorString is templated.

What

Add three %inline helpers in pyext/swsscommon.i (also picked up by the goext build via the symlink):

  • zmqProducerBatchedSet(p, keys, fvss) — all entries get op=SET
  • zmqProducerBatchedDel(p, keys) — all entries get op=DEL
  • zmqProducerBatchedSend(p, keys, ops, fvss) — caller supplies per-entry op for mixed batches

Each helper takes parallel vectors of already-wrapped types (VectorString and FieldValuePairsList) and assembles the vector<KeyOpFieldsValuesTuple> on the C++ side before forwarding to the existing batched API. This mirrors the parallel-vector pattern Table::pops already uses. Size mismatches throw std::invalid_argument, which the existing SWIG exception bridge surfaces to callers as a recoverable panic.

Why we want this now

This unblocks bulk gNMI Set in sonic-gnmi (sonic-net/sonic-buildimage#27250 — bulk DASH route programming). Today MixedDbClient.handleTableData emits one ZMQ message per key; with these helpers it can ship N keys in a single message. Follow-up PR in sonic-gnmi to land afterwards.

Tests

  • New Go tests in goext/swsscommon_test.go (TestZmqProducerBatchedSet/Del/Send and a SetSizeMismatch negative case) drive each helper through a ZmqConsumerStateTable and verify the entries arrive on the consumer side.
  • Per-entry semantic correctness is already covered by the existing C++ tests in tests/zmq_state_ut.cpp; the Go tests exist to prove the SWIG glue is reachable from Go.
  • Verified locally inside sonic-slave-trixie:
    • ./configure --disable-yangmodules && make -j && make install — clean build, no warnings introduced
    • ./tests/tests --gtest_filter='ZmqConsumerStateTable*' — all 3 ZMQ unit tests pass
    • cd goext && go test -v -run TestZmqProducer — all 4 new tests pass

Related

The batched overloads of ZmqProducerStateTable::set/del/send that take a
std::vector<KeyOpFieldsValuesTuple> are unreachable from Go (and Python)
today. SWIG generates the C trampolines and Go methods for them, but the
parameter type std::vector<KeyOpFieldsValuesTuple> is not %template'd
and SWIG cannot synthesize a default constructor for the underlying
tuple element, so callers have no way to build the argument: the type
exposes only Swigcptr(). Batched DEL works (vector<string> is templated)
but batched SET and the mixed SET/DEL send() do not.

Add three %inline shims (zmqProducerBatchedSet/Del/Send) that take
parallel vectors of already-wrapped types (VectorString and
FieldValuePairsList) and assemble the KeyOpFieldsValuesTuple vector on
the C++ side, mirroring the parallel-vector pattern used by Table::pops.
Size mismatches throw std::invalid_argument, which the existing SWIG
exception bridge surfaces to callers as a recoverable panic.

This unblocks bulk gNMI Set in sonic-gnmi (e.g. DASH route programming)
which today emits one ZMQ message per key.

Add Go tests in goext/swsscommon_test.go that drive each helper through
a ZmqConsumerStateTable and verify the entries arrive on the consumer
side. Per-entry semantic correctness is already covered by the existing
C++ unit tests in tests/zmq_state_ut.cpp; the Go tests prove the SWIG
bindings are reachable from Go.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Copilot AI review requested due to automatic review settings May 14, 2026 16:14
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request makes the batched swss::ZmqProducerStateTable APIs usable from SWIG-generated Go/Python bindings by adding a small C++ shim layer in the SWIG interface, avoiding SWIG’s inability to construct std::vector<KeyOpFieldsValuesTuple> arguments directly.

Changes:

  • Added %inline shim helpers (zmqProducerBatchedSet/Del/Send) that accept parallel SWIG-friendly vectors and assemble std::vector<KeyOpFieldsValuesTuple> in C++ before forwarding to existing batched APIs.
  • Added Go tests that exercise the new helpers end-to-end via ZmqConsumerStateTable, including a size-mismatch negative case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pyext/swsscommon.i Adds SWIG-exposed C++ inline helpers to construct and call batched ZMQ producer operations from Go/Python.
goext/swsscommon_test.go Adds Go coverage to verify the new SWIG glue is reachable and functions correctly for set/del/send batches.

@prabhataravind
Copy link
Copy Markdown
Contributor

prabhataravind commented May 19, 2026

@hdwhdw could you check PR checker failures?

@hdwhdw
Copy link
Copy Markdown
Contributor Author

hdwhdw commented May 20, 2026

/azpw run Azure.sonic-swss-common

Re-running to confirm the Build amd64 failure on a Rust ZMQ reconnect test (zmq_consumer_producer_state_tables_sync_api_connect_late_reconnect, sync.rs:223 expected Data, got Timeout within 2 s) is a flake unrelated to this PR. This PR only touches pyext/swsscommon.i — the Rust crates wrap the C++ API directly and don't go through SWIG, so the changed surface cannot affect this test. The Build amd64/ubuntu-22.04 job passed.

@mssonicbld
Copy link
Copy Markdown
Collaborator

⚠️ Notice: /azpw run only runs failed jobs now. If you want to trigger a whole pipline run, please rebase your branch or close and reopen the PR.
💡 Tip: You can also use /azpw retry to retry failed jobs directly.

Retrying failed(or canceled) jobs...

@mssonicbld
Copy link
Copy Markdown
Collaborator

Retrying failed(or canceled) stages in build 1113708:

✅Stage Build:

  • Job amd64: retried.

@hdwhdw
Copy link
Copy Markdown
Contributor Author

hdwhdw commented May 21, 2026

Re-run results: the previous Build amd64 Rust late-reconnect flake cleared (now PASS), but Test vstest fails on test_inband_intf_mgmt_vrf.py::TestInbandInterface::test_InbandIntf[PortChannel5] and [Loopback1] (line 128: assert status == True after INTF_TABLE lookup) — Ethernet4 and Vlan100 parametrisations pass.

This is independent of this PR. The change is a pure SWIG %inline addition in pyext/swsscommon.i — no existing C++, Python, or Go symbol is modified; only three new exports (zmqProducerBatchedSet/Del/Send) are added. The test_InbandIntf test uses swsscommon.Table('INTF_TABLE') and doesn't touch ZmqProducerStateTable at all. The flaky plugin retried 3 times and produced the same two failures, so it's likely an underlying intfmgr/portsorch timing issue that surfaces for these two interface kinds on the Azure DVS runner.

Cross-check: unrelated PR #1191 (LruDedup queue policy for Notification Consumer) also has a failing vstest run with different symptoms (docker.errors.APIError: 500 Server Error in fixture setup), and the same vstest job passes on most recent merged PRs (#1170, #1177). The vstest pipeline is unstable on the Azure runners this week.

/azpw run Azure.sonic-swss-common

Copy link
Copy Markdown
Contributor

@prabhataravind prabhataravind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @qiluo-msft to sign-off/merge as well.

@prabhataravind
Copy link
Copy Markdown
Contributor

@prsunny for viz as this is needed for 202605.

@qiluo-msft qiluo-msft merged commit 3110aaa into sonic-net:master May 29, 2026
19 checks passed
hdwhdw added a commit to hdwhdw/sonic-buildimage that referenced this pull request May 29, 2026
sonic-net/sonic-swss-common#1188 merged to master as 3110aaa5.
Update the submodule pointer from the PR branch tip d7330786 to the
merge commit on master so the bump is reachable from sonic-buildimage
master once this PR lands.

sonic-gnmi submodule pointer is unchanged for now and will be bumped
to the merge commit of sonic-net/sonic-gnmi#675 once that PR lands,
at which point this PR can come out of draft.

Related: sonic-net#27250

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
prabhataravind pushed a commit to sonic-net/sonic-gnmi that referenced this pull request May 31, 2026
…675)

* MixedDbClient: batch ZMQ Set/Del for multi-key gNMI updates

Why:
The current handleTableData loops DbSetTable/DbDelTable per key, so a
single gNMI Set carrying N keys to a DASH_* table generates N ZMQ
round-trips even though the messages all target the same producer.
At the scale DASH route programming requires (10k-100k entries per
update), the per-key overhead dominates wall-clock time.

What:
* Add DbBatchSetTable(table, []tableBatchEntry) and DbBatchDelTable(
  table, []string) on MixedDbClient. When the table is ZMQ-backed
  these collapse the batch into a single ZmqProducerBatchedSet /
  ZmqProducerBatchedDel call (one ZMQ message). For plain Table or
  non-ZMQ ProducerStateTable backends — where no batched C++ API
  exists — they fall back to the existing per-entry loop, so behavior
  for DASH_HA_* tables and ordinary CONFIG_DB writes is unchanged.

* Wire the multi-key code paths in handleTableData to the new
  helpers: the opRemove branch and the multi-key JSON opAdd branch
  now build a slice and make one batched call. Single-key paths are
  untouched (batching one entry is pointless).

* Wrap the new SWIG entry points with ZmqProducerBatched{Set,Del}Wrapper
  to convert C++ exceptions into Go errors, mirroring the existing
  ProducerStateTable wrappers.

Tests:
client_test.go adds:
* TestZmqBatchedSetEndToEnd / TestZmqBatchedDelEndToEnd — round-trip
  N entries through a real ZmqProducerStateTable + ZmqConsumerStateTable
  and verify the consumer drains them all.
* TestDbBatchSetTableFallsBackForPlainTable / ...EmptyAndSingle —
  verify the plainTableMap fallback path and empty/single-entry
  shortcuts.
* TestHandleTableDataBatchesMultiKeyJSON / ...OpRemove — gomonkey-
  patched proof that handleTableData routes through DbBatchSetTable /
  DbBatchDelTable exactly once for multi-key payloads instead of N
  per-key calls.

Related:
* sonic-net/sonic-buildimage#27250 — tracking issue for the
  optimization.
* Depends on sonic-net/sonic-swss-common#1188, which exposes
  ZmqProducerBatchedSet/Del/Send to Go via SWIG %inline helpers.
  Until that PR lands and a fresh libswsscommon artifact is
  published, sonic-gnmi CI on this branch will fail to build — the
  Go bindings for the new symbols do not exist in the current
  master swsscommon artifact.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>

* MixedDbClient tests: harden batch tests against global-state ordering

Why:
Copilot review on PR #675 flagged four genuine test-hygiene issues in
the new batched-Set/Del tests. All four come from the tests touching
package-level globals (RedisDbMap, zmqClientMap) without going through
the existing helpers, which makes them order-dependent and racy with
other tests that explicitly nil out RedisDbMap.

What:
* TestZmqBatchedSetEndToEnd / TestZmqBatchedDelEndToEnd: the cleanup
  loop freed every SWIG handle in zmqClientMap but left the (now
  dangling) entries in the map. A later test that ranges zmqClientMap
  would double-DeleteZmqClient the same pointer, which on CGO is a
  real crash. Delete the map entry alongside the handle.

* TestHandleTableDataBatchesMultiKeyJSON /
  TestHandleTableDataBatchesOpRemove: replace the ad-hoc
  RedisDbMap[...] = redis.NewClient(...) with the existing
  setupMixedDbRedis(t, mapkey) helper, which save/restores RedisDbMap
  via Target2RedisDb. This stops the tests from panicking with
  'assignment to entry in nil map' when they run after any test that
  sets RedisDbMap = nil (e.g. the TestUseRedis* family at lines
  ~490/521/537/553).

No production code changes; behavior of the batched Set/Del path is
unchanged.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>

* ci: install libyang3 runtime alongside libyang 1.x

sonic-swss-common master switched to libyang3 (>= 3.12.2) in commit
1431c3d3 ('port to libyang3 with libyang1 fallback') and the published
libswsscommon Debian artifact now hard-depends on the libyang3 package.
The sonic-gnmi pipeline only downloads libyang 1.0 from common_libs, so
'dpkg -i libswsscommon_1.0.0_amd64.deb' fails with:

  dpkg: dependency problems prevent configuration of libswsscommon:
   libswsscommon depends on libyang3 (>= 3.12.2); however:
    Package libyang3 is not installed.

This breaks every sonic-gnmi PR build from #1188 onward.

Add libyang3_*.deb to the download pattern. libyang 1.x is still needed
because sonic-mgmt-common/cvl/internal/util/util.go cgo-links '-lyang'
against the v1 headers; the two runtimes coexist as distinct SONAMEs.

Carried in this PR because it is also the unblock for #675's own CI.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>

* MixedDbClient tests: drop bogus zmqClientMap nuke loop from new batch tests

The new TestZmqBatched{Set,Del}EndToEnd tests construct their ZmqClient
directly with swsscommon.NewZmqClient(zmqAddr), so they do not register
anything in the package-level zmqClientMap (only getZmqClientByAddress
does). The cleanup loop that ranged over zmqClientMap and called
DeleteZmqClient was therefore freeing pointers that belong to *other*
tests.

In source order, TestZmqReconnect (~line 657) and TestRemoveZmqClient
(~line 823) run first; both already do the same broken
'for _, c := range zmqClientMap { DeleteZmqClient(c) }' pattern but
without deleting from the map. By the time the new batch tests run,
zmqClientMap still holds entries that point at already-freed C++
objects. Iterating those entries triggered:

    double free or corruption (out)
    SIGABRT: abort
    signal arrived during cgo execution
    ...DeleteZmqClient...
    client_test.go:2173

in CI as soon as the build actually linked successfully (build 1125898).

The previous commit (5b770fb) added delete(zmqClientMap, k) on Copilot's
advice, which fixes the in-loop iterator-mutation hazard but does not
help here because the bad entries are inserted by prior tests, not by
ours. The right fix for the new tests is to not touch zmqClientMap at
all — the explicit DeleteZmqClient(zmqClient) is sufficient.

Pre-existing tests are left alone; their map-nuke pattern is latent
and only matters if any later test also iterates zmqClientMap. After
this change none do.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>

* MixedDbClient: log batch size when sending batched ZMQ messages

Adds a V(2) log line in DbBatchSetTable and DbBatchDelTable on the
ZMQ-backed path so the actual batch size landing in a single ZMQ
message is visible when debugging. Matches the verbosity used by the
surrounding 'Create ZmqProducerStateTable' logs and stays quiet at
default verbosity.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>

---------

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants