Skip to content

sonic-swss-common: port to libyang3, fix zmq use-after-free, and CI libnexthopgroup#1199

Closed
securely1g wants to merge 3 commits into
sonic-net:masterfrom
securely1g:combined-pr
Closed

sonic-swss-common: port to libyang3, fix zmq use-after-free, and CI libnexthopgroup#1199
securely1g wants to merge 3 commits into
sonic-net:masterfrom
securely1g:combined-pr

Conversation

@securely1g
Copy link
Copy Markdown

Combines the following PRs into a single submission:

All original commits and authorship preserved.

Signed-off-by: Baba securely1g@users.noreply.github.com

bradh352 and others added 3 commits May 27, 2026 00:33
…te CI

Port the YANG schema-walking code in defaultvalueprovider.{h,cpp} from
libyang1 to libyang3. sonic-swss-common is a submodule whose libyang
dependency is pinned by the parent project (sonic-buildimage), so the
source needs to compile against either version. Use LY_ARRAY_COUNT
(defined only by libyang3) as a preprocessor sentinel:

  * Trivial spelling/accessor differences (node types, ->child,
    leaf->dflt extraction, module->data layout, ly_ctx_destroy,
    ly_ctx_load_module) are hidden behind SWSS_LYS_* macros so the
    call sites stay near-identical to the libyang1 source.
  * Structurally divergent code (list-key extraction via lysc_is_key,
    leaf-list default values, the ly_ctx_new return-by-out-param
    signature) still needs explicit #ifdef branches.

SWIG's preprocessor cannot see <libyang/libyang.h>, so when it parses
defaultvalueprovider.h it always takes the libyang1 branch of the
version shim and generates wrappers that fail to compile against
libyang3 headers. `%ignore swss::DefaultValueHelper` in
pyext/swsscommon.i suppresses those wrappers (the helper is internal —
no Python or Go caller uses it) and keeps the generated bindings
version-agnostic.

CI changes wired up for the libyang3 packages:

  * Switch the .deb download/install patterns from libyang1
    (libyang_1.0*, libyang-*_1.0*, libyang-cpp_*, python3-yang_*) to
    libyang3 (libyang3_*, libyang-dev_3*) across azure-pipelines.yml
    and the build/sairedis/swss/test-docker templates so CI pulls the
    libyang3 artifacts the parent project produces.
  * In the amd64/ubuntu-22.04 host job, replace the python3-libyang
    .deb install with `pip install libyang==3.3.0`. The bookworm-
    built python3-libyang_3.1.0 .deb pins python3 (>= 3.11~,
    << 3.12) so it only installs on systems with Python 3.11; the
    vmImage ships Python 3.10. PyPI's libyang ships sdist only, so
    pip has to build the CFFI extension — use --no-build-isolation
    with apt's python3-cffi to sidestep a cffi version-mismatch
    Exception that jammy's pip 22.0.2 build-isolation env triggers
    (newer cffi in the overlay vs. the system /usr/lib/python3/
    dist-packages/_cffi_backend.so that still wins on sys.path).
    3.3.0 is the last 3.x release and includes patches we
    upstreamed against 3.1.0.
  * Drop the amd64/ubuntu-20.04 job, which has been pinned to
    `condition: false` (i.e. never runs) for some time.

Signed-off-by: Brad House <brad@brad-house.com>
ZmqServer dispatches incoming messages on its poll thread by looking up
a ZmqMessageHandler pointer in a (db, table) map populated by
ZmqConsumerStateTable's constructor via registerMessageHandler() and
calling handler->handleReceivedData(). There was no symmetric
removeMessageHandler and no ZmqConsumerStateTable destructor — so when
a consumer went out of scope, the map kept the now-dangling pointer.

That bites in test code that declares its ZmqServer before its
ZmqConsumerStateTable (the natural order — server has to exist to be
passed by reference into the consumer's constructor). Reverse-order
destruction then runs the consumer's destructor first, which destroys
m_selectableEvent and closes the underlying eventfd, while the server's
mqPollThread is still running and may be partway through a dispatch
that ends in m_selectableEvent.notify(). write() on the closed fd
returns EBADF, SelectableEvent::notify() throws
runtime_error("write failed"), the exception escapes the poll thread,
and std::terminate aborts the process — visible in CI as a crash
attributed to whichever test happened to be RUNing next
(ZmqServerLazzyBind in practice, since gtest prints [ RUN ] before the
test body executes).

The fix moves the handler map and its mutex out of ZmqServer into a
small, separately-allocated ZmqHandlerRegistry class. The server and
every registered consumer co-own a std::shared_ptr<ZmqHandlerRegistry>,
so:

  * The mutex is held across handler dispatch, so the consumer's
    destructor (which calls removeHandler() on the registry) blocks
    until any in-flight callback into it has returned — after which
    its other members can safely destruct.

  * The registry survives as long as either the server OR any
    registered handler still holds a reference. Destruction order
    between server and consumer no longer matters: whichever goes
    away last leaves a live registry behind for the other's
    destructor to use without touching freed memory.

The server's public surface — registerMessageHandler() and the new
removeMessageHandler() — is unchanged in signature; both delegate to
the registry. A new (internal) ZmqServer::getHandlerRegistry()
accessor returns the shared_ptr so ZmqMessageHandler implementations
can co-own the registry.

ZmqConsumerStateTable now caches its db name as std::string at
construction and stores std::shared_ptr<ZmqHandlerRegistry> instead of
ZmqServer&, so its destructor never dereferences the DBConnector or
the ZmqServer it was created with — only the shared registry.

Handler callbacks still must not themselves call register/removeHandler
on the registry (would self-deadlock); ZmqConsumerStateTable::
handleReceivedData only enqueues and notifies, so the restriction is
harmless in tree. Documented on the ZmqHandlerRegistry class.

Signed-off-by: Brad House <bhouse@nexthop.ai>
sonic-swss commit e6c482ca ("...") added -lnexthopgroup to fpmsyncd's
link line, expecting libnexthopgroup_*.deb to already be installed
when the swss build runs. sonic-swss-common's downstream "Compile
sonic swss" stage installs everything it needs from the common-lib
artifact, but its pattern list never picked up the new package, so
the link now fails with `cannot find -lnexthopgroup`. Add the
libnexthopgroup pattern next to the other Azure.sonic-buildimage.
common_libs entries in build-swss-template.yml so dpkg picks it up
in the existing install step.

Signed-off-by: Brad House <brad@brad-house.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@lguohan lguohan closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants