Skip to content

local ARS#7

Open
VladimirKuk wants to merge 313 commits into
masterfrom
mrvl-local-ARS
Open

local ARS#7
VladimirKuk wants to merge 313 commits into
masterfrom
mrvl-local-ARS

Conversation

@VladimirKuk
Copy link
Copy Markdown

Raised PR for internal comments

@apannerselva apannerselva force-pushed the mrvl-local-ARS branch 3 times, most recently from 4466fb2 to f4fe7ae Compare November 3, 2025 18:04
StephenWangGoogle and others added 26 commits January 10, 2026 09:17
…eate

[P4Orch] Add ACL action list during ACL table creation if they are mandatory.
…et#4052)

- What I did
Check if SAI_DASH_APPLIANCE_ATTR_LOCAL_REGION_ID is implemented before attempting to create the dash appliance object.

- Why I did it
This prevents error handling for SAI_STATUS_NOT_IMPLEMENTED by proactively checking attribute capability using sai_query_attribute_capability.

- Changes:
Add capability check before creating dash appliance object
Only create appliance if attribute is supported and implemented
How I verified it
Run DASH FNIC test


Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
Signed-off-by: mint570 <runmingwu@google.com>
[P4Orch] Use bulk APIs in the next hop manager.
…onic-net#3956)

Previously, SRv6 test cases relied on a workaround to configure SIDs
by creating them directly in the kernel. The kernel would then notify
FRR, which in turn would program the SIDs in SONiC. This was the only
available method before the introduction of the static SRv6 CLI in FRR.

With the introduction of the static SRv6 CLI, we can now configure
SIDs directly through FRR, which is the intended and correct approach.
This commit updates the test cases to leverage the static SRv6 CLI.

This change aligns the test cases with the recommended and expected
workflow for configuring SIDs in SONiC, which is also used by
`bgpcfgd`.

Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
What I did

Revert to old behavior for all SAI API call failures in main.cpp and for all failures in initSaiRedis(). This implies reverting back to orchagent exiting behavior for critical failures like create switch failure and other failures during init.
Also , when an unrecoverable SAI error occurs currently, orchagent logs and error once and then does no retry. To make the error obvious, log it repeatedly in the orchagent while loop periodically, so it can be detected quickly.
Why I did it
To make the behavior consistent with legacy behavior for critical failures like create switch failure and to make detection easier, given that there are no forced orchagent exits for most other cases.
Signed-off-by: mint570 <runmingwu@google.com>
What I did
Add metric to vnet route tunnel request, so vnet route tunnels can be created with a metric value.
Related yang model change: sonic-net/sonic-buildimage#25019

Why I did it
Enable vnet route tunnels to be created with a metric value that can be used to identify type of route.
…ort.

Signed-off-by: mint570 <runmingwu@google.com>
[P4Orch] Add support for SAI functions needed for swss multicast support.
…t#4131)

What I did
Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector.

Why I did it
To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable.
…moved and created when the Port Speed is changed dynamically via GCU (sonic-net#3977)

Remove the Old Port Object Id, Queue id, PG id from COUNTER_DB when the Port is removed. Add the new Port Object Id, Queue Id, PG Id in COUNTER_DB when the Port is created. The tables COUNTERS_QUEUE_PORT_MAP, COUNTERS_PG_PORT_MAP, COUNTERS_PG_INDEX_MAP, COUNTERS_PG_NAME_MAP are updated.
Fixed the issue reported in sonic-net/sonic-buildimage#24385

---------

Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
Co-authored-by: Vineet  Mittal <46945843+vmittal-msft@users.noreply.github.com>
…e its unit test.

Signed-off-by: mint570 <runmingwu@google.com>
…entries.

Signed-off-by: mint570 <runmingwu@google.com>
[P4Orch] Implement functions to add multicast router interface table entries.
What I did
Set the relax attr parsing flag to true for vnet route table, by default it's false.

If not, exception will throw when there is an expected field.

sonic-swss/orchagent/request_parser.cpp

Lines 160 to 167 in a8d968c

 if (!relaxed_attr_parsing_) 
 { 
     throw std::invalid_argument(std::string("Unknown attribute name: ") + fvField(*i)); 
 } 
 else 
 { 
     continue; 
 } 

Why I did it

We are adding new fields i.e. pinned_state, the exception throwing adds unnecessary dependency on the ordering of PR merging and feature enabling.
…nic-net#4120)

What I did
Add the missing drop monitor attribute definitions to debug_counter.h and include them in the supported attributes set in debug_counter.cpp.

Why I did it
The drop monitor feature introduced new CONFIG_DB fields drop_monitor_status, drop_count_threshold, incident_count_threshold and window for drop monitoring configuration. However, these attributes were not added to the supported_debug_counter_attributes set, causing error logs during debug counter installation: "Unknown debug counter attribute 'drop_monitor_status'"
It can replicated by running the test_configurable_drop_counters.py testcase which will fail in teardown with this error.

How I verified it
Verified the fix by running test_configurable_drop_counters.py which was previously failing in teardown with the error "Unknown debug counter attribute 'drop_monitor_status'". After applying the fix, the test passes without generating any error logs
…es to countersyncd for analyzing online issues and performance. The changes introduce a new utilities module with functions for formatting hex dumps and tracking inter-actor channel statistics. Changes: Added utilities module with hex formatting and channel statistics tracking Instrumented all actor message receive points to record queue lengths Added debug logging for raw netlink message payloads Reviewed changes

* [countersyncd]: Add communication statistics recording and utilities

This PR adds communication statistics recording and debugging utilities to countersyncd for analyzing online issues and performance. The changes introduce a new utilities module with functions for formatting hex dumps and tracking inter-actor channel statistics.

Changes:

Added utilities module with hex formatting and channel statistics tracking
Instrumented all actor message receive points to record queue lengths
Added debug logging for raw netlink message payloads
…st router interface table entries.

Signed-off-by: mint570 <runmingwu@google.com>
Signed-off-by: SRAVANI KANASANI <kanasanis@google.com>
[P4Orch] Implement functions to update multicast router interface table entries.
…ace table entries.

Signed-off-by: mint570 <runmingwu@google.com>
[P4Orch] Implement functions to process/drain multicast router interface table entries.
What I did
A new slave naming convention is being used sonic-net/sonic-buildimage#24922.
This PR is to update the pipeline accordingly.
1. Enable polling of FEC histogram for gbsyncd in orchagent
2. Refactored the port_rates.lua to have compute_rate to calculate bps and packet counters and compute_ber to calculate prefec and postfec ber. Gearboxes do not support the port_rate calculation so we are skipping it for the GB_COUNTERS_DB.
3. Enable port_rates.lua on GB_COUNTERS_DB

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
What I did

Remove explicit throwing of exceptions
Use generic orchagent SAI failure handling logic
Why I did it

To avoid unnecessary orchagent crashes when programming vxlan tunnels and nexthops on a vnet in case of SAI errors.
How I verified it
By running mock tests and making sure that there are no crashes when there are vxlan tunnel, mapping or tunnel termination create/remove failure.
StephenWangGoogle and others added 2 commits May 4, 2026 22:29
[P4Orch] Add priority field to tunnel decap entry and Modify tunnel decap action as vrf_id is no longer used.
…c-net#4274)" (sonic-net#4547)

This reverts commit fa273fc.

Reverts sonic-net#4274 to unblock the sonic-swss submodule ref update for sonic-buildimage sonic-net/sonic-buildimage#26918.
PR checker history: build history Pipelines - Runs for Azure.sonic-buildimage. The t0 test always failed on sanity check.
@apannerselva apannerselva force-pushed the mrvl-local-ARS branch 2 times, most recently from 14a8f94 to 2c6ce3d Compare May 5, 2026 16:46
ganglyu and others added 25 commits May 6, 2026 13:12
…bject (sonic-net#4535)

What I did

In clearPortPhySerdesAttrCounterMap(), changed the log level from SWSS_LOG_ERROR to SWSS_LOG_WARN when a port has no serdes object.

Why I did it

On platforms that do not support serdes, port_serdes_id is SAI_NULL_OBJECT_ID, which is expected behavior. However, this was logged at ERROR level, causing false alarms in loganalyzer.

This is the same class of issue fixed in sonic-net#4467 (getPortSerdesIdFromPortId()), but in a different code path, clearPortPhySerdesAttrCounterMap() iterates all ports and logs ERROR for any port without a serdes object, even on platforms where serdes is simply not implemented.

How I verified it

On a platform that does not support serdes, confirmed that the log no longer appears at ERR level in syslog.
…nic-net#4541)" (sonic-net#4551)

This reverts commit 55075d3.

Reverts sonic-net#4541

This is redundant as this PR addresses the same issues sonic-net#4254

Hence reverting.
sonic-net#4381)

This reverts commit 5de5922.
Hi, @prsunny
The PR depends on sonic-net/sonic-buildimage#23628

We tested all them together and notified that too much changes are required for sonic-mgmt to make the regression stable. Without changes to sonic-buildimage itself, the sonic-swss part alone breaks counters polling for warm-reboot case, since there is no trigger to re-activate counters read
Handle ACL resource constraints with event based retry. Fixes sonic-net#4406

What I did

Use an event-based retry using RetryCache to handle ACL resource availability constraints.

Why I did it

Blind retries are not helpful when resources (ACL in this case) are not available. This can cause new events including route updates to be throttled even when retries are not guaranteed to help.

How I verified it

By using mock tests to test the behavior.
Also verified by the originator of issue 4406 that the problem reported is resolved.
Details if related

Uses the RetryCache mechanism that has already been used for some orchs like routeorch and srv6orch.
…net#4534)

Why I did it
Commit fe19634 ("Connect to teamd before adding the lag to STATE_DB") added teamdctl_connect() to the TeamPortSync constructor to verify teamd is live before writing to STATE_DB. However, this introduced a crash during teamd container restart.

On restart, teamd is launched with -r (recreate flag), which deletes and recreates each PortChannel kernel device with a new ifindex. Before recreation, the kernel emits RTM_NEWLINK for the pre-existing devices (initial dump). teamsyncd calls addLag() TeamPortSync constructor with the old ifindex. In the constructor, team_init(old_ifindex) succeeds (device still exists at this point), but teamdctl_connect() fails with ECONNREFUSED (error=111) because teamd's socket is not yet bound.
The catch block logs the failure and calls sleep(1) before retrying. During that 1-second sleep, teamd -r deletes the old kernel device and creates a new one with a new ifindex. On retry, team_init(old_ifindex) fails with EADDRNOTAVAIL (error=99) the ifindex no longer exists. After max_retries=3 exhausted, the constructor throws i std::system_error.
The exception propagates uncaught through addLag() → onMsg() → libnl's C callback i stack → NetLink::readData() (no catch) → Select::poll_descriptors() (catches only runtime_error, logs "readData error", returns ERROR). The main loop ignores the SELECT error, continues in a corrupted state, and eventually crashes with SIGSEGV in libnl-route-3 at a NULL pointer dereference. Note: This crash did not exist before fe19634 because without teamdctl_connect(), the constructor succeeded immediately on attempt=0 (device still existed), never triggering the retry sleep that opens the race window.
)

What I did
Add default IPv6 link local route to fix ipv6 ndp issue inside VRF

Why I did it
When a VRF is created, all the assigned IP addresses are removed and a new link local address is regenerated in Linux kernel but SWSS does not add a route for link local fe80::/10.
Linux sends NS packet with link local addresses and peer reply NA packet also uses link local address caused the NA packet be dropped and the ndp cannot be updated.

How I verified it
Built test image with the fix and verified it in a SONiC router. The system can process replied NA packet with link local dst IP address to that VRF.
* Southbound ZMQ runtime enablement
What I did

In orchagent/saihelper.cpp, added resolveCommunicationModeFromContextConfig(std::istream&, sai_redis_communication_mode_t) and call it from initSaiApi(). initSaiApi() opens /usr/share/sonic/hwsku/context_config.json once and streams it straight into the helper. If zmq_enable=false for the default context (guid=0), the helper returns REDIS_SYNC; otherwise it returns the input mode unchanged. The result is stored in gRedisCommunicationMode before initSaiRedis() sends the comm-mode attribute to sairedis.

Why I did it

When the southbound-ZMQ runtime knob is in play, context_config.json is the source of truth for whether ZMQ is actually used. gRedisCommunicationMode is set unconditionally from the -z cmdline flag in main.cpp and is then read by five notification callbacks in notifications.cpp (on_port_state_change, on_ha_set_event, on_ha_scope_event, etc.) to decide whether to forward events to ASIC_DB via NotificationProducer.

If the JSON disables ZMQ but -z zmq_sync was passed, sairedis silently falls back to RedisChannel (see sonic-sairedis#1835). The orchagent global was previously left at ZMQ_SYNC in that case, so notification handlers would forward events as if ZMQ were active, wrong path. This change keeps the global aligned with the actual transport, symmetric with the fallback in syncd/Syncd.cpp.
…onic-net#4558)

What I did

Clean up the corresponding entries in DPU_STATE_DB (DASH_HA_SET_STATE_TABLE and DASH_HA_SCOPE_STATE_TABLE) when an HA_SET or HA_SCOPE entry is deleted from DPU_APPL_DB.

Why I did it

Previously, on DEL of an HA_SET or HA_SCOPE entry, the SAI object and internal map entries were removed but the corresponding DPU_STATE_DB rows were left stale, leading to inconsistent state DB content after deletion.
…-net#4500)

Why I did it
Implements the fix for sonic-net/sonic-mgmt#24066

When autoneg is configured as 'off' on a platform whose ASIC does not support autoneg, the previous code logged an error and dropped the entire port config task (speed, admin_state etc).

How I did it
Disabling autoneg on such a platform is a no-op — so there is no reason to error out. Skipping allows us to apply the other configs. Changes introduced another else case to handle this scenario gracefully
)

What I did
Fix VLAN-based FDB flush in fdborch.
For VLAN flush requests, data contains the VLAN ID only (for example, 1000), but getPort() expects the VLAN interface name format (Vlan1000).
This change prepends Vlan to the VLAN ID before calling getPort() in the VLAN FDB flush path.

Why I did it
Before this change, running fdbclear -v <vid> could report success on CLI, but the dynamic MAC entry was not removed.
Meanwhile, orchagent logged an error like:

doTask: Get Port from vlan(1000) failed!
What I did
Added support for ACL match on Inner Source IPv6 address

Why I did it
Currently there is only support for ACL match on Inner Source IP. This extends the support to match on Inner Source IPv6 address.
…-net#4341)

What I did

Fixed a memory leak and undefined behavior in Srv6Orch::createUpdateSidList() by replacing raw new[]/delete with std::unique_ptr.

Why I did it

segment_list.list is allocated with new sai_ip6_t[count] but on two SAI error paths (create_srv6_sidlist and set_srv6_sidlist_attribute failures) the function returns false without freeing the allocation. Additionally, the success path uses plain delete instead of delete[] for an array allocation, which is undefined behavior per the C++ standard.
Summary
Add loop guard field support to STP port config message handling.

Changes
Add loop_guard field in STP_PORT_CONFIG_MSG.
Parse and propagate loop_guard from STP_PORT table updates to stpd message.
What I did
Added log message to display next-hop weight information

Why I did it
Helpful to know next-hop has weights associated with it and its values
How I verified it
Enabled the logs and see if weights are printed
sonic-net#4543)

What I did
Issue:
sonic-net#4546

SAI throws error below messages that are causing tests (e.g: qos/test_sched_strict_priority.py) to fail. Some of the attribute queries (related to phy and serdes) are not applicable to recycle and inband ports. Even though these queries are harmless they are causing the test noise. This PR avoids queries on recycle/inband ports for attributes that are not applicable:

SAI_PORT_ATTR_PORT_SERDES_ID
SAI_PORT_ATTR_RX_SIGNAL_DETECT
SAI_PORT_ATTR_RX_SNR

Recent changes made in the following PRs introduced querying these attributes recently:
sonic-net#4214
sonic-net#4223
…-net#4565)

What I did
Remove redundant per-direction port stats SAI attributes

Why I did it
SAI_TAM_TEL_TYPE_ATTR_SWITCH_ENABLE_PORT_STATS already covers both ingress and egress, making the explicit
SAI_TAM_TEL_TYPE_ATTR_SWITCH_ENABLE_PORT_STATS_EGRESS and SAI_TAM_TEL_TYPE_ATTR_SWITCH_ENABLE_PORT_STATS_INGRESS attributes redundant. In SINGLE mode only one attribute can be set to true at a time, so setting all three caused a conflict.
Add COUNTERS_ENI_OID_MAP table to map ENI SAI OIDs to names, enabling
counter collection by OID lookup. Also add DPU_COUNTERS_DB support so
ENI name and OID mappings are mirrored to the DPU counter database for
DPU-side counter polling.

Changes:
- Add m_eni_oid_table for OID-to-name mapping in COUNTERS_DB
- Add m_dpu_counter_db, m_dpu_eni_name_table, m_dpu_eni_oid_table for
  DPU_COUNTERS_DB mirroring
- Update addEniMapEntry/removeEniMapEntry to maintain all four tables
- Add DPU_COUNTERS_DB (id 18) to mock test database config

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
…nic-net#4589)

What I did
sonic-swss submodule integration (version bump up) PR
sonic-net/sonic-buildimage#27442 is blocked as some portsorch unit tests are failing. These are not directly related to the PR (they fail without this change when run in certain order) it is blocking integration of sonic-net#4543.
Tests only fail when run in certain order.
This is being caused by lack of port clean up at the end of some tests.
Added the clean up code.

Why I did it
Blocking submodule bump up of sonic-swss
…nic-net#4394)

* [fpmsyncd]: add libnexthopgroup in Makefile.am to support rib/fib feature
What I did

Add libnexthopgroup in fpmsyncd Makefile.am to support RIB/FIB

Why I did it

In the RIB/FIB design, we need the libnexthopgroup compiled from sonic-fib to communicate with Zebra.
More details please refer the documents following:

RIB/FIB HLD: sonic-net/SONiC#2060
nhg mgr LLD: sonic-net/SONiC#2270
sonic-fib PR: sonic-net/SONiC#2270
RIB/FIB nhg mgr code: sonic-net#4395
…icitly supported value (sonic-net#4156)

* use query attribute capability for stats mode and set explicitly supported value

Depend on PR: sonic-net/sonic-sairedis#1756
It should fix PR failure

What I did
use query attribute capability for stats mode and set an explicitly supported value.
If stats mode is not set or not supported then log WARN messages and rely on internal SAI logic during session creation

Why I did it
need to check if some SAI capabilities are supported before using them.
Currently there are few possible stats count mode:

SAI_STATS_COUNT_MODE_PACKET_AND_BYTE,
SAI_STATS_COUNT_MODE_PACKET
SAI_STATS_COUNT_MODE_BYTE
SAI_STATS_COUNT_MODE_NONE
…et#4575)

Fix for the issue raised on sonic-swss repo sonic-net#4569

What I did
Fixed an issue where explicit getOID error logs were being thrown unnecessarily during the p4orch state verification process.

Why I did it
To resolve issue sonic-net#4569. False-positive or misleading ERR logs from getOID pollute the system logs (syslog) during routine state verification, making genuine troubleshooting more difficult.

How I verified it
Ran the p4orch unit tests and state verification suites locally to confirm that the unwanted error logs are no longer generated.

Details if related
This fix targets the p4orch state verification flow within orchagent.

Signed-off-by: divyagayathri-hcl <divyagayathri.s@hcl.com>
Co-authored-by: mint570 <runmingwu@google.com>
…c-net#4601)

What I did

Fixed a double-delete of the HFTel ASIC NotificationConsumer in HFTelOrch by aligning ownership with the orchagent Executor model.

Replaced std::shared_ptr<swss::NotificationConsumer> with a non-owning swss::NotificationConsumer* in hftelorch.h.
Allocate the consumer with new and pass it directly to Notifier in hftelorch.cpp (same pattern as PortsOrch).
On failed SAI_SWITCH_ATTR_TAM_TEL_TYPE_CONFIG_CHANGE_NOTIFY registration, delete the Notifier and clear the pointer before throwing.
Updated doTask(NotificationConsumer&) to compare against the raw pointer instead of shared_ptr::get().
Why I did it

HFTelOrch was introduced in a18824e (sonic-net#3759) with dual ownership: a shared_ptr member and a Notifier/Executor that also deletes m_selectable in ~Executor(). On graceful shutdown, ~HFTelOrch destroyed the consumer first and ~Orch destroyed it again, causing SIGSEGV in redisFree() during ~NotificationConsumer.

This was observed on SONiC as repeated orchagent cores during stop/restart. The failure is most visible once full teardown runs (e.g. after graceful SIGTERM/SIGINT shutdown paths such as sonic-net#4400), but the ownership bug has existed since HFTel was added.
What I did
pass object_ids through SwssActor -> IPFixTemplatesMessage -> IpfixActor
store object_ids alongside object_names per session key
resolve IPFIX labels via label -> object_id -> object_name
keep the legacy positional fallback only when object_ids are unavailable
update unit/integration tests for the new message shape
Why
object_ids and object_names are positional pairs, while the IPFIX label matches the object ID rather than the 1-based index into object_names. The old logic incorrectly assumed:

label -> object_names[label - 1]

The correct logic is:

label -> find matching object_id -> use the object_name at the same position
…tes (sonic-net#4591)

Across recent sonic-swss VS test pipelines (builds 1118214, 1118474, 1118703,
1118888), the same handful of tests have been failing on first attempt with
identical signatures that are not caused by any code change:

* test_portchannel.py::TestPortchannel::test_Portchannel_lacpkey
  KeyError: 'ports' from json.loads(teamdctl ... state dump). teamd is not yet
  fully up after the static 1s sleep. Replace with a 30s polling loop that
  retries the teamdctl invocation until valid JSON containing the expected
  port appears.

* test_portchannel.py::TestPortchannel::test_Portchannel_tpid
  AssertionError: assert '0' == '37376'. The TPID write travels
  CONFIG_DB -> APPL_DB -> orchagent -> ASIC_DB and can exceed the static 1s
  sleep on loaded CI agents. Poll ASIC_STATE:SAI_OBJECT_TYPE_LAG until
  SAI_LAG_ATTR_TPID reflects the configured value.

* test_portchannel.py::TestPortchannel::test_portchannel_member_netdev_oper_status
  AssertionError: assert 'down' == 'up'. portmgr's netlink listener can take
  longer than 1s to propagate the carrier change. Poll STATE_DB PORT_TABLE
  for netdev_oper_status == 'up' instead.

* test_inband_intf_mgmt_vrf.py::TestInbandInterface::test_InbandIntf[PortChannel5]
  Fails when INTF_TABLE entry appears just after the default 5s swsscommon
  poll window (observed ~5.1s) because PortChannel intf creation includes an
  extra teamd -> kernel master rebind. Poll INTF_TABLE for up to 30s.

* test_inband_intf_mgmt_vrf.py::TestInbandInterface::test_InbandIntf[Loopback1]
  Cascade failure: when the PortChannel5 parametrization aborts, the 'mgmt'
  VRF is never cleaned up, so the next parametrization (Loopback1) inherits
  dirty state. Wrap the test body in try/finally so del_inband_mgmt_vrf and
  del_mgmt_vrf always run.

These changes are test-only and do not affect production code paths.
Local ARS (Adaptive Routing and Switching) support.

Co-authored-by: Vladimir Kuk <vkuk@marvell.com>
Co-authored-by: Apoorv Sachan <apoorv@arista.com>
Signed-off-by: apannerselva <apannerselva@marvell.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.