local ARS#7
Open
VladimirKuk wants to merge 313 commits into
Open
Conversation
4466fb2 to
f4fe7ae
Compare
f4fe7ae to
ee3f102
Compare
…eate [P4Orch] Add ACL action list during ACL table creation if they are mandatory.
…et#4052) - What I did Check if SAI_DASH_APPLIANCE_ATTR_LOCAL_REGION_ID is implemented before attempting to create the dash appliance object. - Why I did it This prevents error handling for SAI_STATUS_NOT_IMPLEMENTED by proactively checking attribute capability using sai_query_attribute_capability. - Changes: Add capability check before creating dash appliance object Only create appliance if attribute is supported and implemented How I verified it Run DASH FNIC test Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com> Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
Signed-off-by: mint570 <runmingwu@google.com>
[P4Orch] Use bulk APIs in the next hop manager.
…onic-net#3956) Previously, SRv6 test cases relied on a workaround to configure SIDs by creating them directly in the kernel. The kernel would then notify FRR, which in turn would program the SIDs in SONiC. This was the only available method before the introduction of the static SRv6 CLI in FRR. With the introduction of the static SRv6 CLI, we can now configure SIDs directly through FRR, which is the intended and correct approach. This commit updates the test cases to leverage the static SRv6 CLI. This change aligns the test cases with the recommended and expected workflow for configuring SIDs in SONiC, which is also used by `bgpcfgd`. Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
What I did Revert to old behavior for all SAI API call failures in main.cpp and for all failures in initSaiRedis(). This implies reverting back to orchagent exiting behavior for critical failures like create switch failure and other failures during init. Also , when an unrecoverable SAI error occurs currently, orchagent logs and error once and then does no retry. To make the error obvious, log it repeatedly in the orchagent while loop periodically, so it can be detected quickly. Why I did it To make the behavior consistent with legacy behavior for critical failures like create switch failure and to make detection easier, given that there are no forced orchagent exits for most other cases.
Signed-off-by: mint570 <runmingwu@google.com>
[P4Orch] Fix nexthop manager test.
What I did Add metric to vnet route tunnel request, so vnet route tunnels can be created with a metric value. Related yang model change: sonic-net/sonic-buildimage#25019 Why I did it Enable vnet route tunnels to be created with a metric value that can be used to identify type of route.
…ort. Signed-off-by: mint570 <runmingwu@google.com>
[P4Orch] Add support for SAI functions needed for swss multicast support.
…t#4131) What I did Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector. Why I did it To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable.
…moved and created when the Port Speed is changed dynamically via GCU (sonic-net#3977) Remove the Old Port Object Id, Queue id, PG id from COUNTER_DB when the Port is removed. Add the new Port Object Id, Queue Id, PG Id in COUNTER_DB when the Port is created. The tables COUNTERS_QUEUE_PORT_MAP, COUNTERS_PG_PORT_MAP, COUNTERS_PG_INDEX_MAP, COUNTERS_PG_NAME_MAP are updated. Fixed the issue reported in sonic-net/sonic-buildimage#24385 --------- Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com> Co-authored-by: Vineet Mittal <46945843+vmittal-msft@users.noreply.github.com>
…e its unit test. Signed-off-by: mint570 <runmingwu@google.com>
…entries. Signed-off-by: mint570 <runmingwu@google.com>
[P4Orch] Implement functions to add multicast router interface table entries.
What I did Set the relax attr parsing flag to true for vnet route table, by default it's false. If not, exception will throw when there is an expected field. sonic-swss/orchagent/request_parser.cpp Lines 160 to 167 in a8d968c if (!relaxed_attr_parsing_) { throw std::invalid_argument(std::string("Unknown attribute name: ") + fvField(*i)); } else { continue; } Why I did it We are adding new fields i.e. pinned_state, the exception throwing adds unnecessary dependency on the ordering of PR merging and feature enabling.
…nic-net#4120) What I did Add the missing drop monitor attribute definitions to debug_counter.h and include them in the supported attributes set in debug_counter.cpp. Why I did it The drop monitor feature introduced new CONFIG_DB fields drop_monitor_status, drop_count_threshold, incident_count_threshold and window for drop monitoring configuration. However, these attributes were not added to the supported_debug_counter_attributes set, causing error logs during debug counter installation: "Unknown debug counter attribute 'drop_monitor_status'" It can replicated by running the test_configurable_drop_counters.py testcase which will fail in teardown with this error. How I verified it Verified the fix by running test_configurable_drop_counters.py which was previously failing in teardown with the error "Unknown debug counter attribute 'drop_monitor_status'". After applying the fix, the test passes without generating any error logs
…es to countersyncd for analyzing online issues and performance. The changes introduce a new utilities module with functions for formatting hex dumps and tracking inter-actor channel statistics. Changes: Added utilities module with hex formatting and channel statistics tracking Instrumented all actor message receive points to record queue lengths Added debug logging for raw netlink message payloads Reviewed changes * [countersyncd]: Add communication statistics recording and utilities This PR adds communication statistics recording and debugging utilities to countersyncd for analyzing online issues and performance. The changes introduce a new utilities module with functions for formatting hex dumps and tracking inter-actor channel statistics. Changes: Added utilities module with hex formatting and channel statistics tracking Instrumented all actor message receive points to record queue lengths Added debug logging for raw netlink message payloads
…st router interface table entries. Signed-off-by: mint570 <runmingwu@google.com> Signed-off-by: SRAVANI KANASANI <kanasanis@google.com>
[P4Orch] Implement functions to update multicast router interface table entries.
…ace table entries. Signed-off-by: mint570 <runmingwu@google.com>
[P4Orch] Implement functions to process/drain multicast router interface table entries.
What I did A new slave naming convention is being used sonic-net/sonic-buildimage#24922. This PR is to update the pipeline accordingly.
1. Enable polling of FEC histogram for gbsyncd in orchagent 2. Refactored the port_rates.lua to have compute_rate to calculate bps and packet counters and compute_ber to calculate prefec and postfec ber. Gearboxes do not support the port_rate calculation so we are skipping it for the GB_COUNTERS_DB. 3. Enable port_rates.lua on GB_COUNTERS_DB Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
What I did Remove explicit throwing of exceptions Use generic orchagent SAI failure handling logic Why I did it To avoid unnecessary orchagent crashes when programming vxlan tunnels and nexthops on a vnet in case of SAI errors. How I verified it By running mock tests and making sure that there are no crashes when there are vxlan tunnel, mapping or tunnel termination create/remove failure.
[P4Orch] Add priority field to tunnel decap entry and Modify tunnel decap action as vrf_id is no longer used.
…c-net#4274)" (sonic-net#4547) This reverts commit fa273fc. Reverts sonic-net#4274 to unblock the sonic-swss submodule ref update for sonic-buildimage sonic-net/sonic-buildimage#26918. PR checker history: build history Pipelines - Runs for Azure.sonic-buildimage. The t0 test always failed on sanity check.
14a8f94 to
2c6ce3d
Compare
…bject (sonic-net#4535) What I did In clearPortPhySerdesAttrCounterMap(), changed the log level from SWSS_LOG_ERROR to SWSS_LOG_WARN when a port has no serdes object. Why I did it On platforms that do not support serdes, port_serdes_id is SAI_NULL_OBJECT_ID, which is expected behavior. However, this was logged at ERROR level, causing false alarms in loganalyzer. This is the same class of issue fixed in sonic-net#4467 (getPortSerdesIdFromPortId()), but in a different code path, clearPortPhySerdesAttrCounterMap() iterates all ports and logs ERROR for any port without a serdes object, even on platforms where serdes is simply not implemented. How I verified it On a platform that does not support serdes, confirmed that the log no longer appears at ERR level in syslog.
…nic-net#4541)" (sonic-net#4551) This reverts commit 55075d3. Reverts sonic-net#4541 This is redundant as this PR addresses the same issues sonic-net#4254 Hence reverting.
sonic-net#4381) This reverts commit 5de5922. Hi, @prsunny The PR depends on sonic-net/sonic-buildimage#23628 We tested all them together and notified that too much changes are required for sonic-mgmt to make the regression stable. Without changes to sonic-buildimage itself, the sonic-swss part alone breaks counters polling for warm-reboot case, since there is no trigger to re-activate counters read
Handle ACL resource constraints with event based retry. Fixes sonic-net#4406 What I did Use an event-based retry using RetryCache to handle ACL resource availability constraints. Why I did it Blind retries are not helpful when resources (ACL in this case) are not available. This can cause new events including route updates to be throttled even when retries are not guaranteed to help. How I verified it By using mock tests to test the behavior. Also verified by the originator of issue 4406 that the problem reported is resolved. Details if related Uses the RetryCache mechanism that has already been used for some orchs like routeorch and srv6orch.
…net#4534) Why I did it Commit fe19634 ("Connect to teamd before adding the lag to STATE_DB") added teamdctl_connect() to the TeamPortSync constructor to verify teamd is live before writing to STATE_DB. However, this introduced a crash during teamd container restart. On restart, teamd is launched with -r (recreate flag), which deletes and recreates each PortChannel kernel device with a new ifindex. Before recreation, the kernel emits RTM_NEWLINK for the pre-existing devices (initial dump). teamsyncd calls addLag() TeamPortSync constructor with the old ifindex. In the constructor, team_init(old_ifindex) succeeds (device still exists at this point), but teamdctl_connect() fails with ECONNREFUSED (error=111) because teamd's socket is not yet bound. The catch block logs the failure and calls sleep(1) before retrying. During that 1-second sleep, teamd -r deletes the old kernel device and creates a new one with a new ifindex. On retry, team_init(old_ifindex) fails with EADDRNOTAVAIL (error=99) the ifindex no longer exists. After max_retries=3 exhausted, the constructor throws i std::system_error. The exception propagates uncaught through addLag() → onMsg() → libnl's C callback i stack → NetLink::readData() (no catch) → Select::poll_descriptors() (catches only runtime_error, logs "readData error", returns ERROR). The main loop ignores the SELECT error, continues in a corrupted state, and eventually crashes with SIGSEGV in libnl-route-3 at a NULL pointer dereference. Note: This crash did not exist before fe19634 because without teamdctl_connect(), the constructor succeeded immediately on attempt=0 (device still existed), never triggering the retry sleep that opens the race window.
) What I did Add default IPv6 link local route to fix ipv6 ndp issue inside VRF Why I did it When a VRF is created, all the assigned IP addresses are removed and a new link local address is regenerated in Linux kernel but SWSS does not add a route for link local fe80::/10. Linux sends NS packet with link local addresses and peer reply NA packet also uses link local address caused the NA packet be dropped and the ndp cannot be updated. How I verified it Built test image with the fix and verified it in a SONiC router. The system can process replied NA packet with link local dst IP address to that VRF.
* Southbound ZMQ runtime enablement What I did In orchagent/saihelper.cpp, added resolveCommunicationModeFromContextConfig(std::istream&, sai_redis_communication_mode_t) and call it from initSaiApi(). initSaiApi() opens /usr/share/sonic/hwsku/context_config.json once and streams it straight into the helper. If zmq_enable=false for the default context (guid=0), the helper returns REDIS_SYNC; otherwise it returns the input mode unchanged. The result is stored in gRedisCommunicationMode before initSaiRedis() sends the comm-mode attribute to sairedis. Why I did it When the southbound-ZMQ runtime knob is in play, context_config.json is the source of truth for whether ZMQ is actually used. gRedisCommunicationMode is set unconditionally from the -z cmdline flag in main.cpp and is then read by five notification callbacks in notifications.cpp (on_port_state_change, on_ha_set_event, on_ha_scope_event, etc.) to decide whether to forward events to ASIC_DB via NotificationProducer. If the JSON disables ZMQ but -z zmq_sync was passed, sairedis silently falls back to RedisChannel (see sonic-sairedis#1835). The orchagent global was previously left at ZMQ_SYNC in that case, so notification handlers would forward events as if ZMQ were active, wrong path. This change keeps the global aligned with the actual transport, symmetric with the fallback in syncd/Syncd.cpp.
…onic-net#4558) What I did Clean up the corresponding entries in DPU_STATE_DB (DASH_HA_SET_STATE_TABLE and DASH_HA_SCOPE_STATE_TABLE) when an HA_SET or HA_SCOPE entry is deleted from DPU_APPL_DB. Why I did it Previously, on DEL of an HA_SET or HA_SCOPE entry, the SAI object and internal map entries were removed but the corresponding DPU_STATE_DB rows were left stale, leading to inconsistent state DB content after deletion.
…-net#4500) Why I did it Implements the fix for sonic-net/sonic-mgmt#24066 When autoneg is configured as 'off' on a platform whose ASIC does not support autoneg, the previous code logged an error and dropped the entire port config task (speed, admin_state etc). How I did it Disabling autoneg on such a platform is a no-op — so there is no reason to error out. Skipping allows us to apply the other configs. Changes introduced another else case to handle this scenario gracefully
) What I did Fix VLAN-based FDB flush in fdborch. For VLAN flush requests, data contains the VLAN ID only (for example, 1000), but getPort() expects the VLAN interface name format (Vlan1000). This change prepends Vlan to the VLAN ID before calling getPort() in the VLAN FDB flush path. Why I did it Before this change, running fdbclear -v <vid> could report success on CLI, but the dynamic MAC entry was not removed. Meanwhile, orchagent logged an error like: doTask: Get Port from vlan(1000) failed!
What I did Added support for ACL match on Inner Source IPv6 address Why I did it Currently there is only support for ACL match on Inner Source IP. This extends the support to match on Inner Source IPv6 address.
…-net#4341) What I did Fixed a memory leak and undefined behavior in Srv6Orch::createUpdateSidList() by replacing raw new[]/delete with std::unique_ptr. Why I did it segment_list.list is allocated with new sai_ip6_t[count] but on two SAI error paths (create_srv6_sidlist and set_srv6_sidlist_attribute failures) the function returns false without freeing the allocation. Additionally, the success path uses plain delete instead of delete[] for an array allocation, which is undefined behavior per the C++ standard.
Summary Add loop guard field support to STP port config message handling. Changes Add loop_guard field in STP_PORT_CONFIG_MSG. Parse and propagate loop_guard from STP_PORT table updates to stpd message.
What I did Added log message to display next-hop weight information Why I did it Helpful to know next-hop has weights associated with it and its values How I verified it Enabled the logs and see if weights are printed
sonic-net#4543) What I did Issue: sonic-net#4546 SAI throws error below messages that are causing tests (e.g: qos/test_sched_strict_priority.py) to fail. Some of the attribute queries (related to phy and serdes) are not applicable to recycle and inband ports. Even though these queries are harmless they are causing the test noise. This PR avoids queries on recycle/inband ports for attributes that are not applicable: SAI_PORT_ATTR_PORT_SERDES_ID SAI_PORT_ATTR_RX_SIGNAL_DETECT SAI_PORT_ATTR_RX_SNR Recent changes made in the following PRs introduced querying these attributes recently: sonic-net#4214 sonic-net#4223
…-net#4565) What I did Remove redundant per-direction port stats SAI attributes Why I did it SAI_TAM_TEL_TYPE_ATTR_SWITCH_ENABLE_PORT_STATS already covers both ingress and egress, making the explicit SAI_TAM_TEL_TYPE_ATTR_SWITCH_ENABLE_PORT_STATS_EGRESS and SAI_TAM_TEL_TYPE_ATTR_SWITCH_ENABLE_PORT_STATS_INGRESS attributes redundant. In SINGLE mode only one attribute can be set to true at a time, so setting all three caused a conflict.
Add COUNTERS_ENI_OID_MAP table to map ENI SAI OIDs to names, enabling counter collection by OID lookup. Also add DPU_COUNTERS_DB support so ENI name and OID mappings are mirrored to the DPU counter database for DPU-side counter polling. Changes: - Add m_eni_oid_table for OID-to-name mapping in COUNTERS_DB - Add m_dpu_counter_db, m_dpu_eni_name_table, m_dpu_eni_oid_table for DPU_COUNTERS_DB mirroring - Update addEniMapEntry/removeEniMapEntry to maintain all four tables - Add DPU_COUNTERS_DB (id 18) to mock test database config Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
…nic-net#4589) What I did sonic-swss submodule integration (version bump up) PR sonic-net/sonic-buildimage#27442 is blocked as some portsorch unit tests are failing. These are not directly related to the PR (they fail without this change when run in certain order) it is blocking integration of sonic-net#4543. Tests only fail when run in certain order. This is being caused by lack of port clean up at the end of some tests. Added the clean up code. Why I did it Blocking submodule bump up of sonic-swss
…nic-net#4394) * [fpmsyncd]: add libnexthopgroup in Makefile.am to support rib/fib feature What I did Add libnexthopgroup in fpmsyncd Makefile.am to support RIB/FIB Why I did it In the RIB/FIB design, we need the libnexthopgroup compiled from sonic-fib to communicate with Zebra. More details please refer the documents following: RIB/FIB HLD: sonic-net/SONiC#2060 nhg mgr LLD: sonic-net/SONiC#2270 sonic-fib PR: sonic-net/SONiC#2270 RIB/FIB nhg mgr code: sonic-net#4395
…icitly supported value (sonic-net#4156) * use query attribute capability for stats mode and set explicitly supported value Depend on PR: sonic-net/sonic-sairedis#1756 It should fix PR failure What I did use query attribute capability for stats mode and set an explicitly supported value. If stats mode is not set or not supported then log WARN messages and rely on internal SAI logic during session creation Why I did it need to check if some SAI capabilities are supported before using them. Currently there are few possible stats count mode: SAI_STATS_COUNT_MODE_PACKET_AND_BYTE, SAI_STATS_COUNT_MODE_PACKET SAI_STATS_COUNT_MODE_BYTE SAI_STATS_COUNT_MODE_NONE
…et#4575) Fix for the issue raised on sonic-swss repo sonic-net#4569 What I did Fixed an issue where explicit getOID error logs were being thrown unnecessarily during the p4orch state verification process. Why I did it To resolve issue sonic-net#4569. False-positive or misleading ERR logs from getOID pollute the system logs (syslog) during routine state verification, making genuine troubleshooting more difficult. How I verified it Ran the p4orch unit tests and state verification suites locally to confirm that the unwanted error logs are no longer generated. Details if related This fix targets the p4orch state verification flow within orchagent. Signed-off-by: divyagayathri-hcl <divyagayathri.s@hcl.com> Co-authored-by: mint570 <runmingwu@google.com>
…c-net#4601) What I did Fixed a double-delete of the HFTel ASIC NotificationConsumer in HFTelOrch by aligning ownership with the orchagent Executor model. Replaced std::shared_ptr<swss::NotificationConsumer> with a non-owning swss::NotificationConsumer* in hftelorch.h. Allocate the consumer with new and pass it directly to Notifier in hftelorch.cpp (same pattern as PortsOrch). On failed SAI_SWITCH_ATTR_TAM_TEL_TYPE_CONFIG_CHANGE_NOTIFY registration, delete the Notifier and clear the pointer before throwing. Updated doTask(NotificationConsumer&) to compare against the raw pointer instead of shared_ptr::get(). Why I did it HFTelOrch was introduced in a18824e (sonic-net#3759) with dual ownership: a shared_ptr member and a Notifier/Executor that also deletes m_selectable in ~Executor(). On graceful shutdown, ~HFTelOrch destroyed the consumer first and ~Orch destroyed it again, causing SIGSEGV in redisFree() during ~NotificationConsumer. This was observed on SONiC as repeated orchagent cores during stop/restart. The failure is most visible once full teardown runs (e.g. after graceful SIGTERM/SIGINT shutdown paths such as sonic-net#4400), but the ownership bug has existed since HFTel was added.
What I did pass object_ids through SwssActor -> IPFixTemplatesMessage -> IpfixActor store object_ids alongside object_names per session key resolve IPFIX labels via label -> object_id -> object_name keep the legacy positional fallback only when object_ids are unavailable update unit/integration tests for the new message shape Why object_ids and object_names are positional pairs, while the IPFIX label matches the object ID rather than the 1-based index into object_names. The old logic incorrectly assumed: label -> object_names[label - 1] The correct logic is: label -> find matching object_id -> use the object_name at the same position
…tes (sonic-net#4591) Across recent sonic-swss VS test pipelines (builds 1118214, 1118474, 1118703, 1118888), the same handful of tests have been failing on first attempt with identical signatures that are not caused by any code change: * test_portchannel.py::TestPortchannel::test_Portchannel_lacpkey KeyError: 'ports' from json.loads(teamdctl ... state dump). teamd is not yet fully up after the static 1s sleep. Replace with a 30s polling loop that retries the teamdctl invocation until valid JSON containing the expected port appears. * test_portchannel.py::TestPortchannel::test_Portchannel_tpid AssertionError: assert '0' == '37376'. The TPID write travels CONFIG_DB -> APPL_DB -> orchagent -> ASIC_DB and can exceed the static 1s sleep on loaded CI agents. Poll ASIC_STATE:SAI_OBJECT_TYPE_LAG until SAI_LAG_ATTR_TPID reflects the configured value. * test_portchannel.py::TestPortchannel::test_portchannel_member_netdev_oper_status AssertionError: assert 'down' == 'up'. portmgr's netlink listener can take longer than 1s to propagate the carrier change. Poll STATE_DB PORT_TABLE for netdev_oper_status == 'up' instead. * test_inband_intf_mgmt_vrf.py::TestInbandInterface::test_InbandIntf[PortChannel5] Fails when INTF_TABLE entry appears just after the default 5s swsscommon poll window (observed ~5.1s) because PortChannel intf creation includes an extra teamd -> kernel master rebind. Poll INTF_TABLE for up to 30s. * test_inband_intf_mgmt_vrf.py::TestInbandInterface::test_InbandIntf[Loopback1] Cascade failure: when the PortChannel5 parametrization aborts, the 'mgmt' VRF is never cleaned up, so the next parametrization (Loopback1) inherits dirty state. Wrap the test body in try/finally so del_inband_mgmt_vrf and del_mgmt_vrf always run. These changes are test-only and do not affect production code paths.
Local ARS (Adaptive Routing and Switching) support. Co-authored-by: Vladimir Kuk <vkuk@marvell.com> Co-authored-by: Apoorv Sachan <apoorv@arista.com> Signed-off-by: apannerselva <apannerselva@marvell.com>
2c6ce3d to
833056e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Raised PR for internal comments