[HA] [smartswitch] add ha steady traffic test with PL config#22161
Merged
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
nikamirrr
reviewed
Feb 2, 2026
nikamirrr
reviewed
Feb 2, 2026
nikamirrr
reviewed
Feb 2, 2026
nikamirrr
reviewed
Feb 2, 2026
nikamirrr
reviewed
Feb 2, 2026
nikamirrr
reviewed
Feb 2, 2026
vivekrnv
requested changes
Feb 3, 2026
zjswhhh
previously approved these changes
Feb 5, 2026
vivekrnv
reviewed
Feb 9, 2026
vivekrnv
reviewed
Feb 9, 2026
vivekrnv
reviewed
Feb 9, 2026
vivekrnv
reviewed
Feb 9, 2026
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
Collaborator
|
@aronovic PR conflicts with 202511 branch |
vrajeshe
pushed a commit
to vrajeshe/sonic-mgmt
that referenced
this pull request
Mar 23, 2026
…et#22161) * DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161) Summary: This test is covering the module 1 of the HA testplan. The following is being tested Load HA configuration on Primary and Standby Activate HA on Primary and Standby Send private link traffic and verify that is received as expected. Signed-off-by: Venkata Gouri Rajesh Etla <vrajeshe@cisco.com>
selldinesh
pushed a commit
to selldinesh/sonic-mgmt
that referenced
this pull request
Apr 1, 2026
…et#22161) * DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161) Summary: This test is covering the module 1 of the HA testplan. The following is being tested Load HA configuration on Primary and Standby Activate HA on Primary and Standby Send private link traffic and verify that is received as expected. Signed-off-by: selldinesh <dinesh.sellappan@keysight.com>
Contributor
|
The "Approved for 202511 branch" and "Cherry Pick Conflict_202511" labels have been temporarily removed so that the automated cherry-pick procedure can be retried once older prerequisite PRs are successfully cherry-picked to the 202511 branch. The "Approved" label will be re-added at that time. |
albertovillarreal-keys
pushed a commit
to albertovillarreal-keys/sonic-mgmt
that referenced
this pull request
Apr 3, 2026
…et#22161) * DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161) Summary: This test is covering the module 1 of the HA testplan. The following is being tested Load HA configuration on Primary and Standby Activate HA on Primary and Standby Send private link traffic and verify that is received as expected.
mssonicbld
pushed a commit
to mssonicbld/sonic-mgmt
that referenced
this pull request
Apr 6, 2026
…et#22161) * DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161) Summary: This test is covering the module 1 of the HA testplan. The following is being tested Load HA configuration on Primary and Standby Activate HA on Primary and Standby Send private link traffic and verify that is received as expected. Signed-off-by: mssonicbld <sonicbld@microsoft.com>
Collaborator
|
Cherry-pick PR to 202511: #23643 |
12 tasks
theasianpianist
pushed a commit
to theasianpianist/sonic-mgmt
that referenced
this pull request
Apr 6, 2026
…et#22161) * DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161) Summary: This test is covering the module 1 of the HA testplan. The following is being tested Load HA configuration on Primary and Standby Activate HA on Primary and Standby Send private link traffic and verify that is received as expected.
vmittal-msft
pushed a commit
that referenced
this pull request
Apr 7, 2026
Cherry-pick of 9 PRs to 202511 — HA core infrastructure (conftest.py chain, BFD, GNMI, state_db). All cherry-picks apply cleanly with no conflicts. 3 PRs from the original batch (#22489, #22736, #22920) were already on 202511 and are skipped. ### Included PRs (in cherry-pick order): 1. #22161 — [HA] [smartswitch] add ha steady traffic test with PL config 2. #22958 — [HA][smartswitch] ha test workaround for the neigh resolve issue 3. #23023 — Use GNMI to configure HA 4. #22664 — [HA][smartswitch] Extract DASH HA info from state_db directly 5. #23106 — revert PR 22920 to the original BFD values 6. #23125 — [ha] get remote npu pa ip (loopback0 ip) from topo definition 7. #23100 — configure vlan port on both dpus and perform cleanup 8. #22952 — Remove generate_vlan_config and address review comments in HA conftest ### Already on 202511 (skipped): - #22489 — [ssw][ha] update ovs rules for HA - #22736 — [DASH-HA] Remove hardcoded loopback and device names - #22920 — Fix bfd probe interval ### Why batched? These PRs form a dependency chain through `tests/ha/conftest.py` and related HA files. Each commit modifies files that subsequent commits also touch, so they must be applied in order. --------- Signed-off-by: Jing Zhang <zhangjing@microsoft.com> Co-authored-by: aronovic <166534786+aronovic@users.noreply.github.com> Co-authored-by: dypet <dypeters@cisco.com> Co-authored-by: yue-fred-gao <132678244+yue-fred-gao@users.noreply.github.com> Co-authored-by: Jing Zhang <zhangjing@microsoft.com> Co-authored-by: nnelluri-cisco <nnelluri@cisco.com>
rraghav-cisco
pushed a commit
to rraghav-cisco/sonic-mgmt
that referenced
this pull request
Apr 20, 2026
…et#22161) * DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161) Summary: This test is covering the module 1 of the HA testplan. The following is being tested Load HA configuration on Primary and Standby Activate HA on Primary and Standby Send private link traffic and verify that is received as expected. Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
prsunny
pushed a commit
that referenced
this pull request
Apr 20, 2026
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Split into two new test files for HA critical process crash verification: tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes (syncd, bgp) on active/standby DPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes (hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. Each file covers 4 variations per process: Crash on active side, traffic landing on active side Crash on active side, traffic landing on standby side Crash on standby side, traffic landing on active side Crash on standby side, traffic landing on standby side Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3 processes x 4 variations) = 20 test cases. Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 - [ X] 202511 ### Approach SONiC HA (High Availability) on SmartSwitch requires validation that critical process crashes on DPUs are handled gracefully. When a process like syncd crashes on an active or standby DPU, the HA state machine must converge correctly and the dataplane must recover with minimal disruption. This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios using PrivateLink (PL) traffic as the dataplane verification mechanism. How did you do it? How did you do it? Added test_ha_critical_process_crash.py with a TestSyncdCrash class that covers 4 syncd crash variations: Crash on active DPU, traffic landing on active DPU Crash on active DPU, traffic landing on standby DPU Crash on standby DPU, traffic landing on active DPU Crash on standby DPU, traffic landing on standby DPU Each test follows a common flow: Verify PL dataplane is functional (pre-crash baseline) Kill syncd on the target DPU via docker exec Verify HA state converges on both the crash DUT and the peer DUT Wait for the allowed traffic disruption window, then re-verify PL traffic Wait for syncd to auto-recover PL traffic verification sends a VxLAN-encapped DASH packet from the T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the T2/PE-side PTF port, following the pattern established in test_ha_steady_state_pl.py (PR #22161). The PrivateLink DASH config (appliance, routing types, VNET, ENI, meters, routes) is defined in configs/privatelink_config.py and pushed to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses config_reload to clean up. #### What is the motivation for this PR? This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios #### How did you do it? Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O) with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs Verified fixture execution order: setup_ha_config → setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs Verified HA state queries via swbus-cli show hamgrd actor inside dash-hadpu0 container #### Any platform specific information? Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando DPUs DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on both NPUs Requires swbus-cli available inside the dash-hadpu0 container for HA state verification #### Supported testbed topology if it's a new test case? t1-smartswitch-ha ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> NPU Pass logs ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd] /data/tests/conftest.py:1417: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC). record_testsuite_property("timestamp", datetime.utcnow()) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ------------- generated xml file: /run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml -------------- INFO:root:Can not get Allure report URL. Please check logs ----------------------------------------------------- live log sessionfinish ------------------------------------------------------ 14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can not get Allure report URL. Please check logs ========================================= 12 passed, 3696 warnings in 8234.92s (2:17:14) ========================================== root@sonic-m6-02-12:/data/tests# --------- Signed-off-by: nnelluri <nnelluri@cisco.com>
selldinesh
pushed a commit
to selldinesh/sonic-mgmt
that referenced
this pull request
May 4, 2026
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Split into two new test files for HA critical process crash verification: tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes (syncd, bgp) on active/standby DPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes (hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. Each file covers 4 variations per process: Crash on active side, traffic landing on active side Crash on active side, traffic landing on standby side Crash on standby side, traffic landing on active side Crash on standby side, traffic landing on standby side Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3 processes x 4 variations) = 20 test cases. Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 - [ X] 202511 ### Approach SONiC HA (High Availability) on SmartSwitch requires validation that critical process crashes on DPUs are handled gracefully. When a process like syncd crashes on an active or standby DPU, the HA state machine must converge correctly and the dataplane must recover with minimal disruption. This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios using PrivateLink (PL) traffic as the dataplane verification mechanism. How did you do it? How did you do it? Added test_ha_critical_process_crash.py with a TestSyncdCrash class that covers 4 syncd crash variations: Crash on active DPU, traffic landing on active DPU Crash on active DPU, traffic landing on standby DPU Crash on standby DPU, traffic landing on active DPU Crash on standby DPU, traffic landing on standby DPU Each test follows a common flow: Verify PL dataplane is functional (pre-crash baseline) Kill syncd on the target DPU via docker exec Verify HA state converges on both the crash DUT and the peer DUT Wait for the allowed traffic disruption window, then re-verify PL traffic Wait for syncd to auto-recover PL traffic verification sends a VxLAN-encapped DASH packet from the T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the T2/PE-side PTF port, following the pattern established in test_ha_steady_state_pl.py (PR sonic-net#22161). The PrivateLink DASH config (appliance, routing types, VNET, ENI, meters, routes) is defined in configs/privatelink_config.py and pushed to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses config_reload to clean up. #### What is the motivation for this PR? This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios #### How did you do it? Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O) with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs Verified fixture execution order: setup_ha_config → setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs Verified HA state queries via swbus-cli show hamgrd actor inside dash-hadpu0 container #### Any platform specific information? Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando DPUs DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on both NPUs Requires swbus-cli available inside the dash-hadpu0 container for HA state verification #### Supported testbed topology if it's a new test case? t1-smartswitch-ha ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> NPU Pass logs ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd] /data/tests/conftest.py:1417: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC). record_testsuite_property("timestamp", datetime.utcnow()) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ------------- generated xml file: /run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml -------------- INFO:root:Can not get Allure report URL. Please check logs ----------------------------------------------------- live log sessionfinish ------------------------------------------------------ 14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can not get Allure report URL. Please check logs ========================================= 12 passed, 3696 warnings in 8234.92s (2:17:14) ========================================== root@sonic-m6-02-12:/data/tests# --------- Signed-off-by: nnelluri <nnelluri@cisco.com> Signed-off-by: selldinesh <dinesh.sellappan@keysight.com>
selldinesh
pushed a commit
to selldinesh/sonic-mgmt
that referenced
this pull request
May 4, 2026
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Split into two new test files for HA critical process crash verification: tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes (syncd, bgp) on active/standby DPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes (hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. Each file covers 4 variations per process: Crash on active side, traffic landing on active side Crash on active side, traffic landing on standby side Crash on standby side, traffic landing on active side Crash on standby side, traffic landing on standby side Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3 processes x 4 variations) = 20 test cases. Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 - [ X] 202511 ### Approach SONiC HA (High Availability) on SmartSwitch requires validation that critical process crashes on DPUs are handled gracefully. When a process like syncd crashes on an active or standby DPU, the HA state machine must converge correctly and the dataplane must recover with minimal disruption. This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios using PrivateLink (PL) traffic as the dataplane verification mechanism. How did you do it? How did you do it? Added test_ha_critical_process_crash.py with a TestSyncdCrash class that covers 4 syncd crash variations: Crash on active DPU, traffic landing on active DPU Crash on active DPU, traffic landing on standby DPU Crash on standby DPU, traffic landing on active DPU Crash on standby DPU, traffic landing on standby DPU Each test follows a common flow: Verify PL dataplane is functional (pre-crash baseline) Kill syncd on the target DPU via docker exec Verify HA state converges on both the crash DUT and the peer DUT Wait for the allowed traffic disruption window, then re-verify PL traffic Wait for syncd to auto-recover PL traffic verification sends a VxLAN-encapped DASH packet from the T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the T2/PE-side PTF port, following the pattern established in test_ha_steady_state_pl.py (PR sonic-net#22161). The PrivateLink DASH config (appliance, routing types, VNET, ENI, meters, routes) is defined in configs/privatelink_config.py and pushed to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses config_reload to clean up. #### What is the motivation for this PR? This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios #### How did you do it? Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O) with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs Verified fixture execution order: setup_ha_config → setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs Verified HA state queries via swbus-cli show hamgrd actor inside dash-hadpu0 container #### Any platform specific information? Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando DPUs DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on both NPUs Requires swbus-cli available inside the dash-hadpu0 container for HA state verification #### Supported testbed topology if it's a new test case? t1-smartswitch-ha ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> NPU Pass logs ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd] /data/tests/conftest.py:1417: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC). record_testsuite_property("timestamp", datetime.utcnow()) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ------------- generated xml file: /run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml -------------- INFO:root:Can not get Allure report URL. Please check logs ----------------------------------------------------- live log sessionfinish ------------------------------------------------------ 14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can not get Allure report URL. Please check logs ========================================= 12 passed, 3696 warnings in 8234.92s (2:17:14) ========================================== root@sonic-m6-02-12:/data/tests# --------- Signed-off-by: nnelluri <nnelluri@cisco.com> Signed-off-by: selldinesh <dinesh.sellappan@keysight.com>
rraghav-cisco
pushed a commit
to rraghav-cisco/sonic-mgmt
that referenced
this pull request
May 11, 2026
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Split into two new test files for HA critical process crash verification: tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes (syncd, bgp) on active/standby DPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes (hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. Each file covers 4 variations per process: Crash on active side, traffic landing on active side Crash on active side, traffic landing on standby side Crash on standby side, traffic landing on active side Crash on standby side, traffic landing on standby side Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3 processes x 4 variations) = 20 test cases. Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 - [ X] 202511 ### Approach SONiC HA (High Availability) on SmartSwitch requires validation that critical process crashes on DPUs are handled gracefully. When a process like syncd crashes on an active or standby DPU, the HA state machine must converge correctly and the dataplane must recover with minimal disruption. This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios using PrivateLink (PL) traffic as the dataplane verification mechanism. How did you do it? How did you do it? Added test_ha_critical_process_crash.py with a TestSyncdCrash class that covers 4 syncd crash variations: Crash on active DPU, traffic landing on active DPU Crash on active DPU, traffic landing on standby DPU Crash on standby DPU, traffic landing on active DPU Crash on standby DPU, traffic landing on standby DPU Each test follows a common flow: Verify PL dataplane is functional (pre-crash baseline) Kill syncd on the target DPU via docker exec Verify HA state converges on both the crash DUT and the peer DUT Wait for the allowed traffic disruption window, then re-verify PL traffic Wait for syncd to auto-recover PL traffic verification sends a VxLAN-encapped DASH packet from the T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the T2/PE-side PTF port, following the pattern established in test_ha_steady_state_pl.py (PR sonic-net#22161). The PrivateLink DASH config (appliance, routing types, VNET, ENI, meters, routes) is defined in configs/privatelink_config.py and pushed to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses config_reload to clean up. #### What is the motivation for this PR? This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios #### How did you do it? Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O) with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs Verified fixture execution order: setup_ha_config → setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs Verified HA state queries via swbus-cli show hamgrd actor inside dash-hadpu0 container #### Any platform specific information? Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando DPUs DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on both NPUs Requires swbus-cli available inside the dash-hadpu0 container for HA state verification #### Supported testbed topology if it's a new test case? t1-smartswitch-ha ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> NPU Pass logs ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd] /data/tests/conftest.py:1417: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC). record_testsuite_property("timestamp", datetime.utcnow()) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ------------- generated xml file: /run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml -------------- INFO:root:Can not get Allure report URL. Please check logs ----------------------------------------------------- live log sessionfinish ------------------------------------------------------ 14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can not get Allure report URL. Please check logs ========================================= 12 passed, 3696 warnings in 8234.92s (2:17:14) ========================================== root@sonic-m6-02-12:/data/tests# --------- Signed-off-by: nnelluri <nnelluri@cisco.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
johanna-nexthop
pushed a commit
to nexthop-ai/sonic-mgmt
that referenced
this pull request
May 14, 2026
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Split into two new test files for HA critical process crash verification: tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes (syncd, bgp) on active/standby DPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes (hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink traffic on active/standby. Verifies HA state convergence and asserts traffic loss stays within 5%. Each file covers 4 variations per process: Crash on active side, traffic landing on active side Crash on active side, traffic landing on standby side Crash on standby side, traffic landing on active side Crash on standby side, traffic landing on standby side Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3 processes x 4 variations) = 20 test cases. Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 - [ X] 202511 ### Approach SONiC HA (High Availability) on SmartSwitch requires validation that critical process crashes on DPUs are handled gracefully. When a process like syncd crashes on an active or standby DPU, the HA state machine must converge correctly and the dataplane must recover with minimal disruption. This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios using PrivateLink (PL) traffic as the dataplane verification mechanism. How did you do it? How did you do it? Added test_ha_critical_process_crash.py with a TestSyncdCrash class that covers 4 syncd crash variations: Crash on active DPU, traffic landing on active DPU Crash on active DPU, traffic landing on standby DPU Crash on standby DPU, traffic landing on active DPU Crash on standby DPU, traffic landing on standby DPU Each test follows a common flow: Verify PL dataplane is functional (pre-crash baseline) Kill syncd on the target DPU via docker exec Verify HA state converges on both the crash DUT and the peer DUT Wait for the allowed traffic disruption window, then re-verify PL traffic Wait for syncd to auto-recover PL traffic verification sends a VxLAN-encapped DASH packet from the T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the T2/PE-side PTF port, following the pattern established in test_ha_steady_state_pl.py (PR sonic-net#22161). The PrivateLink DASH config (appliance, routing types, VNET, ENI, meters, routes) is defined in configs/privatelink_config.py and pushed to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses config_reload to clean up. #### What is the motivation for this PR? This PR adds Module 6 tests to verify HA behavior under critical process crash scenarios #### How did you do it? Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O) with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs Verified fixture execution order: setup_ha_config → setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs Verified HA state queries via swbus-cli show hamgrd actor inside dash-hadpu0 container #### Any platform specific information? Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando DPUs DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on both NPUs Requires swbus-cli available inside the dash-hadpu0 container for HA state verification #### Supported testbed topology if it's a new test case? t1-smartswitch-ha ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> NPU Pass logs ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd] /data/tests/conftest.py:1417: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC). record_testsuite_property("timestamp", datetime.utcnow()) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ------------- generated xml file: /run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml -------------- INFO:root:Can not get Allure report URL. Please check logs ----------------------------------------------------- live log sessionfinish ------------------------------------------------------ 14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can not get Allure report URL. Please check logs ========================================= 12 passed, 3696 warnings in 8234.92s (2:17:14) ========================================== root@sonic-m6-02-12:/data/tests# --------- Signed-off-by: nnelluri <nnelluri@cisco.com> Signed-off-by: Johanna Jegan <johanna@nexthop.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
HA test code with PL config and traffic
Summary:
This test is covering the module 1 of the HA testplan.
The following is being tested
Type of change
Back port request
Approach
What is the motivation for this PR?
Add HA test with PL traffic
How did you do it?
Added tests
How did you verify/test it?
Run it on HA topology
Any platform specific information?
HA topology for MTFuji
Supported testbed topology if it's a new test case?
HA topology
Documentation