Skip to content

[HA] [smartswitch] add ha steady traffic test with PL config#22161

Merged
prsunny merged 29 commits into
sonic-net:masterfrom
aronovic:ha_dash_pl_traffic
Mar 10, 2026
Merged

[HA] [smartswitch] add ha steady traffic test with PL config#22161
prsunny merged 29 commits into
sonic-net:masterfrom
aronovic:ha_dash_pl_traffic

Conversation

@aronovic
Copy link
Copy Markdown
Contributor

@aronovic aronovic commented Jan 29, 2026

Description of PR

HA test code with PL config and traffic

Summary:
This test is covering the module 1 of the HA testplan.
The following is being tested

  1. Load HA configuration on Primary and Standby
  2. Activate HA on Primary and Standby
  3. Send private link traffic and verify that is received as expected.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Add HA test with PL traffic

How did you do it?

Added tests

How did you verify/test it?

Run it on HA topology

Any platform specific information?

HA topology for MTFuji

Supported testbed topology if it's a new test case?

HA topology

Documentation

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread tests/ha/conftest.py Outdated
Comment thread tests/ha/conftest.py Outdated
Comment thread tests/ha/test_dash_privatelink.py Outdated
Comment thread tests/ha/test_dash_privatelink.py Outdated
Comment thread tests/ha/test_dash_privatelink.py Outdated
Comment thread tests/ha/conftest.py
Comment thread tests/ha/proto_utils.py
Comment thread tests/ha/gnmi_utils.py
zjswhhh
zjswhhh previously approved these changes Feb 5, 2026
Comment thread tests/ha/gnmi_utils.py Outdated
Comment thread tests/ha/test_dash_privatelink.py
Comment thread tests/ha/test_ha_steady_state_pl.py
Comment thread tests/ha/test_ha_steady_state_pl.py
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@mssonicbld
Copy link
Copy Markdown
Collaborator

@aronovic PR conflicts with 202511 branch

vrajeshe pushed a commit to vrajeshe/sonic-mgmt that referenced this pull request Mar 23, 2026
…et#22161)

* DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161)

Summary:
This test is covering the module 1 of the HA testplan.
The following is being tested

Load HA configuration on Primary and Standby
Activate HA on Primary and Standby
Send private link traffic and verify that is received as expected.

Signed-off-by: Venkata Gouri Rajesh Etla <vrajeshe@cisco.com>
selldinesh pushed a commit to selldinesh/sonic-mgmt that referenced this pull request Apr 1, 2026
…et#22161)

* DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161)

Summary:
This test is covering the module 1 of the HA testplan.
The following is being tested

Load HA configuration on Primary and Standby
Activate HA on Primary and Standby
Send private link traffic and verify that is received as expected.

Signed-off-by: selldinesh <dinesh.sellappan@keysight.com>
@theasianpianist
Copy link
Copy Markdown
Contributor

The "Approved for 202511 branch" and "Cherry Pick Conflict_202511" labels have been temporarily removed so that the automated cherry-pick procedure can be retried once older prerequisite PRs are successfully cherry-picked to the 202511 branch. The "Approved" label will be re-added at that time.

albertovillarreal-keys pushed a commit to albertovillarreal-keys/sonic-mgmt that referenced this pull request Apr 3, 2026
…et#22161)

* DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161)

Summary:
This test is covering the module 1 of the HA testplan.
The following is being tested

Load HA configuration on Primary and Standby
Activate HA on Primary and Standby
Send private link traffic and verify that is received as expected.
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Apr 6, 2026
…et#22161)

* DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161)

Summary:
This test is covering the module 1 of the HA testplan.
The following is being tested

Load HA configuration on Primary and Standby
Activate HA on Primary and Standby
Send private link traffic and verify that is received as expected.

Signed-off-by: mssonicbld <sonicbld@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202511: #23643

theasianpianist pushed a commit to theasianpianist/sonic-mgmt that referenced this pull request Apr 6, 2026
…et#22161)

* DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161)

Summary:
This test is covering the module 1 of the HA testplan.
The following is being tested

Load HA configuration on Primary and Standby
Activate HA on Primary and Standby
Send private link traffic and verify that is received as expected.
vmittal-msft pushed a commit that referenced this pull request Apr 7, 2026
Cherry-pick of 9 PRs to 202511 — HA core infrastructure (conftest.py
chain, BFD, GNMI, state_db).

All cherry-picks apply cleanly with no conflicts. 3 PRs from the
original batch (#22489, #22736, #22920) were already on 202511 and are
skipped.

### Included PRs (in cherry-pick order):

1. #22161 — [HA] [smartswitch] add ha steady traffic test with PL config
2. #22958 — [HA][smartswitch] ha test workaround for the neigh resolve
issue
3. #23023 — Use GNMI to configure HA
4. #22664 — [HA][smartswitch] Extract DASH HA info from state_db
directly
5. #23106 — revert PR 22920 to the original BFD values
6. #23125 — [ha] get remote npu pa ip (loopback0 ip) from topo
definition
7. #23100 — configure vlan port on both dpus and perform cleanup
8. #22952 — Remove generate_vlan_config and address review comments in
HA conftest

### Already on 202511 (skipped):

- #22489 — [ssw][ha] update ovs rules for HA
- #22736 — [DASH-HA] Remove hardcoded loopback and device names
- #22920 — Fix bfd probe interval

### Why batched?

These PRs form a dependency chain through `tests/ha/conftest.py` and
related HA files. Each commit modifies files that subsequent commits
also touch, so they must be applied in order.

---------

Signed-off-by: Jing Zhang <zhangjing@microsoft.com>
Co-authored-by: aronovic <166534786+aronovic@users.noreply.github.com>
Co-authored-by: dypet <dypeters@cisco.com>
Co-authored-by: yue-fred-gao <132678244+yue-fred-gao@users.noreply.github.com>
Co-authored-by: Jing Zhang <zhangjing@microsoft.com>
Co-authored-by: nnelluri-cisco <nnelluri@cisco.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Apr 20, 2026
…et#22161)

* DASH - [HA] [smartswitch] add ha steady traffic test with PL config (sonic-net#22161)

Summary:
This test is covering the module 1 of the HA testplan.
The following is being tested

Load HA configuration on Primary and Standby
Activate HA on Primary and Standby
Send private link traffic and verify that is received as expected.

Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
prsunny pushed a commit that referenced this pull request Apr 20, 2026
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Split into two new test files for HA critical process crash
verification:

tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes
(syncd, bgp) on active/standby DPU with continuous PrivateLink traffic
on active/standby. Verifies HA state convergence and asserts traffic
loss stays within 5%.

tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes
(hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink
traffic on active/standby. Verifies HA state convergence and asserts
traffic loss stays within 5%.

Each file covers 4 variations per process:

Crash on active side, traffic landing on active side
Crash on active side, traffic landing on standby side
Crash on standby side, traffic landing on active side
Crash on standby side, traffic landing on standby side
Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3
processes x 4 variations) = 20 test cases.
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement


### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
- [ X] 202511

### Approach
SONiC HA (High Availability) on SmartSwitch requires validation that
critical process crashes on DPUs are handled gracefully. When a process
like syncd crashes on an active or standby DPU, the HA state machine
must converge correctly and the dataplane must recover with minimal
disruption. This PR adds Module 6 tests to verify HA behavior under
critical process crash scenarios using PrivateLink (PL) traffic as the
dataplane verification mechanism.

How did you do it?
How did you do it?
Added test_ha_critical_process_crash.py with a TestSyncdCrash class that
covers 4 syncd crash variations:

Crash on active DPU, traffic landing on active DPU
Crash on active DPU, traffic landing on standby DPU
Crash on standby DPU, traffic landing on active DPU
Crash on standby DPU, traffic landing on standby DPU
Each test follows a common flow:

Verify PL dataplane is functional (pre-crash baseline)
Kill syncd on the target DPU via docker exec
Verify HA state converges on both the crash DUT and the peer DUT
Wait for the allowed traffic disruption window, then re-verify PL
traffic
Wait for syncd to auto-recover
PL traffic verification sends a VxLAN-encapped DASH packet from the
T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the
T2/PE-side PTF port, following the pattern established in
test_ha_steady_state_pl.py (PR #22161).

The PrivateLink DASH config (appliance, routing types, VNET, ENI,
meters, routes) is defined in configs/privatelink_config.py and pushed
to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses
config_reload to clean up.


#### What is the motivation for this PR?
This PR adds Module 6 tests to verify HA behavior under critical process
crash scenarios
#### How did you do it?
Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O)
with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs
Verified fixture execution order: setup_ha_config →
setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json
Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs
Verified HA state queries via swbus-cli show hamgrd actor inside
dash-hadpu0 container

#### Any platform specific information?
Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando  DPUs
DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on
both NPUs
Requires swbus-cli available inside the dash-hadpu0 container for HA
state verification
#### Supported testbed topology if it's a new test case?
t1-smartswitch-ha
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
NPU Pass logs


ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd]
/data/tests/conftest.py:1417: DeprecationWarning:
datetime.datetime.utcnow() is deprecated and scheduled for removal in a
future version. Use timezone-aware objects to represent datetimes in
UTC: datetime.datetime.now(datetime.UTC).
    record_testsuite_property("timestamp", datetime.utcnow())

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
------------- generated xml file:
/run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml
--------------
INFO:root:Can not get Allure report URL. Please check logs
----------------------------------------------------- live log
sessionfinish ------------------------------------------------------
14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can
not get Allure report URL. Please check logs
========================================= 12 passed, 3696 warnings in
8234.92s (2:17:14) ==========================================
root@sonic-m6-02-12:/data/tests#

---------

Signed-off-by: nnelluri <nnelluri@cisco.com>
selldinesh pushed a commit to selldinesh/sonic-mgmt that referenced this pull request May 4, 2026
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Split into two new test files for HA critical process crash
verification:

tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes
(syncd, bgp) on active/standby DPU with continuous PrivateLink traffic
on active/standby. Verifies HA state convergence and asserts traffic
loss stays within 5%.

tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes
(hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink
traffic on active/standby. Verifies HA state convergence and asserts
traffic loss stays within 5%.

Each file covers 4 variations per process:

Crash on active side, traffic landing on active side
Crash on active side, traffic landing on standby side
Crash on standby side, traffic landing on active side
Crash on standby side, traffic landing on standby side
Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3
processes x 4 variations) = 20 test cases.
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
- [ X] 202511

### Approach
SONiC HA (High Availability) on SmartSwitch requires validation that
critical process crashes on DPUs are handled gracefully. When a process
like syncd crashes on an active or standby DPU, the HA state machine
must converge correctly and the dataplane must recover with minimal
disruption. This PR adds Module 6 tests to verify HA behavior under
critical process crash scenarios using PrivateLink (PL) traffic as the
dataplane verification mechanism.

How did you do it?
How did you do it?
Added test_ha_critical_process_crash.py with a TestSyncdCrash class that
covers 4 syncd crash variations:

Crash on active DPU, traffic landing on active DPU
Crash on active DPU, traffic landing on standby DPU
Crash on standby DPU, traffic landing on active DPU
Crash on standby DPU, traffic landing on standby DPU
Each test follows a common flow:

Verify PL dataplane is functional (pre-crash baseline)
Kill syncd on the target DPU via docker exec
Verify HA state converges on both the crash DUT and the peer DUT
Wait for the allowed traffic disruption window, then re-verify PL
traffic
Wait for syncd to auto-recover
PL traffic verification sends a VxLAN-encapped DASH packet from the
T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the
T2/PE-side PTF port, following the pattern established in
test_ha_steady_state_pl.py (PR sonic-net#22161).

The PrivateLink DASH config (appliance, routing types, VNET, ENI,
meters, routes) is defined in configs/privatelink_config.py and pushed
to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses
config_reload to clean up.

#### What is the motivation for this PR?
This PR adds Module 6 tests to verify HA behavior under critical process
crash scenarios
#### How did you do it?
Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O)
with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs
Verified fixture execution order: setup_ha_config →
setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json
Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs
Verified HA state queries via swbus-cli show hamgrd actor inside
dash-hadpu0 container

#### Any platform specific information?
Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando  DPUs
DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on
both NPUs
Requires swbus-cli available inside the dash-hadpu0 container for HA
state verification
#### Supported testbed topology if it's a new test case?
t1-smartswitch-ha
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
NPU Pass logs

ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd]
/data/tests/conftest.py:1417: DeprecationWarning:
datetime.datetime.utcnow() is deprecated and scheduled for removal in a
future version. Use timezone-aware objects to represent datetimes in
UTC: datetime.datetime.now(datetime.UTC).
    record_testsuite_property("timestamp", datetime.utcnow())

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
------------- generated xml file:
/run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml
--------------
INFO:root:Can not get Allure report URL. Please check logs
----------------------------------------------------- live log
sessionfinish ------------------------------------------------------
14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can
not get Allure report URL. Please check logs
========================================= 12 passed, 3696 warnings in
8234.92s (2:17:14) ==========================================
root@sonic-m6-02-12:/data/tests#

---------

Signed-off-by: nnelluri <nnelluri@cisco.com>
Signed-off-by: selldinesh <dinesh.sellappan@keysight.com>
selldinesh pushed a commit to selldinesh/sonic-mgmt that referenced this pull request May 4, 2026
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Split into two new test files for HA critical process crash
verification:

tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes
(syncd, bgp) on active/standby DPU with continuous PrivateLink traffic
on active/standby. Verifies HA state convergence and asserts traffic
loss stays within 5%.

tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes
(hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink
traffic on active/standby. Verifies HA state convergence and asserts
traffic loss stays within 5%.

Each file covers 4 variations per process:

Crash on active side, traffic landing on active side
Crash on active side, traffic landing on standby side
Crash on standby side, traffic landing on active side
Crash on standby side, traffic landing on standby side
Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3
processes x 4 variations) = 20 test cases.
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
- [ X] 202511

### Approach
SONiC HA (High Availability) on SmartSwitch requires validation that
critical process crashes on DPUs are handled gracefully. When a process
like syncd crashes on an active or standby DPU, the HA state machine
must converge correctly and the dataplane must recover with minimal
disruption. This PR adds Module 6 tests to verify HA behavior under
critical process crash scenarios using PrivateLink (PL) traffic as the
dataplane verification mechanism.

How did you do it?
How did you do it?
Added test_ha_critical_process_crash.py with a TestSyncdCrash class that
covers 4 syncd crash variations:

Crash on active DPU, traffic landing on active DPU
Crash on active DPU, traffic landing on standby DPU
Crash on standby DPU, traffic landing on active DPU
Crash on standby DPU, traffic landing on standby DPU
Each test follows a common flow:

Verify PL dataplane is functional (pre-crash baseline)
Kill syncd on the target DPU via docker exec
Verify HA state converges on both the crash DUT and the peer DUT
Wait for the allowed traffic disruption window, then re-verify PL
traffic
Wait for syncd to auto-recover
PL traffic verification sends a VxLAN-encapped DASH packet from the
T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the
T2/PE-side PTF port, following the pattern established in
test_ha_steady_state_pl.py (PR sonic-net#22161).

The PrivateLink DASH config (appliance, routing types, VNET, ENI,
meters, routes) is defined in configs/privatelink_config.py and pushed
to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses
config_reload to clean up.

#### What is the motivation for this PR?
This PR adds Module 6 tests to verify HA behavior under critical process
crash scenarios
#### How did you do it?
Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O)
with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs
Verified fixture execution order: setup_ha_config →
setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json
Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs
Verified HA state queries via swbus-cli show hamgrd actor inside
dash-hadpu0 container

#### Any platform specific information?
Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando  DPUs
DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on
both NPUs
Requires swbus-cli available inside the dash-hadpu0 container for HA
state verification
#### Supported testbed topology if it's a new test case?
t1-smartswitch-ha
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
NPU Pass logs

ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd]
/data/tests/conftest.py:1417: DeprecationWarning:
datetime.datetime.utcnow() is deprecated and scheduled for removal in a
future version. Use timezone-aware objects to represent datetimes in
UTC: datetime.datetime.now(datetime.UTC).
    record_testsuite_property("timestamp", datetime.utcnow())

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
------------- generated xml file:
/run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml
--------------
INFO:root:Can not get Allure report URL. Please check logs
----------------------------------------------------- live log
sessionfinish ------------------------------------------------------
14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can
not get Allure report URL. Please check logs
========================================= 12 passed, 3696 warnings in
8234.92s (2:17:14) ==========================================
root@sonic-m6-02-12:/data/tests#

---------

Signed-off-by: nnelluri <nnelluri@cisco.com>
Signed-off-by: selldinesh <dinesh.sellappan@keysight.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request May 11, 2026
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Split into two new test files for HA critical process crash
verification:

tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes
(syncd, bgp) on active/standby DPU with continuous PrivateLink traffic
on active/standby. Verifies HA state convergence and asserts traffic
loss stays within 5%.

tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes
(hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink
traffic on active/standby. Verifies HA state convergence and asserts
traffic loss stays within 5%.

Each file covers 4 variations per process:

Crash on active side, traffic landing on active side
Crash on active side, traffic landing on standby side
Crash on standby side, traffic landing on active side
Crash on standby side, traffic landing on standby side
Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3
processes x 4 variations) = 20 test cases.
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
- [ X] 202511

### Approach
SONiC HA (High Availability) on SmartSwitch requires validation that
critical process crashes on DPUs are handled gracefully. When a process
like syncd crashes on an active or standby DPU, the HA state machine
must converge correctly and the dataplane must recover with minimal
disruption. This PR adds Module 6 tests to verify HA behavior under
critical process crash scenarios using PrivateLink (PL) traffic as the
dataplane verification mechanism.

How did you do it?
How did you do it?
Added test_ha_critical_process_crash.py with a TestSyncdCrash class that
covers 4 syncd crash variations:

Crash on active DPU, traffic landing on active DPU
Crash on active DPU, traffic landing on standby DPU
Crash on standby DPU, traffic landing on active DPU
Crash on standby DPU, traffic landing on standby DPU
Each test follows a common flow:

Verify PL dataplane is functional (pre-crash baseline)
Kill syncd on the target DPU via docker exec
Verify HA state converges on both the crash DUT and the peer DUT
Wait for the allowed traffic disruption window, then re-verify PL
traffic
Wait for syncd to auto-recover
PL traffic verification sends a VxLAN-encapped DASH packet from the
T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the
T2/PE-side PTF port, following the pattern established in
test_ha_steady_state_pl.py (PR sonic-net#22161).

The PrivateLink DASH config (appliance, routing types, VNET, ENI,
meters, routes) is defined in configs/privatelink_config.py and pushed
to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses
config_reload to clean up.

#### What is the motivation for this PR?
This PR adds Module 6 tests to verify HA behavior under critical process
crash scenarios
#### How did you do it?
Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O)
with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs
Verified fixture execution order: setup_ha_config →
setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json
Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs
Verified HA state queries via swbus-cli show hamgrd actor inside
dash-hadpu0 container

#### Any platform specific information?
Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando  DPUs
DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on
both NPUs
Requires swbus-cli available inside the dash-hadpu0 container for HA
state verification
#### Supported testbed topology if it's a new test case?
t1-smartswitch-ha
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
NPU Pass logs

ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd]
/data/tests/conftest.py:1417: DeprecationWarning:
datetime.datetime.utcnow() is deprecated and scheduled for removal in a
future version. Use timezone-aware objects to represent datetimes in
UTC: datetime.datetime.now(datetime.UTC).
    record_testsuite_property("timestamp", datetime.utcnow())

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
------------- generated xml file:
/run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml
--------------
INFO:root:Can not get Allure report URL. Please check logs
----------------------------------------------------- live log
sessionfinish ------------------------------------------------------
14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can
not get Allure report URL. Please check logs
========================================= 12 passed, 3696 warnings in
8234.92s (2:17:14) ==========================================
root@sonic-m6-02-12:/data/tests#

---------

Signed-off-by: nnelluri <nnelluri@cisco.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
johanna-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request May 14, 2026
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Split into two new test files for HA critical process crash
verification:

tests/ha/test_ha_dpu_process_crash.py — Tests DPU process crashes
(syncd, bgp) on active/standby DPU with continuous PrivateLink traffic
on active/standby. Verifies HA state convergence and asserts traffic
loss stays within 5%.

tests/ha/test_ha_npu_process_crash.py — Tests NPU process crashes
(hamgrd, pmon, bgp) on active/standby NPU with continuous PrivateLink
traffic on active/standby. Verifies HA state convergence and asserts
traffic loss stays within 5%.

Each file covers 4 variations per process:

Crash on active side, traffic landing on active side
Crash on active side, traffic landing on standby side
Crash on standby side, traffic landing on active side
Crash on standby side, traffic landing on standby side
Total: 8 DPU tests (2 processes x 4 variations) + 12 NPU tests (3
processes x 4 variations) = 20 test cases.
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
- [ X] 202511

### Approach
SONiC HA (High Availability) on SmartSwitch requires validation that
critical process crashes on DPUs are handled gracefully. When a process
like syncd crashes on an active or standby DPU, the HA state machine
must converge correctly and the dataplane must recover with minimal
disruption. This PR adds Module 6 tests to verify HA behavior under
critical process crash scenarios using PrivateLink (PL) traffic as the
dataplane verification mechanism.

How did you do it?
How did you do it?
Added test_ha_critical_process_crash.py with a TestSyncdCrash class that
covers 4 syncd crash variations:

Crash on active DPU, traffic landing on active DPU
Crash on active DPU, traffic landing on standby DPU
Crash on standby DPU, traffic landing on active DPU
Crash on standby DPU, traffic landing on standby DPU
Each test follows a common flow:

Verify PL dataplane is functional (pre-crash baseline)
Kill syncd on the target DPU via docker exec
Verify HA state converges on both the crash DUT and the peer DUT
Wait for the allowed traffic disruption window, then re-verify PL
traffic
Wait for syncd to auto-recover
PL traffic verification sends a VxLAN-encapped DASH packet from the
T0/VM-side PTF port and verifies a GRE-encapped packet arrives at the
T2/PE-side PTF port, following the pattern established in
test_ha_steady_state_pl.py (PR sonic-net#22161).

The PrivateLink DASH config (appliance, routing types, VNET, ENI,
meters, routes) is defined in configs/privatelink_config.py and pushed
to DPU0 on both DUTs via gNMI using ordered batch apply. Teardown uses
config_reload to clean up.

#### What is the motivation for this PR?
This PR adds Module 6 tests to verify HA behavior under critical process
crash scenarios
#### How did you do it?
Deployed on a physical SmartSwitch HA testbed (Cisco-8102-28FH-DPU-O)
with two NPUs (MtFuji-dut01, MtFuji-dut02) each with 8 AMD Pensando DPUs
Verified fixture execution order: setup_ha_config →
setup_dash_ha_from_json → setup_pl_config → activate_dash_ha_from_json
Verified PL config is successfully pushed via gNMI to DPU0 on both DUTs
Verified HA state queries via swbus-cli show hamgrd actor inside
dash-hadpu0 container

#### Any platform specific information?
Tested on Cisco-8102-28FH-DPU-O hardware SKU with AMD Pensando  DPUs
DPU HA containers (dash-hadpu0 through dash-hadpu7) must be running on
both NPUs
Requires swbus-cli available inside the dash-hadpu0 container for HA
state verification
#### Supported testbed topology if it's a new test case?
t1-smartswitch-ha
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
NPU Pass logs

ha/test_ha_npu_process_crash.py::TestNpuProcessCrash::test_crash_active_npu_traffic_on_active[hamgrd]
/data/tests/conftest.py:1417: DeprecationWarning:
datetime.datetime.utcnow() is deprecated and scheduled for removal in a
future version. Use timezone-aware objects to represent datetimes in
UTC: datetime.datetime.now(datetime.UTC).
    record_testsuite_property("timestamp", datetime.utcnow())

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
------------- generated xml file:
/run_logs/dpu-module6-npu-all/ha/test_ha_npu_process_crash_2026-04-14-03-42-19.xml
--------------
INFO:root:Can not get Allure report URL. Please check logs
----------------------------------------------------- live log
sessionfinish ------------------------------------------------------
14/04/2026 05:59:39 __init__.pytest_terminal_summary L0067 INFO | Can
not get Allure report URL. Please check logs
========================================= 12 passed, 3696 warnings in
8234.92s (2:17:14) ==========================================
root@sonic-m6-02-12:/data/tests#

---------

Signed-off-by: nnelluri <nnelluri@cisco.com>
Signed-off-by: Johanna Jegan <johanna@nexthop.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants