fix(ptfadapter): tolerate supervisorctl restart spawn errors#24749
Closed
auspham wants to merge 1 commit into
Closed
fix(ptfadapter): tolerate supervisorctl restart spawn errors#24749auspham wants to merge 1 commit into
auspham wants to merge 1 commit into
Conversation
The start_ptf_nn_agent retry loop in the ptfadapter fixture calls 'supervisorctl restart ptf_nn_agent' without module_ignore_errors=True. When supervisord returns a non-zero exit code (for example, 'ERROR (spawn error)' on a port collision with the randomly-picked ptf_nn_port), Ansible raises and the surrounding 3-iteration retry loop never runs. The fixture aborts in setup and any test depending on ptfadapter errors out. Pass module_ignore_errors=True so the loop can pick a different random port. Also harden the supervisorctl status parsing against an empty stdout_lines list to avoid an IndexError. Signed-off-by: Austin (Ngoc Thang) Pham <austinpham@microsoft.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
Author
|
/azpw run |
Collaborator
|
Retrying failed(or canceled) jobs... |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
Author
|
should be fixed by sonic-net/sonic-buildimage#26912 |
Contributor
Author
|
Stale analysis, already fixed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Summary: Make the
ptfadapterfixture tolerant to a transientsupervisorctl restartfailure so that the existing 3-iteration retry loop actually runs when supervisord briefly reportsERROR (spawn error)for the randomly-pickedptf_nn_port. This stabilizes one of the top PR flakes affectingvlan/test_host_vlan.py::test_host_vlan_no_floodling,vlan/test_vlan_ping.py::test_vlan_ping, and many other tests that depend on this fixture.Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
The fixture in
tests/common/plugins/ptfadapter/__init__.pyis one of the heaviest contributors to PR-test flakiness. The Elastictest flaky-episode KQL (30-day window) shows the resulting setup errors hitting 110+ distinct PRs each acrossvlan.test_host_vlan::test_host_vlan_no_floodlingandvlan.test_vlan_ping::test_vlan_ping, plus a long tail of other ptfadapter-using tests. The failures all surface as:Sample affected Elastictest testplans:
How did you do it?
The
start_ptf_nn_agentretry loop intests/common/plugins/ptfadapter/__init__.pycalls:without
module_ignore_errors=True. When supervisord returns a non-zero exit code (for example,ERROR (spawn error)when the randomly-pickedptf_nn_portis already in use), Ansible raises, the surrounding 3-iteration retry loop never runs, and the fixture aborts in setup. The very mechanism intended to recover from a bad port pick is bypassed.This PR:
module_ignore_errors=Trueto thesupervisorctl restartcall so a transient spawn error keeps us inside the loop, where the next iteration randomly picks a different port and re-templates the supervisor config.supervisorctl statusparsing against an emptystdout_lineslist (avoids anIndexErrorif status output is briefly empty).It is a 2-line behavioural change; the existing retry logic, port range, and template flow are unchanged.
How did you verify/test it?
ERROR (spawn error)raises out of the loop (Ansible default is to fail on non-zero exit).nbr_ptfadapteralready handles transient spawn issues with similar tolerance).Any platform specific information?
N/A — generic test framework fix; applies to all platforms/topologies using
ptfadapter.Supported testbed topology if it's a new test case?
N/A — not a new test case.
Documentation
N/A