Skip to content

fix(ptfadapter): tolerate supervisorctl restart spawn errors#24749

Closed
auspham wants to merge 1 commit into
sonic-net:masterfrom
auspham:austinpham/38022764-pr-flakiness-test-host-vlan
Closed

fix(ptfadapter): tolerate supervisorctl restart spawn errors#24749
auspham wants to merge 1 commit into
sonic-net:masterfrom
auspham:austinpham/38022764-pr-flakiness-test-host-vlan

Conversation

@auspham
Copy link
Copy Markdown
Contributor

@auspham auspham commented May 20, 2026

Description of PR

Summary: Make the ptfadapter fixture tolerant to a transient supervisorctl restart failure so that the existing 3-iteration retry loop actually runs when supervisord briefly reports ERROR (spawn error) for the randomly-picked ptf_nn_port. This stabilizes one of the top PR flakes affecting vlan/test_host_vlan.py::test_host_vlan_no_floodling, vlan/test_vlan_ping.py::test_vlan_ping, and many other tests that depend on this fixture.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

The fixture in tests/common/plugins/ptfadapter/__init__.py is one of the heaviest contributors to PR-test flakiness. The Elastictest flaky-episode KQL (30-day window) shows the resulting setup errors hitting 110+ distinct PRs each across vlan.test_host_vlan::test_host_vlan_no_floodling and vlan.test_vlan_ping::test_vlan_ping, plus a long tail of other ptfadapter-using tests. The failures all surface as:

ERROR at setup of test_host_vlan_no_floodling
ptf_nn_agent: ERROR (spawn error)
res = {'failed': True, 'changed': True,
       'stdout': 'ptf_nn_agent: stopped\nptf_nn_agent: ERROR (spawn error)', ...}

Sample affected Elastictest testplans:

How did you do it?

The start_ptf_nn_agent retry loop in tests/common/plugins/ptfadapter/__init__.py calls:

ptfhost.command('supervisorctl restart ptf_nn_agent')

without module_ignore_errors=True. When supervisord returns a non-zero exit code (for example, ERROR (spawn error) when the randomly-picked ptf_nn_port is already in use), Ansible raises, the surrounding 3-iteration retry loop never runs, and the fixture aborts in setup. The very mechanism intended to recover from a bad port pick is bypassed.

This PR:

  1. Adds module_ignore_errors=True to the supervisorctl restart call so a transient spawn error keeps us inside the loop, where the next iteration randomly picks a different port and re-templates the supervisor config.
  2. Hardens the subsequent supervisorctl status parsing against an empty stdout_lines list (avoids an IndexError if status output is briefly empty).

It is a 2-line behavioural change; the existing retry logic, port range, and template flow are unchanged.

How did you verify/test it?

  • Re-read the fixture and confirmed that without the change a single ERROR (spawn error) raises out of the loop (Ansible default is to fail on non-zero exit).
  • Confirmed the change keeps the loop semantics the same on the happy path (RUNNING → return; non-RUNNING → next iteration → fresh random port).
  • Manual reasoning matches sibling code in the same file (nbr_ptfadapter already handles transient spawn issues with similar tolerance).

Any platform specific information?

N/A — generic test framework fix; applies to all platforms/topologies using ptfadapter.

Supported testbed topology if it's a new test case?

N/A — not a new test case.

Documentation

N/A

The start_ptf_nn_agent retry loop in the ptfadapter fixture calls
'supervisorctl restart ptf_nn_agent' without module_ignore_errors=True.
When supervisord returns a non-zero exit code (for example,
'ERROR (spawn error)' on a port collision with the randomly-picked
ptf_nn_port), Ansible raises and the surrounding 3-iteration retry loop
never runs. The fixture aborts in setup and any test depending on
ptfadapter errors out.

Pass module_ignore_errors=True so the loop can pick a different random
port. Also harden the supervisorctl status parsing against an empty
stdout_lines list to avoid an IndexError.

Signed-off-by: Austin (Ngoc Thang) Pham <austinpham@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@github-actions github-actions Bot requested review from lolyu, wangxin and xwjiang-ms May 20, 2026 11:47
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham marked this pull request as ready for review May 20, 2026 11:59
@auspham
Copy link
Copy Markdown
Contributor Author

auspham commented May 20, 2026

/azpw run

@mssonicbld
Copy link
Copy Markdown
Collaborator

⚠️ Notice: /azpw run only runs failed jobs now. If you want to trigger a whole pipline run, please rebase your branch or close and reopen the PR.
💡 Tip: You can also use /azpw retry to retry failed jobs directly.

Retrying failed(or canceled) jobs...

@auspham auspham closed this May 20, 2026
@auspham auspham reopened this May 20, 2026
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@auspham
Copy link
Copy Markdown
Contributor Author

auspham commented May 21, 2026

should be fixed by sonic-net/sonic-buildimage#26912

@auspham
Copy link
Copy Markdown
Contributor Author

auspham commented May 21, 2026

Stale analysis, already fixed.

@auspham auspham closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants