Skip to content

Harden arm64 VM tests: bump systemd device timeout to 600s#749

Merged
vinceaperri merged 1 commit into
mainfrom
user/vinceaperri/harden-osmodifier-arm64-device-timeout
May 29, 2026
Merged

Harden arm64 VM tests: bump systemd device timeout to 600s#749
vinceaperri merged 1 commit into
mainfrom
user/vinceaperri/harden-osmodifier-arm64-device-timeout

Conversation

@vinceaperri
Copy link
Copy Markdown
Contributor

@vinceaperri vinceaperri commented May 28, 2026

Fixes flaky failures across multiple arm64 VM test jobs that were hitting an ESP device timeout during guest boot and cascading into emergency mode. Reproductions seen on two consecutive release runs:

  • osmodifier Ubuntu24.04 ARM64 (run 26543985458, job 78192335242)
    • all 6 osmodifier AZL4 tests failed with "Unable to connect to port 22" after the boot fell into emergency.target
  • VMTests imagecustomizer Ubuntu24.04 ARM64 (run 26545094754, job 78195779954)
    • test_min_change_efi_azl4_qcow_output failed identically on the same ESP UUID (B3A5-CBA1) with the same boot-time cascade

Root cause

arm64 VM tests run a guest qemu/libvirt VM on an arm64 Ubuntu 24.04 runner and boot the guest much more slowly than the x86_64 jobs. On a noisy or slow runner the AZL4 image can take well over 90 seconds before udev publishes the /dev/disk/by-uuid/ symlink (kernel time ~187s observed in the imagecustomizer failure).

systemd's default DefaultDeviceTimeoutSec=90s is too short for that. When it expires the failure cascades:

dev-disk-by\x2duuid-...device (timeout)
-> boot-efi.mount (dependency failed)
-> local-fs.target (failed; ESP fstab entry lacks nofail)
-> emergency.target (system drops to rescue shell)

systemd never reaches multi-user.target, sshd never starts, and the test's SSH retry loop (~360s) eventually gives up with "Unable to connect to port 22". AZL3 happens to win this race today because its boot is a bit faster, but it is on the same cliff edge.

Fix

On arm64 hosts only, inject systemd.default_device_timeout_sec=600 into the test image's kernel cmdline via image-customizer's os.kernelCommandLine.extraCommandLine. The bump is gated on platform.machine() == "aarch64", matching the existing convention in the VM-test helpers (see libvirt_vm.py and libvirt_utils.py).

This is applied centrally in add_ssh_to_config, which every VM test already calls to prepare its config, so the hardening covers all current and future arm64 VM tests built on the helper:

  • osmodifier VMTests (test/vmtests/vmtests/osmodifier/...)
  • imagecustomizer VMTests (test_create.py, test_min_change.py)

x86_64 VM tests have not exhibited this flake, so the default systemd behaviour is left in place there. This change only affects test images at test time; product image behaviour is unchanged.

Fixes flaky failures across multiple arm64 VM test jobs that were
hitting an ESP device timeout during guest boot and cascading into
emergency mode. Reproductions seen on two consecutive release runs:

* osmodifier Ubuntu24.04 ARM64 (run 26543985458, job 78192335242)
  - all 6 osmodifier AZL4 tests failed with "Unable to connect to
    port 22" after the boot fell into emergency.target
* VMTests imagecustomizer Ubuntu24.04 ARM64 (run 26545094754,
  job 78195779954)
  - test_min_change_efi_azl4_qcow_output failed identically on the
    same ESP UUID (B3A5-CBA1) with the same boot-time cascade

Root cause
----------
arm64 VM tests run a guest qemu/libvirt VM on an arm64 Ubuntu 24.04
runner and boot the guest much more slowly than the x86_64 jobs. On
a noisy or slow runner the AZL4 image can take well over 90 seconds
before udev publishes the /dev/disk/by-uuid/<ESP-UUID> symlink
(kernel time ~187s observed in the imagecustomizer failure).

systemd's default DefaultDeviceTimeoutSec=90s is too short for that.
When it expires the failure cascades:

  dev-disk-by\x2duuid-...device  (timeout)
    -> boot-efi.mount             (dependency failed)
    -> local-fs.target            (failed; ESP fstab entry lacks nofail)
    -> emergency.target           (system drops to rescue shell)

systemd never reaches multi-user.target, sshd never starts, and the
test's SSH retry loop (~360s) eventually gives up with
"Unable to connect to port 22". AZL3 happens to win this race today
because its boot is a bit faster, but it is on the same cliff edge.

Fix
---
On arm64 hosts only, inject `systemd.default_device_timeout_sec=600`
into the test image's kernel cmdline via image-customizer's
`os.kernelCommandLine.extraCommandLine`. The bump is gated on
`platform.machine() == "aarch64"`, matching the existing convention
in the VM-test helpers (see `libvirt_vm.py` and `libvirt_utils.py`).

This is applied centrally in `add_ssh_to_config`, which every VM test
already calls to prepare its config, so the hardening covers all
current and future arm64 VM tests built on the helper:

  * osmodifier VMTests (test/vmtests/vmtests/osmodifier/...)
  * imagecustomizer VMTests (test_create.py, test_min_change.py)

x86_64 VM tests have not exhibited this flake, so the default systemd
behaviour is left in place there. This change only affects test
images at test time; product image behaviour is unchanged.
@vinceaperri vinceaperri requested a review from a team as a code owner May 28, 2026 00:50
@vinceaperri vinceaperri requested a review from Copilot May 28, 2026 00:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens arm64 VM test images against slow guest boots by increasing systemd’s default device-unit timeout via a kernel command-line override, avoiding cascades into emergency.target that prevent SSH from coming up.

Changes:

  • Add platform detection in VM-test imagecustomizer helper.
  • On aarch64 hosts, inject systemd.default_device_timeout_sec=600 into os.kernelCommandLine.extraCommandLine when preparing test images for SSH-based VM testing.
  • Update the helper’s header comment to reflect the additional arm64-only hardening behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@vinceaperri vinceaperri merged commit 069f5af into main May 29, 2026
27 checks passed
@vinceaperri vinceaperri deleted the user/vinceaperri/harden-osmodifier-arm64-device-timeout branch May 29, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants