Harden arm64 VM tests: bump systemd device timeout to 600s#749
Merged
vinceaperri merged 1 commit intoMay 29, 2026
Merged
Conversation
Fixes flaky failures across multiple arm64 VM test jobs that were
hitting an ESP device timeout during guest boot and cascading into
emergency mode. Reproductions seen on two consecutive release runs:
* osmodifier Ubuntu24.04 ARM64 (run 26543985458, job 78192335242)
- all 6 osmodifier AZL4 tests failed with "Unable to connect to
port 22" after the boot fell into emergency.target
* VMTests imagecustomizer Ubuntu24.04 ARM64 (run 26545094754,
job 78195779954)
- test_min_change_efi_azl4_qcow_output failed identically on the
same ESP UUID (B3A5-CBA1) with the same boot-time cascade
Root cause
----------
arm64 VM tests run a guest qemu/libvirt VM on an arm64 Ubuntu 24.04
runner and boot the guest much more slowly than the x86_64 jobs. On
a noisy or slow runner the AZL4 image can take well over 90 seconds
before udev publishes the /dev/disk/by-uuid/<ESP-UUID> symlink
(kernel time ~187s observed in the imagecustomizer failure).
systemd's default DefaultDeviceTimeoutSec=90s is too short for that.
When it expires the failure cascades:
dev-disk-by\x2duuid-...device (timeout)
-> boot-efi.mount (dependency failed)
-> local-fs.target (failed; ESP fstab entry lacks nofail)
-> emergency.target (system drops to rescue shell)
systemd never reaches multi-user.target, sshd never starts, and the
test's SSH retry loop (~360s) eventually gives up with
"Unable to connect to port 22". AZL3 happens to win this race today
because its boot is a bit faster, but it is on the same cliff edge.
Fix
---
On arm64 hosts only, inject `systemd.default_device_timeout_sec=600`
into the test image's kernel cmdline via image-customizer's
`os.kernelCommandLine.extraCommandLine`. The bump is gated on
`platform.machine() == "aarch64"`, matching the existing convention
in the VM-test helpers (see `libvirt_vm.py` and `libvirt_utils.py`).
This is applied centrally in `add_ssh_to_config`, which every VM test
already calls to prepare its config, so the hardening covers all
current and future arm64 VM tests built on the helper:
* osmodifier VMTests (test/vmtests/vmtests/osmodifier/...)
* imagecustomizer VMTests (test_create.py, test_min_change.py)
x86_64 VM tests have not exhibited this flake, so the default systemd
behaviour is left in place there. This change only affects test
images at test time; product image behaviour is unchanged.
Contributor
There was a problem hiding this comment.
Pull request overview
Hardens arm64 VM test images against slow guest boots by increasing systemd’s default device-unit timeout via a kernel command-line override, avoiding cascades into emergency.target that prevent SSH from coming up.
Changes:
- Add
platformdetection in VM-test imagecustomizer helper. - On
aarch64hosts, injectsystemd.default_device_timeout_sec=600intoos.kernelCommandLine.extraCommandLinewhen preparing test images for SSH-based VM testing. - Update the helper’s header comment to reflect the additional arm64-only hardening behavior.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cwize1
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes flaky failures across multiple arm64 VM test jobs that were hitting an ESP device timeout during guest boot and cascading into emergency mode. Reproductions seen on two consecutive release runs:
Root cause
arm64 VM tests run a guest qemu/libvirt VM on an arm64 Ubuntu 24.04 runner and boot the guest much more slowly than the x86_64 jobs. On a noisy or slow runner the AZL4 image can take well over 90 seconds before udev publishes the /dev/disk/by-uuid/ symlink (kernel time ~187s observed in the imagecustomizer failure).
systemd's default DefaultDeviceTimeoutSec=90s is too short for that. When it expires the failure cascades:
dev-disk-by\x2duuid-...device (timeout)
-> boot-efi.mount (dependency failed)
-> local-fs.target (failed; ESP fstab entry lacks nofail)
-> emergency.target (system drops to rescue shell)
systemd never reaches multi-user.target, sshd never starts, and the test's SSH retry loop (~360s) eventually gives up with "Unable to connect to port 22". AZL3 happens to win this race today because its boot is a bit faster, but it is on the same cliff edge.
Fix
On arm64 hosts only, inject
systemd.default_device_timeout_sec=600into the test image's kernel cmdline via image-customizer'sos.kernelCommandLine.extraCommandLine. The bump is gated onplatform.machine() == "aarch64", matching the existing convention in the VM-test helpers (seelibvirt_vm.pyandlibvirt_utils.py).This is applied centrally in
add_ssh_to_config, which every VM test already calls to prepare its config, so the hardening covers all current and future arm64 VM tests built on the helper:x86_64 VM tests have not exhibited this flake, so the default systemd behaviour is left in place there. This change only affects test images at test time; product image behaviour is unchanged.