feat(krun): raise IRQ cap, fix GIC nr_irqs, and make split irqchip opt-in#46
Conversation
Raise IRQ_MAX from 15 to 223 on x86_64 and from 159 to 223 on aarch64 to support 136+ virtio-MMIO devices needed for block-backed EROFS OCI rootfs (one block device per OCI layer). - Raise IOAPIC_NUM_PINS from 24 to 256 in the userspace split irqchip to match the new IRQ range on x86_64 - Always use split irqchip on x86_64 since KVM's in-kernel IOAPIC is hardcoded to 24 pins (KVM_IOAPIC_NUM_PINS) and cannot be changed - Fix GIC nr_irqs calculation in kvmgicv2 and kvmgicv3: KVM interprets nr_irqs as total interrupts including 32 private ones (SGIs + PPIs), so the old `IRQ_MAX - IRQ_BASE + 1` formula under-allocated SPIs
You could also use new libkrun VMDK support to:
|
- Fix rustfmt: collapse method chain in builder.rs to single line - Fix clippy: use .div_ceil(32) instead of manual ((x + 31) / 32) in kvmgicv2 and kvmgicv3 - Fix test: increase cmdline buffer from 4096 to 16384 in test_register_too_many_devices since 219 devices (~10KB) now overflows the old 4KB limit
No longer needed since split irqchip is now always used.
|
@hsiangkao Reading your comment again. Thank you so much for these suggestions. It makes the changes in this PR unnecessary! I knew erofs had a merge feature but wasn't quite sure how to make it work. This amazing! Never knew VMDK had sth like that. I will try it and maybe I will close this PR. |
The previous revision forced userspace split irqchip unconditionally on x86_64 to reach the raised IRQ cap, which changed runtime behavior for every caller — even those fine with the in-kernel IOAPIC's 11-IRQ budget. Restore the mode as opt-in while keeping the raised ceiling available to callers who need it. - src/arch/src/x86_64/layout.rs: keep IRQ_MAX at 15 for the in-kernel IOAPIC path; add IRQ_MAX_SPLIT = 223 for the userspace split irqchip. - src/vmm/src/builder.rs: restore the if/else on vm_resources.split_irqchip choosing IoApic vs KvmIoapic and the matching attach_legacy_devices arg. Size the MMIODeviceManager IRQ pool to IRQ_MAX_SPLIT only when split irqchip is selected. - src/krun/src/api/builders.rs + builder.rs: add MachineBuilder::split_irqchip(bool) and thread it through to vmr.split_irqchip. - src/devices/src/legacy/ioapic.rs: fix KVM_CAP_SPLIT_IRQCHIP args[0] to match the emulated IOAPIC's pin count (256) instead of the previous hardcoded 24. The old value was inconsistent with IOAPIC_NUM_PINS and only worked because libkrun installs MSI routes rather than pin routes. aarch64 GIC nr_irqs fix and the aarch64 IRQ_MAX bump from the prior commits remain in place — those stand on their own.
Both lints are pre-existing on main and only surface because the CI runner uses the latest stable toolchain, which is ahead of the local dev toolchain. - src/devices/src/virtio/snd/worker.rs: drop a useless .into_iter() on the argument to Vec::extend (clippy::useless_conversion in 1.95+). - src/cpuid/src/common.rs: reorder the std::arch::* imports to put CpuidResult last, matching rustfmt 1.95+'s case-insensitive sort.
|
For context, our filesystem implementation now use erofs meta and vmdk stitching, but this changes were still needed because a user was hitting irq limit on linux x86-64 |
Yes, but it also depends on when you generate fsmerge metadata (assumedly OCI layers are already applied to EROFS parallelly so the fsmerge metadata generation could be a potential bottleneck.) If generating fsmerge still takes some time on the critical path and you don't have a way to re-distribute it, you could also use GPT partitions and mount EROFS in the guest instead (each erofs partition mount costs ~us, so 100 layers only takes ms), anyway you could benchmark all the alternatives and find which one is best for your scenarios. |
Summary
Raise the per-VM IRQ ceiling so callers can attach many virtio-mmio devices (e.g. lots of virtio-fs tags or block-backed OCI rootfs layers), while preserving the existing in-kernel-IOAPIC behavior as the default.
IRQ_MAX = 15for the default in-kernel IOAPIC path (KVM hardcodesKVM_IOAPIC_NUM_PINS = 24, giving 11 usable virtio IRQs). AddIRQ_MAX_SPLIT = 223used only when the caller selects the userspace split irqchip, which emulates a 256-pin IOAPIC.MachineBuilder::split_irqchip(bool). Callers that need >11 IRQs opt in via the builder; otherwise libkrun behaves exactly as before. Mirrors the existingkrun_split_irqchipC API.KVM_CAP_SPLIT_IRQCHIP args[0]. Was hardcoded to 24 whileIOAPIC_NUM_PINSwas 256; now derived fromIOAPIC_NUM_PINSso the reserved userspace-IOAPIC GSI range matches the emulated device. Functionally harmless before (libkrun installs MSI routes regardless), but the old value was a lie to KVM.IRQ_MAXbumped 159 → 223. No mode switch needed on aarch64.nr_irqscorrectness fix.kvmgicv2/kvmgicv3rounded up to a multiple of 32 and accounts for the 32 private interrupts (SGIs + PPIs); the previousIRQ_MAX - IRQ_BASE + 1under-allocated SPIs so any device assigned an IRQ above 127 ended up with an invalid interrupt.Why opt-in and not always-on
Forcing the userspace split irqchip globally changed runtime behavior for every x86_64 caller, even those fine with the 11-IRQ budget. Keeping it opt-in means zero behavioral change for existing users; callers who legitimately need the higher cap (e.g. microsandbox with many virtio-fs mounts) flip one flag and get 219 usable IRQs.
Test plan
cargo build -p msb_krun(x86_64 linux, macOS host)cargo fmt --check,cargo clippy -p msb_krun_vmm -p msb_krun.split_irqchip(true),just build && just install,msb run alpine— boots cleanly