-
Notifications
You must be signed in to change notification settings - Fork 167
keepalive: Add inspect command to disable NVMe keepalives #2789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add `vm/nvme_keepalive_mode` inspect node shows the current status, but can be updated through `ohcldiag-dev`: ``` C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode "enabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u disabled "disabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode "disabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u enabled "enabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode "enabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u blah Error: update error: unknown value ``` Add a test for this in `openhcl_servicing.rs` that verifies that the keepalive can be disabled, and servicing completes successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a mutable Inspect node for controlling NVMe keepalive behavior in OpenHCL and validates the behavior via a new servicing test.
Changes:
- Expose
vm/nvme_keepalive_modevia Inspect with update support and logging on changes. - Extend
KeepAliveConfigparsing and add a helper for string rendering. - Add an integration test that disables NVMe keepalive through Inspect and verifies servicing triggers NVMe IDENTIFY.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs |
Adds a new test exercising disabling NVMe keepalive via Inspect before servicing. |
openhcl/underhill_core/src/options.rs |
Updates KeepAliveConfig to support enabled/disabled strings and introduces as_str(). |
openhcl/underhill_core/src/dispatch/mod.rs |
Adds an Inspect node vm/nvme_keepalive_mode with custom update handling and logging. |
| let sh = agent.unix_shell(); | ||
|
|
||
| // Make sure the disk showed up. | ||
| cmd!(sh, "ls /dev/sda").run().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I took this for granted from previous tests but looking at this now got me wondering ... what does this check give us? Petri non-deterministically exposes the disk to sda or sdb so half the time this does nothing. I wonder if we should either omit this completely or perform a more rigorous check by calling get_device_paths() instead [which will probably slow down the test a bit]
Either way this was just food for thought, not meant to block your changes. We can take this up as a future change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Since this line appears in several tests in this file, I guess a follow up to fix them all makes more sense than fixing just this one.
| "nohost,privatepool" => Ok(KeepAliveConfig::DisabledHostAndPrivatePoolPresent), | ||
| "nohost,noprivatepool" => Ok(KeepAliveConfig::Disabled), | ||
| x if x.starts_with("disabled,") => Ok(KeepAliveConfig::Disabled), | ||
| x if x == "disabled" || x.starts_with("disabled,") => Ok(KeepAliveConfig::Disabled), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think) I don't want to allow disabledblah
| let mut flags = config.default_servicing_flags(); | ||
| // Enable keepalive, then disable it later via inspect | ||
| flags.enable_nvme_keepalive = true; | ||
| flags.enable_mana_keepalive = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document why mana keepalive is being disabled here? iirc, the enable_mana_keepalive needs to be false because the flags are entangled with each other at the DmaManager level. i.e. If either of the flags is enabled, the dma manager will save before either nvme::save() or mana::save().
Also, do we have any planned work items to fix this behavior in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will do.
Yes: https://microsoft.visualstudio.com/OS/_workitems/edit/61014151
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had the time to go and fully verify, but I believe MANA assumes that its DMA buffers will have been freed by the time we get to saving the DmaClient's allocations, which is why this check is written as it is. I'm not sure if that's true for nvme, but I suspect it isn't and leading to the unrestored nvme allocations when mana keepalive is on and nvme keepalive is off.
I'm not sure what the right solution here is. Should DmaClient have a per-device allocation save/restore so that we can save and restore independently of each other? Or should we make an nvme change to make sure its DMA buffers are freed by the time we get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have answers to your questions, and happy to discuss further. That said, these answers are needed for 1.8/main branch. In this PR we'd like the smallest change that can go in 1.7 to support disabling NVMe keepalives through inspect (since AH24 doesn't have other ways to disable). For this test, just disabling MAN KA works :-) since it's also disabled for 1.6/1.7 by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justus-camp-microsoft I am not sure I follow here. why does MANA assume that the DMA buffers have been freed before save was called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline. I agree that we need to rationalize this, but we can defer that work to a (quick) follow up. For now I'm going to sign off on this change to get the targeted fix in. Alex will follow up on the main and future changes.
) Clean cherry pick of PR #2789 Add `vm/nvme_keepalive_mode` inspect node shows the current status, but can be updated through `ohcldiag-dev`: ``` C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode "enabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u disabled "disabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode "disabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u enabled "enabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode "enabled" C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u blah Error: update error: Invalid keepalive config: blah ``` Add a test for this in `openhcl_servicing.rs` that verifies that the keepalive can be disabled, and servicing completes successfully.
|
Backported to release/1.7.2511 in #2800 |
Add
vm/nvme_keepalive_modeinspect node shows the current status, but can be updated throughohcldiag-dev:Add a test for this in
openhcl_servicing.rsthat verifies that the keepalive can be disabled, and servicing completes successfully.