Skip to content

Conversation

@alandau
Copy link
Contributor

@alandau alandau commented Feb 10, 2026

Add vm/nvme_keepalive_mode inspect node shows the current status, but can be updated through ohcldiag-dev:

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode
"enabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u disabled
"disabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode
"disabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u enabled
"enabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode
"enabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u blah
Error: update error: Invalid keepalive config: blah

Add a test for this in openhcl_servicing.rs that verifies that the keepalive can be disabled, and servicing completes successfully.

Add `vm/nvme_keepalive_mode` inspect node shows the current status, but can be updated through `ohcldiag-dev`:

```
C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode
"enabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u disabled
"disabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode
"disabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u enabled
"enabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode
"enabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u blah
Error: update error: unknown value
```

Add a test for this in `openhcl_servicing.rs` that verifies that the keepalive
can be disabled, and servicing completes successfully.
@alandau alandau marked this pull request as ready for review February 10, 2026 22:22
@alandau alandau requested a review from a team as a code owner February 10, 2026 22:22
Copilot AI review requested due to automatic review settings February 10, 2026 22:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a mutable Inspect node for controlling NVMe keepalive behavior in OpenHCL and validates the behavior via a new servicing test.

Changes:

  • Expose vm/nvme_keepalive_mode via Inspect with update support and logging on changes.
  • Extend KeepAliveConfig parsing and add a helper for string rendering.
  • Add an integration test that disables NVMe keepalive through Inspect and verifies servicing triggers NVMe IDENTIFY.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs Adds a new test exercising disabling NVMe keepalive via Inspect before servicing.
openhcl/underhill_core/src/options.rs Updates KeepAliveConfig to support enabled/disabled strings and introduces as_str().
openhcl/underhill_core/src/dispatch/mod.rs Adds an Inspect node vm/nvme_keepalive_mode with custom update handling and logging.

let sh = agent.unix_shell();

// Make sure the disk showed up.
cmd!(sh, "ls /dev/sda").run().await?;
Copy link
Contributor

@gurasinghMS gurasinghMS Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I took this for granted from previous tests but looking at this now got me wondering ... what does this check give us? Petri non-deterministically exposes the disk to sda or sdb so half the time this does nothing. I wonder if we should either omit this completely or perform a more rigorous check by calling get_device_paths() instead [which will probably slow down the test a bit]

Either way this was just food for thought, not meant to block your changes. We can take this up as a future change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Since this line appears in several tests in this file, I guess a follow up to fix them all makes more sense than fixing just this one.

"nohost,privatepool" => Ok(KeepAliveConfig::DisabledHostAndPrivatePoolPresent),
"nohost,noprivatepool" => Ok(KeepAliveConfig::Disabled),
x if x.starts_with("disabled,") => Ok(KeepAliveConfig::Disabled),
x if x == "disabled" || x.starts_with("disabled,") => Ok(KeepAliveConfig::Disabled),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think) I don't want to allow disabledblah

let mut flags = config.default_servicing_flags();
// Enable keepalive, then disable it later via inspect
flags.enable_nvme_keepalive = true;
flags.enable_mana_keepalive = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document why mana keepalive is being disabled here? iirc, the enable_mana_keepalive needs to be false because the flags are entangled with each other at the DmaManager level. i.e. If either of the flags is enabled, the dma manager will save before either nvme::save() or mana::save().

Also, do we have any planned work items to fix this behavior in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had the time to go and fully verify, but I believe MANA assumes that its DMA buffers will have been freed by the time we get to saving the DmaClient's allocations, which is why this check is written as it is. I'm not sure if that's true for nvme, but I suspect it isn't and leading to the unrestored nvme allocations when mana keepalive is on and nvme keepalive is off.

I'm not sure what the right solution here is. Should DmaClient have a per-device allocation save/restore so that we can save and restore independently of each other? Or should we make an nvme change to make sure its DMA buffers are freed by the time we get here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have answers to your questions, and happy to discuss further. That said, these answers are needed for 1.8/main branch. In this PR we'd like the smallest change that can go in 1.7 to support disabling NVMe keepalives through inspect (since AH24 doesn't have other ways to disable). For this test, just disabling MAN KA works :-) since it's also disabled for 1.6/1.7 by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justus-camp-microsoft I am not sure I follow here. why does MANA assume that the DMA buffers have been freed before save was called?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline. I agree that we need to rationalize this, but we can defer that work to a (quick) follow up. For now I'm going to sign off on this change to get the targeted fix in. Alex will follow up on the main and future changes.

@github-actions
Copy link

@mattkur mattkur merged commit 458c128 into microsoft:main Feb 11, 2026
79 checks passed
alandau added a commit that referenced this pull request Feb 12, 2026
)

Clean cherry pick of PR #2789

Add `vm/nvme_keepalive_mode` inspect node shows the current status, but
can be updated through `ohcldiag-dev`:

```
C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode
"enabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u disabled
"disabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode
"disabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u enabled
"enabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode
"enabled"

C:\openvmm>ohcldiag-dev.exe uhdiag.sock i vm/nvme_keepalive_mode -u blah
Error: update error: Invalid keepalive config: blah
```

Add a test for this in `openhcl_servicing.rs` that verifies that the
keepalive can be disabled, and servicing completes successfully.
@alandau
Copy link
Contributor Author

alandau commented Feb 12, 2026

Backported to release/1.7.2511 in #2800

@alandau alandau added backported_1.7.2511 PR that has been backported to release/1.7.2511 and removed backport_1.7.2511 labels Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backported_1.7.2511 PR that has been backported to release/1.7.2511

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants