Skip to content

feat: add active ballooning reclaim controller#160

Merged
sjmiller609 merged 24 commits intomainfrom
codex/active-ballooning
Mar 23, 2026
Merged

feat: add active ballooning reclaim controller#160
sjmiller609 merged 24 commits intomainfrom
codex/active-ballooning

Conversation

@sjmiller609
Copy link
Collaborator

@sjmiller609 sjmiller609 commented Mar 19, 2026

Summary

  • add a host-side active ballooning controller in lib/guestmemory with pressure sampling, proportional reclaim, protected floors, and manual reclaim holds
  • expose POST /resources/memory/reclaim, wire the controller through the API startup path, and document/configure the new hypervisor.memory.active_ballooning settings
  • extend the existing guest-memory integration tests for Cloud Hypervisor, QEMU, Firecracker, and VZ to validate runtime balloon targets and manual reclaim flows

Validation

  • go test ./lib/guestmemory -count=1
  • go test ./cmd/api/api -run 'TestReclaimMemory_' -count=1
  • make test-guestmemory-vz
  • make test-guestmemory-linux on deft-kernel-dev from /home/sjmiller609/codex-active-ballooning-plan/hypeman

Note

Medium Risk
Introduces a new background control loop that can change VM memory balloon targets across hypervisors and adds a new API endpoint; mistakes could cause unexpected VM memory pressure or reclaim behavior.

Overview
Adds an active guest-memory ballooning controller (lib/guestmemory) that samples host pressure (Linux /proc + PSI, macOS vm_stat/memory_pressure), computes a reclaim target with hysteresis/protected floors, and applies proportional runtime balloon target changes with cooldown/step limits, plus metrics/tracing/logging.

Exposes a new manual reclaim API POST /resources/memory/reclaim (with hold_for, dry_run, and feature-disabled vs internal error handling), wires the controller into API DI/startup, and extends config with hypervisor.memory.active_ballooning defaults + validation.

Enables runtime balloon control across hypervisors by extending the hypervisor.Hypervisor interface (Set/GetTargetGuestMemoryBytes + SupportsBalloonControl), implementing it for Cloud Hypervisor/QEMU/Firecracker/VZ (including vz-shim balloon endpoints), and updates integration tests/Makefile to cover runtime target changes and reduce CI flakiness (timeouts, per-test runs, better artifact logging, PID resolution).

Written by Cursor Bugbot for commit 6651859. This will update automatically on new commits. Configure here.

@github-actions
Copy link

github-actions bot commented Mar 19, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: add active ballooning reclaim controller
hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@b8ecb541b8237d1283a05422eed6e30ff2b0536f
⚠️ hypeman-typescript studio · code

Your SDK build had at least one "error" diagnostic.
generate ❗build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/d5dc0286ee9626e5881a550cf74bd569416f65c2/dist.tar.gz

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-03-23 16:50:37 UTC

@sjmiller609 sjmiller609 marked this pull request as ready for review March 20, 2026 14:12
@sjmiller609
Copy link
Collaborator Author

Validated the feature end to end on deft-kernel-dev with fresh uncached runs.

What I checked:

  • Booted real VMs under Cloud Hypervisor, QEMU, and Firecracker on deft
  • Verified the guest-side helper came up far enough for runtime memory control to work
  • Verified Hypeman could read each VM's current runtime memory target over the real hypervisor control interface
  • Verified host-triggered reclaim changed the balloon target, respected the protected floor, and returned the VM to its normal target after clearing reclaim
  • Reran the host-side controller/API checks for proactive reclaim and pressure-state behavior on deft

The main issue I had to fix was that the manual Linux path could reuse a stale non-Linux embedded guest-agent artifact after syncing from this laptop, which caused the guest to shut down early with exec format error. I also tightened Linux runtime PID handling around the manual guest-memory validation path so the tests follow the live VMM process more reliably.

}
return HostPressureStatePressure
default:
if availablePercent <= highWatermark || sample.Stressed {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pressure hysteresis uses <= where < matches docs

Low Severity

In nextPressureState, the healthy→pressure transition triggers when availablePercent <= highWatermark. This means available memory exactly at the high watermark (e.g., 10%) enters pressure state. However, the pressure→healthy exit condition uses >= lowWatermark. The asymmetry means available memory at exactly the high watermark threshold enters pressure, which may cause unnecessary pressure transitions on hosts hovering near the boundary, contradicting the hysteresis intent of avoiding flapping at thresholds.

Fix in Cursor Fix in Web

sjmiller609 and others added 2 commits March 21, 2026 00:27
- Change example YAML byte-size values from MB (decimal SI, 10^6) to
  MiB (binary, 2^20) so they match the Go default config which uses raw
  binary byte counts (e.g. 536870912 = 512 MiB).
- Remove dead strings.TrimSuffix call in parseDarwinVMStatOutput; the
  " bytes)" suffix is never present after strings.Fields splits on
  whitespace.

Addresses remaining Cursor Bugbot review findings on PR #160.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The c2h5oh/datasize library does not support MiB (binary IEC) suffixes.
Use raw byte counts (e.g. 536870912 = 512*1024*1024) to match the Go
default config and avoid the ~5% discrepancy from using MB (decimal SI).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@masnwilliams masnwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solid feature — well-structured controller with good observability and test infrastructure. left a few comments, mostly nits. the scope question (comment 8) is the main one worth a look.


if req.force && !req.dryRun {
switch {
case req.requestedReclaim <= 0 && req.holdFor <= 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this branch and the one below (requestedReclaim > 0 && holdFor <= 0) do the exact same thing — both clear the hold. could collapse to a single if req.holdFor <= 0 { clear } branch.

sampleSpan.SetStatus(codes.Error, err.Error())
sampleSpan.End()
c.recordReconcileError(ctx, trigger, start, span, err)
return ManualReclaimResponse{}, err

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider wrapping this: fmt.Errorf("sample host pressure: %w", err) — callers (especially the API handler) won't have context on what failed otherwise.

func ResolveProcessPID(socketPath string) (int, error) {
socketRef, err := socketRefForPath(socketPath)
if err == nil {
if pid, err := pidBySocketRef(socketRef); err == nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the := here shadows the outer err from socketRefForPath. if pidBySocketRef fails, that error is lost — and downstream, the if err != nil check on line ~28 references the pidByCmdline error instead. the final return 0, fmt.Errorf(...) also appears to be dead code since err != nil is always true at that point. consider capturing errors explicitly to preserve diagnostic info.

if err != nil || len(cmdline) == 0 {
continue
}
if strings.Contains(string(cmdline), socketPath) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: substring match means /run/vm-1.sock would also match a process with /run/vm-10.sock in its cmdline. low risk since this is a fallback path, but splitting on \0 and checking exact argument match would be more precise.

return kb * 1024
}

func assertActiveBallooningLifecycleVZ(t *testing.T, ctx context.Context, inst *Instance) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is identical to assertActiveBallooningLifecycle in the linux test file (~60 lines). consider extracting to the shared guestmemory_active_ballooning_test_helpers_test.go to avoid drift.

"GET /resources": ResourceRead,
"GET /health": ResourceRead,
"GET /resources": ResourceRead,
"POST /resources/memory/reclaim": ResourceRead,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this maps a mutating POST to ResourceRead — every other POST in this file uses a write scope. intentional (since no ResourceWrite exists yet), or should a write scope be added?

@sjmiller609 sjmiller609 marked this pull request as draft March 23, 2026 15:04
sjmiller609 and others added 2 commits March 23, 2026 15:07
- Clamp Low watermark to High+1 instead of cascading reset to defaults
- Collapse duplicate hold-clearing branches in reconcile
- Add lastApplied pruning for VMs no longer in candidate list
- Wrap sampler error with context for better caller diagnostics
- Fix err shadowing in ResolveProcessPID (socket_pid_linux.go)
- Use exact arg match in pidByCmdline to avoid substring false positives
- Extract assertActiveBallooningLifecycle to shared test helpers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sjmiller609 sjmiller609 marked this pull request as ready for review March 23, 2026 15:15
The reclaim endpoint is a mutating POST but was mapped to ResourceRead.
Add ResourceWrite scope and use it for the reclaim route, consistent
with every other POST endpoint using a write scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 5 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@sjmiller609 sjmiller609 marked this pull request as draft March 23, 2026 15:43
@sjmiller609 sjmiller609 marked this pull request as ready for review March 23, 2026 15:56
@sjmiller609 sjmiller609 merged commit 75c3289 into main Mar 23, 2026
5 checks passed
@sjmiller609 sjmiller609 deleted the codex/active-ballooning branch March 23, 2026 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants