feat: add active ballooning reclaim controller#160
Conversation
✱ Stainless preview buildsThis PR will update the ✅ hypeman-go studio · code
|
|
Validated the feature end to end on What I checked:
The main issue I had to fix was that the manual Linux path could reuse a stale non-Linux embedded |
lib/guestmemory/planner.go
Outdated
| } | ||
| return HostPressureStatePressure | ||
| default: | ||
| if availablePercent <= highWatermark || sample.Stressed { |
There was a problem hiding this comment.
Pressure hysteresis uses <= where < matches docs
Low Severity
In nextPressureState, the healthy→pressure transition triggers when availablePercent <= highWatermark. This means available memory exactly at the high watermark (e.g., 10%) enters pressure state. However, the pressure→healthy exit condition uses >= lowWatermark. The asymmetry means available memory at exactly the high watermark threshold enters pressure, which may cause unnecessary pressure transitions on hosts hovering near the boundary, contradicting the hysteresis intent of avoiding flapping at thresholds.
- Change example YAML byte-size values from MB (decimal SI, 10^6) to MiB (binary, 2^20) so they match the Go default config which uses raw binary byte counts (e.g. 536870912 = 512 MiB). - Remove dead strings.TrimSuffix call in parseDarwinVMStatOutput; the " bytes)" suffix is never present after strings.Fields splits on whitespace. Addresses remaining Cursor Bugbot review findings on PR #160. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The c2h5oh/datasize library does not support MiB (binary IEC) suffixes. Use raw byte counts (e.g. 536870912 = 512*1024*1024) to match the Go default config and avoid the ~5% discrepancy from using MB (decimal SI). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
masnwilliams
left a comment
There was a problem hiding this comment.
solid feature — well-structured controller with good observability and test infrastructure. left a few comments, mostly nits. the scope question (comment 8) is the main one worth a look.
lib/guestmemory/controller.go
Outdated
|
|
||
| if req.force && !req.dryRun { | ||
| switch { | ||
| case req.requestedReclaim <= 0 && req.holdFor <= 0: |
There was a problem hiding this comment.
nit: this branch and the one below (requestedReclaim > 0 && holdFor <= 0) do the exact same thing — both clear the hold. could collapse to a single if req.holdFor <= 0 { clear } branch.
lib/guestmemory/controller.go
Outdated
| sampleSpan.SetStatus(codes.Error, err.Error()) | ||
| sampleSpan.End() | ||
| c.recordReconcileError(ctx, trigger, start, span, err) | ||
| return ManualReclaimResponse{}, err |
There was a problem hiding this comment.
consider wrapping this: fmt.Errorf("sample host pressure: %w", err) — callers (especially the API handler) won't have context on what failed otherwise.
lib/hypervisor/socket_pid_linux.go
Outdated
| func ResolveProcessPID(socketPath string) (int, error) { | ||
| socketRef, err := socketRefForPath(socketPath) | ||
| if err == nil { | ||
| if pid, err := pidBySocketRef(socketRef); err == nil { |
There was a problem hiding this comment.
the := here shadows the outer err from socketRefForPath. if pidBySocketRef fails, that error is lost — and downstream, the if err != nil check on line ~28 references the pidByCmdline error instead. the final return 0, fmt.Errorf(...) also appears to be dead code since err != nil is always true at that point. consider capturing errors explicitly to preserve diagnostic info.
lib/hypervisor/socket_pid_linux.go
Outdated
| if err != nil || len(cmdline) == 0 { | ||
| continue | ||
| } | ||
| if strings.Contains(string(cmdline), socketPath) { |
There was a problem hiding this comment.
nit: substring match means /run/vm-1.sock would also match a process with /run/vm-10.sock in its cmdline. low risk since this is a fallback path, but splitting on \0 and checking exact argument match would be more precise.
| return kb * 1024 | ||
| } | ||
|
|
||
| func assertActiveBallooningLifecycleVZ(t *testing.T, ctx context.Context, inst *Instance) { |
There was a problem hiding this comment.
this is identical to assertActiveBallooningLifecycle in the linux test file (~60 lines). consider extracting to the shared guestmemory_active_ballooning_test_helpers_test.go to avoid drift.
lib/scopes/scopes.go
Outdated
| "GET /resources": ResourceRead, | ||
| "GET /health": ResourceRead, | ||
| "GET /resources": ResourceRead, | ||
| "POST /resources/memory/reclaim": ResourceRead, |
There was a problem hiding this comment.
this maps a mutating POST to ResourceRead — every other POST in this file uses a write scope. intentional (since no ResourceWrite exists yet), or should a write scope be added?
- Clamp Low watermark to High+1 instead of cascading reset to defaults - Collapse duplicate hold-clearing branches in reconcile - Add lastApplied pruning for VMs no longer in candidate list - Wrap sampler error with context for better caller diagnostics - Fix err shadowing in ResolveProcessPID (socket_pid_linux.go) - Use exact arg match in pidByCmdline to avoid substring false positives - Extract assertActiveBallooningLifecycle to shared test helpers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The reclaim endpoint is a mutating POST but was mapped to ResourceRead. Add ResourceWrite scope and use it for the reclaim route, consistent with every other POST endpoint using a write scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
There are 5 total unresolved issues (including 2 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.


Summary
lib/guestmemorywith pressure sampling, proportional reclaim, protected floors, and manual reclaim holdsPOST /resources/memory/reclaim, wire the controller through the API startup path, and document/configure the newhypervisor.memory.active_ballooningsettingsValidation
go test ./lib/guestmemory -count=1go test ./cmd/api/api -run 'TestReclaimMemory_' -count=1make test-guestmemory-vzmake test-guestmemory-linuxondeft-kernel-devfrom/home/sjmiller609/codex-active-ballooning-plan/hypemanNote
Medium Risk
Introduces a new background control loop that can change VM memory balloon targets across hypervisors and adds a new API endpoint; mistakes could cause unexpected VM memory pressure or reclaim behavior.
Overview
Adds an active guest-memory ballooning controller (
lib/guestmemory) that samples host pressure (Linux/proc+ PSI, macOSvm_stat/memory_pressure), computes a reclaim target with hysteresis/protected floors, and applies proportional runtime balloon target changes with cooldown/step limits, plus metrics/tracing/logging.Exposes a new manual reclaim API
POST /resources/memory/reclaim(withhold_for,dry_run, and feature-disabled vs internal error handling), wires the controller into API DI/startup, and extends config withhypervisor.memory.active_ballooningdefaults + validation.Enables runtime balloon control across hypervisors by extending the
hypervisor.Hypervisorinterface (Set/GetTargetGuestMemoryBytes+SupportsBalloonControl), implementing it for Cloud Hypervisor/QEMU/Firecracker/VZ (includingvz-shimballoon endpoints), and updates integration tests/Makefile to cover runtime target changes and reduce CI flakiness (timeouts, per-test runs, better artifact logging, PID resolution).Written by Cursor Bugbot for commit 6651859. This will update automatically on new commits. Configure here.