Skip to content

[BUG] CPU Idle Detection Using psutil.cpu_percent() Is Misleading on Multi-Core Instances #176

Description

@keiranmraine

Summary

The current implementation in AWS Research Engineering Studio (RES) appears to use psutil.cpu_percent() as a proxy for system idleness. This approach does not scale appropriately with multi-core systems and can incorrectly classify actively used instances as idle, particularly in data science workflows.

We'd like to use this feature properly, but the implementation doesn't fit well with our user workloads.


Problem Description

psutil.cpu_percent() reports CPU utilisation as a percentage of total system capacity, averaged across all cores.

On high core-count instances (e.g. 8, 16, 32 vCPUs), this leads to unintuitive and misleading behaviour:

  • A single fully utilised core on a 16 vCPU instance results in ~6–10% reported usage.
  • Moderate but legitimate workloads (e.g. 1–2 active cores) appear as low overall utilisation.

Example

On a 16 vCPU instance:

Workload Actual Activity cpu_percent()
1 core fully used Active computation ~6–10%
2 cores fully used Parallel work ~12–20%
Fully idle None ~0–2%

Real-World Impact (Data Science / R Workloads)

In many data science use cases (especially R-based analytics):

  • High CPU instances are used for RAM requirements rather than CPU-
  • Processing occurs in bursty or phased patterns-

This results in:

  • Sustained meaningful work
  • Low aggregate CPU percentage

In contrast, system load metrics behave differently:

import os
os.getloadavg()

Typical observed values:

  • cpu_percent() → ~8–12%
  • loadavg (1 min) → ~1.0–1.5

The load average correctly reflects that the system is not idle, whereas CPU percentage suggests it is.


Current Behaviour in RES

  • CPU utilisation from psutil.cpu_percent() is directly assessed against idle threshold (fixed at instance creation)
  • Default idle threshold: 30% - problematic
  • Idle timeout: default 1 year - we'd like 4h but can't safely set due to method

Consequence

Current global implementation is not compatible with instances of different sizes

This is particularly problematic for:

  • Long-running analyses
  • Pipeline stages with intermittent CPU usage
  • Interactive analytical sessions

Why This Approach Is Problematic

psutil.cpu_percent() measures:

Total CPU utilisation as a fraction of total available compute capacity

Whereas "idleness" in a multi-core environment should consider:

  • Whether any meaningful work is occurring
  • Queueing and scheduling pressure
  • Per-process or per-core activity

Thus, CPU percentage alone is not a reliable indicator of idleness on modern multi-core systems.


Suggested Improvements

✅ Option 1: Use Load Average

Leverage os.getloadavg():

  • Load ≈ number of runnable processes
  • Interpretable relative to CPU count

Example heuristic:

import os

load1, _, _ = os.getloadavg()

if load1 < 0.2:
    system_idle = True

✅ Option 2: Convert CPU % to Core-Equivalent Usage

Instead of raw percentage:

core_usage = (cpu_percent / 100.0) * cpu_count

This makes thresholds meaningful across instance sizes.


✅ Option 3: Multi-Signal Approach

Combine indicators:

  • CPU usage (scaled)
  • Load average
  • Recent activity window (buffering spikes)

Recommendation

At minimum:

Replace or augment psutil.cpu_percent() with a load-based or core-normalised metric before applying idle thresholds.


Expected Outcome

Fixing this would:

  • Make the idle shutdown a reliable functionality
  • Improve usability for data science workflows
  • Better align RES behaviour with real-world compute patterns
  • Reduce user frustration and job failures

Additional Context

This issue is especially visible in:

  • R workloads
  • Single-threaded Python tasks
  • Memory-bound computations
  • Interactive sessions (e.g. notebooks)

Screenshots/Video

Example of cpu vs load average (can't copy text out of environment).

During exec (generating load):
Image

Result:
Image

Environment (please complete the following information):

  • RES Version: 2026.03
  • Software Stack AMI ID:
    • Private prepped for closed environment: ami-0c26b557ed705377b
    • Oringinal: ami-067ea4effee56973f
  • Software Stack OS: Ubuntu 24.04

Additional context
Enterprise support account

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions