[BUG] CPU Idle Detection Using psutil.cpu_percent() Is Misleading on Multi-Core Instances

## Summary
The current implementation in AWS Research Engineering Studio (RES) appears to use `psutil.cpu_percent()` as a proxy for system idleness. This approach does not scale appropriately with multi-core systems and can incorrectly classify actively used instances as idle, particularly in data science workflows.

We'd like to use this feature properly, but the implementation doesn't fit well with our user workloads.

---

## Problem Description

`psutil.cpu_percent()` reports CPU utilisation as a percentage of **total system capacity**, averaged across all cores.

On high core-count instances (e.g. 8, 16, 32 vCPUs), this leads to unintuitive and misleading behaviour:

- A **single fully utilised core** on a 16 vCPU instance results in ~6–10% reported usage.
- Moderate but legitimate workloads (e.g. 1–2 active cores) appear as **low overall utilisation**.

### Example

On a 16 vCPU instance:

| Workload              | Actual Activity     | `cpu_percent()` |
|----------------------|--------------------|-----------------|
| 1 core fully used    | Active computation | ~6–10%          |
| 2 cores fully used   | Parallel work      | ~12–20%         |
| Fully idle           | None               | ~0–2%           |

---

## Real-World Impact (Data Science / R Workloads)

In many data science use cases (especially R-based analytics):

- High CPU instances are used for RAM requirements rather than CPU- 
- Processing occurs in **bursty or phased patterns**- 

This results in:

- Sustained meaningful work
- Low aggregate CPU percentage

In contrast, system load metrics behave differently:

```python
import os
os.getloadavg()
```

Typical observed values:

- `cpu_percent()` → ~8–12%  
- `loadavg (1 min)` → ~1.0–1.5  

The load average correctly reflects that the system is **not idle**, whereas CPU percentage suggests it is.

---

## Current Behaviour in RES

- CPU utilisation from `psutil.cpu_percent()` is directly assessed against idle threshold (fixed at instance creation)
- Default idle threshold: 30% - problematic
- Idle timeout: default 1 year - we'd like *4h* but can't safely set due to method

### Consequence

Current global implementation is not compatible with instances of different sizes

This is particularly problematic for:

- Long-running analyses  
- Pipeline stages with intermittent CPU usage  
- Interactive analytical sessions  

---

## Why This Approach Is Problematic

`psutil.cpu_percent()` measures:

> Total CPU utilisation as a fraction of total available compute capacity

Whereas "idleness" in a multi-core environment should consider:

- Whether *any* meaningful work is occurring  
- Queueing and scheduling pressure  
- Per-process or per-core activity  

Thus, **CPU percentage alone is not a reliable indicator of idleness** on modern multi-core systems.

---

## Suggested Improvements

### ✅ Option 1: Use Load Average

Leverage `os.getloadavg()`:

- Load ≈ number of runnable processes  
- Interpretable relative to CPU count  

Example heuristic:

```python
import os

load1, _, _ = os.getloadavg()

if load1 < 0.2:
    system_idle = True
```

---

### ✅ Option 2: Convert CPU % to Core-Equivalent Usage

Instead of raw percentage:

```python
core_usage = (cpu_percent / 100.0) * cpu_count
```

This makes thresholds meaningful across instance sizes.

---

### ✅ Option 3: Multi-Signal Approach

Combine indicators:

- CPU usage (scaled)  
- Load average  
- Recent activity window (buffering spikes)  

---

## Recommendation

At minimum:

> Replace or augment `psutil.cpu_percent()` with a load-based or core-normalised metric before applying idle thresholds.

---

## Expected Outcome

Fixing this would:

- Make the idle shutdown a reliable functionality
- Improve usability for data science workflows
- Better align RES behaviour with real-world compute patterns
- Reduce user frustration and job failures

---

## Additional Context

This issue is especially visible in:

- R workloads
- Single-threaded Python tasks
- Memory-bound computations
- Interactive sessions (e.g. notebooks)

---

**Screenshots/Video**

Example of cpu vs load average (can't copy text out of environment).

During exec (generating load):
<img width="1392" height="724" alt="Image" src="https://github.com/user-attachments/assets/afee630d-c7dd-4e3e-a398-b4265350bb34" />

Result:
<img width="322" height="97" alt="Image" src="https://github.com/user-attachments/assets/2f87680a-42b3-43b1-976f-e9789a2cbadd" />

**Environment (please complete the following information):**
 - RES Version: 2026.03
 - Software Stack AMI ID:
   - Private prepped for closed environment: ami-0c26b557ed705377b
   - Oringinal: ami-067ea4effee56973f
 - Software Stack OS: Ubuntu 24.04

**Additional context**
Enterprise support account

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] CPU Idle Detection Using psutil.cpu_percent() Is Misleading on Multi-Core Instances #176

Summary

Problem Description

Example

Real-World Impact (Data Science / R Workloads)

Current Behaviour in RES

Consequence

Why This Approach Is Problematic

Suggested Improvements

✅ Option 1: Use Load Average

✅ Option 2: Convert CPU % to Core-Equivalent Usage

✅ Option 3: Multi-Signal Approach

Recommendation

Expected Outcome

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Workload	Actual Activity	`cpu_percent()`
1 core fully used	Active computation	~6–10%
2 cores fully used	Parallel work	~12–20%
Fully idle	None	~0–2%

Uh oh!

[BUG] CPU Idle Detection Using psutil.cpu_percent() Is Misleading on Multi-Core Instances #176

Description

Summary

Problem Description

Example

Real-World Impact (Data Science / R Workloads)

Current Behaviour in RES

Consequence

Why This Approach Is Problematic

Suggested Improvements

✅ Option 1: Use Load Average

✅ Option 2: Convert CPU % to Core-Equivalent Usage

✅ Option 3: Multi-Signal Approach

Recommendation

Expected Outcome

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions