Skip to content

sysfs: skip disabled thermal zones before reading temp#793

Closed
realgagangupta wants to merge 2 commits intoprometheus:masterfrom
realgagangupta:fix-thermal-zone-disabled
Closed

sysfs: skip disabled thermal zones before reading temp#793
realgagangupta wants to merge 2 commits intoprometheus:masterfrom
realgagangupta:fix-thermal-zone-disabled

Conversation

@realgagangupta
Copy link

Problem

Disabled thermal zones return EINVAL when their temp file is read. This causes the entire thermal_zone collector to fail with:

collector failed name=thermal_zone err="read /sys/class/thermal/thermal_zone10/temp: invalid argument"

Even healthy/enabled zones stop reporting metrics because one disabled zone crashes the whole loop.

Root Cause

parseClassThermalZone reads the mode file after reading temp. For disabled zones, reading temp fails with EINVAL before we ever check the mode.

Fix

Read mode first in parseClassThermalZone. If the zone is "disabled", return os.ErrProcessDone early — before attempting to read temp. The caller skips on this sentinel and continues to the next zone.

Related

Disabled thermal zones return EINVAL when their temp file is read,
causing the entire thermal_zone collector to fail even for healthy zones.

Fix by reading the mode file first in parseClassThermalZone(). If the
zone is disabled, return early with os.ErrProcessDone which the caller
already handles by skipping to the next zone.

Fixes: prometheus/node_exporter#2980
Signed-off-by: Gagan Gupta <realgagangupta@users.noreply.github.com>
Add thermal_zone2 fixture with mode=disabled and no temp file to
simulate a disabled zone. Update TestClassThermalZoneStats to verify
that disabled zones are skipped and do not appear in results.

Signed-off-by: Gagan Gupta <realgagangupta@users.noreply.github.com>
@realgagangupta realgagangupta force-pushed the fix-thermal-zone-disabled branch from 183ffd9 to 34f4020 Compare March 8, 2026 09:49
@SuperQ
Copy link
Member

SuperQ commented Mar 8, 2026

I don't think this is actually correct. The temp file is specified as a required field. I think the real problem is a bad kernel platform driver.

@realgagangupta
Copy link
Author

Thanks for the review @SuperQ! That makes sense — I see you've opened #794 with a cleaner approach that passes the error back to the caller instead of skipping at this layer. I'll close this PR in favor of yours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle thermal_zone errors gracefully

2 participants