Skip to content

feat: Use inventory for raid details instead of BMC#1985

Merged
stevekeay merged 3 commits into
mainfrom
raid
Apr 28, 2026
Merged

feat: Use inventory for raid details instead of BMC#1985
stevekeay merged 3 commits into
mainfrom
raid

Conversation

@stevekeay
Copy link
Copy Markdown
Contributor

@stevekeay stevekeay commented Apr 27, 2026

We now have full raid controller details available after redfish
inspection so there is no need to probe the BMC for them.

We also update the business logic to group disks by size, to avoid
building RAID arrays out of mismatched disks. This behaviour is
ported from #1978

I deleted the old raid script script to avoid further confusion.

This also fixes a RAID delete problem.

Delete the old raid script script to avoid further confusion.
We now have full raid controller details available after redfish
inspection so there is no need to probe the BMC for them.

We also update the business logic to group disks by size, to avoid
building RAID arrays out of mismatched disks.
Without this, we hit a problem with the iDRAC/PERC job:
RealTimeNoRebootConfiguration stuck at 1%. Ironic never gets to its
normal async RAID polling path because Sushy is still inside
volume.delete() when it times out.

The node would go to clean failed, and Ironic gave a somewhat misleading
error message: Node ea2cdf3f-c868-42a4-b47f-5b782717b349 failed step
{'interface': 'raid', 'step': 'delete_configuration', 'abortable':
False, 'priority': 0}: Unable to connect to
/redfish/v1/TaskService/Tasks/JID_773174628107. Error: Timeout waiting
for task monitor /redfish/v1/TaskService/Tasks/JID_773174628107 (timeout
= 500)

I suspect the jobs is stuck because it powered on the server and
attempted to boot from the volume being deleted, but that is just a
guess.  Either way, getting rid of the disable_ramdisk makes it work.
Copy link
Copy Markdown
Contributor

@mfencik mfencik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@skrobul skrobul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

PhysicalDisk(
id=disk["id"],
controller=controller_id,
size_gb=disk["size"] // 10**9,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any realistic scenario where the disk["size"] or disk["id"] is None? if yes, consider skipping that iteration and logging the warning

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly this data is brand new, and I have not seen the code that generates it, so anything could happen.

If the data is not in the expected / advertised format though, is it not better for enrol to fail and expose the problem, rather than build raid arrays with missing devices?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point - it's better to fail early, I just would prefer to have a log message explaining why rather than TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll improve this as a follow-on

@stevekeay stevekeay added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit 274ad86 Apr 28, 2026
62 checks passed
@stevekeay stevekeay deleted the raid branch April 28, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants