Skip to content

Conversation

@mirza-halilcevic
Copy link
Contributor

@mirza-halilcevic mirza-halilcevic commented Dec 18, 2025

Motivation

Improve tuning time by parallelizing tuning on multi-GPU systems.

Technical Details

  • tuningRunner.py

    • Use rocm-smi to retrieve a list of available GPUs. Utilize all of them for tuning, or a subset of them if --gpus is specified.
    • Parallelize work across problem configs with a thread pool. The number of threads corresponds to the number of GPUs and each thread gets assigned a GPU. The tuning-driver processes are spawned with ROCR_VISIBLE_DEVICES set accordingly and --num-compile-threads set to ceil(num_cpus / num_gpus) - 1.
    • Implement a persistence mechanism so that we don't retune already tuned configs present in the output file, unless --retune is specified.
    • Unclutter output by adding a progress bar. Can be disabled with --quiet.
    • Remove --compact-print and correct the semantics of --quiet to suppress non-error output (used in CI).
    • Refactor code for maintainability and readability.
  • rocmlir-tuning-driver.cpp

    • Hide compilation time latency by utilizing a simple concurrent queue to implement a producer-consumer pattern, so we start benchmarking as soon as compiled kernels become available.

Resolves https://github.com/ROCm/rocMLIR-internal/issues/2018

Test Plan

Test Result

Submission Checklist

This comment was marked as outdated.

mirza-halilcevic and others added 5 commits December 19, 2025 18:50
- Redesign for better maintainability and readability
- Unclutter output with proper quiet flag and progress bar
- Distribute work on multi-gpu systems
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

mlir/utils/performance/tuningRunner.py:1

  • Off-by-one error: loop creates numThreads - 1 compilation threads, but should create numThreads threads. This will result in one fewer compilation worker than intended.
#!/usr/bin/env python3

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mirza-halilcevic mirza-halilcevic marked this pull request as ready for review December 21, 2025 13:46
Copy link
Member

@umangyadav umangyadav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change Jenkinsfile to not use MITuna after this PR is merged to test this out ?

Comment on lines 87 to 88
if len(device_ids) > 0:
return device_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is only picking values from first HIP_VISIBLE_DEVICES and if not found then it tries to pick values form ROCR_VISIBLE_DEVICES. Can you add comment about it so that it sounds intentional instead of ROCR_VISIBLE_DEVICES being ignored

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm moving away from using the env vars because they are inconsistent and unreliable. I landed on retrieving the GPU IDs using rocm-smi instead, because it ignores any mapping and returns the physical IDs.

@mirza-halilcevic
Copy link
Contributor Author

Can we change Jenkinsfile to not use MITuna after this PR is merged to test this out ?

Created a tracking issue: https://github.com/ROCm/rocMLIR-internal/issues/2206

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 94 to 96
class TuningError(Exception):
"""Raised when tuning or verification fails."""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class TuningError is the same as Exception class (except the it's name). Does it make sense having this as separate class ? If it's raised when tuning or verification fails should we maybe add the config attribute so that it's easier to see for which config TuningError occured. Otherwise we would only get info that the exception happened but not on which config (in a easy readable way) (if we don't provide context somehow manually later). Maybe it would be useful to have special attribute related to config when we use this kind of exceptions so then it would be intuitive to attach the config when using class TuningError. Since we will use more GPUs maybe it would be useful to have optional attribute in this class for which gpu was used

if not tune_mlir_kernels(configs, conf_class, paths, options):
print("Tuning aborted", file=sys.stderr)
try:
return not tune_configs(configs, conf_class, paths, options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tune_configs returns bool in this code version, and we expect 0 to mean success so that is why we invert the boolean? so to get success in this case we switch the boolean to False (if it was True). It might be confusing. Can we handle this differently, so that we don't directly flip the bool value to get actually the int return value?
maybe it would be more understandable if we do somethjing like
tuning_success = tune_configs(...
return 0 if tuning_success else 1
not sure if my suggestion makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants