Parallelize tuning across GPUs #2179

mirza-halilcevic · 2025-12-18T14:57:10Z

Motivation

Improve tuning time by parallelizing tuning on multi-GPU systems.

Technical Details

tuningRunner.py
- Use rocm-smi to retrieve a list of available GPUs. Utilize all of them for tuning, or a subset of them if --gpus is specified.
- Parallelize work across problem configs with a thread pool. The number of threads corresponds to the number of GPUs and each thread gets assigned a GPU. The tuning-driver processes are spawned with ROCR_VISIBLE_DEVICES set accordingly and --num-compile-threads set to ceil(num_cpus / num_gpus) - 1.
- Implement a persistence mechanism so that we don't retune already tuned configs present in the output file, unless --retune is specified.
- Unclutter output by adding a progress bar. Can be disabled with --quiet.
- Remove --compact-print and correct the semantics of --quiet to suppress non-error output (used in CI).
- Refactor code for maintainability and readability.
rocmlir-tuning-driver.cpp
- Hide compilation time latency by utilizing a simple concurrent queue to implement a producer-consumer pattern, so we start benchmarking as soon as compiled kernels become available.

Resolves https://github.com/ROCm/rocMLIR-internal/issues/2018

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

- Redesign for better maintainability and readability - Unclutter output with proper quiet flag and progress bar - Distribute work on multi-gpu systems

…g-driver.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

mlir/utils/performance/tuningRunner.py:1

Off-by-one error: loop creates numThreads - 1 compilation threads, but should create numThreads threads. This will result in one fewer compilation worker than intended.

#!/usr/bin/env python3

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlir/utils/performance/tuningRunner.py

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

umangyadav

Can we change Jenkinsfile to not use MITuna after this PR is merged to test this out ?

pip_requirements.txt

mlir/utils/jenkins/Jenkinsfile.downstream

umangyadav · 2025-12-22T14:45:57Z

mlir/utils/performance/tuningRunner.py

+            if len(device_ids) > 0:
+                return device_ids


This looks like it is only picking values from first HIP_VISIBLE_DEVICES and if not found then it tries to pick values form ROCR_VISIBLE_DEVICES. Can you add comment about it so that it sounds intentional instead of ROCR_VISIBLE_DEVICES being ignored

I'm moving away from using the env vars because they are inconsistent and unreliable. I landed on retrieving the GPU IDs using rocm-smi instead, because it ignores any mapping and returns the physical IDs.

mlir/utils/performance/tuningRunner.py

mlir/tools/rocmlir-tuning-driver/ConcurrentQueue.h

mirza-halilcevic · 2025-12-23T12:11:08Z

Can we change Jenkinsfile to not use MITuna after this PR is merged to test this out ?

Created a tracking issue: https://github.com/ROCm/rocMLIR-internal/issues/2206

- Update docstrings - Use rocm-smi instead of env vars for GPU selection

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlir/utils/performance/tuningRunner.py

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

mlir/tools/rocmlir-tuning-driver/ConcurrentQueue.h

dorde-antic · 2025-12-28T07:58:07Z

mlir/utils/performance/tuningRunner.py

 class TuningError(Exception):
+    """Raised when tuning or verification fails."""
    pass


class TuningError is the same as Exception class (except the it's name). Does it make sense having this as separate class ? If it's raised when tuning or verification fails should we maybe add the config attribute so that it's easier to see for which config TuningError occured. Otherwise we would only get info that the exception happened but not on which config (in a easy readable way) (if we don't provide context somehow manually later). Maybe it would be useful to have special attribute related to config when we use this kind of exceptions so then it would be intuitive to attach the config when using class TuningError. Since we will use more GPUs maybe it would be useful to have optional attribute in this class for which gpu was used

dorde-antic · 2025-12-28T08:10:54Z

mlir/utils/performance/tuningRunner.py

-    if not tune_mlir_kernels(configs, conf_class, paths, options):
-        print("Tuning aborted", file=sys.stderr)
+    try:
+        return not tune_configs(configs, conf_class, paths, options)


tune_configs returns bool in this code version, and we expect 0 to mean success so that is why we invert the boolean? so to get success in this case we switch the boolean to False (if it was True). It might be confusing. Can we handle this differently, so that we don't directly flip the bool value to get actually the int return value?
maybe it would be more understandable if we do somethjing like
tuning_success = tune_configs(...
return 0 if tuning_success else 1
not sure if my suggestion makes sense

Utilize multiple GPUs in tuningRunner.

3b92772

mirza-halilcevic requested review from Copilot, dhernandez0 and umangyadav December 18, 2025 14:57

This comment was marked as outdated.

Sign in to view

mirza-halilcevic and others added 5 commits December 19, 2025 18:50

tuningRunner.py refactoring

31dca63

- Redesign for better maintainability and readability - Unclutter output with proper quiet flag and progress bar - Distribute work on multi-gpu systems

Merge branch 'develop' into mgpu-tuning

4e7c234

Overlap compilation and benchmarking of perf configs in rocmlir-tunin…

5e0338e

…g-driver.

Merge remote-tracking branch 'origin/mgpu-tuning' into mgpu-tuning

4d71e33

Newline at end of file.

eb8dff6

mirza-halilcevic requested a review from Copilot December 19, 2025 19:06

Copilot AI reviewed Dec 19, 2025

View reviewed changes

mlir/utils/performance/tuningRunner.py Outdated Show resolved Hide resolved

mlir/utils/performance/tuningRunner.py Outdated Show resolved Hide resolved

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp Outdated Show resolved Hide resolved

Implement persistence logic.

70d021d

mirza-halilcevic requested a review from pabloantoniom December 21, 2025 13:20

mirza-halilcevic marked this pull request as ready for review December 21, 2025 13:46

mirza-halilcevic requested a review from causten as a code owner December 21, 2025 13:46

mirza-halilcevic added 2 commits December 21, 2025 14:10

Fix thread allocation bug in tuning-driver.

735fb2b

Simplify implementation and optimize for edge cases.

d36e62d

umangyadav reviewed Dec 22, 2025

View reviewed changes

mlir/tools/rocmlir-tuning-driver/ConcurrentQueue.h Show resolved Hide resolved

This was referenced Dec 23, 2025

Add tqdm as a pip requirement #2183

Merged

Add tqdm as a pip requirement ROCm/MITuna#1021

Merged

mirza-halilcevic and others added 5 commits December 23, 2025 12:30

Merge remote-tracking branch 'origin/develop' into mgpu-tuning

e01d4b1

Merge branch 'develop' into mgpu-tuning

a3c15e8

Address review comments:

cf2046e

- Update docstrings - Use rocm-smi instead of env vars for GPU selection

Merge remote-tracking branch 'origin/develop' into mgpu-tuning

8d6dbc3

Merge remote-tracking branch 'origin/mgpu-tuning' into mgpu-tuning

df88112

mirza-halilcevic requested review from Copilot and umangyadav December 25, 2025 22:03

Copilot AI reviewed Dec 25, 2025

View reviewed changes

mirza-halilcevic added 5 commits December 26, 2025 00:02

Improve gpus argument parsing and improve graceful shutdown.

af53625

Implement OutputFileWriter and DebugFileWriter.

e8d29d1

Fix output parsing during shutdown.

95f771b

Keep track of commit hash for each tuning run.

e6a985b

Remove semicolons.

2c48a92

mirza-halilcevic requested a review from dorde-antic December 26, 2025 13:41

mirza-halilcevic added 5 commits December 26, 2025 14:20

Add debug info.

34fffb6

Fix progress bar output.

b6d7bea

Add GPU ID to debug info.

e3e8add

Fix stderr deadlock.

90414e7

Add --num-compile-threads argument.

c8d7bd8

dorde-antic reviewed Dec 28, 2025

View reviewed changes

dorde-antic approved these changes Dec 29, 2025

View reviewed changes

Parallelize tuning across GPUs #2179

Are you sure you want to change the base?

Parallelize tuning across GPUs #2179

Conversation

mirza-halilcevic commented Dec 18, 2025 • edited by dorde-antic Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

umangyadav left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

umangyadav Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

mirza-halilcevic Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mirza-halilcevic commented Dec 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dorde-antic Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

dorde-antic Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mirza-halilcevic commented Dec 18, 2025 •

edited by dorde-antic

Loading