Commit c1b210d
fix(torchrun): use correct datatypes for torchrun args (Red-Hat-AI-Innovation-Team#44)
* fix(torchrun): use correct datatypes for torchrun args
Torchrun supports nproc_per_node and rdzv_id as str.
TorchrunArgs only supports int, which is permissible
by pytorch.
This change will enable TorchrunArgs to support both str, int.
Also, remove unset or empty parameters before passing it to
torchrun args.
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Use python3.11 style for pydatnic model
Signed-off-by: Saad Zaher <szaher@redhat.com>
* replace - with _ for cli args
Signed-off-by: Saad Zaher <szaher@redhat.com>
* make nproc_per_node to only accept gpu or int. Remove Defaults
Signed-off-by: Saad Zaher <szaher@redhat.com>
* add master_{addr, port} validate args
Signed-off-by: Saad Zaher <szaher@redhat.com>
* deep check if variables are set and not empty
Co-authored-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
* Update src/mini_trainer/training_types.py
Co-authored-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
* Update src/mini_trainer/api_train.py
Co-authored-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
* does not automatically set --master-port
* Update api_train.py
* use standalone when neither rdzv_endpoint nor master_addr are provided
* Update training_types.py
* update tests
---------
Signed-off-by: Saad Zaher <szaher@redhat.com>
Co-authored-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>1 parent 37881b5 commit c1b210d
3 files changed
Lines changed: 48 additions & 15 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
88 | | - | |
| 88 | + | |
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
96 | 116 | | |
97 | 117 | | |
98 | 118 | | |
| |||
109 | 129 | | |
110 | 130 | | |
111 | 131 | | |
112 | | - | |
| 132 | + | |
113 | 133 | | |
114 | 134 | | |
115 | 135 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
| 10 | + | |
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
| 25 | + | |
26 | 26 | | |
27 | | - | |
28 | | - | |
| 27 | + | |
29 | 28 | | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
30 | 43 | | |
31 | 44 | | |
32 | 45 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
| 31 | + | |
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| |||
405 | 405 | | |
406 | 406 | | |
407 | 407 | | |
408 | | - | |
409 | | - | |
410 | | - | |
411 | | - | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
412 | 412 | | |
413 | 413 | | |
414 | 414 | | |
| |||
0 commit comments