Upgrade PJRT to XLA commit 9e9a0fb / ZML artifacts v17.0.0#168
Upgrade PJRT to XLA commit 9e9a0fb / ZML artifacts v17.0.0#168sebffischer wants to merge 14 commits into
Conversation
- Update vendored headers and proto files from openxla/xla@9e9a0fb - Add 2 new proto files: backends.proto, oneapi_compute_capability.proto - Patch backends.proto edition syntax to proto3 for protobuf@21 compat - Bump plugin_version() to 17.0.0 - Move patch files from tools/headers/patch/ to tools/patch/ (one per file) - Add manual-cuda CI mode (workflow_dispatch + PR label) for testing PJRT upgrades before cuda R package is updated - Add upgrade-pjrt Claude skill Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XLA commit 9e9a0fb targets CUDA 12.9.1. Update the default container image, cuda R package reference, and cuda_r_package config to 12.9. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@dfalbel I think there is an issue with the cuda runner. I want to upgrade the plugins so it is usable on linux arm machines (at least with CPU for now) |
|
We don't have cuda 12.9 yet in the cudatoolkit repo |
|
but the CI path with the manual-cuda tag does now use the cuda12.9 package (I added this in the PR so we can easily test upgrading in the pjrt package without making changes in other packages). |
|
there is some container runtime issue: https://github.com/r-xla/pjrt/actions/runs/23999906166/job/69994155165?pr=168 |
|
Sorry, what's manual cuda? |
|
It's likely a connection timeout error... downloading the cudnn docker container is like a 5GB download running on my local network, which is not super fast. |
|
@dfalbel claude used the wrong CUDA version. Instead for the new pjrt build from zml we would need cuda 13.0 (https://github.com/zml/pjrt-artifacts/blob/e1c8db3f6730c040e3ee3a008591c80f6d0f8891/openxla/bazelrc/upstream/.bazelrc#L99-L103). However, I think the hardware of the GPU runner might be too old for that. Did you run into this issue as well with torch? |
|
The GPU I have locally is a 1080 ti with compute capability 6.1. In principle it supports cuda 13. But we need to figure out if ZML pjrt binaries are built with 6.1 support. Torch has dropped it recently and I had to make a custom build for it :( |
|
Unfortunately 6.1 is not supported anymore: https://github.com/zml/pjrt-artifacts/blob/d104b855719bf4256bf1a87e4542285a54d0e594/openxla/bazelrc/upstream/.bazelrc#L99 (this is also already the case for the commit from release 17.0.0) |
|
Maybe the easier way for now to add linux arm support is to just include it here: https://github.com/r-xla/pjrt-builds. Eventually we have to switch to CUDA 13.0 I guess but maybe it's not necessary yet. Maybe we could also make a PR in pjrt-artifacts and add 6.1 here: https://github.com/zml/pjrt-artifacts/blob/d104b855719bf4256bf1a87e4542285a54d0e594/openxla/bazelrc/upstream/.bazelrc#L99 but I guess it's not as easy as that ... |
|
I have created an issue in pjrt-artifacts here: zml/pjrt-artifacts#70. Maybe they can just add it. |
|
@dfalbel there is nothing we can do on your machine I think. With cuda 13.0 offline compilation support for 6.1 was removed (https://docs.nvidia.com/cuda/archive/13.0.0/cuda-toolkit-release-notes/index.html#deprecated-architectures). I might have access to a machine with compute capability 7.5 that can be used to run CI jobs but I will postpone setting this up as long as possible :D |
|
Ahhh, that's unfortunate. Ideally we should have a cloud hosted gpu, such as the ones available with GitHub actions.. |
|
Yeah, that would indeed be nice. But if we would run half an hour of CUDA CIs per day that would cost 0.05 * 30 * 30 = 45 euro per month. I think it's more realistic that I use a machine from my university (I am quite certain I will be allowed to do it, we just need to set it up). But for now CUDA 12.8 does the job :D (Also I can hope that eventually torch does not run on your machine anymore so posit has to buy you a new GPU :P) |
TODOs: