Skip to content

Upgrade PJRT to XLA commit 9e9a0fb / ZML artifacts v17.0.0#168

Draft
sebffischer wants to merge 14 commits into
mainfrom
skill-upgrade-pjrt
Draft

Upgrade PJRT to XLA commit 9e9a0fb / ZML artifacts v17.0.0#168
sebffischer wants to merge 14 commits into
mainfrom
skill-upgrade-pjrt

Conversation

@sebffischer
Copy link
Copy Markdown
Collaborator

@sebffischer sebffischer commented Apr 5, 2026

  • Update vendored headers and proto files from openxla/xla@9e9a0fb
  • Add 2 new proto files: backends.proto, oneapi_compute_capability.proto
  • Patch backends.proto edition syntax to proto3 for protobuf@21 compat
  • Bump plugin_version() to 17.0.0
  • Move patch files from tools/headers/patch/ to tools/patch/ (one per file)
  • Add manual-cuda CI mode (workflow_dispatch + PR label) for testing PJRT upgrades before cuda R package is updated
  • Add upgrade-pjrt Claude skill

TODOs:

  • When using newer CUDA we need to ensure we define all the cuda types / signatures correctly (because we don't actually include CUDA SDK we manually do this, which is error-prone).

- Update vendored headers and proto files from openxla/xla@9e9a0fb
- Add 2 new proto files: backends.proto, oneapi_compute_capability.proto
- Patch backends.proto edition syntax to proto3 for protobuf@21 compat
- Bump plugin_version() to 17.0.0
- Move patch files from tools/headers/patch/ to tools/patch/ (one per file)
- Add manual-cuda CI mode (workflow_dispatch + PR label) for testing
  PJRT upgrades before cuda R package is updated
- Add upgrade-pjrt Claude skill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XLA commit 9e9a0fb targets CUDA 12.9.1. Update the default container
image, cuda R package reference, and cuda_r_package config to 12.9.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sebffischer
Copy link
Copy Markdown
Collaborator Author

@dfalbel I think there is an issue with the cuda runner. I want to upgrade the plugins so it is usable on linux arm machines (at least with CPU for now)

@dfalbel
Copy link
Copy Markdown
Collaborator

dfalbel commented Apr 6, 2026

We don't have cuda 12.9 yet in the cudatoolkit repo

@sebffischer
Copy link
Copy Markdown
Collaborator Author

but the CI path with the manual-cuda tag does now use the cuda12.9 package (I added this in the PR so we can easily test upgrading in the pjrt package without making changes in other packages).

@sebffischer
Copy link
Copy Markdown
Collaborator Author

there is some container runtime issue: https://github.com/r-xla/pjrt/actions/runs/23999906166/job/69994155165?pr=168

@dfalbel
Copy link
Copy Markdown
Collaborator

dfalbel commented Apr 6, 2026

Sorry, what's manual cuda?

@dfalbel
Copy link
Copy Markdown
Collaborator

dfalbel commented Apr 6, 2026

It's likely a connection timeout error... downloading the cudnn docker container is like a 5GB download running on my local network, which is not super fast.

@sebffischer
Copy link
Copy Markdown
Collaborator Author

@dfalbel claude used the wrong CUDA version. Instead for the new pjrt build from zml we would need cuda 13.0 (https://github.com/zml/pjrt-artifacts/blob/e1c8db3f6730c040e3ee3a008591c80f6d0f8891/openxla/bazelrc/upstream/.bazelrc#L99-L103). However, I think the hardware of the GPU runner might be too old for that. Did you run into this issue as well with torch?

@dfalbel
Copy link
Copy Markdown
Collaborator

dfalbel commented Apr 7, 2026

The GPU I have locally is a 1080 ti with compute capability 6.1. In principle it supports cuda 13. But we need to figure out if ZML pjrt binaries are built with 6.1 support. Torch has dropped it recently and I had to make a custom build for it :(

@sebffischer
Copy link
Copy Markdown
Collaborator Author

sebffischer commented Apr 7, 2026

Unfortunately 6.1 is not supported anymore: https://github.com/zml/pjrt-artifacts/blob/d104b855719bf4256bf1a87e4542285a54d0e594/openxla/bazelrc/upstream/.bazelrc#L99 (this is also already the case for the commit from release 17.0.0)

@sebffischer
Copy link
Copy Markdown
Collaborator Author

Maybe the easier way for now to add linux arm support is to just include it here: https://github.com/r-xla/pjrt-builds.

Eventually we have to switch to CUDA 13.0 I guess but maybe it's not necessary yet.

Maybe we could also make a PR in pjrt-artifacts and add 6.1 here: https://github.com/zml/pjrt-artifacts/blob/d104b855719bf4256bf1a87e4542285a54d0e594/openxla/bazelrc/upstream/.bazelrc#L99 but I guess it's not as easy as that ...

@sebffischer
Copy link
Copy Markdown
Collaborator Author

I have created an issue in pjrt-artifacts here: zml/pjrt-artifacts#70. Maybe they can just add it.

@sebffischer
Copy link
Copy Markdown
Collaborator Author

@dfalbel there is nothing we can do on your machine I think. With cuda 13.0 offline compilation support for 6.1 was removed (https://docs.nvidia.com/cuda/archive/13.0.0/cuda-toolkit-release-notes/index.html#deprecated-architectures). I might have access to a machine with compute capability 7.5 that can be used to run CI jobs but I will postpone setting this up as long as possible :D

@sebffischer sebffischer marked this pull request as draft April 7, 2026 13:07
@dfalbel
Copy link
Copy Markdown
Collaborator

dfalbel commented Apr 7, 2026

Ahhh, that's unfortunate. Ideally we should have a cloud hosted gpu, such as the ones available with GitHub actions..

@sebffischer
Copy link
Copy Markdown
Collaborator Author

Yeah, that would indeed be nice. But if we would run half an hour of CUDA CIs per day that would cost 0.05 * 30 * 30 = 45 euro per month. I think it's more realistic that I use a machine from my university (I am quite certain I will be allowed to do it, we just need to set it up). But for now CUDA 12.8 does the job :D (Also I can hope that eventually torch does not run on your machine anymore so posit has to buy you a new GPU :P)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants