Performance Improvements

* [ ] implement the LU pivots vector -> permutation vector as a custom CUDA call. Currently, we implement this via a while loop, which is rather expensive on XLA:GPU (https://github.com/jax-ml/jax/issues/5880). This requires a mechanism to not only register custom calls, but to also have custom cuda kernels, which we currently do not have infrastructure for.