My PR fix the issue above https://github.com/dzhwinter/Paddle/tree/review_conv2d_1
The cudnn op is run on Cuda device, so its inputs/outputs must stay at Cuda device. In ROCm#16, it use CPU Tensor to store the algorithm selected, but our framework will automatically transform it into a temporary GPU Tensor. As a result, inside cudnn op, it can not get the real persistent Tensor.
If we allocated output and input in GPU, and copy the result to CPU, then we will get the correct result.
My PR fix the issue above https://github.com/dzhwinter/Paddle/tree/review_conv2d_1
The cudnn op is run on Cuda device, so its inputs/outputs must stay at Cuda device. In ROCm#16, it use CPU Tensor to store the algorithm selected, but our framework will automatically transform it into a temporary GPU Tensor. As a result, inside cudnn op, it can not get the real persistent Tensor.
If we allocated output and input in GPU, and copy the result to CPU, then we will get the correct result.