Our current implementation of compute kernel is slower than torch.compile() implementation. This Roadmap aims to at least match with torch.comile implementation.
Stage 1. Add torch implementation reference
Stage 2. Strengthen our CUDA kernel
These PRs should
- strengthen current numerical correctness test with hard test cases
- strengthen CUDA kernel implementation accordingly
- make sure numerical correctness test with hard test cases passes
- make sure we are faster or match
torch.compiles speed
Our current implementation of compute kernel is slower than
torch.compile()implementation. This Roadmap aims to at least match withtorch.comileimplementation.Stage 1. Add torch implementation reference
Stage 2. Strengthen our CUDA kernel
These PRs should
torch.compilesspeed