[Roadmap] Strengthen CUDA kernel implementation

Our current implementation of compute kernel is slower than `torch.compile()` implementation. This Roadmap aims to at least match with `torch.comile` implementation.

### Stage 1. Add torch implementation reference

- [#1189 ](https://github.com/apache/mahout/pull/1189) @ryankert01 
- phase @vvvdwbvvv 

### Stage 2. Strengthen our CUDA kernel

- amplitude encoding @ryankert01
- angle encoding @400Ping 
- iqp encoding @vvvdwbvvv 
- phase @vvvdwbvvv 
- etc.

These PRs should
1. strengthen current numerical correctness test with hard test cases
2. strengthen CUDA kernel implementation accordingly
3. make sure numerical correctness test with hard test cases passes
4. make sure we are faster or match `torch.compiles` speed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Strengthen CUDA kernel implementation #1227

Stage 1. Add torch implementation reference

Stage 2. Strengthen our CUDA kernel

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Roadmap] Strengthen CUDA kernel implementation #1227

Description

Stage 1. Add torch implementation reference

Stage 2. Strengthen our CUDA kernel

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions