Your kernel implementation appears to be more straightforward than anticipated. Unfortunately, it does not outperform cuSPARSE or other rapid sparse kernel libraries in terms of efficiency. Concerning your benchmarks, the evaluation seems biased, as all your test files focus on single-layer operations without considering any framework overhead, whereas your baseline measurements are taken within a framework context. For a more equitable comparison, could you supply relevant artifacts or adjust your testing methodology to include framework overheads in both your implementation and the baseline?
Your kernel implementation appears to be more straightforward than anticipated. Unfortunately, it does not outperform cuSPARSE or other rapid sparse kernel libraries in terms of efficiency. Concerning your benchmarks, the evaluation seems biased, as all your test files focus on single-layer operations without considering any framework overhead, whereas your baseline measurements are taken within a framework context. For a more equitable comparison, could you supply relevant artifacts or adjust your testing methodology to include framework overheads in both your implementation and the baseline?