batched_vec is currently implemented to use batched_mul (which calls batched gemm) only with some extra reshapes.
Some basic benchmarks (on an RTX PRO 6000 Blackwell) suggest batched gemv is sometimes 1-2% faster:
and converges to be similar in the limit:
but in some cases is consistently slightly slower:
Not sure if this has been considered already. The difference isn't huge, but in the cases where there actually is justification to specifically use gemv, it could be nice to have the option.
batched_vecis currently implemented to usebatched_mul(which calls batched gemm) only with some extra reshapes.Some basic benchmarks (on an RTX PRO 6000 Blackwell) suggest batched gemv is sometimes 1-2% faster:
and converges to be similar in the limit:
but in some cases is consistently slightly slower:
Not sure if this has been considered already. The difference isn't huge, but in the cases where there actually is justification to specifically use gemv, it could be nice to have the option.