NEC BLAS batched_gemm support

BLAS interface defines matrix multiplication primitive for 2D matrices, but in modern neural networks "batched matrix multiplication" is very important.

I.e. consider the following multiplication example taken from BERT model with seqLen of 128:
[4, 12, 128, 128] x [4, 12, 128, 64] = [4, 12, 128, 64].

This operation basically defines 48 multiplications of [128, 128] x [128, 64] matrices.
In current APIs that would be sequential single-threaded loop, and it'll be quite inefficient, since inner matrices are relatively small. 

We'll have better performance if we do this in parallel along batch dimension.
Here's the relatively simple batched_gemm code we've added: https://github.com/KonduitAI/deeplearning4j/blob/3bf785f2608aa78b9d618b4a4425da5779643702/libnd4j/include/helpers/cpu/MmulHelper.cpp#L1018

And it provides x12 better performance than sequential single-threaded loop. 
So it would be awesome to have proper support for such a primitive in NEC BLAS library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEC BLAS batched_gemm support #331

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NEC BLAS batched_gemm support #331

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions