Hi,
Thanks for providing the Mamba implementation. I would like to know if there is any workaround in the efficient computation of deltaA and deltaB_u that can avoid the GPU memory running out issue. The following are the parameters I used to create the Mamba instance:
d_model: 1024
n_layer: 4
d_state: int = 1024
expand: int = 2
The other parameters are set to their default values.
It results in a model of ~60M parameters. However, I run out of memory (max GPU memory= 24 GB) when I train with a batch size of 256 or even as low as 64 and this probably happens due to large matrix computations for deltaA and deltaB_u.
Hi,
Thanks for providing the Mamba implementation. I would like to know if there is any workaround in the efficient computation of
deltaAanddeltaB_uthat can avoid the GPU memory running out issue. The following are the parameters I used to create the Mamba instance:The other parameters are set to their default values.
It results in a model of ~60M parameters. However, I run out of memory (max GPU memory= 24 GB) when I train with a batch size of 256 or even as low as 64 and this probably happens due to large matrix computations for
deltaAanddeltaB_u.