Hello,
I implemented the algorithm in the vision transformer architecture the following way:
#inside __init__()
self.spe = SineSPE(num_heads=head_cnt,in_features=in_dim,num_sines=5,num_realizations=64)
self.filter = SPEFilter(gated=False,code_shape=self.spe.code_shape)
#inside forward()
q,k=self.filter(q,k,self.spe(q.shape[:2]))
qk,kp = performer(...)
out=lin_attention(...)
The model I am using has 4 layers 6 heads and embedding dimension 384, patch_size=4.
Training 100 epochs with CIFAR100 converges to 42.3% and without SPE 45.3%. Although this can be expected, with SPE the training time is around 6x longer, is that normal?
Performers + ViT takes 39 minutes
Perfomers + ViT + SPE takes around 4 hours
For both I am using 2 Titan XP GPUs.
This is very problematic to me because I was considering scaling up those experiments with imagenet.
I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.
Many thanks!
Hello,
I implemented the algorithm in the vision transformer architecture the following way:
The model I am using has 4 layers 6 heads and embedding dimension 384, patch_size=4.
Training 100 epochs with CIFAR100 converges to 42.3% and without SPE 45.3%. Although this can be expected, with SPE the training time is around 6x longer, is that normal?
Performers + ViT takes 39 minutes
Perfomers + ViT + SPE takes around 4 hours
For both I am using 2 Titan XP GPUs.
This is very problematic to me because I was considering scaling up those experiments with imagenet.
I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.
Many thanks!