Dear mlpotter, your code is perfect! I found you just deal with the initial input by causal convolutions, however, the K and Q were still calculated by 'torch.nn.TransformerEncoderLayer'. Thus, this attention is consistent with canonical Transformer architecture.
Dear mlpotter, your code is perfect! I found you just deal with the initial input by causal convolutions, however, the K and Q were still calculated by 'torch.nn.TransformerEncoderLayer'. Thus, this attention is consistent with canonical Transformer architecture.