Convolutional self-attention

Dear mlpotter, your code is perfect！ I found you just deal with the initial input by causal convolutions, however, the K and Q  were still calculated by 'torch.nn.TransformerEncoderLayer'. Thus, this attention is consistent with canonical Transformer architecture.