Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer.
2020: Swetha Mandava, Szymon Migacz, Alex Fit Florea
Ranked #1 on Paraphrase Identification on MRPC
https://arxiv.org/pdf/2009.04534v3.pdf
view more