This paper introduces a gradient caching technique that decouples backpropagation between contrastive loss and the encoder, removing encoder backward pass data dependency along the batch dimension. As a result, gradients can be computed for one subset of the batch at a time, leading to almost constant memory usage.
2021: Luyu Gao, Yunyi Zhang, Jiawei Han, Jamie Callan
https://arxiv.org/pdf/2101.06983v2.pdf
view more