This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning “Locked-image Tuning” (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks.
2021: Xiaohua Zhai, Xiao Wang, Basil Mustafa, A. Steiner, Daniel Keysers, Alexander Kolesnikov, L. Beyer
Ranked #1 on Zero-Shot Transfer Image Classification on ImageNet ReaL
https://arxiv.org/pdf/2111.07991v3.pdf
view more