In this episode we discuss Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
by Authors:
- Feng Liang
- Bichen Wu
- Xiaoliang Dai
- Kunpeng Li
- Yinan Zhao
- Hang Zhang
- Peizhao Zhang
- Peter Vajda
- Diana Marculescu
Affiliations:
- Feng Liang and Diana Marculescu are affiliated with The University of Texas at Austin.
- Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Peizhao Zhang, Peter Vajda are affiliated with Meta Reality Labs.
- Hang Zhang is affiliated with Cruise.. The paper proposes a method to improve the performance of open-vocabulary semantic segmentation, which involves segmenting an image into semantic regions according to text descriptions that may not have been seen during training. The current two-stage approach involves generating class-agnostic mask proposals and then using pre-trained vision-language models like CLIP to classify masked regions. However, the authors identify the bottleneck of this approach to be the pre-trained CLIP model, which doesn't perform well on masked images. To address this issue, they propose fine-tuning CLIP on a collection of masked image regions and their corresponding text descriptions, collected by mining an existing image-caption dataset. They also use a method called "mask prompt tuning" to utilize the "blank" areas in masked images. The authors demonstrate that their method achieves significant improvement over the previous state-of-the-art on the ADE20K-150 dataset.
view more