arxiv preprint - Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
In this episode, we discuss Synth 2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings by Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak, Dharshan Kumaran, Anastasija Ilic, Jovana Mitrovic, Charles Blundell, Andrea Banino. The paper introduces a method that combines Large Language Models (LLMs) and image generation models to synthetically create image-text pairs for training Visual-Language Models (VLMs), thus circumventing the need for extensive human-labeled data. Synthetic image embeddings, generated from LLM-produced captions, are used to effectively train VLMs, achieving a 17% performance improvement over baselines while using less data. Additionally, this synthetic data creation in the image embedding space is shown to be 25% faster than working in the pixel space, offering a scalable and efficient solution for enhancing VLM training.
Create your
podcast in
minutes
It is Free