In this episode we discuss Align and Attend: Multimodal Summarization with Dual Contrastive Losses
by Authors:
- Bo He
- Jun Wang
- Jielin Qiu
- Trung Bui
- Abhinav Shrivastava
- Zhaowen Wang
Affiliations:
- Bo He, Jun Wang, and Abhinav Shrivastava: University of Maryland, College Park
- Jielin Qiu: Carnegie Mellon University
- Trung Bui and Zhaowen Wang: Adobe Research. The paper proposes a new approach called Align and Attend Multimodal Summarization (A2Summ) for extracting important information from multiple modalities to create reliable summaries. It introduces a unified transformer-based model that aligns and attends to the multimodal input, while also addressing the issue of ignoring temporal correspondence between different modalities and intrinsic correlation between different samples. The proposed model achieves state-of-the-art performance on standard video summarization and multimodal summarization datasets and the authors also introduce a new large-scale multimodal summarization dataset called BLiSS.
view more