In this episode we discuss Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
by Sixun Dong, Huazhang Hu, Dongze Lian, Weixin Luo, Yicheng Qian, Shenghua Gao. The paper proposes a weakly supervised approach for sequential video understanding, where time-stamp level text-video alignment is not provided. The proposed method uses a transformer to aggregate frame-level features for video representation and a pre-trained text encoder to encode texts corresponding to each action and the whole video. The proposed multiple granularity loss includes a video-paragraph contrastive loss and a frame-sentence contrastive loss, where pseudo frame-sentence correspondence is generated to supervise the network training. Experimental results demonstrate the effectiveness of the proposed approach, outperforming baselines by a large margin.
view more