In this episode we discuss AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
by Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid. The paper proposes a method called AVFormer for augmenting audio-only models with visual information for audiovisual automatic speech recognition (AV-ASR). The method involves injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors, which can be trained on a small amount of weakly labelled video data with minimal additional training time and parameters. A simple curriculum scheme is also introduced during training, which is shown to be crucial for the model to jointly process audio and visual information effectively. The proposed model achieves state-of-the-art zero-shot results on three AV-ASR benchmarks while preserving decent performance on traditional audio-only speech recognition benchmarks.
view more