Use What You Have: Video retrieval using representations from collaborative experts

Yang Liu (University of Oxford), Samuel Albanie (University of Oxford), Arsha Nagrani (University of Oxford), Andrew Zisserman (University of Oxford)

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human generated queries for video datasets `in the wild' vary a lot in terms of degree of specificity, with some queries describing `specific details' such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pretrained semantic embeddings which include `general' features such as motion, appearance, and scene features from visual content, and more `specific' cues from ASR and OCR which may not always be available, but allow for more fine-grained disambiguation when present. We propose a collaborative experts model to aggregate information effectively from these different pretrained experts. The effectiveness of our approach is demonstrated empirically, setting new state-of-the-art performances on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet, while simultaneously reducing the number of parameters used by prior work. Code and data can be found at \url{}.


Paper (PDF)
Supplementary material (PDF)

title={Use What You Have: Video retrieval using representations from collaborative experts},
author={Yang Liu and Samuel Albanie and Arsha Nagrani and Andrew Zisserman},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Kirill Sidorov and Yulia Hicks},