Spatio-temporal Relational Reasoning for Video Question Answering

Gursimran Singh (University of British Columbia), Leonid Sigal (University of British Columbia), Jim Little (University of British Columbia)

Abstract
Video question answering is the task of automatically answering questions about videos. Among query types which include identification, localization, and counting, the most challenging questions enquire about relationships among different entities. Answering such questions, and many others, require modeling relationships between entities in the spatial domain and evolution of those relationships in the temporal domain. We argue that current approaches have limited capacity to model such long-range spatial and temporal dependencies. To address these challenges, we present a novel spatio-temporal reasoning neural module which enables modeling complex multi-entity relationships in space and long-term ordered dependencies in time. We evaluate our module on two benchmark datasets which require spatio-temporal reasoning: TGIF-QA and SVQA. We achieve state-of-the-art performance on both datasets. More significantly, we achieve substantial improvements on some of the most challenging question types, like counting, which demonstrate the effectiveness of our proposed spatio-temporal relational module.

DOI
10.5244/C.33.46
https://dx.doi.org/10.5244/C.33.46

Files
Paper (PDF)

BibTeX
@inproceedings{BMVC2019,
title={Spatio-temporal Relational Reasoning for Video Question Answering},
author={Gursimran Singh and Leonid Sigal and Jim Little},
year={2019},
month={September},
pages={46.1--46.13},
articleno={46},
numpages={13},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Kirill Sidorov and Yulia Hicks},
doi={10.5244/C.33.46},
url={https://dx.doi.org/10.5244/C.33.46}
}