Self-supervised Video Representation Learning for Correspondence Flow

Zihang Lai (University of Oxford), Weidi Xie (University of Oxford)

Abstract
The objective of this paper is self-supervised learning of feature embeddings from video, suitable for correspondence flow. We leverage the natural spatial-temporal coherence of appearance in videos, to create a model that learns to reconstruct a target frame by copying colors from a reference frame. We make three contributions: First, we introduce a simple information bottleneck that enforces the model to learn robust features for correspondence matching, and avoids it learning trivial solutions, e.g. matching by low-level color information. Second, we propose to learn the matching over a longer temporal window in videos. To make the model more robust to complex object deformation, occlusion, i.e. the problem of tracker drifting, we formulate a recursive model and trained with scheduled sampling and cycle consistency. Third, we evaluate the approach by first training on the Kinetics dataset using self-supervised learning, and then directly applied for DAVIS video segmentation and JHMDB keypoint tracking. On both tasks, our approach has outperformed all previous methods by a significant margin. The source code will be released at \url{https://github.com/zlai0/CorrFlow}.

DOI
10.5244/C.33.121
https://dx.doi.org/10.5244/C.33.121

Files
Paper (PDF)

BibTeX
@inproceedings{BMVC2019,
title={Self-supervised Video Representation Learning for Correspondence Flow},
author={Zihang Lai and Weidi Xie},
year={2019},
month={September},
pages={121.1--121.13},
articleno={121},
numpages={13},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Kirill Sidorov and Yulia Hicks},
doi={10.5244/C.33.121},
url={https://dx.doi.org/10.5244/C.33.121}
}