Visuomotor Understanding for Representation Learning of Driving Scenes

Seokju Lee (Korea Advanced Institute of Science and Technology), Junsik Kim (Korea Advanced Institute of Science and Technology), Tae-Hyun Oh (MIT CSAIL), Yongseop Jeong (Korea Advanced Institute of Science and Technology), Donggeun Yoo (Lunit), Stephen Lin (Microsoft Research), In So Kweon (Korea Advanced Institute of Science and Technology)

Abstract
Dashboard cameras capture a tremendous amount of driving scene video each day. These videos are purposefully coupled with vehicle sensing data, such as from the speedometer and inertial sensors, providing an additional sensing modality for free. In this work, we leverage the large-scale unlabeled yet naturally paired data for visual representation learning in the driving scenario. A representation is learned in an end-to-end self-supervised framework for predicting dense optical flow from a single frame with paired sensing data. We postulate that success on this task requires the network to learn semantic and geometric knowledge in the ego-centric view. For example, forecasting a future view to be seen from a moving vehicle requires an understanding of scene depth, scale, and movement of objects. We demonstrate that our learned representation can benefit other tasks that require detailed scene understanding and outperforms competing unsupervised representations on semantic segmentation.

DOI
10.5244/C.33.1
https://dx.doi.org/10.5244/C.33.1

Files
Paper (PDF)
Supplementary material (ZIP)

BibTeX
@inproceedings{BMVC2019,
title={Visuomotor Understanding for Representation Learning of Driving Scenes},
author={Seokju Lee and Junsik Kim and Tae-Hyun Oh and Yongseop Jeong and Donggeun Yoo and Stephen Lin and In So Kweon},
year={2019},
month={September},
pages={1.1--1.14},
articleno={1},
numpages={14},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Kirill Sidorov and Yulia Hicks},
doi={10.5244/C.33.1},
url={https://dx.doi.org/10.5244/C.33.1}
}