Generalised Visual Microphone

Juhyun Ahn (SUALAB)

Visual microphone (VM) recovers the sound from a silent video, which extracts subtle motion signals from the video using quadrature filter pairs of the complex steerable pyramid (CSP), and then recovers the sound signal by the weighted sum of subtle motion signals. We observe that (1) the subtle motion extraction and the weighted sum in the VM can be treated as a convolution operation, (2) the selection of the sound with the least expected noise can be seen as a pooling of the sound with maximum expected ratio of the sound and noise signals, (3) the VM needs to utilise the past signal to obtain the zero DB level sound signal, and (4) the VM cannot recover the sound sufficiently in the case of using normal frame rate camera. These observations motivate the generalised VM that has the following features: (1) it has the sound recovery convolutional neural networks (SR-CNN) that can learn the ideal filter weights from the training data, (2) it has the DC blocker recurrent neural network (DB-RNN) that recovers the signal with zero DC level, and (3) it has the bandwidth extension residual network (BE-ResNet) to extend the bandwidth of the recovered sound by double. Experiment results show that the proposed generalised VM achieves 12.52\% higher segmented SNR than the conventional VM in the case of using the normal frame rate camera.


Paper (PDF)

title={Generalised Visual Microphone},
author={Juhyun Ahn},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Kirill Sidorov and Yulia Hicks},