Several attention based encoder-decoder architectures have been geared towards the task of image captioning. Yet, the collocations and contextual inference seen in captions written by humans is not observed in the output of these systems e.g., if we see a lot of different vehicles on the road, we infer ``traffic'' and say ``a lot of traffic on the road''. Further, ``hallucination'' of commonly seen concepts for fitting the language model is commonly observed in a lot of existing systems. For example, ``a group of soldiers cutting a cake with a sword'' would be hallucinated as ``a boy cutting a cake with a knife''. In this work we construct two simultaneously learning channels, where first channel uses the mean-pooled image feature and learns to associate it with the most relevant words. The second channel, on the other hand, utilizes the spatial features belonging to salient image regions to learn to form meaningful collocations and perform contextual inference. This way, the final language model gets the opportunity to leverage the information from the two channels to learn to generate grammatically correct sentence structures which are more human-like and creative. Our novel ``spatial image features to n-gram text features mapping'' mechanism not only learns meaningful collocations but also verifies that the caption words correspond to the region(s) of the image, thereby avoiding ``hallucination'' by the model. We validate the effectiveness of our one pass system on the challenging MS-COCO image captioning benchmark, where our single-model achieves a new state-of-the art 126.3 CIDEr-D on the Karpathy split, and a competitive 124.1 CIDEr-D (c40) on the official server.
Supplementary material (PDF)