MS-GAN: Text to Image Synthesis with Attention-Modulated Generators and Similarity-aware DiscriminatorsFengling Mao (Chinese Academy of Sciences ), Bingpeng Ma (Chinese Academy of Sciences), Hong Chang (Chinese Academy of Sciences), Shiguang Shan (Chinese Academy of Sciences), Xilin Chen (Chinese Academy of Sciences) Abstract
Existing approaches for text-to-image synthesis often produce images that either contain artifacts or do not well match the text, when the input text description is complex. In this paper, we propose a novel model named MS-GAN, composed of multi-stage attention-Modulated generators and Similarity-aware discriminators, to address these problems. Our proposed generator consists of multiple convolutional blocks that are modulated by both globally and locally attended features calculated between the output image and the text. With such an attention-modulation, our generator can better preserve the semantic information of the text during the text-to-image transformation. Moreover, we propose a similarity-aware discriminator to explicitly constrain the semantic consistency between the text and the synthesized image. Experimental results on Caltech-UCSD Birds and MS-COCO datasets demonstrate that our model can generate images that look more realistic and better match the given text description, compared to the state-of-the-art models.