Abstract
We propose Text2Scene, a model that interprets input natural language descriptions in order to generate various forms of compositional scene representations; from abstract cartoon-like scenes to synthetic images. Unlike recent works, our method does not use generative adversarial networks, but a combination of an encoder-decoder model with a semi-parametric retrieval-based approach. Text2Scene learns to sequentially produce objects and their attributes (location, size, appearance, etc) at every time step by attending to different parts of the input text, and the current status of the generated scene. We show that under minor modifications, the proposed framework can handle the generation of different forms of scene representations, including cartoon-like scenes, object layouts corresponding to real images, and synthetic image composites. Our method is not only competitive when compared with state-of-the-art GAN-based methods using automatic metrics and superior based on human judgments but it is also more general and interpretable.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1809.01110