Abstract
Generative Adversarial Networks (GANs) have recently demonstrated the capability to synthesize compelling real-world images, such as room interiors, album covers, manga, faces, birds, and flowers. While existing models can synthesize images based on global constraints such as a class label or caption, they do not provide control over pose or object location. We propose a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location. We show high-quality 128 x 128 image synthesis on the Caltech-UCSD Birds dataset, conditioned on both informal text descriptions and also object location. Our system exposes control over both the bounding box around the bird and its constituent parts. By modeling the conditional distributions over part locations, our system also enables conditioning on arbitrary subsets of parts (e.g. only the beak and tail), yielding an efficient interface for picking part locations. We also show preliminary results on the more challenging domain of text- and location-controllable synthesis of images of human actions on the MPII Human Pose dataset.
Abstract (translated by Google)
生成对抗网络(GANs)最近展示了综合真实世界图像的能力,如室内,专辑封面,漫画,面孔,鸟类和鲜花。虽然现有模型可以基于全局约束(如类标签或标题)合成图像,但它们不提供对姿势或对象位置的控制。我们提出了一个新的模型,即生成敌对事情网络(GAWWN),该网络综合了图像,给出了描述在哪个位置绘制什么内容的说明。我们在Caltech-UCSD Birds数据集上显示高质量的128 x 128图像合成,同时以非正式文本描述和对象位置为条件。我们的系统暴露了对鸟和它的组成部分的边界框的控制。通过对部件位置上的条件分布进行建模,我们的系统还能够调节部件的任意子集(例如,只有喙和尾部),从而为拾取部件位置提供高效的接口。我们还展示了在MPII人体姿态数据集上人类活动图像的文本和位置可控合成更具挑战性的领域的初步结果。
URL
https://arxiv.org/abs/1610.02454