Abstract
Important high-level vision tasks such as human-object interaction, image captioning and robotic manipulation require rich semantic descriptions of objects at part level. Based upon previous work on part localization, in this paper, we address the problem of inferring rich semantics imparted by an object part in still images. We propose to tokenize the semantic space as a discrete set of part states. Our modeling of part state is spatially localized, therefore, we formulate the part state inference problem as a pixel-wise annotation problem. An iterative part-state inference neural network is specifically designed for this task, which is efficient in time and accurate in performance. Extensive experiments demonstrate that the proposed method can effectively predict the semantic states of parts and simultaneously correct localization errors, thus benefiting a few visual understanding applications. The other contribution of this paper is our part state dataset which contains rich part-level semantic annotations.
Abstract (translated by Google)
重要的高级视觉任务,如人机交互,图像字幕和机器人操作等,都需要在零件层面对对象进行丰富的语义描述。基于以往的部分定位工作,本文针对静止图像中对象部分赋予丰富语义的问题。我们建议将语义空间标记为一组离散的部分状态。我们对零件状态的建模是空间局部化的,因此,我们将零件状态推理问题作为一个像素注解问题。迭代的部分状态推理神经网络是专门为这个任务设计的,它在时间和性能上都是高效的。大量的实验表明,所提出的方法能够有效地预测零件的语义状态,同时纠正定位误差,从而有益于一些视觉理解应用。本文的另一个贡献是我们的部分状态数据集,其中包含丰富的部分级语义标注。
URL
https://arxiv.org/abs/1612.07310