Abstract
Understanding the semantics of complex visual scenes often requires analyzing a network of objects and their relations. Such networks are known as scene-graphs. While scene-graphs have great potential for machine vision applications, learning scene-graph based models is challenging. One reason is the complexity of the graph representation, and the other is the lack of large scale data for training broad coverage graphs. In this work we propose a way of addressing these difficulties, via the concept of a Latent Scene Graph. We describe a family of models that uses “scene-graph like” representations, and uses them in downstream tasks. Furthermore, we show how these representations can be trained from partial supervision. Finally, we show how our approach can be used to achieve new state of the art results on the challenging problem of referring relationships.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1902.10200