Abstract
As processing power has become more available, more human-like artificial intelligences are created to solve image processing tasks that we are inherently good at. As such we propose a model that estimates depth from a monocular image. Our approach utilizes a combination of structure from motion and stereo disparity. We estimate a pose between the source image and a different viewpoint and a dense depth map and use a simple transformation to reconstruct the image seen from said viewpoint. We can then use the real image at that viewpoint to act as supervision to train out model. The metric chosen for image comparison employs standard L1 and structural similarity and a consistency constraint between depth maps as well as smoothness constraint. We show that similar to human perception utilizing the correlation within the provided data by two different approaches increases the accuracy and outperforms the individual components.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1905.04467