Abstract
Together with the development of more accurate methods in Computer Vision and Natural Language Understanding, holistic architectures that answer on questions about the content of real-world images have emerged. In this tutorial, we build a neural-based approach to answer questions about images. We base our tutorial on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the models that we present here can achieve a competitive performance on both datasets, in fact, they are among the best methods that use a combination of LSTM with a global, full frame CNN representation of an image. We hope that after reading this tutorial, the reader will be able to use Deep Learning frameworks, such as Keras and introduced Kraino, to build various architectures that will lead to a further performance improvement on this challenging task.
Abstract (translated by Google)
随着“计算机视觉和自然语言理解”中更准确的方法的发展,出现了回答有关现实世界图像内容的整体架构。在本教程中,我们构建了一个基于神经的方法来回答有关图像的问题。我们将我们的教程建立在两个数据集上:(主要是)DAQUAR和(有点儿)VQA。通过小的调整,我们在这里呈现的模型可以在两个数据集上实现有竞争力的性能,实际上,它们是使用LSTM和全局CNN表示图像的最佳方法。我们希望阅读本教程后,读者将能够使用Keras等深度学习框架和Kraino引入的各种体系结构,从而进一步提高这一具有挑战性的任务的性能。
URL
https://arxiv.org/abs/1610.01076