Towards Transparent AI Systems: Interpreting Visual Question Answering Models

Abstract
Abstract (translated by Google)
URL
PDF

Abstract

Deep neural networks have shown striking progress and obtained state-of-the-art results in many AI research fields in the recent years. However, it is often unsatisfying to not know why they predict what they do. In this paper, we address the problem of interpreting Visual Question Answering (VQA) models. Specifically, we are interested in finding what part of the input (pixels in images or words in questions) the VQA model focuses on while answering the question. To tackle this problem, we use two visualization techniques – guided backpropagation and occlusion – to find important words in the question and important regions in the image. We then present qualitative and quantitative analyses of these importance maps. We found that even without explicit attention mechanisms, VQA models may sometimes be implicitly attending to relevant regions in the image, and often to appropriate words in the question.

Abstract (translated by Google)

近年来，深度神经网络在许多AI研究领域取得了令人瞩目的进展，并取得了最新的成果。然而，不知道为什么他们会预测他们做了什么，这往往是令人不满意的。在本文中，我们解决了解释视觉问答（VQA）模型的问题。具体而言，我们感兴趣的是在回答问题的同时，找出VQA模型关注的部分输入（图像中的像素或问题中的单词）。为了解决这个问题，我们使用两种可视化技术 - 引导反向传播和遮挡 - 在图像中的问题和重要区域中找到重要的单词。然后，我们对这些重要性地图进行定性和定量分析。我们发现，即使没有明确的注意机制，VQA模型有时可能会隐含地关注图像中的相关区域，而且往往会涉及到问题中恰当的词汇。

URL

https://arxiv.org/abs/1608.08974

PDF

https://arxiv.org/pdf/1608.08974

Towards Transparent AI Systems: Interpreting Visual Question Answering Models

Abstract

Abstract (translated by Google)

URL

PDF

Similar Posts

Comments