Abstract
Lifelogging cameras capture everyday life from a first-person perspective, but generate so much data that it is hard for users to browse and organize their image collections effectively. In this paper, we propose to use automatic image captioning algorithms to generate textual representations of these collections. We develop and explore novel techniques based on deep learning to generate captions for both individual images and image streams, using temporal consistency constraints to create summaries that are both more compact and less noisy. We evaluate our techniques with quantitative and qualitative results, and apply captioning to an image retrieval application for finding potentially private images. Our results suggest that our automatic captioning algorithms, while imperfect, may work well enough to help users manage lifelogging photo collections.
Abstract (translated by Google)
动态记录摄像头从第一人称视角捕捉日常生活,但生成的数据太多,以致用户难以有效地浏览和组织其图像集合。在本文中,我们建议使用自动图像字幕算法来生成这些集合的文本表示。我们开发和探索基于深度学习的新技术,为单个图像和图像流生成标题,使用时间一致性约束来创建更简洁,更小噪声的摘要。我们使用定量和定性结果评估我们的技术,并将字幕应用于图像检索应用程序,以查找潜在的私人图像。我们的结果表明,我们的自动字幕算法,虽然不完善,可能会工作得很好,以帮助用户管理lifelogging照片收藏。
URL
https://arxiv.org/abs/1608.03819