Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Target-side monolingual data plays an important role in boosting fluency for phrase-based statistical machine translation, and we investigate the use of monolingual data for NMT. In contrast to previous work, which combines NMT models with separately trained language models, we note that encoder-decoder NMT architectures already have the capacity to learn the same information as a language model, and we explore strategies to train with monolingual data without changing the neural network architecture. By pairing monolingual training data with an automatic back-translation, we can treat it as additional parallel training data, and we obtain substantial improvements on the WMT 15 task English<->German (+2.8-3.7 BLEU), and for the low-resourced IWSLT 14 task Turkish->English (+2.1-3.4 BLEU), obtaining new state-of-the-art results. We also show that fine-tuning on in-domain monolingual and parallel data gives substantial improvements for the IWSLT 15 task English->German.
https://arxiv.org/abs/1511.06709
Visual storytelling aims to generate human-level narrative language (i.e., a natural paragraph with multiple sentences) from a photo streams. A typical photo story consists of a global timeline with multi-thread local storylines, where each storyline occurs in one different scene. Such complex structure leads to large content gaps at scene transitions between consecutive photos. Most existing image/video captioning methods can only achieve limited performance, because the units in traditional recurrent neural networks (RNN) tend to “forget” the previous state when the visual sequence is inconsistent. In this paper, we propose a novel visual storytelling approach with Bidirectional Multi-thread Recurrent Neural Network (BMRNN). First, based on the mined local storylines, a skip gated recurrent unit (sGRU) with delay control is proposed to maintain longer range visual information. Second, by using sGRU as basic units, the BMRNN is trained to align the local storylines into the global sequential timeline. Third, a new training scheme with a storyline-constrained objective function is proposed by jointly considering both global and local matches. Experiments on three standard storytelling datasets show that the BMRNN model outperforms the state-of-the-art methods.
视觉叙事的目的是从照片流中产生人类叙事语言(即具有多个句子的自然段落)。一个典型的照片故事包括一个全球时间线与多线程本地故事情节,其中每个故事情节发生在一个不同的场景。这种复杂的结构导致在连续照片之间的场景转换处有大的内容空白。大多数现有的图像/视频字幕方法只能达到有限的性能,因为当视觉序列不一致时,传统递归神经网络(RNN)中的单元倾向于“忘记”以前的状态。在本文中,我们提出了一种新颖的双向多线程递归神经网络(BMRNN)的视觉叙事方法。首先,根据本地的故事情节,提出了一个带有延迟控制的跳过门控循环单元(sGRU),以保持更远距离的视觉信息。其次,通过使用sGRU作为基本单位,BMRNN接受培训,将本地故事情节整合到全球连续时间表中。第三,通过联合考虑全球和地方的匹配,提出了一种新的具有故事情节约束目标函数的训练方案。三个标准叙事数据集上的实验表明,BMRNN模型胜过了最先进的方法。
https://arxiv.org/abs/1606.00625
Previous studies in Open Information Extraction (Open IE) are mainly based on extraction patterns. They manually define patterns or automatically learn them from a large corpus. However, these approaches are limited when grasping the context of a sentence, and they fail to capture implicit relations. In this paper, we address this problem with the following methods. First, we exploit long short-term memory (LSTM) networks to extract higher-level features along the shortest dependency paths, connecting headwords of relations and arguments. The path-level features from LSTM networks provide useful clues regarding contextual information and the validity of arguments. Second, we constructed samples to train LSTM networks without the need for manual labeling. In particular, feedback negative sampling picks highly negative samples among non-positive samples through a model trained with positive samples. The experimental results show that our approach produces more precise and abundant extractions than state-of-the-art open IE systems. To the best of our knowledge, this is the first work to apply deep learning to Open IE.
https://arxiv.org/abs/1605.07918
Recurrent neural networks are increasing popular models for sequential learning. Unfortunately, although the most effective RNN architectures are perhaps excessively complicated, extensive searches have not found simpler alternatives. This paper imports ideas from physics and functional programming into RNN design to provide guiding principles. From physics, we introduce type constraints, analogous to the constraints that forbids adding meters to seconds. From functional programming, we require that strongly-typed architectures factorize into stateless learnware and state-dependent firmware, reducing the impact of side-effects. The features learned by strongly-typed nets have a simple semantic interpretation via dynamic average-pooling on one-dimensional convolutions. We also show that strongly-typed gradients are better behaved than in classical architectures, and characterize the representational power of strongly-typed nets. Finally, experiments show that, despite being more constrained, strongly-typed architectures achieve lower training and comparable generalization error to classical architectures.
https://arxiv.org/abs/1602.02218
We report on a detailed study of the intensity dependent optical properties of individual GaN/AlN Quantum Disks (QDisks) embedded into GaN nanowires (NW). The structural and optical properties of the QDisks were probed by high spatial resolution cathodoluminescence (CL) in a scanning transmission electron microscope (STEM). By exciting the QDisks with a nanometric electron beam at currents spanning over 3 orders of magnitude, strong non-linearities (energy shifts) in the light emission are observed. In particular, we find that the amount of energy shift depends on the emission rate and on the QDisk morphology (size, position along the NW and shell thickness). For thick QDisks (>4nm), the QDisk emission energy is observed to blue-shift with the increase of the emission intensity. This is interpreted as a consequence of the increase of carriers density excited by the incident electron beam inside the QDisks, which screens the internal electric field and thus reduces the quantum confined Stark effect (QCSE) present in these QDisks. For thinner QDisks (<3 nm), the blue-shift is almost absent in agreement with the negligible QCSE at such sizes. For QDisks of intermediate sizes there exists a current threshold above which the energy shifts, marking the transition from unscreened to partially screened QCSE. From the threshold value we estimate the lifetime in the unscreened regime. These observations suggest that, counterintuitively, electrons of high energy can behave ultimately as single electron-hole pair generators. In addition, when we increase the current from 1 pA to 10 pA the light emission efficiency drops by more than one order of magnitude. This reduction of the emission efficiency is a manifestation of the efficiency droop as observed in nitride-based 2D light emitting diodes, a phenomenon tentatively attributed to the Auger effect.
https://arxiv.org/abs/1605.07504
Memory networks are neural networks with an explicit memory component that can be both read and written to by the network. The memory is often addressed in a soft way using a softmax function, making end-to-end training with backpropagation possible. However, this is not computationally scalable for applications which require the network to read from extremely large memories. On the other hand, it is well known that hard attention mechanisms based on reinforcement learning are challenging to train successfully. In this paper, we explore a form of hierarchical memory network, which can be considered as a hybrid between hard and soft attention memory networks. The memory is organized in a hierarchical structure such that reading from it is done with less computation than soft attention over a flat memory, while also being easier to train than hard attention over a flat memory. Specifically, we propose to incorporate Maximum Inner Product Search (MIPS) in the training and inference procedures for our hierarchical memory network. We explore the use of various state-of-the art approximate MIPS techniques and report results on SimpleQuestions, a challenging large scale factoid question answering task.
https://arxiv.org/abs/1605.07427
We present a general framework and method for simultaneous detection and segmentation of an object in a video that moves (or comes into view of the camera) at some unknown time in the video. The method is an online approach based on motion segmentation, and it operates under dynamic backgrounds caused by a moving camera or moving nuisances. The goal of the method is to detect and segment the object as soon as it moves. Due to stochastic variability in the video and unreliability of the motion signal, several frames are needed to reliably detect the object. The method is designed to detect and segment with minimum delay subject to a constraint on the false alarm rate. The method is derived as a problem of Quickest Change Detection. Experiments on a dataset show the effectiveness of our method in minimizing detection delay subject to false alarm constraints.
https://arxiv.org/abs/1605.07369
Each human genome is a 3 billion base pair set of encoding instructions. Decoding the genome using deep learning fundamentally differs from most tasks, as we do not know the full structure of the data and therefore cannot design architectures to suit it. As such, architectures that fit the structure of genomics should be learned not prescribed. Here, we develop a novel search algorithm, applicable across domains, that discovers an optimal architecture which simultaneously learns general genomic patterns and identifies the most important sequence motifs in predicting functional genomic outcomes. The architectures we find using this algorithm succeed at using only RNA expression data to predict gene regulatory structure, learn human-interpretable visualizations of key sequence motifs, and surpass state-of-the-art results on benchmark genomics challenges.
https://arxiv.org/abs/1605.07156
Following letter introduces a theoretical approach to investigate the effect of two-step GaN barrier layer growth methodology on the performance of InGaN/GaN MQW solar cell, in which a lower temperature GaN cap layer was grown on top of each quantum well followed by a higher temperature GaN barrier layer. Different growth conditions would cause changes in the concentration of trap level density of states and imperfection sites. The simulation and comparison of 3 samples each with different cap layer thickness, reveals the fact that increasing cap layer thickness results in higher quantum efficiency, improved short circuit density of current and 3.2% increase of the fill factor.
https://arxiv.org/abs/1605.06816
Object detection often suffers from a plenty of bootless proposals, selecting high quality proposals remains a great challenge. In this paper, we propose a semantic, class-specific approach to re-rank object proposals, which can consistently improve the recall performance even with less proposals. We first extract features for each proposal including semantic segmentation, stereo information, contextual information, CNN-based objectness and low-level cue, and then score them using class-specific weights learnt by Structured SVM. The advantages of the proposed model are twofold: 1) it can be easily merged to existing generators with few computational costs, and 2) it can achieve high recall rate uner strict critical even using less proposals. Experimental evaluation on the KITTI benchmark demonstrates that our approach significantly improves existing popular generators on recall performance. Moreover, in the experiment conducted for object detection, even with 1,500 proposals, our approach can still have higher average precision (AP) than baselines with 5,000 proposals.
https://arxiv.org/abs/1605.05904
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
https://arxiv.org/abs/1409.0473
Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of “one-shot learning.” Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location-based focusing mechanisms.
https://arxiv.org/abs/1605.06065
We prove a precise formula relating the Bessel period of certain automorphic forms on ${\rm GSp}{4}(\mathbb{A}{F})$ to a central $L$-value. This is a special case of the refined Gan–Gross–Prasad conjecture for the groups $({\rm SO}{5},{\rm SO}{2})$ as set out by Ichino–Ikeda and Liu. This conjecture is deep and hard to prove in full generality; in this paper we succeed in proving the conjecture for forms lifted, via automorphic induction, from ${\rm GL}{2}(\mathbb{A}{E})$ where $E$ is a quadratic extension of $F$. The case where $E=F\times F$ has been previously dealt with by Liu.
https://arxiv.org/abs/1507.00089
We investigate the use of hierarchical phrase-based SMT lattices in end-to-end neural machine translation (NMT). Weight pushing transforms the Hiero scores for complete translation hypotheses, with the full translation grammar score and full n-gram language model score, into posteriors compatible with NMT predictive probabilities. With a slightly modified NMT beam-search decoder we find gains over both Hiero and NMT decoding alone, with practical advantages in extending NMT to very large input and output vocabularies.
https://arxiv.org/abs/1605.04569
Recent advances in image captioning task have led to increasing interests in video captioning task. However, most works on video captioning are focused on generating single input of aggregated features, which hardly deviates from image captioning process and does not fully take advantage of dynamic contents present in videos. We attempt to generate video captions that convey richer contents by temporally segmenting the video with action localization, generating multiple captions from multiple frames, and connecting them with natural language processing techniques, in order to generate a story-like caption. We show that our proposed method can generate captions that are richer in contents and can compete with state-of-the-art method without explicitly using video-level features as input.
图像字幕任务的最新进展已经引起对视频字幕任务的兴趣增加。然而,大多数关于视频字幕的作品都集中在单一输入的聚合特征上,这几乎没有偏离图像字幕的过程,也没有充分利用视频中的动态内容。我们试图通过将视频与动作本地化进行时间分割,从多个帧生成多个字幕,并将它们与自然语言处理技术连接起来,生成传达更丰富内容的视频字幕,以生成故事状的字幕。我们表明,我们提出的方法可以生成更丰富的内容的字幕,并可以与最先进的方法竞争,而无需明确使用视频级功能作为输入。
https://arxiv.org/abs/1605.05440
Recent advances in conditional recurrent language modelling have mainly focused on network architectures (e.g., attention mechanism), learning algorithms (e.g., scheduled sampling and sequence-level training) and novel applications (e.g., image/video description generation, speech recognition, etc.) On the other hand, we notice that decoding algorithms/strategies have not been investigated as much, and it has become standard to use greedy or beam search. In this paper, we propose a novel decoding strategy motivated by an earlier observation that nonlinear hidden layers of a deep neural network stretch the data manifold. The proposed strategy is embarrassingly parallelizable without any communication overhead, while improving an existing decoding algorithm. We extensively evaluate it with attention-based neural machine translation on the task of En->Cz translation.
https://arxiv.org/abs/1605.03835
Object detection with deep neural networks is often performed by passing a few thousand candidate bounding boxes through a deep neural network for each image. These bounding boxes are highly correlated since they originate from the same image. In this paper we investigate how to exploit feature occurrence at the image scale to prune the neural network which is subsequently applied to all bounding boxes. We show that removing units which have near-zero activation in the image allows us to significantly reduce the number of parameters in the network. Results on the PASCAL 2007 Object Detection Challenge demonstrate that up to 40% of units in some fully-connected layers can be entirely eliminated with little change in the detection result.
https://arxiv.org/abs/1605.03477
The localized effect of impurities in single GaN nanowires in the sub-diffraction limit is reported using the study of lattice vibrational modes in the evanescent field of Au nanoparticle assisted tip enhanced Raman spectroscopy (TERS). GaN nanowires with the O impurity and the Mg dopants were grown by the chemical vapor deposition technique in the catalyst assisted vapor-liquid-solid process. Symmetry allowed Raman modes of wurtzite GaN are observed for undoped and doped nanowires. Unusually very strong intensity of the non-zone center zone boundary mode is observed for the TERS studies of both the undoped and the Mg doped GaN single nanowires. Surface optical mode of A1 symmetry is also observed for both the undoped and the Mg doped GaN samples. A strong coupling of longitudinal optical (LO) phonons with free electrons, however is reported only in the O rich single nanowires with the asymmetric A1(LO) mode. Study of the local vibration mode shows the presence of Mg as dopant in the single GaN nanowires.
https://arxiv.org/abs/1605.03295
The physicochemical processes at the surfaces of semiconductor nanostructures involved in electrochemical and sensing devices are strongly influenced by the presence of intrinsic or extrinsic defects. To reveal the surface controlled sensing mechanism, intentional lattice oxygen defects are created on the surfaces of GaN nanowires for the elucidation of charge transfer process in methane (CH4) sensing. Experimental and simulation results of electron energy loss spectroscopy (EELS) studies on oxygen rich GaN nanowires confirmed the possible presence of 2(ON) and VGa-3ON defect complexes. A global resistive response for sensor devices of ensemble nanowires and a localized charge transfer process in single GaN nanowires are studied in situ scanning by Kelvin probe microscopy (SKPM). A localized charge transfer process, involving the VGa-3ON defect complex on nanowire surface is attributed in controlling the global gas sensing behavior of the oxygen rich ensemble GaN nanowires.
https://arxiv.org/abs/1605.03293
Although deep convolutional neural networks(CNNs) have achieved remarkable results on object detection and segmentation, pre- and post-processing steps such as region proposals and non-maximum suppression(NMS), have been required. These steps result in high computational complexity and sensitivity to hyperparameters, e.g. thresholds for NMS. In this work, we propose a novel end-to-end trainable deep neural network architecture, which consists of convolutional and recurrent layers, that generates the correct number of object instances and their bounding boxes (or segmentation masks) given an image, using only a single network evaluation without any pre- or post-processing steps. We have tested on detecting digits in multi-digit images synthesized using MNIST, automatically segmenting digits in these images, and detecting cars in the KITTI benchmark dataset. The proposed approach outperforms a strong CNN baseline on the synthesized digits datasets and shows promising results on KITTI car detection.
https://arxiv.org/abs/1511.06449
Several ten $\mu$m GaN have been deposited on a silicon substrate using a two-step hydride vapor phase epitaxy (HVPE) process. The substrates have been covered by AlN layers and GaN nanostructures grown by plasma-assisted molecular-beam epitaxy. During the first low-temperature (low-T) HVPE step, stacking faults (SF) form, which show distinct luminescence lines and stripe-like features in cathodoluminescence images of the cross-section of the layers. These cathodoluminescence features allow for an insight into the growth process. During a second high-temperature (high-T) step, the SFs disappear, and the luminescence of this part of the GaN layer is dominated by the donor-bound exciton. For templates consisting of both a thin AlN buffer and GaN nanostructures, a silicon incorporation into the GaN grown by HVPE is not observed. Moreover, the growth mode of the (high-T) HVPE step depends on the specific structure of the AlN/GaN template, where in a first case, the epitaxy is dominated by the formation of slowly growing facets, while in a second case, the epitaxy proceeds directly along the c-axis.
https://arxiv.org/abs/1605.03089
For improving e-health services, we propose a context-aware framework to monitor the activities of daily living of dependent persons. We define a strategy for generating long-term realistic scenarios and a framework containing an adaptive monitoring algorithm based on three approaches for optimizing resource usage. The used approaches provide a deep knowledge about the person’s context by considering: the person’s profile, the activities and the relationships between activities. We evaluate the performances of our framework and show its adaptability and significant reduction in network, energy and processing usage over a traditional monitoring implementation.
为了改善电子医疗服务,我们提出了一个情境感知框架来监控受抚养人的日常生活活动。我们定义了一个生成长期现实场景的策略和一个包含基于三种优化资源使用方法的自适应监控算法的框架。所使用的方法通过考虑:人的简介,活动和活动之间的关系来提供关于人的情境的深刻知识。我们评估了我们框架的性能,并展示了其适应性,并显着降低了网络,能源和处理使用率,而不是传统的监控实施。
http://arxiv.org/abs/1605.03035
We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.
https://arxiv.org/abs/1506.02640
This paper addresses the issue on how to more effectively coordinate the depth with RGB aiming at boosting the performance of RGB-D object detection. Particularly, we investigate two primary ideas under the CNN model: property derivation and property fusion. Firstly, we propose that the depth can be utilized not only as a type of extra information besides RGB but also to derive more visual properties for comprehensively describing the objects of interest. So a two-stage learning framework consisting of property derivation and fusion is constructed. Here the properties can be derived either from the provided color/depth or their pairs (e.g. the geometry contour adopted in this paper). Secondly, we explore the fusion method of different properties in feature learning, which is boiled down to, under the CNN model, from which layer the properties should be fused together. The analysis shows that different semantic properties should be learned separately and combined before passing into the final classifier. Actually, such a detection way is in accordance with the mechanism of the primary neural cortex (V1) in brain. We experimentally evaluate the proposed method on the challenging dataset, and have achieved state-of-the-art performance.
https://arxiv.org/abs/1605.02260
Millimeter wave (mmWave) communication is envisioned as a cornerstone to fulfill the data rate requirements for fifth generation (5G) cellular networks. In mmWave communication, beamforming is considered as a key technology to combat the high path-loss, and unlike in conventional microwave communication, beamforming may be necessary even during initial access/cell search. Among the proposed beamforming schemes for initial cell search, analog beamforming is a power efficient approach but suffers from its inherent search delay during initial access. In this work, we argue that analog beamforming can still be a viable choice when context information about mmWave base stations (BS) is available at the mobile station (MS). We then study how the performance of analog beamforming degrades in case of angular errors in the available context information. Finally, we present an analog beamforming receiver architecture that uses multiple arrays of Phase Shifters and a single RF chain to combat the effect of angular errors, showing that it can achieve the same performance as hybrid beamforming.
https://arxiv.org/abs/1605.01930
This paper describes our winning entry in the ImageCLEF 2015 image sentence generation task. We improve Google’s CNN-LSTM model by introducing concept-based sentence reranking, a data-driven approach which exploits the large amounts of concept-level annotations on Flickr. Different from previous usage of concept detection that is tailored to specific image captioning models, the propose approach reranks predicted sentences in terms of their matches with detected concepts, essentially treating the underlying model as a black box. This property makes the approach applicable to a number of existing solutions. We also experiment with fine tuning on the deep language model, which improves the performance further. Scoring METEOR of 0.1875 on the ImageCLEF 2015 test set, our system outperforms the runner-up (METEOR of 0.1687) with a clear margin.
本文介绍了我们在ImageCLEF 2015图像语句生成任务中的获奖条目。我们通过引入基于概念的句子重新排序来改进Google的CNN-LSTM模型,这是一种利用Flickr上的大量概念级注释的数据驱动方法。与之前使用的针对特定图像字幕模型的概念检测不同,该建议方法根据与检测到的概念匹配来重新排列预测句子,基本上将底层模型视为黑盒子。该属性使该方法适用于一些现有的解决方案。我们还在深度语言模型上进行了微调,进一步提高了性能。我们的系统在ImageCLEF 2015测试集上得分为0.1875的METEOR,表现优于亚军(METEOR为0.1687)。
https://arxiv.org/abs/1605.00855
Beamforming is an essential requirement to combat high pathloss and to improve signal-to-noise ratio during initial cell discovery in future millimeter wave cellular networks. The choice of an appropriate beamforming is directly coupled with its energy consumption. The energy consumption is even of more concern at a battery limited mobile station (MS). In this work, we provide an energy consumption based comparison of different beamforming schemes while considering both a low power and a high power analog-to-digital converter (ADC) for a millimeter wave based receiver at the MS. We analyze both context information (CI) (GPS positioning based) and non context information based schemes, and show that analog beamforming with CI (where mobile station’s positioning information is already available) can result in a lower energy consumption, while in all other scenarios digital beamforming has a lower energy consumption than analog and hybrid beamforming. We also show that under certain scenarios recently proposed phase shifters network architecture can result in a lower energy consumption than other beamforming schemes. Moreover, we show that the energy consumption trend among different beamforming schemes is valid irrespective of the number of ADC bits. Finally, we propose a new signaling structure which utilizes a relatively higher frequency sub-carrier for primary synchronization signals compared to other signaling, which allows a further reduction in initial cell search delay and energy consumption of the MS.
https://arxiv.org/abs/1605.00508
Much of the recent progress in Vision-to-Language (V2L) problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. We propose here a method of incorporating high-level concepts into the very successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art performance in both image captioning and visual question answering. We also show that the same mechanism can be used to introduce external semantic information and that doing so further improves performance. In doing so we provide an analysis of the value of high level semantic information in V2L problems.
视觉语言(V2L)问题的近期进展大部分是通过卷积神经网络(CNN)和递归神经网络(RNN)的组合来实现的。这种方法没有明确表示高层次的语义概念,而是试图直接从图像特征发展到文本。我们在这里提出一种将高级概念纳入CNN-RNN成功方法的方法,并且显示出它在图像字幕和视觉问题回答方面的最新性能得到显着改善。我们还表明,同样的机制可以用来引入外部语义信息,这样做可以进一步提高性能。在这样做的过程中,我们提供了V2L问题中高级语义信息的价值分析。
https://arxiv.org/abs/1506.01144
While recent deep neural network models have achieved promising results on the image captioning task, they rely largely on the availability of corpora with paired image and sentence captions to describe objects in context. In this work, we propose the Deep Compositional Captioner (DCC) to address the task of generating descriptions of novel objects which are not present in paired image-sentence datasets. Our method achieves this by leveraging large object recognition datasets and external text corpora and by transferring knowledge between semantically similar concepts. Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet. In contrast, our model can compose sentences that describe novel objects and their interactions with other objects. We demonstrate our model’s ability to describe novel concepts by empirically evaluating its performance on MSCOCO and show qualitative results on ImageNet images of objects for which no paired image-caption data exist. Further, we extend our approach to generate descriptions of objects in video clips. Our results show that DCC has distinct advantages over existing image and video captioning approaches for generating descriptions of new objects in context.
虽然最近的深度神经网络模型已经在图像字幕任务上取得了有希望的结果,但是它们在很大程度上依赖于具有配对图像和句子标题的语料库的可用性来描述背景中的对象。在这项工作中,我们提出了深度合成字幕(DCC)来解决生成新的对象的任务,这些对象不存在于配对的图像句子数据集中。我们的方法通过利用大对象识别数据集和外部文本语料库以及在语义相似的概念之间传递知识来实现这一点。目前的深层字幕模型只能描述配对的图像 - 句子语料库中的对象,尽管事实上他们是用大对象识别数据集(即ImageNet)预先训练的。相比之下,我们的模型可以组成描述新的对象的句子以及它们与其他对象的交互。我们通过实证评估其在MSCOCO上的性能来展示我们的模型描述新颖概念的能力,并且在没有配对图像标题数据的对象的ImageNet图像上显示定性结果。此外,我们扩展了我们的方法来生成视频剪辑中的对象的描述。我们的研究结果表明,与现有的图像和视频字幕方法相比,DCC具有明显的优势,用于在上下文中生成新对象的描述。
https://arxiv.org/abs/1511.05284
The tremendous growth in 3D (stereo) imaging and display technologies has led to stereoscopic content (video and image) becoming increasingly popular. However, both the subjective and the objective evaluation of stereoscopic video content has not kept pace with the rapid growth of the content. Further, the availability of standard stereoscopic video databases is also quite limited. In this work, we attempt to alleviate these shortcomings. We present a stereoscopic video database and its subjective evaluation. We have created a database containing a set of 144 distorted videos. We limit our attention to H.264 compression artifacts. The distorted videos were generated using 6 uncompressed pristine videos of left and right views originally created by Goldmann et al. at EPFL [1]. Further, 19 subjects participated in the subjective assessment task. Based on the subjective study, we have formulated a relation between the 2D and stereoscopic subjective scores as a function of compression rate and depth range. We have also evaluated the performance of popular 2D and 3D image/video quality assessment (I/VQA) algorithms on our database.
3D(立体声)成像和显示技术的巨大增长导致立体内容(视频和图像)变得越来越流行。然而,立体视频内容的主观评价和客观评价都跟不上内容的快速增长。此外,标准立体视频数据库的可用性也相当有限。在这项工作中,我们试图缓解这些缺点。我们提出一个立体视频数据库及其主观评价。我们创建了一个包含一组144个失真视频的数据库。我们将注意力集中在H.264压缩工件上。扭曲的视频是由Goldmann等人最初创建的6个未被压缩的左视图和右视图的原始视频生成的。在EPFL [1]。另外还有19位学员参加了主观评估任务。在主观研究的基础上,我们将二维和立体主观评分之间的关系作为压缩率和深度范围的函数。我们还评估了数据库上流行的2D和3D图像/视频质量评估(I / VQA)算法的性能。
https://arxiv.org/abs/1604.07519
A Content-Based Image Retrieval (CBIR) system which identifies similar medical images based on a query image can assist clinicians for more accurate diagnosis. The recent CBIR research trend favors the construction and use of binary codes to represent images. Deep architectures could learn the non-linear relationship among image pixels adaptively, allowing the automatic learning of high-level features from raw pixels. However, most of them require class labels, which are expensive to obtain, particularly for medical images. The methods which do not need class labels utilize a deep autoencoder for binary hashing, but the code construction involves a specific training algorithm and an ad-hoc regularization technique. In this study, we explored using a deep de-noising autoencoder (DDA), with a new unsupervised training scheme using only backpropagation and dropout, to hash images into binary codes. We conducted experiments on more than 14,000 x-ray images. By using class labels only for evaluating the retrieval results, we constructed a 16-bit DDA and a 512-bit DDA independently. Comparing to other unsupervised methods, we succeeded to obtain the lowest total error by using the 512-bit codes for retrieval via exhaustive search, and speed up 9.27 times with the use of the 16-bit codes while keeping a comparable total error. We found that our new training scheme could reduce the total retrieval error significantly by 21.9%. To further boost the image retrieval performance, we developed Radon Autoencoder Barcode (RABC) which are learned from the Radon projections of images using a de-noising autoencoder. Experimental results demonstrated its superior performance in retrieval when it was combined with DDA binary codes.
https://arxiv.org/abs/1604.07060
Recurrent neural networks (RNN) are simple dynamical systems whose computational power has been attributed to their short-term memory. Short-term memory of RNNs has been previously studied analytically only for the case of orthogonal networks, and only under annealed approximation, and uncorrelated input. Here for the first time, we present an exact solution to the memory capacity and the task-solving performance as a function of the structure of a given network instance, enabling direct determination of the function–structure relation in RNNs. We calculate the memory capacity for arbitrary networks with exponentially correlated input and further related it to the performance of the system on signal processing tasks in a supervised learning setup. We compute the expected error and the worst-case error bound as a function of the spectra of the network and the correlation structure of its inputs and outputs. Our results give an explanation for learning and generalization of task solving using short-term memory, which is crucial for building alternative computer architectures using physical phenomena based on the short-term memory principle.
https://arxiv.org/abs/1604.06929
Recurrent Neural Networks (RNN) have obtained excellent result in many natural language processing (NLP) tasks. However, understanding and interpreting the source of this success remains a challenge. In this paper, we propose Recurrent Memory Network (RMN), a novel RNN architecture, that not only amplifies the power of RNN but also facilitates our understanding of its internal functioning and allows us to discover underlying patterns in data. We demonstrate the power of RMN on language modeling and sentence completion tasks. On language modeling, RMN outperforms Long Short-Term Memory (LSTM) network on three large German, Italian, and English dataset. Additionally we perform in-depth analysis of various linguistic dimensions that RMN captures. On Sentence Completion Challenge, for which it is essential to capture sentence coherence, our RMN obtains 69.2% accuracy, surpassing the previous state-of-the-art by a large margin.
https://arxiv.org/abs/1601.01272
We propose a framework for top-down salient object detection that incorporates a tightly coupled image classification module. The classifier is trained on novel category-aware sparse codes computed on object dictionaries used for saliency modeling. A misclassification indicates that the corresponding saliency model is inaccurate. Hence, the classifier selects images for which the saliency models need to be updated. The category-aware sparse coding produces better image classification accuracy as compared to conventional sparse coding with a reduced computational complexity. A saliency-weighted max-pooling is proposed to improve image classification, which is further used to refine the saliency maps. Experimental results on Graz-02 and PASCAL VOC-07 datasets demonstrate the effectiveness of salient object detection. Although the role of the classifier is to support salient object detection, we evaluate its performance in image classification and also illustrate the utility of thresholded saliency maps for image segmentation.
https://arxiv.org/abs/1604.06570
The status quo approach to training object detectors requires expensive bounding box annotations. Our framework takes a markedly different direction: we transfer tracked object boxes from weakly-labeled videos to weakly-labeled images to automatically generate pseudo ground-truth boxes, which replace manually annotated bounding boxes. We first mine discriminative regions in the weakly-labeled image collection that frequently/rarely appear in the positive/negative images. We then match those regions to videos and retrieve the corresponding tracked object boxes. Finally, we design a hough transform algorithm to vote for the best box to serve as the pseudo GT for each image, and use them to train an object detector. Together, these lead to state-of-the-art weakly-supervised detection results on the PASCAL 2007 and 2010 datasets.
https://arxiv.org/abs/1604.05766
The complex compositional structure of language makes problems at the intersection of vision and language challenging. But language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content. This can hinder progress in pushing state of art in the computer vision aspects of multi-modal AI. In this paper, we address binary Visual Question Answering (VQA) on abstract scenes. We formulate this problem as visual verification of concepts inquired in the questions. Specifically, we convert the question to a tuple that concisely summarizes the visual concept to be detected in the image. If the concept can be found in the image, the answer to the question is “yes”, and otherwise “no”. Abstract scenes play two roles (1) They allow us to focus on the high-level semantics of the VQA task as opposed to the low-level recognition problems, and perhaps more importantly, (2) They provide us the modality to balance the dataset such that language priors are controlled, and the role of vision is essential. In particular, we collect fine-grained pairs of scenes for every question, such that the answer to the question is “yes” for one scene, and “no” for the other for the exact same question. Indeed, language priors alone do not perform better than chance on our balanced dataset. Moreover, our proposed approach matches the performance of a state-of-the-art VQA approach on the unbalanced dataset, and outperforms it on the balanced dataset.
语言的复杂构成结构使得在视觉和语言的交叉点处有问题。但是语言也提供了一个强大的优先事项,可以产生很好的表面效果,而不需要底层模型真正理解视觉内容。这可能会阻碍在多模态AI的计算机视觉方面推进艺术进步。在本文中,我们解决抽象场景中的二进制视觉问题回答(VQA)。我们把这个问题制定成问题中查询概念的可视化验证。具体来说,我们把这个问题转换成一个元组,简单地概括了图像中要被检测到的视觉概念。如果这个概念可以在图像中找到,问题的答案是“是”,否则“否”。抽象场景扮演两个角色(1)它们使我们能够专注于VQA任务的高级语义而不是低级别的识别问题,更重要的是,(2)它们为我们提供了平衡数据集的方式这样语言的先驱就受到控制,而视觉的作用也是必不可少的。特别是,我们为每个问题收集精细的场景对,一个场景的问题答案是“是”,另一个问题的答案是“否”。事实上,语言先验本身在我们平衡的数据集上的表现并不比单纯的好。此外,我们提出的方法匹配最先进的VQA方法在不平衡数据集上的性能,并且在平衡数据集上胜过它。
https://arxiv.org/abs/1511.05099
In a wide range of statistical learning problems such as ranking, clustering or metric learning among others, the risk is accurately estimated by $U$-statistics of degree $d\geq 1$, i.e. functionals of the training data with low variance that take the form of averages over $k$-tuples. From a computational perspective, the calculation of such statistics is highly expensive even for a moderate sample size $n$, as it requires averaging $O(n^d)$ terms. This makes learning procedures relying on the optimization of such data functionals hardly feasible in practice. It is the major goal of this paper to show that, strikingly, such empirical risks can be replaced by drastically computationally simpler Monte-Carlo estimates based on $O(n)$ terms only, usually referred to as incomplete $U$-statistics, without damaging the $O_{\mathbb{P}}(1/\sqrt{n})$ learning rate of Empirical Risk Minimization (ERM) procedures. For this purpose, we establish uniform deviation results describing the error made when approximating a $U$-process by its incomplete version under appropriate complexity assumptions. Extensions to model selection, fast rate situations and various sampling techniques are also considered, as well as an application to stochastic gradient descent for ERM. Finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques.
http://arxiv.org/abs/1501.02629
We propose a method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. This allows more complex questions to be answered using the predominant neural network-based approach than has previously been possible. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain the whole answer. The method constructs a textual representation of the semantic content of an image, and merges it with textual information sourced from a knowledge base, to develop a deeper understanding of the scene viewed. Priming a recurrent neural network with this combined information, and the submitted question, leads to a very flexible visual question answering approach. We are specifically able to answer questions posed in natural language, that refer to information not contained in the image. We demonstrate the effectiveness of our model on two publicly available datasets, Toronto COCO-QA and MS COCO-VQA and show that it produces the best reported results in both cases.
我们提出了一种视觉问答方法,它将图像内容的内部表示与从一般知识库中提取的信息相结合,以回答广泛的基于图像的问题。这允许使用主要的基于神经网络的方法回答比以前可能的更复杂的问题。特别是,即使图像本身不包含整个答案,也可以询问有关图像内容的问题。该方法构建图像语义内容的文本表示,并将其与来自知识库的文本信息进行合并,从而更深入地了解所观看的场景。用这个组合信息和提交的问题启动一个循环的神经网络,导致一个非常灵活的视觉问题解答方法。我们特别能够回答以自然语言提出的问题,即涉及图像中未包含的信息。我们在两个可公开获得的数据集(多伦多COCO-QA和MS COCO-VQA)上展示了我们模型的有效性,并证明它在两种情况下都能产生最好的报告结果。
https://arxiv.org/abs/1511.06973
Deep Convolution Neural Networks (CNNs) have shown impressive performance in various vision tasks such as image classification, object detection and semantic segmentation. For object detection, particularly in still images, the performance has been significantly increased last year thanks to powerful deep networks (e.g. GoogleNet) and detection frameworks (e.g. Regions with CNN features (R-CNN)). The lately introduced ImageNet task on object detection from video (VID) brings the object detection task into the video domain, in which objects’ locations at each frame are required to be annotated with bounding boxes. In this work, we introduce a complete framework for the VID task based on still-image object detection and general object tracking. Their relations and contributions in the VID task are thoroughly studied and evaluated. In addition, a temporal convolution network is proposed to incorporate temporal information to regularize the detection results and shows its effectiveness for the task.
https://arxiv.org/abs/1604.04053
Object detection is one of the most active areas in computer vision, which has made significant improvement in recent years. Current state-of-the-art object detection methods mostly adhere to the framework of regions with convolutional neural network (R-CNN) and only use local appearance features inside object bounding boxes. Since these approaches ignore the contextual information around the object proposals, the outcome of these detectors may generate a semantically incoherent interpretation of the input image. In this paper, we propose an ensemble object detection system which incorporates the local appearance, the contextual information in term of relationships among objects and the global scene based contextual feature generated by a convolutional neural network. The system is formulated as a fully connected conditional random field (CRF) defined on object proposals and the contextual constraints among object proposals are modeled as edges naturally. Furthermore, a fast mean field approximation method is utilized to inference in this CRF model efficiently. The experimental results demonstrate that our approach achieves a higher mean average precision (mAP) on PASCAL VOC 2007 datasets compared to the baseline algorithm Faster R-CNN.
https://arxiv.org/abs/1604.04048
Finetuning from a pretrained deep model is found to yield state-of-the-art performance for many vision tasks. This paper investigates many factors that influence the performance in finetuning for object detection. There is a long-tailed distribution of sample numbers for classes in object detection. Our analysis and empirical results show that classes with more samples have higher impact on the feature learning. And it is better to make the sample number more uniform across classes. Generic object detection can be considered as multiple equally important tasks. Detection of each class is a task. These classes/tasks have their individuality in discriminative visual appearance representation. Taking this individuality into account, we cluster objects into visually similar class groups and learn deep representations for these groups separately. A hierarchical feature learning scheme is proposed. In this scheme, the knowledge from the group with large number of classes is transferred for learning features in its sub-groups. Finetuned on the GoogLeNet model, experimental results show 4.7% absolute mAP improvement of our approach on the ImageNet object detection dataset without increasing much computational cost at the testing stage.
https://arxiv.org/abs/1601.05150
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
我们介绍了序贯视觉语言的第一个数据集,并探讨这个数据如何用于视觉叙事的任务。这个数据集的第一个版本,SIND v.1,包括20,211个序列中的81,743张独特的照片,与描述性(标题)和故事语言对齐。我们为故事叙述任务建立了几个强有力的基线,并且激励一个自动度量来衡量进度。对数据集和叙事任务中提供的具体描述以及比喻性和社会性语言进行建模,有可能将人工智能从典型的视觉场景的基本理解转移到对基础事件结构和主观表达的越来越像人类的理解。
https://arxiv.org/abs/1604.03968
When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy “human-centric” annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use these noisy annotations for learning visually correct image classifiers. Such annotations do not use consistent vocabulary, and miss a significant amount of the information present in an image; however, we demonstrate that the noise in these annotations exhibits structure and can be modeled. We propose an algorithm to decouple the human reporting bias from the correct visually grounded labels. Our results are highly interpretable for reporting “what’s in the image” versus “what’s worth saying.” We demonstrate the algorithm’s efficacy along a variety of metrics and datasets, including MS COCO and Yahoo Flickr 100M. We show significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.
当人类的注释者可以选择图像中的标签时,他们将自己的主观判断应用于忽略什么和提到什么。我们将这些嘈杂的“以人为本”的注释称为表现人类报告偏见。此类注释的示例包括在照片共享网站或包含图像标题的数据集中找到的图像标记和关键字。在本文中,我们使用这些噪声标注来学习视觉上正确的图像分类器。这样的注释不使用一致的词汇,并且错过图像中存在的大量信息;然而,我们证明这些注释中的噪声展示结构并且可以被模拟。我们提出了一个算法来解耦从正确的视觉接地标签的人类报告偏见。我们的结果在报道“形象内容”与“值得评论的内容”方面具有高度的可理解性。我们演示了算法的各种指标和数据集的功效,包括MS COCO和Yahoo Flickr 100M。在传统的图像分类和图像字幕算法方面,我们显示出了显着的改进,在某些情况下,将现有方法的性能提高了一倍。
https://arxiv.org/abs/1512.06974
The vertex coloring problem has received a lot of attention in the context of synchronous round-based systems where, at each round, a process can send a message to all its neighbors, and receive a message from each of them. Hence, this communication model is particularly suited to point-to-point communication channels. Several vertex coloring algorithms suited to these systems have been proposed. They differ mainly in the number of rounds they require and the number of colors they use. This paper considers a broadcast/receive communication model in which message collisions and message conflicts can occur (a collision occurs when, during the same round, messages are sent to the same process by too many neighbors; a conflict occurs when a process and one of its neighbors broadcast during the same round). This communication model is suited to systems where processes share communication bandwidths. More precisely,the paper considers the case where, during a round, a process may either broadcast a message to its neighbors or receive a message from at most $m$ of them. This captures communication-related constraints or a local memory constraint stating that, whatever the number of neighbors of a process, its local memory allows it to receive and store at most $m$ messages during each round. The paper defines first the corresponding generic vertex multi-coloring problem (a vertex can have several colors). It focuses then on tree networks, for which it presents a lower bound on the number of colors $K$ that are necessary (namely, $K=\lceil\frac{\Delta}{m}\rceil+1$, where $\Delta$ is the maximal degree of the communication graph), and an ssociated coloring algorithm, which is optimal with respect to $K$.
https://arxiv.org/abs/1604.03356
With the recent popularity of animated GIFs on social media, there is need for ways to index them with rich metadata. To advance research on animated GIF understanding, we collected a new dataset, Tumblr GIF (TGIF), with 100K animated GIFs from Tumblr and 120K natural language descriptions obtained via crowdsourcing. The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips. To ensure a high quality dataset, we developed a series of novel quality controls to validate free-form text input from crowdworkers. We show that there is unambiguous association between visual content and natural language descriptions in our dataset, making it an ideal benchmark for the visual content captioning task. We perform extensive statistical analyses to compare our dataset to existing image and video description datasets. Next, we provide baseline results on the animated GIF description task, using three representative techniques: nearest neighbor, statistical machine translation, and recurrent neural networks. Finally, we show that models fine-tuned from our animated GIF description dataset can be helpful for automatic movie description.
随着动画GIF最近在社交媒体上的流行,需要使用丰富的元数据来对它们进行索引。为了促进对动画GIF理解的研究,我们收集了一个新的数据集Tumblr GIF(TGIF),其中包含来自Tumblr的100K动画GIF和通过众包获得的120K自然语言描述。这项工作的动机是开发图像序列描述系统的测试平台,其任务是为动画GIF或视频剪辑生成自然语言描述。为了确保高质量的数据集,我们开发了一系列新颖的质量控制,以验证众包工的自由形式文本输入。我们发现,在我们的数据集中,视觉内容和自然语言描述之间存在明确的联系,使其成为视觉内容字幕任务的理想基准。我们进行大量的统计分析,将我们的数据集与现有的图像和视频描述数据集进行比较。接下来,我们使用三种代表性技术:最近邻,统计机器翻译和递归神经网络,在动画GIF描述任务上提供基线结果。最后,我们展示了从我们的动画GIF描述数据集进行微调的模型可以有助于自动电影描述。
https://arxiv.org/abs/1604.02748
Breadth First Search (BFS) is a building block for graph algorithms and has recently been used for large scale analysis of information in a variety of applications including social networks, graph databases and web searching. Due to its importance, a number of different parallel programming models and architectures have been exploited to optimize the BFS. However, due to the irregular memory access patterns and the unstructured nature of the large graphs, its efficient parallelization is a challenge. The Xeon Phi is a massively parallel architecture available as an off-the-shelf accelerator, which includes a powerful 512 bit vector unit with optimized scatter and gather functions. Given its potential benefits, work related to graph traversing on this architecture is an active area of research. We present a set of experiments in which we explore architectural features of the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the techniques can be applied to the current state-of-the-art hybrid, top-down plus bottom-up, algorithms. We focus on the exploitation of the vector unit by developing an improved highly vectorized OpenMP parallel algorithm, using vector intrinsics, and understanding the use of data alignment and prefetching. In addition, we investigate the impact of hyperthreading and thread affinity on performance, a topic that appears under researched in the literature. As a result, we achieve what we believe is the fastest published top-down BFS algorithm on the version of Xeon Phi used in our experiments. The vectorized BFS top-down source code presented in this paper can be available on request as free-to-use software.
https://arxiv.org/abs/1604.02844
State-of-the-art object detection systems rely on an accurate set of region proposals. Several recent methods use a neural network architecture to hypothesize promising object locations. While these approaches are computationally efficient, they rely on fixed image regions as anchors for predictions. In this paper we propose to use a search strategy that adaptively directs computational resources to sub-regions likely to contain objects. Compared to methods based on fixed anchor locations, our approach naturally adapts to cases where object instances are sparse and small. Our approach is comparable in terms of accuracy to the state-of-the-art Faster R-CNN approach while using two orders of magnitude fewer anchors on average. Code is publicly available.
https://arxiv.org/abs/1512.07711
In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object. Natural language object retrieval differs from text-based image retrieval task as it involves spatial information about objects within the scene and global scene context. To address this issue, we propose a novel Spatial Context Recurrent ConvNet (SCRC) model as scoring function on candidate boxes for object retrieval, integrating spatial configurations and global scene-level contextual information into the network. Our model processes query text, local image descriptors, spatial configurations and global context features through a recurrent network, outputs the probability of the query text conditioned on each candidate box as a score for the box, and can transfer visual-linguistic knowledge from image captioning domain to our task. Experimental results demonstrate that our method effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.
在本文中,我们提出自然语言对象检索的任务,根据对象的自然语言查询来定位给定图像中的目标对象。自然语言对象检索不同于基于文本的图像检索任务,因为它涉及场景内的对象和全局场景上下文的空间信息。为了解决这个问题,我们提出了一种新的空间上下文递归ConvNet(SCRC)模型作为候选对象检索框的评分函数,将空间配置和全局场景级上下文信息整合到网络中。我们的模型通过循环网络处理查询文本,局部图像描述符,空间配置和全局上下文特征,将以每个候选框为条件的查询文本的概率输出为框的分数,并且可以将图像字幕的视觉语言知识域到我们的任务。实验结果表明,我们的方法有效利用本地和全球信息,显着优于以前的基线方法在不同的数据集和情景,并可以利用大规模的视觉和语言数据集的知识转移。
https://arxiv.org/abs/1511.04164
We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate, our task allows for easy objective evaluation. We also present a new large-scale dataset for referring expressions, based on MS-COCO. We have released the dataset and a toolbox for visualization and evaluation, see this https URL
我们提出了一种方法,可以生成图像中特定对象或区域的明确描述(称为引用表达式),还可以理解或解释这种表达式来推断哪个对象正在描述。我们表明,我们的方法胜过以前的方法,生成的对象的描述,而不考虑其他可能模糊的场景中的对象。我们的模型受到最近成功的深度图像字幕学习方法的启发,但是图像字幕难以评估,我们的任务允许客观评估。我们还提出了一个新的基于MS-COCO的大规模数据集表达式。我们已经发布了数据集和用于可视化和评估的工具箱,请参阅https网址
https://arxiv.org/abs/1511.02283
This paper presents a novel approach to perform sentiment analysis of news videos, based on the fusion of audio, textual and visual clues extracted from their contents. The proposed approach aims at contributing to the semiodiscoursive study regarding the construction of the ethos (identity) of this media universe, which has become a central part of the modern-day lives of millions of people. To achieve this goal, we apply state-of-the-art computational methods for (1) automatic emotion recognition from facial expressions, (2) extraction of modulations in the participants’ speeches and (3) sentiment analysis from the closed caption associated to the videos of interest. More specifically, we compute features, such as, visual intensities of recognized emotions, field sizes of participants, voicing probability, sound loudness, speech fundamental frequencies and the sentiment scores (polarities) from text sentences in the closed caption. Experimental results with a dataset containing 520 annotated news videos from three Brazilian and one American popular TV newscasts show that our approach achieves an accuracy of up to 84% in the sentiments (tension levels) classification task, thus demonstrating its high potential to be used by media analysts in several applications, especially, in the journalistic domain.
本文提出了一种基于从内容中提取的音频,文本和视觉线索的融合,对新闻视频进行情感分析的新方法。所提出的方法旨在促进关于这个媒体宇宙的精神(身份)建构的半投影研究,这已经成为数百万人现代生活的核心部分。为了实现这一目标,我们应用了最先进的计算方法,用于(1)从面部表情中自动识别情绪,(2)提取参与者演讲中的调制,以及(3)从与隐藏字幕相关的情感分析感兴趣的视频。更具体地说,我们计算特征,例如识别情绪的视觉强度,参与者的场大小,发声概率,声音响度,语音基频以及隐藏字幕中的文本句子的情感分数(极性)。一个包含来自三个巴西和一个美国流行电视新闻的520条带注释的新闻视频的数据集的实验结果表明,我们的方法在情绪(张力水平)分类任务中达到了高达84%的准确度,从而表明其具有很高的潜力媒体分析师在几个应用领域,特别是在新闻领域。
https://arxiv.org/abs/1604.02612