We evaluate performance of associative memory in a neural network by based on the singular value decomposition (SVD) of image data stored in the network. We consider the situation in which the original image and its highly coarse-grained one by SVD are stored in the network and the intermediate one is taken as an input. We find that the performance is characterized by the snapshot-entropy scaling inherent in the SVD: the network retrieves the original image when the entropy of the input image is larger than the critical value determined from the scaling. The result indicates efficiency of the SVD as a criterion of the performance and also indicates universality of the scaling for realistic problems beyond theoretical physics.
https://arxiv.org/abs/1608.08333
Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression. One major difficulty to train such language-based image segmentation systems is the lack of datasets with joint vision and text annotations. Although existing vision datasets such as MS COCO provide image captions, there are few datasets with region-level textual annotations for images, and these are often smaller in scale. In this paper, we explore how existing large scale vision-only and text-only datasets can be utilized to train models for image segmentation from referring expressions. We propose a method to address this problem, and show in experiments that our method can help this joint vision and language modeling task with vision-only and text-only data and outperforms previous results.
从引用表达式的图像分割是联合视觉和语言建模任务,其中输入是描述图像中的特定区域的图像和文本表达;目标是根据给定的表达式来定位和分割特定的图像区域。训练这种基于语言的图像分割系统的一个主要困难是缺乏具有联合视觉和文本注释的数据集。虽然现有的视觉数据集(如MS COCO)提供图像标题,但是很少有数据集具有区域级的图像文本注释,而且这些数据集通常规模较小。在本文中,我们探讨如何利用现有的大规模纯视觉和纯文本数据集来训练来自引用表达式的图像分割模型。我们提出一种方法来解决这个问题,并在实验中表明,我们的方法可以帮助这个联合视觉和语言建模任务与纯视觉和纯文本数据,并胜过以前的结果。
https://arxiv.org/abs/1608.08305
Spins bound to point defects are increasingly viewed as an important resource for solid-state implementations of quantum information technologies. In particular, there is a growing interest in the identification of new classes of defect spin that can be controlled optically. Here we demonstrate ensemble optical spin polarization and optically detected magnetic resonance (ODMR) of the S = 1 electronic ground state of chromium (Cr4+) impurities in silicon carbide (SiC) and gallium nitride (GaN). Spin polarization is made possible by the narrow optical linewidths of these ensembles (< 8.5 GHz), which are similar in magnitude to the ground state zero-field spin splitting energies of the ions at liquid helium temperatures. We therefore are able to optically resolve individual spin sublevels within the ensembles at low magnetic fields using resonant excitation from a cavity-stabilized, narrow-linewidth laser. Additionally, these near-infrared emitters possess exceptionally weak phonon sidebands, ensuring that > 73% of the overall optical emission is contained with the defects zero-phonon lines. These characteristics make this semiconductor-based, transition metal impurity system a promising target for further study in the ongoing effort to integrate optically active quantum states within common optoelectronic materials.
https://arxiv.org/abs/1608.08255
Visual question answering (VQA) systems are emerging from a desire to empower users to ask any natural language question about visual content and receive a valid answer in response. However, close examination of the VQA problem reveals an unavoidable, entangled problem that multiple humans may or may not always agree on a single answer to a visual question. We train a model to automatically predict from a visual question whether a crowd would agree on a single answer. We then propose how to exploit this system in a novel application to efficiently allocate human effort to collect answers to visual questions. Specifically, we propose a crowdsourcing system that automatically solicits fewer human responses when answer agreement is expected and more human responses when answer disagreement is expected. Our system improves upon existing crowdsourcing systems, typically eliminating at least 20% of human effort with no loss to the information collected from the crowd.
视觉问答(VQA)系统正在涌现,希望使用户能够询问关于视觉内容的任何自然语言问题,并得到有效答案。然而,仔细研究VQA问题揭示了一个不可避免的,纠结的问题,即多个人可能会也可能不会总是同意一个视觉问题的单一答案。我们训练一个模型,根据视觉问题自动预测一个人群是否同意一个答案。然后,我们提出如何在一个新的应用程序中利用这个系统来高效地分配人力来收集视觉问题的答案。具体而言,我们提出了一个众包系统,当预计会达成一致意见时,会自动征求较少的人类反应,当预期会有不同意见时,会有更多的人类反应。我们的系统改进了现有的众包系统,通常至少消除了20%的人力投入,而不会损失从人群中收集的信息。
https://arxiv.org/abs/1608.08188
In this paper, we enhance the attention-based neural machine translation (NMT) by adding explicit coverage embedding models to alleviate issues of repeating and dropping translations in NMT. For each source word, our model starts with a full coverage embedding vector to track the coverage status, and then keeps updating it with neural networks as the translation goes. Experiments on the large-scale Chinese-to-English task show that our enhanced model improves the translation quality significantly on various test sets over the strong large vocabulary NMT system.
https://arxiv.org/abs/1605.03148
Deep neural networks (DNNs) demand a very large amount of computation and weight storage, and thus efficient implementation using special purpose hardware is highly desired. In this work, we have developed an FPGA based fixed-point DNN system using only on-chip memory not to access external DRAM. The execution time and energy consumption of the developed system is compared with a GPU based implementation. Since the capacity of memory in FPGA is limited, only 3-bit weights are used for this implementation, and training based fixed-point weight optimization is employed. The implementation using Xilinx XC7Z045 is tested for the MNIST handwritten digit recognition benchmark and a phoneme recognition task on TIMIT corpus. The obtained speed is about one quarter of a GPU based implementation and much better than that of a PC based one. The power consumption is less than 5 Watt at the full speed operation resulting in much higher efficiency compared to GPU based systems.
https://arxiv.org/abs/1602.01616
The effectiveness of long short term memory networks trained by backpropagation through time for stock price prediction is explored in this paper. A range of different architecture LSTM networks are constructed trained and tested.
https://arxiv.org/abs/1603.07893
Recurrent neural networks have recently been used for learning to describe images using natural language. However, it has been observed that these models generalize poorly to scenes that were not observed during training, possibly depending too strongly on the statistics of the text in the training data. Here we propose to describe images using short structured representations, aiming to capture the crux of a description. These structured representations allow us to tease-out and evaluate separately two types of generalization: standard generalization to new images with similar scenes, and generalization to new combinations of known entities. We compare two learning approaches on the MS-COCO dataset: a state-of-the-art recurrent network based on an LSTM (Show, Attend and Tell), and a simple structured prediction model on top of a deep network. We find that the structured model generalizes to new compositions substantially better than the LSTM, ~7 times the accuracy of predicting structured representations. By providing a concrete method to quantify generalization for unseen combinations, we argue that structured representations and compositional splits are a useful benchmark for image captioning, and advocate compositional models that capture linguistic and visual structure.
递归神经网络最近被用于学习以使用自然语言描述图像。然而,已经观察到,这些模型对于在训练期间没有观察到的场景的概括性较差,可能过于强烈地依赖于训练数据中的文本的统计。在这里,我们建议使用简短的结构化表示来描述图像,旨在捕捉描述的关键。这些结构化表示允许我们分别梳理和评估两种类型的泛化:标准泛化到具有相似场景的新图像,以及推广到已知实体的新组合。我们比较了MS-COCO数据集上的两种学习方法:一种基于LSTM(Show,Attend和Tell)的最新循环网络,以及一个在深度网络之上的简单结构化预测模型。我们发现结构化模型比LSTM更容易推广到新组合,这是预测结构化表示准确性的7倍。通过提供具体的方法来量化看不见的组合的泛化,我们认为结构化表示和组合分割是图像字幕的有用基准,并提倡捕捉语言和视觉结构的组合模型。
https://arxiv.org/abs/1608.07639
The coalescence in dense arrays of spontaneously formed GaN nanowires proceeds by bundling: adjacent nanowires bend and merge at their top, thus reducing their surface energy at the expense of the elastic energy of bending. We give a theoretical description of the energetics of this bundling process. The bending energy is shown to be substantially reduced by the creation of dislocations at the coalescence joints. A comparison of experimental and calculated x-ray diffraction profiles from ensembles of bundled nanowires demonstrates that a large part of the bending energy is indeed relaxed by plastic deformation. The residual bending manifests itself by extended tails of the diffraction profiles.
https://arxiv.org/abs/1608.07420
Since the technological breakthrough prompted by the inception of light emitting diodes based on III-nitrides, these material systems have emerged as strategic semiconductors not only for the lighting of the future, but also for the new generation of high-power electronic and spintronic devices. While III-nitride optoelectronics in the visible and ultraviolet spectral range is widely established, all-nitride and In-free efficient devices in the near-infrared (NIR) are still wanted. Here, through a comprehensive protocol of design, modeling, epitaxial growth and in-depth characterization, we develop Al$x$Ga${1-x}$N:Mn/GaN NIR distributed Bragg reflectors and we show their efficiency in combination with GaN:(Mn,Mg) layers containing Mn-Mg$_{k}$ complexes optically active in the telecommunication range of wavelengths.
https://arxiv.org/abs/1608.07077
We present a systematic study of the influence of elastic strain relaxation on the built-in electrostatic potentials and the electronic properties of axial (In,Ga)N/GaN nanowire heterostructures. We employ and evaluate analytical and numerical approaches to compute strain and polarization potentials. These two ingredients then enter an eight-band k.p model to compute electron and hole ground states and energies. Our analysis reveals that for a sufficiently large ratio between the thickness of the (In,Ga)N disk and the diameter of the nanowire, the elastic relaxation leads to a significant reduction of the built-in electrostatic potential in comparison to a planar system of similar layer thickness and In content. However, a complete elimination of the built-in potential cannot be achieved in axial nanowire heterostructures. Nevertheless, the reduction of the built-in electrostatic potential leads to a significant modification of the electron and hole energies. Our findings indicate that the range of accessible ground state transition energies in an axial (In,Ga)N/GaN nanowire heterostructure is limited due to the reduced influence of polarization potentials for thicker disks. Additionally, we find that strain and polarization potentials induce complex confinement features of electrons and holes, which depend on the In content, shape, and dimensions of the heterostructure.
https://arxiv.org/abs/1608.07047
A three-dimensional (3D) Network-on-Chip (NoC) enables the design of high performance and low power many-core chips. Existing 3D NoCs are inadequate for meeting the ever-increasing performance requirements of many-core processors since they are simple extensions of regular 2D architectures and they do not fully exploit the advantages provided by 3D integration. Moreover, the anticipated performance gain of a 3D NoC-enabled many-core chip may be compromised due to the potential failures of through-silicon-vias (TSVs) that are predominantly used as vertical interconnects in a 3D IC. To address these problems, we propose a machine-learning-inspired predictive design methodology for energy-efficient and reliable many-core architectures enabled by 3D integration. We demonstrate that a small-world network-based 3D NoC (3D SWNoC) performs significantly better than its 3D MESH-based counterparts. On average, the 3D SWNoC shows 35% energy-delay-product (EDP) improvement over 3D MESH for the PARSEC and SPLASH2 benchmarks considered in this work. To improve the reliability of 3D NoC, we propose a computationally efficient spare-vertical link (sVL) allocation algorithm based on a state-space search formulation. Our results show that the proposed sVL allocation algorithm can significantly improve the reliability as well as the lifetime of 3D SWNoC.
https://arxiv.org/abs/1608.06972
We present an attention-based model for end-to-end handwriting recognition. Our system does not require any segmentation of the input paragraph. The model is inspired by the differentiable attention models presented recently for speech recognition, image captioning or translation. The main difference is the covert and overt attention, implemented as a multi-dimensional LSTM network. Our principal contribution towards handwriting recognition lies in the automatic transcription without a prior segmentation into lines, which was crucial in previous approaches. To the best of our knowledge this is the first successful attempt of end-to-end multi-line handwriting recognition. We carried out experiments on the well-known IAM Database. The results are encouraging and bring hope to perform full paragraph transcription in the near future.
我们提出了一种基于关注的端到端手写识别模型。我们的系统不需要对输入段落进行任何细分。该模型受到最近提出的用于语音识别,图像字幕或翻译的可区分的注意模型的启发。主要区别在于隐蔽和公开的关注,作为多维LSTM网络实施。我们对手写识别的主要贡献在于自动转录,没有事先分割成行,这在以前的方法中是至关重要的。就我们所知,这是端到端多行手写识别的首次成功尝试。我们在着名的IAM数据库上进行了实验。结果是令人鼓舞的,并带来希望在不久的将来执行全段转录。
https://arxiv.org/abs/1604.03286
Video object detection is challenging because objects that are easily detected in one frame may be difficult to detect in another frame within the same clip. Recently, there have been major advances for doing object detection in a single image. These methods typically contain three phases: (i) object proposal generation (ii) object classification and (iii) post-processing. We propose a modification of the post-processing phase that uses high-scoring object detections from nearby frames to boost scores of weaker detections within the same clip. We show that our method obtains superior results to state-of-the-art single image object detection techniques. Our method placed 3rd in the video object detection (VID) task of the ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC2015).
https://arxiv.org/abs/1602.08465
Rank minimization can be converted into tractable surrogate problems, such as Nuclear Norm Minimization (NNM) and Weighted NNM (WNNM). The problems related to NNM, or WNNM, can be solved iteratively by applying a closed-form proximal operator, called Singular Value Thresholding (SVT), or Weighted SVT, but they suffer from high computational cost of Singular Value Decomposition (SVD) at each iteration. We propose a fast and accurate approximation method for SVT, that we call fast randomized SVT (FRSVT), with which we avoid direct computation of SVD. The key idea is to extract an approximate basis for the range of the matrix from its compressed matrix. Given the basis, we compute partial singular values of the original matrix from the small factored matrix. In addition, by developping a range propagation method, our method further speeds up the extraction of approximate basis at each iteration. Our theoretical analysis shows the relationship between the approximation bound of SVD and its effect to NNM via SVT. Along with the analysis, our empirical results quantitatively and qualitatively show that our approximation rarely harms the convergence of the host algorithms. We assess the efficiency and accuracy of the proposed method on various computer vision problems, e.g., subspace clustering, weather artifact removal, and simultaneous multi-image alignment and rectification.
http://arxiv.org/abs/1509.00296
Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.
凝视反映了人类如何处理视觉场景,因此越来越多地用于计算机视觉系统。以前的研究表明注视潜力的目标定位和识别等物体为中心的任务,但目前还不清楚凝视是否也可以有利于以场景为中心的任务,如图像字幕。通过研究人类凝视与深度神经网络的注意机制之间的相互作用,我们提出了一种新的透视辅助图像字幕的视角。使用公开的大规模凝视数据集,我们首先评估最先进的对象与场景识别模型之间的关系,自下而上的视觉显着性和人类的凝视。然后,我们提出了一种新的图像字幕的分割注意模型。我们的模型将人眼注视信息整合到基于注意力的长短期记忆体系结构中,并允许算法有选择地将注意力分配给固定和非固定图像区域。通过对COCO / SALICON数据集的评估,我们发现我们的方法改善了图像字幕性能,并且凝视可以补充语义场景理解任务的机器注意力。
https://arxiv.org/abs/1608.05203
Most of existing detection pipelines treat object proposals independently and predict bounding box locations and classification scores over them separately. However, the important semantic and spatial layout correlations among proposals are often ignored, which are actually useful for more accurate object detection. In this work, we propose a new EM-like group recursive learning approach to iteratively refine object proposals by incorporating such context of surrounding proposals and provide an optimal spatial configuration of object detections. In addition, we propose to incorporate the weakly-supervised object segmentation cues and region-based object detection into a multi-stage architecture in order to fully exploit the learned segmentation features for better object detection in an end-to-end way. The proposed architecture consists of three cascaded networks which respectively learn to perform weakly-supervised object segmentation, object proposal generation and recursive detection refinement. Combining the group recursive learning and the multi-stage architecture provides competitive mAPs of 78.6% and 74.9% on the PASCAL VOC2007 and VOC2012 datasets respectively, which outperforms many well-established baselines [10] [20] significantly.
https://arxiv.org/abs/1608.05159
Most object detectors contain two important components: a feature extractor and an object classifier. The feature extractor has rapidly evolved with significant research efforts leading to better deep convolutional architectures. The object classifier, however, has not received much attention and many recent systems (like SPPnet and Fast/Faster R-CNN) use simple multi-layer perceptrons. This paper demonstrates that carefully designing deep networks for object classification is just as important. We experiment with region-wise classifier networks that use shared, region-independent convolutional features. We call them “Networks on Convolutional feature maps” (NoCs). We discover that aside from deep feature maps, a deep and convolutional per-region classifier is of particular importance for object detection, whereas latest superior image classification models (such as ResNets and GoogLeNets) do not directly lead to good detection accuracy without using such a per-region classifier. We show by experiments that despite the effective ResNets and Faster R-CNN systems, the design of NoCs is an essential element for the 1st-place winning entries in ImageNet and MS COCO challenges 2015.
https://arxiv.org/abs/1504.06066
We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder–decoder pipeline, popular in image and video captioning systems. We propose to utilize two different kinds of video features, one to capture the video content in terms of objects and attributes, and the other to capture the motion and action information. Using these diverse features we train models specializing in two separate input sub-domains. We then train an evaluator model which is used to pick the best caption from the pool of candidates generated by these domain expert models. We argue that this approach is better suited for the current video captioning task, compared to using a single model, due to the diversity in the dataset. Efficacy of our method is proven by the fact that it was rated best in MSR Video to Language Challenge, as per human evaluation. Additionally, we were ranked second in the automatic evaluation metrics based table.
我们将提交给Microsoft视频到语言的挑战,即在挑战数据集中生成描述视频的简短字幕。我们的模型基于编码器 - 解码器流水线,在图像和视频字幕系统中非常流行。我们建议利用两种不同的视频特征,一种是以对象和属性的形式捕捉视频内容,另一种是捕捉动作和动作信息。使用这些不同的功能,我们训练模型专门在两个单独的输入子域。然后,我们训练一个评估者模型,用来从这些领域专家模型生成的候选人中选择最佳的标题。我们认为,由于数据集的多样性,与使用单个模型相比,这种方法更适合当前的视频字幕任务。根据人类评估,我们的方法的功效已被证明是在MSR Video to Language Challenge中被评为最佳。另外,我们在自动评估指标表中排名第二。
https://arxiv.org/abs/1608.04959
Previous studies have proposed image-based clutter measures that correlate with human search times and/or eye movements. However, most models do not take into account the fact that the effects of clutter interact with the foveated nature of the human visual system: visual clutter further from the fovea has an increasing detrimental influence on perception. Here, we introduce a new foveated clutter model to predict the detrimental effects in target search utilizing a forced fixation search task. We use Feature Congestion (Rosenholtz et al.) as our non foveated clutter model, and we stack a peripheral architecture on top of Feature Congestion for our foveated model. We introduce the Peripheral Integration Feature Congestion (PIFC) coefficient, as a fundamental ingredient of our model that modulates clutter as a non-linear gain contingent on eccentricity. We finally show that Foveated Feature Congestion (FFC) clutter scores r(44) = -0.82 correlate better with target detection (hit rate) than regular Feature Congestion r(44) = -0.19 in forced fixation search. Thus, our model allows us to enrich clutter perception research by computing fixation specific clutter maps. A toolbox for creating peripheral architectures: Piranhas: Peripheral Architectures for Natural, Hybrid and Artificial Systems will be made available.
https://arxiv.org/abs/1608.04042
Lifelogging cameras capture everyday life from a first-person perspective, but generate so much data that it is hard for users to browse and organize their image collections effectively. In this paper, we propose to use automatic image captioning algorithms to generate textual representations of these collections. We develop and explore novel techniques based on deep learning to generate captions for both individual images and image streams, using temporal consistency constraints to create summaries that are both more compact and less noisy. We evaluate our techniques with quantitative and qualitative results, and apply captioning to an image retrieval application for finding potentially private images. Our results suggest that our automatic captioning algorithms, while imperfect, may work well enough to help users manage lifelogging photo collections.
动态记录摄像头从第一人称视角捕捉日常生活,但生成的数据太多,以致用户难以有效地浏览和组织其图像集合。在本文中,我们建议使用自动图像字幕算法来生成这些集合的文本表示。我们开发和探索基于深度学习的新技术,为单个图像和图像流生成标题,使用时间一致性约束来创建更简洁,更小噪声的摘要。我们使用定量和定性结果评估我们的技术,并将字幕应用于图像检索应用程序,以查找潜在的私人图像。我们的结果表明,我们的自动字幕算法,虽然不完善,可能会工作得很好,以帮助用户管理lifelogging照片收藏。
https://arxiv.org/abs/1608.03819
This work is in the field of video surveillance including motion detection. The video surveillance is one of essential techniques for automatic video analysis to extract crucial information or relevant scenes in video surveillance systems. The aim of our work is to propose solutions for the automatic detection of moving objects in real time with a surveillance camera. The detected objects are objects that have some geometric shape (circle, ellipse, square, and rectangle).
https://arxiv.org/abs/1608.03617
This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset. Previous approaches to Visual Question Answering (VQA) have mainly used generic image features from networks trained on the ImageNet dataset, despite the wide scope of questions. In contrast, our approach employs features derived from networks trained for specialized tasks of scene classification, person activity prediction, and person and object attribute prediction. We also present a method for selecting sub-regions of an image that are relevant for evaluating the appropriateness of a putative answer. Visual features are computed both from the whole image and from local regions, while sentences are mapped to a common space using a simple normalized canonical correlation analysis (CCA) model. Our results show a significant improvement over the previous state of the art, and indicate that answering different question types benefits from examining a variety of image cues and carefully choosing informative image sub-regions.
本文着重从Visual Madlibs数据集中回答空白样式的多选题。视觉问题回答(VQA)的先前方法主要使用来自ImageNet数据集上训练的网络的通用图像特征,尽管问题范围很广。相比之下,我们的方法使用来自网络的特征来训练场景分类,人员活动预测和人物属性预测等专业任务。我们还提出了一种选择与评估推定答案的适当性相关的图像的子区域的方法。从整个图像和局部区域计算视觉特征,而使用简单的归一化典型相关分析(CCA)模型将句子映射到公共空间。我们的研究结果表明,与以前的技术水平相比有了显着的改进,并且指出回答不同的问题类型可以从检查各种图像提示和仔细选择信息图像子区域中受益。
https://arxiv.org/abs/1608.03410
This study introduced a novel system, called Gaze2Segment, integrating biological and computer vision techniques to support radiologists’ reading experience with an automatic image segmentation task. During diagnostic assessment of lung CT scans, the radiologists’ gaze information were used to create a visual attention map. This map was then combined with a computer-derived saliency map, extracted from the gray-scale CT images. The visual attention map was used as an input for indicating roughly the location of a object of interest. With computer-derived saliency information, on the other hand, we aimed at finding foreground and background cues for the object of interest. At the final step, these cues were used to initiate a seed-based delineation process. Segmentation accuracy of the proposed Gaze2Segment was found to be 86% with dice similarity coefficient and 1.45 mm with Hausdorff distance. To the best of our knowledge, Gaze2Segment is the first true integration of eye-tracking technology into a medical image segmentation task without the need for any further user-interaction.
https://arxiv.org/abs/1608.03235
A k nearest neighbor (kNN) query on road networks retrieves the k closest points of interest (POIs) by their network distances from a given location. Today, in the era of ubiquitous mobile computing, this is a highly pertinent query. While Euclidean distance has been used as a heuristic to search for the closest POIs by their road network distance, its efficacy has not been thoroughly investigated. The most recent methods have shown significant improvement in query performance. Earlier studies, which proposed disk-based indexes, were compared to the current state-of-the-art in main memory. However, recent studies have shown that main memory comparisons can be challenging and require careful adaptation. This paper presents an extensive experimental investigation in main memory to settle these and several other issues. We use efficient and fair memory-resident implementations of each method to reproduce past experiments and conduct additional comparisons for several overlooked evaluations. Notably we revisit a previously discarded technique (IER) showing that, through a simple improvement, it is often the best performing technique.
https://arxiv.org/abs/1601.01549
We present an approach for object segmentation in videos that combines frame-level object detection with concepts from object tracking and motion segmentation. The approach extracts temporally consistent object tubes based on an off-the-shelf detector. Besides the class label for each tube, this provides a location prior that is independent of motion. For the final video segmentation, we combine this information with motion cues. The method overcomes the typical problems of weakly supervised/unsupervised video segmentation, such as scenes with no motion, dominant camera motion, and objects that move as a unit. In contrast to most tracking methods, it provides an accurate, temporally consistent segmentation of each object. We report results on four video segmentation datasets: YouTube Objects, SegTrackv2, egoMotion, and FBMS.
https://arxiv.org/abs/1608.03066
Attention-based Neural Machine Translation (NMT) models suffer from attention deficiency issues as has been observed in recent research. We propose a novel mechanism to address some of these limitations and improve the NMT attention. Specifically, our approach memorizes the alignments temporally (within each sentence) and modulates the attention with the accumulated temporal memory, as the decoder generates the candidate translation. We compare our approach against the baseline NMT model and two other related approaches that address this issue either explicitly or implicitly. Large-scale experiments on two language pairs show that our approach achieves better and robust gains over the baseline and related NMT approaches. Our model further outperforms strong SMT baselines in some settings even without using ensembles.
https://arxiv.org/abs/1608.02927
The recent COCO object detection dataset presents several new challenges for object detection. In particular, it contains objects at a broad range of scales, less prototypical images, and requires more precise localization. To address these challenges, we test three modifications to the standard Fast R-CNN object detector: (1) skip connections that give the detector access to features at multiple network layers, (2) a foveal structure to exploit object context at multiple object resolutions, and (3) an integral loss function and corresponding network adjustment that improve localization. The result of these modifications is that information can flow along multiple paths in our network, including through features from multiple network layers and from multiple object views. We refer to our modified classifier as a “MultiPath” network. We couple our MultiPath network with DeepMask object proposals, which are well suited for localization and small objects, and adapt our pipeline to predict segmentation masks in addition to bounding boxes. The combined system improves results over the baseline Fast R-CNN detector with Selective Search by 66% overall and by 4x on small objects. It placed second in both the COCO 2015 detection and segmentation challenges.
https://arxiv.org/abs/1604.02135
Attention mechanism has enhanced state-of-the-art Neural Machine Translation (NMT) by jointly learning to align and translate. It tends to ignore past alignment information, however, which often leads to over-translation and under-translation. To address this problem, we propose coverage-based NMT in this paper. We maintain a coverage vector to keep track of the attention history. The coverage vector is fed to the attention model to help adjust future attention, which lets NMT system to consider more about untranslated source words. Experiments show that the proposed approach significantly improves both translation quality and alignment quality over standard attention-based NMT.
https://arxiv.org/abs/1601.04811
Change detection, or anomaly detection, from street-view images acquired by an autonomous robot at multiple different times, is a major problem in robotic mapping and autonomous driving. Formulation as an image comparison task, which operates on a given pair of query and reference images is common to many existing approaches to this problem. Unfortunately, providing relevant reference images is not straightforward. In this paper, we propose a novel formulation for change detection, termed compressive change retrieval, which can operate on a query image and similar reference images retrieved from the web. Compared to previous formulations, there are two sources of difficulty. First, the retrieved reference images may frequently contain non-relevant reference images, because even state-of-the-art place-recognition techniques suffer from retrieval noise. Second, image comparison needs to be conducted in a compressed domain to minimize the storage cost of large collections of street-view images. To address the above issues, we also present a practical change detection algorithm that uses compressed bag-of-words (BoW) image representation as a scalable solution. The results of experiments conducted on a practical change detection task, “moving object detection (MOD),” using the publicly available Malaga dataset validate the effectiveness of the proposed approach.
https://arxiv.org/abs/1608.02051
Multiview assisted learning has gained significant attention in recent years in supervised learning genre. Availability of high performance computing devices enables learning algorithms to search simultaneously over multiple views or feature spaces to obtain an optimum classification performance. The paper is a pioneering attempt of formulating a mathematical foundation for realizing a multiview aided collaborative boosting architecture for multiclass classification. Most of the present algorithms apply multiview learning heuristically without exploring the fundamental mathematical changes imposed on traditional boosting. Also, most of the algorithms are restricted to two class or view setting. Our proposed mathematical framework enables collaborative boosting across any finite dimensional view spaces for multiclass learning. The boosting framework is based on forward stagewise additive model which minimizes a novel exponential loss function. We show that the exponential loss function essentially captures difficulty of a training sample space instead of the traditional `1/0’ loss. The new algorithm restricts a weak view from over learning and thereby preventing overfitting. The model is inspired by our earlier attempt on collaborative boosting which was devoid of mathematical justification. The proposed algorithm is shown to converge much nearer to global minimum in the exponential loss space and thus supersedes our previous algorithm. The paper also presents analytical and numerical analysis of convergence and margin bounds for multiview boosting algorithms and we show that our proposed ensemble learning manifests lower error bound and higher margin compared to our previous model. Also, the proposed model is compared with traditional boosting and recent multiview boosting algorithms.
https://arxiv.org/abs/1608.01874
In present object detection systems, the deep convolutional neural networks (CNNs) are utilized to predict bounding boxes of object candidates, and have gained performance advantages over the traditional region proposal methods. However, existing deep CNN methods assume the object bounds to be four independent variables, which could be regressed by the $\ell_2$ loss separately. Such an oversimplified assumption is contrary to the well-received observation, that those variables are correlated, resulting to less accurate localization. To address the issue, we firstly introduce a novel Intersection over Union ($IoU$) loss function for bounding box prediction, which regresses the four bounds of a predicted box as a whole unit. By taking the advantages of $IoU$ loss and deep fully convolutional networks, the UnitBox is introduced, which performs accurate and efficient localization, shows robust to objects of varied shapes and scales, and converges fast. We apply UnitBox on face detection task and achieve the best performance among all published methods on the FDDB benchmark.
https://arxiv.org/abs/1608.01471
This paper discusses the technical challenges in maritime image processing and machine vision problems for video streams generated by cameras. Even well documented problems of horizon detection and registration of frames in a video are very challenging in maritime scenarios. More advanced problems of background subtraction and object detection in video streams are very challenging. Challenges arising from the dynamic nature of the background, unavailability of static cues, presence of small objects at distant backgrounds, illumination effects, all contribute to the challenges as discussed here.
https://arxiv.org/abs/1608.01079
Entity search is a new application meeting either precise or vague requirements from the search engines users. Baidu Cup 2016 Challenge just provided such a chance to tackle the problem of the entity search. We achieved the first place with the average MAP scores on 4 tasks including movie, tvShow, celebrity and restaurant. In this paper, we propose a series of similarity features based on both of the word frequency features and the word semantic features and describe our ranking architecture and experiment details.
https://arxiv.org/abs/1608.01068
While data has certainly taken the center stage in computer vision in recent years, it can still be difficult to obtain in certain scenarios. In particular, acquiring ground truth 3D shapes of objects pictured in 2D images remains a challenging feat and this has hampered progress in recognition-based object reconstruction from a single image. Here we propose to bypass previous solutions such as 3D scanning or manual design, that scale poorly, and instead populate object category detection datasets semi-automatically with dense, per-object 3D reconstructions, bootstrapped from:(i) class labels, (ii) ground truth figure-ground segmentations and (iii) a small set of keypoint annotations. Our proposed algorithm first estimates camera viewpoint using rigid structure-from-motion and then reconstructs object shapes by optimizing over visual hull proposals guided by loose within-class shape similarity assumptions. The visual hull sampling process attempts to intersect an object’s projection cone with the cones of minimal subsets of other similar objects among those pictured from certain vantage points. We show that our method is able to produce convincing per-object 3D reconstructions and to accurately estimate cameras viewpoints on one of the most challenging existing object-category detection datasets, PASCAL VOC. We hope that our results will re-stimulate interest on joint object recognition and 3D reconstruction from a single image.
https://arxiv.org/abs/1503.06465
Avalanche photodiode (APD) has been intensively investigated as a promising candidate to replace photomultiplier tubes (PMT) for weak light detection. However, in conventional APDs, a large portion of carrier energy drawn from the electric field is thermalized, and the multiplication efficiencies of electron and hole are low and close. In order to achieve high gain, the device should work under breakdown bias, where carrier multiplication proceeds bi-directionally to form a positive feedback multiplication circle. However, breakdown is hard to control, in practice, APDs should work under Geiger mode as a compromise between sustainable detection and high gain. The complexity of system seriously restricts the application. Here, we demonstrate an avalanche photodiode holding high gain without breakdown, which means no quenching circuit is needed for sustainable detection. The device is based on a GaN/AlN periodically-stacked-structure (PSS), wherein electron holds much higher efficiency than hole to draw energy from the electric field, and avalanche happens uni-directionally with high efficiency. and a recorded high gain (10^4) tested under constant bias is obtained in a prototype device, wherein the stable gain can be determined by the periodicity of the GaN/AlN PSS. This work not only brings a new light into avalanche multiplication mechanism, but also paves a technological path with high commercial value to realize highly sensitive avalanche devices working under constant bias like PMT.
https://arxiv.org/abs/1608.00561
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors?' and
can caption-generators count?’
自动生成图像标题的任务有相当大的兴趣。但是,评估具有挑战性。现有的自动评估指标主要对n-gram重叠敏感,对模拟人类判断任务既不必要也不足够。我们假设语义命题内容是人类标题评估的重要组成部分,并提出了一个新的自动标题评估度量标准在SPICE创建的场景图。对一系列模型和数据集的广泛评估表明,SPICE比其他自动指标更好地捕捉模型生成的字幕的人为判断(例如,与MS COCO数据集上的人类判断相比,0.88与CIDEr的0.43相比,0.53流星)。此外,SPICE可以回答“哪个字幕生成器最能理解颜色?”等问题。和’字幕发电机可以算吗?
https://arxiv.org/abs/1607.08822
The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher’s flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.
https://arxiv.org/abs/1602.08124
The millimeter wave (mmWave) bands have recently attracted considerable interest for next-generation cellular systems due to the massive available bandwidths at these frequencies. However, a key challenge in designing mmWave cellular systems is initial access – the procedure by which a mobile establishes an initial link-layer connection to a base station cell. MmWave communication relies on highly directional transmissions and the initial access procedure must thus provide a mechanism by which initial transmission directions can be searched in a potentially large angular space. Design options are compared considering different scanning and signaling procedures to evaluate access delay and system overhead. The channel structure and multiple access issues are also considered. The analysis demonstrates significant benefits of low-resolution fully digital architectures in comparison to single stream analog beamforming.
https://arxiv.org/abs/1511.06483
Using hybrid density functional theory, we address point defects susceptible to cause charge compensation upon Mg doping of GaN. We determine the free energy of formation of the nitrogen vacancy and of several Mg-related defects. The entropic contribution as a function of temperature is determined within the quasiharmonic approximation. We find that the Mg interstitial shows a noticeably lower free energy of formation than the Mg substitutional to Ga in p-type conditions. Therefore, the Mg impurity is amphoteric behaving like an acceptor when substitutional to Ga and like a double donor when accommodated in an interstitial position. The hybrid-functional results are then linked to experimental observations by solving the charge neutrality equations for semiconductor dominated by impurities. We show that a thermodynamic equilibrium model is unable to account for the experimental hole concentration as a function of Mg doping density, due to nitrogen vacancies and Mg interstitials acting as compensating donors. To explain the experimental result, which includes a dropoff of the hole concentration at high Mg densities, we thus resort to nonequilibrium models. We show that either nitrogen vacancies or Mg interstitials could be at the origin of the self-compensation mechanism. However, only the model based on interstitial Mg donors provides a natural mechanism to account for the sudden appearance of self-compensation. Indeed, the amphoteric nature of the Mg impurity leads to Fermi-level pinning and accounts for the observed dropoff of the hole concentration of GaN samples at high Mg doping. Our work suggests that current limitations in p-type doping of GaN could be overcome by extrinsically controlling the Fermi energy during growth.
https://arxiv.org/abs/1607.08353
This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task.
本文提出了深度卷积网络模型,利用局部和全局的背景,在静止图像中进行人类活动标签预测,在两个最近的数据集上获得了最新的性能,每个数据集都有数百个标签。我们使用多实例学习来处理个人实例水平的缺乏监督,加权损失处理不平衡的训练数据。此外,我们还展示了如何使用这些数据集训练的专业特性来提高视觉问题回答(VQA)任务的准确性,以多选填空题(Visual Madlibs)的形式提供。具体来说,我们解决了两类人员活动和人 - 对象关系问题,并对ImageNet分类任务上训练的泛型特征进行了改进。
https://arxiv.org/abs/1604.04808
In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base. More specifically, we propose a benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data. The rich information provided by the knowledge base helps to conduct disambiguation and improve the recognition accuracy, and contributes to various real-world applications, such as image captioning and news video analysis. Associated with this task, we design and provide concrete measurement set, evaluation protocol, as well as training data. We also present in details our experiment setup and report promising baseline results. Our benchmark task could lead to one of the largest classification problems in computer vision. To the best of our knowledge, our training dataset, which contains 10M images in version 1, is the largest publicly available one in the world.
在本文中,我们设计了一个基准任务,并提供相关数据集来识别人脸图像,并将其链接到知识库中相应的实体关键字。更具体地说,我们提出了一个基准任务,从他们的脸部图像中识别出100万名名人,将所有可能收集的这个人在网络上的脸部图像用作训练数据。知识库提供的丰富信息有助于消除歧义,提高识别的准确性,并有助于图像字幕和新闻视频分析等各种现实应用。与此相关的任务,我们设计并提供具体的测量集,评估协议,以及培训数据。我们还详细介绍了我们的实验设置并报告了有希望的基线结果。我们的基准任务可能会导致计算机视觉中最大的分类问题之一。就我们所知,我们的训练数据集(包含版本1中的10M图像)是世界上最大的公开可用数据集。
https://arxiv.org/abs/1607.08221
Computing $k$-Nearest Neighbors (KNN) is one of the core kernels used in many machine learning, data mining and scientific computing applications. Although kd-tree based $O(\log n)$ algorithms have been proposed for computing KNN, due to its inherent sequentiality, linear algorithms are being used in practice. This limits the applicability of such methods to millions of data points, with limited scalability for Big Data analytics challenges in the scientific domain. In this paper, we present parallel and highly optimized kd-tree based KNN algorithms (both construction and querying) suitable for distributed architectures. Our algorithm includes novel approaches for pruning search space and improving load balancing and partitioning among nodes and threads. Using TB-sized datasets from three science applications: astrophysics, plasma physics, and particle physics, we show that our implementation can construct kd-tree of 189 billion particles in 48 seconds on utilizing $\sim$50,000 cores. We also demonstrate computation of KNN of 19 billion queries in 12 seconds. We demonstrate almost linear speedup both for shared and distributed memory computers. Our algorithms outperforms earlier implementations by more than order of magnitude; thereby radically improving the applicability of our implementation to state-of-the-art Big Data analytics problems. In addition, we showcase performance and scalability on the recently released Intel Xeon Phi processor showing that our algorithm scales well even on massively parallel architectures.
https://arxiv.org/abs/1607.08220
Behavior planning is known to be one of the basic cognitive functions, which is essential for any cognitive architecture of any control system used in robotics. At the same time most of the widespread planning algorithms employed in those systems are developed using only approaches and models of Artificial Intelligence and don’t take into account numerous results of cognitive experiments. As a result, there is a strong need for novel methods of behavior planning suitable for modern cognitive architectures aimed at robot control. One such method is presented in this work and is studied within a special class of navigation task called smart relocation task. The method is based on the hierarchical two-level model of abstraction and knowledge representation, e.g. symbolic and subsymbolic. On the symbolic level sign world model is used for knowledge representation and hierarchical planning algorithm, PMA, is utilized for planning. On the subsymbolic level the task of path planning is considered and solved as a graph search problem. Interaction between both planners is examined and inter-level interfaces and feedback loops are described. Preliminary experimental results are presented.
https://arxiv.org/abs/1607.08181
We report 1212 radial-velocity (RV) measurements obtained in the years 2009-2013 using an iodine cell for the spectroscopic binary nu Octantis (K1III/IV). This system (a_bin~2.6 au, P~1050 days) is conjectured to have a Jovian planet with a semi-major axis half that of the binary host. The extreme geometry only permits long-term stability if the planet is in a retrograde orbit. Whilst the reality of the planet (P~415 days) remains uncertain, other scenarios (stellar variability or apsidal motion caused by a yet unobserved third star) continue to appear substantially less credible based on CCF bisectors, line-depth ratios and many other independent details. If this evidence is validated but the planet is disproved, the claims of other planets using RVs will be seriously challenged. We also describe a significant revision to the previously published RVs and the full set of 1437 RVs now encompasses nearly 13 years. The sensitive orbital dynamics allow us to constrain the three-dimensional architecture with a broad prior probability distribution on the mutual inclination, which with posterior samples obtained from an N-body Markov chain Monte Carlo is found to be 158.4 +/- 1.2 deg. None of these samples are dynamically stable beyond 1 Myr. However, a grid search around the best-fitting solution finds a region that has many models stable for 10 Myr, and includes one model within 1-sigma that is stable for at least 100 Myr. The planet’s exceptional nature demands robust independent verification and makes the theoretical understanding of its formation a worthy challenge.
https://arxiv.org/abs/1605.06720
Enterprise level software is implemented using multi-layer architecture. These layers are often implemented using de-coupled solutions with millions of lines of code. Programmers often have to track and debug a function call from user interface layer to the data access layer while troubleshooting an issue. They have to inspect the code based on search results or use design documents to construct the call graph. This process is time consuming and laborious. The development environment tools are insufficient or confined to analyzing only the code in the loaded solution. This paper proposes a method to construct a call graph of the call across several layers of the code residing in different code bases to help programmers better understand the design and architecture of the software. The signatures of class, methods, and properties were evaluated and then matched against the code files. A graph of matching functions was created. The recursive search stopped when there were no matches or the data layer code was detected. The method resulted in 78.26% accuracy when compared with manual search.
https://arxiv.org/abs/1610.04594
This study aims to analyze the benefits of improved multi-scale reasoning for object detection and localization with deep convolutional neural networks. To that end, an efficient and general object detection framework which operates on scale volumes of a deep feature pyramid is proposed. In contrast to the proposed approach, most current state-of-the-art object detectors operate on a single-scale in training, while testing involves independent evaluation across scales. One benefit of the proposed approach is in better capturing of multi-scale contextual information, resulting in significant gains in both detection performance and localization quality of objects on the PASCAL VOC dataset and a multi-view highway vehicles dataset. The joint detection and localization scale-specific models are shown to especially benefit detection of challenging object categories which exhibit large scale variation as well as detection of small objects.
https://arxiv.org/abs/1505.03597
A unified deep neural network, denoted the multi-scale CNN (MS-CNN), is proposed for fast multi-scale object detection. The MS-CNN consists of a proposal sub-network and a detection sub-network. In the proposal sub-network, detection is performed at multiple output layers, so that receptive fields match objects of different scales. These complementary scale-specific detectors are combined to produce a strong multi-scale object detector. The unified network is learned end-to-end, by optimizing a multi-task loss. Feature upsampling by deconvolution is also explored, as an alternative to input upsampling, to reduce the memory and computation costs. State-of-the-art object detection performance, at up to 15 fps, is reported on datasets, such as KITTI and Caltech, containing a substantial number of small objects.
https://arxiv.org/abs/1607.07155
Neural machine translation (NMT) aims at solving machine translation (MT) problems using neural networks and has exhibited promising results in recent years. However, most of the existing NMT models are shallow and there is still a performance gap between a single NMT model and the best conventional MT system. In this work, we introduce a new type of linear connections, named fast-forward connections, based on deep Long Short-Term Memory (LSTM) networks, and an interleaved bi-directional architecture for stacking the LSTM layers. Fast-forward connections play an essential role in propagating the gradients and building a deep topology of depth 16. On the WMT’14 English-to-French task, we achieve BLEU=37.7 with a single attention model, which outperforms the corresponding single shallow model by 6.2 BLEU points. This is the first time that a single NMT model achieves state-of-the-art performance and outperforms the best conventional model by 0.7 BLEU points. We can still achieve BLEU=36.3 even without using an attention mechanism. After special handling of unknown words and model ensembling, we obtain the best score reported to date on this task with BLEU=40.4. Our models are also validated on the more difficult WMT’14 English-to-German task.
https://arxiv.org/abs/1606.04199
\ni We develop a simple method to study the zero-point and thermally renormalized electron energy $\varepsilon_{\mathbf{k}n}(T)$ for $\mathbf{k}n$ the conduction band minimum or valence maximum in polar semiconductors. We use the adiabatic approximation, including an imaginary broadening parameter $i\delta$ to supress noise in the density-functional integrations. Fröhlich polaron methods provide analytic expressions for the contribution of the problematic optical phonon mode. We use this to correct the renormalization obtained from the adiabatic approximation. Test calculations are done for zincblende GaN for an 18x18x18 integration grid. The Fröhlich correction is of order -0.02 eV for the zero-point energy shift of the conduction band minimum, and +0.03 eV for the valence band maximum; the correction to renormalization of the 3.28 eV gap is -0.05 eV, a significant fraction of the total zero point renormalization of -0.15 eV.
https://arxiv.org/abs/1603.04269