In this paper we report on a comparative study of the low temperature emission and polarisation properties of InGaN/GaN quantum wells (QWs) grown on nonpolar a-plane and m-plane free-standing bulk GaN substrates where the In content varied from 0.14 to 0.28 in the m-plane series and 0.08 to 0.21 for the a-plane series. The low temperature photoluminescence spectra from both sets of samples are very broad with full width at half-maximum height increasing from 81 to 330 meV as the In fraction increases. Comparative photoluminescence excitation spectroscopy indicates that the recombination mainly involves strongly localised carriers. At a temperature of 10 K the degree of linear polarisation of the a-plane samples is much smaller than of the m-plane counterparts and also varies across the spectrum. From polarisation-resolved photoluminescence excitation spectroscopy we measured the energy splitting between the lowest valence sub-band states to lie in the range of 23-54 meV for both a-and m-plane samples in which we could observe distinct exciton features in the polarised photoluminescence excitation spectroscopy. Thus, the thermal occupation of a higher valence subband cannot be responsible for the reduction of the degree of linear polarisation. Time-resolved spectroscopy indicates that in a-plane samples there is an extra emission component which at least partly responsible for the reduction in the degree of linear polarisation.
https://arxiv.org/abs/1612.06353
We present a novel architecture for sparse pattern processing, using flash storage with embedded accelerators. Sparse pattern processing on large data sets is the essence of applications such as document search, natural language processing, bioinformatics, subgraph matching, machine learning, and graph processing. One slice of our prototype accelerator is capable of handling up to 1TB of data, and experiments show that it can outperform C/C++ software solutions on a 16-core system at a fraction of the power and cost; an optimized version of the accelerator can match the performance of a 48-core server.
https://arxiv.org/abs/1611.03380
Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits its remarkable accuracy in estimating local, next-word distributions. In this work, we introduce a model and beam-search training scheme, based on the work of Daume III and Marcu (2005), that extends seq2seq to learn global sequence scores. This structured approach avoids classical biases associated with local training and unifies the training loss with the test-time usage, while preserving the proven model architecture of seq2seq and its efficient training approach. We show that our system outperforms a highly-optimized attention-based seq2seq system and other baselines on three different sequence to sequence tasks: word ordering, parsing, and machine translation.
https://arxiv.org/abs/1606.02960
Observational astronomy in the time-domain era faces several new challenges. One of them is the efficient use of observations obtained at multiple epochs. The work presented here addresses faint object detection with multi-epoch data, and describes an incremental strategy for separating real objects from artifacts in ongoing surveys, in situations where the single-epoch data are summaries of the full image data, such as single-epoch catalogs of flux and direction estimates for candidate sources. The basic idea is to produce low-threshold single-epoch catalogs, and use a probabilistic approach to accumulate catalog information across epochs; this is in contrast to more conventional strategies based on co-added or stacked image data across all epochs. We adopt a Bayesian approach, addressing object detection by calculating the marginal likelihoods for hypotheses asserting there is no object, or one object, in a small image patch containing at most one cataloged source at each epoch. The object-present hypothesis interprets the sources in a patch at different epochs as arising from a genuine object; the no-object (noise) hypothesis interprets candidate sources as spurious, arising from noise peaks. We study the detection probability for constant-flux objects in a simplified Gaussian noise setting, comparing results based on single exposures and stacked exposures to results based on a series of single-epoch catalog summaries. Computing the detection probability based on catalog data amounts to generalized cross-matching: it is the product of a factor accounting for matching of the estimated fluxes of candidate sources, and a factor accounting for matching of their estimated directions. We find that probabilistic fusion of multi-epoch catalog information can detect sources with only modest sacrifice in sensitivity and selectivity compared to stacking.
https://arxiv.org/abs/1611.03171
The surface orientation can have profound effects on the atomic-scale processes of crystal growth, and is essential to such technologies as GaN-based light-emitting diodes and high-power electronics. We investigate the dependence of homoepitaxial growth mechanisms on the surface orientation of a hexagonal crystal using kinetic Monte Carlo simulations. To model GaN metal-organic vapor phase epitaxy, in which N species are supplied in excess, only Ga atoms on a hexagonal close-packed (HCP) lattice are considered. The results are thus potentially applicable to any HCP material. Growth behaviors on c-plane ${(0 0 0 1)}$ and m-plane ${(0 1 \overline{1} 0)}$ surfaces are compared. We present a reciprocal space analysis of the surface morphology, which allows extraction of growth mode boundaries and direct comparison with surface X-ray diffraction experiments. For each orientation we map the boundaries between 3-dimensional, layer-by-layer, and step flow growth modes as a function of temperature and growth rate. Two models for surface diffusion are used, which produce different effective Ehrlich-Schwoebel step-edge barriers, and different adatom diffusion anisotropies on m-plane surfaces. Simulation results in agreement with observed GaN island morphologies and growth mode boundaries are obtained. These indicate that anisotropy of step edge energy, rather than adatom diffusion, is responsible for the elongated islands observed on m-plane surfaces. Island nucleation spacing obeys a power-law dependence on growth rate, with exponents of -0.24 and -0.29 for m- and c-plane, respectively.
https://arxiv.org/abs/1611.03121
Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication. We propose the task of sequencing – given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story. We present multiple approaches, via unary (position) and pairwise (order) predictions, and their ensemble-based combinations, achieving strong results on this task. We use both text-based and image-based features, which depict complementary improvements. Using qualitative examples, we demonstrate that our models have learnt interesting aspects of temporal common sense.
时态常识在AI任务中有应用,如QA,多文件摘要和人 - AI通信。我们提出了排序的任务 - 给定一组混杂的图像 - 字幕对,属于一个故事,任务是排序他们,使输出序列形成一个连贯的故事。我们提出了多种方法,通过一元(位置)和成对(顺序)预测,以及它们的基于集合的组合,在这个任务上取得了很好的结果。我们使用基于文本和基于图像的功能,这些功能描述了互补的改进。使用定性的例子,我们证明我们的模型已经学习了时间常识的有趣方面。
https://arxiv.org/abs/1606.07493
Truth discovery is to resolve conflicts and find the truth from multiple-source statements. Conventional methods mostly research based on the mutual effect between the reliability of sources and the credibility of statements, however, pay no attention to the mutual effect among the credibility of statements about the same object. We propose memory network based models to incorporate these two ideas to do the truth discovery. We use feedforward memory network and feedback memory network to learn the representation of the credibility of statements which are about the same object. Specially, we adopt memory mechanism to learn source reliability and use it through truth prediction. During learning models, we use multiple types of data (categorical data and continuous data) by assigning different weights automatically in the loss function based on their own effect on truth discovery prediction. The experiment results show that the memory network based models much outperform the state-of-the-art method and other baseline methods.
https://arxiv.org/abs/1611.01868
Medical image registration plays an important role in determining topographic and morphological changes for functional diagnostic and therapeutic purposes. Manual alignment and semi-automated software still have been used; however they are subjective and make specialists spend precious time. Fully automated methods are faster and user-independent, but the critical point is registration reliability. Similarity measurement using Mutual Information (MI) with Shannon entropy (MIS) is the most common automated method that is being currently applied in medical images, although more reliable algorithms have been proposed over the last decade, suggesting improvements and different entropies; such as Studholme et al, (1999), who demonstrated that the normalization of Mutual Information (NMI) provides an invariant entropy measure for 3D medical image registration. In this paper, we described a set of experiments to evaluate the applicability of Tsallis entropy in the Mutual Information (MIT) and in the Normalized Mutual Information (NMIT) as cost functions for Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET) and Computed Tomography (CT) exams registration. The effect of changing overlap in a simple image model and clinical experiments on current entropies (Entropy Correlation Coefficient - ECC, MIS and NMI) and the proposed ones (MIT and NMT) showed NMI and NMIT with Tsallis parameter close to 1 as the best options (confidence and accuracy) for CT to MRI and PET to MRI automatic neuroimaging registration.
https://arxiv.org/abs/1611.01730
Track reconstruction in high track multiplicity environments at current and future high rate particle physics experiments is a big challenge and very time consuming. The search for track seeds and the fitting of track candidates are usually the most time consuming steps in the track reconstruction. Here, a new and fast track reconstruction method based on hit triplets is proposed which exploits a three-dimensional fit model including multiple scattering and hit uncertainties from the very start, including the search for track seeds. The hit triplet based reconstruction method assumes a homogeneous magnetic field which allows to give an analytical solutions for the triplet fit result. This method is highly parallelizable, needs fewer operations than other standard track reconstruction methods and is therefore ideal for the implementation on parallel computing architectures. The proposed track reconstruction algorithm has been studied in the context of the Mu3e-experiment and a typical LHC experiment.
https://arxiv.org/abs/1611.01671
Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. In this paper, we present Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner. To incorporate attributes, we construct variants of architectures by feeding image representations and attributes into RNNs in different ways to explore the mutual but also fuzzy relationship between them. Extensive experiments are conducted on COCO image captioning dataset and our framework achieves superior results when compared to state-of-the-art deep models. Most remarkably, we obtain METEOR/CIDEr-D of 25.2%/98.6% on testing data of widely used and publicly available splits in (Karpathy & Fei-Fei, 2015) when extracting image representations by GoogleNet and achieve to date top-1 performance on COCO captioning Leaderboard.
用自然语言自动描述图像一直是计算机视觉和自然语言处理这两个领域的一个新兴挑战。在本文中,我们提出了具有属性的长时间记忆(LSTM-A) - 一种将属性集成到成功的卷积神经网络(CNN)和递归神经网络(RNN)图像字幕框架中的新型架构,端到端的方式。为了结合属性,我们通过以不同的方式将图像表示和属性提供给RNN来构造变体结构,以探索它们之间的相互但也是模糊的关系。在COCO图像字幕数据集上进行了大量的实验,与国家最先进的深度模型相比,我们的框架取得了优异的结果。最显着的是,当我们通过GoogleNet提取图像表示,获得广泛使用和公开可用的分割(Karpathy&Fei-Fei,2015)的数据时,METEOR / CIDEr-D为25.2%/ 98.6%在COCO字幕排行榜上。
https://arxiv.org/abs/1611.01646
Various families of malware use domain generation algorithms (DGAs) to generate a large number of pseudo-random domain names to connect to a command and control (C&C) server. In order to block DGA C&C traffic, security organizations must first discover the algorithm by reverse engineering malware samples, then generating a list of domains for a given seed. The domains are then either preregistered or published in a DNS blacklist. This process is not only tedious, but can be readily circumvented by malware authors using a large number of seeds in algorithms with multivariate recurrence properties (e.g., banjori) or by using a dynamic list of seeds (e.g., bedep). Another technique to stop malware from using DGAs is to intercept DNS queries on a network and predict whether domains are DGA generated. Such a technique will alert network administrators to the presence of malware on their networks. In addition, if the predictor can also accurately predict the family of DGAs, then network administrators can also be alerted to the type of malware that is on their networks. This paper presents a DGA classifier that leverages long short-term memory (LSTM) networks to predict DGAs and their respective families without the need for a priori feature extraction. Results are significantly better than state-of-the-art techniques, providing 0.9993 area under the receiver operating characteristic curve for binary classification and a micro-averaged F1 score of 0.9906. In other terms, the LSTM technique can provide a 90% detection rate with a 1:10000 false positive (FP) rate—a twenty times FP improvement over comparable methods. Experiments in this paper are run on open datasets and code snippets are provided to reproduce the results.
https://arxiv.org/abs/1611.00791
While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a dual-learning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a dual-learning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., English-to-French translation (primal) versus French-to-English translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the dual-learning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the language-model likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation \emph{dual-NMT}. Experiments show that dual-NMT works very well on English$\leftrightarrow$French translation; especially, by learning from monolingual data (with 10% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the French-to-English translation task.
https://arxiv.org/abs/1611.00179
Recurrent neural networks (RNNs) have achieved state-of-the-art performances in many natural language processing tasks, such as language modeling and machine translation. However, when the vocabulary is large, the RNN model will become very big (e.g., possibly beyond the memory capacity of a GPU device) and its training will become very inefficient. In this work, we propose a novel technique to tackle this challenge. The key idea is to use 2-Component (2C) shared embedding for word representations. We allocate every word in the vocabulary into a table, each row of which is associated with a vector, and each column associated with another vector. Depending on its position in the table, a word is jointly represented by two components: a row vector and a column vector. Since the words in the same row share the row vector and the words in the same column share the column vector, we only need $2 \sqrt{|V|}$ vectors to represent a vocabulary of $|V|$ unique words, which are far less than the $|V|$ vectors required by existing approaches. Based on the 2-Component shared embedding, we design a new RNN algorithm and evaluate it using the language modeling task on several benchmark datasets. The results show that our algorithm significantly reduces the model size and speeds up the training process, without sacrifice of accuracy (it achieves similar, if not better, perplexity as compared to state-of-the-art language models). Remarkably, on the One-Billion-Word benchmark Dataset, our algorithm achieves comparable perplexity to previous language models, whilst reducing the model size by a factor of 40-100, and speeding up the training process by a factor of 2. We name our proposed algorithm \emph{LightRNN} to reflect its very small model size and very high training speed.
https://arxiv.org/abs/1610.09893
Vision-based object detection is one of the fundamental functions in numerous traffic scene applications such as self-driving vehicle systems and advance driver assistance systems (ADAS). However, it is also a challenging task due to the diversity of traffic scene and the storage, power and computing source limitations of the platforms for traffic scene applications. This paper presents a generalized Haar filter based deep network which is suitable for the object detection tasks in traffic scene. In this approach, we first decompose a object detection task into several easier local regression tasks. Then, we handle the local regression tasks by using several tiny deep networks which simultaneously output the bounding boxes, categories and confidence scores of detected objects. To reduce the consumption of storage and computing resources, the weights of the deep networks are constrained to the form of generalized Haar filter in training phase. Additionally, we introduce the strategy of sparse windows generation to improve the efficiency of the algorithm. Finally, we perform several experiments to validate the performance of our proposed approach. Experimental results demonstrate that the proposed approach is both efficient and effective in traffic scene compared with the state-of-the-art.
https://arxiv.org/abs/1610.09609
Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows — limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory access scheme, which we call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories. We show that SAM achieves asymptotic lower bounds in space and time complexity, and find that an implementation runs $1,!000\times$ faster and with $3,!000\times$ less physical memory than non-sparse models. SAM learns with comparable data efficiency to existing models on a range of synthetic tasks and one-shot Omniglot character recognition, and can scale to tasks requiring $100,!000$s of time steps and memories. As well, we show how our approach can be adapted for models that maintain temporal associations between memories, as with the recently introduced Differentiable Neural Computer.
https://arxiv.org/abs/1610.09027
We propose a novel extension of the encoder-decoder framework, called a review network. The review network is generic and can enhance any existing encoder- decoder model: in this paper, we consider RNN decoders with both CNN and RNN encoders. The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review step; the thought vectors are used as the input of the attention mechanism in the decoder. We show that conventional encoder-decoders are a special case of our framework. Empirically, we show that our framework improves over state-of- the-art encoder-decoder systems on the tasks of image captioning and source code captioning.
我们提出了一个新的扩展编码器 - 解码器框架,称为评论网络。审查网络是通用的,可以增强任何现有的编码器 - 解码器模型:在本文中,我们考虑使用CNN和RNN编码器的RNN解码器。评论网络对编码器隐藏状态进行多个关注机制的评论步骤,并在每个评论步骤之后输出思想向量;思想向量被用作解码器中的关注机制的输入。我们展示了传统的编码器解码器是我们框架的一个特例。实证上,我们表明,我们的框架在图像字幕和源代码字幕任务上优于最先进的编码器 - 解码器系统。
https://arxiv.org/abs/1605.07912
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (this http URL).
我们提出了自由形式和开放式的视觉问答(VQA)的任务。给定一幅关于图像的图像和自然语言问题,其任务是提供一个准确的自然语言答案。反映现实世界的情景,如帮助视障者,问题和答案都是开放式的。视觉问题有选择地针对图像的不同区域,包括背景细节和背景。因此,一个在VQA上成功的系统通常需要比生成通用图像标题的系统更加详细地了解图像和复杂的推理。此外,VQA适合自动评估,因为许多开放式答案只包含几个字或一组封闭的答案,可以以多选形式提供。我们提供一个包含〜0.25M图像,约0.76M个问题和〜10M个答案(www.visualqa.org)的数据集,并讨论它提供的信息。提供了许多VQA的基线和方法,并与人的表现进行比较。我们的VQA演示可在CloudCV(此http URL)上找到。
https://arxiv.org/abs/1505.00468
Neural Machine Translation (NMT) has become the new state-of-the-art in several language pairs. However, it remains a challenging problem how to integrate NMT with a bilingual dictionary which mainly contains words rarely or never seen in the bilingual training data. In this paper, we propose two methods to bridge NMT and the bilingual dictionaries. The core idea behind is to design novel models that transform the bilingual dictionaries into adequate sentence pairs, so that NMT can distil latent bilingual mappings from the ample and repetitive phenomena. One method leverages a mixed word/character model and the other attempts at synthesizing parallel sentences guaranteeing massive occurrence of the translation lexicon. Extensive experiments demonstrate that the proposed methods can remarkably improve the translation quality, and most of the rare words in the test sentences can obtain correct translations if they are covered by the dictionary.
https://arxiv.org/abs/1610.07272
Answering open-ended questions is an essential capability for any intelligent agent. One of the most interesting recent open-ended question answering challenges is Visual Question Answering (VQA) which attempts to evaluate a system’s visual understanding through its answers to natural language questions about images. There exist many approaches to VQA, the majority of which do not exhibit deeper semantic understanding of the candidate answers they produce. We study the importance of generating plausible answers to a given question by introducing the novel task of `Answer Proposal’: for a given open-ended question, a system should generate a ranked list of candidate answers informed by the semantics of the question. We experiment with various models including a neural generative model as well as a semantic graph matching one. We provide both intrinsic and extrinsic evaluations for the task of Answer Proposal, showing that our best model learns to propose plausible answers with a high recall and performs competitively with some other solutions to VQA.
回答开放式问题是任何智能代理人必备的能力。视觉问题回答(Visual Question Answering,简称VQA)是最近最有趣的开放式问题答案之一,它试图通过对图像自然语言问题的回答来评估系统的视觉理解。 VQA存在很多方法,其中大多数不会对候选答案产生更深的语义理解。我们研究通过引入“答案建议书”的新任务来为给定问题产生合理答案的重要性:对于给定的开放式问题,系统应该产生由问题的语义通知的候选答案的排序列表。我们尝试了各种模型,包括一个神经生成模型以及一个与之匹配的语义图。我们针对答案的任务提供了内在和外在的评估,表明我们最好的模型学习提出高回忆的合理答案,并与VQA的其他解决方案进行竞争。
https://arxiv.org/abs/1610.06620
We introduce a framework for model learning and planning in stochastic domains with continuous state and action spaces and non-Gaussian transition models. It is efficient because (1) local models are estimated only when the planner requires them; (2) the planner focuses on the most relevant states to the current planning problem; and (3) the planner focuses on the most informative and/or high-value actions. Our theoretical analysis shows the validity and asymptotic optimality of the proposed approach. Empirically, we demonstrate the effectiveness of our algorithm on a simulated multi-modal pushing problem.
我们在具有连续状态和动作空间以及非高斯过渡模型的随机域中引入了模型学习和规划的框架。它是有效的,因为(1)只有当规划者需要时才估计本地模型; (2)规划者关注当前规划问题中最相关的状态; (3)计划者关注最具信息性和/或高价值的行动。我们的理论分析表明了该方法的有效性和渐近最优性。根据经验,我们证明了我们的算法在模拟多模态推力问题上的有效性。
https://arxiv.org/abs/1607.07762
In this paper, we introduce a novel fusion method that can enhance object detection performance by fusing decisions from two different types of computer vision tasks: object detection and image classification. In the proposed work, the class label of an image obtained from image classification is viewed as prior knowledge about existence or non-existence of certain objects. The prior knowledge is then fused with the decisions of object detection to improve detection accuracy by mitigating false positives of an object detector that are strongly contradicted with the prior knowledge. A recently introduced novel fusion approach called dynamic belief fusion (DBF) is used to fuse the detector output with the classification prior. Experimental results show that the detection performance of all the detection algorithms used in the proposed work is improved on benchmark datasets via the proposed fusion framework.
https://arxiv.org/abs/1610.06907
This year, the Nara Institute of Science and Technology (NAIST)/Carnegie Mellon University (CMU) submission to the Japanese-English translation track of the 2016 Workshop on Asian Translation was based on attentional neural machine translation (NMT) models. In addition to the standard NMT model, we make a number of improvements, most notably the use of discrete translation lexicons to improve probability estimates, and the use of minimum risk training to optimize the MT system for BLEU score. As a result, our system achieved the highest translation evaluation scores for the task.
https://arxiv.org/abs/1610.06542
Large area epitaxy of two-dimensional (2D) layered materials with high material quality is a crucial step in realizing novel device applications based on 2D materials. In this work, we report high-quality, crystalline, large-area gallium selenide (GaSe) films grown on bulk substrates such as c-plane sapphire and gallium nitride (GaN) using a valved cracker source for Se. (002)-oriented GaSe with random in-plane orientation of domains was grown on sapphire and GaN substrates at a substrate temperature of 350-450 C with complete surface coverage and smooth surface morphology. Higher growth temperature (575 C) resulted in the formation of single-crystalline {\epsilon}-GaSe triangular domains with six-fold symmetry confirmed by in-situ reflection high electron energy diffraction (RHEED) and off-axis x-ray diffraction (XRD). A two-step growth method involving high temperature nucleation of single crystalline domains and low temperature growth to enhance coalescence was adopted to obtain continuous (002)-oriented GaSe with an epitaxial relationship with the substrate. While six-fold symmetry was maintained in the two step growth, {\beta}-GaSe phase was observed in addition to the dominant {\epsilon}-GaSe in cross-sectional scanning transmission electron microscopy images. This work demonstrates the potential of growing high quality 2D-layered materials using molecular beam epitaxy and can be extended to the growth of other transition metal chalcogenides.
https://arxiv.org/abs/1610.06265
Detection and learning based appearance feature play the central role in data association based multiple object tracking (MOT), but most recent MOT works usually ignore them and only focus on the hand-crafted feature and association algorithms. In this paper, we explore the high-performance detection and deep learning based appearance feature, and show that they lead to significantly better MOT results in both online and offline setting. We make our detection and appearance feature publicly available. In the following part, we first summarize the detection and appearance feature, and then introduce our tracker named Person of Interest (POI), which has both online and offline version.
https://arxiv.org/abs/1610.06136
We investigate the 3.45-eV luminescence band of spontaneously formed GaN nanowires on Si(111) by photoluminescence and cathodoluminescence spectroscopy. This band is found to be particularly prominent for samples synthesized at comparatively low temperatures. At the same time, these samples exhibit a peculiar morphology, namely, isolated long nanowires are interspersed within a dense matrix of short ones. Cathodoluminescence intensity maps reveal the 3.45-eV band to originate primarily from the long nanowires. Transmission electron microscopy shows that these long nanowires are either Ga polar and are joined by an inversion domain boundary with their short N-polar neighbors, or exhibit a Ga-polar core surrounded by a N-polar shell with a tubular inversion domain boundary at the core/shell interface. For samples grown at high temperatures, which exhibit a uniform nanowire morphology, the 3.45-eV band is also found to originate from particular nanowires in the ensemble and thus presumably from inversion domain boundaries stemming from the coexistence of N- and Ga-polar nanowires. For several of the investigated samples, the 3.45-eV band splits into a doublet. We demonstrate that the higher-energy component of this doublet arises from the recombination of two-dimensional excitons free to move in the plane of the inversion domain boundary. In contrast, the lower-energy component of the doublet originates from excitons localized in the plane of the inversion domain boundary. We propose that this in-plane localization is due to shallow donors in the vicinity of the inversion domain boundaries.
https://arxiv.org/abs/1607.04036
Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development has recently moved from laboratory to production systems as demonstrated by several entities announcing roll-out of NMT engines to replace their existing technologies. NMT systems have a large number of training configurations and the training process of such systems is usually very long, often a few weeks, so role of experimentation is critical and important to share. In this work, we present our approach to production-ready systems simultaneously with release of online demonstrators covering a large variety of languages (12 languages, for 32 language pairs). We explore different practical choices: an efficient and evolutive open-source framework; data preparation; network architecture; additional implemented features; tuning for production; etc. We discuss about evaluation methodology, present our first findings and we finally outline further work. Our ultimate goal is to share our expertise to build competitive production systems for “generic” translation. We aim at contributing to set up a collaborative framework to speed-up adoption of the technology, foster further research efforts and enable the delivery and adoption to/by industry of use-case specific engines integrated in real production workflows. Mastering of the technology would allow us to build translation engines suited for particular needs, outperforming current simplest/uniform systems.
https://arxiv.org/abs/1610.05540
This paper explores new evaluation perspectives for image captioning and introduces a noun translation task that achieves comparative image caption generation performance by translating from a set of nouns to captions. This implies that in image captioning, all word categories other than nouns can be evoked by a powerful language model without sacrificing performance on n-gram precision. The paper also investigates lower and upper bounds of how much individual word categories in the captions contribute to the final BLEU score. A large possible improvement exists for nouns, verbs, and prepositions.
本文探讨了图像字幕的新的评价视角,并引入了一个名词翻译任务,通过从一组名词到字幕的翻译,实现了比较图像字幕的生成性能。这意味着在图像字幕中,除了名词以外的所有字类都可以通过强大的语言模型来唤起,而不牺牲n-gram精度的性能。本文还调查了字幕中单个字词类别对最终BLEU分数的贡献的上限和下限。名词,动词和介词有很大的改进。
https://arxiv.org/abs/1610.03708
Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.
由于动态真实场景中的复杂交互,自动视频字幕具有挑战性。一个全面的系统将最终本地化和追踪视频中的对象,动作和交互,并生成一个依赖于时间本地化的描述,以便为视觉概念奠定基础。然而,大多数现有的自动视频字幕系统从原始视频数据映射到高级文本描述,绕过本地化和识别,从而丢弃了用于内容定位和概括的潜在有价值的信息。在这项工作中,我们提出了一个自动视频字幕模型,结合了时空关注和图像分类的手段,基于深度神经网络结构的长期短期记忆。由此产生的系统被证明能够在标准的YouTube字幕基准测试中产生最新的结果,同时还提供了在空间和时间上对视觉概念(主题,动词,对象)进行本地化的优势,而不需要接地监督。
https://arxiv.org/abs/1610.04997
Recently, the development of neural machine translation (NMT) has significantly improved the translation quality of automatic machine translation. While most sentences are more accurate and fluent than translations by statistical machine translation (SMT)-based systems, in some cases, the NMT system produces translations that have a completely different meaning. This is especially the case when rare words occur. When using statistical machine translation, it has already been shown that significant gains can be achieved by simplifying the input in a preprocessing step. A commonly used example is the pre-reordering approach. In this work, we used phrase-based machine translation to pre-translate the input into the target language. Then a neural machine translation system generates the final hypothesis using the pre-translation. Thereby, we use either only the output of the phrase-based machine translation (PBMT) system or a combination of the PBMT output and the source sentence. We evaluate the technique on the English to German translation task. Using this approach we are able to outperform the PBMT system as well as the baseline neural MT system by up to 2 BLEU points. We analyzed the influence of the quality of the initial system on the final result.
https://arxiv.org/abs/1610.05243
Conventional attention-based Neural Machine Translation (NMT) conducts dynamic alignment in generating the target sentence. By repeatedly reading the representation of source sentence, which keeps fixed after generated by the encoder (Bahdanau et al., 2015), the attention mechanism has greatly enhanced state-of-the-art NMT. In this paper, we propose a new attention mechanism, called INTERACTIVE ATTENTION, which models the interaction between the decoder and the representation of source sentence during translation by both reading and writing operations. INTERACTIVE ATTENTION can keep track of the interaction history and therefore improve the translation performance. Experiments on NIST Chinese-English translation task show that INTERACTIVE ATTENTION can achieve significant improvements over both the previous attention-based NMT baseline and some state-of-the-art variants of attention-based NMT (i.e., coverage models (Tu et al., 2016)). And neural machine translator with our INTERACTIVE ATTENTION can outperform the open source attention-based NMT system Groundhog by 4.22 BLEU points and the open source phrase-based system Moses by 3.94 BLEU points averagely on multiple test sets.
https://arxiv.org/abs/1610.05011
Recently, neural networks have achieved great success on sentiment classification due to their ability to alleviate feature engineering. However, one of the remaining challenges is to model long texts in document-level sentiment classification under a recurrent architecture because of the deficiency of the memory unit. To address this problem, we present a Cached Long Short-Term Memory neural networks (CLSTM) to capture the overall semantic information in long texts. CLSTM introduces a cache mechanism, which divides memory into several groups with different forgetting rates and thus enables the network to keep sentiment information better within a recurrent unit. The proposed CLSTM outperforms the state-of-the-art models on three publicly available document-level sentiment analysis datasets.
https://arxiv.org/abs/1610.04989
Object proposals greatly benefit object detection task in recent state-of-the-art works. However, the existing object proposals usually have low localization accuracy at high intersection over union threshold. To address it, we apply saliency detection to each bounding box to improve their quality in this paper. We first present a geodesic saliency detection method in contour, which is designed to find closed contours. Then, we apply it to each candidate box with multi-sizes, and refined boxes can be easily produced in the obtained saliency maps which are further used to calculate saliency scores for proposal ranking. Experiments on PASCAL VOC 2007 test dataset demonstrate the proposed refinement approach can greatly improve existing models.
https://arxiv.org/abs/1603.04146
With the advancement of huge data generation and data handling capability, Machine Learning and Probabilistic modelling enables an immense opportunity to employ predictive analytics platform in high security critical industries namely data centers, electricity grids, utilities, airport etc. where downtime minimization is one of the primary objectives. This paper proposes a novel, complete architecture of an intelligent predictive analytics platform, Fault Engine, for huge device network connected with electrical/information flow. Three unique modules, here proposed, seamlessly integrate with available technology stack of data handling and connect with middleware to produce online intelligent prediction in critical failure scenarios. The Markov Failure module predicts the severity of a failure along with survival probability of a device at any given instances. The Root Cause Analysis model indicates probable devices as potential root cause employing Bayesian probability assignment and topological sort. Finally, a community detection algorithm produces correlated clusters of device in terms of failure probability which will further narrow down the search space of finding route cause. The whole Engine has been tested with different size of network with simulated failure environments and shows its potential to be scalable in real-time implementation.
https://arxiv.org/abs/1610.04872
A dual-channel AlN/GaN/AlN/GaN high electron mobility transistor (HEMT) architecture is proposed, simulated, and demonstrated that suppresses gate lag due to surface-originated trapped charge. Dual two-dimensional electron gas (2DEG) channels are utilized such that the top 2DEG serves as an equipotential that screens potential fluctuations resulting from surface trapped charge. The bottom channel serves as the transistor’s modulated channel. Two device modeling approaches have been performed as a means to guide the device design and to elucidate the relationship between the design and performance metrics. The modeling efforts include a self-consistent Poisson-Schrodinger solution for electrostatic simulation as well as hydrodynamic three-dimensional device modeling for three-dimensional electrostatics, steady-state, and transient simulations. Experimental results validated the HEMT design whereby homo-epitaxial growth on free-standing GaN substrates and fabrication of same-wafer dual-channel and recessed-gate AlN/GaN HEMTs have been demonstrated. Notable pulsed-gate performance has been achieved by the fabricated HEMTs through a gate lag ratio of 0.86 with minimal drain current collapse while maintaining high levels of dc and rf performance.
https://arxiv.org/abs/1610.03921
Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-the-art recognition accuracy. Due to the substantial compute and memory operations, however, they require significant execution time. The massive parallel computing capability of GPUs make them as one of the ideal platforms to accelerate CNNs and a number of GPU-based CNN libraries have been developed. While existing works mainly focus on the computational efficiency of CNNs, the memory efficiency of CNNs have been largely overlooked. Yet CNNs have intricate data structures and their memory behavior can have significant impact on the performance. In this work, we study the memory efficiency of various CNN layers and reveal the performance implication from both data layouts and memory access patterns. Experiments show the universal effect of our proposed optimizations on both single layers and various networks, with up to 27.9x for a single layer and up to 5.6x on the whole networks.
https://arxiv.org/abs/1610.03618
Long short-term memory (LSTM) recurrent neural networks (RNNs) have been shown to give state-of-the-art performance on many speech recognition tasks, as they are able to provide the learned dynamically changing contextual window of all sequence history. On the other hand, the convolutional neural networks (CNNs) have brought significant improvements to deep feed-forward neural networks (FFNNs), as they are able to better reduce spectral variation in the input signal. In this paper, a network architecture called as convolutional recurrent neural network (CRNN) is proposed by combining the CNN and LSTM RNN. In the proposed CRNNs, each speech frame, without adjacent context frames, is organized as a number of local feature patches along the frequency axis, and then a LSTM network is performed on each feature patch along the time axis. We train and compare FFNNs, LSTM RNNs and the proposed LSTM CRNNs at various number of configurations. Experimental results show that the LSTM CRNNs can exceed state-of-the-art speech recognition performance.
https://arxiv.org/abs/1610.03165
This paper considers the problem of approximate nearest neighbor search in the compressed domain. We introduce polysemous codes, which offer both the distance estimation quality of product quantization and the efficient comparison of binary codes with Hamming distance. Their design is inspired by algorithms introduced in the 90’s to construct channel-optimized vector quantizers. At search time, this dual interpretation accelerates the search. Most of the indexed vectors are filtered out with Hamming distance, letting only a fraction of the vectors to be ranked with an asymmetric distance estimator. The method is complementary with a coarse partitioning of the feature space such as the inverted multi-index. This is shown by our experiments performed on several public benchmarks such as the BIGANN dataset comprising one billion vectors, for which we report state-of-the-art results for query times below 0.3\,millisecond per core. Last but not least, our approach allows the approximate computation of the k-NN graph associated with the Yahoo Flickr Creative Commons 100M, described by CNN image descriptors, in less than 8 hours on a single machine.
本文考虑了压缩域中近似最近邻搜索的问题。我们引入多义码,既提供产品量化的距离估计质量,又提供二进制码与汉明距离的有效比较。他们的设计灵感来自90年代推出的构建通道优化矢量量化器的算法。在搜索时,这种双重解释加速了搜索。大多数索引向量用Hamming距离过滤出来,让一小部分向量用非对称距离估计器排序。 该方法与特征空间的粗分割(诸如倒排多索引)是互补的。我们在几个公共基准测试中进行的实验表明了这一点,例如包含10亿个载体的BIGANN数据集,为此我们报告了针对低于每核0.3毫秒毫秒的查询时间的最新结果。最后但并非最不重要的是,我们的方法允许在单台机器上在不到8小时内近似计算与CNN图像描述符描述的Yahoo Flickr Creative Commons 100M相关的k-NN图。
http://arxiv.org/abs/1609.01882
Directly reading documents and being able to answer questions from them is an unsolved challenge. To avoid its inherent difficulty, question answering (QA) has been directed towards using Knowledge Bases (KBs) instead, which has proven effective. Unfortunately KBs often suffer from being too restrictive, as the schema cannot support certain types of answers, and too sparse, e.g. Wikipedia contains much more information than Freebase. In this work we introduce a new method, Key-Value Memory Networks, that makes reading documents more viable by utilizing different encodings in the addressing and output stages of the memory read operation. To compare using KBs, information extraction or Wikipedia documents directly in a single framework we construct an analysis tool, WikiMovies, a QA dataset that contains raw text alongside a preprocessed KB, in the domain of movies. Our method reduces the gap between all three settings. It also achieves state-of-the-art results on the existing WikiQA benchmark.
https://arxiv.org/abs/1606.03126
This thesis report studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework. As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-16 and K-CNN convolutional neural networks to extract visual features from the image. These are merged with the word embedding or with a sentence embedding of the question to predict the answer. This work was successfully submitted to the Visual Question Answering Challenge 2016, where it achieved a 53,62% of accuracy in the test dataset. The developed software has followed the best programming practices and Python code style, providing a consistent baseline in Keras for different configurations.
本论文报告了使用深度学习框架来解决视觉问答(VQA)任务的研究方法。作为一个初步的步骤,我们探索在自然语言处理(NLP)中使用的长期短期记忆(LSTM)网络来处理问题回答(基于文本)。然后,我们修改以前的模型,接受一个图像作为输入,除了这个问题。为此,我们探索了VGG-16和K-CNN卷积神经网络从图像中提取视觉特征。这些与嵌入词或嵌入问题的句子合并以预测答案。这项工作已经成功提交到2016年视觉问答应答挑战赛中,在测试数据集中达到了53.62%的准确性。开发的软件遵循最好的编程实践和Python代码风格,为Keras提供不同配置的一致基线。
https://arxiv.org/abs/1610.02692
Within the field of Statistical Machine Translation (SMT), the neural approach (NMT) has recently emerged as the first technology able to challenge the long-standing dominance of phrase-based approaches (PBMT). In particular, at the IWSLT 2015 evaluation campaign, NMT outperformed well established state-of-the-art PBMT systems on English-German, a language pair known to be particularly hard because of morphology and syntactic differences. To understand in what respects NMT provides better translation quality than PBMT, we perform a detailed analysis of neural versus phrase-based SMT outputs, leveraging high quality post-edits performed by professional translators on the IWSLT data. For the first time, our analysis provides useful insights on what linguistic phenomena are best modeled by neural models – such as the reordering of verbs – while pointing out other aspects that remain to be improved.
https://arxiv.org/abs/1608.04631
The visual cues from multiple support regions of different sizes and resolutions are complementary in classifying a candidate box in object detection. Effective integration of local and contextual visual cues from these regions has become a fundamental problem in object detection. In this paper, we propose a gated bi-directional CNN (GBD-Net) to pass messages among features from different support regions during both feature learning and feature extraction. Such message passing can be implemented through convolution between neighboring support regions in two directions and can be conducted in various layers. Therefore, local and contextual visual patterns can validate the existence of each other by learning their nonlinear relationships and their close interactions are modeled in a more complex way. It is also shown that message passing is not always helpful but dependent on individual samples. Gated functions are therefore needed to control message transmission, whose on-or-offs are controlled by extra visual evidence from the input sample. The effectiveness of GBD-Net is shown through experiments on three object detection datasets, ImageNet, Pascal VOC2007 and Microsoft COCO. This paper also shows the details of our approach in wining the ImageNet object detection challenge of 2016, with source code provided on \url{this https URL}.
https://arxiv.org/abs/1610.02579
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT’s use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google’s Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units (“wordpieces”) for both input and output. This method provides a good balance between the flexibility of “character”-delimited models and the efficiency of “word”-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT’14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google’s phrase-based production system.
https://arxiv.org/abs/1609.08144
Generative Adversarial Networks (GANs) have recently demonstrated the capability to synthesize compelling real-world images, such as room interiors, album covers, manga, faces, birds, and flowers. While existing models can synthesize images based on global constraints such as a class label or caption, they do not provide control over pose or object location. We propose a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location. We show high-quality 128 x 128 image synthesis on the Caltech-UCSD Birds dataset, conditioned on both informal text descriptions and also object location. Our system exposes control over both the bounding box around the bird and its constituent parts. By modeling the conditional distributions over part locations, our system also enables conditioning on arbitrary subsets of parts (e.g. only the beak and tail), yielding an efficient interface for picking part locations. We also show preliminary results on the more challenging domain of text- and location-controllable synthesis of images of human actions on the MPII Human Pose dataset.
生成对抗网络(GANs)最近展示了综合真实世界图像的能力,如室内,专辑封面,漫画,面孔,鸟类和鲜花。虽然现有模型可以基于全局约束(如类标签或标题)合成图像,但它们不提供对姿势或对象位置的控制。我们提出了一个新的模型,即生成敌对事情网络(GAWWN),该网络综合了图像,给出了描述在哪个位置绘制什么内容的说明。我们在Caltech-UCSD Birds数据集上显示高质量的128 x 128图像合成,同时以非正式文本描述和对象位置为条件。我们的系统暴露了对鸟和它的组成部分的边界框的控制。通过对部件位置上的条件分布进行建模,我们的系统还能够调节部件的任意子集(例如,只有喙和尾部),从而为拾取部件位置提供高效的接口。我们还展示了在MPII人体姿态数据集上人类活动图像的文本和位置可控合成更具挑战性的领域的初步结果。
https://arxiv.org/abs/1610.02454
Neural sequence models are widely used to model time-series data in many fields. Equally ubiquitous is the usage of beam search (BS) as an approximate inference algorithm to decode output sequences from these models. BS explores the search space in a greedy left-right fashion retaining only the top-$B$ candidates – resulting in sequences that differ only slightly from each other. Producing lists of nearly identical sequences is not only computationally wasteful but also typically fails to capture the inherent ambiguity of complex AI tasks. To overcome this problem, we propose \emph{Diverse Beam Search} (DBS), an alternative to BS that decodes a list of diverse outputs by optimizing for a diversity-augmented objective. We observe that our method finds better top-1 solutions by controlling for the exploration and exploitation of the search space – implying that DBS is a \emph{better search algorithm}. Moreover, these gains are achieved with minimal computational or memory overhead as compared to beam search. To demonstrate the broad applicability of our method, we present results on image captioning, machine translation and visual question generation using both standard quantitative metrics and qualitative human studies. Our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.
神经序列模型被广泛用于许多领域的时间序列数据建模。同样无处不在的是使用波束搜索(BS)作为近似推理算法来解码来自这些模型的输出序列。 BS以贪婪的左右方式探索搜索空间,只保留顶部$ B $候选人 - 导致序列只相差很小。生成几乎相同的序列的列表不仅在计算上是浪费的,而且通常也不能捕捉到复杂的AI任务的固有的模糊性。为了克服这个问题,我们提出了一种替代BS的替代方案,即通过优化多样性增强的目标来解码不同输出的列表。我们观察到,我们的方法通过控制搜索空间的探索和利用来找到更好的top-1解决方案,这意味着DBS是一个更好的搜索算法。而且,与波束搜索相比,这些增益是以最小的计算或存储开销实现的。为了证明我们的方法具有广泛的适用性,我们使用标准的定量度量和定性的人体研究来展示图像字幕,机器翻译和视觉问题生成的结果。我们的方法始终优于BS和先前提出的用于从神经序列模型进行不同解码的技术。
https://arxiv.org/abs/1610.02424
Convolutional neural network (CNN) is one of the most prominent architectures and algorithm in Deep Learning. It shows a remarkable improvement in the recognition and classification of objects. This method has also been proven to be very effective in a variety of computer vision and machine learning problems. As in other deep learning, however, training the CNN is interesting yet challenging. Recently, some metaheuristic algorithms have been used to optimize CNN using Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing and Harmony Search. In this paper, another type of metaheuristic algorithms with different strategy has been proposed, i.e. Microcanonical Annealing to optimize Convolutional Neural Network. The performance of the proposed method is tested using the MNIST and CIFAR-10 datasets. Although experiment results of MNIST dataset indicate the increase in computation time (1.02x - 1.38x), nevertheless this proposed method can considerably enhance the performance of the original CNN (up to 4.60\%). On the CIFAR10 dataset, currently, state of the art is 96.53\% using fractional pooling, while this proposed method achieves 99.14\%.
https://arxiv.org/abs/1610.02306
Augmenting RGB data with measured depth has been shown to improve the performance of a range of tasks in computer vision including object detection and semantic segmentation. Although depth sensors such as the Microsoft Kinect have facilitated easy acquisition of such depth information, the vast majority of images used in vision tasks do not contain depth information. In this paper, we show that augmenting RGB images with estimated depth can also improve the accuracy of both object detection and semantic segmentation. Specifically, we first exploit the recent success of depth estimation from monocular images and learn a deep depth estimation model. Then we learn deep depth features from the estimated depth and combine with RGB features for object detection and semantic segmentation. Additionally, we propose an RGB-D semantic segmentation method which applies a multi-task training scheme: semantic label prediction and depth value regression. We test our methods on several datasets and demonstrate that incorporating information from estimated depth improves the performance of object detection and semantic segmentation remarkably.
https://arxiv.org/abs/1610.01706
Neural machine translation (NMT) often makes mistakes in translating low-frequency content words that are essential to understanding the meaning of the sentence. We propose a method to alleviate this problem by augmenting NMT systems with discrete translation lexicons that efficiently encode translations of these low-frequency words. We describe a method to calculate the lexicon probability of the next word in the translation candidate by using the attention vector of the NMT model to select which source word lexical probabilities the model should focus on. We test two methods to combine this probability with the standard NMT probability: (1) using it as a bias, and (2) linear interpolation. Experiments on two corpora show an improvement of 2.0-2.3 BLEU and 0.13-0.44 NIST score, and faster convergence time.
https://arxiv.org/abs/1606.02006
Together with the development of more accurate methods in Computer Vision and Natural Language Understanding, holistic architectures that answer on questions about the content of real-world images have emerged. In this tutorial, we build a neural-based approach to answer questions about images. We base our tutorial on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the models that we present here can achieve a competitive performance on both datasets, in fact, they are among the best methods that use a combination of LSTM with a global, full frame CNN representation of an image. We hope that after reading this tutorial, the reader will be able to use Deep Learning frameworks, such as Keras and introduced Kraino, to build various architectures that will lead to a further performance improvement on this challenging task.
随着“计算机视觉和自然语言理解”中更准确的方法的发展,出现了回答有关现实世界图像内容的整体架构。在本教程中,我们构建了一个基于神经的方法来回答有关图像的问题。我们将我们的教程建立在两个数据集上:(主要是)DAQUAR和(有点儿)VQA。通过小的调整,我们在这里呈现的模型可以在两个数据集上实现有竞争力的性能,实际上,它们是使用LSTM和全局CNN表示图像的最佳方法。我们希望阅读本教程后,读者将能够使用Keras等深度学习框架和Kraino引入的各种体系结构,从而进一步提高这一具有挑战性的任务的性能。
https://arxiv.org/abs/1610.01076
This paper presents how we can achieve the state-of-the-art accuracy in multi-category object detection task while minimizing the computational cost by adapting and combining recent technical innovations. Following the common pipeline of “CNN feature extraction + region proposal + RoI classification”, we mainly redesign the feature extraction part, since region proposal part is not computationally expensive and classification part can be efficiently compressed with common techniques like truncated SVD. Our design principle is “less channels with more layers” and adoption of some building blocks including concatenated ReLU, Inception, and HyperNet. The designed network is deep and thin and trained with the help of batch normalization, residual connections, and learning rate scheduling based on plateau detection. We obtained solid results on well-known object detection benchmarks: 83.8% mAP (mean average precision) on VOC2007 and 82.5% mAP on VOC2012 (2nd place), while taking only 750ms/image on Intel i7-6700K CPU with a single core and 46ms/image on NVIDIA Titan X GPU. Theoretically, our network requires only 12.3% of the computational cost compared to ResNet-101, the winner on VOC2012.
https://arxiv.org/abs/1608.08021
We propose a novel algorithm for visual question answering based on a recurrent deep neural network, where every module in the network corresponds to a complete answering unit with attention mechanism by itself. The network is optimized by minimizing loss aggregated from all the units, which share model parameters while receiving different information to compute attention probability. For training, our model attends to a region within image feature map, updates its memory based on the question and attended image feature, and answers the question based on its memory state. This procedure is performed to compute loss in each step. The motivation of this approach is our observation that multi-step inferences are often required to answer questions while each problem may have a unique desirable number of steps, which is difficult to identify in practice. Hence, we always make the first unit in the network solve problems, but allow it to learn the knowledge from the rest of units by backpropagation unless it degrades the model. To implement this idea, we early-stop training each unit as soon as it starts to overfit. Note that, since more complex models tend to overfit on easier questions quickly, the last answering unit in the unfolded recurrent neural network is typically killed first while the first one remains last. We make a single-step prediction for a new question using the shared model. This strategy works better than the other options within our framework since the selected model is trained effectively from all units without overfitting. The proposed algorithm outperforms other multi-step attention based approaches using a single step prediction in VQA dataset.
我们提出了一种基于循环深度神经网络的视觉问题解答的新算法,其中网络中的每个模块都对应一个完整的具有关注机制的应答单元。通过最小化所有单元聚集的损失来优化网络,所述单元共享模型参数,同时接收不同的信息以计算注意概率。对于训练,我们的模型参与图像特征映射中的一个区域,基于该问题更新其内存和出席图像特征,并且基于其内存状态来回答该问题。执行此过程以计算每个步骤中的损失。这种方法的动机是我们的观察,经常需要多步推理来回答问题,而每个问题可能有独特的理想的步骤数量,这在实践中很难识别。因此,我们总是使网络中的第一个单元解决问题,但是让它通过反向传播学习其余单元的知识,除非它降低了模型。为了实现这个想法,我们尽快开始训练每个单元。需要注意的是,由于更复杂的模型往往会过分简化问题,所以展开的循环神经网络中的最后一个应答单元通常会首先被杀死,而第一个则保持最后一个。我们使用共享模型对新问题进行单步预测。这个策略比我们的框架中的其他选项更好,因为所选择的模型是从所有单元有效地训练而不过度拟合的。所提出的算法在VQA数据集中使用单步预测的性能优于其他基于多步骤注意的方法。
https://arxiv.org/abs/1606.03647