We propose a novel discriminative model that learns embeddings from multilingual and multi-modal data, meaning that our model can take advantage of images and descriptions in multiple languages to improve embedding quality. To that end, we introduce a modification of a pairwise contrastive estimation optimisation function as our training objective. We evaluate our embeddings on an image-sentence ranking (ISR), a semantic textual similarity (STS), and a neural machine translation (NMT) task. We find that the additional multilingual signals lead to improvements on both the ISR and STS tasks, and the discriminative cost can also be used in re-ranking $n$-best lists produced by NMT models, yielding strong improvements.
https://arxiv.org/abs/1702.01101
In this paper, we give a proof of the Gan-Gross-Prasad conjecture for the discrete series of U(p,q). Given a discrete series representation $D(\lambda)$ in terms of the Harish-Chandra parameter, the restriction of $D(\lambda)$ to U(p-1,q) contains $D(\mu)$ as a subrepresentation if and only if $\lambda$ and $\mu$ interlaces in a very special way.
https://arxiv.org/abs/1508.02032
The adsorption of hydrogen at nonpolar GaN(1-100) surfaces and its impact on the electronic and vibrational properties is investigated using surface electron spectroscopy in combination with density functional theory (DFT) calculations. For the surface mediated dissociation of H2 and the subsequent adsorption of H, an energy barrier of 0.55 eV has to be overcome. The calculated kinetic surface phase diagram indicates that the reaction is kinetically hindered at low pressures and low temperatures. At higher temperatures ab-initio thermodynamics show, that the H-free surface is energetically favored. To validate these theoretical predictions experiments at room temperature and under ultrahigh vacuum conditions were performed. They reveal that molecular hydrogen does not dissociatively adsorb at the GaN(1-100) surface. Only activated atomic hydrogen atoms attach to the surface. At temperatures above 820 K, the attached hydrogen gets desorbed. The adsorbed hydrogen atoms saturate the dangling bonds of the gallium and nitrogen surface atoms and result in an inversion of the Ga-N surface dimer buckling. The signatures of the Ga-H and N-H vibrational modes on the H-covered surface have experimentally been identified and are in good agreement with the DFT calculations of the surface phonon modes. Both theory and experiment show that H adsorption results in a removal of occupied and unoccupied intragap electron states of the clean GaN(1-100) surface and a reduction of the surface upward band bending by 0.4 eV. The latter mechanism largely reduces surface electron depletion.
https://arxiv.org/abs/1702.00809
This paper introduces a novel neural network model for question answering, the \emph{entity-based memory network}. It enhances neural networks’ ability of representing and calculating information over a long period by keeping records of entities contained in text. The core component is a memory pool which comprises entities’ states. These entities’ states are continuously updated according to the input text. Questions with regard to the input text are used to search the memory pool for related entities and answers are further predicted based on the states of retrieved entities. Compared with previous memory network models, the proposed model is capable of handling fine-grained information and more sophisticated relations based on entities. We formulated several different tasks as question answering problems and tested the proposed model. Experiments reported satisfying results.
https://arxiv.org/abs/1612.03551
Recent empirical results on long-term dependency tasks have shown that neural networks augmented with an external memory can learn the long-term dependency tasks more easily and achieve better generalization than vanilla recurrent neural networks (RNN). We suggest that memory augmented neural networks can reduce the effects of vanishing gradients by creating shortcut (or wormhole) connections. Based on this observation, we propose a novel memory augmented neural network model called TARDIS (Temporal Automatic Relation Discovery in Sequences). The controller of TARDIS can store a selective set of embeddings of its own previous hidden states into an external memory and revisit them as and when needed. For TARDIS, memory acts as a storage for wormhole connections to the past to propagate the gradients more effectively and it helps to learn the temporal dependencies. The memory structure of TARDIS has similarities to both Neural Turing Machines (NTM) and Dynamic Neural Turing Machines (D-NTM), but both read and write operations of TARDIS are simpler and more efficient. We use discrete addressing for read/write operations which helps to substantially to reduce the vanishing gradient problem with very long sequences. Read and write operations in TARDIS are tied with a heuristic once the memory becomes full, and this makes the learning problem simpler when compared to NTM or D-NTM type of architectures. We provide a detailed analysis on the gradient propagation in general for MANNs. We evaluate our models on different long-term dependency tasks and report competitive results in all of them.
https://arxiv.org/abs/1701.08718
We propose a Convolutional Neural Network (CNN) based algorithm - StuffNet - for object detection. In addition to the standard convolutional features trained for region proposal and object detection [31], StuffNet uses convolutional features trained for segmentation of objects and ‘stuff’ (amorphous categories such as ground and water). Through experiments on Pascal VOC 2010, we show the importance of features learnt from stuff segmentation for improving object detection performance. StuffNet improves performance from 18.8% mAP to 23.9% mAP for small objects. We also devise a method to train StuffNet on datasets that do not have stuff segmentation labels. Through experiments on Pascal VOC 2007 and 2012, we demonstrate the effectiveness of this method and show that StuffNet also significantly improves object detection performance on such datasets.
https://arxiv.org/abs/1610.05861
Electrically injected deep ultra-violet (UV) emission is obtained using monolayer (ML) thin GaN/AlN quantum structures as active regions. The emission wavelength is tuned by controlling the thickness of ultrathin GaN layers with monolayer precision using plasma assisted molecular beam epitaxy (PAMBE). Single peaked emission spectra is achieved with narrow full width at half maximum (FWHM) for three different light emitting diodes (LEDs) operating at 232 nm, 246 nm and 270 nm. 232 nm (5.34 eV) is the shortest EL emission wavelength reported so far using GaN as the light emitting material and employing polarization-induced doping.
https://arxiv.org/abs/1610.05651
Recurrent neural networks (RNNs) have drawn interest from machine learning researchers because of their effectiveness at preserving past inputs for time-varying data processing tasks. To understand the success and limitations of RNNs, it is critical that we advance our analysis of their fundamental memory properties. We focus on echo state networks (ESNs), which are RNNs with simple memoryless nodes and random connectivity. In most existing analyses, the short-term memory (STM) capacity results conclude that the ESN network size must scale linearly with the input size for unstructured inputs. The main contribution of this paper is to provide general results characterizing the STM capacity for linear ESNs with multidimensional input streams when the inputs have common low-dimensional structure: sparsity in a basis or significant statistical dependence between inputs. In both cases, we show that the number of nodes in the network must scale linearly with the information rate and poly-logarithmically with the ambient input dimension. The analysis relies on advanced applications of random matrix theory and results in explicit non-asymptotic bounds on the recovery error. Taken together, this analysis provides a significant step forward in our understanding of the STM properties in RNNs.
https://arxiv.org/abs/1605.08346
The application of Deep Neural Networks for ranking in search engines may obviate the need for the extensive feature engineering common to current learning-to-rank methods. However, we show that combining simple relevance matching features like BM25 with existing Deep Neural Net models often substantially improves the accuracy of these models, indicating that they do not capture essential local relevance matching signals. We describe a novel deep Recurrent Neural Net-based model that we call Match-Tensor. The architecture of the Match-Tensor model simultaneously accounts for both local relevance matching and global topicality signals allowing for a rich interplay between them when computing the relevance of a document to a query. On a large held-out test set consisting of social media documents, we demonstrate not only that Match-Tensor outperforms BM25 and other classes of DNNs but also that it largely subsumes signals present in these models.
https://arxiv.org/abs/1701.07795
This chapter describes a semantic dialogue system for radiologists in a comprehensive case study within the large-scale MEDICO project. MEDICO addresses the need for advanced semantic technologies in the search for medical image and patient data. The objectives are, first, to enable a seamless integration of medical images and different user applications by providing direct access to image semantics, and second, to design and implement a multimodal dialogue shell for the radiologist. Speech-based semantic image retrieval and annotation of medical images should provide the basis for help in clinical decision support and computer aided diagnosis. We will describe the clinical workflow and interaction requirements and focus on the design and implementation of a multimodal user interface for patient/image search or annotation and its implementation while using a speech-based dialogue shell. Ontology modeling provides the backbone for knowledge representation in the dialogue shell and the specific medical application domain; ontology structures are the communication basis of our combined semantic search and retrieval architecture which includes the MEDICO server, the triple store, the semantic search API, the medical visualization toolkit MITK, and the speech-based dialogue shell, amongst others. We will focus on usability aspects of multimodal applications, our storyboard and the implemented speech and touchscreen interaction design.
https://arxiv.org/abs/1701.07381
We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are ‘important’ for predictions – or visual explanations. Our approach, called Gradient-weighted Class Activation Mapping (Grad-CAM), uses class-specific gradient information to localize important regions. These localizations are combined with existing pixel-space visualizations to create a novel high-resolution and class-discriminative visualization called Guided Grad-CAM. These methods help better understand CNN-based models, including image captioning and visual question answering (VQA) models. We evaluate our visual explanations by measuring their ability to discriminate between classes, to inspire trust in humans, and their correlation with occlusion maps. Grad-CAM provides a new way to understand CNN-based models. We have released code, an online demo hosted on CloudCV, and a full version of this extended abstract.
https://arxiv.org/abs/1611.07450
Deep learning techniques lie at the heart of several significant AI advances in recent years including object recognition and detection, image captioning, machine translation, speech recognition and synthesis, and playing the game of Go. Automated first-order theorem provers can aid in the formalization and verification of mathematical theorems and play a crucial role in program analysis, theory reasoning, security, interpolation, and system verification. Here we suggest deep learning based guidance in the proof search of the theorem prover E. We train and compare several deep neural network models on the traces of existing ATP proofs of Mizar statements and use them to select processed clauses during proof search. We give experimental evidence that with a hybrid, two-phase approach, deep learning based guidance can significantly reduce the average number of proof search steps while increasing the number of theorems proved. Using a few proof guidance strategies that leverage deep neural networks, we have found first-order proofs of 7.36% of the first-order logic translations of the Mizar Mathematical Library theorems that did not previously have ATP generated proofs. This increases the ratio of statements in the corpus with ATP generated proofs from 56% to 59%.
https://arxiv.org/abs/1701.06972
We introduce multi-modal, attention-based neural machine translation (NMT) models which incorporate visual features into different parts of both the encoder and the decoder. We utilise global image features extracted using a pre-trained convolutional neural network and incorporate them (i) as words in the source sentence, (ii) to initialise the encoder hidden state, and (iii) as additional data to initialise the decoder hidden state. In our experiments, we evaluate how these different strategies to incorporate global image features compare and which ones perform best. We also study the impact that adding synthetic multi-modal, multilingual data brings and find that the additional data have a positive impact on multi-modal models. We report new state-of-the-art results and our best models also significantly improve on a comparable phrase-based Statistical MT (PBSMT) model trained on the Multi30k data set according to all metrics evaluated. To the best of our knowledge, it is the first time a purely neural model significantly improves over a PBSMT model on all metrics evaluated on this data set.
https://arxiv.org/abs/1701.06521
Image retrieval is a complex task that differs according to the context and the user requirements in any specific field, for example in a medical environment. Search by text is often not possible or optimal and retrieval by the visual content does not always succeed in modelling high-level concepts that a user is looking for. Modern image retrieval techniques consist of multiple steps and aim to retrieve information from large–scale datasets and not only based on global image appearance but local features and if possible in a connection between visual features and text or semantics. This paper presents the Parallel Distributed Image Search Engine (ParaDISE), an image retrieval system that combines visual search with text–based retrieval and that is available as open source and free of charge. The main design concepts of ParaDISE are flexibility, expandability, scalability and interoperability. These concepts constitute the system, able to be used both in real-world applications and as an image retrieval research platform. Apart from the architecture and the implementation of the system, two use cases are described, an application of ParaDISE in retrieval of images from the medical literature and a visual feature evaluation for medical image retrieval. Future steps include the creation of an open source community that will contribute and expand this platform based on the existing parts.
https://arxiv.org/abs/1701.05596
A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling “where to look” or visual attention, it is equally important to model “what words to listen to” or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.
https://arxiv.org/abs/1606.00061
We investigate the dynamical role of inhibitory and highly connected nodes (hub) in synchronization and input processing of leaky-integrate-and-fire neural networks with short term synaptic plasticity. We take advantage of a heterogeneous mean-field approximation to encode the role of network structure and we tune the fraction of inhibitory neurons $f_I$ and their connectivity level to investigate the cooperation between hub features and inhibition. We show that, depending on $f_I$, highly connected inhibitory nodes strongly drive the synchronization properties of the overall network through dynamical transitions from synchronous to asynchronous regimes. Furthermore, a metastable regime with long memory of external inputs emerges for a specific fraction of hub inhibitory neurons, underlining the role of inhibition and connectivity also for input processing in neural networks.
https://arxiv.org/abs/1701.05056
We explore an alternative way to fabricate (In,Ga)N/GaN short-period superlattices on GaN(0001) by plasma-assisted molecular beam epitaxy. We exploit the existence of an In adsorbate structure manifesting itself by a $(\sqrt{3}\times!\sqrt{3})\text{R}30^{\circ}$ surface reconstruction observed in-situ by reflection high-energy electron diffraction. This In adlayer accommodates a maximum of 1/3 monolayer of In on the GaN surface and, under suitable conditions, can be embedded into GaN to form an In${0.33}$Ga${0.67}$N quantum sheet whose width is naturally limited to a single monolayer. Periodically inserting these quantum sheets, we synthesize (In,Ga)N/GaN short-period superlattices with abrupt interfaces and high periodicity as demonstrated by x-ray diffractometry and scanning transmission electron microscopy. The embedded quantum sheets are found to consist of single monolayers with an In content of 0.25-0.29. For a barrier thickness of 6 monolayers, the superlattice gives rise to a photoluminescence band at 3.16 eV, close to the theoretically predicted values for these structures.
https://arxiv.org/abs/1701.04680
Visual scene decomposition into semantic entities is one of the major challenges when creating a reliable object grasping system. Recently, we introduced a bottom-up hierarchical clustering approach which is able to segment objects and parts in a scene. In this paper, we introduce a transform from such a segmentation into a corresponding, hierarchical saliency function. In comprehensive experiments we demonstrate its ability to detect salient objects in a scene. Furthermore, this hierarchical saliency defines a most salient corresponding region (scale) for every point in an image. Based on this, an easy-to-use pick and place manipulation system was developed and tested exemplarily.
https://arxiv.org/abs/1701.04284
The classification of MRI images according to the anatomical field of view is a necessary task to solve when faced with the increasing quantity of medical images. In parallel, advances in deep learning makes it a suitable tool for computer vision problems. Using a common architecture (such as AlexNet) provides quite good results, but not sufficient for clinical use. Improving the model is not an easy task, due to the large number of hyper-parameters governing both the architecture and the training of the network, and to the limited understanding of their relevance. Since an exhaustive search is not tractable, we propose to optimize the network first by random search, and then by an adaptive search based on Gaussian Processes and Probability of Improvement. Applying this method on a large and varied MRI dataset, we show a substantial improvement between the baseline network and the final one (up to 20\% for the most difficult classes).
https://arxiv.org/abs/1701.04355
We consider the task of identifying attitudes towards a given set of entities from text. Conventionally, this task is decomposed into two separate subtasks: target detection that identifies whether each entity is mentioned in the text, either explicitly or implicitly, and polarity classification that classifies the exact sentiment towards an identified entity (the target) into positive, negative, or neutral. Instead, we show that attitude identification can be solved with an end-to-end machine learning architecture, in which the two subtasks are interleaved by a deep memory network. In this way, signals produced in target detection provide clues for polarity classification, and reversely, the predicted polarity provides feedback to the identification of targets. Moreover, the treatments for the set of targets also influence each other – the learned representations may share the same semantics for some targets but vary for others. The proposed deep memory network, the AttNet, outperforms methods that do not consider the interactions between the subtasks or those among the targets, including conventional machine learning methods and the state-of-the-art deep learning models.
https://arxiv.org/abs/1701.04189
Descriptions are often provided along with recommendations to help users’ discovery. Recommending automatically generated music playlists (e.g. personalised playlists) introduces the problem of generating descriptions. In this paper, we propose a method for generating music playlist descriptions, which is called as music captioning. In the proposed method, audio content analysis and natural language processing are adopted to utilise the information of each track.
https://arxiv.org/abs/1608.04868
We studied electric current and noise in planar GaN nanowires (NWs). The results obtained at low voltages provide us with estimates of the depletion effects in the NWs. For larger voltages, we observed the space-charge limited current (SCLC) effect. The onset of the effect clearly correlates with the NW width. For narrow NWs the mature SCLC regime was achieved. This effect has great impact on fluctuation characteristics of studied NWs. At low voltages, we found that the normalized noise level increases with decreasing NW width. In the SCLC regime, a further increase in the normalized noise intensity (up to 1E4 times) was observed, as well as a change in the shape of the spectra with a tendency towards slope -3/2. We suggest that the features of the electric current and noise found in the NWs are of a general character and will have an impact on the development of NW-based devices.
https://arxiv.org/abs/1701.03970
This paper describes QCRI’s machine translation systems for the IWSLT 2016 evaluation campaign. We participated in the Arabic->English and English->Arabic tracks. We built both Phrase-based and Neural machine translation models, in an effort to probe whether the newly emerged NMT framework surpasses the traditional phrase-based systems in Arabic-English language pairs. We trained a very strong phrase-based system including, a big language model, the Operation Sequence Model, Neural Network Joint Model and Class-based models along with different domain adaptation techniques such as MML filtering, mixture modeling and using fine tuning over NNJM model. However, a Neural MT system, trained by stacking data from different genres through fine-tuning, and applying ensemble over 8 models, beat our very strong phrase-based system by a significant 2 BLEU points margin in Arabic->English direction. We did not obtain similar gains in the other direction but were still able to outperform the phrase-based system. We also applied system combination on phrase-based and NMT outputs.
https://arxiv.org/abs/1701.03924
The standard LSTM recurrent neural networks while very powerful in long-range dependency sequence applications have highly complex structure and relatively large (adaptive) parameters. In this work, we present empirical comparison between the standard LSTM recurrent neural network architecture and three new parameter-reduced variants obtained by eliminating combinations of the input signal, bias, and hidden unit signals from individual gating signals. The experiments on two sequence datasets show that the three new variants, called simply as LSTM1, LSTM2, and LSTM3, can achieve comparable performance to the standard LSTM model with less (adaptive) parameters.
https://arxiv.org/abs/1701.03441
We consider generation and comprehension of natural language referring expression for objects in an image. Unlike generic “image captioning” which lacks natural standard evaluation criteria, quality of a referring expression may be measured by the receiver’s ability to correctly infer which object is being described. Following this intuition, we propose two approaches to utilize models trained for comprehension task to generate better expressions. First, we use a comprehension module trained on human-generated expressions, as a “critic” of referring expression generator. The comprehension module serves as a differentiable proxy of human evaluation, providing training signal to the generation module. Second, we use the comprehension module in a generate-and-rerank pipeline, which chooses from candidate expressions generated by a model according to their performance on the comprehension task. We show that both approaches lead to improved referring expression generation on multiple benchmark datasets.
https://arxiv.org/abs/1701.03439
Scalability is an important characteristic of cloud computing. With scalability, cost is minimized by provisioning and releasing resources according to demand. Most of current Infrastructure as a Service (IaaS) providers deliver threshold-based auto-scaling techniques. However, setting up thresholds with right values that minimize cost and achieve Service Level Agreement is not an easy task, especially with variant and sudden workload changes. This paper has proposed dynamic threshold based auto-scaling algorithms that predict required resources using Long Short-Term Memory Recurrent Neural Network and auto-scale virtual resources based on predicted values. The proposed algorithms have been evaluated and compared with some of existing algorithms. Experimental results show that the proposed algorithms outperform other algorithms.
https://arxiv.org/abs/1701.03295
Translating in real-time, a.k.a. simultaneous translation, outputs translation words before the input sentence ends, which is a challenging problem for conventional machine translation methods. We propose a neural machine translation (NMT) framework for simultaneous translation in which an agent learns to make decisions on when to translate from the interaction with a pre-trained NMT environment. To trade off quality and delay, we extensively explore various targets for delay and design a method for beam-search applicable in the simultaneous MT setting. Experiments against state-of-the-art baselines on two language pairs demonstrate the efficacy of the proposed framework both quantitatively and qualitatively.
https://arxiv.org/abs/1610.00388
Neural Machine Translation (NMT) is a new approach for Machine Translation (MT), and due to its success, it has absorbed the attention of many researchers in the field. In this paper, we study NMT model on Persian-English language pairs, to analyze the model and investigate the appropriateness of the model for scarce-resourced scenarios, the situation that exists for Persian-centered translation systems. We adjust the model for the Persian language and find the best parameters and hyper parameters for two tasks: translation and transliteration. We also apply some preprocessing task on the Persian dataset which yields to increase for about one point in terms of BLEU score. Also, we have modified the loss function to enhance the word alignment of the model. This new loss function yields a total of 1.87 point improvements in terms of BLEU score in the translation quality.
https://arxiv.org/abs/1701.01854
We aim to study the modeling limitations of the commonly employed boosted decision trees classifier. Inspired by the success of large, data-hungry visual recognition models (e.g. deep convolutional neural networks), this paper focuses on the relationship between modeling capacity of the weak learners, dataset size, and dataset properties. A set of novel experiments on the Caltech Pedestrian Detection benchmark results in the best known performance among non-CNN techniques while operating at fast run-time speed. Furthermore, the performance is on par with deep architectures (9.71% log-average miss rate), while using only HOG+LUV channels as features. The conclusions from this study are shown to generalize over different object detection domains as demonstrated on the FDDB face detection benchmark (93.37% accuracy). Despite the impressive performance, this study reveals the limited modeling capacity of the common boosted trees model, motivating a need for architectural changes in order to compete with multi-level and very deep architectures.
https://arxiv.org/abs/1701.01692
Hashtags have become a powerful tool in social platforms such as Twitter to categorize and search for content, and to spread short messages across members of the social network. In this paper, we study temporal hashtag usage practices in Twitter with the aim of designing a cognitive-inspired hashtag recommendation algorithm we call BLLi,s. Our main idea is to incorporate the effect of time on (i) individual hashtag reuse (i.e., reusing own hashtags), and (ii) social hashtag reuse (i.e., reusing hashtags, which has been previously used by a followee) into a predictive model. For this, we turn to the Base-Level Learning (BLL) equation from the cognitive architecture ACT-R, which accounts for the time-dependent decay of item exposure in human memory. We validate BLLi,s using two crawled Twitter datasets in two evaluation scenarios: firstly, only temporal usage patterns of past hashtag assignments are utilized and secondly, these patterns are combined with a content-based analysis of the current tweet. In both scenarios, we find not only that temporal effects play an important role for both individual and social hashtag reuse but also that BLLi,s provides significantly better prediction accuracy and ranking results than current state-of-the-art hashtag recommendation methods.
https://arxiv.org/abs/1701.01276
Diagnosis of a clinical condition is a challenging task, which often requires significant medical investigation. Previous work related to diagnostic inferencing problems mostly consider multivariate observational data (e.g. physiological signals, lab tests etc.). In contrast, we explore the problem using free-text medical notes recorded in an electronic health record (EHR). Complex tasks like these can benefit from structured knowledge bases, but those are not scalable. We instead exploit raw text from Wikipedia as a knowledge source. Memory networks have been demonstrated to be effective in tasks which require comprehension of free-form text. They use the final iteration of the learned representation to predict probable classes. We introduce condensed memory neural networks (C-MemNNs), a novel model with iterative condensation of memory representations that preserves the hierarchy of features in the memory. Experiments on the MIMIC-III dataset show that the proposed model outperforms other variants of memory networks to predict the most probable diagnoses given a complex clinical scenario.
https://arxiv.org/abs/1612.01848
Neural Machine Translation (NMT) is a new approach to machine translation that has made great progress in recent years. However, recent studies show that NMT generally produces fluent but inadequate translations (Tu et al. 2016b; Tu et al. 2016a; He et al. 2016; Tu et al. 2017). This is in contrast to conventional Statistical Machine Translation (SMT), which usually yields adequate but non-fluent translations. It is natural, therefore, to leverage the advantages of both models for better translations, and in this work we propose to incorporate SMT model into NMT framework. More specifically, at each decoding step, SMT offers additional recommendations of generated words based on the decoding information from NMT (e.g., the generated partial translation and attention history). Then we employ an auxiliary classifier to score the SMT recommendations and a gating function to combine the SMT recommendations with NMT generations, both of which are jointly trained within the NMT architecture in an end-to-end manner. Experimental results on Chinese-English translation show that the proposed approach achieves significant and consistent improvements over state-of-the-art NMT and SMT systems on multiple NIST test sets.
https://arxiv.org/abs/1610.05150
Identifying musical instruments in polyphonic music recordings is a challenging but important problem in the field of music information retrieval. It enables music search by instrument, helps recognize musical genres, or can make music transcription easier and more accurate. In this paper, we present a convolutional neural network framework for predominant instrument recognition in real-world polyphonic music. We train our network from fixed-length music excerpts with a single-labeled predominant instrument and estimate an arbitrary number of predominant instruments from an audio signal with a variable length. To obtain the audio-excerpt-wise result, we aggregate multiple outputs from sliding windows over the test audio. In doing so, we investigated two different aggregation methods: one takes the average for each instrument and the other takes the instrument-wise sum followed by normalization. In addition, we conducted extensive experiments on several important factors that affect the performance, including analysis window size, identification threshold, and activation functions for neural networks to find the optimal set of parameters. Using a dataset of 10k audio excerpts from 11 instruments for evaluation, we found that convolutional neural networks are more robust than conventional methods that exploit spectral features and source separation with support vector machines. Experimental results showed that the proposed convolutional network architecture obtained an F1 measure of 0.602 for micro and 0.503 for macro, respectively, achieving 19.6% and 16.4% in performance improvement compared with other state-of-the-art algorithms.
https://arxiv.org/abs/1605.09507
A recurring topic in interstellar exploration and the search for extraterrestrial intelligence (SETI) is the role of artificial intelligence. More precisely, these are programs or devices that are capable of performing cognitive tasks that have been previously associated with humans such as image recognition, reasoning, decision-making etc. Such systems are likely to play an important role in future deep space missions, notably interstellar exploration, where the spacecraft needs to act autonomously. This article explores the drivers for an interstellar mission with a computation-heavy payload and provides an outline of a spacecraft and mission architecture that supports such a payload. Based on existing technologies and extrapolations of current trends, it is shown that AI spacecraft development and operation will be constrained and driven by three aspects: power requirements for the payload, power generation capabilities, and heat rejection capabilities. A likely mission architecture for such a probe is to get into an orbit close to the star in order to generate maximum power for computational activities, and then to prepare for further exploration activities. Given current levels of increase in computational power, such a payload with a similar computational power as the human brain would have a mass of hundreds to dozens of tons in a 2050 - 2060 timeframe.
https://arxiv.org/abs/1612.08733
We introduce a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identifying the most suitable text describing a scene, given several similar options. Accomplishing the task entails demonstrating comprehension beyond just recognizing “keywords” (or key-phrases) and their corresponding visual concepts. Instead, it requires an alignment between the representations of the two modalities that achieves a visually-grounded “understanding” of various linguistic elements and their dependencies. This new task also admits an easy-to-compute and well-studied metric: the accuracy in detecting the true target among the decoys. The paper makes several contributions: an effective and extensible mechanism for generating decoys from (human-created) image captions; an instance of applying this mechanism, yielding a large-scale machine comprehension dataset (based on the COCO images and captions) that we make publicly available; human evaluation results on this dataset, informing a performance upper-bound; and several baseline and competitive learning approaches that illustrate the utility of the proposed task and dataset in advancing both image and language comprehension. We also show that, in a multi-task learning setting, the performance on the proposed task is positively correlated with the end-to-end task of image captioning.
我们为计算机系统引入了一个新的多模式任务,这个任务构成了一个视觉语言理解的综合挑战:给出几个相似的选项,找出描述场景的最合适的文本。完成任务需要展示理解力,而不仅仅是识别“关键字”(或关键短语)及其相应的视觉概念。相反,它需要两种形式的表征之间的一致性,从而实现对各种语言元素及其依赖性的基于视觉的“理解”。这个新的任务也承认了一个易于计算和精心研究的度量标准:检测诱饵中真实目标的准确性。本文做出了一些贡献:从(人造)图像字幕生成诱饵的有效和可扩展的机制;应用这种机制的实例,产生我们公开可用的大规模机器理解数据集(基于COCO图像和标题);对这个数据集的人类评估结果,通知性能上限;以及一些基线和竞争性的学习方法,说明所提议的任务和数据集在推进图像和语言理解方面的实用性。我们还表明,在多任务学习环境中,所提议的任务的表现与图像字幕的端到端任务正相关。
https://arxiv.org/abs/1612.07833
The task of generating natural language descriptions from images has received a lot of attention in recent years. Consequently, it is becoming increasingly important to evaluate such image captioning approaches in an automatic manner. In this paper, we provide an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments. Moreover, we explore the utilization of the recently proposed Word Mover’s Distance (WMD) document metric for the purpose of image captioning. Our findings outline the differences and/or similarities between metrics and their relative robustness by means of extensive correlation, accuracy and distraction based evaluations. Our results also demonstrate that WMD provides strong advantages over other metrics.
从图像生成自然语言描述的任务近年来受到了很多关注。因此,以自动方式评估这种图像字幕方法变得越来越重要。在本文中,我们通过一系列精心设计的实验,对现有的图像字幕指标进行了深入的评估。此外,我们探讨了最近提出的Word Mover’s Distance(WMD)文档度量的用法,用于图像字幕。我们的研究结果通过广泛的相关性,准确性和基于分心的评估,概述了指标之间的差异和/或相似性及其相对稳健性。我们的结果也表明,WMD提供了超越其他指标的强大优势。
https://arxiv.org/abs/1612.07600
Several methods exist to infer causal networks from massive volumes of observational data. However, almost all existing methods require a considerable length of time series data to capture cause and effect relationships. In contrast, memory-less transition networks or Markov Chain data, which refers to one-step transitions to and from an event, have not been explored for causality inference even though such data is widely available. We find that causal network can be inferred from characteristics of four unique distribution zones around each event. We call this Composition of Transitions and show that cause, effect, and random events exhibit different behavior in their compositions. We applied machine learning models to learn these different behaviors and to infer causality. We name this new method Causality Inference using Composition of Transitions (CICT). To evaluate CICT, we used an administrative inpatient healthcare dataset to set up a network of patients transitions between different diagnoses. We show that CICT is highly accurate in inferring whether the transition between a pair of events is causal or random and performs well in identifying the direction of causality in a bi-directional association.
https://arxiv.org/abs/1608.02658
Neural Machine Translation (NMT) is a new approach for automatic translation of text from one human language into another. The basic concept in NMT is to train a large Neural Network that maximizes the translation performance on a given parallel corpus. NMT is gaining popularity in the research community because it outperformed traditional SMT approaches in several translation tasks at WMT and other evaluation tasks/benchmarks at least for some language pairs. However, many of the enhancements in SMT over the years have not been incorporated into the NMT framework. In this paper, we focus on one such enhancement namely domain adaptation. We propose an approach for adapting a NMT system to a new domain. The main idea behind domain adaptation is that the availability of large out-of-domain training data and a small in-domain training data. We report significant gains with our proposed method in both automatic metrics and a human subjective evaluation metric on two language pairs. With our adaptation method, we show large improvement on the new domain while the performance of our general domain only degrades slightly. In addition, our approach is fast enough to adapt an already trained system to a new domain within few hours without the need to retrain the NMT model on the combined data which usually takes several days/weeks depending on the volume of the data.
https://arxiv.org/abs/1612.06897
A dominant paradigm for deep learning based object detection relies on a “bottom-up” approach using “passive” scoring of class agnostic proposals. These approaches are efficient but lack of holistic analysis of scene-level context. In this paper, we present an “action-driven” detection mechanism using our “top-down” visual attention model. We localize an object by taking sequential actions that the attention model provides. The attention model conditioned with an image region provides required actions to get closer toward a target object. An action at each time step is weak itself but an ensemble of the sequential actions makes a bounding-box accurately converge to a target object boundary. This attention model we call AttentionNet is composed of a convolutional neural network. During our whole detection procedure, we only utilize the actions from a single AttentionNet without any modules for object proposals nor post bounding-box regression. We evaluate our top-down detection mechanism over the PASCAL VOC series and ILSVRC CLS-LOC dataset, and achieve state-of-the-art performances compared to the major bottom-up detection methods. In particular, our detection mechanism shows a strong advantage in elaborate localization by outperforming Faster R-CNN with a margin of +7.1% over PASCAL VOC 2007 when we increase the IoU threshold for positive detection to 0.7.
https://arxiv.org/abs/1612.06704
Word sense disambiguation helps identifying the proper sense of ambiguous words in text. With large terminologies such as the UMLS Metathesaurus ambiguities appear and highly effective disambiguation methods are required. Supervised learning algorithm methods are used as one of the approaches to perform disambiguation. Features extracted from the context of an ambiguous word are used to identify the proper sense of such a word. The type of features have an impact on machine learning methods, thus affect disambiguation performance. In this work, we have evaluated several types of features derived from the context of the ambiguous word and we have explored as well more global features derived from MEDLINE using word embeddings. Results show that word embeddings improve the performance of more traditional features and allow as well using recurrent neural network classifiers based on Long-Short Term Memory (LSTM) nodes. The combination of unigrams and word embeddings with an SVM sets a new state of the art performance with a macro accuracy of 95.97 in the MSH WSD data set.
https://arxiv.org/abs/1604.02506
We propose a scalable approach to learn video-based question answering (QA): answer a “free-form natural language question” about a video content. Our approach automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended fromMN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et al. 2015), SS (Venugopalan et al. 2015). In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. Finally, we evaluate performance on manually generated video-based QA pairs. The results show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.
我们提出了一个可扩展的方法来学习基于视频的问答(QA):回答关于视频内容的“自由形式的自然语言问题”。我们的方法可以在线自动收集大量视频和说明。然后,从描述中自动生成大量的候选QA对,而不是手动注释。接下来,我们使用这些候选QA对来训练从MN(Sukhbaatar等,2015),VQA(Antol等,2015),SA(Yao等,2015),SS(Venugopalan等)延伸的一些基于视频的QA方法al。2015)。为了处理非完美的候选QA对,我们提出了一个自学的学习过程来迭代地识别它们并减轻它们在训练中的影响。最后,我们评估手动生成的基于视频的QA对的性能。结果表明,我们的自主学习过程是有效的,并且扩展的SS模型胜过了各种基线。
https://arxiv.org/abs/1611.04021
Domain adaptation is a key feature in Machine Translation. It generally encompasses terminology, domain and style adaptation, especially for human post-editing workflows in Computer Assisted Translation (CAT). With Neural Machine Translation (NMT), we introduce a new notion of domain adaptation that we call “specialization” and which is showing promising results both in the learning speed and in adaptation accuracy. In this paper, we propose to explore this approach under several perspectives.
https://arxiv.org/abs/1612.06141
Text simplification aims at reducing the lexical, grammatical and structural complexity of a text while keeping the same meaning. In the context of machine translation, we introduce the idea of simplified translations in order to boost the learning ability of deep neural translation models. We conduct preliminary experiments showing that translation complexity is actually reduced in a translation of a source bi-text compared to the target reference of the bi-text while using a neural machine translation (NMT) system learned on the exact same bi-text. Based on knowledge distillation idea, we then train an NMT system using the simplified bi-text, and show that it outperforms the initial system that was built over the reference data set. Performance is further boosted when both reference and automatic translations are used to learn the network. We perform an elementary analysis of the translated corpus and report accuracy results of the proposed approach on English-to-French and English-to-German translation tasks.
https://arxiv.org/abs/1612.06139
In this project we propose a new approach for emotion recognition using web-based similarity (e.g. confidence, PMI and PMING). We aim to extract basic emotions from short sentences with emotional content (e.g. news titles, tweets, captions), performing a web-based quantitative evaluation of semantic proximity between each word of the analyzed sentence and each emotion of a psychological model (e.g. Plutchik, Ekman, Lovheim). The phases of the extraction include: text preprocessing (tokenization, stop words, filtering), search engine automated query, HTML parsing of results (i.e. scraping), estimation of semantic proximity, ranking of emotions according to proximity measures. The main idea is that, since it is possible to generalize semantic similarity under the assumption that similar concepts co-occur in documents indexed in search engines, therefore also emotions can be generalized in the same way, through tags or terms that express them in a particular language, ranking emotions. Training results are compared to human evaluation, then additional comparative tests on results are performed, both for the global ranking correlation (e.g. Kendall, Spearman, Pearson) both for the evaluation of the emotion linked to each single word. Different from sentiment analysis, our approach works at a deeper level of abstraction, aiming at recognizing specific emotions and not only the positive/negative sentiment, in order to predict emotions as semantic data.
在这个项目中,我们提出了一种基于网络的相似性(例如置信度,PMI和PMING)的情感识别新方法。我们的目的是从情感内容的简短句子(例如新闻标题,推文,标题)中提取基本情绪,对分析的句子的每个单词与心理模型的每种情绪之间的语义接近度进行基于网络的定量评估(例如Plutchik, Ekman,Lovheim)。提取阶段包括:文本预处理(标记化,停用词,过滤),搜索引擎自动查询,结果(即刮取)的HTML解析,语义接近度的估计,根据接近度量度的情绪排序。其主要思想是,由于在搜索引擎索引的文档中类似概念同时出现的假设下,可以推广语义相似性,所以也可以通过以同样方式概括情感,通过标签或术语表达它们特定的语言,排名的情绪。将训练结果与人类评估进行比较,然后对全球排名相关性(例如肯德尔(Kendall),斯皮尔曼(Spearman),皮尔逊)进行额外的结果比较测试,以评估与每个单词有关的情绪。与情感分析不同,我们的方法在更深层次的抽象层面上工作,旨在识别特定的情绪,而不仅仅是正面/负面的情绪,以便将情绪预测作为语义数据。
https://arxiv.org/abs/1612.05734
Much recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain a complete answer. Our final model achieves the best reported results on both image captioning and visual question answering on several benchmark datasets.
通过卷积神经网络(CNN)和递归神经网络(RNN)的组合,已经在视觉 - 语言问题方面取得了许多新进展。这种方法没有明确表示高层次的语义概念,而是试图从图像特征直接进入文本。在本文中,我们首先提出一种将高级概念纳入成功的CNN-RNN方法的方法,并且表明它在图像字幕和视觉问题回答方面都取得了显着的进步。我们进一步表明,同样的机制可以用来结合外部的知识,这对于回答高层次的视觉问题是至关重要的。具体而言,我们设计了一个视觉问题回答模型,将图像内容的内部表示与从一般知识库中提取的信息相结合,以回答广泛的基于图像的问题。特别是,即使图像本身不包含完整的答案,也可以询问有关图像内容的问题。我们的最终模型在几个基准数据集的图像字幕和视觉问题回答方面实现了最好的报告结果。
https://arxiv.org/abs/1603.02814
One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredictability of the questions. Extracting the information required to answer them demands a variety of image operations from detection and counting, to segmentation and reconstruction. To train a method to perform even one of these operations accurately from {image,question,answer} tuples would be challenging, but to aim to achieve them all with a limited set of such training data seems ambitious at best. We propose here instead a more general and scalable approach which exploits the fact that very good methods to achieve these operations already exist, and thus do not need to be trained. Our method thus learns how to exploit a set of external off-the-shelf algorithms to achieve its goal, an approach that has something in common with the Neural Turing Machine. The core of our proposed method is a new co-attention model. In addition, the proposed approach generates human-readable reasons for its decision, and can still be trained end-to-end without ground truth reasons being given. We demonstrate the effectiveness on two publicly available datasets, Visual Genome and VQA, and show that it produces the state-of-the-art results in both cases.
视觉问答(VQA)挑战中最有趣的特征之一就是问题的不可预测性。提取回答所需的信息需要从检测和计数到分割和重建的各种图像操作。为了训练一种从{图像,问题,答案}元素准确执行这些操作中的一种的方法将是具有挑战性的,但是为了实现这些操作,仅仅通过有限的这样的训练数据来实现这些操作似乎充满了雄心。我们在这里提出一个更普遍和可扩展的方法,利用这个事实,即已经存在很好的方法来实现这些操作,因此不需要训练。因此,我们的方法学习如何利用一套外部的现成算法来实现其目标,这种方法与神经图灵机有一些共同点。我们提出的方法的核心是一个新的共同关注模型。另外,所提出的方法为其决策产生了人为可读的原因,并且仍然可以在没有给出实际理由的情况下端对端地进行训练。我们在两个公开可用的数据集,Visual Genome和VQA上展示了它的有效性,并显示它在两种情况下都能产生最先进的结果。
https://arxiv.org/abs/1612.05386
Automating the detection of anomalous events within long video sequences is challenging due to the ambiguity of how such events are defined. We approach the problem by learning generative models that can identify anomalies in videos using limited supervision. We propose end-to-end trainable composite Convolutional Long Short-Term Memory (Conv-LSTM) networks that are able to predict the evolution of a video sequence from a small number of input frames. Regularity scores are derived from the reconstruction errors of a set of predictions with abnormal video sequences yielding lower regularity scores as they diverge further from the actual sequence over time. The models utilize a composite structure and examine the effects of conditioning in learning more meaningful representations. The best model is chosen based on the reconstruction and prediction accuracy. The Conv-LSTM models are evaluated both qualitatively and quantitatively, demonstrating competitive results on anomaly detection datasets. Conv-LSTM units are shown to be an effective tool for modeling and predicting video sequences.
https://arxiv.org/abs/1612.00390
Important high-level vision tasks such as human-object interaction, image captioning and robotic manipulation require rich semantic descriptions of objects at part level. Based upon previous work on part localization, in this paper, we address the problem of inferring rich semantics imparted by an object part in still images. We propose to tokenize the semantic space as a discrete set of part states. Our modeling of part state is spatially localized, therefore, we formulate the part state inference problem as a pixel-wise annotation problem. An iterative part-state inference neural network is specifically designed for this task, which is efficient in time and accurate in performance. Extensive experiments demonstrate that the proposed method can effectively predict the semantic states of parts and simultaneously correct localization errors, thus benefiting a few visual understanding applications. The other contribution of this paper is our part state dataset which contains rich part-level semantic annotations.
重要的高级视觉任务,如人机交互,图像字幕和机器人操作等,都需要在零件层面对对象进行丰富的语义描述。基于以往的部分定位工作,本文针对静止图像中对象部分赋予丰富语义的问题。我们建议将语义空间标记为一组离散的部分状态。我们对零件状态的建模是空间局部化的,因此,我们将零件状态推理问题作为一个像素注解问题。迭代的部分状态推理神经网络是专门为这个任务设计的,它在时间和性能上都是高效的。大量的实验表明,所提出的方法能够有效地预测零件的语义状态,同时纠正定位误差,从而有益于一些视觉理解应用。本文的另一个贡献是我们的部分状态数据集,其中包含丰富的部分级语义标注。
https://arxiv.org/abs/1612.07310
Micro-facial expressions are regarded as an important human behavioural event that can highlight emotional deception. Spotting these movements is difficult for humans and machines, however research into using computer vision to detect subtle facial expressions is growing in popularity. This paper proposes an individualised baseline micro-movement detection method using 3D Histogram of Oriented Gradients (3D HOG) temporal difference method. We define a face template consisting of 26 regions based on the Facial Action Coding System (FACS). We extract the temporal features of each region using 3D HOG. Then, we use Chi-square distance to find subtle facial motion in the local regions. Finally, an automatic peak detector is used to detect micro-movements above the newly proposed adaptive baseline threshold. The performance is validated on two FACS coded datasets: SAMM and CASME II. This objective method focuses on the movement of the 26 face regions. When comparing with the ground truth, the best result was an AUC of 0.7512 and 0.7261 on SAMM and CASME II, respectively. The results show that 3D HOG outperformed for micro-movement detection, compared to state-of-the-art feature representations: Local Binary Patterns in Three Orthogonal Planes and Histograms of Oriented Optical Flow.
https://arxiv.org/abs/1612.05038
Along with the prosperity of recurrent neural network in modelling sequential data and the power of attention mechanism in automatically identify salient information, image captioning, a.k.a., image description, has been remarkably advanced in recent years. Nonetheless, most existing paradigms may suffer from the deficiency of invariance to images with different scaling, rotation, etc.; and effective integration of standalone attention to form a holistic end-to-end system. In this paper, we propose a novel image captioning architecture, termed Recurrent Image Captioner (\textbf{RIC}), which allows visual encoder and language decoder to coherently cooperate in a recurrent manner. Specifically, we first equip CNN-based visual encoder with a differentiable layer to enable spatially invariant transformation of visual signals. Moreover, we deploy an attention filter module (differentiable) between encoder and decoder to dynamically determine salient visual parts. We also employ bidirectional LSTM to preprocess sentences for generating better textual representations. Besides, we propose to exploit variational inference to optimize the whole architecture. Extensive experimental results on three benchmark datasets (i.e., Flickr8k, Flickr30k and MS COCO) demonstrate the superiority of our proposed architecture as compared to most of the state-of-the-art methods.
随着循环神经网络在顺序数据建模方面的繁荣和自动识别显着信息的注意力机制,近年来图像字幕等图像描述技术得到了显着的发展。尽管如此,大多数现有的范式可能会受到不同尺度,旋转等影像不变性的不足的影响。有效整合独立的注意力,形成整体端到端的体系。在本文中,我们提出了一个新的图像字幕体系结构,称为复发图像标题(\ textbf {RIC}),它允许视觉编码器和语言解码器以一种循环的方式协调合作。具体而言,我们首先配备了基于CNN的视觉编码器,具有可微分层以实现视觉信号的空间不变量变换。此外,我们在编码器和解码器之间部署了一个注意滤波器模块(可微分),以动态确定显着的视觉部分。我们还使用双向LSTM来预处理句子以生成更好的文本表示。此外,我们建议利用变分推理来优化整个架构。在三个基准数据集(即Flickr8k,Flickr30k和MS COCO)上的广泛的实验结果证明了我们提出的架构与大多数最先进的方法相比的优越性。
https://arxiv.org/abs/1612.04949