The BL Lac object 3C66A is one of the most luminous extragalactic sources at TeV $\gamma$-rays (VHE, i.e. $E >100$ GeV). Since TeV $\gamma$-ray radiation is absorbed by the extragalactic background light (EBL), it is crucial to know the redshift of the source in order to reconstruct its original spectral energy distribution, as well as to constrain EBL models. However, the optical spectrum of this BL\,Lac is almost featureless, so a direct measurement of $z$ is very difficult; in fact, the published redshift value for this source ($z=0.444$) has been strongly questioned. Based on EBL absorption arguments, several constraints to its redshift, in the range $0.096 < z < 0.5$, were proposed. Since these AGNs are hosted, typically, in early type galaxies that are members of groups or clusters, we have analysed spectro-photometrically the environment of 3C66A, with the goal of finding the galaxy group hosting this blazar. This study was made using optical images of a $5.5 \times 5.5$\,arcmin$^{2}$ field centred on the blazar, and spectra of 24 sources obtained with Gemini/GMOS-N multi-object spectroscopy. We found spectroscopic evidence of two galaxy groups along the blazar’s line of sight: one at $z \simeq 0.020$ and a second one at $z \simeq 0.340$. The first one is consistent with a known foreground structure, while the second group here presented has six spectroscopically confirmed members. Their location along a red sequence in the colour-magnitude diagram allows us to identify 34 additional candidate members of the more distant group. The blazar’s spectrum shows broad absorption features that we identify as arising in the intergalactic medium, thus allowing us to tentatively set a redshift lower limit at $z_{3C66A} \ge 0.33$. As a consequence, we propose that 3C66A is hosted in a galaxy that belongs to a cluster at $z=0.340$.
https://arxiv.org/abs/1710.04309
In past few years, several data-sets have been released for text and images. We present an approach to create the data-set for use in detecting and removing gender bias from text. We also include a set of challenges we have faced while creating this corpora. In this work, we have worked with movie data from Wikipedia plots and movie trailers from YouTube. Our Bollywood Movie corpus contains 4000 movies extracted from Wikipedia and 880 trailers extracted from YouTube which were released from 1970-2017. The corpus contains csv files with the following data about each movie - Wikipedia title of movie, cast, plot text, co-referenced plot text, soundtrack information, link to movie poster, caption of movie poster, number of males in poster, number of females in poster. In addition to that, corresponding to each cast member the following data is available - cast name, cast gender, cast verbs, cast adjectives, cast relations, cast centrality, cast mentions. We present some preliminary results on the task of bias removal which suggest that the data-set is quite useful for performing such tasks.
https://arxiv.org/abs/1710.04142
We propose a unified formulation for the problem of 3D human pose estimation from a single raw RGB image that reasons jointly about 2D joint estimation and 3D pose reconstruction to improve both tasks. We take an integrated approach that fuses probabilistic knowledge of 3D human pose with a multi-stage CNN architecture and uses the knowledge of plausible 3D landmark locations to refine the search for better 2D locations. The entire process is trained end-to-end, is extremely efficient and obtains state- of-the-art results on Human3.6M outperforming previous approaches both on 2D and 3D errors.
https://arxiv.org/abs/1701.00295
This article expands on research that has been done to develop a recurrent neural network (RNN) capable of predicting aircraft engine vibrations using long short-term memory (LSTM) neurons. LSTM RNNs can provide a more generalizable and robust method for prediction over analytical calculations of engine vibration, as analytical calculations must be solved iteratively based on specific empirical engine parameters, making this approach ungeneralizable across multiple engines. In initial work, multiple LSTM RNN architectures were proposed, evaluated and compared. This research improves the performance of the most effective LSTM network design proposed in the previous work by using a promising neuroevolution method based on ant colony optimization (ACO) to develop and enhance the LSTM cell structure of the network. A parallelized version of the ACO neuroevolution algorithm has been developed and the evolved LSTM RNNs were compared to the previously used fixed topology. The evolved networks were trained on a large database of flight data records obtained from an airline containing flights that suffered from excessive vibration. Results were obtained using MPI (Message Passing Interface) on a high performance computing (HPC) cluster, evolving 1000 different LSTM cell structures using 168 cores over 4 days. The new evolved LSTM cells showed an improvement of 1.35%, reducing prediction error from 5.51% to 4.17% when predicting excessive engine vibrations 10 seconds in the future, while at the same time dramatically reducing the number of weights from 21,170 to 11,810.
https://arxiv.org/abs/1710.03753
Heterostructures of wurtzite based devices have attracted great research interests since the tremendous success of GaN in light emitting diodes (LED) industry. Among the possible heterostructure material candidates, high quality GaN thin films on inexpensive and lattice matched ZnO substrate are both commercially and technologically desirable. However, the energy of ZnO polar surfaces is much lower than that of GaN polar surfaces. Therefore, the intrinsic wetting condition forbids such heterostructures. As a result, poor crystal quality and 3D growth mode were obtained. To dramatically change the growth mode of the heterostructure, we propose to use hydrogen as a surfactant, confirmed by our first principles calculations. Stable H involved surface configurations and interfaces are investigated, with the help of newly developed algorithms. By applying the experimental Gibbs free energy of H$_2$, we also predict the temperature and chemical potential of H, which is critical in experimental realizations of our strategy. This novel approach will for the first time make the growth of high quality GaN thin films on ZnO substrates possible. We believe that our new strategy may reduce the manufactory cost and improve the crystal quality and the efficiency of GaN based devices.
https://arxiv.org/abs/1710.03472
An ever increasing number of configuration parameters are provided to system users. But many users have used one configuration setting across different workloads, leaving untapped the performance potential of systems. A good configuration setting can greatly improve the performance of a deployed system under certain workloads. But with tens or hundreds of parameters, it becomes a highly costly task to decide which configuration setting leads to the best performance. While such task requires the strong expertise in both the system and the application, users commonly lack such expertise. To help users tap the performance potential of systems, we present BestConfig, a system for automatically finding a best configuration setting within a resource limit for a deployed system under a given application workload. BestConfig is designed with an extensible architecture to automate the configuration tuning for general systems. To tune system configurations within a resource limit, we propose the divide-and-diverge sampling method and the recursive bound-and-search algorithm. BestConfig can improve the throughput of Tomcat by 75%, that of Cassandra by 63%, that of MySQL by 430%, and reduce the running time of Hive join job by about 50% and that of Spark join job by about 80%, solely by configuration adjustment.
https://arxiv.org/abs/1710.03439
Proteins are the main workhorses of biological functions in a cell, a tissue, or an organism. Identification and quantification of proteins in a given sample, e.g. a cell type under normal/disease conditions, are fundamental tasks for the understanding of human health and disease. In this paper, we present DeepNovo, a deep learning-based tool to address the problem of protein identification from tandem mass spectrometry data. The idea was first proposed in the context of de novo peptide sequencing [1] in which convolutional neural networks and recurrent neural networks were applied to predict the amino acid sequence of a peptide from its spectrum, a similar task to generating a caption from an image. We further develop DeepNovo to perform sequence database search, the main technique for peptide identification that greatly benefits from numerous existing protein databases. We combine two modules de novo sequencing and database search into a single deep learning framework for peptide identification, and integrate de Bruijn graph assembly technique to offer a complete solution to reconstruct protein sequences from tandem mass spectrometry data. This paper describes a comprehensive protocol of DeepNovo for protein identification, including training neural network models, dynamic programming search, database querying, estimation of false discovery rate, and de Bruijn graph assembly. Training and testing data, model implementations, and comprehensive tutorials in form of IPython notebooks are available in our GitHub repository (this https URL).
https://arxiv.org/abs/1710.02765
The spiking activity of principal cells in mammalian hippocampus encodes an internalized neuronal representation of the ambient space—a cognitive map. Once learned, such a map enables the animal to navigate a given environment for a long period. However, the neuronal substrate that produces this map remains transient: the synaptic connections in the hippocampus and in the downstream neuronal networks never cease to form and to deteriorate at a rapid rate. How can the brain maintain a robust, reliable representation of space using a network that constantly changes its architecture? Here, we demonstrate, using novel Algebraic Topology techniques, that cognitive map’s stability is a generic, emergent phenomenon. The model allows evaluating the effect produced by specific physiological parameters, e.g., the distribution of connections’ decay times, on the properties of the cognitive map as a whole. It also points out that spatial memory deterioration caused by weakening or excessive loss of the synaptic connections may be compensated by simulating the neuronal activity. Lastly, the model explicates functional importance of the complementary learning systems for processing spatial information at different levels of spatiotemporal granularity, by establishing three complementary timescales at which spatial information unfolds. Thus, the model provides a principal insight into how can the brain develop a reliable representation of the world, learn and retain memories despite complex plasticity of the underlying networks and allows studying how instabilities and memory deterioration mechanisms may affect learning process.
https://arxiv.org/abs/1710.02623
The continually increasing number of documents produced each year necessitates ever improving information processing methods for searching, retrieving, and organizing text. Central to these information processing methods is document classification, which has become an important application for supervised learning. Recently the performance of these traditional classifiers has degraded as the number of documents has increased. This is because along with this growth in the number of documents has come an increase in the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multi-class classification. Instead we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification (HDLTex). HDLTex employs stacks of deep learning architectures to provide specialized understanding at each level of the document hierarchy.
https://arxiv.org/abs/1709.08267
Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learning (CL), for image captioning. Specifically, via two constraints formulated on top of a reference model, the proposed method can encourage distinctiveness, while maintaining the overall quality of the generated captions. We tested our method on two challenging datasets, where it improves the baseline model by significant margins. We also showed in our studies that the proposed method is generic and can be used for models with various structures.
https://arxiv.org/abs/1710.02534
We study 1 or 2 neighboring Mn impurities, as well as complexes of 1 Mn and 1 or 2 Mg ions in a 64 atoms supercell of GaN by means of density functional calculations. Taking into account the electron correlation in the local spin density approximation with explicit correction of the Hubbard term (the LSDA+U method) and full lattice relaxation we determine the nearest neighbor exchange J for a pair of Mn impurities. We find J to be ferromagnetic and of the order of about 18 meV in the Hamiltonian H=-2J1J2. That J is only weakly influenced by the U parameter (varying between 2 and 8 eV) and by the lattice relaxation. From a detailed analysis of the magnetization density distribution we get hints for a ferromagnetic super-exchange mechanism. Also the Mn valence was found to be 3+ without any doubt in the absence of co-doping with Mg. Co-doping with Mg leads to a valence change to 4+ for 1 Mg and to 5+ for 2 Mg. We show that the valence change can already be concluded from a careful analysis of the density of states of GaN doped with Mn without any Mg.
https://arxiv.org/abs/1710.02056
We propose model with larger spatial size of feature maps and evaluate it on object detection task. With the goal to choose the best feature extraction network for our model we compare several popular lightweight networks. After that we conduct a set of experiments with channels reduction algorithms in order to accelerate execution. Our vehicle detection models are accurate, fast and therefore suit for embedded visual applications. With only 1.5 GFLOPs our best model gives 93.39 AP on validation subset of challenging DETRAC dataset. The smallest of our models is the first to achieve real-time inference speed on CPU with reasonable accuracy drop to 91.43 AP.
https://arxiv.org/abs/1707.01395
The local Gan-Gross-Prasad conjecture of unitary groups, which is now settled by the works of Plessis, Gan and Ichino, says that for a pair of generic $L$-parameters of $(U(n+1),U(n))$, there is a unique pair of representations in their associated Vogan $L$-packets which produces the Bessel model. In this paper, we examined the conjecture for a pair of $L$-parameters of $\big(U(n+1), U(n)\big)$ as fixing a special non-generic parameter of $U(n+1)$ and varing tempered $L$-parameters of $U(n)$. We observed that there still exist a Gan-Gross-Prasad type formulae depending on the choice of $L$-parameter of $U(n)$.
https://arxiv.org/abs/1708.06705
Neural machine translation (NMT) has recently achieved impressive results. A potential problem of the existing NMT algorithm, however, is that the decoding is conducted from left to right, without considering the right context. This paper proposes an two-stage approach to solve the problem. In the first stage, a conventional attention-based NMT system is used to produce a draft translation, and in the second stage, a novel double-attention NMT system is used to refine the translation, by looking at the original input as well as the draft translation. This drafting-and-refinement can obtain the right-context information from the draft, hence producing more consistent translations. We evaluated this approach using two Chinese-English translation tasks, one with 44k pairs and 1M pairs respectively. The experiments showed that our approach achieved positive improvements over the conventional NMT system: the improvements are 2.4 and 0.9 BLEU points on the small-scale and large-scale tasks, respectively.
https://arxiv.org/abs/1710.01789
Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful fANOVA framework. In total, we summarize the results of 5400 experimental runs ($\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
https://arxiv.org/abs/1503.04069
In genomics, pattern matching against a sequence of nucleotides plays a pivotal role for DNA sequence alignment and comparing genomes. This helps tackling some diseases, such as cancer in humans. The complexity of searching biological sequences in big databases has transformed sequence alignment problem into a challenging field of research in bioinformatics. A large number of research has been carried to solve this problem based on electronic computers. The required extensive amount of computations for handling this huge database in electronic computers leads to vast amounts of energy consumption for electrical processing and cooling. On the other hand, optical processing due to its parallel nature is much faster than electrical counterpart at a fraction of energy consumption level and cost. In this paper, an algorithm based on optical parallel processing is proposed that not only locate similarity between sequences but also determines the exact location of edits. The proposed algorithm is based on partitioning the read sequence into some parts, namely, windows, then, computing their correlation with reference sequence in parallel. Multiple metamaterial based optical correlators are used in parallel to optically implement the architecture. Design limitations and challenges of the architecture are also discussed in details. The simulation results, comparing with the well-known BLAST algorithm, demonstrate superior speed, accuracy, and much lower power consumption.
https://arxiv.org/abs/1710.01262
Commercial InGaN/GaN light emitting diode heterostructures continue to suffer from efficiency droop at high current densities. Droop mitigation strategies target Auger recombination and typically require structural and/or compositional changes within the multi-quantum well active region. However, these modifications are often accompanied by a corresponding degradation in material quality that decreases the expected gains in high-current external quantum efficiency. We study origins of these efficiency losses by correlating chip-level quantum efficiency measurements with structural and optical properties obtained using a combination of electron microscopy tools. The drop in quantum efficiency is not found to be correlated with quantum well (QW) width fluctuations. Rather, we show direct correlation between active region design, deep level defects, and delayed electron beam induced cathodoluminescence (CL) with characteristic rise time constants on the order of tens of seconds. We propose a model in which the electron beam fills deep level defect states and simultaneously drives reduction of the built-in field within the multi-quantum well active region, resulting in a delay in accumulation of carrier populations within the QWs. The CL measurements yield fundamental insights into carrier transport phenomena, efficiency-reducing defects, and quantum well band structure that are important in guiding future heterostructure process development.
https://arxiv.org/abs/1710.01233
Learning to remember long sequences remains a challenging task for recurrent neural networks. Register memory and attention mechanisms were both proposed to resolve the issue with either high computational cost to retain memory differentiability, or by discounting the RNN representation learning towards encoding shorter local contexts than encouraging long sequence encoding. Associative memory, which studies the compression of multiple patterns in a fixed size memory, were rarely considered in recent years. Although some recent work tries to introduce associative memory in RNN and mimic the energy decay process in Hopfield nets, it inherits the shortcoming of rule-based memory updates, and the memory capacity is limited. This paper proposes a method to learn the memory update rule jointly with task objective to improve memory capacity for remembering long sequences. Also, we propose an architecture that uses multiple such associative memory for more complex input encoding. We observed some interesting facts when compared to other RNN architectures on some well-studied sequence learning tasks.
https://arxiv.org/abs/1709.06493
Training efficiency is one of the main problems for Neural Machine Translation (NMT). Deep networks need for very large data as well as many training iterations to achieve state-of-the-art performance. This results in very high computation cost, slowing down research and industrialisation. In this paper, we propose to alleviate this problem with several training methods based on data boosting and bootstrap with no modifications to the neural network. It imitates the learning process of humans, which typically spend more time when learning “difficult” concepts than easier ones. We experiment on an English-French translation task showing accuracy improvements of up to 1.63 BLEU while saving 20% of training time.
https://arxiv.org/abs/1612.06138
Theory of Mind is the ability to attribute mental states (beliefs, intents, knowledge, perspectives, etc.) to others and recognize that these mental states may differ from one’s own. Theory of Mind is critical to effective communication and to teams demonstrating higher collective performance. To effectively leverage the progress in Artificial Intelligence (AI) to make our lives more productive, it is important for humans and AI to work well together in a team. Traditionally, there has been much emphasis on research to make AI more accurate, and (to a lesser extent) on having it better understand human intentions, tendencies, beliefs, and contexts. The latter involves making AI more human-like and having it develop a theory of our minds. In this work, we argue that for human-AI teams to be effective, humans must also develop a theory of AI’s mind (ToAIM) - get to know its strengths, weaknesses, beliefs, and quirks. We instantiate these ideas within the domain of Visual Question Answering (VQA). We find that using just a few examples (50), lay people can be trained to better predict responses and oncoming failures of a complex VQA model. We further evaluate the role existing explanation (or interpretability) modalities play in helping humans build ToAIM. Explainable AI has received considerable scientific and popular attention in recent times. Surprisingly, we find that having access to the model’s internal states - its confidence in its top-k predictions, explicit or implicit attention maps which highlight regions in the image (and words in the question) the model is looking at (and listening to) while answering a question about an image - do not help people better predict its behavior.
https://arxiv.org/abs/1704.00717
The CEGS N-GRID 2016 Shared Task 1 in Clinical Natural Language Processing focuses on the de-identification of psychiatric evaluation records. This paper describes two participating systems of our team, based on conditional random fields (CRFs) and long short-term memory networks (LSTMs). A pre-processing module was introduced for sentence detection and tokenization before de-identification. For CRFs, manually extracted rich features were utilized to train the model. For LSTMs, a character-level bi-directional LSTM network was applied to represent tokens and classify tags for each token, following which a decoding layer was stacked to decode the most probable protected health information (PHI) terms. The LSTM-based system attained an i2b2 strict micro-F_1 measure of 89.86%, which was higher than that of the CRF-based system.
https://arxiv.org/abs/1709.06901
With the proliferation of mobile devices and location-based services, continuous generation of massive volume of streaming spatial objects (i.e., geo-tagged data) opens up new opportunities to address real-world problems by analyzing them. In this paper, we present a novel continuous bursty region detection problem that aims to continuously detect a bursty region of a given size in a specified geographical area from a stream of spatial objects. Specifically, a bursty region shows maximum spike in the number of spatial objects in a given time window. The problem is useful in addressing several real-world challenges such as surge pricing problem in online transportation and disease outbreak detection. To solve the problem, we propose an exact solution and two approximate solutions, and the approximation ratio is $\frac{1-\alpha}{4}$ in terms of the burst score, where $\alpha$ is a parameter to control the burst score. We further extend these solutions to support detection of top-$k$ bursty regions. Extensive experiments with real-world data are conducted to demonstrate the efficiency and effectiveness of our solutions.
https://arxiv.org/abs/1709.09287
Privatizing data is a useful strategy for increasing parallelism in a shared memory multithreaded program. Independent cores can compute independently on duplicates of shared data, combining their results at the end of their computations. Conventional approaches to privatization, however, rely on explicit static or dynamic memory allocation for duplicated state, increasing memory footprint and contention for cache resources, especially in shared caches. In this work, we describe CCache, a system for on-demand privatization of data manipulated by commutative operations. CCache garners the benefits of privatization, without the increase in memory footprint or cache occupancy. Each core in CCache dynamically privatizes commutatively manipulated data, operating on a copy. Periodically or at the end of its computation, the core merges its value with the value resident in memory, and when all cores have merged, the in-memory copy contains the up-to-date value. We describe a low-complexity architectural implementation of CCache that extends a conventional multicore to support on-demand privatization without using additional memory for private copies. We evaluate CCache on several high-value applications, including random access key-value store, clustering, breadth first search and graph ranking, showing speedups upto 3.2X.
https://arxiv.org/abs/1709.09491
Although neural networks could achieve state-of-the-art performance while recongnizing images, they often suffer a tremendous defeat from adversarial examples–inputs generated by utilizing imperceptible but intentional perturbation to clean samples from the datasets. How to defense against adversarial examples is an important problem which is well worth researching. So far, very few methods have provided a significant defense to adversarial examples. In this paper, a novel idea is proposed and an effective framework based Generative Adversarial Nets named APE-GAN is implemented to defense against the adversarial examples. The experimental results on three benchmark datasets including MNIST, CIFAR10 and ImageNet indicate that APE-GAN is effective to resist adversarial examples generated from five attacks.
https://arxiv.org/abs/1707.05474
While some remarkable progress has been made in neural machine translation (NMT) research, there have not been many reports on its development and evaluation in practice. This paper tries to fill this gap by presenting some of our findings from building an in-house travel domain NMT system in a large scale E-commerce setting. The three major topics that we cover are optimization and training (including different optimization strategies and corpus sizes), handling real-world content and evaluating results.
https://arxiv.org/abs/1709.05820
The brain is a noisy system subject to energy constraints. These facts are rarely taken into account when modelling artificial neural networks. In this paper, we are interested in demonstrating that those factors can actually lead to the appearance of robust associative memories. We first propose a simplified model of noise in the brain, taking into account synaptic noise and interference from neurons external to the network. When coarsely quantized, we show that this noise can be reduced to insertions and erasures. We take a neural network with recurrent modifiable connections, and subject it to noisy external inputs. We introduce an energy usage limitation principle in the network as well as consolidated Hebbian learning, resulting in an incremental processing of inputs. We show that the connections naturally formed correspond to state-of-the-art binary sparse associative memories.
https://arxiv.org/abs/1709.08367
Visual Question Answering (VQA) presents a unique challenge as it requires the ability to understand and encode the multi-modal inputs - in terms of image processing and natural language processing. The algorithm further needs to learn how to perform reasoning over this multi-modal representation so it can answer the questions correctly. This paper presents a survey of different approaches proposed to solve the problem of Visual Question Answering. We also describe the current state of the art model in later part of paper. In particular, the paper describes the approaches taken by various algorithms to extract image features, text features and the way these are employed to predict answers. We also briefly discuss the experiments performed to evaluate the VQA models and report their performances on diverse datasets including newly released VQA2.0[8].
https://arxiv.org/abs/1709.08203
Autonomous randomly coupled neural networks display a transition to chaos at a critical coupling strength. We here investigate the effect of a time-varying input on the onset of chaos and the resulting consequences for information processing. Dynamic mean-field theory yields the statistics of the activity, the maximum Lyapunov exponent, and the memory capacity of the network. We find an exact condition that determines the transition from stable to chaotic dynamics and the sequential memory capacity in closed form. The input suppresses chaos by a dynamic mechanism, shifting the transition to significantly larger coupling strengths than predicted by local stability analysis. Beyond linear stability, a regime of coexistent locally expansive, but non-chaotic dynamics emerges that optimizes the capacity of the network to store sequential input.
https://arxiv.org/abs/1603.01880
Natural, spontaneous dialogue proceeds incrementally on a word-by-word basis; and it contains many sorts of disfluency such as mid-utterance/sentence hesitations, interruptions, and self-corrections. But training data for machine learning approaches to dialogue processing is often either cleaned-up or wholly synthetic in order to avoid such phenomena. The question then arises of how well systems trained on such clean data generalise to real spontaneous dialogue, or indeed whether they are trainable at all on naturally occurring dialogue data. To answer this question, we created a new corpus called bAbI+ by systematically adding natural spontaneous incremental dialogue phenomena such as restarts and self-corrections to the Facebook AI Research’s bAbI dialogues dataset. We then explore the performance of a state-of-the-art retrieval model, MemN2N, on this more natural dataset. Results show that the semantic accuracy of the MemN2N model drops drastically; and that although it is in principle able to learn to process the constructions in bAbI+, it needs an impractical amount of training data to do so. Finally, we go on to show that an incremental, semantic parser – DyLan – shows 100% semantic accuracy on both bAbI and bAbI+, highlighting the generalisation properties of linguistically informed dialogue models.
https://arxiv.org/abs/1709.07840
In this paper, we propose a novel edge preserving and multi-scale contextual neural network for salient object detection. The proposed framework is aiming to address two limits of the existing CNN based methods. First, region-based CNN methods lack sufficient context to accurately locate salient object since they deal with each region independently. Second, pixel-based CNN methods suffer from blurry boundaries due to the presence of convolutional and pooling layers. Motivated by these, we first propose an end-to-end edge-preserved neural network based on Fast R-CNN framework (named RegionNet) to efficiently generate saliency map with sharp object boundaries. Later, to further improve it, multi-scale spatial context is attached to RegionNet to consider the relationship between regions and the global scenes. Furthermore, our method can be generally applied to RGB-D saliency detection by depth refinement. The proposed framework achieves both clear detection boundary and multi-scale contextual robustness simultaneously for the first time, and thus achieves an optimized performance. Experiments on six RGB and two RGB-D benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance.
https://arxiv.org/abs/1608.08029
We discuss the microstructural origin of enhanced radial growth in magnesium (Mg) doped gallium nitride (GaN) nanorods (NRs) using electron microscopy and \textit{first-principles} Density Functional Theory calculations. Experimentally, we find the Mg incorporation increases surface coverage of the grown samples and the height of NRs decreases as a consequence of an increase radial growth rate. We also observed the coalescence of NRs becomes prominent and the critical height of coalescence decreases with the increase in Mg concentration. From \textit{first-principles} calculations, we find the surface free energy of Mg doped surface reduces with increasing Mg concentration in the samples. The calculations further suggests a reduction in the diffusion barrier of Ga adatoms along [11$\overline{2}$0] on the side wall surface of the NRs, possibly the primary reason for the observed enhancement in the radial growth.
https://arxiv.org/abs/1709.04479
Recently visual question answering (VQA) and visual question generation (VQG) are two trending topics in the computer vision, which have been explored separately. In this work, we propose an end-to-end unified framework, the Invertible Question Answering Network (iQAN), to leverage the complementary relations between questions and answers in images by jointly training the model on VQA and VQG tasks. Corresponding parameter sharing scheme and regular terms are proposed as constraints to explicitly leverage Q,A’s dependencies to guide the training process. After training, iQAN can take either question or answer as input, then output the counterpart. Evaluated on the large-scale visual question answering datasets CLEVR and VQA2, our iQAN improves the VQA accuracy over the baselines. We also show the dual learning framework of iQAN can be generalized to other VQA architectures and consistently improve the results over both the VQA and VQG tasks.
https://arxiv.org/abs/1709.07192
In recent years, we have seen tremendous progress in the field of object detection. Most of the recent improvements have been achieved by targeting deeper feedforward networks. However, many hard object categories such as bottle, remote, etc. require representation of fine details and not just coarse, semantic representations. But most of these fine details are lost in the early convolutional layers. What we need is a way to incorporate finer details from lower layers into the detection architecture. Skip connections have been proposed to combine high-level and low-level features, but we argue that selecting the right features from low-level requires top-down contextual information. Inspired by the human visual pathway, in this paper we propose top-down modulations as a way to incorporate fine details into the detection framework. Our approach supplements the standard bottom-up, feedforward ConvNet with a top-down modulation (TDM) network, connected using lateral connections. These connections are responsible for the modulation of lower layer filters, and the top-down network handles the selection and integration of contextual information and low-level features. The proposed TDM architecture provides a significant boost on the COCO testdev benchmark, achieving 28.6 AP for VGG16, 35.2 AP for ResNet101, and 37.3 for InceptionResNetv2 network, without any bells and whistles (e.g., multi-scale, iterative box refinement, etc.).
https://arxiv.org/abs/1612.06851
Machine learning is a field of computer science that builds algorithms that learn. In many cases, machine learning algorithms are used to recreate a human ability like adding a caption to a photo, driving a car, or playing a game. While the human brain has long served as a source of inspiration for machine learning, little effort has been made to directly use data collected from working brains as a guide for machine learning algorithms. Here we demonstrate a new paradigm of “neurally-weighted” machine learning, which takes fMRI measurements of human brain activity from subjects viewing images, and infuses these data into the training process of an object recognition learning algorithm to make it more consistent with the human brain. After training, these neurally-weighted classifiers are able to classify images without requiring any additional neural data. We show that our neural-weighting approach can lead to large performance gains when used with traditional machine vision features, as well as to significant improvements with already high-performing convolutional neural network features. The effectiveness of this approach points to a path forward for a new class of hybrid machine learning algorithms which take both inspiration and direct constraints from neuronal data.
https://arxiv.org/abs/1703.05463
In this paper, we present a novel deep learning based approach for addressing the problem of interaction recognition from a first person perspective. The proposed approach uses a pair of convolutional neural networks, whose parameters are shared, for extracting frame level features from successive frames of the video. The frame level features are then aggregated using a convolutional long short-term memory. The hidden state of the convolutional long short-term memory, after all the input video frames are processed, is used for classification in to the respective categories. The two branches of the convolutional neural network perform feature encoding on a short time interval whereas the convolutional long short term memory encodes the changes on a longer temporal duration. In our network the spatio-temporal structure of the input is preserved till the very final processing stage. Experimental results show that our method outperforms the state of the art on most recent first person interactions datasets that involve complex ego-motion. In particular, on UTKinect-FirstPerson it competes with methods that use depth image and skeletal joints information along with RGB images, while it surpasses all previous methods that use only RGB images by more than 20% in recognition accuracy.
https://arxiv.org/abs/1709.06495
Attention mechanisms have been widely applied in the Visual Question Answering (VQA) task, as they help to focus on the area-of-interest of both visual and textual information. To answer the questions correctly, the model needs to selectively target different areas of an image, which suggests that an attention-based model may benefit from an explicit attention supervision. In this work, we aim to address the problem of adding attention supervision to VQA models. Since there is a lack of human attention data, we first propose a Human Attention Network (HAN) to generate human-like attention maps, training on a recently released dataset called Human ATtention Dataset (VQA-HAT). Then, we apply the pre-trained HAN on the VQA v2.0 dataset to automatically produce the human-like attention maps for all image-question pairs. The generated human-like attention map dataset for the VQA v2.0 dataset is named as Human-Like ATtention (HLAT) dataset. Finally, we apply human-like attention supervision to an attention-based VQA model. The experiments show that adding human-like supervision yields a more accurate attention together with a better performance, showing a promising future for human-like attention supervision in VQA.
https://arxiv.org/abs/1709.06308
In this paper, we argue that the future of Artificial Intelligence research resides in two keywords: integration and embodiment. We support this claim by analyzing the recent advances of the field. Regarding integration, we note that the most impactful recent contributions have been made possible through the integration of recent Machine Learning methods (based in particular on Deep Learning and Recurrent Neural Networks) with more traditional ones (e.g. Monte-Carlo tree search, goal babbling exploration or addressable memory systems). Regarding embodiment, we note that the traditional benchmark tasks (e.g. visual classification or board games) are becoming obsolete as state-of-the-art learning algorithms approach or even surpass human performance in most of them, having recently encouraged the development of first-person 3D game platforms embedding realistic physics. Building upon this analysis, we first propose an embodied cognitive architecture integrating heterogenous sub-fields of Artificial Intelligence into a unified framework. We demonstrate the utility of our approach by showing how major contributions of the field can be expressed within the proposed framework. We then claim that benchmarking environments need to reproduce ecologically-valid conditions for bootstrapping the acquisition of increasingly complex cognitive skills through the concept of a cognitive arms race between embodied agents.
https://arxiv.org/abs/1704.01407
Object detection is considered one of the most challenging problems in this field of computer vision, as it involves the combination of object classification and object localization within a scene. Recently, deep neural networks (DNNs) have been demonstrated to achieve superior object detection performance compared to other approaches, with YOLOv2 (an improved You Only Look Once model) being one of the state-of-the-art in DNN-based object detection methods in terms of both speed and accuracy. Although YOLOv2 can achieve real-time performance on a powerful GPU, it still remains very challenging for leveraging this approach for real-time object detection in video on embedded computing devices with limited computational power and limited memory. In this paper, we propose a new framework called Fast YOLO, a fast You Only Look Once framework which accelerates YOLOv2 to be able to perform object detection in video on embedded devices in a real-time manner. First, we leverage the evolutionary deep intelligence framework to evolve the YOLOv2 network architecture and produce an optimized architecture (referred to as O-YOLOv2 here) that has 2.8X fewer parameters with just a ~2% IOU drop. To further reduce power consumption on embedded devices while maintaining performance, a motion-adaptive inference method is introduced into the proposed Fast YOLO framework to reduce the frequency of deep inference with O-YOLOv2 based on temporal motion characteristics. Experimental results show that the proposed Fast YOLO framework can reduce the number of deep inferences by an average of 38.13%, and an average speedup of ~3.3X for objection detection in video compared to the original YOLOv2, leading Fast YOLO to run an average of ~18FPS on a Nvidia Jetson TX1 embedded system.
https://arxiv.org/abs/1709.05943
The need for large annotated image datasets for training Convolutional Neural Networks (CNNs) has been a significant impediment for their adoption in computer vision applications. We show that with transfer learning an effective object detector can be trained almost entirely on synthetically rendered datasets. We apply this strategy for detecting pack- aged food products clustered in refrigerator scenes. Our CNN trained only with 4000 synthetic images achieves mean average precision (mAP) of 24 on a test set with 55 distinct products as objects of interest and 17 distractor objects. A further increase of 12% in the mAP is obtained by adding only 400 real images to these 4000 synthetic images in the training set. A high degree of photorealism in the synthetic images was not essential in achieving this performance. We analyze factors like training data set size and 3D model dictionary size for their influence on detection performance. Additionally, training strategies like fine-tuning with selected layers and early stopping which affect transfer learning from synthetic scenes to real scenes are explored. Training CNNs with synthetic datasets is a novel application of high-performance computing and a promising approach for object detection applications in domains where there is a dearth of large annotated image data.
https://arxiv.org/abs/1706.06782
Recent works have shown that synthetic parallel data automatically generated by translation models can be effective for various neural machine translation (NMT) issues. In this study, we build NMT systems using only synthetic parallel data. As an efficient alternative to real parallel data, we also present a new type of synthetic parallel corpus. The proposed pseudo parallel data are distinct from previous works in that ground truth and synthetic examples are mixed on both sides of sentence pairs. Experiments on Czech-German and French-German translations demonstrate the efficacy of the proposed pseudo parallel corpus, which shows not only enhanced results for bidirectional translation tasks but also substantial improvement with the aid of a ground truth real parallel corpus.
https://arxiv.org/abs/1704.00253
This paper introduces speech-based visual question answering (VQA), the task of generating an answer given an image and a spoken question. Two methods are studied: an end-to-end, deep neural network that directly uses audio waveforms as input versus a pipelined approach that performs ASR (Automatic Speech Recognition) on the question, followed by text-based visual question answering. Furthermore, we investigate the robustness of both methods by injecting various levels of noise into the spoken question and find both methods to be tolerate noise at similar levels.
https://arxiv.org/abs/1705.00464
Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations, and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Objects, phrases, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the state-of-art method with more than 3% margin.
https://arxiv.org/abs/1707.09700
In this paper, a self-guiding multimodal LSTM (sg-LSTM) image captioning model is proposed to handle uncontrolled imbalanced real-world image-sentence dataset. We collect FlickrNYC dataset from Flickr as our testbed with 306,165 images and the original text descriptions uploaded by the users are utilized as the ground truth for training. Descriptions in FlickrNYC dataset vary dramatically ranging from short term-descriptions to long paragraph-descriptions and can describe any visual aspects, or even refer to objects that are not depicted. To deal with the imbalanced and noisy situation and to fully explore the dataset itself, we propose a novel guiding textual feature extracted utilizing a multimodal LSTM (m-LSTM) model. Training of m-LSTM is based on the portion of data in which the image content and the corresponding descriptions are strongly bonded. Afterwards, during the training of sg-LSTM on the rest training data, this guiding information serves as additional input to the network along with the image representations and the ground-truth descriptions. By integrating these input components into a multimodal block, we aim to form a training scheme with the textual information tightly coupled with the image content. The experimental results demonstrate that the proposed sg-LSTM model outperforms the traditional state-of-the-art multimodal RNN captioning framework in successfully describing the key components of the input images.
https://arxiv.org/abs/1709.05038
We have designed, developed, and applied a convolutional neural network (CNN) architecture using multi-task learning to search for and characterize strong HI Lya absorption in quasar spectra. Without any explicit modeling of the quasar continuum nor application of the predicted line-profile for Lya from quantum mechanics, our algorithm predicts the presence of strong HI absorption and estimates the corresponding redshift zabs and HI column density NHI, with emphasis on damped Lya systems (DLAs, absorbers with log NHI > 20.3). We tuned the CNN model using a custom training set of DLAs injected into DLA-free quasar spectra from the Sloan Digital Sky Survey (SDSS), data release 5 (DR5). Testing on a held-back validation set demonstrates a high incidence of DLAs recovered by the algorithm (97.4% as DLAs and 99% as an HI absorber with log NHI > 19.5) and excellent estimates for zabs and NHI. Similar results are obtained against a human-generated survey of the SDSS DR5 dataset. The algorithm yields a low incidence of false positives and negatives but is challenged by overlapping DLAs and/or very high NHI systems. We have applied this CNN model to the quasar spectra of SDSS-DR7 and the Baryonic Oscillation Spectroscopic Survey (BOSS, data release 12) and provide catalogs of 4,913 and 50,969 DLAs respectively (including 1,659 and 9,230 high-confidence DLAs that were previously unpublished). This work validates the application of deep learning techniques to astronomical spectra for both classification and quantitative measurements.
https://arxiv.org/abs/1709.04962
Deep Learning has recently become hugely popular in machine learning, providing significant improvements in classification accuracy in the presence of highly-structured and large databases. Researchers have also considered privacy implications of deep learning. Models are typically trained in a centralized manner with all the data being processed by the same training algorithm. If the data is a collection of users’ private data, including habits, personal pictures, geographical positions, interests, and more, the centralized server will have access to sensitive information that could potentially be mishandled. To tackle this problem, collaborative deep learning models have recently been proposed where parties locally train their deep learning structures and only share a subset of the parameters in the attempt to keep their respective training sets private. Parameters can also be obfuscated via differential privacy (DP) to make information extraction even more challenging, as proposed by Shokri and Shmatikov at CCS’15. Unfortunately, we show that any privacy-preserving collaborative deep learning is susceptible to a powerful attack that we devise in this paper. In particular, we show that a distributed, federated, or decentralized deep learning approach is fundamentally broken and does not protect the training sets of honest participants. The attack we developed exploits the real-time nature of the learning process that allows the adversary to train a Generative Adversarial Network (GAN) that generates prototypical samples of the targeted training set that was meant to be private (the samples generated by the GAN are intended to come from the same distribution as the training data). Interestingly, we show that record-level DP applied to the shared parameters of the model, as suggested in previous work, is ineffective (i.e., record-level DP is not designed to address our attack).
https://arxiv.org/abs/1702.07464
We present a novel method to solve image analogy problems : it allows to learn the relation between paired images present in training data, and then generalize and generate images that correspond to the relation, but were never seen in the training set. Therefore, we call the method Conditional Analogy Generative Adversarial Network (CAGAN), as it is based on adversarial training and employs deep convolutional neural networks. An especially interesting application of that technique is automatic swapping of clothing on fashion model photos. Our work has the following contributions. First, the definition of the end-to-end trainable CAGAN architecture, which implicitly learns segmentation masks without expensive supervised labeling data. Second, experimental results show plausible segmentation masks and often convincing swapped images, given the target article. Finally, we discuss the next steps for that technique: neural network architecture improvements and more advanced applications.
https://arxiv.org/abs/1709.04695
In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are evaluated on them. As a result, evaluation scores are inflated and predominantly determined by answering easier questions, making it difficult to compare different methods. In this paper, we analyze existing VQA algorithms using a new dataset. It contains over 1.6 million questions organized into 12 different categories. We also introduce questions that are meaningless for a given image to force a VQA system to reason about image content. We propose new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms. We analyze the performance of both baseline and state-of-the-art VQA models, including multi-modal compact bilinear pooling (MCB), neural module networks, and recurrent answering units. Our experiments establish how attention helps certain categories more than others, determine which models work better than others, and explain how simple models (e.g. MLP) can surpass more complex models (MCB) by simply learning to answer large, easy question categories.
https://arxiv.org/abs/1703.09684
Analog memory is of great importance in neurocomputing technologies field, but still remains difficult to implement. With emergence of memristors in VLSI technologies the idea of designing scalable analog data storage elements finds its second wind. A memristor, known for its history dependent resistance levels, independently can provide blocks of binary or discrete state data storage. However, using single memristor to save the analog value is practically limited due to the device variability and implementation complexity. In this paper, we present a new design of discrete state memory cell consisting of subcells constructed from a memristor and its resistive network. A memristor in the sub-cells provides the storage element, while its resistive network is used for programming its resistance. Several sub-cells are then connected in parallel, resembling potential divider configuration. The output of the memory cell is the voltage resulting from distributing the input voltage among the sub-cells. Here, proposed design was programmed to obtain 10 and 27 different output levels depending on the configuration of the combined resistive networks within the sub-cell. Despite the simplicity of the circuit, this realization of multilevel memory provides increased number of output levels compared to previous designs of memory technologies based on memristors. Simulation results of proposed memory are analyzed providing explicit data on the issues of distinguishing discrete analog output levels and sensitivity of the cell to oscillations in write signal patterns.
https://arxiv.org/abs/1709.04149
We introduce an open-source toolkit for neural machine translation (NMT) to support research into model architectures, feature representations, and source modalities, while maintaining competitive performance, modularity and reasonable training requirements.
https://arxiv.org/abs/1709.03815
Machine translation systems are very sensitive to the domains they were trained on. Several domain adaptation techniques have been deeply studied. We propose a new technique for neural machine translation (NMT) that we call domain control which is performed at runtime using a unique neural network covering multiple domains. The presented approach shows quality improvements when compared to dedicated domains translating on any of the covered domains and even on out-of-domain data. In addition, model parameters do not need to be re-estimated for each domain, making this effective to real use cases. Evaluation is carried out on English-to-French translation for two different testing scenarios. We first consider the case where an end-user performs translations on a known domain. Secondly, we consider the scenario where the domain is not known and predicted at the sentence level before translating. Results show consistent accuracy improvements for both conditions.
https://arxiv.org/abs/1612.06140