Starting from NMT, encoder-decoder neu- ral networks have been used for many NLP problems. Graph-based models and transition-based models borrowing the en- coder components achieve state-of-the-art performance on dependency parsing and constituent parsing, respectively. How- ever, there has not been work empirically studying the encoder-decoder neural net- works for transition-based parsing. We apply a simple encoder-decoder to this end, achieving comparable results to the parser of Dyer et al. (2015) on standard de- pendency parsing, and outperforming the parser of Vinyals et al. (2015) on con- stituent parsing.
https://arxiv.org/abs/1706.07905
The minimum energy paths for the migration of interstitial Mg in wurtzite GaN are studied through density functional calculations. The study also comprises Li, Na, and Be dopants to examine the dependence on size and charge of the dopant species. In all cases considered, the impurities diffuse like ions without any tendency of localizing charge. Li, Mg, and to some extent Na, diffuse almost isotropically in GaN, with average diffusion barriers of 1.1, 2.1, and 2.5 eV, respectively. Instead Be shows a marked anisotropy with energy barriers of 0.76 and 1.88 eV for diffusion paths perpendicular and parallel to the c-axis. The diffusion barrier generally increases with ionic charge and ionic radius, but their interplay is not trivial. The calculated migration barrier for Mg is consistent with the values estimated in a recent beta- emission channeling experiment.
https://arxiv.org/abs/1706.07171
This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird’s eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmark show that our approach outperforms the state-of-the-art by around 25% and 30% AP on the tasks of 3D localization and 3D detection. In addition, for 2D detection, our approach obtains 10.3% higher AP than the state-of-the-art on the hard data among the LIDAR-based methods.
https://arxiv.org/abs/1611.07759
Recent successful applications of convolutional neural networks (CNNs) to audio classification and speech recognition have motivated the search for better input representations for more efficient training. Visual displays of an audio signal, through various time-frequency representations such as spectrograms offer a rich representation of the temporal and spectral structure of the original signal. In this letter, we compare various popular signal processing methods to obtain this representation, such as short-time Fourier transform (STFT) with linear and Mel scales, constant-Q transform (CQT) and continuous Wavelet transform (CWT), and assess their impact on the classification performance of two environmental sound datasets using CNNs. This study supports the hypothesis that time-frequency representations are valuable in learning useful features for sound classification. Moreover, the actual transformation used is shown to impact the classification accuracy, with Mel-scaled STFT outperforming the other discussed methods slightly and baseline MFCC features to a large degree. Additionally, we observe that the optimal window size during transformation is dependent on the characteristics of the audio signal and architecturally, 2D convolution yielded better results in most cases compared to 1D.
https://arxiv.org/abs/1706.07156
Convolution is a critical component in modern deep neural networks, thus several algorithms for convolution have been developed. Direct convolution is simple but suffers from poor performance. As an alternative, multiple indirect methods have been proposed including im2col-based convolution, FFT-based convolution, or Winograd-based algorithm. However, all these indirect methods have high memory-overhead, which creates performance degradation and offers a poor trade-off between performance and memory consumption. In this work, we propose a memory-efficient convolution or MEC with compact lowering, which reduces memory-overhead substantially and accelerates convolution process. MEC lowers the input matrix in a simple yet efficient/compact way (i.e., much less memory-overhead), and then executes multiple small matrix multiplications in parallel to get convolution completed. Additionally, the reduced memory footprint improves memory sub-system efficiency, improving performance. Our experimental results show that MEC reduces memory consumption significantly with good speedup on both mobile and server platforms, compared with other indirect convolution algorithms.
https://arxiv.org/abs/1706.06873
Weakly supervised object detection (WSOD), which is the problem of learning detectors using only image-level labels, has been attracting more and more interest. However, this problem is quite challenging due to the lack of location supervision. To address this issue, this paper integrates saliency into a deep architecture, in which the location in- formation is explored both explicitly and implicitly. Specifically, we select highly confident object pro- posals under the guidance of class-specific saliency maps. The location information, together with semantic and saliency information, of the selected proposals are then used to explicitly supervise the network by imposing two additional losses. Meanwhile, a saliency prediction sub-network is built in the architecture. The prediction results are used to implicitly guide the localization procedure. The entire network is trained end-to-end. Experiments on PASCAL VOC demonstrate that our approach outperforms all state-of-the-arts.
https://arxiv.org/abs/1706.06768
Performing high level cognitive tasks requires the integration of feature maps with drastically different structure. In Visual Question Answering (VQA) image descriptors have spatial structures, while lexical inputs inherently follow a temporal sequence. The recently proposed Multimodal Compact Bilinear pooling (MCB) forms the outer products, via count-sketch approximation, of the visual and textual representation at each spatial location. While this procedure preserves spatial information locally, outer-products are taken independently for each fiber of the activation tensor, and therefore do not include spatial context. In this work, we introduce multi-dimensional sketch ({MD-sketch}), a novel extension of count-sketch to tensors. Using this new formulation, we propose Multimodal Compact Tensor Pooling (MCT) to fully exploit the global spatial context during bilinear pooling operations. Contrarily to MCB, our approach preserves spatial context by directly convolving the MD-sketch from the visual tensor features with the text vector feature using higher order FFT. Furthermore we apply MCT incrementally at each step of the question embedding and accumulate the multi-modal vectors with a second LSTM layer before the final answer is chosen.
https://arxiv.org/abs/1706.06706
Over the years, many different indexing techniques and search algorithms have been proposed, including CSS-trees, CSB+ trees, k-ary binary search, and fast architecture sensitive tree search. There have also been papers on how best to set the many different parameters of these index structures, such as the node size of CSB+ trees. These indices have been proposed because CPU speeds have been increasing at a dramatically higher rate than memory speeds, giving rise to the Von Neumann CPU–Memory bottleneck. To hide the long latencies caused by memory access, it has become very important to well-utilize the features of modern CPUs. In order to drive down the average number of CPU clock cycles required to execute CPU instructions, and thus increase throughput, it has become important to achieve a good utilization of CPU resources. Some of these are the data and instruction caches, and the translation lookaside buffers. But it also has become important to avoid branch misprediction penalties, and utilize vectorization provided by CPUs in the form of SIMD instructions. While the layout of index structures has been heavily optimized for the data cache of modern CPUs, the instruction cache has been neglected so far. In this paper, we present NitroGen, a framework for utilizing code generation for speeding up index traversal in main memory database systems. By bringing together data and code, we make index structures use the dormant resource of the instruction cache. We show how to combine index compilation with previous approaches, such as binary tree search, cache-sensitive tree search, and the architecture-sensitive tree search presented by Kim et al.
https://arxiv.org/abs/1706.06697
Detecting small objects is notoriously challenging due to their low resolution and noisy representation. Existing object detection pipelines usually detect small objects through learning representations of all the objects at multiple scales. However, the performance gain of such ad hoc architectures is usually limited to pay off the computational cost. In this work, we address the small object detection problem by developing a single architecture that internally lifts representations of small objects to “super-resolved” ones, achieving similar characteristics as large objects and thus more discriminative for detection. For this purpose, we propose a new Perceptual Generative Adversarial Network (Perceptual GAN) model that improves small object detection through narrowing representation difference of small objects from the large ones. Specifically, its generator learns to transfer perceived poor representations of the small objects to super-resolved ones that are similar enough to real large objects to fool a competing discriminator. Meanwhile its discriminator competes with the generator to identify the generated representation and imposes an additional perceptual requirement - generated representations of small objects must be beneficial for detection purpose - on the generator. Extensive evaluations on the challenging Tsinghua-Tencent 100K and the Caltech benchmark well demonstrate the superiority of Perceptual GAN in detecting small objects, including traffic signs and pedestrians, over well-established state-of-the-arts.
https://arxiv.org/abs/1706.05274
We have characterized the photodetection capabilities of single GaN nanowires incorporating 20 periods of AlN/GaN:Ge axial heterostructures enveloped in an AlN shell. Transmission electron microscopy confirms the absence of an additional GaN shell around the heterostructures. In the absence of a surface conduction channel, the incorporation of the heterostructure leads to a decrease of the dark current and an increase of the photosensitivity. A significant dispersion in the magnitude of dark currents for different single nanowires is attributed to the coalescence of nanowires with displaced nanodisks, reducing the effective length of the heterostructure. A larger number of active nanodisks and AlN barriers in the current path results in lower dark current and higher photosensitivity, and improves the sensitivity of the nanowire to variations in the illumination intensity (improved linearity). Additionally, we observe a persistence of the photocurrent, which is attributed to a change of the resistance of the overall structure, particularly the GaN stem and cap sections. In consequence, the time response is rather independent of the dark current.
https://arxiv.org/abs/1604.07978
This paper introduces THUMT, an open-source toolkit for neural machine translation (NMT) developed by the Natural Language Processing Group at Tsinghua University. THUMT implements the standard attention-based encoder-decoder framework on top of Theano and supports three training criteria: maximum likelihood estimation, minimum risk training, and semi-supervised training. It features a visualization tool for displaying the relevance between hidden states in neural networks and contextual words, which helps to analyze the internal workings of NMT. Experiments on Chinese-English datasets show that THUMT using minimum risk training significantly outperforms GroundHog, a state-of-the-art toolkit for NMT.
https://arxiv.org/abs/1706.06415
Recent work in computer vision has yielded impressive results in automatically describing images with natural language. Most of these systems generate captions in a sin- gle language, requiring multiple language-specific models to build a multilingual captioning system. We propose a very simple technique to build a single unified model across languages, using artificial tokens to control the language, making the captioning system more compact. We evaluate our approach on generating English and Japanese captions, and show that a typical neural captioning architecture is capable of learning a single model that can switch between two different languages.
https://arxiv.org/abs/1706.06275
Generative adversarial nets (GANs) are a promising technique for modeling a distribution from samples. It is however well known that GAN training suffers from instability due to the nature of its maximin formulation. In this paper, we explore ways to tackle the instability problem by dualizing the discriminator. We start from linear discriminators in which case conjugate duality provides a mechanism to reformulate the saddle point objective into a maximization problem, such that both the generator and the discriminator of this ‘dualing GAN’ act in concert. We then demonstrate how to extend this intuition to non-linear formulations. For GANs with linear discriminators our approach is able to remove the instability in training, while for GANs with nonlinear discriminators our approach provides an alternative to the commonly used GAN training algorithm.
https://arxiv.org/abs/1706.06216
In this report, we provide a comparative analysis of different techniques for user intent classification towards the task of app recommendation. We analyse the performance of different models and architectures for multi-label classification over a dataset with a relative large number of classes and only a handful examples of each class. We focus, in particular, on memory network architectures, and compare how well the different versions perform under the task constraints. Since the classifier is meant to serve as a module in a practical dialog system, it needs to be able to work with limited training data and incorporate new data on the fly. We devise a 1-shot learning task to test the models under the above constraint. We conclude that relatively simple versions of memory networks perform better than other approaches. Although, for tasks with very limited data, simple non-parametric methods perform comparably, without needing the extra training data.
https://arxiv.org/abs/1706.06160
The accumulating, but small, set of large semi-major axis trans-Neptunian objects (TNOs) shows an apparent clustering in the orientations of their orbits. This clustering must either be representative of the intrinsic distribution of these TNOs, or else arise as a result of observation biases and/or statistically expected variations for such a small set of detected objects. The clustered TNOs were detected across different and independent surveys, which has led to claims that the detections are therefore free of observational bias. This apparent clustering has led to the so-called “Planet 9” hypothesis that a super-Earth currently resides in the distant solar system and causes this clustering. The Outer Solar System Origins Survey (OSSOS) is a large program that ran on the Canada-France-Hawaii Telescope from 2013–2017, discovering more than 800 new TNOs. One of the primary design goals of OSSOS was the careful determination of observational biases that would manifest within the detected sample. We demonstrate the striking and non-intuitive biases that exist for the detection of TNOs with large semi-major axes. The eight large semi-major axis OSSOS detections are an independent dataset, of comparable size to the conglomerate samples used in previous studies. We conclude that the orbital distribution of the OSSOS sample is consistent with being detected from a uniform underlying angular distribution.
https://arxiv.org/abs/1706.05348
Training of neural machine translation (NMT) models usually uses mini-batches for efficiency purposes. During the mini-batched training process, it is necessary to pad shorter sentences in a mini-batch to be equal in length to the longest sentence therein for efficient computation. Previous work has noted that sorting the corpus based on the sentence length before making mini-batches reduces the amount of padding and increases the processing speed. However, despite the fact that mini-batch creation is an essential step in NMT training, widely used NMT toolkits implement disparate strategies for doing so, which have not been empirically validated or compared. This work investigates mini-batch creation strategies with experiments over two different datasets. Our results suggest that the choice of a mini-batch creation strategy has a large effect on NMT training and some length-based sorting strategies do not always work well compared with simple shuffling.
https://arxiv.org/abs/1706.05765
This work presents an in-depth investigation of the properties of complexes composed of hydrogen, silicon or oxygen with carbon, which are major unintentional impurities in undoped GaN. This manuscript is a complement to our previous work on carbon–carbon and carbon-vacancy complexes. We have employed a first-principles method using Heyd-Scuseria-Ernzerhof hybrid functionals within the framework of generalized Kohn-Sham density functional theory. Two H–C, four Si–C and five O–C complexes in different charge states have been considered. After full geometry relaxations, formation energies, binding energies and both thermal and optical transition levels were obtained. The calculated energy levels have been systematically compared with the experimentally observed carbon related trap levels. Furthermore, we computed vibrational frequencies for selected defect complexes and defect concentrations were estimated in the low, mid and high carbon doping scenarios considering two different cases where electrically active defects: (a) only carbon and vacancies and (b) not only carbon and vacancies but also hydrogen, silicon and oxygen. We confirmed that $\mathrm{C_N}$ is a dominant acceptor in GaN. In addition to it, substantial amount of $\mathrm{Si_{Ga}-C_N}$ complex exists in a neutral form. This complex is a likely candidate for unknown form of carbon observed in undoped $n$-type GaN.
https://arxiv.org/abs/1706.05574
Various forms of carbon based complexes in GaN are studied with first-principles calculations employing Heyd-Scuseria-Ernzerhof hybrid functional within the framework of density functional theory. We consider carbon complexes made of the combinations of single impurities, i.e. $\mathrm{C_N-C_{Ga}}$, $\mathrm{C_I-C_N}$ and $\mathrm{C_I-C_{Ga}}$, where $\mathrm{C_N}$, $\mathrm{C_{Ga}}$ and $\mathrm{C_I}$ denote C substituting nitrogen, C substituting gallium and interstitial C, respectively, and of neighboring gallium/nitrogen vacancies ($\mathrm{V_{Ga}}$/$\mathrm{V_N}$), i.e. $\mathrm{C_N-V_{Ga}}$ and $\mathrm{C_{Ga}-V_N}$. Formation energies are computed for all these configurations with different charge states after full geometry optimizations. From our calculated formation energies, thermodynamic transition levels are evaluated, which are related to the thermal activation energies observed in experimental techniques such as deep level transient spectroscopy. Furthermore, the lattice relaxation energies (Franck-Condon shift) are computed to obtain optical activation energies, which are observed in experimental techniques such as deep level optical spectroscopy. We compare our calculated values of activation energies with the energies of experimentally observed C-related trap levels and identify the physical origins of these traps, which are unknown before.
https://arxiv.org/abs/1507.06969
While there is overall agreement that future technology for organizing, browsing and searching videos hinges on the development of methods for high-level semantic understanding of video, so far no consensus has been reached on the best way to train and assess models for this task. Casting video understanding as a form of action or event categorization is problematic as it is not fully clear what the semantic classes or abstractions in this domain should be. Language has been exploited to sidestep the problem of defining video categories, by formulating video understanding as the task of captioning or description. However, language is highly complex, redundant and sometimes ambiguous. Many different captions may express the same semantic concept. To account for this ambiguity, quantitative evaluation of video description requires sophisticated metrics, whose performance scores are typically hard to interpret by humans. This paper provides four contributions to this problem. First, we formulate Video Multiple Choice Caption (VideoMCC) as a new well-defined task with an easy-to-interpret performance measure. Second, we describe a general semi-automatic procedure to create benchmarks for this task. Third, we publicly release a large-scale video benchmark created with an implementation of this procedure and we include a human study that assesses human performance on our dataset. Finally, we propose and test a varied collection of approaches on this benchmark for the purpose of gaining a better understanding of the new challenges posed by video comprehension.
https://arxiv.org/abs/1606.07373
Understanding the response of complex materials to external force is central to fields ranging from materials science to biology. Here, we describe a novel type of mechanical adaptation in cross-linked networks of F-actin, a ubuiquitous protein found in eukaryotic cells. We show that shear stress changes its nonlinear mechanical response even long after that stress is removed. The duration, magnitude and direction of forcing history all impact changes in mechanical response. The `memory’ of the forcing history is long-lived, but can be erased by force application in the opposite direction. We further show that the observed mechanical adaptation is consistent with stress-dependent changes in the nematic order of the constituent filaments. Thus, this mechano-memory is a type of nonlinear hysteretic response in which an applied, “training” strain modifies the nonlinear elasticity. This demonstrates that F-actin networks can encode analog read-write mechano-memories, which can be used for adaptation to mechanical stimuli.
https://arxiv.org/abs/1706.05336
Machine learning techniques are widely applied in many modern optical sky surveys, e.q. Pan-STARRS1, PTF/iPTF and Subaru/Hyper Suprime-Cam survey, to reduce human intervention for data verification. In this study, we have established a machine learning based real-bogus system to reject the false detections in the Subaru/Hyper-Suprime-Cam StrategicSurvey Program (HSC-SSP) source catalog. Therefore the HSC-SSP moving object detection pipeline can operate more effectively due to the reduction of false positives. To train the real-bogus system, we use the stationary sources as the real training set and the “flagged” data as the bogus set. The training set contains 47 features, most of which are photometric measurements and shape moments generated from the HSC image reduction pipeline (hscPipe). Our system can reach a true positive rate (tpr) ~96% with a false positive rate (fpr) ~ 1% or tpr ~99% at fpr ~5%. Therefore we conclude that the stationary sources are decent real training samples, and using photometry measurements and shape moments can reject the false positives effectively.
https://arxiv.org/abs/1704.06413
Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task. Our model architecture incorporates building blocks from multiple domains. It contains convolutional layers, an attention mechanism, and sparsely-gated layers. Each of these computational blocks is crucial for a subset of the tasks we train on. Interestingly, even if a block is not crucial for a task, we observe that adding it never hurts performance and in most cases improves it on all tasks. We also show that tasks with less data benefit largely from joint training with other tasks, while performance on large tasks degrades only slightly if at all.
https://arxiv.org/abs/1706.05137
Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level concepts rather than detailed physical aspects about actions and scenes. In this work, we describe our ongoing collection of the “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation. The database currently contains more than 100,000 videos across 174 classes, which are defined as caption-templates. We also describe the challenges in crowd-sourcing this data at scale.
https://arxiv.org/abs/1706.04261
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
https://arxiv.org/abs/1610.01465
Content-Centric Networking (CCN) is a new paradigm for the future Internet where content is addressed by hierarchically organized names with the goal to replace TCP/IP networks. Unlike IP addresses, names have arbitrary length and are larger than the four bytes of IPv4 addresses. One important data structure in CCN is the Forwarding Information Base (FIB) where prefixes of names are stored together with the forwarding face. Long prefixes create problems for memory constrained Internet of Things (IoT) devices. In this work, we derive requirements for a FIB in the IoT and survey possible solutions. We investigate, design and compare memory-efficient solutions for the FIB based on hashes and Bloom-Filters. For large number of prefixes and an equal distribution of prefixes to faces we recommend a FIB implementation based on Bloom-Filters. In all other cases, we recommend an implementation of the FIB with hashes.
https://arxiv.org/abs/1706.04405
The basic concept in Neural Machine Translation (NMT) is to train a large Neural Network that maximizes the translation performance on a given parallel corpus. NMT is then using a simple left-to-right beam-search decoder to generate new translations that approximately maximize the trained conditional probability. The current beam search strategy generates the target sentence word by word from left-to- right while keeping a fixed amount of active candidates at each time step. First, this simple search is less adaptive as it also expands candidates whose scores are much worse than the current best. Secondly, it does not expand hypotheses if they are not within the best scoring candidates, even if their scores are close to the best one. The latter one can be avoided by increasing the beam size until no performance improvement can be observed. While you can reach better performance, this has the draw- back of a slower decoding speed. In this paper, we concentrate on speeding up the decoder by applying a more flexible beam search strategy whose candidate size may vary at each time step depending on the candidate scores. We speed up the original decoder by up to 43% for the two language pairs German-English and Chinese-English without losing any translation quality.
https://arxiv.org/abs/1702.01806
Training a deep convolutional neural net typically starts with a random initialisation of all filters in all layers which severely reduces the forward signal and back-propagated error and leads to slow and sub-optimal training. Techniques that counter that focus on either increasing the signal or increasing the gradients adaptively but the model behaves very differently at the beginning of training compared to later when stable pathways through the net have been established. To compound this problem the effective minibatch size varies greatly between layers at different depths and between individual filters as activation sparsity typically increases with depth leading to a reduction in effective learning rate since gradients may superpose rather than add and this further compounds the covariate shift problem as deeper neurons are less able to adapt to upstream shift. Proposed here is a method of automatic gain control of the signal built into each convolutional neuron that achieves equivalent or superior performance than batch normalisation and is compatible with single sample or minibatch gradient descent. The same model is used both for training and inference. The technique comprises a scaled per sample map mean subtraction from the raw convolutional filter output followed by scaling of the difference.
https://arxiv.org/abs/1706.03907
Object detection is a core problem in computer vision. With the development of deep ConvNets, the performance of object detectors has been dramatically improved. The deep ConvNets based object detectors mainly focus on regressing the coordinates of bounding box, e.g., Faster-R-CNN, YOLO and SSD. Different from these methods that considering bounding box as a whole, we propose a novel object bounding box representation using points and links and implemented using deep ConvNets, termed as Point Linking Network (PLN). Specifically, we regress the corner/center points of bounding-box and their links using a fully convolutional network; then we map the corner points and their links back to multiple bounding boxes; finally an object detection result is obtained by fusing the multiple bounding boxes. PLN is naturally robust to object occlusion and flexible to object scale variation and aspect ratio variation. In the experiments, PLN with the Inception-v2 model achieves state-of-the-art single-model and single-scale results on the PASCAL VOC 2007, the PASCAL VOC 2012 and the COCO detection benchmarks without bells and whistles. The source code will be released.
https://arxiv.org/abs/1706.03646
Most existing machine translation systems operate at the level of words, relying on explicit segmentation to extract tokens. We introduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any segmentation. We employ a character-level convolutional network with max-pooling at the encoder to reduce the length of source representation, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword-level encoder on WMT’15 DE-EN and CS-EN, and gives comparable performance on FI-EN and RU-EN. We then demonstrate that it is possible to share a single character-level encoder across multiple languages by training a model on a many-to-one translation task. In this multilingual setting, the character-level encoder significantly outperforms the subword-level encoder on all the language pairs. We observe that on CS-EN, FI-EN and RU-EN, the quality of the multilingual character-level translation even surpasses the models specifically trained on that language pair alone, both in terms of BLEU score and human judgment.
https://arxiv.org/abs/1610.03017
Recently, with the revolutionary neural style transferring methods, creditable paintings can be synthesized automatically from content images and style images. However, when it comes to the task of applying a painting’s style to an anime sketch, these methods will just randomly colorize sketch lines as outputs and fail in the main task: specific style tranfer. In this paper, we integrated residual U-net to apply the style to the gray-scale sketch with auxiliary classifier generative adversarial network (AC-GAN). The whole process is automatic and fast, and the results are creditable in the quality of art style as well as colorization.
https://arxiv.org/abs/1706.03319
Neural Machine Translation (NMT) models usually use large target vocabulary sizes to capture most of the words in the target language. The vocabulary size is a big factor when decoding new sentences as the final softmax layer normalizes over all possible target words. To address this problem, it is widely common to restrict the target vocabulary with candidate lists based on the source sentence. Usually, the candidate lists are a combination of external word-to-word aligner, phrase table entries or most frequent words. In this work, we propose a simple and yet novel approach to learn candidate lists directly from the attention layer during NMT training. The candidate lists are highly optimized for the current NMT model and do not need any external computation of the candidate pool. We show significant decoding speedup compared with using the entire vocabulary, without losing any translation quality for two language pairs.
https://arxiv.org/abs/1706.03824
The software powering today’s vehicles surpasses mechatronics as the dominating engineering challenge due to its fast evolving and innovative nature. In addition, the software and system architecture for upcoming vehicles with automated driving functionality is already processing ~750MB/s - corresponding to over 180 simultaneous 4K-video streams from popular video-on-demand services. Hence, self-driving cars will run so much software to resemble “small data centers on wheels” rather than just transportation vehicles. Continuous Integration, Deployment, and Experimentation have been successfully adopted for software-only products as enabling methodology for feedback-based software development. For example, a popular search engine conducts ~250 experiments each day to improve the software based on its users’ behavior. This work investigates design criteria for the software architecture and the corresponding software development and deployment process for complex cyber-physical systems, with the goal of enabling Continuous Experimentation as a way to achieve continuous software evolution. Our research involved reviewing related literature on the topic to extract relevant design requirements. The study is concluded by describing the software development and deployment process and software architecture adopted by our self-driving vehicle laboratory, both based on the extracted criteria.
https://arxiv.org/abs/1705.05170
The optical emission of InGaN quantum dots embedded in GaN nanowires is dynamically controlled by a surface acoustic wave (SAW). The emission energy of both the exciton and biexciton lines is modulated over a 1.5 meV range at ~330 MHz. A small but systematic difference in the exciton and biexciton spectral modulation reveals a linear change of the biexciton binding energy with the SAW amplitude. The present results are relevant for the dynamic control of individual single photon emitters based on nitride semiconductors.
https://arxiv.org/abs/1706.03602
Single-photon emitters (SPEs) are at the basis of many applications for quantum information management. Semiconductor-based SPEs are best suited for practical implementations because of high design flexibility, scalability and integration potential in practical devices. Single-photon emission from ordered arrays of InGaN nano-disks embedded in GaN nanowires is reported. Intense and narrow optical emission lines from quantum dot-like recombination centers are observed in the blue-green spectral range. Characterization by electron microscopy, cathodoluminescence and micro-photoluminescence indicate that single photons are emitted from regions of high In concentration in the nano-disks due to alloy composition fluctuations. Single-photon emission is determined by photon correlation measurements showing deep anti-bunching minima in the second-order correlation function. The present results are a promising step towards the realization of on-site/on-demand single-photon sources in the blue-green spectral range operating in the GHz frequency range at high temperatures.
https://arxiv.org/abs/1706.03601
The realization of reliable single photon emitters operating at high temperature and located at predetermined positions still presents a major challenge for the development of solid-state systems for quantum light applications. We demonstrate single-photon emission from two-dimensional ordered arrays of GaN nanowires containing InGaN nano-disks. The structures were fabricated by molecular beam epitaxy on (0001) GaN-on-sapphire templates patterned with nanohole masks prepared by colloidal lithography. Low-temperature cathodoluminescence measurements reveal the spatial distribution of light emitted from a single nanowire heterostructure. The emission originating from the topmost part of the InGaN regions covers the blue-to-green spectral range and shows intense and narrow quantum dot-like photoluminescence lines. These lines exhibit an average linear polarization ratio of 92%. Photon correlation measurements show photon antibunching with a g(2)(0) values well below the 0.5 threshold for single photon emission. The antibunching rate increases linearly with the optical excitation power, extrapolating to the exciton decay rate of ~1 ns-1 at vanishing pump power. This value is comparable with the exciton lifetime measured by time-resolved photoluminescence. Fast and efficient single photon emitters with controlled spatial position and strong linear polarization are an important step towards high-speed on-chip quantum information management.
https://arxiv.org/abs/1706.03599
Neural networks have proven effective at solving difficult problems but designing their architectures can be challenging, even for image classification problems alone. Our goal is to minimize human participation, so we employ evolutionary algorithms to discover such networks automatically. Despite significant computational requirements, we show that it is now possible to evolve models with accuracies within the range of those published in the last year. Specifically, we employ simple evolutionary techniques at unprecedented scales to discover models for the CIFAR-10 and CIFAR-100 datasets, starting from trivial initial conditions and reaching accuracies of 94.6% (95.6% for ensemble) and 77.0%, respectively. To do this, we use novel and intuitive mutation operators that navigate large search spaces; we stress that no human participation is required once evolution starts and that the output is a fully-trained model. Throughout this work, we place special emphasis on the repeatability of results, the variability in the outcomes and the computational requirements.
https://arxiv.org/abs/1703.01041
Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person’s visual sensorimotor object interactions from up close. In this paper, we study the tight interplay between our momentary visual attention and motor action with objects from a first-person camera. We propose a concept of action-objects—the objects that capture person’s conscious visual (watching a TV) or tactile (taking a cup) interactions. Action-objects may be task-dependent but since many tasks share common person-object spatial configurations, action-objects exhibit a characteristic 3D spatial distance and orientation with respect to the person. We design a predictive model that detects action-objects using EgoNet, a joint two-stream network that holistically integrates visual appearance (RGB) and 3D spatial layout (depth and height) cues to predict per-pixel likelihood of action-objects. Our network also incorporates a first-person coordinate embedding, which is designed to learn a spatial distribution of the action-objects in the first-person data. We demonstrate EgoNet’s predictive power, by showing that it consistently outperforms previous baseline approaches. Furthermore, EgoNet also exhibits a strong generalization ability, i.e., it predicts semantically meaningful objects in novel first-person datasets. Our method’s ability to effectively detect action-objects could be used to improve robots’ understanding of human-object interactions.
https://arxiv.org/abs/1603.04908
We report the carrier dynamics and recombination coefficients in single-quantum-well semipolar $(20\bar 2\bar 1)$ InGaN/GaN light-emitting diodes emitting at 440 nm with 93% peak internal quantum efficiency. The differential carrier lifetime is analyzed for various injection current densities from 5 $A/cm^2$ to 10 $kA/cm^2$, and the corresponding carrier densities are obtained. The coupling of internal quantum efficiency and differential carrier lifetime vs injected carrier density ($n$) enables the separation of the radiative and nonradiative recombination lifetimes and the extraction of the Shockley-Read-Hall (SRH) nonradiative ($A$), radiative ($B$), and Auger ($C$) recombination coefficients and their $n$-dependency considering the saturation of the SRH recombination rate and phase-space filling. The results indicate a three to four-fold higher $A$ and a nearly two-fold higher $B_0$ for this semipolar orientation compared to that of $c$-plane reported using a similar approach [A. David and M. J. Grundmann, Appl. Phys. Lett. 96, 103504 (2010)]. In addition, the carrier density in semipolar $(20\bar 2\bar 1)$ is found to be lower than the carrier density in $c$-plane for a given current density, which is important for suppressing efficiency droop. The semipolar LED also shows a two-fold lower $C_0$ compared to $c$-plane, which is consistent with the lower relative efficiency droop for the semipolar LED (57% vs. 69%). The lower carrier density, higher $B_0$ coefficient, and lower $C_0$ (Auger) coefficient are directly responsible for the high efficiency and low efficiency droop reported in semipolar $(20\bar 2\bar 1)$ LEDs.
https://arxiv.org/abs/1706.03135
We introduce new families of Integral Probability Metrics (IPM) for training Generative Adversarial Networks (GAN). Our IPMs are based on matching statistics of distributions embedded in a finite dimensional feature space. Mean and covariance feature matching IPMs allow for stable training of GANs, which we will call McGan. McGan minimizes a meaningful loss between distributions.
https://arxiv.org/abs/1702.08398
Network Traffic Matrix (TM) prediction is defined as the problem of estimating future network traffic from the previous and achieved network traffic data. It is widely used in network planning, resource management and network security. Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that is well-suited to learn from experience to classify, process and predict time series with time lags of unknown size. LSTMs have been shown to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose a LSTM RNN framework for predicting short and long term Traffic Matrix (TM) in large networks. By validating our framework on real-world data from GEANT network, we show that our LSTM models converge quickly and give state of the art TM prediction performance for relatively small sized models.
https://arxiv.org/abs/1705.05690
N-polar GaN p-n diodes are realized on single-crystal N-polar GaN bulk wafers by plasma-assisted molecular beam epitaxy growth. The current-voltage characteristics show high-quality rectification and electroluminescence characteristics with a high on/off current ratio and interband photon emission. The measured electroluminescence spectrum is dominated by strong near-band edge emission, while deep level luminescence is greatly suppressed. A very low dislocation density leads to a high reverse breakdown electric field. The low leakage current N-polar diodes open up several potential applications in polarization-engineered photonic and electronic devices.
https://arxiv.org/abs/1706.02439
Automatically generating a natural language description of an image is a task close to the heart of image understanding. In this paper, we present a multi-model neural network method closely related to the human visual system that automatically learns to describe the content of images. Our model consists of two sub-models: an object detection and localization model, which extract the information of objects and their spatial relationship in images respectively; Besides, a deep recurrent neural network (RNN) based on long short-term memory (LSTM) units with attention mechanism for sentences generation. Each word of the description will be automatically aligned to different objects of the input image when it is generated. This is similar to the attention mechanism of the human visual system. Experimental results on the COCO dataset showcase the merit of the proposed method, which outperforms previous benchmark models.
https://arxiv.org/abs/1706.02430
Similarity search is a critical primitive for a wide variety of applications including natural language processing, content-based search, machine learning, computer vision, databases, robotics, and recommendation systems. At its core, similarity search is implemented using the k-nearest neighbors (kNN) algorithm, where computation consists of highly parallel distance calculations and a global top-k sort. In contemporary von-Neumann architectures, kNN is bottlenecked by data movement which limits throughput and latency. In this paper, we present and evaluate a novel automata-based algorithm for kNN on the Micron Automata Processor (AP), which is a non-von Neumann near-data processing architecture. By employing near-data processing, the AP minimizes the data movement bottleneck and is able to achieve better performance. Unlike prior work in the automata processing space, our work combines temporal encodings with automata design to augment the space of applications for the AP. We evaluate our design’s performance on the AP and compare to state-of-the-art CPU, GPU, and FPGA implementations; we show that the current generation of AP hardware can achieve over 50x speedup over CPUs while maintaining competitive energy efficiency gains. We also propose several automata optimization techniques and simple architectural extensions that highlight the potential of the AP hardware.
https://arxiv.org/abs/1608.03175
This paper aims at synthesizing filamentary structured images such as retinal fundus images and neuronal images, as follows: Given a ground-truth, to generate multiple realistic looking phantoms. A ground-truth could be a binary segmentation map containing the filamentary structured morphology, while the synthesized output image is of the same size as the ground-truth and has similar visual appearance to what have been presented in the training set. Our approach is inspired by the recent progresses in generative adversarial nets (GANs) as well as image style transfer. In particular, it is dedicated to our problem context with the following properties: Rather than large-scale dataset, it works well in the presence of as few as 10 training examples, which is common in medical image analysis; It is capable of synthesizing diverse images from the same ground-truth; Last and importantly, the synthetic images produced by our approach are demonstrated to be useful in boosting image analysis performance. Empirical examination over various benchmarks of fundus and neuronal images demonstrate the advantages of the proposed approach.
https://arxiv.org/abs/1706.02185
This short article revisits some of the ideas introduced in arXiv:1701.07875 and arXiv:1705.07642 in a simple setup. This sheds some lights on the connexions between Variational Autoencoders (VAE), Generative Adversarial Networks (GAN) and Minimum Kantorovitch Estimators (MKE).
https://arxiv.org/abs/1706.01807
Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as “the” and “of”. Other words that may seem visual can often be predicted reliably just from the language model e.g., “sign” after “behind a red stop” or “phone” following “talking on a cell”. In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.
https://arxiv.org/abs/1612.01887
Robots will eventually be part of every household. It is thus critical to enable algorithms to learn from and be guided by non-expert users. In this paper, we bring a human in the loop, and enable a human teacher to give feedback to a learning agent in the form of natural language. We argue that a descriptive sentence can provide a much stronger learning signal than a numeric reward in that it can easily point to where the mistakes are and how to correct them. We focus on the problem of image captioning in which the quality of the output can easily be judged by non-experts. We propose a hierarchical phrase-based captioning model trained with policy gradients, and design a feedback network that provides reward to the learner by conditioning on the human-provided feedback. We show that by exploiting descriptive feedback our model learns to perform better than when given independently written human captions.
https://arxiv.org/abs/1706.00130
Recent progress has been made in using attention based encoder-decoder framework for video captioning. However, most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., “gun” and “shooting”) and non-visual words (e.g. “the”, “a”). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of video captioning. To address this issue, we propose a hierarchical LSTM with adjusted temporal attention (hLSTMat) approach for video captioning. Specifically, the proposed framework utilizes the temporal attention for selecting specific frames to predict the related words, while the adjusted temporal attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the video caption generation. To demonstrate the effectiveness of our proposed framework, we test our method on two prevalent datasets: MSVD and MSR-VTT, and experimental results show that our approach outperforms the state-of-the-art methods on both two datasets.
https://arxiv.org/abs/1706.01231
In aspect-based sentiment analysis, most existing methods either focus on aspect/opinion terms extraction or aspect terms categorization. However, each task by itself only provides partial information to end users. To generate more detailed and structured opinion analysis, we propose a finer-grained problem, which we call category-specific aspect and opinion terms extraction. This problem involves the identification of aspect and opinion terms within each sentence, as well as the categorization of the identified terms. To this end, we propose an end-to-end multi-task attention model, where each task corresponds to aspect/opinion terms extraction for a specific category. Our model benefits from exploring the commonalities and relationships among different tasks to address the data sparsity issue. We demonstrate its state-of-the-art performance on three benchmark datasets.
https://arxiv.org/abs/1702.01776
Translating information between text and image is a fundamental problem in artificial intelligence that connects natural language processing and computer vision. In the past few years, performance in image caption generation has seen significant improvement through the adoption of recurrent neural networks (RNN). Meanwhile, text-to-image generation begun to generate plausible images using datasets of specific categories like birds and flowers. We’ve even seen image generation from multi-category datasets such as the Microsoft Common Objects in Context (MSCOCO) through the use of generative adversarial networks (GANs). Synthesizing objects with a complex shape, however, is still challenging. For example, animals and humans have many degrees of freedom, which means that they can take on many complex shapes. We propose a new training method called Image-Text-Image (I2T2I) which integrates text-to-image and image-to-text (image captioning) synthesis to improve the performance of text-to-image synthesis. We demonstrate that %the capability of our method to understand the sentence descriptions, so as to I2T2I can generate better multi-categories images using MSCOCO than the state-of-the-art. We also demonstrate that I2T2I can achieve transfer learning by using a pre-trained image captioning module to generate human images on the MPII Human Pose
https://arxiv.org/abs/1703.06676