We present a foveated object detector (FOD) as a biologically-inspired alternative to the sliding window (SW) approach which is the dominant method of search in computer vision object detection. Similar to the human visual system, the FOD has higher resolution at the fovea and lower resolution at the visual periphery. Consequently, more computational resources are allocated at the fovea and relatively fewer at the periphery. The FOD processes the entire scene, uses retino-specific object detection classifiers to guide eye movements, aligns its fovea with regions of interest in the input image and integrates observations across multiple fixations. Our approach combines modern object detectors from computer vision with a recent model of peripheral pooling regions found at the V1 layer of the human visual system. We assessed various eye movement strategies on the PASCAL VOC 2007 dataset and show that the FOD performs on par with the SW detector while bringing significant computational cost savings.
https://arxiv.org/abs/1408.0814
Generative Adversarial Networks (GANs) were intuitively and attractively explained under the perspective of game theory, wherein two involving parties are a discriminator and a generator. In this game, the task of the discriminator is to discriminate the real and generated (i.e., fake) data, whilst the task of the generator is to generate the fake data that maximally confuses the discriminator. In this paper, we propose a new viewpoint for GANs, which is termed as the minimizing general loss viewpoint. This viewpoint shows a connection between the general loss of a classification problem regarding a convex loss function and a f-divergence between the true and fake data distributions. Mathematically, we proposed a setting for the classification problem of the true and fake data, wherein we can prove that the general loss of this classification problem is exactly the negative f-divergence for a certain convex function f. This allows us to interpret the problem of learning the generator for dismissing the f-divergence between the true and fake data distributions as that of maximizing the general loss which is equivalent to the min-max problem in GAN if the Logistic loss is used in the classification problem. However, this viewpoint strengthens GANs in two ways. First, it allows us to employ any convex loss function for the discriminator. Second, it suggests that rather than limiting ourselves in NN-based discriminators, we can alternatively utilize other powerful families. Bearing this viewpoint, we then propose using the kernel-based family for discriminators. This family has two appealing features: i) a powerful capacity in classifying non-linear nature data and ii) being convex in the feature space. Using the convexity of this family, we can further develop Fenchel duality to equivalently transform the max-min problem to the max-max dual problem.
https://arxiv.org/abs/1711.01744
We present an empirical study of active learning for Visual Question Answering, where a deep VQA model selects informative question-image pairs from a pool and queries an oracle for answers to maximally improve its performance under a limited query budget. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a fast and effective goal-driven active learning scoring function to pick question-image pairs for deep VQA models under the Bayesian Neural Network framework. We find that deep VQA models need large amounts of training data before they can start asking informative questions. But once they do, all three approaches outperform the random selection baseline and achieve significant query savings. For the scenario where the model is allowed to ask generic questions about images but is evaluated only on specific questions (e.g., questions whose answer is either yes or no), our proposed goal-driven scoring function performs the best.
https://arxiv.org/abs/1711.01732
A robust and fast automatic moving object detection and tracking system is essential to characterize target object and extract spatial and temporal information for different functionalities including video surveillance systems, urban traffic monitoring and navigation, robotic. In this dissertation, I present a collaborative Spatial Pyramid Context-aware moving object detection and Tracking system. The proposed visual tracker is composed of one master tracker that usually relies on visual object features and two auxiliary trackers based on object temporal motion information that will be called dynamically to assist master tracker. SPCT utilizes image spatial context at different level to make the video tracking system resistant to occlusion, background noise and improve target localization accuracy and robustness. We chose a pre-selected seven-channel complementary features including RGB color, intensity and spatial pyramid of HoG to encode object color, shape and spatial layout information. We exploit integral histogram as building block to meet the demands of real-time performance. A novel fast algorithm is presented to accurately evaluate spatially weighted local histograms in constant time complexity using an extension of the integral histogram method. Different techniques are explored to efficiently compute integral histogram on GPU architecture and applied for fast spatio-temporal median computations and 3D face reconstruction texturing. We proposed a multi-component framework based on semantic fusion of motion information with projected building footprint map to significantly reduce the false alarm rate in urban scenes with many tall structures. The experiments on extensive VOTC2016 benchmark dataset and aerial video confirm that combining complementary tracking cues in an intelligent fusion framework enables persistent tracking for Full Motion Video and Wide Aerial Motion Imagery.
https://arxiv.org/abs/1711.01656
Internet of Things (IoT) is expected to enable a myriad of applications by interconnecting objects - such as sensors and robots - over the Internet. IoT applications range from healthcare to autonomous vehicles and include disaster management. Enabling these applications in cloud environments requires the design of appropriate IoT Infrastructure-as-a-Service (IoT IaaS) to ease the provisioning of the IoT objects as cloud services. This paper discusses a case study on search and rescue IoT applications in large-scale disaster scenarios. It proposes an IoT IaaS architecture that virtualizes robots (IaaS for robots) and provides them to the upstream applications as-a-Service. Node- and Network-level robots virtualization are supported. The proposed architecture meets a set of identified requirements, such as the need for a unified description model for heterogeneous robots, publication/discovery mechanism, and federation with other IaaS for robots when needed. A validating proof of concept is built and experiments are made to evaluate its performance. Lessons learned and prospective research directions are discussed.
https://arxiv.org/abs/1710.04919
We propose a reconfigurable hardware architecture for deep neural networks (DNNs) capable of online training and inference, which uses algorithmically pre-determined, structured sparsity to significantly lower memory and computational requirements. This novel architecture introduces the notion of edge-processing to provide flexibility and combines junction pipelining and operational parallelization to speed up training. The overall effect is to reduce network complexity by factors up to 30x and training time by up to 35x relative to GPUs, while maintaining high fidelity of inference results. This has the potential to enable extensive parameter searches and development of the largely unexplored theoretical foundation of DNNs. The architecture automatically adapts itself to different network sizes given available hardware resources. As proof of concept, we show results obtained for different bit widths.
https://arxiv.org/abs/1711.01343
Generative Adversarial Networks (GANs) are powerful models for learning complex distributions. Stable training of GANs has been addressed in many recent works which explore different metrics between distributions. In this paper we introduce Fisher GAN which fits within the Integral Probability Metrics (IPM) framework for training GANs. Fisher GAN defines a critic with a data dependent constraint on its second order moments. We show in this paper that Fisher GAN allows for stable and time efficient training that does not compromise the capacity of the critic, and does not need data independent constraints such as weight clipping. We analyze our Fisher IPM theoretically and provide an algorithm based on Augmented Lagrangian for Fisher GAN. We validate our claims on both image sample generation and semi-supervised classification using Fisher GAN.
https://arxiv.org/abs/1705.09675
Semi-supervised learning methods based on generative adversarial networks (GANs) obtained strong empirical results, but it is not clear 1) how the discriminator benefits from joint training with a generator, and 2) why good semi-supervised classification performance and a good generator cannot be obtained at the same time. Theoretically, we show that given the discriminator objective, good semisupervised learning indeed requires a bad generator, and propose the definition of a preferred generator. Empirically, we derive a novel formulation based on our analysis that substantially improves over feature matching GANs, obtaining state-of-the-art results on multiple benchmark datasets.
https://arxiv.org/abs/1705.09783
Being inspired by child’s learning experience - taught first and followed by observation and questioning, we investigate a critically supervised learning methodology for object detection in this work. Specifically, we propose a taught-observe-ask (TOA) method that consists of several novel components such as negative object proposal, critical example mining, and machine-guided question-answer (QA) labeling. To consider labeling time and performance jointly, new evaluation methods are developed to compare the performance of the TOA method, with the fully and weakly supervised learning methods. Extensive experiments are conducted on the PASCAL VOC and the Caltech benchmark datasets. The TOA method provides significantly improved performance of weakly supervision yet demands only about 3-6% of labeling time of full supervision. The effectiveness of each novel component is also analyzed.
https://arxiv.org/abs/1711.01043
While neural machine translation (NMT) has become the new paradigm, the parameter optimization requires large-scale parallel data which is scarce in many domains and language pairs. In this paper, we address a new translation scenario in which there only exists monolingual corpora and phrase pairs. We propose a new method towards translation with partially aligned sentence pairs which are derived from the phrase pairs and monolingual corpora. To make full use of the partially aligned corpora, we adapt the conventional NMT training method in two aspects. On one hand, different generation strategies are designed for aligned and unaligned target words. On the other hand, a different objective function is designed to model the partially aligned parts. The experiments demonstrate that our method can achieve a relatively good result in such a translation scenario, and tiny bitexts can boost translation quality to a large extent.
https://arxiv.org/abs/1711.01006
There has recently been significant interest in hard attention models for tasks such as object recognition, visual captioning and speech recognition. Hard attention can offer benefits over soft attention such as decreased computational cost, but training hard attention models can be difficult because of the discrete latent variables they introduce. Previous work used REINFORCE and Q-learning to approach these issues, but those methods can provide high-variance gradient estimates and be slow to train. In this paper, we tackle the problem of learning hard attention for a sequential task using variational inference methods, specifically the recently introduced VIMCO and NVIL. Furthermore, we propose a novel baseline that adapts VIMCO to this setting. We demonstrate our method on a phoneme recognition task in clean and noisy environments and show that our method outperforms REINFORCE, with the difference being greater for a more complicated task.
https://arxiv.org/abs/1705.05524
We report results from the preliminary trials of Colibri, a dedicated fast-photometry array for the detection of small Kuiper belt objects through serendipitous stellar occultations. Colibri’s novel data processing pipeline analyzed 4000 star hours with two overlapping-field EMCCD cameras, detecting no Kuiper belt objects and one false positive occultation event in a high ecliptic latitude field. No occultations would be expected at these latitudes, allowing these results to provide a control sample for the upcoming main Colibri campaign. The empirical false positive rate found by the processing pipeline is consistent with the 0.002% simulation-determined false positive rate. We also describe Colibri’s software design, kernel sets for modeling stellar occultations, and method for retrieving occultation parameters from noisy diffraction curves. Colibri’s main campaign will begin in mid-2018, operating at a 40 Hz sampling rate.
https://arxiv.org/abs/1711.00358
Compared to traditional statistical machine translation (SMT), neural machine translation (NMT) often sacrifices adequacy for the sake of fluency. We propose a method to combine the advantages of traditional SMT and NMT by exploiting an existing phrase-based SMT model to compute the phrase-based decoding cost for an NMT output and then using this cost to rerank the n-best NMT outputs. The main challenge in implementing this approach is that NMT outputs may not be in the search space of the standard phrase-based decoding algorithm, because the search space of phrase-based SMT is limited by the phrase-based translation rule table. We propose a soft forced decoding algorithm, which can always successfully find a decoding path for any NMT output. We show that using the forced decoding cost to rerank the NMT outputs can successfully improve translation quality on four different language pairs.
https://arxiv.org/abs/1711.00309
We describe a novel architecture for semantic image retrieval—in particular, retrieval of instances of visual situations. Visual situations are concepts such as “a boxing match,” “walking the dog,” “a crowd waiting for a bus,” or “a game of ping-pong,” whose instantiations in images are linked more by their common spatial and semantic structure than by low-level visual similarity. Given a query situation description, our architecture—called Situate—learns models capturing the visual features of expected objects as well the expected spatial configuration of relationships among objects. Given a new image, Situate uses these models in an attempt to ground (i.e., to create a bounding box locating) each expected component of the situation in the image via an active search procedure. Situate uses the resulting grounding to compute a score indicating the degree to which the new image is judged to contain an instance of the situation. Such scores can be used to rank images in a collection as part of a retrieval system. In the preliminary study described here, we demonstrate the promise of this system by comparing Situate’s performance with that of two baseline methods, as well as with a related semantic image-retrieval system based on “scene graphs.”
https://arxiv.org/abs/1711.00088
It is commonly agreed that the use of relevant invariances as a good statistical bias is important in machine-learning. However, most approaches that explicitly incorporate invariances into a model architecture only make use of very simple transformations, such as translations and rotations. Hence, there is a need for methods to model and extract richer transformations that capture much higher-level invariances. To that end, we introduce a tool allowing to parametrize the set of filters of a trained convolutional neural network with the latent space of a generative adversarial network. We then show that the method can capture highly non-linear invariances of the data by visualizing their effect in the data space.
https://arxiv.org/abs/1710.11386
The recent progress in formation of two-dimensional (2D) GaN by a migration-enhanced encapsulated technique opens up new possibilities for group III-V 2D semiconductors with a band gap within the visible energy spectrum. Using first-principles calculations we explored alloying of 2D-GaN to achieve an optically active material with a tuneable band gap. The effect of isoelectronic III-V substitutional elements on the band gaps, band offsets, and spatial electron localization is studied. In addition to optoelectronic properties, the formability of alloys is evaluated using impurity formation energies. A dilute highly-mismatched solid solution 2D-GaN$_{1-x}$P$_x$ features an efficient band gap reduction in combination with a moderate energy penalty associated with incorporation of phosphorous in 2D-GaN, which is substantially lower than in the case of the bulk GaN. The group-V alloying elements also introduce significant disorder and localization at the valence band edge that facilitates direct band gap optical transitions thus implying the feasibility of using III-V alloys of 2D-GaN in light-emitting devices.
https://arxiv.org/abs/1707.04625
Recurrent neural networks (RNNs) have been successfully applied to various natural language processing (NLP) tasks and achieved better results than conventional methods. However, the lack of understanding of the mechanisms behind their effectiveness limits further improvements on their architectures. In this paper, we present a visual analytics method for understanding and comparing RNN models for NLP tasks. We propose a technique to explain the function of individual hidden state units based on their expected response to input texts. We then co-cluster hidden state units and words based on the expected response and visualize co-clustering results as memory chips and word clouds to provide more structured knowledge on RNNs’ hidden states. We also propose a glyph-based sequence visualization based on aggregate information to analyze the behavior of an RNN’s hidden state at the sentence-level. The usability and effectiveness of our method are demonstrated through case studies and reviews from domain experts.
https://arxiv.org/abs/1710.10777
Deep learning models require extensive architecture design exploration and hyperparameter optimization to perform well on a given task. The exploration of the model design space is often made by a human expert, and optimized using a combination of grid search and search heuristics over a large space of possible choices. Neural Architecture Search (NAS) is a Reinforcement Learning approach that has been proposed to automate architecture design. NAS has been successfully applied to generate Neural Networks that rival the best human-designed architectures. However, NAS requires sampling, constructing, and training hundreds to thousands of models to achieve well-performing architectures. This procedure needs to be executed from scratch for each new task. The application of NAS to a wide set of tasks currently lacks a way to transfer generalizable knowledge across tasks. In this paper, we present the Multitask Neural Model Search (MNMS) controller. Our goal is to learn a generalizable framework that can condition model construction on successful model searches for previously seen tasks, thus significantly speeding up the search for new tasks. We demonstrate that MNMS can conduct an automated architecture search for multiple tasks simultaneously while still learning well-performing, specialized models for each task. We then show that pre-trained MNMS controllers can transfer learning to new tasks. By leveraging knowledge from previous searches, we find that pre-trained MNMS models start from a better location in the search space and reduce search time on unseen tasks, while still discovering models that outperform published human-designed models.
https://arxiv.org/abs/1710.10776
Deep region-based object detector consists of a region proposal step and a deep object recognition step. In this paper, we make significant improvements on both of the two steps. For region proposal we propose a novel lightweight cascade structure which can effectively improve RPN proposal quality. For object recognition we re-implement global context modeling with a few modications and obtain a performance boost (4.2% mAP gain on the ILSVRC 2016 validation set). Besides, we apply the idea of pre-training extensively and show its importance in both steps. Together with common training and testing tricks, we improve Faster R-CNN baseline by a large margin. In particular, we obtain 87.9% mAP on the PASCAL VOC 2012 test set, 65.3% on the ILSVRC 2016 test set and 36.8% on the COCO test-std set.
https://arxiv.org/abs/1710.10749
We present Direct Assessment, a method for manually assessing the quality of automatically-generated captions for video. Evaluating the accuracy of video captions is particularly difficult because for any given video clip there is no definitive ground truth or correct answer against which to measure. Automatic metrics for comparing automatic video captions against a manual caption such as BLEU and METEOR, drawn from techniques used in evaluating machine translation, were used in the TRECVid video captioning task in 2016 but these are shown to have weaknesses. The work presented here brings human assessment into the evaluation by crowdsourcing how well a caption describes a video. We automatically degrade the quality of some sample captions which are assessed manually and from this we are able to rate the quality of the human assessors, a factor we take into account in the evaluation. Using data from the TRECVid video-to-text task in 2016, we show how our direct assessment method is replicable and robust and should scale to where there many caption-generation techniques to be evaluated.
https://arxiv.org/abs/1710.10586
While the visualization of statistical data tends to a mature technology, the visualization of textual data is still in its infancy, especially for the artistic text. Due to the fact that visualization of artistic text is valuable and attractive in both art and information science, we attempt to realize this tentative idea in this article. We propose the Generative Adversarial Network based Artistic Textual Visualization (GAN-ATV) which can create paintings after analyzing the semantic content of existing poems. Our GAN-ATV consists of two main sections: natural language analysis section and visual information synthesis section. In natural language analysis section, we use Bag-of-Word (BoW) feature descriptors and a two-layer network to mine and analyze the high-level semantic information from poems. In visual information synthesis section, we design a cross-modal semantic understanding module and integrate it with Generative Adversarial Network (GAN) to create paintings, whose content are corresponding to the original poems. Moreover, in order to train our GAN-ATV and verify its performance, we establish a cross-modal artistic dataset named “Cross-Art”. In the Cross-Art dataset, there are six topics and each topic has their corresponding paintings and poems. The experimental results on Cross-Art dataset are shown in this article.
https://arxiv.org/abs/1710.10553
Reward augmented maximum likelihood (RAML), a simple and effective learning framework to directly optimize towards the reward function in structured prediction tasks, has led to a number of impressive empirical successes. RAML incorporates task-specific reward by performing maximum-likelihood updates on candidate outputs sampled according to an exponentiated payoff distribution, which gives higher probabilities to candidates that are close to the reference output. While RAML is notable for its simplicity, efficiency, and its impressive empirical successes, the theoretical properties of RAML, especially the behavior of the exponentiated payoff distribution, has not been examined thoroughly. In this work, we introduce softmax Q-distribution estimation, a novel theoretical interpretation of RAML, which reveals the relation between RAML and Bayesian decision theory. The softmax Q-distribution can be regarded as a smooth approximation of the Bayes decision boundary, and the Bayes decision rule is achieved by decoding with this Q-distribution. We further show that RAML is equivalent to approximately estimating the softmax Q-distribution, with the temperature $\tau$ controlling approximation error. We perform two experiments, one on synthetic data of multi-class classification and one on real data of image captioning, to demonstrate the relationship between RAML and the proposed softmax Q-distribution estimation method, verifying our theoretical analysis. Additional experiments on three structured prediction tasks with rewards defined on sequential (named entity recognition), tree-based (dependency parsing) and irregular (machine translation) structures show notable improvements over maximum likelihood baselines.
https://arxiv.org/abs/1705.07136
Since the creation of Generative Adversarial Networks (GANs), much work has been done to improve their training stability, their generated image quality, their range of application but nearly none of them explored their self-training potential. Self-training has been used before the advent of deep learning in order to allow training on limited labelled training data and has shown impressive results in semi-supervised learning. In this work, we combine these two ideas and make GANs self-trainable for semi-supervised learning tasks by exploiting their infinite data generation potential. Results show that using even the simplest form of self-training yields an improvement. We also show results for a more complex self-training scheme that performs at least as well as the basic self-training scheme but with significantly less data augmentation.
https://arxiv.org/abs/1710.10313
We present a new “learning-to-learn”-type approach that enables rapid learning of concepts from small-to-medium sized training sets and is primarily designed for web-initialized image retrieval. At the core of our approach is a deep architecture (a Set2Model network) that maps sets of examples to simple generative probabilistic models such as Gaussians or mixtures of Gaussians in the space of high-dimensional descriptors. The parameters of the embedding into the descriptor space are trained in the end-to-end fashion in the meta-learning stage using a set of training learning problems. The main technical novelty of our approach is the derivation of the backprop process through the mixture model fitting, which makes the likelihood of the resulting models differentiable with respect to the positions of the input descriptors. While the meta-learning process for a Set2Model network is discriminative, a trained Set2Model network performs generative learning of generative models in the descriptor space, which facilitates learning in the cases when no negative examples are available, and whenever the concept being learned is polysemous or represented by noisy training sets. Among other experiments, we demonstrate that these properties allow Set2Model networks to pick visual concepts from the raw outputs of Internet image search engines better than a set of strong baselines.
https://arxiv.org/abs/1612.07697
We present a study of GaN single-nanowire ultraviolet photodetectors with an embedded GaN/AlN superlattice. The heterostructure dimensions and doping profile were designed in such a way that the application of positive or negative bias leads to an enhancement of the collection of photogenerated carriers from the GaN/AlN superlattice or from the GaN base, respectively, as confirmed by electron beam-induced current measurements. The devices display enhanced response in the ultraviolet A ($\approx$ 330-360 nm) / B ($\approx$ 280-330 nm) spectral windows under positive/negative bias. The result is explained by correlation of the photocurrent measurements with scanning transmission electron microscopy observations of the same single nanowire, and semi-classical simulations of the strain and band structure in one and three dimensions.
https://arxiv.org/abs/1712.01869
Erbium (Er) doped GaN has been studied extensively for optoelectronic applications, yet its defect physics is still not well understood. In this work, we report a first-principles hybrid density functional study of the structure, energetics, and thermodynamic transition levels of Er-related defect complexes in GaN. We discover for the first time that Er${\rm Ga}$-C${\rm N}$-$V_{\rm N}$, a defect complex of Er, a C impurity, and an N vacancy, and Er${\rm Ga}$-O${\rm N}$-$V_{\rm N}$, a complex of Er, an O impurity, and an N vacancy, form defect levels at 0.18 and 0.46 eV below the conduction band, respectively. Together with Er${\rm Ga}$-$V{\rm N}$, a defect complex of Er and an N vacancy which has recently been found to produce a donor level at 0.61 eV, these defect complexes provide explanation for the Er-related defect levels observed in experiments. The role of these defects in optical excitation of the luminescent Er center is also discussed.
https://arxiv.org/abs/1710.09886
We present experimental measurements of the thermal boundary conductance (TBC) from $77 - 500$ K across isolated heteroepitaxially grown ZnO films on GaN substrates. These data provide an assessment of the assumptions that drive the phonon gas model-based diffuse mismatch models (DMM) and atomistic Green’s function (AGF) formalisms for predicting TBC. Our measurements, when compared to previous experimental data, suggest that the TBC can be influenced by long wavelength, zone center modes in a material on one side of the interface as opposed to the “vibrational mismatch” concept assumed in the DMM; this disagreement is pronounced at high temperatures. At room temperature, we measure the ZnO/GaN TBC as $490\lbrack +150, -110\rbrack$ MW m$^{-2}$ K$^{-1}$. The disagreement among the DMM and AGF and the experimental data these elevated temperatures suggests a non-negligible contribution from additional modes contributing to TBC that not accounted for in the fundamental assumptions of these harmonic formalisms, such as inelastic scattering. Given the high quality of these ZnO/GaN interface, these results provide an invaluable critical and quantitive assessment of the accuracy of assumptions in the current state of the art of computational approaches for predicting the phonon TBC across interfaces.
https://arxiv.org/abs/1710.09525
Performance of data-driven network for tumor classification varies with stain-style of histopathological images. This article proposes the stain-style transfer (SST) model based on conditional generative adversarial networks (GANs) which is to learn not only the certain color distribution but also the corresponding histopathological pattern. Our model considers feature-preserving loss in addition to well-known GAN loss. Consequently our model does not only transfers initial stain-styles to the desired one but also prevent the degradation of tumor classifier on transferred images. The model is examined using the CAMELYON16 dataset.
https://arxiv.org/abs/1710.08543
We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.
https://arxiv.org/abs/1706.10006
Deep convolutional neural networks have led to breakthrough results in numerous practical machine learning tasks such as classification of images in the ImageNet data set, control-policy-learning to play Atari games or the board game Go, and image captioning. Many of these applications first perform feature extraction and then feed the results thereof into a trainable classifier. The mathematical analysis of deep convolutional neural networks for feature extraction was initiated by Mallat, 2012. Specifically, Mallat considered so-called scattering networks based on a wavelet transform followed by the modulus non-linearity in each network layer, and proved translation invariance (asymptotically in the wavelet scale parameter) and deformation stability of the corresponding feature extractor. This paper complements Mallat’s results by developing a theory that encompasses general convolutional transforms, or in more technical parlance, general semi-discrete frames (including Weyl-Heisenberg filters, curvelets, shearlets, ridgelets, wavelets, and learned filters), general Lipschitz-continuous non-linearities (e.g., rectified linear units, shifted logistic sigmoids, hyperbolic tangents, and modulus functions), and general Lipschitz-continuous pooling operators emulating, e.g., sub-sampling and averaging. In addition, all of these elements can be different in different network layers. For the resulting feature extractor we prove a translation invariance result of vertical nature in the sense of the features becoming progressively more translation-invariant with increasing network depth, and we establish deformation sensitivity bounds that apply to signal classes such as, e.g., band-limited functions, cartoon functions, and Lipschitz functions.
https://arxiv.org/abs/1512.06293
N-polar (In,Ga)N/GaN quantum wells prepared on freestanding GaN substrates by plasma-assisted molecular beam epitaxy at conventional growth temperatures of about 650 °C do not exhibit any detectable luminescence even at 10 K. In the present work, we investigate (In,Ga)N/GaN quantum wells grown on Ga- and N-polar GaN substrates at a constant temperature of 730 °C. This exceptionally high temperature results in a vanishing In incorporation for the Ga-polar sample. In contrast, quantum wells with an In content of 20% and abrupt interfaces are formed on N-polar GaN. Moreover, these quantum wells exhibit a spatially homogeneous green luminescence band up to room temperature, but the intensity of this band is observed to strongly quench with temperature. Temperature-dependent photoluminescence transients show that this thermal quenching is related to a high density of nonradiative Shockley-Read-Hall centers with large capture coefficients for electrons and holes.
https://arxiv.org/abs/1710.08351
This paper proposes a network architecture to perform variable length semantic video generation using captions. We adopt a new perspective towards video generation where we allow the captions to be combined with the long-term and short-term dependencies between video frames and thus generate a video in an incremental manner. Our experiments demonstrate our network architecture’s ability to distinguish between objects, actions and interactions in a video and combine them to generate videos for unseen captions. The network also exhibits the capability to perform spatio-temporal style transfer when asked to generate videos for a sequence of captions. We also show that the network’s ability to learn a latent representation allows it generate videos in an unsupervised manner and perform other tasks such as action recognition. (Accepted in International Conference in Computer Vision (ICCV) 2017)
https://arxiv.org/abs/1708.05980
This paper introduces a novel approach for generating videos called Synchronized Deep Recurrent Attentive Writer (Sync-DRAW). Sync-DRAW can also perform text-to-video generation which, to the best of our knowledge, makes it the first approach of its kind. It combines a Variational Autoencoder~(VAE) with a Recurrent Attention Mechanism in a novel manner to create a temporally dependent sequence of frames that are gradually formed over time. The recurrent attention mechanism in Sync-DRAW attends to each individual frame of the video in sychronization, while the VAE learns a latent distribution for the entire video at the global level. Our experiments with Bouncing MNIST, KTH and UCF-101 suggest that Sync-DRAW is efficient in learning the spatial and temporal information of the videos and generates frames with high structural integrity, and can generate videos from simple captions on these datasets. (Accepted as oral paper in ACM-Multimedia 2017)
https://arxiv.org/abs/1611.10314
Mobile visual search applications are emerging that enable users to sense their surroundings with smart phones. However, because of the particular challenges of mobile visual search, achieving a high recognition bitrate has becomes a consistent target of previous related works. In this paper, we propose a few-parameter, low-latency, and high-accuracy deep hashing approach for constructing binary hash codes for mobile visual search. First, we exploit the architecture of the MobileNet model, which significantly decreases the latency of deep feature extraction by reducing the number of model parameters while maintaining accuracy. Second, we add a hash-like layer into MobileNet to train the model on labeled mobile visual data. Evaluations show that the proposed system can exceed state-of-the-art accuracy performance in terms of the MAP. More importantly, the memory consumption is much less than that of other deep learning models. The proposed method requires only $13$ MB of memory for the neural network and achieves a MAP of $97.80\%$ on the mobile location recognition dataset used for testing.
https://arxiv.org/abs/1710.07750
Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods, that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning, and generate multiple sentences to describe a video considering different random factors. Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.
https://arxiv.org/abs/1708.02478
The Wasserstein distance received a lot of attention recently in the community of machine learning, especially for its principled way of comparing distributions. It has found numerous applications in several hard problems, such as domain adaptation, dimensionality reduction or generative models. However, its use is still limited by a heavy computational cost. Our goal is to alleviate this problem by providing an approximation mechanism that allows to break its inherent complexity. It relies on the search of an embedding where the Euclidean distance mimics the Wasserstein distance. We show that such an embedding can be found with a siamese architecture associated with a decoder network that allows to move from the embedding space back to the original input space. Once this embedding has been found, computing optimization problems in the Wasserstein space (e.g. barycenters, principal directions or even archetypes) can be conducted extremely fast. Numerical experiments supporting this idea are conducted on image datasets, and show the wide potential benefits of our method.
https://arxiv.org/abs/1710.07457
Deep learning typically requires training a very capable architecture using large datasets. However, many important learning problems demand an ability to draw valid inferences from small size datasets, and such problems pose a particular challenge for deep learning. In this regard, various researches on “meta-learning” are being actively conducted. Recent work has suggested a Memory Augmented Neural Network (MANN) for meta-learning. MANN is an implementation of a Neural Turing Machine (NTM) with the ability to rapidly assimilate new data in its memory, and use this data to make accurate predictions. In models such as MANN, the input data samples and their appropriate labels from previous step are bound together in the same memory locations. This often leads to memory interference when performing a task as these models have to retrieve a feature of an input from a certain memory location and read only the label information bound to that location. In this paper, we tried to address this issue by presenting a more robust MANN. We revisited the idea of meta-learning and proposed a new memory augmented neural network by explicitly splitting the external memory into feature and label memories. The feature memory is used to store the features of input data samples and the label memory stores their labels. Hence, when predicting the label of a given input, our model uses its feature memory unit as a reference to extract the stored feature of the input, and based on that feature, it retrieves the label information of the input from the label memory unit. In order for the network to function in this framework, a new memory-writingmodule to encode label information into the label memory in accordance with the meta-learning task structure is designed. Here, we demonstrate that our model outperforms MANN by a large margin in supervised one-shot classification tasks using Omniglot and MNIST datasets.
https://arxiv.org/abs/1710.07110
Similarity-preserving hashing is a widely-used method for nearest neighbour search in large-scale image retrieval tasks. There has been considerable research on generating efficient image representation via the deep-network-based hashing methods. However, the issue of efficient searching in the deep representation space remains largely unsolved. To this end, we propose a simple yet efficient deep-network-based multi-index hashing method for simultaneously learning the powerful image representation and the efficient searching. To achieve these two goals, we introduce the multi-index hashing (MIH) mechanism into the proposed deep architecture, which divides the binary codes into multiple substrings. Due to the non-uniformly distributed codes will result in inefficiency searching, we add the two balanced constraints at feature-level and instance-level, respectively. Extensive evaluations on several benchmark image retrieval datasets show that the learned balanced binary codes bring dramatic speedups and achieve comparable performance over the existing baselines.
https://arxiv.org/abs/1710.06993
Processing of multi-word expressions (MWEs) is a known problem for any natural language processing task. Even neural machine translation (NMT) struggles to overcome it. This paper presents results of experiments on investigating NMT attention allocation to the MWEs and improving automated translation of sentences that contain MWEs in English->Latvian and English->Czech NMT systems. Two improvement strategies were explored -(1) bilingual pairs of automatically extracted MWE candidates were added to the parallel corpus used to train the NMT system, and (2) full sentences containing the automatically extracted MWE candidates were added to the parallel corpus. Both approaches allowed to increase automated evaluation results. The best result - 0.99 BLEU point increase - has been reached with the first approach, while with the second approach minimal improvements achieved. We also provide open-source software and tools used for MWE extraction and alignment inspection.
https://arxiv.org/abs/1710.06313
Images in the wild encapsulate rich knowledge about varied abstract concepts and cannot be sufficiently described with models built only using image-caption pairs containing selected objects. We propose to handle such a task with the guidance of a knowledge base that incorporate many abstract concepts. Our method is a two-step process where we first build a multi-entity-label image recognition model to predict abstract concepts as image labels and then leverage them in the second step as an external semantic attention and constrained inference in the caption generation model for describing images that depict unseen/novel objects. Evaluations show that our models outperform most of the prior work for out-of-domain captioning on MSCOCO and are useful for integration of knowledge and vision in general.
https://arxiv.org/abs/1710.06303
The past several years have witnessed the rapid progress of end-to-end Neural Machine Translation (NMT). However, there exists discrepancy between training and inference in NMT when decoding, which may lead to serious problems since the model might be in a part of the state space it has never seen during training. To address the issue, Scheduled Sampling has been proposed. However, there are certain limitations in Scheduled Sampling and we propose two dynamic oracle-based methods to improve it. We manage to mitigate the discrepancy by changing the training process towards a less guided scheme and meanwhile aggregating the oracle’s demonstrations. Experimental results show that the proposed approaches improve translation quality over standard NMT system.
https://arxiv.org/abs/1709.06265
Comprehending complex systems by simplifying and highlighting important dynamical patterns requires modeling and mapping higher-order network flows. However, complex systems come in many forms and demand a range of representations, including memory and multilayer networks, which in turn call for versatile community-detection algorithms to reveal important modular regularities in the flows. Here we show that various forms of higher-order network flows can be represented in a unified way with networks that distinguish physical nodes for representing a~complex system’s objects from state nodes for describing flows between the objects. Moreover, these so-called sparse memory networks allow the information-theoretic community detection method known as the map equation to identify overlapping and nested flow modules in data from a range of~different higher-order interactions such as multistep, multi-source, and temporal data. We derive the map equation applied to sparse memory networks and describe its search algorithm Infomap, which can exploit the flexibility of sparse memory networks. Together they provide a general solution to reveal overlapping modular patterns in higher-order flows through complex systems.
https://arxiv.org/abs/1706.04792
We develop a method for policy architecture search and adaptation via gradient-free optimization which can learn to perform autonomous driving tasks. By learning from both demonstration and environmental reward we develop a model that can learn with relatively few early catastrophic failures. We first learn an architecture of appropriate complexity to perceive aspects of world state relevant to the expert demonstration, and then mitigate the effect of domain-shift during deployment by adapting a policy demonstrated in a source domain to rewards obtained in a target environment. We show that our approach allows safer learning than baseline methods, offering a reduced cumulative crash metric over the agent’s lifetime as it learns to drive in a realistic simulated environment.
https://arxiv.org/abs/1710.05958
An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.
https://arxiv.org/abs/1704.00260
Quantum processors promise a paradigm shift in high-performance computing which needs to be assessed by accurate benchmarking measures. In this work, we introduce a new benchmark for variational quantum algorithm (VQA), recently proposed as a heuristic algorithm for small-scale quantum processors. In VQA, a classical optimization algorithm guides the quantum dynamics of the processor to yield the best solution for a given problem. A complete assessment of scalability and competitiveness of VQA should take into account both the quality and the time of dynamics optimization. The method of optimal stopping, employed here, provides such an assessment by explicitly including time as a cost factor. Here we showcase this measure for benchmarking VQA as a solver for some quadratic unconstrained binary optimization. Moreover we show that a better choice for the cost function of the classical routine can significantly improve the performance of the VQA algorithm and even improving it’s scaling properties.
https://arxiv.org/abs/1710.05365
In conveyor belt sushi restaurants, billing is a burdened job because one has to manually count the number of dishes and identify the color of them to calculate the price. In a busy situation, there can be a mistake that customers are overcharged or under-charged. To deal with this problem, we developed a method that automatically identifies the color of dishes and calculate the total price using real images. Our method consists of ellipse fitting and convol-utional neural network. It achieves ellipse detection precision 85% and recall 96% and classification accuracy 92%.
https://arxiv.org/abs/1709.00751
Surface Enhanced Laser Desorption/Ionization-Time Of Flight Mass Spectrometry (SELDI-TOF MS) is a variant of the MALDI. It is uses in many cases especially for the analysis of protein profiling and for preliminary screening tasks of complex sample aimed for the searching of biomarker. Unfortunately, these analysis are time consuming and strictly limited about the protein identification. Seldi analysis of mass spectra (SELYMATRA) is a Web Application (WA) developed with the aim of reduce these lacks automating the identification processes and introducing the possibility to predict the proteins present in complex mixtures from cells and tissues analysed by Mass Spectrometry. SELYMATRA has the following characteristics. The architectural pattern used to develop the WA is the Model-View-Controller (MVC), extremely used in the development of software system. The WA expects an user to upload data in a Microsoft Excel spreadsheet file format, usually generated by means of the proprietary Mass Spectrometry softwares. Several parameters can be set such as experiment conditions, range of isoelectric point, range of pH, relative errors and so on. The WA compare the mass value among two mass spectra (sample vs control) to extract differences, and according to the parameters set, it queries a local database for the prediction of the most likely proteins related to the masses differently expressed. The WA was validated in a cellular model overexpressing a tagged NURR1 receptor. SELYMATRA is available at this http URL.
https://arxiv.org/abs/1710.05914
Vector Quantization, VQ is a popular image compression technique with a simple decoding architecture and high compression ratio. Codebook designing is the most essential part in Vector Quantization. LindeBuzoGray, LBG is a traditional method of generation of VQ Codebook which results in lower PSNR value. A Codebook affects the quality of image compression, so the choice of an appropriate codebook is a must. Several optimization techniques have been proposed for global codebook generation to enhance the quality of image compression. In this paper, a novel algorithm called IDE-LBG is proposed which uses Improved Differential Evolution Algorithm coupled with LBG for generating optimum VQ Codebooks. The proposed IDE works better than the traditional DE with modifications in the scaling factor and the boundary control mechanism. The IDE generates better solutions by efficient exploration and exploitation of the search space. Then the best optimal solution obtained by the IDE is provided as the initial Codebook for the LBG. This approach produces an efficient Codebook with less computational time and the consequences include excellent PSNR values and superior quality reconstructed images. It is observed that the proposed IDE-LBG find better VQ Codebooks as compared to IPSO-LBG, BA-LBG and FA-LBG.
https://arxiv.org/abs/1710.05311
Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Empirical evidence validates this method on automatic summarization and image captioning tasks.
https://arxiv.org/abs/1709.09346
Electron irradiation of GaN nanowires in a scanning electron microscope strongly reduces their luminous efficiency as shown by cathodoluminescence imaging and spectroscopy. We demonstrate that this luminescence quenching originates from a combination of charge trapping at already existing surface states and the formation of new surface states induced by the adsorption of C on the nanowire sidewalls. The interplay of these effects leads to a complex temporal evolution of the quenching, which strongly depends on the incident electron dose per area. Time-resolved photoluminescence measurements on electron-irradiated samples reveal that the carbonaceous adlayer affects both the nonradiative and the radiative recombination dynamics.
https://arxiv.org/abs/1607.03397