Evolutionary deep intelligence has recently shown great promise for producing small, powerful deep neural network models via the organic synthesis of increasingly efficient architectures over successive generations. Existing evolutionary synthesis processes, however, have allowed the mating of parent networks independent of architectural alignment, resulting in a mismatch of network structures. We present a preliminary study into the effects of architectural alignment during evolutionary synthesis using a gene tagging system. Surprisingly, the network architectures synthesized using the gene tagging approach resulted in slower decreases in performance accuracy and storage size; however, the resultant networks were comparable in size and performance accuracy to the non-gene tagging networks. Furthermore, we speculate that there is a noticeable decrease in network variability for networks synthesized with gene tagging, indicating that enforcing a like-with-like mating policy potentially restricts the exploration of the search space of possible network architectures.
https://arxiv.org/abs/1811.07966
Researchers have observed that Visual Question Answering (VQA) models tend to answer questions by learning statistical biases in the data. For example, their answer to the question “What is the color of the grass?” is usually “Green”, whereas a question like “What is the title of the book?” cannot be answered by inferring statistical biases. It is of interest to the community to explicitly discover such biases, both for understanding the behavior of such models, and towards debugging them. Our work address this problem. In a database, we store the words of the question, answer and visual words corresponding to regions of interest in attention maps. By running simple rule mining algorithms on this database, we discover human-interpretable rules which give us unique insight into the behavior of such models. Our results also show examples of unusual behaviors learned by models in attempting VQA tasks.
https://arxiv.org/abs/1811.07789
Multi-task learning (MTL) allows deep neural networks to learn from related tasks by sharing parameters with other networks. In practice, however, MTL involves searching an enormous space of possible parameter sharing architectures to find (a) the layers or subspaces that benefit from sharing, (b) the appropriate amount of sharing, and (c) the appropriate relative weights of the different task losses. Recent work has addressed each of the above problems in isolation. In this work we present an approach that learns a latent multi-task architecture that jointly addresses (a)–(c). We present experiments on synthetic data and data from OntoNotes 5.0, including four different tasks and seven different domains. Our extension consistently outperforms previous approaches to learning latent architectures for multi-task problems and achieves up to 15% average error reductions over common approaches to MTL.
https://arxiv.org/abs/1705.08142
We consider active learning of deep neural networks. Most active learning works in this context have focused on studying effective querying mechanisms and assumed that an appropriate network architecture is a priori known for the problem at hand. We challenge this assumption and propose a novel active strategy whereby the learning algorithm searches for effective architectures on the fly, while actively learning. We apply our strategy using three known querying techniques (softmax response, MC-dropout, and coresets) and show that the proposed approach overwhelmingly outperforms active learning using fixed architectures.
https://arxiv.org/abs/1811.07579
Previous studies show that incorporating external information could improve the translation quality of Neural Machine Translation (NMT) systems. However, there are inevitably noises in the external information, severely reducing the benefit that the existing methods could receive from the incorporation. To tackle the problem, this study pays special attention to the discrimination of the noises during the incorporation. We argue that there exist two kinds of noise in this external information, i.e. global noise and local noise, which affect the translations for the whole sentence and for some specific words, respectively. Accordingly, we propose a general framework that learns to jointly discriminate both the global and local noises, so that the external information could be better leveraged. Our model is trained on the dataset derived from the original parallel corpus without any external labeled data or annotation. Experimental results in various real-world scenarios, language pairs, and neural architectures indicate that discriminating noises contributes to significant improvements in translation quality by being able to better incorporate the external information, even in very noisy conditions.
https://arxiv.org/abs/1810.10317
Deep learning object detectors achieve state-of-the-art accuracy at the expense of high computational overheads, impeding their utilization on embedded systems such as drones. A primary source of these overheads is the exhaustive classification of typically 10^4-10^5 regions per image. Given that most of these regions contain uninformative background, the detector designs seem extremely superfluous and inefficient. In contrast, biological vision systems leverage selective attention for fast and efficient object detection. Recent neuroscientific findings shedding new light on the mechanism behind selective attention allowed us to formulate a new hypothesis of object detection efficiency and subsequently introduce a new object detection paradigm. To that end, we leverage this knowledge to design a novel region proposal network and empirically show that it achieves high object detection performance on the COCO dataset. Moreover, the model uses two to three orders of magnitude fewer computations than state-of-the-art models and consequently achieves inference speeds exceeding 500 frames/s, thereby making it possible to achieve object detection on embedded systems.
https://arxiv.org/abs/1811.07502
Object detection and classification is one of the most important computer vision problems. Ever since the introduction of deep learning \cite{krizhevsky2012imagenet}, we have witnessed a dramatic increase in the accuracy of this object detection problem. However, most of these improvements have occurred using conventional 2D image processing. Recently, low-cost 3D-image sensors, such as the Microsoft Kinect (Time-of-Flight) or the Apple FaceID (Structured-Light), can provide 3D-depth or point cloud data that can be added to a convolutional neural network, acting as an extra set of dimensions. In our proposed approach, we introduce a new 2D + 3D system that takes the 3D-data to determine the object region followed by any conventional 2D-DNN, such as AlexNet. In this method, our approach can easily dissociate the information collection from the Point Cloud and 2D-Image data and combine both operations later. Hence, our system can use any existing trained 2D network on a large image dataset, and does not require a large 3D-depth dataset for new training. Experimental object detection results across 30 images show an accuracy of 0.67, versus 0.54 and 0.51 for RCNN and YOLO, respectively.
https://arxiv.org/abs/1811.07493
The existing still-static deep learning based saliency researches do not consider the weighting and highlighting of extracted features from different layers, all features contribute equally to the final saliency decision-making. Such methods always evenly detect all “potentially significant regions” and unable to highlight the key salient object, resulting in detection failure of dynamic scenes. In this paper, based on the fact that salient areas in videos are relatively small and concentrated, we propose a \textbf{key salient object re-augmentation method (KSORA) using top-down semantic knowledge and bottom-up feature guidance} to improve detection accuracy in video scenes. KSORA includes two sub-modules (WFE and KOS): WFE processes local salient feature selection using bottom-up strategy, while KOS ranks each object in global fashion by top-down statistical knowledge, and chooses the most critical object area for local enhancement. The proposed KSORA can not only strengthen the saliency value of the local key salient object but also ensure global saliency consistency. Results on three benchmark datasets suggest that our model has the capability of improving the detection accuracy on complex scenes. The significant performance of KSORA, with a speed of 17FPS on modern GPUs, has been verified by comparisons with other ten state-of-the-art algorithms.
https://arxiv.org/abs/1811.07480
Event-based cameras, also known as neuromorphic cameras, are bioinspired sensors able to perceive changes in the scene at high frequency with low power consumption. Becoming available only very recently, a limited amount of work addresses object detection on these devices. In this paper we propose two neural networks architectures for object detection: YOLE, which integrates the events into surfaces and uses a frame-based model to process them, and eFCN, an asynchronous event-based fully convolutional network which uses a novel and general formalization of the convolutional and max pooling layers to exploit the sparsity of camera events. We evaluated the algorithm with different extensions of publicly available datasets, and on a novel synthetic dataset.
https://arxiv.org/abs/1805.07931
Distance metric learning (DML) has been successfully applied to object classification, both in the standard regime of rich training data and in the few-shot scenario, where each category is represented by only a few examples. In this work, we propose a new method for DML that simultaneously learns the backbone network parameters, the embedding space, and the multi-modal distribution of each of the training categories in that space, in a single end-to-end training process. Our approach outperforms state-of-the-art methods for DML-based object classification on a variety of standard fine-grained datasets. Furthermore, we demonstrate the effectiveness of our approach on the problem of few-shot object detection, by incorporating the proposed DML architecture as a classification head into a standard object detection model. We achieve the best results on the ImageNet-LOC dataset compared to strong baselines, when only a few training examples are available. We also offer the community a new episodic benchmark based on the ImageNet dataset for the few-shot object detection task.
https://arxiv.org/abs/1806.04728
We propose an end-to-end framework for training domain specific models (DSMs) to obtain both high accuracy and computational efficiency for object detection tasks. DSMs are trained with distillation \cite{hinton2015distilling} and focus on achieving high accuracy at a limited domain (e.g. fixed view of an intersection). We argue that DSMs can capture essential features well even with a small model size, enabling higher accuracy and efficiency than traditional techniques. In addition, we improve the training efficiency by reducing the dataset size by culling easy to classify images from the training set. For the limited domain, we observed that compact DSMs significantly surpass the accuracy of COCO trained models of the same size. By training on a compact dataset, we show that with an accuracy drop of only 3.6\%, the training time can be reduced by 93\%. The codes are uploaded in this https URL.
https://arxiv.org/abs/1811.02689
This paper examines three generic strategies for improving the performance of neuro-evolution techniques aimed at evolving convolutional neural networks (CNNs). These were implemented as part of the Evolutionary eXploration of Augmenting Convolutional Topologies (EXACT) algorithm. EXACT evolves arbitrary convolutional neural networks (CNNs) with goals of better discovering and understanding new effective architectures of CNNs for machine learning tasks and to potentially automate the process of network design and selection. The strategies examined are node-level mutation operations, epigenetic weight initialization and pooling connections. Results were gathered over the period of a month using a volunteer computing project, where over 225,000 CNNs were trained and evaluated across 16 different EXACT searches. The node mutation operations where shown to dramatically improve evolution rates over traditional edge mutation operations (as used by the NEAT algorithm), and epigenetic weight initialization was shown to further increase the accuracy and generalizability of the trained CNNs. As a negative but interesting result, allowing for pooling connections was shown to degrade the evolution progress. The best trained CNNs reached 99.46% accuracy on the MNIST test data in under 13,500 CNN evaluations – accuracy comparable with some of the best human designed CNNs.
https://arxiv.org/abs/1811.08286
The problem of keyword spotting i.e. identifying keywords in a real-time audio stream is mainly solved by applying a neural network over successive sliding windows. Due to the difficulty of the task, baseline models are usually large, resulting in a high computational cost and energy consumption level. We propose a new method called SANAS (Stochastic Adaptive Neural Architecture Search) which is able to adapt the architecture of the neural network on-the-fly at inference time such that small architectures will be used when the stream is easy to process (silence, low noise, …) and bigger networks will be used when the task becomes more difficult. We show that this adaptive model can be learned end-to-end by optimizing a trade-off between the prediction performance and the average computational cost per unit of time. Experiments on the Speech Commands dataset show that this approach leads to a high recognition level while being much faster (and/or energy saving) than classical approaches where the network architecture is static.
https://arxiv.org/abs/1811.06753
Most current detection methods have adopted anchor boxes as regression references. However, the detection performance is sensitive to the setting of the anchor boxes. A proper setting of anchor boxes may vary significantly across different datasets, which severely limits the universality of the detectors. To improve the adaptivity of the detectors, in this paper, we present a novel dimension-decomposition region proposal network (DeRPN) that can perfectly displace the traditional Region Proposal Network (RPN). DeRPN utilizes an anchor string mechanism to independently match object widths and heights, which is conducive to treating variant object shapes. In addition, a novel scale-sensitive loss is designed to address the imbalanced loss computations of different scaled objects, which can avoid the small objects being overwhelmed by larger ones. Comprehensive experiments conducted on both general object detection datasets (Pascal VOC 2007, 2012 and MS COCO) and scene text detection datasets (ICDAR 2013 and COCO-Text) all prove that our DeRPN can significantly outperform RPN. It is worth mentioning that the proposed DeRPN can be employed directly on different models, tasks, and datasets without any modifications of hyperparameters or specialized optimization, which further demonstrates its adaptivity. The code will be released at this https URL.
https://arxiv.org/abs/1811.06700
This paper presents a modular lightweight network model for road objects detection, such as car, pedestrian and cyclist, especially when they are far away from the camera and their sizes are small. Great advances have been made for the deep networks, but small objects detection is still a challenging task. In order to solve this problem, majority of existing methods utilize complicated network or bigger image size, which generally leads to higher computation cost. The proposed network model is referred to as modular feature fusion detector (MFFD), using a fast and efficient network architecture for detecting small objects. The contribution lies in the following aspects: 1) Two base modules have been designed for efficient computation: Front module reduce the information loss from raw input images; Tinier module decrease model size and computation cost, while ensuring the detection accuracy. 2) By stacking the base modules, we design a context features fusion framework for multi-scale object detection. 3) The propose method is efficient in terms of model size and computation cost, which is applicable for resource limited devices, such as embedded systems for advanced driver assistance systems (ADAS). Comparisons with the state-of-the-arts on the challenging KITTI dataset reveal the superiority of the proposed method. Especially, 100 fps can be achieved on the embedded GPUs such as Jetson TX2.
https://arxiv.org/abs/1811.06641
Neural network-based Open-ended conversational agents automatically generate responses based on predictive models learned from a large number of pairs of utterances. The generated responses are typically acceptable as a sentence but are often dull, generic, and certainly devoid of any emotion. In this paper, we present neural models that learn to express a given emotion in the generated response. We propose four models and evaluate them against 3 baselines. An encoder-decoder framework-based model with multiple attention layers provides the best overall performance in terms of expressing the required emotion. While it does not outperform other models on all emotions, it presents promising results in most cases.
http://arxiv.org/abs/1811.10990
Non-uniform and multi-illuminant color constancy are important tasks, the solution of which will allow to discard information about lighting conditions in the image. Non-uniform illumination and shadows distort colors of real-world objects and mostly do not contain valuable information. Thus, many computer vision and image processing techniques would benefit from automatic discarding of this information at the pre-processing step. In this work we propose novel view on this classical problem via generative end-to-end algorithm, namely image conditioned Generative Adversarial Network. We also demonstrate the potential of the given approach for joint shadow detection and removal. Forced by the lack of training data, we render the largest existing shadow removal dataset and make it publicly available. It consists of approximately 6,000 pairs of wide field of view synthetic images with and without shadows.
https://arxiv.org/abs/1811.06604
Recently, deep neural networks have achieved remarkable performance on the task of object detection and recognition. The reason for this success is mainly grounded in the availability of large scale, fully annotated datasets, but the creation of such a dataset is a complicated and costly task. In this paper, we propose a novel method for weakly supervised object detection that simplifies the process of gathering data for training an object detector. We train an ensemble of two models that work together in a student-teacher fashion. Our student (localizer) is a model that learns to localize an object, the teacher (assessor) assesses the quality of the localization and provides feedback to the student. The student uses this feedback to learn how to localize objects and is thus entirely supervised by the teacher, as we are using no labels for training the localizer. In our experiments, we show that our model is very robust to noise and reaches competitive performance compared to a state-of-the-art fully supervised approach. We also show the simplicity of creating a new dataset, based on a few videos (e.g. downloaded from YouTube) and artificially generated data.
https://arxiv.org/abs/1811.05773
Generative Adversarial Networks are a new family of generative models, frequently used for generating photorealistic images. The theory promises for the GAN to eventually reach an equilibrium where generator produces pictures indistinguishable for the training set. In practice, however, a range of problems frequently prevents the system from reaching this equilibrium, with training not progressing ahead due to instabilities or mode collapse. This paper describes a series of experiments trying to identify patterns in regard to the effect of the training set on the dynamics and eventual outcome of the training.
https://arxiv.org/abs/1811.02850
This paper presents a Lisp architecture for a portable NLP system, termed LAPNLP, for processing clinical notes. LAPNLP integrates multiple standard, customized and in-house developed NLP tools. Our system facilitates portability across different institutions and data systems by incorporating an enriched Common Data Model (CDM) to standardize necessary data elements. It utilizes UMLS to perform domain adaptation when integrating generic domain NLP tools. It also features stand-off annotations that are specified by positional reference to the original document. We built an interval tree based search engine to efficiently query and retrieve the stand-off annotations by specifying positional requirements. We also developed a utility to convert an inline annotation format to stand-off annotations to enable the reuse of clinical text datasets with inline annotations. We experimented with our system on several NLP facilitated tasks including computational phenotyping for lymphoma patients and semantic relation extraction for clinical notes. These experiments showcased the broader applicability and utility of LAPNLP.
https://arxiv.org/abs/1811.06179
Deep neural networks have shown superior performance in many regimes to remember familiar patterns with large amounts of data. However, the standard supervised deep learning paradigm is still limited when facing the need to learn new concepts efficiently from scarce data. In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning. The training procedure, imitating the concept formation course of human, learns how to distinguish samples from different classes and aggregate samples of the same kind. In order to better utilize the advantages originated from the human behavior, we propose a sequential process, during which the network should decide how to remember each sample at every step. In this sequential process, a stable and interactive memory serves as an important module. We validate our model in some typical one-shot learning tasks and also an exploratory outlier detection problem. In all the experiments, our model gets highly competitive to reach or outperform those strong baselines.
https://arxiv.org/abs/1811.06145
In this work, we study the problem of learning a single model for multiple domains. Unlike the conventional machine learning scenario where each domain can have the corresponding model, multiple domains (i.e., applications/users) may share the same machine learning model due to maintenance loads in cloud computing services. For example, a digit-recognition model should be applicable to hand-written digits, house numbers, car plates, etc. Therefore, an ideal model for cloud computing has to perform well at each applicable domain. To address this new challenge from cloud computing, we develop a framework of robust optimization over multiple domains. In lieu of minimizing the empirical risk, we aim to learn a model optimized to the adversarial distribution over multiple domains. Hence, we propose to learn the model and the adversarial distribution simultaneously with the stochastic algorithm for efficiency. Theoretically, we analyze the convergence rate for convex and non-convex models. To our best knowledge, we first study the convergence rate of learning a robust non-convex model with a practical algorithm. Furthermore, we demonstrate that the robustness of the framework and the convergence rate can be further enhanced by appropriate regularizers over the adversarial distribution. The empirical study on real-world fine-grained visual categorization and digits recognition tasks verifies the effectiveness and efficiency of the proposed framework.
http://arxiv.org/abs/1805.07588
We show that with an appropriate factorization, and encodings of layout and appearance constructed from outputs of pretrained object detectors, a relatively simple model outperforms more sophisticated approaches on human-object interaction detection. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop training techniques that improve learning efficiency by: (i) eliminating train-inference mismatch; (ii) rejecting easy negatives during mini-batch training; and (iii) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches while constructing training mini-batches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset.
https://arxiv.org/abs/1811.05967
Our work presented in this paper focuses on the translation of terminological expressions represented in semantically structured resources, like ontologies or knowledge graphs. The challenge of translating ontology labels or terminological expressions represented in knowledge bases lies in the highly specific vocabulary and the lack of contextual information, which can guide a machine translation system to translate ambiguous words into the targeted domain. Due to these challenges, we evaluate the translation quality of domain-specific expressions in the medical and financial domain with statistical (SMT) as well as with neural machine translation (NMT) methods and experiment domain adaptation of the translation models with terminological expressions only. Furthermore, we perform experiments on the injection of external terminological expressions into the translation systems. Through these experiments, we observed a significant advantage in domain adaptation for the domain-specific resource in the medical and financial domain and the benefit of subword models over word-based NMT models for terminology translation.
https://arxiv.org/abs/1709.02184
Paraphrasing is rooted in semantics. We show the effectiveness of transformers (Vaswani et al. 2017) for paraphrase generation and further improvements by incorporating PropBank labels via a multi-encoder. Evaluating on MSCOCO and WikiAnswers, we find that transformers are fast and effective, and that semantic augmentation for both transformers and LSTMs leads to sizable 2-3 point gains in BLEU, METEOR and TER. More importantly, we find surprisingly large gains on human evaluations compared to previous models. Nevertheless, manual inspection of generated paraphrases reveals ample room for improvement: even our best model produces human-acceptable paraphrases for only 28% of captions from the CHIA dataset (Sharma et al. 2018), and it fails spectacularly on sentences from Wikipedia. Overall, these results point to the potential for incorporating semantics in the task while highlighting the need for stronger evaluation.
https://arxiv.org/abs/1811.00119
This paper focuses on YOLO-LITE, a real-time object detection model developed to run on portable devices such as a laptop or cellphone lacking a Graphics Processing Unit (GPU). The model was first trained on the PASCAL VOC dataset then on the COCO dataset, achieving a mAP of 33.81% and 12.26% respectively. YOLO-LITE runs at about 21 FPS on a non-GPU computer and 10 FPS after implemented onto a website with only 7 layers and 482 million FLOPS. This speed is 3.8x faster than the fastest state of art model, SSD MobilenetvI. Based on the original object detection algorithm YOLOV2, YOLO- LITE was designed to create a smaller, faster, and more efficient model increasing the accessibility of real-time object detection to a variety of devices.
https://arxiv.org/abs/1811.05588
Extracting valuable facts or informative summaries from multi-dimensional tables, i.e. insight mining, is an important task in data analysis and business intelligence. However, ranking the importance of insights remains a challenging and unexplored task. The main challenge is that explicitly scoring an insight or giving it a rank requires a thorough understanding of the tables and costs a lot of manual efforts, which leads to the lack of available training data for the insight ranking problem. In this paper, we propose an insight ranking model that consists of two parts: A neural ranking model explores the data characteristics, such as the header semantics and the data statistical features, and a memory network model introduces table structure and context information into the ranking process. We also build a dataset with text assistance. Experimental results show that our approach largely improves the ranking precision as reported in multi evaluation metrics.
https://arxiv.org/abs/1811.05563
Generative Adversarial Networks (GANs) have shown great results in accurately modeling complex distributions, but their training is known to be difficult due to instabilities caused by a challenging minimax optimization problem. This is especially troublesome given the lack of an evaluation metric that can reliably detect non-convergent behaviors. We leverage the notion of duality gap from game theory in order to propose a novel convergence metric for GANs that has low computational cost. We verify the validity of the proposed metric for various test scenarios commonly used in the literature.
https://arxiv.org/abs/1811.05512
State-of-the-art object detectors and trackers are developing fast. Trackers are in general more efficient than detectors but bear the risk of drifting. A question is hence raised – how to improve the accuracy of video object detection/tracking by utilizing the existing detectors and trackers within a given time budget? A baseline is frame skipping – detecting every N-th frames and tracking for the frames in between. This baseline, however, is suboptimal since the detection frequency should depend on the tracking quality. To this end, we propose a scheduler network, which determines to detect or track at a certain frame, as a generalization of Siamese trackers. Although being light-weight and simple in structure, the scheduler network is more effective than the frame skipping baselines and flow-based approaches, as validated on ImageNet VID dataset in video object detection/tracking.
https://arxiv.org/abs/1811.05340
Visual context is one of the important clue for object detection and the context information for boundaries of an object is especially valuable. We propose a boundary aware network (BAN) designed to exploit the visual contexts including boundary information and surroundings, named boundary context, and define three types of the boundary contexts: side, vertex and in/out-boundary context. Our BAN consists of 10 sub-networks for the area belonging to the boundary contexts. The detection head of BAN is defined as an ensemble of these sub-networks with different contributions depending on the sub-problem of detection. To verify our method, we visualize the activation of the sub-networks according to the boundary contexts and empirically show that the sub-networks contribute more to the related sub-problem in detection. We evaluate our method on PASCAL VOC detection benchmark and MS COCO dataset. The proposed method achieves the mean Average Precision (mAP) of 83.4% on PASCAL VOC and 36.9% on MS COCO. BAN allows the convolution network to provide an additional source of contexts for detection and selectively focus on the more important contexts, and it can be generally applied to many other detection methods as well to enhance the accuracy in detection.
https://arxiv.org/abs/1811.05243
Given that South African education is in crisis, strategies for improvement and sustainability of high-quality, up-to-date education must be explored. In the migration of education online, inclusion of machine translation for low-resourced local languages becomes necessary. This paper aims to spur the use of current neural machine translation (NMT) techniques for low-resourced local languages. The paper demonstrates state-of-the-art performance on English-to-Setswana translation using the Autshumato dataset. The use of the Transformer architecture beat previous techniques by 5.33 BLEU points. This demonstrates the promise of using current NMT techniques for African languages.
https://arxiv.org/abs/1811.05467
Dialogue Act (DA) classification is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker’s intention. Currently, many existing approaches formulate the DA classification problem ranging from multi-classification to structured prediction, which suffer from two limitations: a) these methods are either handcrafted feature-based or have limited memories. b) adversarial examples can’t be correctly classified by traditional training methods. To address these issues, in this paper we first cast the problem into a question and answering problem and proposed an improved dynamic memory networks with hierarchical pyramidal utterance encoder. Moreover, we apply adversarial training to train our proposed model. We evaluate our model on two public datasets, i.e., Switchboard dialogue act corpus and the MapTask corpus. Extensive experiments show that our proposed model is not only robust, but also achieves better performance when compared with some state-of-the-art baselines.
https://arxiv.org/abs/1811.05021
Although Neural Machine Translation (NMT) has achieved remarkable progress in the past several years, most NMT systems still suffer from a fundamental shortcoming as in other sequence generation tasks: errors made early in generation process are fed as inputs to the model and can be quickly amplified, harming subsequent sequence generation. To address this issue, we propose a novel model regularization method for NMT training, which aims to improve the agreement between translations generated by left-to-right (L2R) and right-to-left (R2L) NMT decoders. This goal is achieved by introducing two Kullback-Leibler divergence regularization terms into the NMT training objective to reduce the mismatch between output probabilities of L2R and R2L models. In addition, we also employ a joint training strategy to allow L2R and R2L models to improve each other in an interactive update process. Experimental results show that our proposed method significantly outperforms state-of-the-art baselines on Chinese-English and English-German translation tasks.
https://arxiv.org/abs/1808.04064
In this paper we present a novel strategy to implement the paradigm of tabu search within a hybrid quantum-classical scheme based on quantum annealing to solve otimization problems with a particular focus on QUBO problems. The proposed algorithm is based on an iterative structure where the representation of an objective function into the annealer architecture is modified and already visited solutions are penalized by a tabu search. We prove the convergence of the algorithm to a global optimum in the case of general QUBO problems. Our technique is an alternative to the direct reduction of a given optimization problem into the sparse annealer graph.
https://arxiv.org/abs/1810.09342
In this paper, we present the result of adopting skip connections and dense layers, previously used in image classification tasks, in the Fisher GAN implementation. We have experimented with different numbers of layers and inserting these connections in different sections of the network. Our findings suggests that networks implemented with the connections produce better images than the baseline, and the number of connections added has only slight effect on the result.
https://arxiv.org/abs/1804.11031
Answering questions according to multi-modal context is a challenging problem as it requires a deep integration of different data sources. Existing approaches only employ partial interactions among data sources in one attention hop. In this paper, we present the Holistic Multi-modal Memory Network (HMMN) framework which fully considers the interactions between different input sources (multi-modal context, question) in each hop. In addition, it takes answer choices into consideration during the context retrieval stage. Therefore, the proposed framework effectively integrates multi-modal context, question, and answer information, which leads to more informative context retrieved for question answering. Our HMMN framework achieves state-of-the-art accuracy on MovieQA dataset. Extensive ablation studies show the importance of holistic reasoning and contributions of different attention strategies.
https://arxiv.org/abs/1811.04595
With the advent of large-scale heterogeneous networks comes the problem of unified network control resulting in security lapses that could have otherwise avoided. A mechanism is needed to detect and deflect intruders to safeguard resource constraint edge devices and networks as well. In this paper we demonstrate the use of an optimized pattern recognition algorithm to detect such attacks. Furthermore, we propose an Intrusion Detection System (IDS) methodology and design architecture for Internet of Things that makes the use of this search algorithm to thwart various security breaches. Numerical results are presented from tests conducted with the aid of NSL KDD cup dataset showing the efficacy the IDS
https://arxiv.org/abs/1811.04582
Convolutional auto-encoders have shown their remarkable performance in stacking to deep convolutional neural networks for classifying image data during past several years. However, they are unable to construct the state-of-the-art convolutional neural networks due to their intrinsic architectures. In this regard, we propose a flexible convolutional auto-encoder by eliminating the constraints on the numbers of convolutional layers and pooling layers from the traditional convolutional auto-encoder. We also design an architecture discovery method by using particle swarm optimization, which is capable of automatically searching for the optimal architectures of the proposed flexible convolutional auto-encoder with much less computational resource and without any manual intervention. We use the designed architecture optimization algorithm to test the proposed flexible convolutional auto-encoder through utilizing one graphic processing unit card on four extensively used image classification datasets. Experimental results show that our work in this paper significantly outperform the peer competitors including the state-of-the-art algorithm.
https://arxiv.org/abs/1712.05042
Real-time searches for faint radio pulses from unknown radio transients are computationally challenging. Detections become further complicated due to continuously increasing technical capabilities of transient surveys: telescope sensitivity, searched area of the sky, number of antennas or dishes, temporal and frequency resolution. The new Apertif transient survey on the Westerbork telescope happens in real-time on GPUs by means of the single-pulse search pipeline AMBER (Sclocco, 2017). AMBER initially carries out auto tuning: it finds the most optimal configuration of user-controlled parameters per each of four pipeline kernels so that each kernel performs its task as fast as possible. The pipeline uses a brute-force (BF) exhaustive search which in total takes 5 - 24 hours to run depending on the processing cluster architecture. We apply more heuristic, biologically driven genetic algorithms (GAs) to limit the exploration of the total parameter space, tune all four kernels together and reduce the tuning time to few hours. Our results show that after only few hours of tuning, GAs always find similar or even better configurations for all kernels together than the combination of single kernel configurations tuned by the BF approach. At the same time, by means of their genetic operators, GAs converge into better solutions than those obtained by pure random searches. The explored multi-dimensional parameter space is very complex and has multiple local optima as the evolution of randomly generated configurations does not always guarantee global solution.
https://arxiv.org/abs/1811.04165
The end-to-end nature of neural machine translation (NMT) removes many ways of manually guiding the translation process that were available in older paradigms. Recent work, however, has introduced a new capability: lexically constrained or guided decoding, a modification to beam search that forces the inclusion of pre-specified words and phrases in the output. However, while theoretically sound, existing approaches have computational complexities that are either linear (Hokamp and Liu, 2017) or exponential (Anderson et al., 2017) in the number of constraints. We present a algorithm for lexically constrained decoding with a complexity of O(1) in the number of constraints. We demonstrate the algorithms remarkable ability to properly place these constraints, and use it to explore the shaky relationship between model and BLEU scores. Our implementation is available as part of Sockeye.
https://arxiv.org/abs/1804.06609
Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems. However, these deep models are perceived as “black box” methods considering the lack of understanding of their internal functioning. There has been a significant recent interest in developing explainable deep learning models, and this paper is an effort in this direction. Building on a recently proposed method called Grad-CAM, we propose a generalized method called Grad-CAM++ that can provide better visual explanations of CNN model predictions, in terms of better object localization as well as explaining occurrences of multiple object instances in a single image, when compared to state-of-the-art. We provide a mathematical derivation for the proposed method, which uses a weighted combination of the positive partial derivatives of the last convolutional layer feature maps with respect to a specific class score as weights to generate a visual explanation for the corresponding class label. Our extensive experiments and evaluations, both subjective and objective, on standard datasets showed that Grad-CAM++ provides promising human-interpretable visual explanations for a given CNN architecture across multiple tasks including classification, image caption generation and 3D action recognition; as well as in new settings such as knowledge distillation.
https://arxiv.org/abs/1710.11063
We present RoarNet, a new approach for 3D object detection from a 2D image and 3D Lidar point clouds. Based on two-stage object detection framework with PointNet as our backbone network, we suggest several novel ideas to improve 3D object detection performance. The first part of our method, RoarNet_2D, estimates the 3D poses of objects from a monocular image, which approximates where to examine further, and derives multiple candidates that are geometrically feasible. This step significantly narrows down feasible 3D regions, which otherwise requires demanding processing of 3D point clouds in a huge search space. Then the second part, RoarNet_3D, takes the candidate regions and conducts in-depth inferences to conclude final poses in a recursive manner. Inspired by PointNet, RoarNet_3D processes 3D point clouds directly without any loss of data, leading to precise detection. We evaluate our method in KITTI, a 3D object detection benchmark. Our result shows that RoarNet has superior performance to state-of-the-art methods that are publicly available. Remarkably, RoarNet also outperforms state-of-the-art methods even in settings where Lidar and camera are not time synchronized, which is practically important for actual driving environments. RoarNet is implemented in Tensorflow and publicly available with pre-trained models.
https://arxiv.org/abs/1811.03818
This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerates the training procedure by considering guided attention and proposed context preservation losses, 2) allows not only spectral envelopes but also fundamental frequency contours and durations of speech to be converted, 3) requires no context information such as phoneme labels, and 4) requires no time-aligned source and target speech data in advance. In our experiment, the proposed VC framework can be trained in only one day, using only one GPU of an NVIDIA Tesla K80, while the quality of the synthesized speech is higher than that of speech converted by Gaussian mixture model-based VC and is comparable to that of speech generated by recurrent neural network-based text-to-speech synthesis, which can be regarded as an upper limit on VC performance.
https://arxiv.org/abs/1811.04076
The use of synthetic data generated by Generative Adversarial Networks (GANs) has become quite a popular method to do data augmentation for many applications. While practitioners celebrate this as an economical way to get more synthetic data that can be used to train downstream classifiers, it is not clear that they recognize the inherent pitfalls of this technique. In this paper, we aim to exhort practitioners against deriving any false sense of security against data biases based on data augmentation. To drive this point home, we show that starting with a dataset consisting of head-shots of engineering researchers, GAN-based augmentation “imagines” synthetic engineers, most of whom have masculine features and white skin color (inferred from a human subject study conducted on Amazon Mechanical Turk). This demonstrates how biases inherent in the training data are reinforced, and sometimes even amplified, by GAN-based data augmentation; it should serve as a cautionary tale for the lay practitioners.
https://arxiv.org/abs/1811.03751
Commercial iterative reconstruction techniques on modern CT scanners target radiation dose reduction but there are lingering concerns over their impact on image appearance and low contrast detectability. Recently, machine learning, especially deep learning, has been actively investigated for CT. Here we design a novel neural network architecture for low-dose CT (LDCT) and compare it with commercial iterative reconstruction methods used for standard of care CT. While popular neural networks are trained for end-to-end mapping, driven by big data, our novel neural network is intended for end-to-process mapping so that intermediate image targets are obtained with the associated search gradients along which the final image targets are gradually reached. This learned dynamic process allows to include radiologists in the training loop to optimize the LDCT denoising workflow in a task-specific fashion with the denoising depth as a key parameter. Our progressive denoising network was trained with the Mayo LDCT Challenge Dataset, and tested on images of the chest and abdominal regions scanned on the CT scanners made by three leading CT vendors. The best deep learning based reconstructions are systematically compared to the best iterative reconstructions in a double-blinded reader study. It is found that our deep learning approach performs either comparably or favorably in terms of noise suppression and structural fidelity, and runs orders of magnitude faster than the commercial iterative CT reconstruction algorithms.
https://arxiv.org/abs/1811.03691
Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training such as overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings. In this work, we present a novel regularization scheme for VQA that reduces this effect. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between the VQA model and this question-only adversary – discouraging the VQA model from capturing language biases in its question encoding. Further,we leverage this question-only model to estimate the increase in model confidence after considering the image, which we maximize explicitly to encourage visual grounding. Our approach is a model agnostic training procedure and simple to implement. We show empirically that it can improve performance significantly on a bias-sensitive split of the VQA dataset for multiple base models – achieving state-of-the-art on this task. Further, on standard VQA tasks, our approach shows significantly less drop in accuracy compared to existing bias-reducing VQA models.
https://arxiv.org/abs/1810.03649
With the development of Connected Vehicle (CV) technology, temporal variation of roadway traffic can be captured by sharing Basic Safety Messages (BSMs) from each vehicle using the communication between vehicles as well as with transportation roadside infrastructures (e.g., traffic signal) and traffic management centers. However, the penetration of connected vehicles in the near future will be limited. BSMs from limited CVs could provide an inaccurate estimation of current speed or space headway. This inaccuracy in the estimated current average speed and average space headway data is termed as noise. This noise in the traffic data significantly reduces the prediction accuracy of a machine learning model, such as the accuracy of long short term memory (LSTM) model in predicting traffic condition. To improve the real time prediction accuracy with low penetration of CVs, we developed a traffic data prediction model that combines the LSTM with a noise reduction model (the standard Kalman filter or Kalman filter based Rauch Tung Striebel (RTS)). The average speed and space headway used in this study were generated from the Enhanced Next Generation Simulation (NGSIM) dataset, which contains vehicle trajectory data for every one tenth of a second. Compared to a baseline LSTM model without any noise reduction, for 5 percent penetration of CVs, the analyses revealed that combined LSTM\RTS model reduced the mean absolute percentage error (MAPE) from 19 percent to 5 percent for speed prediction and from 27 percent to 9 percent for space headway prediction. The overall reduction of MAPE value ranged from 1 percent to 14 percent for speed and 2 percent to 18 percent for space headway prediction compared to the baseline model.
https://arxiv.org/abs/1811.03562
Generative Adversarial Networks have shown impressive results for the task of object translation, including face-to-face translation. A key component behind the success of recent approaches is the self-consistency loss, which encourages a network to recover the original input image when the output generated for a desired attribute is itself passed through the same network, but with the target attribute inverted. While the self-consistency loss yields photo-realistic results, it can be shown that the input and target domains, supposed to be close, differ substantially. This is empirically found by observing that a network recovers the input image even if attributes other than the inversion of the original goal are set as target. This stops one combining networks for different tasks, or using a network to do progressive forward passes. In this paper, we show empirical evidence of this effect, and propose a new loss to bridge the gap between the distributions of the input and target domains. This “triple consistency loss”, aims to minimise the distance between the outputs generated by the network for different routes to the target, independent of any intermediate steps. To show this is effective, we incorporate the triple consistency loss into the training of a new landmark-guided face to face synthesis, where, contrary to previous works, the generated images can simultaneously undergo a large transformation in both expression and pose. To the best of our knowledge, we are the first to tackle the problem of mismatching distributions in self-domain synthesis, and to propose “in-the-wild” landmark-guided synthesis. Code will be available at this https URL
https://arxiv.org/abs/1811.03492
Multi-source translation systems translate from multiple languages to a single target language. By using information from these multiple sources, these systems achieve large gains in accuracy. To train these systems, it is necessary to have corpora with parallel text in multiple sources and the target language. However, these corpora are rarely complete in practice due to the difficulty of providing human translations in all of the relevant languages. In this paper, we propose a data augmentation approach to fill such incomplete parts using multi-source neural machine translation (NMT). In our experiments, results varied over different language combinations but significant gains were observed when using a source language similar to the target language.
https://arxiv.org/abs/1810.06826
Recurrent neural networks can learn complex transduction problems that require maintaining and actively exploiting a memory of their inputs. Such models traditionally consider memory and input-output functionalities indissolubly entangled. We introduce a novel recurrent architecture based on the conceptual separation between the functional input-output transformation and the memory mechanism, showing how they can be implemented through different neural components. By building on such conceptualization, we introduce the Linear Memory Network, a recurrent model comprising a feedforward neural network, realizing the non-linear functional transformation, and a linear autoencoder for sequences, implementing the memory component. The resulting architecture can be efficiently trained by building on closed-form solutions to linear optimization problems. Further, by exploiting equivalence results between feedforward and recurrent neural networks we devise a pretraining schema for the proposed architecture. Experiments on polyphonic music datasets show competitive results against gated recurrent networks and other state of the art models.
https://arxiv.org/abs/1811.03356