Although Convolutional Neural Networks (CNNs) achieve effectiveness in various computer vision tasks, the significant requirement of storage of such networks hinders the deployment on computationally limited devices. In this paper, we propose a new compact and portable deep learning network named Modulated Binary Cliquenet (MBCliqueNet) aiming to improve the portability of CNNs based on binarized filters while achieving comparable performance with the full-precision CNNs like Resnet. In MBCliqueNet, we introduce a novel modulated operation to approximate the unbinarized filters and gives an initialization method to speed up its convergence. We reduce the extra parameters caused by modulated operation with parameters sharing. As a result, the proposed MBCliqueNet can reduce the required storage space of convolutional filters by a factor of at least 32, in contrast to the full-precision model, and achieve better performance than other state-of-the-art binarized models. More importantly, our model compares even better with some full-precision models like Resnet on the dataset we used.
http://arxiv.org/abs/1902.10460
Neural networks are widely used as a model for classification in a large variety of tasks. Typically, a learnable transformation (i.e. the classifier) is placed at the end of such models returning a value for each class used for classification. This transformation plays an important role in determining how the generated features change during the learning process. In this work we argue that this transformation not only can be fixed (i.e. set as non trainable) with no loss of accuracy, but it can also be used to learn stationary and maximally discriminative embeddings. We show that the stationarity of the embedding and its maximal discriminative representation can be theoretically justified by setting the weights of the fixed classifier to values taken from the coordinate vertices of three regular polytopes available in $\mathbb{R}^d$, namely: the $d$-Simplex, the $d$-Cube and the $d$-Orthoplex. These regular polytopes have the maximal amount of symmetry that can be exploited to generate stationary features angularly centered around their corresponding fixed weights. Our approach improves and broadens the concept of a fixed classifier, recently proposed in \cite{hoffer2018fix}, to a larger class of fixed classifier models. Experimental results confirm both the theoretical analysis and the generalization capability of the proposed method.
http://arxiv.org/abs/1902.10441
Multi-Style Transfer (MST) intents to capture the high-level visual vocabulary of different styles and expresses these vocabularies in a joint model to transfer each specific style. Recently, Style Embedding Learning (SEL) based methods represent each style with an explicit set of parameters to perform MST task. However, most existing SEL methods either learn explicit style representation with numerous independent parameters or learn a relatively black-box style representation, which makes them difficult to control the stylized results. In this paper, we outline a novel MST model, StyleRemix, to compactly and explicitly integrate multiple styles into one network. By decomposing diverse styles into the same basis, StyleRemix represents a specific style in a continuous vector space with 1-dimensional coefficients. With the interpretable style representation, StyleRemix not only enables the style visualization task but also allows several ways of remixing styles in the smooth style embedding space.~Extensive experiments demonstrate the effectiveness of StyleRemix on various MST tasks compared to state-of-the-art SEL approaches.
http://arxiv.org/abs/1902.10425
Convolutional neural networks (CNNs) can model complicated non-linear relations between images. However, they are notoriously sensitive to small changes in the input. Most CNNs trained to describe image-to-image mappings generate temporally unstable results when applied to video sequences, leading to flickering artifacts and other inconsistencies over time. In order to use CNNs for video material, previous methods have relied on estimating dense frame-to-frame motion information (optical flow) in the training and/or the inference phase, or by exploring recurrent learning structures. We take a different approach to the problem, posing temporal stability as a regularization of the cost function. The regularization is formulated to account for different types of motion that can occur between frames, so that temporally stable CNNs can be trained without the need for video material or expensive motion estimation. The training can be performed as a fine-tuning operation, without architectural modifications of the CNN. Our evaluation shows that the training strategy leads to large improvements in temporal smoothness. Moreover, in situations where the quantity of training data is limited, the regularization can help in boosting the generalization performance to a much larger extent than what is possible with na"ive augmentation strategies.
http://arxiv.org/abs/1902.10424
The main obstacle to weakly supervised semantic image segmentation is the difficulty of obtaining pixel-level information from coarse image-level annotations. Most methods based on image-level annotations use localization maps obtained from the classifier, but these only focus on the small discriminative parts of objects and do not capture precise boundaries. FickleNet explores diverse combinations of locations on feature maps created by generic deep neural networks. It selects hidden units randomly and then uses them to obtain activation scores for image classification. FickleNet implicitly learns the coherence of each location in the feature maps, resulting in a localization map which identifies both discriminative and other parts of objects. The ensemble effects are obtained from a single network by selecting random hidden unit pairs, which means that a variety of localization maps are generated from a single image. Our approach does not require any additional training steps and only adds a simple layer to a standard convolutional neural network; nevertheless it outperforms recent comparable techniques on the Pascal VOC 2012 benchmark in both weakly and semi-supervised settings.
http://arxiv.org/abs/1902.10421
Automatic question generation is an important technique that can improve the training of question answering, help chatbots to start or continue a conversation with humans, and provide assessment materials for educational purposes. Existing neural question generation models are not sufficient mainly due to their inability to properly model the process of how each word in the question is selected, i.e., whether repeating the given passage or being generated from a vocabulary. In this paper, we propose our Clue Guided Copy Network for Question Generation (CGC-QG), which is a sequence-to-sequence generative model with copying mechanism, yet employing a variety of novel components and techniques to boost the performance of question generation. In CGC-QG, we design a multi-task labeling strategy to identify whether a question word should be copied from the input passage or be generated instead, guiding the model to learn the accurate boundaries between copying and generation. Furthermore, our input passage encoder takes as input, among a diverse range of other features, the prediction made by a clue word predictor, which helps identify whether each word in the input passage is a potential clue to be copied into the target question. The clue word predictor is designed based on a novel application of Graph Convolutional Networks onto a syntactic dependency tree representation of each passage, thus being able to predict clue words only based on their context in the passage and their relative positions to the answer in the tree. We jointly train the clue prediction as well as question generation with multi-task learning and a number of practical strategies to reduce the complexity. Extensive evaluations show that our model significantly improves the performance of question generation and out-performs all previous state-of-the-art neural question generation models by a substantial margin.
http://arxiv.org/abs/1902.10418
Modern neural networks are over-parametrized. In particular, each rectified linear hidden unit can be modified by a multiplicative factor by adjusting input and output weights, without changing the rest of the network. Inspired by the Sinkhorn-Knopp algorithm, we introduce a fast iterative method for minimizing the L2 norm of the weights, equivalently the weight decay regularizer. It provably converges to a unique solution. Interleaving our algorithm with SGD during training improves the test accuracy. For small batches, our approach offers an alternative to batch-and group-normalization on CIFAR-10 and ImageNet with a ResNet-18.
http://arxiv.org/abs/1902.10416
Autonomous mobile manipulation is the cutting edge of the modern robotic technology, which offers a dual advantage of mobility provided by a mobile platform and dexterity afforded by the manipulator. A common approach for controlling these systems is based on the task space control. In a nutshell, a task space controller defines a map from user-defined end-effector references to the actuation commands based on an optimization problem over the distance between the reference trajectories and the physically consistent motions. The optimization however ignores the effect of the current decision on the future error, which limits the applicability of the approach for dynamically stable platforms. On the contrary, the Model Predictive Control (MPC) approach offers the capability of foreseeing the future and making a trade-off in between the current and future tracking errors. Here, we transcribe the task at the end-effector space, which makes the task description more natural for the user. Furthermore, we show how the MPC-based controller skillfully incorporates the reference forces at the end-effector in the control problem. To this end, we showcase here the advantages of using this MPC approach for controlling a ball-balancing mobile manipulator, Rezero. We validate our controller on the hardware for tasks such as end-effector pose tracking and door opening.
http://arxiv.org/abs/1902.10415
In this work we investigate the computation of nonlinear eigenfunctions via the extinction profiles of gradient flows. We analyze a scheme that recursively subtracts such eigenfunctions from given data and show that this procedure yields a decomposition of the data into eigenfunctions in some cases as the 1-dimensional total variation, for instance. We discuss results of numerical experiments in which we use extinction profiles and the gradient flow for the task of spectral graph clustering as used, e.g., in machine learning applications.
http://arxiv.org/abs/1902.10414
Hypernetworks mechanism allows to generate and train neural networks (target networks) with use of other neural network (hypernetwork). In this paper, we extend this idea and show that hypernetworks are able to generate target networks, which can be customized to serve different purposes. In particular, we apply this mechanism to create a continuous functional representation of images. Namely, the hypernetwork takes an image and at test time produces weights to a target network, which approximates its RGB pixel intensities. Due to the continuity of representation, we may look at the image at different scales or fill missing regions. Second, we demonstrate how to design a hypernetwork, which produces a generative model for a new data set at test time. Experimental results demonstrate that the proposed mechanism can be successfully used in super-resolution and 2D object modeling.
http://arxiv.org/abs/1902.10404
Advertising (ad for short) keyword suggestion is important for sponsored search to improve online advertising and increase search revenue. There are two common challenges in this task. First, the keyword bidding problem: hot ad keywords are very expensive for most of the advertisers because more advertisers are bidding on more popular keywords, while unpopular keywords are difficult to discover. As a result, most ads have few chances to be presented to the users. Second, the inefficient ad impression issue: a large proportion of search queries, which are unpopular yet relevant to many ad keywords, have no ads presented on their search result pages. Existing retrieval-based or matching-based methods either deteriorate the bidding competition or are unable to suggest novel keywords to cover more queries, which leads to inefficient ad impressions. To address the above issues, this work investigates to use generative neural networks for keyword generation in sponsored search. Given a purchased keyword (a word sequence) as input, our model can generate a set of keywords that are not only relevant to the input but also satisfy the domain constraint which enforces that the domain category of a generated keyword is as expected. Furthermore, a reinforcement learning algorithm is proposed to adaptively utilize domain-specific information in keyword generation. Offline evaluation shows that the proposed model can generate keywords that are diverse, novel, relevant to the source keyword, and accordant with the domain constraint. Online evaluation shows that generative models can improve coverage (COV), click-through rate (CTR), and revenue per mille (RPM) substantially in sponsored search.
http://arxiv.org/abs/1902.10374
Deep neural networks (DNNs) have achieved great success in a wide range of computer vision areas, but the applications to mobile devices is limited due to their high storage and computational cost. Much efforts have been devoted to compress DNNs. In this paper, we propose a simple yet effective method for deep networks compression, named Cluster Regularized Quantization (CRQ), which can reduce the presentation precision of a full-precision model to ternary values without significant accuracy drop. In particular, the proposed method aims at reducing the quantization error by introducing a cluster regularization term, which is imposed on the full-precision weights to enable them naturally concentrate around the target values. Through explicitly regularizing the weights during the re-training stage, the full-precision model can achieve the smooth transition to the low-bit one. Comprehensive experiments on benchmark datasets demonstrate the effectiveness of the proposed method.
http://arxiv.org/abs/1902.10370
Channel pruning, which seeks to reduce the model size by removing redundant channels, is a popular solution for deep networks compression. Existing channel pruning methods usually conduct layer-wise channel selection by directly minimizing the reconstruction error of feature maps between the baseline model and the pruned one. However, they ignore the feature and semantic distributions within feature maps and real contribution of channels to the overall performance. In this paper, we propose a new channel pruning method by explicitly using both intermediate outputs of the baseline model and the classification loss of the pruned model to supervise layer-wise channel selection. Particularly, we introduce an additional loss to encode the differences in the feature and semantic distributions within feature maps between the baseline model and the pruned one. By considering the reconstruction error, the additional loss and the classification loss at the same time, our approach can significantly improve the performance of the pruned model. Comprehensive experiments on benchmark datasets demonstrate the effectiveness of the proposed method.
http://arxiv.org/abs/1902.10364
Now-a-days, derogatory comments are often made by one another, not only in offline environment but also immensely in online environments like social networking websites and online communities. So, an Identification combined with Prevention System in all social networking websites and applications, including all the communities, existing in the digital world is a necessity. In such a system, the Identification Block should identify any negative online behaviour and should signal the Prevention Block to take action accordingly. This study aims to analyse any piece of text and detecting different types of toxicity like obscenity, threats, insults and identity-based hatred. The labelled Wikipedia Comment Dataset prepared by Jigsaw is used for the purpose. A 6-headed Machine Learning tf-idf Model has been made and trained separately, yielding a Mean Validation Accuracy of 98.08% and Absolute Validation Accuracy of 91.61%. Such an Automated System should be deployed for enhancing healthy online conversation
http://arxiv.org/abs/1903.06765
State-of-the-art deep neural network recognition systems are designed for a static and closed world. It is usually assumed that the distribution at test time will be the same as the distribution during training. As a result, classifiers are forced to categorise observations into one out of a set of predefined semantic classes. Robotic problems are dynamic and open world; a robot will likely observe objects that are from outside of the training set distribution. Classifier outputs in robotic applications can lead to real-world robotic action and as such, a practical recognition system should not silently fail by confidently misclassifying novel observations. We show how a deep metric learning classification system can be applied to such open set recognition problems, allowing the classifier to label novel observations as unknown. Further to detecting novel examples, we propose an open set active learning approach that allows a robot to efficiently query a user about unknown observations. Our approach enables a robot to improve its understanding of the true distribution of data in the environment, from a small number of label queries. Experimental results show that our approach significantly outperforms comparable methods in both the open set recognition and active learning problems.
http://arxiv.org/abs/1902.10363
Recently the GAN generated face images are more and more realistic with high-quality, even hard for human eyes to detect. On the other hand, the forensics community keeps on developing methods to detect these generated fake images and try to guarantee the credibility of visual contents. Although researchers have developed some methods to detect generated images, few of them explore the important problem of generalization ability of forensics model. As new types of GANs are emerging fast, the generalization ability of forensics models to detect new types of GAN images is absolutely an essential research topic. In this paper, we explore this problem and propose to use preprocessed images to train a forensic CNN model. By applying similar image level preprocessing to both real and fake training images, the forensics model is forced to learn more intrinsic features to classify the generated and real face images. Our experimental results also prove the effectiveness of the proposed method.
http://arxiv.org/abs/1902.11153
We suggest a new idea of Editorial Network - a mixed extractive-abstractive summarization approach, which is applied as a post-processing step over a given sequence of extracted sentences. Our network tries to imitate the decision process of a human editor during summarization. Within such a process, each extracted sentence may be either kept untouched, rephrased or completely rejected. We further suggest an effective way for training the “editor” based on a novel soft-labeling approach. Using the CNN/DailyMail dataset we demonstrate the effectiveness of our approach compared to state-of-the-art extractive-only or abstractive-only baseline methods.
http://arxiv.org/abs/1902.10360
We present a robot eye-hand coordination learning method that can directly learn visual task specification by watching human demonstrations. Task specification is represented as a task function, which is learned using inverse reinforcement learning(IRL) by inferring differential rewards between state changes. The learned task function is then used as continuous feedbacks in an uncalibrated visual servoing(UVS) controller designed for the execution phase. Our proposed method can directly learn from raw videos, which removes the need for hand-engineered task specification. It can also provide task interpretability by directly approximating the task function. Besides, benefiting from the use of a traditional UVS controller, our training process is efficient and the learned policy is independent from a particular robot platform. Various experiments were designed to show that, for a certain DOF task, our method can adapt to task/environment variances in target positions, backgrounds, illuminations, and occlusions without prior retraining.
http://arxiv.org/abs/1810.00159
With the rapid development in deep learning, deep neural networks have been widely adopted in many real-life natural language applications. Under deep neural networks, a pre-defined vocabulary is required to vectorize text inputs. The canonical approach to select pre-defined vocabulary is based on the word frequency, where a threshold is selected to cut off the long tail distribution. However, we observed that such simple approach could easily lead to under-sized vocabulary or over-sized vocabulary issues. Therefore, we are interested in understanding how the end-task classification accuracy is related to the vocabulary size and what is the minimum required vocabulary size to achieve a specific performance. In this paper, we provide a more sophisticated variational vocabulary dropout (VVD) based on variational dropout to perform vocabulary selection, which can intelligently select the subset of the vocabulary to achieve the required performance. To evaluate different algorithms on the newly proposed vocabulary selection problem, we propose two new metrics: Area Under Accuracy-Vocab Curve and Vocab Size under X\% Accuracy Drop. Through extensive experiments on various NLP classification tasks, our variational framework is shown to significantly outperform the frequency-based and other selection baselines on these metrics.
http://arxiv.org/abs/1902.10339
In adult laparoscopy, robot-aided surgery is a reality in thousands of operating rooms worldwide, owing to the increased dexterity provided by the robotic tools. Many robots and robot control techniques have been developed to aid in more challenging scenarios, such as pediatric surgery and microsurgery. However, the prevalence of case-specific solutions, particularly those focused on non-redundant robots, reduces the reproducibility of the initial results in more challenging scenarios. In this paper, we propose a general framework for the control of surgical robotics in constrained workspaces under teleoperation, regardless of the robot geometry. Our technique is divided into a slave-side constrained optimization algorithm, which provides virtual fixtures, and with Cartesian impedance on the master side to provide force feedback. Experiments with two robotic systems, one redundant and one non-redundant, show that smooth teleoperation can be achieved in adult laparoscopy and infant surgery.
http://arxiv.org/abs/1809.07907
In spite of achieving revolutionary successes in machine learning, deep convolutional neural networks have been recently found to be vulnerable to adversarial attacks and difficult to generalize to novel test images with reasonably large geometric transformations. Inspired by a recent neuroscience discovery revealing that primate brain employs disentangled shape and appearance representations for object recognition, we propose a general disentangled deep autoencoding regularization framework that can be easily applied to any deep embedding based classification model for improving the robustness of deep neural networks. Our framework effectively learns disentangled appearance code and geometric code for robust image classification, which is the first disentangling based method defending against adversarial attacks and complementary to standard defense methods. Extensive experiments on several benchmark datasets show that, our proposed regularization framework leveraging disentangled embedding significantly outperforms traditional unregularized convolutional neural networks for image classification on robustness against adversarial attacks and generalization to novel test data.
http://arxiv.org/abs/1902.11134
Taxonomies play an important role in machine intelligence. However, most well-known taxonomies are in English, and non-English taxonomies, especially Chinese ones, are still very rare. In this paper, we focus on automatic Chinese taxonomy construction and propose an effective generation and verification framework to build a large-scale and high-quality Chinese taxonomy. In the generation module, we extract isA relations from multiple sources of Chinese encyclopedia, which ensures the coverage. To further improve the precision of taxonomy, we apply three heuristic approaches in verification module. As a result, we construct the largest Chinese taxonomy with high precision about 95% called CN-Probase. Our taxonomy has been deployed on Aliyun, with over 82 million API calls in six months.
http://arxiv.org/abs/1902.10326
Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE_L metrics.
http://arxiv.org/abs/1902.10322
We consider the problem of extracting safe environments and controllers for reach-avoid objectives for systems with known state and control spaces, but unknown dynamics. In a given environment, a common approach is to synthesize a controller from an abstraction or a model of the system (potentially learned from data). However, in many situations, the relationship between the dynamics of the model and the \textit{actual system} is not known; and hence it is difficult to provide safety guarantees for the system. In such cases, the Standard Simulation Metric (SSM), defined as the worst-case norm distance between the model and the system output trajectories, can be used to modify a reach-avoid specification for the system into a more stringent specification for the abstraction. Nevertheless, the obtained distance, and hence the modified specification, can be quite conservative. This limits the set of environments for which a safe controller can be obtained. We propose SPEC, a specification-centric simulation metric, which overcomes these limitations by computing the distance using only the trajectories that violate the specification for the system. We show that modifying a reach-avoid specification with SPEC allows us to synthesize a safe controller for a larger set of environments compared to SSM. We also propose a probabilistic method to compute SPEC for a general class of systems. Case studies using simulators for quadrotors and autonomous cars illustrate the advantages of the proposed metric for determining safe environment sets and controllers.
http://arxiv.org/abs/1902.10320
Deep Neural Networks have achieved remarkable success in computer vision, and audio tasks, etc. However, in classification domains, deep neural models are easily fooled by adversarial examples. Many attack methods generate adversarial examples with large image distortion and low similarity between origin and corresponding adversarial examples, to address these issues, we propose an adversarial method with an adaptive gradient in a direction to generate perturbations, it generates perturbations which can escape local minimal. In this paper, we evaluate several traditional perturbations creating methods in image classification with ours. Experimental results show that our approach works well and outperform recent techniques in the change of misclassifying image classification, and excellent efficiency in fooling deep network models.
http://arxiv.org/abs/1902.01220
Packet classification is a fundamental problem in computer networking. This problem exposes a hard tradeoff between the computation and state complexity, which makes it particularly challenging. To navigate this tradeoff, existing solutions rely on complex hand-tuned heuristics, which are brittle and hard to optimize. In this paper, we propose a deep reinforcement learning (RL) approach to solve the packet classification problem. There are several characteristics that make this problem a good fit for Deep RL. First, many of the existing solutions are iteratively building a decision tree by splitting nodes in the tree. Second, the effects of these actions (e.g., splitting nodes) can only be evaluated once we are done with building the tree. These two characteristics are naturally captured by the ability of RL to take actions that have sparse and delayed rewards. Third, it is computationally efficient to generate data traces and evaluate decision trees, which alleviate the notoriously high sample complexity problem of Deep RL algorithms. Our solution, NeuroCuts, uses succinct representations to encode state and action space, and efficiently explore candidate decision trees to optimize for a global objective. It produces compact decision trees optimized for a specific set of rules and a given performance metric, such as classification time, memory footprint, or a combination of the two. Evaluation on ClassBench shows that NeuroCuts outperforms existing hand-crafted algorithms in classification time by 18% at the median, and reduces both time and memory footprint by up to 3x.
http://arxiv.org/abs/1902.10319
The computational demands of computer vision tasks based on state-of-the-art Convolutional Neural Network (CNN) image classification far exceed the energy budgets of mobile devices. This paper proposes FixyNN, which consists of a fixed-weight feature extractor that generates ubiquitous CNN features, and a conventional programmable CNN accelerator which processes a dataset-specific CNN. Image classification models for FixyNN are trained end-to-end via transfer learning, with the common feature extractor representing the transfered part, and the programmable part being learnt on the target dataset. Experimental results demonstrate FixyNN hardware can achieve very high energy efficiencies up to 26.6 TOPS/W ($4.81 \times$ better than iso-area programmable accelerator). Over a suite of six datasets we trained models via transfer learning with an accuracy loss of $<1\%$ resulting in up to 11.2 TOPS/W - nearly $2 \times$ more efficient than a conventional programmable CNN accelerator of the same area.
http://arxiv.org/abs/1902.11128
In this paper, we propose a light reflection based face anti-spoofing method named Aurora Guard (AG), which is fast, simple yet effective that has already been deployed in real-world systems serving for millions of users. Specifically, our method first extracts the normal cues via light reflection analysis, and then uses an end-to-end trainable multi-task Convolutional Neural Network (CNN) to not only recover subjects’ depth maps to assist liveness classification, but also provide the light CAPTCHA checking mechanism in the regression branch to further improve the system reliability. Moreover, we further collect a large-scale dataset containing $12,000$ live and spoofing samples, which covers abundant imaging qualities and Presentation Attack Instruments (PAI). Extensive experiments on both public and our datasets demonstrate the superiority of our proposed method over the state of the arts.
http://arxiv.org/abs/1902.10311
Although Recurrent Neural Network (RNN) has been a powerful tool for modeling sequential data, its performance is inadequate when processing sequences with multiple patterns. In this paper, we address this challenge by introducing a novel mixture layer and constructing an adaptive RNN. The mixture layer augmented RNN (termed as M-RNN) partitions patterns in training sequences into several clusters and stores the principle patterns as prototype vectors of components in a mixture model. By leveraging the mixture layer, the proposed method can adaptively update states according to the similarities between encoded inputs and prototype vectors, leading to a stronger capacity in assimilating sequences with multiple patterns. Moreover, our approach can be further extended by taking advantage of prior knowledge about data. Experiments on both synthetic and real datasets demonstrate the effectiveness of the proposed method.
http://arxiv.org/abs/1801.08094
With the recently rapid development in deep learning, deep neural networks have been widely adopted in many real-life applications. However, deep neural networks are also known to have very little control over its uncertainty for unseen examples, which potentially causes very harmful and annoying consequences in practical scenarios. In this paper, we are particularly interested in designing a higher-order uncertainty metric for deep neural networks and investigate its effectiveness under the out-of-distribution detection task proposed by~\cite{hendrycks2016baseline}. Our method first assumes there exists an underlying higher-order distribution $\mathbb{P}(z)$, which controls label-wise categorical distribution $\mathbb{P}(y)$ over classes on the K-dimension simplex, and then approximate such higher-order distribution via parameterized posterior function $p_{\theta}(z|x)$ under variational inference framework, finally we use the entropy of learned posterior distribution $p_{\theta}(z|x)$ as uncertainty measure to detect out-of-distribution examples. Further, we propose an auxiliary objective function to discriminate against synthesized adversarial examples to further increase the robustness of the proposed uncertainty measure. Through comprehensive experiments on various datasets, our proposed framework is demonstrated to consistently outperform competing algorithms.
http://arxiv.org/abs/1811.07308
We propose a novel framework for modeling event-related potentials (ERPs) collected during reading that couples pre-trained convolutional decoders with a language model. Using this framework, we compare the abilities of a variety of existing and novel sentence processing models to reconstruct ERPs. We find that modern contextual word embeddings underperform surprisal-based models but that, combined, the two outperform either on its own.
http://arxiv.org/abs/1902.10296
Grasp detection with consideration of the affiliations between grasps and their owner in object overlapping scenes is a necessary and challenging task for the practical use of the robotic grasping approach. In this paper, a robotic grasp detection algorithm named ROI-GD is proposed to provide a feasible solution to this problem based on Region of Interest (ROI), which is the region proposal for objects. ROI-GD uses features from ROIs to detect grasps instead of the whole scene. It has two stages: the first stage is to provide ROIs in the input image and the second-stage is the grasp detector based on ROI features. We also contribute a multi-object grasp dataset, which is much larger than Cornell Grasp Dataset, by labeling Visual Manipulation Relationship Dataset. Experimental results demonstrate that ROI-GD performs much better in object overlapping scenes and at the meantime, remains comparable with state-of-the-art grasp detection algorithms on Cornell Grasp Dataset and Jacquard Dataset. Robotic experiments demonstrate that ROI-GD can help robots grasp the target in single-object and multi-object scenes with the overall success rates of 92.5% and 83.8% respectively.
http://arxiv.org/abs/1808.10313
Echolocating bats locate the targets by echolocation. Many theoretical frameworks have been suggested the abilities of bats are related to the shapes of bats ears, but few artificial bat-like ears have been made to mimic the abilities, the difficulty of which lies in the determination of the elevation angle of the target. In this study, we present a device with artificial bat pinnae modeling by the ears of brown long-eared bat (Plecotus auritus) which can accurately estimate the elevation angle of the aerial target by virtue of active sonar. An artificial neural-network with the labeled data obtained from echoes as the trained and tested data is used and optimized by a tenfold cross-validation technique. A decision method we named sliding window averaging algorithm is designed for getting the estimation results of elevation. At last, a right-angle pinnae construction is designed for determining direction of the target. The results show a higher accuracy for the direction determination of the single target. The results also demonstrate that for the Plecotus auritus bat, not only the binaural shapes, but the binaural relative orientations also play important roles in the target localization.
http://arxiv.org/abs/1902.10291
A simple prior free factorization algorithm\cite{dai2014simple} is quite often cited work in the field of Non-Rigid Structure from Motion (NRSfM). The benefit of this work lies in its simplicity of implementation, strong theoretical justification to the motion and structure estimation, and its invincible originality. Despite this, the prevailing view is, that it performs exceedingly inferior to other methods on several benchmark datasets\cite{jensen2018benchmark,akhter2009nonrigid}. However, our subtle investigation provides some empirical statistics which made us think against such views. The statistical results we obtained supersedes Dai {\it{et. al.}}\cite{dai2014simple} originally reported results on the benchmark datasets by a significant margin under some elementary changes in their core algorithmic idea\cite{dai2014simple}. Now, these results not only exposes some unrevealed areas for research in NRSfM but also give rise to new mathematical challenges for NRSfM researchers. In this paper, we explore some of the hidden intricacies missed by Dai {\it{et. al.}} work\cite{dai2014simple} and how some elementary measures and modifications can significantly enhance its performance, as high as 18\% on the benchmark dataset. The improved performance is justified and empirically verified by extensive experiments on several datasets. We believe our work has both practical and theoretical importance for the development of better NRSfM algorithms. Practically, it can also help improve the recently reported state-of-the-art \cite{kumar2017spatio, kumar2016multi, jensen2018benchmark} and other similar works in this field which are inspired by Dai et. al. work\cite{dai2014simple}.
http://arxiv.org/abs/1902.10274
Recent deep learning architectures can recognize instances of 3D point cloud objects of previously seen classes quite well. At the same time, current 3D depth camera technology allows generating/segmenting a large amount of 3D point cloud objects from an arbitrary scene, for which there is no previously seen training data. A challenge for a 3D point cloud recognition system is, then, to classify objects from new, unseen, classes. This issue can be resolved by adopting a zero-shot learning (ZSL) approach for 3D data, similar to the 2D image version of the same problem. ZSL attempts to classify unseen objects by comparing semantic information (attribute/word vector) of seen and unseen classes. Here, we adapt several recent 3D point cloud recognition systems to the ZSL setting with some changes to their architectures. To the best of our knowledge, this is the first attempt to classify unseen 3D point cloud objects in the ZSL setting. A standard protocol (which includes the choice of datasets and the seen/unseen split) to evaluate such systems is also proposed. Baseline performances are reported using the new protocol on the investigated models. This investigation throws a new challenge to the 3D point cloud recognition community that may instigate numerous future works.
http://arxiv.org/abs/1902.10272
On the one hand, nowadays, fake news articles are easily propagated through various online media platforms and have become a grand threat to the trustworthiness of information. On the other hand, our understanding of the language of fake news is still minimal. Incorporating hierarchical discourse-level structure of fake and real news articles is one crucial step toward better understanding of how these articles are structured. Nevertheless, this has rarely been investigated in the fake news detection domain and faces tremendous challenges: existing methods for capturing discourse-level structure rely on annotated corpora which are not available for fake news datasets as well as how and what insightful information can be extracted from such discovered structures. To address these challenges, we propose Discourse-level Hierarchical Structure for Fake news detection. DHSF constructs discourse-level structures of fake/real news articles in an automated manner. Moreover, we identify insightful structure-related properties, which can explain the discovered structures and boost our understating of fake news. Extensive experiments show the effectiveness of the proposed approach. Further structural analysis suggests that real and fake news present substantial differences in the hierarchical discourse-level structure.
http://arxiv.org/abs/1903.07389
Multi-instance learning (MIL) deals with tasks where data consist of set of bags and each bag is represented by a set of instances. Only the bag labels are observed but the label for each instance is not available. Previous MIL studies typically assume that the training and test samples follow the same distribution, which is often violated in real-world applications. Existing methods address distribution changes by re-weighting the training data with the density ratio between the training and test samples. However, models are often trained without prior knowledge of the test distribution which renders existing methods inapplicable. Inspired by a connection between MIL and causal inference, we propose a novel framework for addressing distribution change in MIL without relying on the test distribution. Experimental results validate the effectiveness of our approach.
http://arxiv.org/abs/1902.05066
We present algorithms to compute tight upper bounds of collision probability between two objects with positional uncertainties, whose error distributions are given in non-Gaussian forms. Our algorithms can efficiently compute the upper bounds of collision probability when the error distributions are given as Truncated Gaussian, weighted samples, and Truncated Gaussian Mixture Model. We create positional error models on static obstacles captured by noisy depth sensors and dynamic obstacles with complex motion models. We highlight the performance of our probabilistic collision detection algorithms under non-Gaussian positional errors for static and dynamic obstacles in simulated scenarios and real-world robot motion planning scenarios with a 7-DOF robot arm. We demonstrate the benefits of our probabilistic collision detection algorithms in the use of motion planning algorithm in terms of planning collision-free trajectories under environments with sensor and motion uncertainties.
http://arxiv.org/abs/1902.10252
We propose a technique to develop (and localize in) topological maps from light detection and ranging (Lidar) data. Localizing an autonomous vehicle with respect to a reference map in real-time is crucial for its safe operation. Owing to the rich information provided by Lidar sensors, these are emerging as a promising choice for this task. However, since a Lidar outputs a large amount of data every fraction of a second, it is progressively harder to process the information in real-time. Consequently, current systems have migrated towards faster alternatives at the expense of accuracy. To overcome this inherent trade-off between latency and accuracy, we propose a technique to develop topological maps from Lidar data using the orthogonal Tucker3 tensor decomposition. Our experimental evaluations demonstrate that in addition to achieving a high compression ratio as compared to full data, the proposed technique, $\textit{TensorMap}$, also accurately detects the position of the vehicle in a graph-based representation of a map. We also analyze the robustness of the proposed technique to Gaussian and translational noise, thus initiating explorations into potential applications of tensor decompositions in Lidar data analysis.
http://arxiv.org/abs/1902.10226
Localizing targets of interest in a given hyperspectral (HS) image has applications ranging from remote sensing to surveillance. This task of target detection leverages the fact that each material/object possesses its own characteristic spectral response, depending upon its composition. As $\textit{signatures}$ of different materials are often correlated, matched filtering based approaches may not be appropriate in this case. In this work, we present a technique to localize targets of interest based on their spectral signatures. We also present the corresponding recovery guarantees, leveraging our recent theoretical results. To this end, we model a HS image as a superposition of a low-rank component and a dictionary sparse component, wherein the dictionary consists of the $\textit{a priori}$ known characteristic spectral responses of the target we wish to localize. Finally, we analyze the performance of the proposed approach via experimental validation on real HS data for a classification task, and compare it with related techniques.
http://arxiv.org/abs/1902.11111
We consider the task of localizing targets of interest in a hyperspectral (HS) image based on their spectral signature(s), by posing the problem as two distinct convex demixing task(s). With applications ranging from remote sensing to surveillance, this task of target detection leverages the fact that each material/object possesses its own characteristic spectral response, depending upon its composition. However, since $\textit{signatures}$ of different materials are often correlated, matched filtering-based approaches may not be apply here. To this end, we model a HS image as a superposition of a low-rank component and a dictionary sparse component, wherein the dictionary consists of the $\textit{a priori}$ known characteristic spectral responses of the target we wish to localize, and develop techniques for two different sparsity structures, resulting from different model assumptions. We also present the corresponding recovery guarantees, leveraging our recent theoretical results from a companion paper. Finally, we analyze the performance of the proposed approach via experimental evaluations on real HS datasets for a classification task, and compare its performance with related techniques.
http://arxiv.org/abs/1902.10238
In this paper, we investigate the practical challenges of using reinforcement learning agents for question-answering over knowledge graphs. We examine the performance metrics used by state-of-the-art systems and determine that they are inadequate. More specifically, they do not evaluate the systems correctly for situations when there is no answer available and thus agents optimized for these metrics are poor at modeling confidence. We introduce a simple new performance metric for evaluating question-answering agents that is more representative of practical usage conditions, and optimize for this metric by extending the binary reward structure used in prior work to a ternary reward structure which also rewards an agent for not answering a question rather than giving an incorrect answer. We show that this can drastically improve the precision of answered questions while only not answering a limited number of questions that were previously answered correctly.
http://arxiv.org/abs/1902.10236
Kernels are powerful and versatile tools in machine learning and statistics. Although the notion of universal kernels and characteristic kernels has been studied, kernel selection still greatly influences the empirical performance. While learning the kernel in a data driven way has been investigated, in this paper we explore learning the spectral distribution of kernel via implicit generative models parametrized by deep neural networks. We called our method Implicit Kernel Learning (IKL). The proposed framework is simple to train and inference is performed via sampling random Fourier features. We investigate two applications of the proposed IKL as examples, including generative adversarial networks with MMD (MMD GAN) and standard supervised learning. Empirically, MMD GAN with IKL outperforms vanilla predefined kernels on both image and text generation benchmarks; using IKL with Random Kitchen Sinks also leads to substantial improvement over existing state-of-the-art kernel learning algorithms on popular supervised learning benchmarks. Theory and conditions for using IKL in both applications are also studied as well as connections to previous state-of-the-art methods.
http://arxiv.org/abs/1902.10214
Currently, college-going students are taking longer to graduate than their parental generations. Further, in the United States, the six-year graduation rate has been 59% for decades. Improving the educational quality by training better-prepared students who can successfully graduate in a timely manner is critical. Accurately predicting students’ grades in future courses has attracted much attention as it can help identify at-risk students early so that personalized feedback can be provided to them on time by advisors. Prior research on students’ grade prediction include shallow linear models; however, students’ learning is a highly complex process that involves the accumulation of knowledge across a sequence of courses that can not be sufficiently modeled by these linear models. In addition to that, prior approaches focus on prediction accuracy without considering prediction uncertainty, which is essential for advising and decision making. In this work, we present two types of Bayesian deep learning models for grade prediction. The MLP ignores the temporal dynamics of students’ knowledge evolution. Hence, we propose RNN for students’ performance prediction. To evaluate the performance of the proposed models, we performed extensive experiments on data collected from a large public university. The experimental results show that the proposed models achieve better performance than prior state-of-the-art approaches. Besides more accurate results, Bayesian deep learning models estimate uncertainty associated with the predictions. We explore how uncertainty estimation can be applied towards developing a reliable educational early warning system. In addition to uncertainty, we also develop an approach to explain the prediction results, which is useful for advisors to provide personalized feedback to students.
http://arxiv.org/abs/1902.10213
Deep learning (DL) has recently emerged to address the heavy storage and computation requirements of the baseline dictionary-matching (DM) for Magnetic Resonance Fingerprinting (MRF) reconstruction. Fed with non-iterated back-projected images, the network is unable to fully resolve spatially-correlated corruptions caused from the undersampling artefacts. We propose an accelerated iterative reconstruction to minimize these artefacts before feeding into the network. This is done through a convex regularization that jointly promotes spatio-temporal regularities of the MRF time-series. Except for training, the rest of the parameter estimation pipeline is dictionary-free. We validate the proposed approach on synthetic and in-vivo datasets.
http://arxiv.org/abs/1902.10205
Automated planning is one of the foundational areas of AI. Since no single planner can work well for all tasks and domains, portfolio-based techniques have become increasingly popular in recent years. In particular, deep learning emerges as a promising methodology for online planner selection. Owing to the recent development of structural graph representations of planning tasks, we propose a graph neural network (GNN) approach to selecting candidate planners. GNNs are advantageous over a straightforward alternative, the convolutional neural networks, in that they are invariant to node permutations and that they incorporate node labels for better inference. Additionally, for cost-optimal planning, we propose a two-stage adaptive scheduling method to further improve the likelihood that a given task is solved in time. The scheduler may switch at halftime to a different planner, conditioned on the observed performance of the first one. Experimental results validate the effectiveness of the proposed method against strong baselines, both deep learning and non-deep learning based.
http://arxiv.org/abs/1811.00210
Understanding the semantics of complex visual scenes often requires analyzing a network of objects and their relations. Such networks are known as scene-graphs. While scene-graphs have great potential for machine vision applications, learning scene-graph based models is challenging. One reason is the complexity of the graph representation, and the other is the lack of large scale data for training broad coverage graphs. In this work we propose a way of addressing these difficulties, via the concept of a Latent Scene Graph. We describe a family of models that uses “scene-graph like” representations, and uses them in downstream tasks. Furthermore, we show how these representations can be trained from partial supervision. Finally, we show how our approach can be used to achieve new state of the art results on the challenging problem of referring relationships.
http://arxiv.org/abs/1902.10200
We study the problem of learning representations of entities and relations in knowledge graphs for predicting missing links. The success of such a task heavily relies on the ability of modeling and inferring the patterns of (or between) the relations. In this paper, we present a new approach for knowledge graph embedding called RotatE, which is able to model and infer various relation patterns including: symmetry/antisymmetry, inversion, and composition. Specifically, the RotatE model defines each relation as a rotation from the source entity to the target entity in the complex vector space. In addition, we propose a novel self-adversarial negative sampling technique for efficiently and effectively training the RotatE model. Experimental results on multiple benchmark knowledge graphs show that the proposed RotatE model is not only scalable, but also able to infer and model various relation patterns and significantly outperform existing state-of-the-art models for link prediction.
http://arxiv.org/abs/1902.10197
Localization in challenging, natural environments such as forests or woodlands is an important capability for many applications from guiding a robot navigating along a forest trail to monitoring vegetation growth with handheld sensors. In this work we explore laser-based localization in both urban and natural environments, which is suitable for online applications. We propose a deep learning approach capable of learning meaningful descriptors directly from 3D point clouds by comparing triplets (anchor, positive and negative examples). The approach learns a feature space representation for a set of segmented point clouds that are matched between a current and previous observations. Our learning method is tailored towards loop closure detection resulting in a small model which can be deployed using only a CPU. The proposed learning method would allow the full pipeline to run on robots with limited computational payload such as drones, quadrupeds or UGVs.
http://arxiv.org/abs/1902.10194
While idiosyncrasies of the Chinese classifier system have been a richly studied topic among linguists (Adams and Conklin, 1973; Erbaugh, 1986; Lakoff, 1986), not much work has been done to quantify them with statistical methods. In this paper, we introduce an information-theoretic approach to measuring idiosyncrasy; we examine how much the uncertainty in Mandarin Chinese classifiers can be reduced by knowing semantic information about the nouns that the classifiers modify. Using the empirical distribution of classifiers from the parsed Chinese Gigaword corpus (Graff et al., 2005), we find that more information (in bits) about classifiers can be gleaned from knowing nouns than from knowing sets of noun synonyms or adjectives that modify the same noun. We investigate whether semantic classes of nouns and adjectives differ in how much they reduce uncertainty in classifier choice, and find that it is not fully idiosyncratic; while there are no obvious trends for the majority of semantic classes, shape nouns greatly reduce uncertainty in classifier choice.
http://arxiv.org/abs/1902.10193