The voice mode of the Opus audio coder can compress wideband speech at bit rates ranging from 6 kb/s to 40 kb/s. However, Opus is at its core a waveform matching coder, and as the rate drops below 10 kb/s, quality degrades quickly. As the rate reduces even further, parametric coders tend to perform better than waveform coders. In this paper we propose a backward-compatible way of improving low bit rate Opus quality by re-synthesizing speech from the decoded parameters. We compare two different neural generative models, WaveNet and LPCNet. WaveNet is a powerful, high-complexity, and high-latency architecture that is not feasible for a practical system, yet provides a best known achievable quality with generative models. LPCNet is a low-complexity, low-latency RNN-based generative model, and practically implementable on mobile phones. We apply these systems with parameters from Opus coded at 6 kb/s as conditioning features for the generative models. Listening tests show that for the same 6 kb/s Opus bit stream, synthesized speech using LPCNet clearly outperforms the output of the standard Opus decoder. This opens up ways to improve the quality of existing speech and audio waveform coders without breaking compatibility.
http://arxiv.org/abs/1905.04628
Recent end-to-end task oriented dialog systems use memory architectures to incorporate external knowledge in their dialogs. Current work makes simplifying assumptions about the structure of the knowledge base, such as the use of triples to represent knowledge, and combines dialog utterances (context) as well as knowledge base (KB) results as part of the same memory. This causes an explosion in the memory size, and makes the reasoning over memory harder. In addition, such a memory design forces hierarchical properties of the data to be fit into a triple structure of memory. This requires the memory reader to infer relationships across otherwise connected attributes. In this paper we relax the strong assumptions made by existing architectures and separate memories used for modeling dialog context and KB results. Instead of using triples to store KB results, we introduce a novel multi-level memory architecture consisting of cells for each query and their corresponding results. The multi-level memory first addresses queries, followed by results and finally each key-value pair within a result. We conduct detailed experiments on three publicly available task oriented dialog data sets and we find that our method conclusively outperforms current state-of-the-art models. We report a 15-25% increase in both entity F1 and BLEU scores.
http://arxiv.org/abs/1810.10647
In this paper, we address the hyperspectral image (HSI) classification task with a generative adversarial network and conditional random field (GAN-CRF) -based framework, which integrates a semi-supervised deep learning and a probabilistic graphical model, and make three contributions. First, we design four types of convolutional and transposed convolutional layers that consider the characteristics of HSIs to help with extracting discriminative features from limited numbers of labeled HSI samples. Second, we construct semi-supervised GANs to alleviate the shortage of training samples by adding labels to them and implicitly reconstructing real HSI data distribution through adversarial training. Third, we build dense conditional random fields (CRFs) on top of the random variables that are initialized to the softmax predictions of the trained GANs and are conditioned on HSIs to refine classification maps. This semi-supervised framework leverages the merits of discriminative and generative models through a game-theoretical approach. Moreover, even though we used very small numbers of labeled training HSI samples from the two most challenging and extensively studied datasets, the experimental results demonstrated that spectral-spatial GAN-CRF (SS-GAN-CRF) models achieved top-ranking accuracy for semi-supervised HSI classification.
http://arxiv.org/abs/1905.04621
Tree-based machine learning models such as random forests, decision trees, and gradient boosted trees are the most popular non-linear predictive models used in practice today, yet comparatively little attention has been paid to explaining their predictions. Here we significantly improve the interpretability of tree-based models through three main contributions: 1) The first polynomial time algorithm to compute optimal explanations based on game theory. 2) A new type of explanation that directly measures local feature interaction effects. 3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to i) identify high magnitude but low frequency non-linear mortality risk factors in the general US population, ii) highlight distinct population sub-groups with shared risk characteristics, iii) identify non-linear interaction effects among risk factors for chronic kidney disease, and iv) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains.
http://arxiv.org/abs/1905.04610
Most objects in the visual world are partially occluded, but humans can recognize them without difficulty. However, it remains unknown whether object recognition models like convolutional neural networks (CNNs) can handle real-world occlusion. It is also a question whether efforts to make these models robust to constant mask occlusion are effective for real-world occlusion. We test both humans and the above-mentioned computational models in a challenging task of object recognition under extreme occlusion, where target objects are heavily occluded by irrelevant real objects in real backgrounds. Our results show that human vision is very robust to extreme occlusion while CNNs are not, even with modifications to handle constant mask occlusion. This implies that the ability to handle constant mask occlusion does not entail robustness to real-world occlusion. As a comparison, we propose another computational model that utilizes object parts/subparts in a compositional manner to build robustness to occlusion. This performs significantly better than CNN-based models on our task with error patterns similar to humans. These findings suggest that testing under extreme occlusion can better reveal the robustness of visual recognition, and that the principle of composition can encourage such robustness.
http://arxiv.org/abs/1905.04598
Graph Neural Nets (GNNs) have received increasing attentions, partially due to their superior performance in many node and graph classification tasks. However, there is a lack of understanding on what they are learning and how sophisticated the learned graph functions are. In this work, we first propose Graph Feature Network (GFN), a simple lightweight neural net defined on a set of graph augmented features. We then propose a dissection of GNNs on graph classification into two parts: 1) the graph filtering, where graph-based neighbor aggregations are performed, and 2) the set function, where a set of hidden node features are composed for prediction. To test the importance of these two parts separately, we prove and leverage the connection that GFN can be derived by linearizing graph filtering part of GNN. Empirically we perform evaluations on common graph classification benchmarks. To our surprise, we find that, despite the simplification, GFN could match or exceed the best accuracies produced by recently proposed GNNs, with a fraction of computation cost. Our results provide new perspectives on both the functions that GNNs learned and the current benchmarks for evaluating them.
http://arxiv.org/abs/1905.04579
We propose a deep autoencoder with graph topology inference and filtering to achieve compact representations of unorganized 3D point clouds in an unsupervised manner. The encoder of the proposed networks adopts similar architectures as in PointNet, which is a well-acknowledged method for supervised learning of 3D point clouds. The decoder of the proposed networks involves three novel modules: the folding module, the graph-topology-inference module, and the graph-filtering module. The folding module folds a canonical 2D lattice to the underlying surface of a 3D point cloud, achieving coarse reconstruction; the graph-topology-inference module learns a graph topology to represent pairwise relationships between 3D points; and the graph-filtering module designs graph filters based on the learnt graph topology to obtain the refined reconstruction. We further provide theoretical analyses of the proposed architecture. We provide an upper bound for the reconstruction loss and further show the superiority of graph smoothness over spatial smoothness as a prior to model 3D point clouds. In the experiments, we validate the proposed networks in three tasks, including reconstruction, visualization, and classification. The experimental results show that (1) the proposed networks outperform the state-of-the-art methods in various tasks, including reconstruction and transfer classification; (2) a graph topology can be inferred as auxiliary information without specific supervision on graph topology inference; and (3) graph filtering refines the reconstruction, leading to better performances.
http://arxiv.org/abs/1905.04571
It has been argued that semantic categories across languages reflect pressure for efficient communication. Recently, this idea has been cast in terms of a general information-theoretic principle of efficiency, the Information Bottleneck (IB) principle, and it has been shown that this principle accounts for the emergence and evolution of named color categories across languages, including soft structure and patterns of inconsistent naming. However, it is not yet clear to what extent this account generalizes to semantic domains other than color. Here we show that it generalizes to two qualitatively different semantic domains: names for containers, and for animals. First, we show that container naming in Dutch and French is near-optimal in the IB sense, and that IB broadly accounts for soft categories and inconsistent naming patterns in both languages. Second, we show that a hierarchy of animal categories derived from IB captures cross-linguistic tendencies in the growth of animal taxonomies. Taken together, these findings suggest that fundamental information-theoretic principles of efficient coding may shape semantic categories across languages and across domains.
http://arxiv.org/abs/1905.04562
There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV). However, a moderate success has been achieved. A recent study [1] presented a time contrastive learning (TCL) concept to explore the non-stationarity of brain signals for classification of brain states. Speech signals have similar non-stationarity property, and TCL further has the advantage of having no need for labeled data. We therefore present a TCL based BN feature extraction method. The method uniformly partitions each speech utterance in a training dataset into a predefined number of multi-frame segments. Each segment in an utterance corresponds to one class, and class labels are shared across utterances. DNNs are then trained to discriminate all speech frames among the classes to exploit the temporal structure of speech. In addition, we propose a segment-based unsupervised clustering algorithm to re-assign class labels to the segments. TD-SV experiments were conducted on the RedDots challenge database. The TCL-DNNs were trained using speech data of fixed pass-phrases that were excluded from the TD-SV evaluation set, so the learned features can be considered phrase-independent. We compare the performance of the proposed TCL bottleneck (BN) feature with those of short-time cepstral features and BN features extracted from DNNs discriminating speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels and boundaries are generated by three different automatic speech recognition (ASR) systems. Experimental results show that the proposed TCL-BN outperforms cepstral features and speaker+pass-phrase discriminant BN features, and its performance is on par with those of ASR derived BN features. Moreover,….
http://arxiv.org/abs/1905.04554
While deep neural networks extract rich features from the input data, the current trade-off between depth and computational cost makes it difficult to adopt deep neural networks for many industrial applications, especially when computing power is limited. Here, we are inspired by the idea that, while deeper embeddings are needed to discriminate difficult samples (i.e., fine-grained discrimination), a large number of samples can be well discriminated via much shallower embeddings (i.e., coarse-grained discrimination). In this study, we introduce the simple yet effective concept of decision gates (d-gate), modules trained to decide whether a sample needs to be projected into a deeper embedding or if an early prediction can be made at the d-gate, thus enabling the computation of dynamic representations at different depths. The proposed d-gate modules can be integrated with any deep neural network and reduces the average computational cost of the deep neural networks while maintaining modeling accuracy. The proposed d-gate framework is examined via different network architectures and datasets, with experimental results showing that leveraging the proposed d-gate modules led to a ~43% speed-up and 44% FLOPs reduction on ResNet-101 and 55% speed-up and 39% FLOPs reduction on DenseNet-201 trained on the CIFAR10 dataset with only ~2% drop in accuracy. Furthermore, experiments where d-gate modules are integrated into ResNet-101 trained on the ImageNet dataset demonstrate that it is possible to reduce the computational cost of the network by 1.5 GFLOPs without any drop in the modeling accuracy.
http://arxiv.org/abs/1811.01476
There is a growing interest in using generative adversarial networks (GANs) to produce image content that is indistinguishable from real images as judged by a typical person. A number of GAN variants for this purpose have been proposed, however, evaluating GANs performance is inherently difficult because current methods for measuring the quality of their output are not always consistent with what a human perceives. We propose a novel approach that combines a brain-computer interface (BCI) with GANs to generate a measure we call Neuroscore, which closely mirrors the behavioral ground truth measured from participants tasked with discerning real from synthetic images. This technique we call a neuro-AI interface, as it provides an interface between a human’s neural systems and an AI process. In this paper, we first compare the three most widely used metrics in the literature for evaluating GANs in terms of visual quality and compare their outputs with human judgments. Secondly we propose and demonstrate a novel approach using neural signals and rapid serial visual presentation (RSVP) that directly measures a human perceptual response to facial production quality, independent of a behavioral response measurement. The correlation between our proposed Neuroscore and human perceptual judgments has Pearson correlation statistics: $\mathrm{r}(48) = -0.767, \mathrm{p} = 2.089e-10$. We also present the bootstrap result for the correlation i.e., $\mathrm{p}\leq 0.0001$. Results show that our Neuroscore is more consistent with human judgment compared to the conventional metrics we evaluated. We conclude that neural signals have potential applications for high quality, rapid evaluation of GANs in the context of visual image synthesis.
http://arxiv.org/abs/1811.04172
In this paper, we present a time-contrastive learning (TCL) based bottleneck (BN)feature extraction method for speech signals with an application to text-dependent (TD) speaker verification (SV). It is well-known that speech signals exhibit quasi-stationary behavior in and only in a short interval, and the TCL method aims to exploit this temporal structure. More specifically, it trains deep neural networks (DNNs) to discriminate temporal events obtained by uniformly segmenting speech signals, in contrast to existing DNN based BN feature extraction methods that train DNNs using labeled data to discriminate speakers or pass-phrases or phones or a combination of them. In the context of speaker verification, speech data of fixed pass-phrases are used for TCL-BN training, while the pass-phrases used for TCL-BN training are excluded from being used for SV, so that the learned features can be considered generic. The method is evaluated on the RedDots Challenge 2016 database. Experimental results show that TCL-BN is superior to the existing speaker and pass-phrase discriminant BN features and the Mel-frequency cepstral coefficient feature for text-dependent speaker verification.
http://arxiv.org/abs/1704.02373
It is challenging to disentangle an object into two orthogonal spaces of content and style since each can influence the visual observation differently and unpredictably. It is rare for one to have access to a large number of data to help separate the influences. In this paper, we present a novel framework to learn this disentangled representation in a completely unsupervised manner. We address this problem in a two-branch Autoencoder framework. For the structural content branch, we project the latent factor into a soft structured point tensor and constrain it with losses derived from prior knowledge. This constraint encourages the branch to distill geometry information. Another branch learns the complementary style information. The two branches form an effective framework that can disentangle object’s content-style representation without any human annotation. We evaluate our approach on four image datasets, on which we demonstrate the superior disentanglement and visual analogy quality both in synthesized and real-world data. We are able to generate photo-realistic images with 256*256 resolution that are clearly disentangled in content and style.
http://arxiv.org/abs/1905.04538
In this letter, we introduce multitask learning to hyperspectral image classification. Deep learning models have achieved promising results on hyperspectral image classification, but their performance highly rely on sufficient labeled samples, which are scarce on hyperspectral images. However, samples from multiple data sets might be sufficient to train one deep learning model, thereby improving its performance. To do so, spectral knowledge is introduced to ensure that the shared features are similar across domains. Four hyperspectral data sets were used in the experiments. We achieved better classification accuracies on three data sets (Pavia University, Indian Pines, and Pavia Center) originally with poor results or simple classification systems and competitive results on Salinas Valley data originally with a complex classification system. Spectral knowledge is useful to prevent the deep network from overfitting when the training samples were scarce. The proposed method successfully utilized samples from multiple data sets to increase its performance.
http://arxiv.org/abs/1905.04535
Recent advances in photographic sensing technologies have made it possible to achieve light detection in terms of a single photon. Photon counting sensors are being increasingly used in many diverse applications. We address the problem of jointly recovering spatial and temporal scene radiance from very few photon counts. Our ConvNet-based scheme effectively combines spatial and temporal information present in measurements to reduce noise. We demonstrate that using our method one can acquire videos at a high frame rate and still achieve good quality signal-to-noise ratio. Experiments show that the proposed scheme performs quite well in different challenging scenarios while the existing approaches are unable to handle them.
http://arxiv.org/abs/1811.02396
This paper addresses the problem of cooperative object transportation for multiple Underwater Vehicle Manipulator Systems (UVMSs) in a constrained workspace with static obstacles, where the coordination relies solely on implicit communication arising from the physical interaction of the robots with the commonly grasped object. We propose a novel distributed leader-follower architecture, where the leading UVMS, which has knowledge of the object’s desired trajectory, tries to achieve the desired tracking behavior via an impedance control law, navigating in this way, the overall formation towards the goal configuration while avoiding collisions with the obstacles. On the other hand, the following UVMSs estimate the object’s desired trajectory via a novel prescribed performance estimation law and implement a similar impedance control law. The feedback relies on each UVMS’s force/torque measurements and no explicit data is exchanged online among the robots. Moreover, the control scheme adopts load sharing among the UVMSs according to their specific payload capabilities. Finally, various simulation studies clarify the proposed method and verify its efficiency.
http://arxiv.org/abs/1905.04531
Most person re-identification (ReID) approaches assume that person images are captured under relatively similar illumination conditions. In reality, long-term person retrieval is common and person images are captured under different illumination conditions at different times across a day. In this situation, the performances of existing ReID models often degrade dramatically. This paper addresses the ReID problem with illumination variations and names it as {\em Illumination-Adaptive Person Re-identification (IA-ReID)}. We propose an Illumination-Identity Disentanglement (IID) network to separate different scales of illuminations apart, while preserving individuals’ identity information. To demonstrate the illumination issue and to evaluate our network, we construct two large-scale simulated datasets with a wide range of illumination variations. Experimental results on the simulated datasets and real-world images demonstrate the effectiveness of the proposed framework.
http://arxiv.org/abs/1905.04525
The problem of multiple class novelty detection is gaining increasing importance due to the large availability of multimedia data and the increasing requirement of the classification models to work in an open set scenario. To this end, novelty detection tries to answer this important question: given a test example should we even try to classify it? In this work, we design a novel deep learning framework, termed Segregation Network, which is trained using the mixup technique. We construct interpolated points using convex combinations of pairs of training data and use our novel loss function for prediction of its constituent classes. During testing, for each input query, mixed samples with the known class prototypes are generated and passed through the proposed network. The output of the network reveals the constituent classes which can be used to determine whether the incoming data is from the known class set or not. Our algorithm is trained using just the data from the known classes and does not require any auxiliary dataset or attributes. Extensive evaluation on two benchmark datasets namely Caltech-256 and Stanford Dogs and comparison with the state-of-the-art justifies the effectiveness of the proposed framework.
http://arxiv.org/abs/1905.04523
Generative models have achieved state-of-the-art performance for the zero-shot learning problem, but they require re-training the classifier every time a new object category is encountered. The traditional semantic embedding approaches, though very elegant, usually do not perform at par with their generative counterparts. In this work, we propose an unified framework termed GenClass, which integrates the generator with the classifier for efficient zero-shot learning, thus combining the representative power of the generative approaches and the elegance of the embedding approaches. End-to-end training of the unified framework not only eliminates the requirement of additional classifier for new object categories as in the generative approaches, but also facilitates the generation of more discriminative and useful features. Extensive evaluation on three standard zero-shot object classification datasets, namely AWA, CUB and SUN shows the effectiveness of the proposed approach. The approach without any modification, also gives state-of-the-art performance for zero-shot action classification, thus showing its generalizability to other domains.
http://arxiv.org/abs/1905.04511
We introduce a novel problem of scene sketch zero-shot learning (SSZSL), which is a challenging task, since (i) different from photo, the gap between common semantic domain (e.g., word vector) and sketch is too huge to exploit common semantic knowledge as the bridge for knowledge transfer, and (ii) compared with single-object sketch, more expressive feature representation for scene sketch is required to accommodate its high-level of abstraction and complexity. To overcome these challenges, we propose a deep embedding model for scene sketch zero-shot learning. In particular, we propose the augmented semantic vector to conduct domain alignment by fusing multi-modal semantic knowledge (e.g., cartoon image, natural image, text description), and adopt attention-based network for scene sketch feature learning. Moreover, we propose a novel distance metric to improve the similarity measure during testing. Extensive experiments and ablation studies demonstrate the benefit of our sketch-specific design.
http://arxiv.org/abs/1905.04510
Recent progress in deep convolutional neural networks (CNNs) have enabled a simple paradigm of architecture design: larger models typically achieve better accuracy. Due to this, in modern CNN architectures, it becomes more important to design models that generalize well under certain resource constraints, e.g. the number of parameters. In this paper, we propose a simple way to improve the capacity of any CNN model having large-scale features, without adding more parameters. In particular, we modify a standard convolutional layer to have a new functionality of channel-selectivity, so that the layer is trained to select important channels to re-distribute their parameters. Our experimental results under various CNN architectures and datasets demonstrate that the proposed new convolutional layer allows new optima that generalize better via efficient resource utilization, compared to the baseline.
http://arxiv.org/abs/1905.04509
This paper describes a possible way to improve computer security by implementing a program which implements the following three features related to a weak notion of artificial consciousness: (partial) self-monitoring, ability to compute the truth of quantifier-free propositions and the ability to communicate with the user. The integrity of the program could be enhanced by using a trusted computing approach, that is to say a hardware module that is at the root of a chain of trust. This paper outlines a possible approach but does not refer to an implementation (which would need further work), but the author believes that an implementation using current processors, a debugger, a monitoring program and a trusted processing module is currently possible.
http://arxiv.org/abs/1905.11807
We propose an alternative and unifying framework for decision-making that, by using quantum mechanics, provides more generalised cognitive and decision models with the ability to represent more information than classical models. This framework can accommodate and predict several cognitive biases reported in Lieder & Griffiths without heavy reliance on heuristics nor on assumptions of the computational resources of the mind.
https://arxiv.org/abs/1905.05176
In this paper, we investigate how to efficiently combine appearance-based methods and personal calibration for video-based gaze estimation. Previous calibration methods have learned an average model for all persons and then adapted it for a specific person. In contrast, we directly learn a model for personalized gaze estimation where the personal variations are modeled as a low-dimensional latent parameter space. During training, the network learns an efficient way to leverage these parameters for estimating gaze. After training, little person-specific data is required to optimize these (few) parameters. In fact, by calibrating only six parameters per person, the accuracy can be improved by a factor of up to 2.5, and reaches the same level as person-specific models. We assess the performance of our method on a large dataset, consisting of high resolution images taken from near-infrared cameras with active illumination. As it turns out, our method is the first appearance-based approach to gaze estimation to reach the same level of accuracy as involved model-based methods. We further demonstrate that it outperforms existing approaches on two datasets with relatively low-resolution images, MPIIGaze and GazeCapture. Finally, we show using a synthetic dataset created using UnityEyes tools, that our method efficiently predicts 3D gaze (origin and direction of gaze) without the need of explicitly 3D-annotated training data.
http://arxiv.org/abs/1807.00664
Deep learning based visual sensing has achieved attractive accuracy but is shown vulnerable to adversarial example attacks. Specifically, once the attackers obtain the deep model, they can construct adversarial examples to mislead the model to yield wrong classification results. Deployable adversarial examples such as small stickers pasted on the road signs and lanes have been shown effective in misleading advanced driver-assistance systems. Many existing countermeasures against adversarial examples build their security on the attackers’ ignorance of the defense mechanisms. Thus, they fall short of following Kerckhoffs’s principle and can be subverted once the attackers know the details of the defense. This paper applies the strategy of moving target defense (MTD) to generate multiple new deep models after system deployment, that will collaboratively detect and thwart adversarial examples. Our MTD design is based on the adversarial examples’ minor transferability to models differing from the one (e.g., the factory-designed model) used for attack construction. The post-deployment quasi-secret deep models significantly increase the bar for the attackers to construct effective adversarial examples. We also apply the technique of serial data fusion with early stopping to reduce the inference time by a factor of up to 5 while maintaining the sensing and defense performance. Extensive evaluation based on three datasets including a road sign image database and a GPU-equipped Jetson embedded computing board shows the effectiveness of our approach.
http://arxiv.org/abs/1905.13148
In this paper, we develop a nonconvex approach to the problem of low-rank and sparse matrix decomposition. In our nonconvex method, we replace the rank function and the $l_{0}$-norm of a given matrix with a non-convex fraction function on the singular values and the elements of the matrix respectively. An alternative direction method of multipliers algorithm is utilized to solve our proposed nonconvex problem with the nonconvex fraction function penalty. Numerical experiments on some low-rank and sparse matrix decomposition problems show that our method performs very well in recovering low-rank matrices which are heavily corrupted by large sparse errors.
http://arxiv.org/abs/1807.01276
As processing power has become more available, more human-like artificial intelligences are created to solve image processing tasks that we are inherently good at. As such we propose a model that estimates depth from a monocular image. Our approach utilizes a combination of structure from motion and stereo disparity. We estimate a pose between the source image and a different viewpoint and a dense depth map and use a simple transformation to reconstruct the image seen from said viewpoint. We can then use the real image at that viewpoint to act as supervision to train out model. The metric chosen for image comparison employs standard L1 and structural similarity and a consistency constraint between depth maps as well as smoothness constraint. We show that similar to human perception utilizing the correlation within the provided data by two different approaches increases the accuracy and outperforms the individual components.
http://arxiv.org/abs/1905.04467
Convolutional neural networks (CNNs) have achieved a great success in face recognition, which unfortunately comes at the cost of massive computation and storage consumption. Many compact face recognition networks are thus proposed to resolve this problem. Triplet loss is effective to further improve the performance of those compact models. However, it normally employs a fixed margin to all the samples, which neglects the informative similarity structures between different identities. In this paper, we propose an enhanced version of triplet loss, named triplet distillation, which exploits the capability of a teacher model to transfer the similarity information to a small model by adaptively varying the margin between positive and negative pairs. Experiments on LFW, AgeDB, and CPLFW datasets show the merits of our method compared to the original triplet loss.
http://arxiv.org/abs/1905.04457
Place recognition is a critical component in robot navigation that enables it to re-establish previously visited locations, and simultaneously use this information to correct the drift incurred in its dead-reckoned estimate. In this work, we develop a self-supervised approach to place recognition in robots. The task of visual loop-closure identification is cast as a metric learning problem, where the labels for positive and negative examples of loop-closures can be bootstrapped using a GPS-aided navigation solution that the robot already uses. By leveraging the synchronization between sensors, we show that we are able to learn an appropriate distance metric for arbitrary real-valued image descriptors (including state-of-the-art CNN models), that is specifically geared for visual place recognition in mobile robots. Furthermore, we show that the newly learned embedding can be particularly powerful in disambiguating visual scenes for the task of vision-based loop-closure identification in mobile robots.
http://arxiv.org/abs/1905.04453
Appearance-based gaze estimation provides relatively unconstrained gaze tracking. However, subject-indepen-dent models achieve limited accuracy partly due to individual variations. To improve estimation, we propose a novel gaze decomposition method and a single gaze point calibration method, motivated by our finding that the inter-subject squared bias exceeds the intra-subject variance for a subject-independent estimator. We decompose the gaze angle into a subject-dependent bias term and a subject-independent difference term between the gaze angle and the bias. The difference term is estimated by a deep convolutional network. For calibration-free tracking, we set the subject-dependent bias term to zero. For single gaze point calibration, we estimate the bias from a few images taken as the subject gazes at a point. Experiments on three datasets indicate that as a calibration-free estimator, the proposed method outperforms the state-of-the-art methods that use single model by up to $10.0\%$. The proposed calibration method is robust and reduces estimation error significantly (up to $35.6\%$), achieving state-of-the-art performance for appearance-based eye trackers with calibration.
http://arxiv.org/abs/1905.04451
Cross-modal hashing has been receiving increasing interests for its low storage cost and fast query speed in multi-modal data retrievals. However, most existing hashing methods are based on hand-crafted or raw level features of objects, which may not be optimally compatible with the coding process. Besides, these hashing methods are mainly designed to handle simple pairwise similarity. The complex multilevel ranking semantic structure of instances associated with multiple labels has not been well explored yet. In this paper, we propose a ranking-based deep cross-modal hashing approach (RDCMH). RDCMH firstly uses the feature and label information of data to derive a semi-supervised semantic ranking list. Next, to expand the semantic representation power of hand-crafted features, RDCMH integrates the semantic ranking information into deep cross-modal hashing and jointly optimizes the compatible parameters of deep feature representations and of hashing functions. Experiments on real multi-modal datasets show that RDCMH outperforms other competitive baselines and achieves the state-of-the-art performance in cross-modal retrieval applications.
http://arxiv.org/abs/1905.04450
Tensor image data sets such as color images and multispectral images are highly correlated and they contain a lot of image details. The main aim of this paper is to propose and develop a regularized tensor completion model for tensor image data completion. In the objective function, we adopt the newly emerged tensor nuclear norm (TNN) to characterize the global structure of such tensor image data sets. Also, we formulate an implicit regularizer to plug in the convolutional neural network (CNN) denoiser, which is convinced to express the image prior learned from a large amount of natural images. The resulting model can be solved efficiently via an alternating directional method of multipliers algorithm. Experimental results (on color images, videos, and multispectral images) are presented to show that both image global structure and details can be recovered very well, and to illustrate that the performance of the proposed method is better than that of testing methods in terms of PSNR and SSIM.
http://arxiv.org/abs/1905.04449
While convolutional neural networks (CNN) have achieved impressive performance on various classification/recognition tasks, they typically consist of a massive number of parameters. This results in significant memory requirement as well as computational overheads. Consequently, there is a growing need for filter-level pruning approaches for compressing CNN based models that not only reduce the total number of parameters but reduce the overall computation as well. We present a new min-max framework for filter-level pruning of CNNs. Our framework, called Play and Prune (PP), jointly prunes and fine-tunes CNN model parameters, with an adaptive pruning rate, while maintaining the model’s predictive performance. Our framework consists of two modules: (1) An adaptive filter pruning (AFP) module, which minimizes the number of filters in the model; and (2) A pruning rate controller (PRC) module, which maximizes the accuracy during pruning. Moreover, unlike most previous approaches, our approach allows directly specifying the desired error tolerance instead of pruning level. Our compressed models can be deployed at run-time, without requiring any special libraries or hardware. Our approach reduces the number of parameters of VGG-16 by an impressive factor of 17.5X, and number of FLOPS by 6.43X, with no loss of accuracy, significantly outperforming other state-of-the-art filter pruning methods.
http://arxiv.org/abs/1905.04446
The ability to estimate task difficulty is critical for many real-world decisions such as setting appropriate goals for ourselves or appreciating others’ accomplishments. Here we give a computational account of how humans judge the difficulty of a range of physical construction tasks (e.g., moving 10 loose blocks from their initial configuration to their target configuration, such as a vertical tower) by quantifying two key factors that influence construction difficulty: physical effort and physical risk. Physical effort captures the minimal work needed to transport all objects to their final positions, and is computed using a hybrid task-and-motion planner. Physical risk corresponds to stability of the structure, and is computed using noisy physics simulations to capture the costs for precision (e.g., attention, coordination, fine motor movements) required for success. We show that the full effort-risk model captures human estimates of difficulty and construction time better than either component alone.
http://arxiv.org/abs/1905.04445
We investigate the problem of optimally assigning a large number of robots (or other types of autonomous agents) to guard the perimeters of closed 2D regions, where the perimeter of each region to be guarded may contain multiple disjoint polygonal chains. Each robot is responsible for guarding a subset of a perimeter and any point on a perimeter must be guarded by some robot. In allocating the robots, the main objective is to minimize the maximum 1D distance to be covered by any robot along the boundary of the regions. For this optimization problem which we call optimal perimeter guarding (OPG), thorough structural analysis is performed, which is then exploited to develop fast exact algorithms that run in guaranteed low polynomial time. In addition to formal analysis and proofs, experimental evaluations and simulations are performed that further validate the correctness and effectiveness of our algorithmic results.
http://arxiv.org/abs/1905.04434
Multi-view subspace clustering aims to divide a set of multisource data into several groups according to their underlying subspace structure. Although the spectral clustering based methods achieve promotion in multi-view clustering, their utility is limited by the separate learning manner in which affinity matrix construction and cluster indicator estimation are isolated. In this paper, we propose to jointly learn the self-representation, continue and discrete cluster indicators in an unified model. Our model can explore the subspace structure of each view and fusion them to facilitate clustering simultaneously. Experimental results on two benchmark datasets demonstrate that our method outperforms other existing competitive multi-view clustering methods.
http://arxiv.org/abs/1905.04432
Activity recognition in shopping environments is an important and challenging computer vision task. We introduce a framework for integrating human body pose and object motion to both temporally detect and classify the activities in a fine-grained manner (very short and similar activities). We achieve this by proposing a multi-stream recurrent convolutional neural network architecture guided by the spatiotemporal \emph{attention} mechanism for both activity recognition and detection. To this end, in the absence of accurate pose supervision, we incorporate generative adversarial networks (GANs) to generate candidate body joints. Additionally, based on the intuition that complex actions demand more than one source of information to be precisely identified even by humans, we integrate the second stream of the object motion to our network that acts as a prior knowledge which we quantitatively show improves the results. Furthermore, we empirically show the capabilities of our approach by achieving state-of-the-art results on MERL shopping dataset. Finally, we further investigate the effectiveness of this approach on a new shopping dataset that we have collected to address existing shortcomings in this area including but not limited to lack of training data.
http://arxiv.org/abs/1905.04430
Teleoperation of heavy machinery in industry often requires operators to be in close proximity to the plant and issue commands on a per-actuator level using joystick input devices. However, this is non-intuitive and makes achieving desired job properties a challenging task requiring operators to complete extensive and costly training. Despite this, operator fatigue is common with implications for personal safety, project timeliness, cost, and quality. While full automation is not yet achievable due to unpredictability and the dynamic nature of the environment and task, shared control paradigms allow operators to issue high-level commands in an intuitive, task-informed control space while having the robot optimize for achieving desired job properties. In this paper, we compare a number of modes of teleoperation, exploring both the number of dimensions of the control input as well as the most intuitive control spaces. Our experimental evaluations of the performance metrics were based on quantifying the difficulty of tasks based on the well known Fitts’ law as well as a measure of how well constraints affecting the task performance were met. Our experiments show that higher performance is achieved when humans submit commands in low-dimensional task spaces as opposed to joint space manipulations.
http://arxiv.org/abs/1905.04428
The robotic assembly represents a group of benchmark problems for reinforcement learning and variable compliance control that features sophisticated contact manipulation. One of the key challenges in applying reinforcement learning to physical robot is the sample complexity, the requirement of large amounts of experience for learning. We mitigate this sample complexity problem by incorporating an iteratively refitted model into the learning process through model-guided exploration. Yet, fitting a local model of the physical environment is of major difficulties. In this work, a Kalman filter is used to combine the adaptive linear dynamics with a coarse prior model from analytical description, and proves to give more accurate predictions than the existing method. Experimental results show that the proposed model fitting strategy can be incorporated into a model predictive controller to generate good exploration behaviors for learning acceleration, while preserving the benefits of model-free reinforcement learning for uncertain environments. In addition to the sample complexity, the inevitable robot overloaded during operation also tends to limit the learning efficiency. To address this problem, we present a method to restrict the largest possible potential energy in the compliance control system and therefore keep the contact force within the legitimate range.
http://arxiv.org/abs/1905.04427
Deep learning approaches to cyclone intensity estimationhave recently shown promising results. However, sufferingfrom the extreme scarcity of cyclone data on specific in-tensity, most existing deep learning methods fail to achievesatisfactory performance on cyclone intensity estimation,especially on classes with few instances. To avoid the degra-dation of recognition performance caused by scarce samples,we propose a context-aware CycleGAN which learns the la-tent evolution features from adjacent cyclone intensity andsynthesizes CNN features of classes lacking samples fromunpaired source classes. Specifically, our approach synthe-sizes features conditioned on the learned evolution features,while the extra information is not required. Experimentalresults of several evaluation methods show the effectivenessof our approach, even can predicting unseen classes.
http://arxiv.org/abs/1905.04425
Unsupervised Domain Adaptation (UDA) addresses the problem of performance degradation due to domain shift between training and testing sets, which is common in computer vision applications. Most existing UDA approaches are based on vector-form data although the typical format of data or features in visual applications is multi-dimensional tensor. Besides, current methods, including the deep network approaches, assume that abundant labeled source samples are provided for training. However, the number of labeled source samples are always limited due to expensive annotation cost in practice, making sub-optimal performance been observed. In this paper, we propose to seek discriminative representation for multi-dimensional data by learning a structured dictionary in tensor space. The dictionary separates domain-specific information and class-specific information to guarantee the representation robust to domains. In addition, a pseudo-label estimation scheme is developed to combine with discriminant analysis in the algorithm iteration for avoiding the external classifier design. We perform extensive results on different datasets with limited source samples. Experimental results demonstrates that the proposed method outperforms the state-of-the-art approaches.
http://arxiv.org/abs/1905.04424
Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and 1.14x reduction in average hop count albeit with slightly more power overhead.
http://arxiv.org/abs/1905.04423
Controlled natural languages (CNLs) are effective languages for knowledge representation and reasoning. They are designed based on certain natural languages with restricted lexicon and grammar. CNLs are unambiguous and simple as opposed to their base languages. They preserve the expressiveness and coherence of natural languages. In this report, we focus on a class of CNLs, called machine-oriented CNLs, which have well-defined semantics that can be deterministically translated into formal languages, such as Prolog, to do logical reasoning. Over the past 20 years, a number of machine-oriented CNLs emerged and have been used in many application domains for problem solving and question answering. However, few of them support non-monotonic inference. In our work, we propose non-monotonic extensions of CNL to support defeasible reasoning. In the first part of this report, we survey CNLs and compare three influential systems: Attempto Controlled English (ACE), Processable English (PENG), and Computer-processable English (CPL). We compare their language design, semantic interpretations, and reasoning services. In the second part of this report, we first identify typical non-monotonicity in natural languages, such as defaults, exceptions and conversational implicatures. Then, we propose their representation in CNL and the corresponding formalizations in a form of defeasible reasoning known as Logic Programming with Defaults and Argumentation Theory (LPDA).
http://arxiv.org/abs/1905.04422
With the emergence of lenslet light field cameras able to capture rich spatio-angular information from multiple directions, new frontiers in visual recognition performance have been opened. Since multiple 2D viewpoint images can be rendered from a light field, those multiple images, or descriptions extracted from them, can be organized as a pseudo-video sequence so that a LSTM network learns a model describing that sequence. This paper proposes three novel LSTM cell architectures able to create richer and more effective description models for visual recognition tasks, by jointly learning from two sequences simultaneously acquired. The novel key idea is to jointly process two sequences of rendered 2D images or their descriptions, e.g. representing the scene horizontal and vertical parallaxes, and thus with some specific dependency between them, that would not be exploited otherwise. To show the efficiency of the novel LSTM cell architectures, these architectures have been integrated into an end-to-end deep learning face recognition framework, which creates this join spatio-angular light field description. The LSTM network, using the proposed LSTM cell architectures, receives as input a sequence of VGG-Face descriptions computed for parallax related, horizontal and vertical 2D face viewpoint images, derived from the input light field image. A comprehensive evaluation in terms of recognition accuracy, computational complexity, memory efficiency, and parallelization ability has been performed with the IST EURECOM LFFD database using three new and challenging evaluation protocols. The obtained results show the superior performance of the proposed face recognition solutions adopting the novel LSTM cell architectures over ten state-of-the-art benchmarking recognition solutions.
http://arxiv.org/abs/1905.04421
Acoustic data provide scientific and engineering insights in fields ranging from biology and communications to ocean and Earth science. We survey the recent advances and transformative potential of machine learning (ML), including deep learning, in the field of acoustics. ML is a broad family of statistical techniques for automatically detecting and utilizing patterns in data. Relative to conventional acoustics and signal processing, ML is data-driven. Given sufficient training data, ML can discover complex relationships between features. With large volumes of training data, ML can discover models describing complex acoustic phenomena such as human speech and reverberation. ML in acoustics is rapidly developing with compelling results and significant future promise. We first introduce ML, then highlight ML developments in five acoustics research areas: source localization in speech processing, source localization in ocean acoustics, bioacoustics, seismic exploration, and environmental sounds in everyday scenes.
http://arxiv.org/abs/1905.04418
While motion planning approaches for automated driving often focus on safety and mathematical optimality with respect to technical parameters, they barely consider convenience, perceived safety for the passenger and comprehensibility for other traffic participants. For automated driving in mixed traffic, however, this is key to reach public acceptance. In this paper, we revise the problem statement of motion planning in mixed traffic: Instead of largely simplifying the motion planning problem to a convex optimization problem, we keep a more complex probabilistic multi agent model and strive for a near optimal solution. We assume cooperation of other traffic participants, yet being aware of violations of this assumption. This approach yields solutions that are provably safe in all situations, and convenient and comprehensible in situations that are also unambiguous for humans. Thus, it outperforms existing approaches in mixed traffic scenarios, as we show in simulation.
http://arxiv.org/abs/1805.05374
Planning for robotic manipulation requires reasoning about the changes a robot can affect on objects. When such interactions can be modelled analytically, as in domains with rigid objects, efficient planning algorithms exist. However, in both domestic and industrial domains, the objects of interest can be soft, or deformable, and hard to model analytically. For such cases, we posit that a data-driven modelling approach is more suitable. In recent years, progress in deep generative models has produced methods that learn to `imagine’ plausible images from data. Building on the recent Causal InfoGAN generative model, in this work we learn to imagine goal-directed object manipulation directly from raw image data of self-supervised interaction of the robot with the object. After learning, given a goal observation of the system, our model can generate an imagined plan – a sequence of images that transition the object into the desired goal. To execute the plan, we use it as a reference trajectory to track with a visual servoing controller, which we also learn from the data as an inverse dynamics model. In a simulated manipulation task, we show that separating the problem into visual planning and visual tracking control is more sample efficient and more interpretable than alternative data-driven approaches. We further demonstrate our approach on learning to imagine and execute in 3 environments, the final of which is deformable rope manipulation on a PR2 robot.
http://arxiv.org/abs/1905.04411
The cloud-based speech recognition/API provides developers or enterprises an easy way to create speech-enabled features in their applications. However, sending audios about personal or company internal information to the cloud, raises concerns about the privacy and security issues. The recognition results generated in cloud may also reveal some sensitive information. This paper proposes a deep polynomial network (DPN) that can be applied to the encrypted speech as an acoustic model. It allows clients to send their data in an encrypted form to the cloud to ensure that their data remains confidential, at mean while the DPN can still make frame-level predictions over the encrypted speech and return them in encrypted form. One good property of the DPN is that it can be trained on unencrypted speech features in the traditional way. To keep the cloud away from the raw audio and recognition results, a cloud-local joint decoding framework is also proposed. We demonstrate the effectiveness of model and framework on the Switchboard and Cortana voice assistant tasks with small performance degradation and latency increased comparing with the traditional cloud-based DNNs.
https://arxiv.org/abs/1905.05605
Solving grounded language tasks often requires reasoning about relationships between objects in the context of a given task. For example, to answer the question What color is the mug on the plate?'' we must check the color of the specific mug that satisfies the
on’’ relationship with respect to the plate. Recent work has proposed various methods capable of complex relational reasoning. However, most of their power is in the inference structure, while the scene is represented with simple local appearance features. In this paper, we take an alternate approach and build contextualized representations for objects in a visual scene to support relational reasoning. We propose a general framework of Language-Conditioned Graph Networks (LCGN), where each node represents an object, and is described by a context-aware representation from related objects through iterative message passing conditioned on the textual input. E.g., conditioning on the on'' relationship to the plate, the object
mug’’ gathers messages from the object plate'' to update its representation to
mug on the plate’’, which can be easily consumed by a simple classifier for answer prediction. We experimentally show that our LCGN approach effectively supports relational reasoning and improves performance across several tasks and datasets.
http://arxiv.org/abs/1905.04405
We propose a method for learning embeddings for few-shot learning that is suitable for use with any number of ways and any number of shots (shot-free). Rather than fixing the class prototypes to be the Euclidean average of sample embeddings, we allow them to live in a higher-dimensional space (embedded class models) and learn the prototypes along with the model parameters. The class representation function is defined implicitly, which allows us to deal with a variable number of shots per each class with a simple constant-size architecture. The class embedding encompasses metric learning, that facilitates adding new classes without crowding the class representation space. Despite being general and not tuned to the benchmark, our approach achieves state-of-the-art performance on the standard few-shot benchmark datasets.
http://arxiv.org/abs/1905.04398