Capsule Networks are a recently proposed alternative for constructing Neural Networks, and early indications suggest that they can provide greater generalisation capacity using fewer parameters. In capsule networks scalar neurons are replaced with capsule vectors or matrices, whose entries represent different properties of objects. The relationships between objects and its parts are learned via trainable viewpoint-invariant transformation matrices, and the presence of a given object is decided by the level of agreement among votes from its parts. This interaction occurs between capsule layers and is a process called routing-by-agreement. Although promising, capsule networks remain underexplored by the community, and in this paper we present a new capsule routing algorithm based of Variational Bayes for a mixture of transforming gaussians. Our Bayesian approach addresses some of the inherent weaknesses of EM routing such as the ‘variance collapse’ by modelling uncertainty over the capsule parameters in addition to the routing assignment posterior probabilities. We test our method on public domain datasets and outperform the state-of-the-art performance on smallNORB using 50% less capsules.
https://arxiv.org/abs/1905.11455
We propose differentiable quantization (DQ) for efficient deep neural network (DNN) inference where gradient descent is used to learn the quantizer’s step size, dynamic range and bitwidth. Training with differentiable quantizers brings two main benefits: first, DQ does not introduce hyperparameters; second, we can learn for each layer a different step size, dynamic range and bitwidth. Our experiments show that DNNs with heterogeneous and learned bitwidth yield better performance than DNNs with a homogeneous one. Further, we show that there is one natural DQ parametrization especially well suited for training. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain quantized DNNs with learned quantization parameters achieving state-of-the-art performance.
https://arxiv.org/abs/1905.11452
We describe our submitted system for the ZeroSpeech Challenge 2019. The current challenge theme addresses the difficulty of constructing a speech synthesizer without any text or phonetic labels and requires a system that can (1) discover subword units in an unsupervised way, and (2) synthesize the speech with a target speaker’s voice. Moreover, the system should also balance the discrimination score ABX, the bit-rate compression rate, and the naturalness and the intelligibility of the constructed voice. To tackle these problems and achieve the best trade-off, we utilize a vector quantized variational autoencoder (VQ-VAE) and a multi-scale codebook-to-spectrogram (Code2Spec) inverter trained by mean square error and adversarial loss. The VQ-VAE extracts the speech to a latent space, forces itself to map it into the nearest codebook and produces compressed representation. Next, the inverter generates a magnitude spectrogram to the target voice, given the codebook vectors from VQ-VAE. In our experiments, we also investigated several other clustering algorithms, including K-Means and GMM, and compared them with the VQ-VAE result on ABX scores and bit rates. Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.
https://arxiv.org/abs/1905.11449
In computer vision and image processing tasks, image fusion has evolved into an attractive research field. However, recent existing image fusion methods are mostly built on pixel-level operations, which may produce unacceptable artifacts and are time-consuming. In this paper, a symmetric encoder-decoder with a residual block (SEDR) for infrared and visible image fusion is proposed. For the training stage, the SEDR network is trained with a new dataset to obtain a fixed feature extractor. For the fusion stage, first, the trained model is utilized to extract the intermediate features and compensation features of two source images. Then, extracted intermediate features are used to generate two attention maps, which are multiplied to the input features for refinement. In addition, the compensation features generated by the first two convolutional layers are merged and passed to the corresponding deconvolutional layers. At last, the refined features are fused for decoding to reconstruct the final fused image. Experimental results demonstrate that the proposed fusion method (named as SEDRFuse) outperforms the state-of-the-art fusion methods in terms of both subjective and objective evaluations.
https://arxiv.org/abs/1905.11447
Two networks are equivalent if they produce the same output for any given input. In this paper, we study the possibility of transforming a deep neural network to another network with a different number of units or layers, which can be either equivalent, a local exact approximation, or a global linear approximation of the original network. On the practical side, we show that certain rectified linear units (ReLUs) can be safely removed from a network if they are always active or inactive for any valid input. If we only need an equivalent network for a smaller domain, then more units can be removed and some layers collapsed. On the theoretical side, we constructively show that for any feed-forward ReLU network, there exists a global linear approximation to a 2-hidden-layer shallow network with a fixed number of units. This result is a balance between the increasing number of units for arbitrary approximation with a single layer and the known upper bound of $\lceil log(n_0+1)\rceil +1$ layers for exact representation, where $n_0$ is the input dimension. While the transformed network may require an exponential number of units to capture the activation patterns of the original network, we show that it can be made substantially smaller by only accounting for the patterns that define linear regions. Based on experiments with ReLU networks on the MNIST dataset, we found that $l_1$-regularization and adversarial training reduces the number of linear regions significantly as the number of stable units increases due to weight sparsity. Therefore, we can also intentionally train ReLU networks to allow for effective loss-less compression and approximation.
https://arxiv.org/abs/1905.11428
FlightGoggles is a photorealistic sensor simulator for perception-driven robotic vehicles. The key contributions of FlightGoggles are twofold. First, FlightGoggles provides photorealistic exteroceptive sensor simulation using graphics assets generated with photogrammetry. Second, it also provides the ability to combine $\textit{(i)}$ synthetic exteroceptive measurements generated $\textit{in silico}$ in real time and $\textit{(ii)}$ vehicle dynamics and proprioceptive measurements generated $\textit{in motio}$ by vehicle(s) in flight in a motion-capture facility. FlightGoggles is capable of simulating a virtual-reality environment around autonomous vehicle(s) in flight. While a vehicle is in flight in the FlightGoggles virtual reality environment, exteroceptive sensors are rendered synthetically in real time while all complex extrinsic dynamics are generated organically through the natural interactions of the vehicle. The FlightGoggles framework allows for researchers to accelerate development by circumventing the need to estimate complex and hard-to-model interactions such as aerodynamics, motor mechanics, battery electrochemistry, and behavior of other agents. The ability to perform vehicle-in-the-loop experiments with photorealistic exteroceptive sensor simulation facilitates novel research directions involving, $\textit{e.g.}$, fast and agile autonomous flight in obstacle-rich environments, safe human interaction, and flexible sensor selection. FlightGoggles has been utilized as the main test for selecting nine teams that will advance in the AlphaPilot autonomous drone racing challenge. Subsequently, FlightGoggles has been actively used by the community. We survey approaches and results from the top twenty AlphaPilot teams, which may be of independent interest.
https://arxiv.org/abs/1905.11377
Recent work addressing model reliability and generalization has resulted in a variety of methods that seek to proactively address differences between the training and unknown target environments. While most methods achieve this by finding distributions that will be invariant across environments, we will show they do not necessarily find the same distributions which has implications for performance. In this paper we unify existing work on prediction using stable distributions by relating environmental shifts to edges in the graph underlying a prediction problem, and characterize stable distributions as those which effectively remove these edges. We then quantify the effect of edge deletion on performance in the linear case and corroborate the findings in a simulated and real data experiment.
https://arxiv.org/abs/1905.11374
We tackle the problem of object discovery, where objects are segmented for a given input image, and the system is trained without using any direct supervision whatsoever. A novel copy-pasting GAN framework is proposed, where the generator learns to discover an object in one image by compositing it into another image such that the discriminator cannot tell that the resulting image is fake. After carefully addressing subtle issues, such as preventing the generator from `cheating’, this game results in the generator learning to select objects, as copy-pasting objects is most likely to fool the discriminator. The system is shown to work well on four very different datasets, including large object appearance variations in challenging cluttered backgrounds.
https://arxiv.org/abs/1905.11369
Instance segmentation is an important problem in computer vision, with applications in autonomous driving, drone navigation and robotic manipulation. However, most existing methods are not real-time, complicating their deployment in time-sensitive contexts. In this work, we extend an existing approach to real-time instance segmentation, called Straight to Shapes' (STS), which makes use of low-dimensional shape embedding spaces to directly regress to object shape masks. The STS model can run at 35 FPS on a high-end desktop, but its accuracy is significantly worse than that of offline state-of-the-art methods. We leverage recent advances in the design and training of deep instance segmentation models to improve the performance accuracy of the STS model whilst keeping its real-time capabilities intact. In particular, we find that parameter sharing, more aggressive data augmentation and the use of structured loss for shape mask prediction all provide a useful boost to the network performance. Our proposed approach,
Straight to Shapes++’, achieves a remarkable 19.7 point improvement in mAP (at IOU of 0.5) over the original method as evaluated on the PASCAL VOC dataset, thus redefining the accuracy frontier at real-time speeds. Since the accuracy of instance segmentation is closely tied to that of object bounding box prediction, we also study the error profile of the latter and examine the failure modes of our method for future improvements.
https://arxiv.org/abs/1905.11358
Weighted A* (wA) is a widely used algorithm for rapidly, but suboptimally, solving planning and search problems. The cost of the solution it produces is guaranteed to be at most W times the optimal solution cost, where W is the weight wA uses in prioritizing open nodes. W is therefore a suboptimality bound for the solution produced by wA. There is broad consensus that this bound is not very accurate, that the actual suboptimality of wA’s solution is often much less than W times optimal. However, there is very little published evidence supporting that view, and no existing explanation of why W is a poor bound. This paper fills in these gaps in the literature. We begin with a large-scale experiment demonstrating that, across a wide variety of domains and heuristics for those domains, W is indeed very often far from the true suboptimality of wA*’s solution. We then analytically identify the potential sources of error. Finally, we present a practical method for correcting for two of these sources of error and experimentally show that the correction frequently eliminates much of the error.
https://arxiv.org/abs/1905.11346
Deep Neural Networks have achieved extraordinary results on image classification tasks, but have been shown to be vulnerable to attacks with carefully crafted perturbations of the input data. Although most attacks usually change values of many image’s pixels, it has been shown that deep networks are also vulnerable to sparse alterations of the input. However, no computationally efficient method has been proposed to compute sparse perturbations. In this paper, we exploit the low mean curvature of the decision boundary, and propose SparseFool, a geometry inspired sparse attack that controls the sparsity of the perturbations. Extensive evaluations show that our approach computes sparse perturbations very fast, and scales efficiently to high dimensional data. We further analyze the transferability and the visual effects of the perturbations, and show the existence of shared semantic information across the images and the networks. Finally, we show that adversarial training can only slightly improve the robustness against sparse additive perturbations computed with SparseFool.
http://arxiv.org/abs/1811.02248
Given the increasingly serious air pollution problem, the monitoring of air quality index (AQI) in urban areas has drawn considerable attention. This paper presents ImgSensingNet, a vision guided aerial-ground sensing system, for fine-grained air quality monitoring and forecasting using the fusion of haze images taken by the unmanned-aerial-vehicle (UAV) and the AQI data collected by an on-ground three-dimensional (3D) wireless sensor network (WSN). Specifically, ImgSensingNet first leverages the computer vision technique to tell the AQI scale in different regions from the taken haze images, where haze-relevant features and a deep convolutional neural network (CNN) are designed for direct learning between haze images and corresponding AQI scale. Based on the learnt AQI scale, ImgSensingNet determines whether to wake up on-ground wireless sensors for small-scale AQI monitoring and inference, which can greatly reduce the energy consumption of the system. An entropy-based model is employed for accurate real-time AQI inference at unmeasured locations and future air quality distribution forecasting. We implement and evaluate ImgSensingNet on two university campuses since Feb. 2018, and has collected 17,630 photos and 2.6 millions of AQI data samples. Experimental results confirm that ImgSensingNet can achieve higher inference accuracy while greatly reduce the energy consumption, compared to state-of-the-art AQI monitoring approaches.
https://arxiv.org/abs/1905.11299
Optical coherence tomography (OCT) is a non-invasive imaging modality which is widely used in clinical ophthalmology. OCT images are capable of visualizing deep retinal layers which is crucial for early diagnosis of retinal diseases. In this paper, we describe a comprehensive open-access database containing more than 500 highresolution images categorized into different pathological conditions. The image classes include Normal (NO), Macular Hole (MH), Age-related Macular Degeneration (AMD), Central Serous Retinopathy (CSR), and Diabetic Retinopathy (DR). The images were obtained from a raster scan protocol with a 2mm scan length and 512x1024 pixel resolution. We have also included 25 normal OCT images with their corresponding ground truth delineations which can be used for an accurate evaluation of OCT image segmentation. In addition, we have provided a user-friendly GUI which can be used by clinicians for manual (and semi-automated) segmentation.
http://arxiv.org/abs/1812.07056
Grasp synergies represent a useful idea to reduce grasping complexity without compromising versatility. Synergies describe coordination patterns between joints, either in terms of position (joint angles) or force (joint torques). In both of these cases, a grasp synergy can be represented as a low-dimensional manifold lying in the high-dimensional joint posture or torque space. In this paper, we use the term Mechanically Realizable Manifolds to refer to the subset of such manifolds (in either posture or torque space) that can be achieved via mechanical coupling of the joints in underactuated hands. We present a method to optimize the design parameters of an underactuated hand in order to shape the Mechanically Realizable Manifolds to satisfy a pre-defined set of desired grasps. Our method guarantees that the resulting synergies can be implemented in a physical underactuated hand, and will enable the resulting hand to both reach the desired grasp postures and achieve quasistatic equilibrium while loading the grasps. We demonstrate this method on three concrete design examples derived from a real use case, and evaluate and compare their performance in practice.
https://arxiv.org/abs/1905.11293
In this paper, we present our system developed by the team from the New Technologies for the Information Society (NTIS) research center of the University of West Bohemia in Pilsen, for the Second DIHARD Speech Diarization Challenge. The base of our system follows the currently-standard approach of segmentation, i/x-vector extraction, clustering, and resegmentation. The hyperparameters for each of the subsystems were selected according to the domain classifier trained on the development set of DIHARD II. We compared our system with results from the Kaldi diarization (with i/x-vectors) and combined these systems. At the time of writing of this abstract, our best submission achieved a DER of 23.47% and a JER of 48.99% on the evaluation set (in Track 1 using reference SAD).
https://arxiv.org/abs/1905.11276
In this work, we present a novel learning-based approach to synthesize new views of a light field image. In particular, given the four corner views of a light field, the presented method estimates any in-between view. We use three sequential convolutional neural networks for feature extraction, scene geometry estimation and view selection. Compared to state-of-the-art approaches, in order to handle occlusions we propose to estimate a different disparity map per view. Jointly with the view selection network, this strategy shows to be the most important to have proper reconstructions near object boundaries. Ablation studies and comparison against the state of the art on Lytro light fields show the superior performance of the proposed method. Furthermore, the method is adapted and tested on light fields with wide baselines acquired with a camera array and, in spite of having to deal with large occluded areas, the proposed approach yields very promising results.
https://arxiv.org/abs/1905.11271
To combat adversarial spelling mistakes, we propose placing a word recognition model in front of the downstream classifier. Our word recognition models build upon the RNN semi-character architecture, introducing several new backoff strategies for handling rare and unseen words. Trained to recognize words corrupted by random adds, drops, swaps, and keyboard mistakes, our method achieves 32% relative (and 3.3% absolute) error reduction over the vanilla semi-character model. Notably, our pipeline confers robustness on the downstream classifier, outperforming both adversarial training and off-the-shelf spell checkers. Against a BERT model fine-tuned for sentiment analysis, a single adversarially-chosen character attack lowers accuracy from 90.3% to 45.8%. Our defense restores accuracy to 75%. Surprisingly, better word recognition does not always entail greater robustness. Our analysis reveals that robustness also depends upon a quantity that we denote the sensitivity.
https://arxiv.org/abs/1905.11268
Dialogue policy plays an important role in task-oriented spoken dialogue systems. It determines how to respond to users. The recently proposed deep reinforcement learning (DRL) approaches have been used for policy optimization. However, these deep models are still challenging for two reasons: 1) Many DRL-based policies are not sample-efficient. 2) Most models don’t have the capability of policy transfer between different domains. In this paper, we propose a universal framework, AgentGraph, to tackle these two problems. The proposed AgentGraph is the combination of GNN-based architecture and DRL-based algorithm. It can be regarded as one of the multi-agent reinforcement learning approaches. Each agent corresponds to a node in a graph, which is defined according to the dialogue domain ontology. When making a decision, each agent can communicate with its neighbors on the graph. Under AgentGraph framework, we further propose Dual GNN-based dialogue policy, which implicitly decomposes the decision in each turn into a high-level global decision and a low-level local decision. Experiments show that AgentGraph models significantly outperform traditional reinforcement learning approaches on most of the 18 tasks of the PyDial benchmark. Moreover, when transferred from the source task to a target task, these models not only have acceptable initial performance but also converge much faster on the target task.
https://arxiv.org/abs/1905.11259
The classification of individual traffic participants is a complex task, especially for challenging scenarios with multiple road users or under bad weather conditions. Radar sensors provide an - with respect to well established camera systems - orthogonal way of measuring such scenes. In order to gain accurate classification results, 50 different features are extracted from the measurement data and tested on their performance. From these features a suitable subset is chosen and passed to random forest and long short-term memory (LSTM) classifiers to obtain class predictions for the radar input. Moreover, it is shown why data imbalance is an inherent problem in automotive radar classification when the dataset is not sufficiently large. To overcome this issue, classifier binarization is used among other techniques in order to better account for underrepresented classes. A new method to couple the resulting probabilities is proposed and compared to others with great success. Final results show substantial improvements when compared to ordinary multiclass classification
https://arxiv.org/abs/1905.11256
Policy gradient methods ignore the potential value of adjusting environment variables: unobservable state features that are randomly determined by the environment in a physical setting, but are controllable in a simulator. This can lead to slow learning, or convergence to suboptimal policies, if the environment variable has a large impact on the transition dynamics. In this paper, we present fingerprint policy optimisation (FPO), which finds a policy that is optimal in expectation across the distribution of environment variables. The central idea is to use Bayesian optimisation (BO) to actively select the distribution of the environment variable that maximises the improvement generated by each iteration of the policy gradient method. To make this BO practical, we contribute two easy-to-compute low-dimensional fingerprints of the current policy. Our experiments show that FPO can efficiently learn policies that are robust to significant rare events, which are unlikely to be observable under random sampling, but are key to learning good policies.
http://arxiv.org/abs/1805.10662
In timeline-based planning, domains are described as sets of independent, but interacting, components, whose behaviour over time (the set of timelines) is governed by a set of temporal constraints. A distinguishing feature of timeline-based planning systems is the ability to integrate planning with execution by synthesising control strategies for flexible plans. However, flexible plans can only represent temporal uncertainty, while more complex forms of nondeterminism are needed to deal with a wider range of realistic problems. In this paper, we propose a novel game-theoretic approach to timeline-based planning problems, generalising the state of the art while uniformly handling temporal uncertainty and nondeterminism. We define a general concept of timeline-based game and we show that the notion of winning strategy for these games is strictly more general than that of control strategy for dynamically controllable flexible plans. Moreover, we show that the problem of establishing the existence of such winning strategies is decidable using a doubly exponential amount of space.
http://arxiv.org/abs/1807.04837
Automatic speech recognition (ASR) system is undergoing an exciting pathway to be more simplified and practical with the spring up of various end-to-end models. However, the mainstream of them neglects the positioning of token boundaries from continuous speech, which is considered crucial in human language learning and instant speech recognition. In this work, we propose Continuous Integrate-and-Fire (CIF), a ‘soft’ and ‘monotonic’ acoustic-to-linguistic alignment mechanism that addresses the boundary positioning by simulating the integrate-and-fire neuron model using continuous functions under the encoder-decoder framework. As the connection between the encoder and decoder, the CIF forwardly integrates the information in the encoded acoustic representations to determine a boundary and instantly fires the integrated information to the decoder once a boundary is located. Multiple effective strategies are introduced to the CIF-based model to alleviate the problems brought by the inaccurate positioning. Besides, multi-task learning is performed during training and an external language model is incorporated during inference to further boost the model performance. Evaluated on multiple ASR datasets that cover different languages and speech types, the CIF-based model shows stable convergence and competitive performance. Especially, it achieves a word error rate (WER) of 3.70% on the test-clean of Librispeech.
https://arxiv.org/abs/1905.11235
It is fundamental and challenging to train robust and accurate Deep Neural Networks (DNNs) when noisy labels exist. Although great progress has been made, there is still one crucial research question which is not thoroughly explored yet: What training examples should be focused and how much more should they be emphasised when training DNNs under label noise? In this work, we study this question and propose gradient rescaling (GR) to solve it. GR modifies the magnitude of logit vector’s gradient to emphasise on relatively easier training data points when severe noise exists, which functions as explicit emphasis regularisation to improve the generalisation performance of DNNs. Apart from regularisation, we also interpret GR from the perspectives of sample reweighting and designing robust loss functions. Therefore, our proposed GR helps connect these three approaches in the literature. We empirically demonstrate that GR is highly noise-robust and outperforms the state-of-the-art noise-tolerant algorithms by a large margin, e.g., increasing 7% on CIFAR-100 with 40% noisy labels. It is also significantly superior to standard regularisors. Furthermore, we present comprehensive ablation studies to explore the behaviours of GR under different cases, which is informative for applying GR in real-world scenarios.
https://arxiv.org/abs/1905.11233
For reentry or near space communication, owing to the influence of the time-varying plasma sheath channel environment, the received IQ baseband signals are severely rotated on the constellation. Researches have shown that the frequency of electron density varies from 20kHz to 100 kHz which is on the same order as the symbol rate of most TT\&C communication systems and a mass of bandwidth will be consumed to track the time-varying channel with traditional estimation. In this paper, motivated by principal curve analysis, we propose a deep learning (DL) algorithm which called symmetric manifold network (SMN) to extract the curves on the constellation and classify the signals based on the curves. The key advantage is that SMN can achieve joint optimization of demodulation and channel estimation. From our simulation results, the new algorithm significantly reduces the symbol error rate (SER) compared to existing algorithms and enables accurate estimation of fading with extremely high bandwith utilization rate.
https://arxiv.org/abs/1905.11229
In order to separate completely the objects with larger occluded boundaries in an image, we devise a new variational level set model for image segmentation combing the recently proposed Chan-Vese-Euler model with elastica and landmark constraints. For computational efficiency, we deign its Augmented Lagrangian Method(ALM) or Alternating Direction Method of Multiplier(ADMM) method by introducing some auxiliary variables, Lagrange multipliers and penalty parameters. In each loop of alternating iterative optimization, the sub-problems of minimization can be solved via simple Gauss-Seidel iterative method, or generalized soft thresholding formulas with projection methods respectively. Numerical experiments show that the proposed model not only can recover larger broken boundaries, but also can improve segmentation efficiency, decrease the dependence of segmentation on tuning parameters and initialization.
https://arxiv.org/abs/1905.11192
We propose the first practical learned lossless image compression system, L3C, and show that it outperforms the popular engineered codecs, PNG, WebP and JPEG 2000. At the core of our method is a fully parallelizable hierarchical probabilistic model for adaptive entropy coding which is optimized end-to-end for the compression task. In contrast to recent autoregressive discrete probabilistic models such as PixelCNN, our method i) models the image distribution jointly with learned auxiliary representations instead of exclusively modeling the image distribution in RGB space, and ii) only requires three forward-passes to predict all pixel probabilities instead of one for each pixel. As a result, L3C obtains over two orders of magnitude speedups when sampling compared to the fastest PixelCNN variant (Multiscale-PixelCNN). Furthermore, we find that learning the auxiliary representation is crucial and outperforms predefined auxiliary representations such as an RGB pyramid significantly.
http://arxiv.org/abs/1811.12817
The human-machine interface is of critical importance for master-slave control of the robotic system for surgery, in which current systems offer the control or two robotic arms teleoperated by the surgeon’s hands. To relax the need for surgical assistants and augment dexterity in surgery, it has been recently proposed to use a robot like a third arm that can be controlled seamlessly, independently from the natural arms, and work together with them. This report will develop and investigate this concept by implementing foot control of a robotic surgical arm. A novel passive haptic foot-machine interface system and analysis of its performances were introduced in this report. This interface using a parallel-serial hybrid structure with springs and force sensors, which allows intuitive control of a slave robotic arm with four degrees of freedom (dof). The elastic isometric design enables a user to control the interface system accurately and adaptively, with an enlarged sensing range breaking the physical restriction of the pedal size. A subject specific (independent component analysis, ICA) model is identified to map the surgeon’s foot movements into kinematic parameters of the slave robotic arm. To validate the system and assess the performance it allows, 10 subjects carried out experiments to manipulate the foot-machine interface system in various movements. With these experimental data, the mapping models were built and verified. A comparison between different mapping models was made and analyzed proving the ICA algorithm is obviously dominant over other methods.
https://arxiv.org/abs/1905.11191
Predictive models are being increasingly used to support consequential decision making at the individual level in contexts such as pretrial bail and loan approval. As a result, there is increasing social and legal pressure to provide explanations that help the affected individuals not only to understand why a prediction was output, but also how to act to obtain a desired outcome. To this end, several works have proposed methods to generate counterfactual explanations. However, they are often restricted to a particular subset of models (e.g., decision trees or linear models), and cannot directly handle the mixed (numerical and nominal) nature of the features describing each individual. In this paper, we propose a model-agnostic algorithm to generate counterfactual explanations that builds on the standard theory and tools from formal verification. Specifically, our algorithm solves a sequence of satisfiability problems, where a wide variety of predictive models and distances in mixed feature spaces, as well as natural notions of plausibility and diversity, are represented as logic formulas. Our experiments on real-world data demonstrate that our approach can flexibly handle widely deployed predictive models, while providing meaningfully closer counterfactuals than existing approaches.
http://arxiv.org/abs/1905.11190
Even though machine learning has become the major scene in dialogue research community, the real breakthrough has been blocked by the scale of data available. To address this fundamental obstacle, we introduce the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of $10$k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The contribution of this work apart from the open-sourced dataset labelled with dialogue belief states and dialogue actions is two-fold: firstly, a detailed description of the data collection procedure along with a summary of data structure and analysis is provided. The proposed data-collection pipeline is entirely based on crowd-sourcing without the need of hiring professional annotators; secondly, a set of benchmark results of belief tracking, dialogue act and response generation is reported, which shows the usability of the data and sets a baseline for future studies.
http://arxiv.org/abs/1810.00278
Control of robot orientation in Cartesian space implicates some difficulties, because the rotation group SO(3) is not contractible, and only globally contractible state spaces support continuous and globally asymptotically stable feedback control systems. In this paper, unit quaternions are used to represent orientations, and it is first shown that the unit quaternion set minus one single point is contractible. This is used to design a control system for temporally coupled dynamical movement primitives (DMPs) in Cartesian space. The functionality of the control system is verified experimentally on an industrial robot.
http://arxiv.org/abs/1905.11176
Despite remarkable contributions from existing emotional speech synthesizers, we find that these methods are based on Text-to-Speech system or limited by aligned speech pairs, which suffered from pure emotion gain synthesis. Meanwhile, few studies have discussed the cross-language generalization ability of above methods to cope with the task of emotional speech synthesis in various languages. We propose a cross-language emotion gain synthesis method named EG-GAN which can learn a language-independent mapping from source emotion domain to target emotion domain in the absence of paired speech samples. EG-GAN is based on cycle-consistent generation adversarial network with a gradient penalty and an auxiliary speaker discriminator. The domain adaptation is introduced to implement the rapid migrating and sharing of emotional gains among different languages. The experiment results show that our method can efficiently synthesize high quality emotional speech from any source speech for given emotion categories, without the limitation of language differences and aligned speech pairs.
http://arxiv.org/abs/1905.11173
Recent research on image denoising has progressed with the development of deep learning architectures, especially convolutional neural networks. However, real-world image denoising is still very challenging because it is not possible to obtain ideal pairs of ground-truth images and real-world noisy images. Owing to the recent release of benchmark datasets, the interest of the image denoising community is now moving toward the real-world denoising problem. In this paper, we propose a grouped residual dense network (GRDN), which is an extended and generalized architecture of the state-of-the-art residual dense network (RDN). The core part of RDN is defined as grouped residual dense block (GRDB) and used as a building module of GRDN. We experimentally show that the image denoising performance can be significantly improved by cascading GRDBs. In addition to the network architecture design, we also develop a new generative adversarial network-based real-world noise modeling method. We demonstrate the superiority of the proposed methods by achieving the highest score in terms of both the peak signal-to-noise ratio and the structural similarity in the NTIRE2019 Real Image Denoising Challenge - Track 2:sRGB.
http://arxiv.org/abs/1905.11172
We aim to perform unsupervised discovery of objects and their states such as location and velocity, as well as physical system parameters such as mass and gravity from video – given only the differential equations governing the scene dynamics. Existing physical scene understanding methods require either object state supervision, or do not integrate with differentiable physics to learn interpretable system parameters and states. We address this problem through a $\textit{physics-as-inverse-graphics}$ approach that brings together vision-as-inverse-graphics and differentiable physics engines. This framework allows us to perform long term extrapolative video prediction, as well as vision-based model-predictive control. Our approach significantly outperforms related unsupervised methods in long-term future frame prediction of systems with interacting objects (such as ball-spring or 3-body gravitational systems). We further show the value of this tight vision-physics integration by demonstrating data-efficient learning of vision-actuated model-based control for a pendulum system. The controller’s interpretability also provides unique capabilities in goal-driven control and physical reasoning for zero-data adaptation.
http://arxiv.org/abs/1905.11169
Recent work has demonstrated that embeddings of tree-like graphs in hyperbolic space surpass their Euclidean counterparts in performance by a large margin. Inspired by these results and scale-free structure in the word co-occurrence graph, we present an algorithm for learning word embeddings in hyperbolic space from free text. An objective function based on the hyperbolic distance is derived and included in the skip-gram negative-sampling architecture of word2vec. The hyperbolic word embeddings are then evaluated on word similarity and analogy benchmarks. The results demonstrate the potential of hyperbolic word embeddings, particularly in low dimensions, though without clear superiority over their Euclidean counterparts. We further discuss subtleties in the formulation of the analogy task in curved spaces.
http://arxiv.org/abs/1809.01498
Many robotics and mapping systems contain multiple sensors to perceive the environment. Extrinsic parameter calibration, the identification of the position and rotation transform between the frames of the different sensors, is critical to fuse data from different sensors. When obtaining multiple camera to camera, lidar to camera and lidar to lidar calibration results, inconsistencies are likely. We propose a graph-based method to refine the relative poses of the different sensors. We demonstrate our approach using our mapping robot platform, which features twelve sensors that are to be calibrated. The experimental results confirm that the proposed algorithm yields great performance.
http://arxiv.org/abs/1905.11167
Giant panda (panda) is a highly endangered animal. Significant efforts and resources have been put on panda conservation. To measure effectiveness of conservation schemes, estimating its population size in wild is an important task. The current population estimation approaches, including capture-recapture, human visual identification and collection of DNA from hair or feces, are invasive, subjective, costly or even dangerous to the workers who perform these tasks in wild. Cameras have been widely installed in the regions where pandas live. It opens a new possibility for non-invasive image based panda recognition. Panda face recognition is naturally a small dataset problem, because of the number of pandas in the world and the number of qualified images captured by the cameras in each encounter. In this paper, a panda face recognition algorithm, which includes alignment, large feature set extraction and matching is proposed and evaluated on a dataset consisting of 163 images. The experimental results are encouraging.
http://arxiv.org/abs/1905.11163
While neural networks have been used to classify or embed data into lower dimensional spaces, they are often regarded as black boxes with uninterpretable features. Here we propose Graph Spectral Regularization for making hidden layers more interpretable without significantly affecting the performance of their primary task. Taking inspiration from spatial organization and localization of neuron activations in biological networks, we use a graph Laplacian that encourages activations to be smooth either on a predetermined graph or on a feature-space graph learned from the data via co-activations of a hidden layer of the neural network. We show numerous uses for this including cluster indication and visualization in biological and image data sets.
http://arxiv.org/abs/1810.00424
In the last few decades we have witnessed how the pheromone of social insect has become a rich inspiration source of swarm robotics. By utilising the virtual pheromone in physical swarm robot system to coordinate individuals and realise direct/indirect inter-robot communications like the social insect, stigmergic behaviour has emerged. However, many studies only take one single pheromone into account in solving swarm problems, which is not the case in real insects. In the real social insect world, diverse behaviours, complex collective performances and flexible transition from one state to another are guided by different kinds of pheromones and their interactions. Therefore, whether multiple pheromone based strategy can inspire swarm robotics research, and inversely how the performances of swarm robots controlled by multiple pheromones bring inspirations to explain the social insects’ behaviours will become an interesting question. Thus, to provide a reliable system to undertake the multiple pheromone study, in this paper, we specifically proposed and realised a multiple pheromone communication system called ColCOS$\Phi$. This system consists of a virtual pheromone sub-system wherein the multiple pheromone is represented by a colour image displayed on a screen, and the micro-robots platform designed for swarm robotics applications. Two case studies are undertaken to verify the effectiveness of this system: one is the multiple pheromone based on an ant’s forage and another is the interactions of aggregation and alarm pheromones. The experimental results demonstrate the feasibility of ColCOS$\Phi$ and its great potential in directing swarm robotics and social insects research.
https://arxiv.org/abs/1905.11160
In the last few decades we have witnessed how the pheromone of social insect has become a rich inspiration source of swarm robotics. By utilising the virtual pheromone in physical swarm robot system to coordinate individuals and realise direct/indirect inter-robot communications like the social insect, stigmergic behaviour has emerged. However, many studies only take one single pheromone into account in solving swarm problems, which is not the case in real insects. In the real social insect world, diverse behaviours, complex collective performances and flexible transition from one state to another are guided by different kinds of pheromones and their interactions. Therefore, whether multiple pheromone based strategy can inspire swarm robotics research, and inversely how the performances of swarm robots controlled by multiple pheromones bring inspirations to explain the social insects’ behaviours will become an interesting question. Thus, to provide a reliable system to undertake the multiple pheromone study, in this paper, we specifically proposed and realised a multiple pheromone communication system called ColCOS$\Phi$. This system consists of a virtual pheromone sub-system wherein the multiple pheromone is represented by a colour image displayed on a screen, and the micro-robots platform designed for swarm robotics applications. Two case studies are undertaken to verify the effectiveness of this system: one is the multiple pheromone based on an ant’s forage and another is the interactions of aggregation and alarm pheromones. The experimental results demonstrate the feasibility of ColCOS$\Phi$ and its great potential in directing swarm robotics and social insects research.
http://arxiv.org/abs/1905.11160
Morphological features play an important role in breast mass classification in sonography. While benign breast masses tend to have a well-defined ellipsoidal contour, shape of malignant breast masses is commonly ill-defined and highly variable. Various handcrafted morphological features have been developed over the years to assess this phenomenon and help the radiologists differentiate benign and malignant masses. In this paper we propose an automatic approach to morphology analysis, we express shapes of breast masses as points on the Kendall’s shape manifold. Next, we use the full Procrustes distance to develop support vector machine classifiers for breast mass differentiation. The usefulness of our method is demonstrated using a dataset of B-mode images collected from 163 breast masses. Our method achieved area under the receiver operating characteristic curve of 0.81. The proposed method can be used to assess shapes of breast masses in ultrasound without any feature engineering.
http://arxiv.org/abs/1905.11159
We propose an end to end deep learning approach for generating real-time facial animation from just audio. Specifically, our deep architecture employs deep bidirectional long short-term memory network and attention mechanism to discover the latent representations of time-varying contextual information within the speech and recognize the significance of different information contributed to certain face status. Therefore, our model is able to drive different levels of facial movements at inference and automatically keep up with the corresponding pitch and latent speaking style in the input audio, with no assumption or further human intervention. Evaluation results show that our method could not only generate accurate lip movements from audio, but also successfully regress the speaker’s time-varying facial movements.
http://arxiv.org/abs/1905.11142
Cross-modal data matching refers to retrieval of data from one modality, when given a query from another modality. In general, supervised algorithms achieve better retrieval performance compared to their unsupervised counterpart, as they can learn better representative features by leveraging the available label information. However, this comes at the cost of requiring huge amount of labeled examples, which may not always be available. In this work, we propose a novel framework in a semi-supervised setting, which can predict the labels of the unlabeled data using complementary information from different modalities. The proposed framework can be used as an add-on with any baseline crossmodal algorithm to give significant performance improvement, even in case of limited labeled data. Finally, we analyze the challenging scenario where the unlabeled examples can even come from classes not in the training data and evaluate the performance of our algorithm under such setting. Extensive evaluation using several baseline algorithms across three different datasets shows the effectiveness of our label prediction framework.
http://arxiv.org/abs/1905.11139
Unlabeled video in the wild presents a valuable, yet so far unharnessed, source of information for learning vision tasks. We present the first attempt of fully self-supervised learning of object detection from subtitled videos without any manual object annotation. To this end, we use the How2 multi-modal collection of instructional videos with English subtitles. We pose the problem as learning with a weakly- and noisily-labeled data, and propose a novel training model that can confront high noise levels, and yet train a classifier to localize the object of interest in the video frames, without any manual labeling involved. We evaluate our approach on a set of 11 manually annotated objects in over 5000 frames and compare it to an existing weakly-supervised approach as baseline. Benchmark data and code will be released upon acceptance of the paper.
http://arxiv.org/abs/1905.11137
The concept of dynamical movement primitives (DMPs) has become popular for modeling of motion, commonly applied to robots. This paper presents a framework that allows a robot operator to adjust DMPs in an intuitive way. Given a generated trajectory with a faulty last part, the operator can use lead-through programming to demonstrate a corrective trajectory. A modified DMP is formed, based on the first part of the faulty trajectory and the last part of the corrective one. A real-time application is presented and verified experimentally.
http://arxiv.org/abs/1905.11130
Industrial robots typically require very structured and predictable working environments, and explicit programming, in order to perform well. Therefore, expensive and time-consuming engineering work is a major obstruction when mediating tasks to robots. This thesis presents methods that decrease the amount of engineering work required for robot programming, and increase the ability of robots to handle unforeseen events. This has two main benefits: Firstly, the programming can be done faster, and secondly, it becomes accessible to users without engineering experience. Even though these methods could be used for various types of robot applications, this thesis is focused on robotic assembly tasks.
http://arxiv.org/abs/1905.11129
Few-shot learning is an important area of research. Conceptually, humans are readily able to understand new concepts given just a few examples, while in more pragmatic terms, limited-example training situations are common in practice. Recent effective approaches to few-shot learning employ a metric-learning framework to learn a feature similarity comparison between a query (test) example, and the few support (training) examples. However, these approaches treat each support class independently from one another, never looking at the entire task as a whole. Because of this, they are constrained to use a single set of features for all possible test-time tasks, which hinders the ability to distinguish the most relevant dimensions for the task at hand. In this work, we introduce a Category Traversal Module that can be inserted as a plug-and-play module into most metric-learning based few-shot learners. This component traverses across the entire support set at once, identifying task-relevant features based on both intra-class commonality and inter-class uniqueness in the feature space. Incorporating our module improves performance considerably (5%-10% relative) over baseline systems on both mini-ImageNet and tieredImageNet benchmarks, with overall performance competitive with recent state-of-the-art systems.
http://arxiv.org/abs/1905.11116
Spoken language understanding (SLU) acts as a critical component in goal-oriented dialog systems. It typically involves identifying the speakers intent and extracting semantic slots from user utterances, which are known as intent detection (ID) and slot filling (SF). SLU problem has been intensively investigated in recent years. However, these methods just constrain SF results grammatically, solve ID and SF independently, or do not fully utilize the mutual impact of the two tasks. This paper proposes a multi-head self-attention joint model with a conditional random field (CRF) layer and a prior mask. The experiments show the effectiveness of our model, as compared with state-of-the-art models. Meanwhile, online education in China has made great progress in the last few years. But there are few intelligent educational dialog applications for students to learn foreign languages. Hence, we design an intelligent dialog robot equipped with different scenario settings to help students learn communication skills.
https://arxiv.org/abs/1905.11393
We study synthesis problems with constraints in partially observable Markov decision processes (POMDPs), where the objective is to compute a strategy for an agent that is guaranteed to satisfy certain safety and performance specifications. Verification and strategy synthesis for POMDPs are, however, computationally intractable in general. We alleviate this difficulty by focusing on planning applications and exploiting typical structural properties of such scenarios; for instance, we assume that the agent has the ability to observe its own position inside an environment. We propose an abstraction refinement framework which turns such a POMDP model into a (fully observable) probabilistic two-player game (PG). For the obtained PGs, efficient verification and synthesis tools allow to determine strategies with optimal safety and performance measures, which approximate optimal schedulers on the POMDP. If the approximation is too coarse to satisfy the given specifications, an refinement scheme improves the computed strategies. As a running example, we use planning problems where an agent moves inside an environment with randomly moving obstacles and restricted observability. We demonstrate that the proposed method advances the state of the art by solving problems several orders-of-magnitude larger than those that can be handled by existing POMDP solvers. Furthermore, this method gives guarantees on safety constraints, which is not supported by the majority of the existing solvers.
http://arxiv.org/abs/1708.04236
How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the $\textit{same}$ recurrent unit is repeatedly applied across the entire time horizon of length $L$, which is analogous to feedforward networks of depth $L$. We show when the number of neurons is sufficiently large, meaning polynomial in the training data size and in $L$, then SGD is capable of minimizing the regression loss in the linear convergence rate. This gives theoretical evidence of how RNNs can memorize data. More importantly, in this paper we build general toolkits to analyze multi-layer networks with ReLU activations. For instance, we prove why ReLU activations can prevent exponential gradient explosion or vanishing, and build a perturbation theory to analyze first-order approximation of multi-layer networks.
https://arxiv.org/abs/1810.12065
Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.
https://arxiv.org/abs/1905.11391