Despite the extensive presence of the legged locomotion in animals, it is extremely challenging to be reproduced with robots. Legged locomotion is an dynamic task which benefits from a planning that takes advantage of the gravitational pull on the system. However, the computational cost of such optimization rapidly increases with the complexity of kinematic structures, rendering impossible real-time deployment in unstructured environments. This paper proposes a simplified method that can generate desired centre of mass and feet trajectory for quadrupeds. The model describes a quadruped as two bipeds connected via their centres of mass, and it is based on the extension of an algebraic bipedal model that uses the topology of the gravitational attractor to describe bipedal locomotion strategies. The results show that the model generates trajectories that agrees with previous studies. The model will be deployed in the future as seed solution for whole-body trajectory optimization in the attempt to reduce the computational cost and obtain real-time planning of complex action in challenging environments.
http://arxiv.org/abs/1902.07346
Deep learning has mainly thrived by training on large-scale datasets. However, for a continual learning agent it is critical to incrementally update its model in a sample efficient manner. Learning semantic segmentation from few labelled samples can be a significant step toward such goal. We propose a novel method that constructs the new class weights from few labelled samples in the support set without back-propagation, while updating the previously learned classes. Inspiring from the work on adaptive correlation filters, an adaptive masked imprinted weights method is designed. It utilizes a masked average pooling layer on the output embeddings and acts as a positive proxy for that class. Our proposed method is evaluated on PASCAL-5i dataset and outperforms the state of the art in the 5-shot semantic segmentation. Unlike previous methods, our proposed approach does not require a second branch to estimate parameters or prototypes, and enables the adaptation of previously learned weights. Our adaptation scheme is evaluated on DAVIS video segmentation benchmark and our proposed incremental version of PASCAL and has shown to outperform the baseline model.
http://arxiv.org/abs/1902.11123
We propose a new approach to video face recognition. Our component-wise feature aggregation network (C-FAN) accepts a set of face images of a subject as an input, and outputs a single feature vector as the face representation of the set for the recognition task. The whole network is trained in two steps: (i) train a base CNN for still image face recognition; (ii) add an aggregation module to the base network to learn the quality value for each feature component, which adaptively aggregates deep feature vectors into a single vector to represent the face in a video. C-FAN automatically learns to retain salient face features with high quality scores while suppressing features with low quality scores. The experimental results on three benchmark datasets, YouTube Faces, IJB-A, and IJB-S show that the proposed C-FAN network is capable of generating a compact feature vector with 512 dimensions for a video sequence by efficiently aggregating feature vectors of all the video frames to achieve state of the art performance.
http://arxiv.org/abs/1902.07327
State-of-the-art deep learning methods for image processing are evolving into increasingly complex meta-architectures with a growing number of modules. Among them, region-based fully convolutional networks (R-FCN) and deformable convolutional nets (DCN) can improve CAD for mammography: R-FCN optimizes for speed and low consumption of memory, which is crucial for processing the high resolutions of to 50 micrometers used by radiologists. Deformable convolution and pooling can model a wide range of mammographic findings of different morphology and scales, thanks to their versatility. In this study, we present a neural net architecture based on R-FCN / DCN, that we have adapted from the natural image domain to suit mammograms – particularly their larger image size – without compromising resolution. We trained the network on a large, recently released dataset (Optimam) including 6,500 cancerous mammograms. By combining our modern architecture with such a rich dataset, we achieved an area under the ROC curve of 0.879 for breast-wise detection in the DREAMS challenge (130,000 withheld images), which surpassed all other submissions in the competitive phase.
http://arxiv.org/abs/1902.07323
Extremely peculiar emission lines have been found in the spectra of some active galactic nuclei and quasars. Their origin is totally unknown. We investigate the hypothesis that they are generated from ultra-rapid quasi-periodic oscillations that may occur in jets or black holes, as predicted in a published theoretical paper. We conclude that, although not totally certain, this hypothesis is just as valid as the other highly peculiar hypotheses that have been previously made (e.g. blueshifts due to bulk motions close to the speed of light). We consider ways to further validate our hypothesis.
https://arxiv.org/abs/1902.07320
Synergies have been adopted in prosthetic limb applications to reduce complexity of design, but typically involve a single synergy setting for a population and ignore individual preference or adaptation capacity. In this paper, a systematic design of kinematic synergies for human-prosthesis interfaces using on-line measurements from each individual is proposed. The task of reaching using the upper-limb is described by an objective function and the interface is parameterized by a kinematic synergy. Consequently, personalizing the interface for a given individual can be formulated as finding an optimal personalized parameter. A structure to model the observed motor behavior that allows for the personalized traits of motor preference and motor learning is proposed, and subsequently used in an on-line optimization scheme to identify the synergies for an individual. The knowledge of the common features contained in the model enables on-line adaptation of the human-prosthesis interface to happen concurrently to human motor adaptation without the need to re-tune the parameters of the on-line algorithm for each individual. Human-in-the-loop experimental results with able-bodied subjects, performed in a virtual reality environment to emulate amputation and prosthesis use, show that the proposed personalization algorithm was effective in obtaining optimal synergies with a fast uniform convergence speed across a group of individuals.
http://arxiv.org/abs/1902.07313
The paper describes a deep network based object detector specialized for ball detection in long shot videos. Due to its fully convolutional design, the method operates on images of any size and produces \emph{ball confidence map} encoding the position of detected ball. The network uses hypercolumn concept, where feature maps from different hierarchy levels of the deep convolutional network are combined and jointly fed to the convolutional classification layer. This allows boosting the detection accuracy as larger visual context around the object of interest is taken into account. The method achieves state-of-the-art results when tested on publicly available ISSIA-CNR Soccer Dataset.
http://arxiv.org/abs/1902.07304
In recent years, object detection has experienced impressive progress. Despite these improvements, there is still a significant gap in the performance between the detection of small and large objects. We analyze the current state-of-the-art model, Mask-RCNN, on a challenging dataset, MS COCO. We show that the overlap between small ground-truth objects and the predicted anchors is much lower than the expected IoU threshold. We conjecture this is due to two factors; (1) only a few images are containing small objects, and (2) small objects do not appear enough even within each image containing them. We thus propose to oversample those images with small objects and augment each of those images by copy-pasting small objects many times. It allows us to trade off the quality of the detector on large objects with that on small objects. We evaluate different pasting augmentation strategies, and ultimately, we achieve 9.7\% relative improvement on the instance segmentation and 7.1\% on the object detection of small objects, compared to the current state of the art method on MS COCO.
http://arxiv.org/abs/1902.07296
In this work, we generalize semi-supervised generative adversarial networks (GANs) from classification problems to regression problems. In the last few years, the importance of improving the training of neural networks using semi-supervised training has been demonstrated for classification problems. We present a novel loss function, called feature contrasting, resulting in a discriminator which can distinguish between fake and real data based on feature statistics. This method avoids potential biases and limitations of alternative approaches. The generalization of semi-supervised GANs to the regime of regression problems of opens their use to countless applications as well as providing an avenue for a deeper understanding of how GANs function. We first demonstrate the capabilities of semi-supervised regression GANs on a toy dataset which allows for a detailed understanding of how they operate in various circumstances. This toy dataset is used to provide a theoretical basis of the semi-supervised regression GAN. We then apply the semi-supervised regression GANs to a number of real-world computer vision applications: age estimation, driving steering angle prediction, and crowd counting from single images. We perform extensive tests of what accuracy can be achieved with significantly reduced annotated data. Through the combination of the theoretical example and real-world scenarios, we demonstrate how semi-supervised GANs can be generalized to regression problems.
http://arxiv.org/abs/1811.11269
There are many use cases in singing synthesis where creating voices from small amounts of data is desirable. In text-to-speech there have been several promising results that apply voice cloning techniques to modern deep learning based models. In this work, we adapt one such technique to the case of singing synthesis. By leveraging data from many speakers to first create a multispeaker model, small amounts of target data can then efficiently adapt the model to new unseen voices. We evaluate the system using listening tests across a number of different use cases, languages and kinds of data.
http://arxiv.org/abs/1902.07292
Many machine vision applications, such as semantic segmentation and depth prediction, require predictions for every pixel of the input image. Models for such problems usually consist of encoders which decrease spatial resolution while learning a high-dimensional representation, followed by decoders who recover the original input resolution and result in low-dimensional predictions. While encoders have been studied rigorously, relatively few studies address the decoder side. This paper presents an extensive comparison of a variety of decoders for a variety of pixel-wise tasks ranging from classification, regression to synthesis. Our contributions are: (1) Decoders matter: we observe significant variance in results between different types of decoders on various problems. (2) We introduce new residual-like connections for decoders. (3) We introduce a novel decoder: bilinear additive upsampling. (4) We explore prediction artifacts.
http://arxiv.org/abs/1707.05847
Recent advances in deep learning have improved the segmentation accuracy of subcortical brain structures, which would be useful in neuroimaging studies of many neurological disorders. However, most of the previous deep learning work does not investigate the specific difficulties that exist in segmenting extremely small but important brain regions such as the amygdala and its subregions. To tackle this challenging task, a novel 3D Bayesian fully convolutional neural network was developed to apply a dilated dualpathway approach that retains fine details and utilizes both local and more global contextual information to automatically segment the amygdala and its subregions at high precision. The proposed method provides insights on network design and sampling strategy that target segmentations of small 3D structures. In particular, this study confirms that a large context, enabled by a large field of view, is beneficial for segmenting small objects; furthermore, precise contextual information enabled by dilated convolutions allows for better boundary localization, which is critical for examining the morphology of the structure. In addition, it is demonstrated that the uncertainty information estimated from our network may be leveraged to identify atypicality in data. Our method was compared with two state-of-the-art deep learning models and a traditional multi-atlas approach, and exhibited excellent performance as measured both by Dice overlap as well as average symmetric surface distance. To the best of our knowledge, this work is the first deep learning-based approach that targets the subregions of the amygdala.
http://arxiv.org/abs/1902.07289
The human brain cortical layer has a convoluted morphology that is unique to each individual. Characterization of the cortical morphology is necessary in longitudinal studies of structural brain change, as well as in discriminating individuals in health and disease. A method for encoding the cortical morphology in the form of a graph is presented. The design of graphs that encode the global cerebral hemisphere cortices as well as localized cortical regions is proposed. Spectral metrics derived from these graphs are then studied and proposed as descriptors of cortical morphology. As proof-of-concept of their applicability in characterizing cortical morphology, the metrics are studied in the context of hemispheric asymmetry as well as gender dependent discrimination of cortical morphology.
http://arxiv.org/abs/1902.07283
It is intuitive that semantic representations can be useful for machine translation, mainly because they can help in enforcing meaning preservation and handling data sparsity (many sentences correspond to one meaning) of machine translation models. On the other hand, little work has been done on leveraging semantics for neural machine translation (NMT). In this work, we study the usefulness of AMR (short for abstract meaning representation) on NMT. Experiments on a standard English-to-German dataset show that incorporating AMR as additional knowledge can significantly improve a strong attention-based sequence-to-sequence neural translation model.
http://arxiv.org/abs/1902.07282
In this paper, we consider batch supervised learning where an adversary is allowed to corrupt instances with arbitrarily large noise. The adversary is allowed to corrupt any $l$ features in each instance and the adversary can change their values in any way. This noise is introduced on test instances and the algorithm receives no label feedback for these instances. We provide several subspace voting techniques that can be used to transform existing algorithms and prove data-dependent performance bounds in this setting. The key insight to our results is that we set our parameters so that a significant fraction of the voting hypotheses do not contain corrupt features and, for many real world problems, these uncorrupt hypotheses are sufficient to achieve high accuracy. We empirically validate our approach on several datasets including three new datasets that deal with side channel electromagnetic information.
http://arxiv.org/abs/1902.07280
To be effective in sequential data processing, Recurrent Neural Networks (RNNs) are required to keep track of past events by creating memories. While the relation between memories and the network’s hidden state dynamics was established over the last decade, previous works in this direction were of a predominantly descriptive nature focusing mainly on locating the dynamical objects of interest. In particular, it remained unclear how dynamical observables affect the performance, how they form and whether they can be manipulated. Here, we utilize different training protocols, datasets and architectures to obtain a range of networks solving a delayed classification task with similar performance, alongside substantial differences in their ability to extrapolate for longer delays. We analyze the dynamics of the network’s hidden state, and uncover the reasons for this difference. Each memory is found to be associated with a nearly steady state of the dynamics which we refer to as a ‘slow point’. Slow point speeds predict extrapolation performance across all datasets, protocols and architectures tested. Furthermore, by tracking the formation of the slow points we are able to understand the origin of differences between training protocols. Finally, we propose a novel regularization technique that is based on the relation between hidden state speeds and memory longevity. Our technique manipulates these speeds, thereby leading to a dramatic improvement in memory robustness over time, and could pave the way for a new class of regularization methods.
https://arxiv.org/abs/1902.07275
Reinforcement learning (RL) enables agents to take decision based on a reward function. However, in the process of learning, the choice of values for learning algorithm parameters can significantly impact the overall learning process. In this paper, we use a genetic algorithm (GA) to find the values of parameters used in Deep Deterministic Policy Gradient (DDPG) combined with Hindsight Experience Replay (HER), to help speed up the learning agent. We used this method on fetch-reach, slide, push, pick and place, and door opening in robotic manipulation tasks. Our experimental evaluation shows that our method leads to better performance, faster than the original algorithm.
http://arxiv.org/abs/1905.04100
Visual segmentation has seen tremendous advancement recently with ready solutions for a wide variety of scene types, including human hands and other body parts. However, focus on segmentation of human hands while performing complex tasks, such as manual assembly, is still severely lacking. Segmenting hands from tools, work pieces, background and other body parts is extremely difficult because of self-occlusions and intricate hand grips and poses. In this paper we introduce BusyHands, a large open dataset of pixel-level annotated images of hands performing 13 different tool-based assembly tasks, from both real-world captures and virtual-world renderings. A total of 7906 samples are included in our first-in-kind dataset, with both RGB and depth images as obtained from a Kinect V2 camera and Blender. We evaluate several state-of-the-art semantic segmentation methods on our dataset as a proposed performance benchmark.
http://arxiv.org/abs/1902.07262
Detecting objects in a video is a compute-intensive task. In this paper we propose CaTDet, a system to speedup object detection by leveraging the temporal correlation in video. CaTDet consists of two DNN models that form a cascaded detector, and an additional tracker to predict regions of interests based on historic detections. We also propose a new metric, mean Delay(mD), which is designed for latency-critical video applications. Experiments on the KITTI dataset show that CaTDet reduces operation count by 5.1-8.7x with the same mean Average Precision(mAP) as the single-model Faster R-CNN detector and incurs additional delay of 0.3 frame. On CityPersons dataset, CaTDet achieves 13.0x reduction in operations with 0.8% mAP loss.
https://arxiv.org/abs/1810.00434
We consider the problem of learning from sparse and underspecified rewards, where an agent receives a complex input, such as a natural language instruction, and needs to generate a complex response, such as an action sequence, while only receiving binary success-failure feedback. Such success-failure rewards are often underspecified: they do not distinguish between purposeful and accidental success. Generalization from underspecified rewards hinges on discounting spurious trajectories that attain accidental success, while learning from sparse feedback requires effective exploration. We address exploration by using a mode covering direction of KL divergence to collect a diverse set of successful trajectories, followed by a mode seeking KL divergence to train a robust policy. We propose Meta Reward Learning (MeRL) to construct an auxiliary reward function that provides more refined feedback for learning. The parameters of the auxiliary reward function are optimized with respect to the validation performance of a trained policy. The MeRL approach outperforms our alternative reward learning technique based on Bayesian Optimization, and achieves the state-of-the-art on weakly-supervised semantic parsing. It improves previous work by 1.2% and 2.4% on WikiTableQuestions and WikiSQL datasets respectively.
http://arxiv.org/abs/1902.07198
We provide a framework to approximate the 2-Wasserstein distance and the optimal transport map, amenable to efficient training as well as statistical and geometric analysis. With the quadratic cost and considering the Kantorovich dual form of the optimal transportation problem, the Brenier theorem states that the optimal potential function is convex and the optimal transport map is the gradient of the optimal potential function. Using this geometric structure, we restrict the optimization problem to different parametrized classes of convex functions and pay special attention to the class of input-convex neural networks. We analyze the statistical generalization and the discriminative power of the resulting approximate metric, and we prove a restricted moment-matching property for the approximate optimal map. Finally, we discuss a numerical algorithm to solve the restricted optimization problem and provide numerical experiments to illustrate and compare the proposed approach with the established regularization-based approaches. We further discuss practical implications of our proposal in a modular and interpretable design for GANs which connects the generator training with discriminator computations to allow for learning an overall composite generator.
https://arxiv.org/abs/1902.07197
Over the past few years, Spiking Neural Networks (SNNs) have become popular as a possible pathway to enable low-power event-driven neuromorphic hardware. However, their application in machine learning have largely been limited to very shallow neural network architectures for simple problems. In this paper, we propose a novel algorithmic technique for generating an SNN with a deep architecture, and demonstrate its effectiveness on complex visual recognition problems such as CIFAR-10 and ImageNet. Our technique applies to both VGG and Residual network architectures, with significantly better accuracy than the state-of-the-art. Finally, we present analysis of the sparse event-driven computations to demonstrate reduced hardware overhead when operating in the spiking domain.
http://arxiv.org/abs/1802.02627
Many machine learning algorithms represent input data with vector embeddings or discrete codes. When inputs exhibit compositional structure (e.g. objects built from parts or procedures from subroutines), it is natural to ask whether this compositional structure is reflected in the the inputs’ learned representations. While the assessment of compositionality in languages has received significant attention in linguistics and adjacent fields, the machine learning literature lacks general-purpose tools for producing graded measurements of compositional structure in more general (e.g. vector-valued) representation spaces. We describe a procedure for evaluating compositionality by measuring how well the true representation-producing model can be approximated by a model that explicitly composes a collection of inferred representational primitives. We use the procedure to provide formal and empirical characterizations of compositional structure in a variety of settings, exploring the relationship between compositionality and learning dynamics, human judgments, representational similarity, and generalization.
http://arxiv.org/abs/1902.07181
Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While there have been a variety of work that look at incorporating an external LM trained on text-only data into the end-to-end framework, none of them have taken into account the characteristic error distribution made by the model. In this paper, we propose a novel approach to utilizing text-only data, by training a spelling correction (SC) model to explicitly correct those errors. On the LibriSpeech dataset, we demonstrate that the proposed model results in an 18.6% relative improvement in WER over the baseline model when directly correcting top ASR hypothesis, and a 29.0% relative improvement when further rescoring an expanded n-best list using an external LM.
http://arxiv.org/abs/1902.07178
GaN films with thickness up to 3 mm were grown by halide vapour phase epitaxy method. Two growth modes were observed: the high temperature (HT) mode and the low temperature (LT) mode. Films grown in HT mode had smooth surface, however the growth stress was high and caused cracking. Films grown in LT mode had rough surface with high density of V-defects (pits), however such films were crack-free. The influence of growth parameters on the pit shape and evolution was investigated. Origins of pits formation and process of pit overgrowth are discussed. Crack-free films with smooth surface and reduced density of pits were grown using combination of the LT and HT growth modes.
https://arxiv.org/abs/1902.07164
We study the emergence of cooperative behaviors in reinforcement learning agents by introducing a challenging competitive multi-agent soccer environment with continuous simulated physics. We demonstrate that decentralized, population-based training with co-play can lead to a progression in agents’ behaviors: from random, to simple ball chasing, and finally showing evidence of cooperation. Our study highlights several of the challenges encountered in large scale multi-agent training in continuous control. In particular, we demonstrate that the automatic optimization of simple shaping rewards, not themselves conducive to co-operative behavior, can lead to long-horizon team behavior. We further apply an evaluation scheme, grounded by game theoretic principals, that can assess agent performance in the absence of pre-defined evaluation tasks or human baselines.
http://arxiv.org/abs/1902.07151
Mobility in an effective and socially-compliant manner is an essential yet challenging task for robots operating in crowded spaces. Recent works have shown the power of deep reinforcement learning techniques to learn socially cooperative policies. However, their cooperation ability deteriorates as the crowd grows since they typically relax the problem as a one-way Human-Robot interaction problem. In this work, we want to go beyond first-order Human-Robot interaction and more explicitly model Crowd-Robot Interaction (CRI). We propose to (i) rethink pairwise interactions with a self-attention mechanism, and (ii) jointly model Human-Robot as well as Human-Human interactions in the deep reinforcement learning framework. Our model captures the Human-Human interactions occurring in dense crowds that indirectly affects the robot’s anticipation capability. Our proposed attentive pooling mechanism learns the collective importance of neighboring humans with respect to their future states. Various experiments demonstrate that our model can anticipate human dynamics and navigate in crowds with time efficiency, outperforming state-of-the-art methods.
http://arxiv.org/abs/1809.08835
A descriptive approach for automatic generation of visual blends is presented. The implemented system, the Blender, is composed of two components: the Mapper and the Visual Blender. The approach uses structured visual representations along with sets of visual relations which describe how the elements (in which the visual representation can be decomposed) relate among each other. Our system is a hybrid blender, as the blending process starts at the Mapper (conceptual level) and ends at the Visual Blender (visual representation level). The experimental results show that the Blender is able to create analogies from input mental spaces and produce well-composed blends, which follow the rules imposed by its base-analogy and its relations. The resulting blends are visually interesting and some can be considered as unexpected.
http://arxiv.org/abs/1706.09076
Individuals with spinal cord injury (SCI) and stroke who is lack of manipulation capability have a particular need for robotic hand exoskeletons. Among assistive and rehabilitative medical exoskeletons, there exists a sharp trade-off between device power on the one hand and ergonomics and portability on other, devices that provide stronger grasping assistance do so at the cost of patient comfort. This paper proposes using fin-ray inspired, cable-driven finger orthoses to generate high fingertip forces without the painful compressive and shear stresses commonly associated with conventional cable-drive exoskeletons. With combination cable-driven transmission and segmented-finger orthoses, the exoskeleton transmitted larger forces and applied torques discretely to the fingers, leading to strong fingertip forces. A prototype of the finger orthoses and associated cable transmission was fabricated, and force transmission tests of the prototype in the finger flexion mode demonstrated a 2:1 input-output ratio between cable tension and fingertip force, with a maximum fingertip force of 22 N. Moreover, the proposed design provides a comfortable experience for wearers thanks to its lightweight and conformal properties to the hands.
http://arxiv.org/abs/1902.07112
While reinforcement learning can effectively improve language generation models, it often suffers from generating incoherent and repetitive phrases \cite{paulus2017deep}. In this paper, we propose a novel repetition normalized adversarial reward to mitigate these problems. Our repetition penalized reward can greatly reduce the repetition rate and adversarial training mitigates generating incoherent phrases. Our model significantly outperforms the baseline model on ROUGE-1\,(+3.24), ROUGE-L\,(+2.25), and a decreased repetition-rate (-4.98\%).
http://arxiv.org/abs/1902.07110
We study the problem of localizing a configuration of points and planes from the collection of point-to-plane distances. This problem models simultaneous localization and mapping from acoustic echoes as well as the notable “structure from sound” approach to microphone localization with unknown sources. In our earlier work we proposed computational methods for localization from point-to-plane distances and noted that such localization suffers from various ambiguities beyond the usual rigid body motions; in this paper we provide a complete characterization of uniqueness. We enumerate equivalence classes of configurations which lead to the same distance measurements as a function of the number of planes and points, and algebraically characterize the related transformations in both 2D and 3D. Here we only discuss uniqueness; computational tools and heuristics for practical localization from point-to-plane distances using sound will be addressed in a companion paper.
http://arxiv.org/abs/1902.09959
This paper presents a new design approach of wearable robots that tackle the three barriers to mainstay practical use of exoskeletons, namely discomfort, weight of the device, and symbiotic control of the exoskeleton-human co-robot system. The hybrid exoskeleton approach, demonstrated in a soft knee industrial exoskeleton case, mitigates the discomfort of wearers as it aims to avoid the drawbacks of rigid exoskeletons and textile-based soft exosuits. Quasi-direct drive actuation using high-torque density motors minimizes the weight of the device and presents high backdrivability that does not restrict natural movement. We derive a biomechanics model that is generic to both squat and stoop lifting motion. The control algorithm symbiotically detects posture using compact inertial measurement unit (IMU) sensors to generate an assistive profile that is proportional to the biological torque generated from our model. Experimental results demonstrate that the robot exhibits 1.5 Nm torque when it is unpowered and 0.5 Nm torque with zero-torque tracking control. The efficacy of injury prevention is demonstrated with one healthy subject. Root mean square (RMS) error of torque tracking is less than 0.29 Nm (1.21% of 24 Nm peak torque) for 50% assistance of biological torque. Comparing to the squat without exoskeleton, the maximum amplitude of the knee extensor muscle activity (rectus femoris) measured by Electromyography (EMG) sensors is reduced by 30% with 50% assistance of biological torque.
http://arxiv.org/abs/1902.07106
Traditionally, machine learning algorithms have been focused on modeling dynamics of a certain dataset at hand for which all features are available for free. However, there are many concerns such as monetary data collection costs, patient discomfort in medical procedures, and privacy impacts of data collection that require careful consideration in any health analytics system. An efficient solution would only acquire a subset of features based on the value it provides whilst considering acquisition costs. Moreover, datasets that provide feature costs are very limited, especially in healthcare. In this paper, we provide a health dataset as well as a method for assigning feature costs based on the total level of inconvenience asking for each feature entails. Furthermore, based on the suggested dataset, we provide a comparison of recent and state-of-the-art approaches to cost-sensitive feature acquisition and learning. Specifically, we analyze the performance of major sensitivity-based and reinforcement learning based methods in the literature on three different problems in the health domain, including diabetes, heart disease, and hypertension classification.
http://arxiv.org/abs/1902.07102
Outdoor videos sometimes contain unexpected rain streaks due to the rainy weather, which bring negative effects on subsequent computer vision applications, e.g., video surveillance, object recognition and tracking, etc. In this paper, we propose a directional regularized tensor-based video deraining model by taking into consideration the arbitrary direction of rain streaks. In particular, the sparsity of rain streaks in spatial and derivative domains, the spatiotemporal sparsity and low-rank property of video background are incorporated into the proposed method. Different from many previous methods under the assumption of vertically falling rain streaks, we consider a more realistic assumption that all the rain streaks in a video fall in an approximately similar arbitrary direction. The resulting complicated optimization problem will be effectively solved through an alternating direction method. Comprehensive experiments on both synthetic and realistic datasets have demonstrated the superiority of the proposed deraining method.
http://arxiv.org/abs/1902.07090
High-quality dehazing performance is highly dependent upon the accurate estimation of transmission map. In this work, the coarse estimation version is first obtained by weightedly fusing two different transmission maps, which are generated from foreground and sky regions, respectively. A hybrid variational model with promoted regularization terms is then proposed to assisting in refining transmission map. The resulting complicated optimization problem is effectively solved via an alternating direction algorithm. The final haze-free image can be effectively obtained according to the refined transmission map and atmospheric scattering model. Our dehazing framework has the capacity of preserving important image details while suppressing undesirable artifacts, even for hazy images with large sky regions. Experiments on both synthetic and realistic images have illustrated that the proposed method is competitive with or even outperforms the state-of-the-art dehazing techniques under different imaging conditions.
http://arxiv.org/abs/1902.07069
In this work, we propose the first quantum Ans"atze for the statistical relational learning on knowledge graphs using parametric quantum circuits. We introduce two types of variational quantum circuits for knowledge graph embedding. Inspired by the classical representation learning, we first consider latent features for entities as coefficients of quantum states, while predicates are characterized by parametric gates acting on the quantum states. For the first model, the quantum advantages disappear when it comes to the optimization of this model. Therefore, we introduce a second quantum circuit model where embeddings of entities are generated from parameterized quantum gates acting on the pure quantum state. The benefit of the second method is that the quantum embeddings can be trained efficiently meanwhile preserving the quantum advantages. We show the proposed methods can achieve comparable results to the state-of-the-art classical models, e.g., RESCAL, DistMult. Furthermore, after optimizing the models, the complexity of inductive inference on the knowledge graphs might be reduced with respect to the number of entities.
http://arxiv.org/abs/1903.00556
This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of long-short-term-memory (LSTM) networks instead of their bidirectional variant used in the original work, b) using a short synthesis window (here 8 ms) required for low-latency operation, and, c) using a buffer in the beginning of audio mixture to estimate cluster centres corresponding to constituent speakers which are then utilized to separate speakers within the rest of the signal. The buffer duration would serve as an initialization phase after which the system is capable of operating with 8 ms algorithmic latency. We evaluate our proposed approach on two-speaker mixtures from the Wall Street Journal (WSJ0) corpus. We observe that the use of LSTM yields around one dB lower SDR as compared to the baseline bidirectional LSTM in terms of source to distortion ratio (SDR). Moreover, using an 8 ms synthesis window instead of 32 ms degrades the separation performance by around 2.1 dB as compared to the baseline. Finally, we also report separation performance with different buffer durations noting that separation can be achieved even for buffer duration as low as 300ms.
http://arxiv.org/abs/1902.07033
We present a novel graph-based neural network model for relation extraction. Our model treats multiple pairs in a sentence simultaneously and considers interactions among them. All the entities in a sentence are placed as nodes in a fully-connected graph structure. The edges are represented with position-aware contexts around the entity pairs. In order to consider different relation paths between two entities, we construct up to l-length walks between each pair. The resulting walks are merged and iteratively used to update the edge representations into longer walks representations. We show that the model achieves performance comparable to the state-of-the-art systems on the ACE 2005 dataset without using any external tools.
http://arxiv.org/abs/1902.07023
Vision-based person, hand or face detection approaches have achieved incredible success in recent years with the development of deep convolutional neural network (CNN). In this paper, we take the inherent correlation between the body and body parts into account and propose a new framework to boost up the detection performance of the multi-level objects. In particular, we adopt a region-based object detection structure with two carefully designed detectors to separately pay attention to the human body and body parts in a coarse-to-fine manner, which we call Detector-in-Detector network (DID-Net). The first detector is designed to detect human body, hand, and face. The second detector, based on the body detection results of the first detector, mainly focus on the detection of small hand and face inside each body. The framework is trained in an end-to-end way by optimizing a multi-task loss. Due to the lack of human body, face and hand detection dataset, we have collected and labeled a new large dataset named Human-Parts with 14,962 images and 106,879 annotations. Experiments show that our method can achieve excellent performance on Human-Parts.
http://arxiv.org/abs/1902.07017
Deep Reinforcement Learning has shown great success in a variety of control tasks. However, it is unclear how close we are to the vision of putting Deep RL into practice to solve real world problems. In particular, common practice in the field is to train policies on largely deterministic simulators and to evaluate algorithms through training performance alone, without a train/test distinction to ensure models generalise and are not overfitted. Moreover, it is not standard practice to check for generalisation under domain shift, although robustness to such system change between training and testing would be necessary for real-world Deep RL control, for example, in robotics. In this paper we study these issues by first characterising the sources of uncertainty that provide generalisation challenges in Deep RL. We then provide a new benchmark and thorough empirical evaluation of generalisation challenges for state of the art Deep RL methods. In particular, we show that, if generalisation is the goal, then common practice of evaluating algorithms based on their training performance leads to the wrong conclusions about algorithm choice. Finally, we evaluate several techniques for improving generalisation and draw conclusions about the most robust techniques to date.
http://arxiv.org/abs/1902.07015
In this paper we show how different choices regarding compliance affect a dual-arm assembly task. In addition, we present how the compliance parameters can be learned from a human demonstration. Compliant motions can be used in assembly tasks to mitigate pose errors originating from, for example, inaccurate grasping. We present analytical background and accompanying experimental results on how to choose the center of compliance to enhance the convergence region of an alignment task. Then we present the possible ways of choosing the compliant axes for accomplishing alignment in a scenario where orientation error is present. We show that an earlier presented Learning from Demonstration method can be used to learn motion and compliance parameters of an impedance controller for both manipulators. The learning requires a human demonstration with a single teleoperated manipulator only, easing the execution of demonstration and enabling usage of manipulators at difficult locations as well. Finally, we experimentally verify our claim that having both manipulators compliant in both rotation and translation can accomplish the alignment task with less total joint motions and in shorter time than moving one manipulator only. In addition, we show that the learning method produces the parameters that achieve the best results in our experiments.
http://arxiv.org/abs/1902.07007
Virtual borders are employed to allow users the flexible and interactive definition of their mobile robots’ workspaces and to ensure a socially aware navigation in human-centered environments. They have been successfully defined using methods from human-robot interaction where a user directly interacts with the robot. However, since we recently witness an emergence of network robot systems (NRS) enhancing the perceptual and interaction abilities of a robot, we investigate the effect of such a NRS on the teaching of virtual borders and answer the question if an intelligent environment can improve the teaching process of virtual borders. For this purpose, we propose an interaction method based on a NRS and laser pointer as interaction device. This interaction method comprises an architecture that integrates robots into intelligent environments with the purpose of supporting the teaching process in terms of interaction and feedback, the cooperation between stationary and mobile cameras to perceive laser spots and an algorithm allowing the extraction of virtual borders from multiple camera observations. Our experimental results acquired from 15 participants’ performances show that our system is equally successful and accurate while featuring a significant lower teaching time and a higher user experience compared to an approach without support of a NRS.
http://arxiv.org/abs/1902.06997
The Image Source Method (ISM) is one of the most employed techniques to calculate acoustic Room Impulse Responses (RIRs), however, its computational complexity grows fast with the reverberation time of the room and its computation time can be prohibitive for some applications where a huge number of RIRs are needed. In this paper, we present a new implementation that dramatically improves the computation speed of the ISM by using Graphic Processing Units (GPUs) to parallelize both the simulation of multiple RIRs and the computation of the images inside each RIR. We provide a Python library under GNU license that can be easily used without any knowledge about GPU programming and we show that it is about 100 times faster than other state of the art CPU libraries.
http://arxiv.org/abs/1810.11359
Face de-identification algorithms have been developed in response to the prevalent use of public video recordings and surveillance cameras. Here, we evaluated the success of identity masking in the context of monitoring drivers as they actively operate a motor vehicle. We compared the effectiveness of eight de-identification algorithms using human perceivers. The algorithms we tested included the personalized supervised bilinear regression method for Facial Action Transfer (FAT), the DMask method, which renders a generic avatar face, and two edge-detection methods implemented with and without image polarity inversion (Canny, Scharr). We also used an Overmask approach that combined the FAT and Canny methods. We compared these identity masking methods to identification of an unmasked video of the driver. Human subjects were tested in a standard face recognition experiment in which they learned driver identities with a high resolution (studio-style) image, and were tested subsequently on their ability to recognize masked and unmasked videos of these individuals driving. All masking methods lowered identification accuracy substantially, relative to the unmasked video. The most successful methods, DMask and Canny, lowered human identification performance to near random. In all cases, identifications were made with stringent decision criteria indicating the subjects had low confidence in their decisions. We conclude that carefully tested de-identification approaches, used alone or in combination, can be an effective tool for protecting the privacy of individuals captured in videos. Future work should examine how the most effective methods fare in preserving facial action recognition.
http://arxiv.org/abs/1902.06967
Deep generative models like variational autoencoders approximate the intrinsic geometry of high dimensional data manifolds by learning low-dimensional latent-space variables and an embedding function. The geometric properties of these latent spaces has been studied under the lens of Riemannian geometry; via analysis of the non-linearity of the generator function. In new developments, deep generative models have been used for learning semantically meaningful `disentangled’ representations; that capture task relevant attributes while being invariant to other attributes. In this work, we explore the geometry of popular generative models for disentangled representation learning. We use several metrics to compare the properties of latent spaces of disentangled representation models in terms of class separability and curvature of the latent-space. The results we obtain establish that the class distinguishable features in the disentangled latent space exhibits higher curvature as opposed to a variational autoencoder. We evaluate and compare the geometry of three such models with variational autoencoder on two different datasets. Further, our results show that distances and interpolation in the latent space are significantly improved with Riemannian metrics derived from the curvature of the space. We expect these results will have implications on understanding how deep-networks can be made more robust, generalizable, as well as interpretable.
http://arxiv.org/abs/1902.06964
Environmental air quality affects people’s life, obtaining real-time and accurate environmental air quality has a profound guiding significance for the development of social activities. At present, environmental air quality measurement mainly adopts the method that setting air quality detector at specific monitoring points in cities and timing sampling analysis, which is easy to be restricted by time and space factors. Some air quality measurement algorithms related to deep learning mostly adopt a single convolutional neural network to train the whole image, which will ignore the difference of different parts of the image. In this paper, we propose a method for air quality measurement based on double-channel convolutional neural network ensemble learning to solve the problem of feature extraction for different parts of environmental images. Our method mainly includes two aspects: ensemble learning of double-channel convolutional neural network and self-learning weighted feature fusion. We constructed a double-channel convolutional neural network, used each channel to train different parts of the environment images for feature extraction. We propose a feature weight self-learning method, which weights and concatenates the extracted feature vectors, and uses the fused feature vectors to measure air quality. Our method can be applied to the two tasks of air quality grade measurement and air quality index (AQI) measurement. Moreover, we build an environmental image dataset of random time and location condition. The experiments show that our method can achieve nearly 82% average accuracy and a small average absolute error on our test set. At the same time, through contrast experiment, we proved that our proposed method obtained considerable increase in performance compared with single channel convolutional neural network air quality measurements.
http://arxiv.org/abs/1902.06942
The Knowledge graph (KG) uses the triples to describe the facts in the real world. It has been widely used in intelligent analysis and applications. However, possible noises and conflicts are inevitably introduced in the process of constructing. And the KG based tasks or applications assume that the knowledge in the KG is completely correct and inevitably bring about potential deviations. In this paper, we establish a knowledge graph triple trustworthiness measurement model that quantify their semantic correctness and the true degree of the facts expressed. The model is a crisscrossing neural network structure. It synthesizes the internal semantic information in the triples and the global inference information of the KG to achieve the trustworthiness measurement and fusion in the three levels of entity level, relationship level, and KG global level. We analyzed the validity of the model output confidence values, and conducted experiments in the real-world dataset FB15K (from Freebase) for the knowledge graph error detection task. The experimental results showed that compared with other models, our model achieved significant and consistent improvements.
http://arxiv.org/abs/1809.09414
A challenge in speech production research is to predict future tongue movements based on a short period of past tongue movements. This study tackles speaker-dependent tongue motion prediction problem in unlabeled ultrasound videos with convolutional long short-term memory (ConvLSTM) networks. The model has been tested on two different ultrasound corpora. ConvLSTM outperforms 3-dimensional convolutional neural network (3DCNN) in predicting the 9\textsuperscript{th} frames based on 8 preceding frames, and also demonstrates good capacity to predict only the tongue contours in future frames. Further tests reveal that ConvLSTM can also learn to predict tongue movements in more distant frames beyond the immediately following frames. Our codes are available at: https://github.com/shuiliwanwu/ConvLstm-ultrasound-videos.
http://arxiv.org/abs/1902.06927
Semi-supervised and unsupervised Generative Adversarial Networks (GAN)-based methods have been gaining popularity in anomaly detection task recently. However, GAN training is somewhat challenging and unstable. Inspired from previous work in GAN-based image generation, we introduce a GAN-based anomaly detection framework - Adversarial Dual Autoencoders (ADAE) - consists of two autoencoders as generator and discriminator to increase training stability. We also employ discriminator reconstruction error as anomaly score for better detection performance. Experiments across different datasets of varying complexity show strong evidence of a robust model that can be used in different scenarios, one of which is brain tumor detection.
http://arxiv.org/abs/1902.06924
This paper develops a deep-learning framework to synthesize a ground-level view of a location given an overhead image. We propose a novel conditional generative adversarial network (cGAN) in which the trained generator generates realistic looking and representative ground-level images using overhead imagery as auxiliary information. The generator is an encoder-decoder network which allows us to compare low- and high-level features as well as their concatenation for encoding the overhead imagery. We also demonstrate how our framework can be used to perform land cover classification by modifying the trained cGAN to extract features from overhead imagery. This is interesting because, although we are using this modified cGAN as a feature extractor for overhead imagery, it incorporates knowledge of how locations look from the ground.
http://arxiv.org/abs/1902.06923