We present batch virtual adversarial training (BVAT), a novel regularization method for graph convolutional networks (GCNs). BVAT addresses the shortcoming of GCNs that do not consider the smoothness of the model’s output distribution against local perturbations around the input. We propose two algorithms, sample-based BVAT and optimization-based BVAT, which are suitable to promote the smoothness of the model for graph-structured data by either finding virtual adversarial perturbations for a subset of nodes far from each other or generating virtual adversarial perturbations for all nodes with an optimization process. Extensive experiments on three citation network datasets Cora, Citeseer and Pubmed and a knowledge graph dataset Nell validate the effectiveness of the proposed method, which establishes state-of-the-art results in the semi-supervised node classification tasks.
http://arxiv.org/abs/1902.09192
Sequence-to-Sequence (Seq2Seq) models have achieved encouraging performance on the dialogue response generation task. However, existing Seq2Seq-based response generation methods suffer from a low-diversity problem: they frequently generate generic responses, which make the conversation less interesting. In this paper, we address the low-diversity problem by investigating its connection with model over-confidence reflected in predicted distributions. Specifically, we first analyze the influence of the commonly used Cross-Entropy (CE) loss function, and find that the CE loss function prefers high-frequency tokens, which results in low-diversity responses. We then propose a Frequency-Aware Cross-Entropy (FACE) loss function that improves over the CE loss function by incorporating a weighting mechanism conditioned on token frequency. Extensive experiments on benchmark datasets show that the FACE loss function is able to substantially improve the diversity of existing state-of-the-art Seq2Seq response generation methods, in terms of both automatic and human evaluations.
http://arxiv.org/abs/1902.09191
The progress in autonomous driving is also due to the increased availability of vast amounts of training data for the underlying machine learning approaches. Machine learning systems are generally known to lack robustness, e.g., if the training data did rarely or not at all cover critical situations. The challenging task of corner case detection in video, which is also somehow related to unusual event or anomaly detection, aims at detecting these unusual situations, which could become critical, and to communicate this to the autonomous driving system (online use case). Such a system, however, could be also used in offline mode to screen vast amounts of data and select only the relevant situations for storing and (re)training machine learning algorithms. So far, the approaches for corner case detection have been limited to videos recorded from a fixed camera, mostly for security surveillance. In this paper, we provide a formal definition of a corner case and propose a system framework for both the online and the offline use case that can handle video signals from front cameras of a naturally moving vehicle and can output a corner case score.
http://arxiv.org/abs/1902.09184
One of the fundamental challenges towards building any intelligent tutoring system is its ability to automatically grade short student answers. A typical automatic short answer grading system (ASAG) grades student answers across multiple domains (or subjects). Grading student answers requires building a supervised machine learning model that evaluates the similarity of the student answer with the reference answer(s). We observe that unlike typical textual similarity or entailment tasks, the notion of similarity is not universal here. On one hand, para-phrasal constructs of the language can indicate similarity independent of the domain. On the other hand, two words, or phrases, that are not strict synonyms of each other, might mean the same in certain domains. Building on this observation, we propose JMD-ASAG, the first joint multidomain deep learning architecture for automatic short answer grading that performs domain adaptation by learning generic and domain-specific aspects from the limited domain-wise training data. JMD-ASAG not only learns the domain-specific characteristics but also overcomes the dependence on a large corpus by learning the generic characteristics from the task-specific data itself. On a large-scale industry dataset and a benchmarking dataset, we show that our model performs significantly better than existing techniques which either learn domain-specific models or adapt a generic similarity scoring model from a large corpus. Further, on the benchmarking dataset, we report state-of-the-art results against all existing non-neural and neural models.
http://arxiv.org/abs/1902.09183
We present a novel, robust sound source localization algorithm considering back-propagation signals. Sound propagation paths are estimated by generating direct and reflection acoustic rays based on ray tracing in a backward manner. We then compute the back-propagation signals by designing and using the impulse response of the backward sound propagation based on the acoustic ray paths. For identifying the 3D source position, we suggest a localization method based on the Monte Carlo localization algorithm. Candidates for a source position is determined by identifying the convergence regions of acoustic ray paths. This candidate is validated by measuring similarities between back-propagation signals, under the assumption that the back-propagation signals of different acoustic ray paths should be similar near the sound source position. Thanks to considering similarities of back-propagation signals, our approach can localize a source position with an averaged error of 0.51 m in a room of 7 m by 7 m area with 3 m height in tested environments. We also observe 65 % to 220 % improvement in accuracy over the stateof-the-art method. This improvement is achieved in environments containing a moving source, an obstacle, and noises.
http://arxiv.org/abs/1902.09179
This paper aims to quantitatively explain rationales of each prediction that is made by a pre-trained convolutional neural network (CNN). We propose to learn a decision tree, which clarifies the specific reason for each prediction made by the CNN at the semantic level. I.e., the decision tree decomposes feature representations in high conv-layers of the CNN into elementary concepts of object parts. In this way, the decision tree tells people which object parts activate which filters for the prediction and how much they contribute to the prediction score. Such semantic and quantitative explanations for CNN predictions have specific values beyond the traditional pixel-level analysis of CNNs. More specifically, our method mines all potential decision modes of the CNN, where each mode represents a common case of how the CNN uses object parts for prediction. The decision tree organizes all potential decision modes in a coarse-to-fine manner to explain CNN predictions at different fine-grained levels. Experiments have demonstrated the effectiveness of the proposed method.
http://arxiv.org/abs/1802.00121
In view of the huge success of convolution neural networks (CNN) for image classification and object recognition, there have been attempts to generalize the method to general graph-structured data. One major direction is based on spectral graph theory and graph signal processing. In this paper, we study the problem from a completely different perspective, by introducing parallel flow decomposition of graphs. The essential idea is to decompose a graph into families of non-intersecting one dimensional (1D) paths, after which, we may apply a 1D CNN along each family of paths. We demonstrate that the our method, which we call GraphFlow, is able to transfer CNN architectures to general graphs. To show the effectiveness of our approach, we test our method on the classical MNIST dataset, synthetic datasets on network information propagation and a news article classification dataset.
http://arxiv.org/abs/1902.09173
We analyze the AI alignment problem. This is the problem of aligning an AI’s objective function with human preferences. This problem has been argued to be critical to AI safety, especially in the long run. But it has also been argued that solving it robustly is extremely challenging, especially in highly complex environments like the Internet. It seems crucial to accelerate research in this direction. To this end, we propose a preliminary research program. Our roadmap aims to decompose alignment into numerous more tractable subproblems. Our hope is that this will help scholars, engineers and decision-makers to better grasp the upcoming difficulties, and to foresee how they can best contribute to the global effort.
http://arxiv.org/abs/1809.01036
Rapid advances in image processing capabilities have been seen across many domains, fostered by the application of machine learning algorithms to “big-data”. However, within the realm of medical image analysis, advances have been curtailed, in part, due to the limited availability of large-scale, well-annotated datasets. One of the main reasons for this is the high cost often associated with producing large amounts of high-quality meta-data. Recently, there has been growing interest in the application of crowdsourcing for this purpose; a technique that has proven effective for creating large-scale datasets across a range of disciplines, from computer vision to astrophysics. Despite the growing popularity of this approach, there has not yet been a comprehensive literature review to provide guidance to researchers considering using crowdsourcing methodologies in their own medical imaging analysis. In this survey, we review studies applying crowdsourcing to the analysis of medical images, published prior to July 2018. We identify common approaches, challenges and considerations, providing guidance of utility to researchers adopting this approach. Finally, we discuss future opportunities for development within this emerging domain.
http://arxiv.org/abs/1902.09159
This paper uses robots to assemble pegs into holes on surfaces with different colors and textures. It especially targets at the problem of peg-in-hole assembly with initial position uncertainty. Two in-hand cameras and a force-torque sensor are used to account for the position uncertainty. A program sequence comprising learning-based visual servoing, spiral search, and impedance control is implemented to perform the peg-in-hole task with feedback from the above sensors. Contributions are mainly made in the learning-based visual servoing of the sequence, where a deep neural network is trained with various sets of synthetic data generated using the concept of domain randomization to predict where a hole is. In the experiments and analysis section, the network is analyzed and compared, and a real-world robotic system to insert pegs to holes using the proposed method is implemented. The results show that the implemented peg-in-hole assembly system can perform successful peg-in-hole insertions on surfaces with various colors and textures. It can generally speed up the entire peg-in-hole process.
http://arxiv.org/abs/1902.09157
This work designs a mechanical tool for robots with 2-finger parallel grippers, which extends the function of the robotic gripper without additional requirements on tool exchangers or other actuators. The fundamental kinematic structure of the mechanical tool is two symmetric parallelograms which transmit the motion of the robotic gripper to the mechanical tool. Four torsion springs are attached to the four inner joints of the two parallelograms to open the tool as the robotic gripper releases. The forces and transmission are analyzed in detail to make sure the tool reacts well with respect to the gripping forces and the spring stiffness. Also, based on the kinematic structure, variety tooltips were designed for the mechanical tool to perform various tasks. The kinematic structure can be a platform to apply various skillful gripper designs. The designed tool could be treated as a normal object and be picked up and used by automatically planned grasps. A robot may locate the tool through the AR markers attached to the tool body, grasp the tool by selecting an automatically planned grasp, and move the tool from any arbitrary pose to a specific pose to grip objects. The robot may also determine the optimal grasps and usage according to the requirements of given tasks.
http://arxiv.org/abs/1902.09150
We present DDFlow, a data distillation approach to learning optical flow estimation from unlabeled data. The approach distills reliable predictions from a teacher network, and uses these predictions as annotations to guide a student network to learn optical flow. Unlike existing work relying on hand-crafted energy terms to handle occlusion, our approach is data-driven, and learns optical flow for occluded pixels. This enables us to train our model with a much simpler loss function, and achieve a much higher accuracy. We conduct a rigorous evaluation on the challenging Flying Chairs, MPI Sintel, KITTI 2012 and 2015 benchmarks, and show that our approach significantly outperforms all existing unsupervised learning methods, while running at real time.
http://arxiv.org/abs/1902.09145
Since sparse unmixing has emerged as a promising approach to hyperspectral unmixing, some spatial-contextual information in the hyperspectral images has been exploited to improve the performance of the unmixing recently. The total variation (TV) has been widely used to promote the spatial homogeneity as well as the smoothness between adjacent pixels. However, the computation task for hyperspectral sparse unmixing with a TV regularization term is heavy. Besides, the convergences of the traditional sparse unmixing algorithms which are special cases of the primal alternating direction method of multipliers (pADMM) have not been explained in details. In this paper, we design an efficient and convergent dual symmetric Gauss-Seidel ADMM (sGS-ADMM) for hyperspectral sparse unmixing with a TV regularization term. We also present the global convergence and local linear convergence rate analysis for the traditional sparse unmixing algorithm and our algorithm. As demonstrated in numerical experiments, our algorithm can obviously improve the efficiency of the unmixing compared with the state-of-the-art algorithm. More importantly, we can obtain images with higher quality.
http://arxiv.org/abs/1902.09135
Skeleton-based action recognition is an important task that requires the adequate understanding of movement characteristics of a human action from the given skeleton sequence. Recent studies have shown that exploring spatial and temporal features of the skeleton sequence is vital for this task. Nevertheless, how to effectively extract discriminative spatial and temporal features is still a challenging problem. In this paper, we propose a novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data. The proposed AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics but also explore the co-occurrence relationship between spatial and temporal domains. We also present a temporal hierarchical architecture to increases temporal receptive fields of the top AGC-LSTM layer, which boosts the ability to learn the high-level semantic representation and significantly reduces the computation cost. Furthermore, to select discriminative spatial information, the attention mechanism is employed to enhance information of key joints in each AGC-LSTM layer. Experimental results on two datasets are provided: NTU RGB+D dataset and Northwestern-UCLA dataset. The comparison results demonstrate the effectiveness of our approach and show that our approach outperforms the state-of-the-art methods on both datasets.
http://arxiv.org/abs/1902.09130
Quantum computers are expected to outperform conventional computers for a range of important problems, from molecular simulation to search algorithms, once they can be scaled-up to very large numbers of quantum bits (qubits), typically many millions. For most solid-state qubit architectures, e.g. those using superconducting circuits or semiconductor spins, scaling poses a significant challenge as every additional qubit increases the heat generated, while the cooling power of dilution refrigerators is severely limited at their normal operating temperature below 100 mK. Here we demonstrate operation of a scalable silicon quantum processor unit cell, comprising two qubits, at a device temperature of $\sim$1.45 Kelvin – the temperature of pumped $^4$He. We achieve this by isolating the quantum dots (QDs) which contain the qubits from the electron reservoir, initialising and reading them solely via tunnelling of electrons between the two QDs. We coherently control the qubits using electrically-driven spin resonance (EDSR) in isotopically enriched silicon $^{28}$Si, attaining single-qubit gate fidelities of 98.6% and Ramsey coherence times of $T_2^*$ = 2 $\mu$s during `hot’ operation, comparable to those of spin qubits in natural silicon at millikelvin temperatures. Furthermore, we show that the unit cell can be operated at magnetic fields as low as 0.1 T, corresponding to a qubit control frequency $f_\textrm{qubit}$ = 3.5 GHz, where the qubit energy $hf_\textrm{qubit}$ is much smaller than the thermal energy $k_\textrm{B}T$. The quantum processor unit cell constitutes the core building block of a full-scale silicon quantum computer. Our work indicates that a spin-based quantum computer could be operated at elevated temperatures in a simple pumped $^4$He system, offering orders of magnitude higher cooling power than dilution refrigerator systems.
https://arxiv.org/abs/1902.09126
Although the fully-connected attention-based model Transformer has achieved great successes on many NLP tasks, it has heavy structure and usually requires large training data. In this paper, we present the Star-Transformer, an alternative and light-weighted model of the Transformer. To reduce the model complexity, we replace the fully-connected structure with a star-shaped structure, in which every two non-adjacent nodes are connected through a shared relay node. Thus, the Star-Transformer has lower complexity than the standard Transformer (from quadratic to linear according to the input length) and preserves the ability to handle with the long-range dependencies. The experiments on four tasks (22 datasets) show the Star-Transformer achieved significant improvements against the standard Transformer for the modestly sized datasets.
http://arxiv.org/abs/1902.09113
In this work, we study the power of Saak features as an effort towards interpretable deep learning. Being inspired by the operations of convolutional layers of convolutional neural networks, multi-stage Saak transform was proposed. Based on this foundation, we provide an in-depth examination on Saak features, which are coefficients of the Saak transform, by analyzing their properties through visualization and demonstrating their applications in image classification. Being similar to CNN features, Saak features at later stages have larger receptive fields, yet they are obtained in a one-pass feedforward manner without backpropagation. The whole feature extraction process is transparent and is of extremely low complexity. The discriminant power of Saak features is demonstrated, and their classification performance in three well-known datasets (namely, MNIST, CIFAR-10 and STL-10) is shown by experimental results.
http://arxiv.org/abs/1902.09107
Features from multiple scales can greatly benefit the semantic edge detection task if they are well fused. However, the prevalent semantic edge detection methods apply a fixed weight fusion strategy where images with different semantics are forced to share the same weights, resulting in universal fusion weights for all images and locations regardless of their different semantics or local context. In this work, we propose a novel dynamic feature fusion strategy that assigns different fusion weights for different input images and locations adaptively. This is achieved by a proposed weight learner to infer proper fusion weights over multi-level features for each location of the feature map, conditioned on the specific input. In this way, the heterogeneity in contributions made by different locations of feature maps and input images can be better considered and thus help produce more accurate and sharper edge predictions. We show that our model with the novel dynamic feature fusion is superior to fixed weight fusion and also the na"ive location-invariant weight fusion methods, via comprehensive experiments on benchmarks Cityscapes and SBD. In particular, our method outperforms all existing well established methods and achieves new state-of-the-art.
http://arxiv.org/abs/1902.09104
Accurate relative pose is one of the key components in visual odometry (VO) and simultaneous localization and mapping (SLAM). Recently, the self-supervised learning framework that jointly optimizes the relative pose and target image depth has attracted the attention of the community. Previous works rely on the photometric error generated from depths and poses between adjacent frames, which contains large systematic error under realistic scenes due to reflective surfaces and occlusions. In this paper, we bridge the gap between geometric loss and photometric loss by introducing the matching loss constrained by epipolar geometry in a self-supervised framework. Evaluated on the KITTI dataset, our method outperforms the state-of-the-art unsupervised ego-motion estimation methods by a large margin. The code and data are available at https://github.com/hlzz/DeepMatchVO.
http://arxiv.org/abs/1902.09103
Recent advances in deep reinforcement learning in the paradigm of locomotion using continuous control have raised the interest of game makers for the potential of digital actors using active ragdoll. Currently, the available options to develop these ideas are either researchers’ limited codebase or proprietary closed systems. We present Marathon Environments, a suite of open source, continuous control benchmarks implemented on the Unity game engine, using the Unity ML- Agents Toolkit. We demonstrate through these benchmarks that continuous control research is transferable to a commercial game engine. Furthermore, we exhibit the robustness of these environments by reproducing advanced continuous control research, such as learning to walk, run and backflip from motion capture data; learning to navigate complex terrains; and by implementing a video game input control system. We show further robustness by training with alternative algorithms found in OpenAI.Baselines. Finally, we share strategies for significantly reducing the training time.
http://arxiv.org/abs/1902.09097
Finding compact representation of videos is an essential component in almost every problem related to video processing or understanding. In this paper, we propose a generative model to learn compact latent codes that can efficiently represent and reconstruct a video sequence from its missing or under-sampled measurements. We use a generative network that is trained to map a compact code into an image. We first demonstrate that if a video sequence belongs to the range of the pretrained generative network, then we can recover it by estimating the underlying compact latent codes. Then we demonstrate that even if the video sequence does not belong to the range of a pretrained network, we can still recover the true video sequence by jointly updating the latent codes and the weights of the generative network. To avoid overfitting in our model, we regularize the recovery problem by imposing low-rank and similarity constraints on the latent codes of the neighboring frames in the video sequence. We use our methods to recover a variety of videos from compressive measurements at different compression rates. We also demonstrate that we can generate missing frames in a video sequence by interpolating the latent codes of the observed frames in the low-dimensional space.
http://arxiv.org/abs/1902.11132
Question Answering (QA), as a research field, has primarily focused on either knowledge bases (KBs) or free text as a source of knowledge. These two sources have historically shaped the kinds of questions that are asked over these sources, and the methods developed to answer them. In this work, we look towards a practical use-case of QA over user-instructed knowledge that uniquely combines elements of both structured QA over knowledge bases, and unstructured QA over narrative, introducing the task of multi-relational QA over personal narrative. As a first step towards this goal, we make three key contributions: (i) we generate and release TextWorldsQA, a set of five diverse datasets, where each dataset contains dynamic narrative that describes entities and relations in a simulated world, paired with variably compositional questions over that knowledge, (ii) we perform a thorough evaluation and analysis of several state-of-the-art QA models and their variants at this task, and (iii) we release a lightweight Python-based framework we call TextWorlds for easily generating arbitrary additional worlds and narrative, with the goal of allowing the community to create and share a growing collection of diverse worlds as a test-bed for this task.
http://arxiv.org/abs/1902.09093
Transfer learning aims to solve the data sparsity for a target domain by applying information of the source domain. Given a sequence (e.g. a natural language sentence), the transfer learning, usually enabled by recurrent neural network (RNN), represents the sequential information transfer. RNN uses a chain of repeating cells to model the sequence data. However, previous studies of neural network based transfer learning simply represents the whole sentence by a single vector, which is unfeasible for seq2seq and sequence labeling. Meanwhile, such layer-wise transfer learning mechanisms lose the fine-grained cell-level information from the source domain. In this paper, we proposed the aligned recurrent transfer, ART, to achieve cell-level information transfer. ART is under the pre-training framework. Each cell attentively accepts transferred information from a set of positions in the source domain. Therefore, ART learns the cross-domain word collocations in a more flexible way. We conducted extensive experiments on both sequence labeling tasks (POS tagging, NER) and sentence classification (sentiment analysis). ART outperforms the state-of-the-arts over all experiments.
http://arxiv.org/abs/1902.09092
This paper focuses on how to take advantage of external knowledge bases (KBs) to improve recurrent neural networks for machine reading. Traditional methods that exploit knowledge from KBs encode knowledge as discrete indicator features. Not only do these features generalize poorly, but they require task-specific feature engineering to achieve good performance. We propose KBLSTM, a novel neural model that leverages continuous representations of KBs to enhance the learning of recurrent neural networks for machine reading. To effectively integrate background knowledge with information from the currently processed text, our model employs an attention mechanism with a sentinel to adaptively decide whether to attend to background knowledge and which information from KBs is useful. Experimental results show that our model achieves accuracies that surpass the previous state-of-the-art results for both entity extraction and event extraction on the widely used ACE2005 dataset.
http://arxiv.org/abs/1902.09091
Short text matching often faces the challenges that there are great word mismatch and expression diversity between the two texts, which would be further aggravated in languages like Chinese where there is no natural space to segment words explicitly. In this paper, we propose a novel lattice based CNN model (LCNs) to utilize multi-granularity information inherent in the word lattice while maintaining strong ability to deal with the introduced noisy information for matching based question answering in Chinese. We conduct extensive experiments on both document based question answering and knowledge based question answering tasks, and experimental results show that the LCNs models can significantly outperform the state-of-the-art matching models and strong baselines by taking advantages of better ability to distill rich but discriminative information from the word lattice input.
http://arxiv.org/abs/1902.09087
The risk of unauthorized remote access of streaming video from networked cameras underlines the need for stronger privacy safeguards. Towards this end, we simulate a lens-free coded aperture (CA) camera as an appearance encoder, i.e., the first layer of privacy protection. Our goal is human action recognition from coded aperture videos for which the coded aperture mask is unknown and does not require reconstruction. We insert a second layer of privacy protection by using non-invertible motion features based on phase correlation and log-polar transformation. Phase correlation encodes translation while the log polar transformation encodes in-plane rotation and scaling. We show the key property of the translation features being mask-invariant. This property allows us to simplify the training of classifiers by removing reliance on a specific mask design. Results based on a subset of the UCF and NTU datasets show the feasibility of our system.
http://arxiv.org/abs/1902.09085
Pedestrian detection plays an important role in many applications such as autonomous driving. We propose a method that explores semantic segmentation results as self-attention cues to significantly improve the pedestrian detection performance. Specifically, a multi-task network is designed to jointly learn semantic segmentation and pedestrian detection from image datasets with weak box-wise annotations. The semantic segmentation feature maps are concatenated with corresponding convolution features maps to provide more discriminative features for pedestrian detection and pedestrian classification. By jointly learning segmentation and detection, our proposed pedestrian self-attention mechanism can effectively identify pedestrian regions and suppress backgrounds. In addition, we propose to incorporate semantic attention information from multi-scale layers into deep convolution neural network to boost pedestrian detection. Experiment results show that the proposed method achieves the best detection performance with MR of 6.27% on Caltech dataset and obtain competitive performance on CityPersons dataset while maintaining high computational efficiency.
http://arxiv.org/abs/1902.09080
The conventional speaker recognition frameworks (e.g., the i-vector and CNN-based approach) have been successfully applied to various tasks when the channel of the enrolment dataset is similar to that of the test dataset. However, in real-world applications, mismatch always exists between these two datasets, which may severely deteriorate the recognition performance. Previously, a few channel compensation algorithms have been proposed, such as Linear Discriminant Analysis (LDA) and Probabilistic LDA. However, these methods always require the collections of different channels from a specific speaker, which is unrealistic to be satisfied in real scenarios. Inspired by domain adaptation, we propose a novel deep-learning based speaker recognition framework to learn the channel-invariant and speaker-discriminative speech representations via channel adversarial training. Specifically, we first employ a gradient reversal layer to remove variations across different channels. Then, the compressed information is projected into the same subspace by adversarial training. Experiments on test datasets with 54,133 speakers demonstrate that the proposed method is not only effective at alleviating the channel mismatch problem, but also outperforms state-of-the-art speaker recognition methods. Compared with the i-vector-based method and the CNN-based method, our proposed method achieves significant relative improvement of 44.7% and 22.6% respectively in terms of the Top1 recall.
http://arxiv.org/abs/1902.09074
Generative Adversarial Networks (GANs) have become a powerful framework to learn generative models that arise across a wide variety of domains. While there has been a recent surge in the development of numerous GAN architectures with distinct optimization metrics, we are still lacking in our understanding on how far away such GANs are from optimality. In this paper, we make progress on a theoretical understanding of the GANs under a simple linear-generator Gaussian-data setting where the optimal maximum-likelihood generator is known to perform Principal Component Analysis (PCA). We find that the original GAN by Goodfellow et. al. fails to recover the optimal PCA solution. On the other hand, we show that Wasserstein GAN can perform PCA, and hence it may serve as a basis for an optimal GAN architecture that yields the optimal generator for a wide range of data settings.
https://arxiv.org/abs/1902.09073
In this work, we consider applying machine learning to the analysis and compression of audio signals in the context of monitoring elephants in sub-Saharan Africa. Earth’s biodiversity is increasingly under threat by sources of anthropogenic change (e.g. resource extraction, land use change, and climate change) and surveying animal populations is critical for developing conservation strategies. However, manually monitoring tropical forests or deep oceans is intractable. For species that communicate acoustically, researchers have argued for placing audio recorders in the habitats as a cost-effective and non-invasive method, a strategy known as passive acoustic monitoring (PAM). In collaboration with conservation efforts, we construct a large labeled dataset of passive acoustic recordings of the African Forest Elephant via crowdsourcing, compromising thousands of hours of recordings in the wild. Using state-of-the-art techniques in artificial intelligence we improve upon previously proposed methods for passive acoustic monitoring for classification and segmentation. In real-time detection of elephant calls, network bandwidth quickly becomes a bottleneck and efficient ways to compress the data are needed. Most audio compression schemes are aimed at human listeners and are unsuitable for low-frequency elephant calls. To remedy this, we provide a novel end-to-end differentiable method for compression of audio signals that can be adapted to acoustic monitoring of any species and dramatically improves over naive coding strategies.
http://arxiv.org/abs/1902.09069
In a mixed-traffic scenario where both autonomous vehicles and human-driving vehicles exist, a timely prediction of driving intentions of nearby human-driving vehicles is essential for the safe and efficient driving of an autonomous vehicle. In this paper, a driving intention prediction method based on Hidden Markov Model (HMM) is proposed for autonomous vehicles. HMMs representing different driving intentions are trained and tested with field collected data from a flyover. When training the models, either discrete or continuous characterization of the mobility features of vehicles is applied. Experimental results show that the HMMs trained with the continuous characterization of mobility features can give a higher prediction accuracy when they are used for predicting driving intentions. Moreover, when the surrounding traffic of the vehicle is taken into account, the performances of the proposed prediction method are further improved.
http://arxiv.org/abs/1902.09068
Semantic segmentation of medical images aims to associate a pixel with a label in a medical image without human initialization. The success of semantic segmentation algorithms is contingent on the availability of high-quality imaging data with corresponding labels provided by experts. We sought to create a large collection of annotated medical image datasets of various clinically relevant anatomies available under open source license to facilitate the development of semantic segmentation algorithms. Such a resource would allow: 1) objective assessment of general-purpose segmentation methods through comprehensive benchmarking and 2) open and free access to medical image data for any researcher interested in the problem domain. Through a multi-institutional effort, we generated a large, curated dataset representative of several highly variable segmentation tasks that was used in a crowd-sourced challenge - the Medical Segmentation Decathlon held during the 2018 Medical Image Computing and Computer Aided Interventions Conference in Granada, Spain. Here, we describe these ten labeled image datasets so that these data may be effectively reused by the research community.
http://arxiv.org/abs/1902.09063
This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.
http://arxiv.org/abs/1803.10963
The ability to deal with articulated objects is very important for robots assisting humans. In this work a general framework for the robust operation of different types of doors using an autonomous robotic mobile manipulator is proposed. To push the state-of-the-art in robustness and speed performance, we devise a novel algorithm that fuses a convolutional neural network with efficient point cloud processing. This advancement allows for real-time grasping pose estimation of single or multiple handles from RGB-D images, providing a speed up for assistive human-centered behaviors. In addition, we propose a versatile Bayesian framework that endows the robot with the ability to infer the door kinematic model from observations of its motion while opening it and learn from previous experiences or human demonstrations. Combining this probabilistic approach with a state-of-the-art motion planner, we achieve efficient door grasping and subsequent door operation regardless of the kinematic model using the Toyota Human Support Robot.
http://arxiv.org/abs/1902.09051
Talent Search systems aim to recommend potential candidates who are a good match to the hiring needs of a recruiter expressed in terms of the recruiter’s search query or job posting. Past work in this domain has focused on linear and nonlinear models which lack preference personalization in the user-level due to being trained only with globally collected recruiter activity data. In this paper, we propose an entity-personalized Talent Search model which utilizes a combination of generalized linear mixed (GLMix) models and gradient boosted decision tree (GBDT) models, and provides personalized talent recommendations using nonlinear tree interaction features generated by the GBDT. We also present the offline and online system architecture for the productionization of this hybrid model approach in our Talent Search systems. Finally, we provide offline and online experiment results benchmarking our entity-personalized model with tree interaction features, which demonstrate significant improvements in our precision metrics compared to globally trained non-personalized models.
http://arxiv.org/abs/1902.09041
Image Signal Processor (ISP) comprises of various blocks to reconstruct image sensor raw data to final image consumed by human visual system or computer vision applications. Each block typically has many tuning parameters due to the complexity of the operation. These need to be hand tuned by Image Quality (IQ) experts, which takes considerable amount of time. In this paper, we present an automatic IQ tuning using nonlinear optimization and automatic reference generation algorithms. The proposed method can produce high quality IQ in minutes as compared with weeks of hand-tuned results by IQ experts. In addition, the proposed method can work with any algorithms without being aware of their specific implementation. It was found successful on multiple different processing blocks such as noise reduction, demosaic, and sharpening.
http://arxiv.org/abs/1902.09023
Autonomous feeding is challenging because it requires manipulation of food items with various compliance, sizes, and shapes. To understand how humans manipulate food items during feeding and to explore ways to adapt their strategies to robots, we collected a rich dataset of human trajectories by asking them to pick up food and feed it to a mannequin. From the analysis of the collected haptic and motion signals, we demonstrate that humans adapt their control policies to accommodate to the compliance and shape of the food item being acquired. We propose a taxonomy of manipulation strategies for feeding to highlight such policies. As a first step to generate compliance-dependent policies, we propose a set of classifiers for compliance-based food categorization from haptic and motion signals. We compare these human manipulation strategies with fixed position-control policies via a robot. Our analysis of success and failure cases of human and robot policies further highlights the importance of adapting the policy to the compliance of a food item.
http://arxiv.org/abs/1804.08768
Humans have schematic knowledge of how certain types of events unfold (e.g. coffeeshop visits) that can readily be generalized to new instances of those events. Schematic knowledge allows humans to perform role-filler binding, the task of associating schematic roles (e.g. “barista”) with specific fillers (e.g. “Bob”). Here we examined whether and how recurrent neural networks learn to do this. We procedurally generated stories from an underlying generative graph, and trained networks on role-filler binding question-answering tasks. We tested whether networks can learn to maintain filler information on their own, and whether they can generalize to fillers that they have not seen before. We studied networks by analyzing their behavior and decoding their memory states. We found that a network’s success in learning role-filler binding depends on both the breadth of roles introduced during training, and the network’s memory architecture. In our decoding analyses, we observed a close relationship between the information we could decode from various parts of network architecture, and the information the network could recall.
http://arxiv.org/abs/1902.09006
Word order is a significant distinctive feature to differentiate languages. In this paper, we investigate cross-lingual transfer and posit that an order-agnostic model will perform better when transferring to distant foreign languages. To test our hypothesis, we train dependency parsers on an English corpus and evaluate their transfer performance on 30 other languages. Specifically, we compare encoders and decoders based on Recurrent Neural Networks (RNNs) and modified self-attentive architectures. The former rely on sequential information while the latter are more flexible at modeling token order. Detailed analysis shows that RNN-based architectures transfer well to languages that are close to English, while self-attentive models have better overall cross-lingual transferability and perform especially well on distant languages.
http://arxiv.org/abs/1811.00570
In this short paper, we present early insights from a Decision Support System for Customer Support Agents (CSAs) serving customers of a leading accounting software. The system is under development and is designed to provide suggestions to CSAs to make them more productive. A unique aspect of the solution is the use of bandit algorithms to create a tractable human-in-the-loop system that can learn from CSAs in an online fashion. In addition to discussing the ML aspects, we also bring out important insights we gleaned from early feedback from CSAs. These insights motivate our future work and also might be of wider interest to ML practitioners.
http://arxiv.org/abs/1903.03512
Conventional therapy approaches limit surgeons’ dexterity control due to limited field-of-view. With the advent of robot-assisted surgery, there has been a paradigm shift in medical technology for minimally invasive surgery. However, it is very challenging to track the position of the surgical instruments in a surgical scene, and accurate detection & identification of surgical tools is paramount. Deep learning-based semantic segmentation in frames of surgery videos has the potential to facilitate this task. In this work, we modify the U-Net architecture named U-NetPlus, by introducing a pre-trained encoder and re-design the decoder part, by replacing the transposed convolution operation with an upsampling operation based on nearest-neighbor (NN) interpolation. To further improve performance, we also employ a very fast and flexible data augmentation technique. We trained the framework on 8 x 225 frame sequences of robotic surgical videos, available through the MICCAI 2017 EndoVis Challenge dataset and tested it on 8 x 75 frame and 2 x 300 frame videos. Using our U-NetPlus architecture, we report a 90.20% DICE for binary segmentation, 76.26% DICE for instrument part segmentation, and 46.07% for instrument type (i.e., all instruments) segmentation, outperforming the results of previous techniques implemented and tested on these data.
http://arxiv.org/abs/1902.08994
In chronic pain physical rehabilitation, physiotherapists adapt movement to current performance of patients especially based on the expression of protective behavior, gradually exposing them to feared but harmless and essential everyday movements. As physical rehabilitation moves outside the clinic, physical rehabilitation technology needs to automatically detect such behaviors so as to provide similar personalized support. In this paper, we investigate the use of a Long Short-Term Memory (LSTM) network, which we call Protect-LSTM, to detect events of protective behavior, based on motion capture and electromyography data of healthy people and people with chronic low back pain engaged in five everyday movements. Differently from previous work on the same dataset, we aim to continuously detect protective behavior within a movement rather than overall estimate the presence of such behavior. The Protect-LSTM reaches best average F1 score of 0.815 with leave-one-subject-out (LOSO) validation, using low level features, better than other algorithms. Performances increase for some movements when modelled separately (mean F1 scores: bending=0.77, standing on one leg=0.81, sit-to-stand=0.72, stand-to-sit=0.83, reaching forward=0.67). These results reach excellent level of agreement with the average ratings of physiotherapists. As such, the results show clear potential for in-home technology supported affect-based personalized physical rehabilitation.
http://arxiv.org/abs/1902.08990
Squamous Cell Carcinoma (SCC) is the most common cancer type of the epithelium and is often detected at a late stage. Besides invasive diagnosis of SCC by means of biopsy and histo-pathologic assessment, Confocal Laser Endomicroscopy (CLE) has emerged as noninvasive method that was successfully used to diagnose SCC in vivo. For interpretation of CLE images, however, extensive training is required, which limits its applicability and use in clinical practice of the method. To aid diagnosis of SCC in a broader scope, automatic detection methods have been proposed. This work compares two methods with regard to their applicability in a transfer learning sense, i.e. training on one tissue type (from one clinical team) and applying the learnt classification system to another entity (different anatomy, different clinical team). Besides a previously proposed, patch-based method based on convolutional neural networks, a novel classification method on image level (based on a pre-trained Inception V.3 network with dedicated preprocessing and interpretation of class activation maps) is proposed and evaluated. The newly presented approach improves recognition performance, yielding accuracies of 91.63% on the first data set (oral cavity) and 92.63% on a joint data set. The generalization from oral cavity to the second data set (vocal folds) lead to similar area-under-the-ROC curve values than a direct training on the vocal folds data set, indicating good generalization.
http://arxiv.org/abs/1902.08985
Model predictive control (MPC) is a powerful technique for solving dynamic control tasks. In this paper, we show that there exists a close connection between MPC and online learning, an abstract theoretical framework for analyzing online decision making in the optimization literature. This new perspective provides a foundation for leveraging powerful online learning algorithms to design MPC algorithms. Specifically, we propose a new algorithm based on dynamic mirror descent (DMD), an online learning algorithm that is designed for non-stationary setups. Our algorithm, Dynamic Mirror Decent Model Predictive Control (DMD-MPC), represents a general family of MPC algorithms that includes many existing techniques as special instances. DMD-MPC also provides a fresh perspective on previous heuristics used in MPC and suggests a principled way to design new MPC algorithms. In the experimental section of this paper, we demonstrate the flexibility of DMD-MPC, presenting a set of new MPC algorithms on a simple simulated cartpole and a simulated and real-world aggressive driving task.
http://arxiv.org/abs/1902.08967
In sequence to sequence generation tasks (e.g. machine translation and abstractive summarization), inference is generally performed in a left-to-right manner to produce the result token by token. The neural approaches, such as LSTM and self-attention networks, are now able to make full use of all the predicted history hypotheses from left side during inference, but cannot meanwhile access any future (right side) information and usually generate unbalanced outputs in which left parts are much more accurate than right ones. In this work, we propose a synchronous bidirectional inference model to generate outputs using both left-to-right and right-to-left decoding simultaneously and interactively. First, we introduce a novel beam search algorithm that facilitates synchronous bidirectional decoding. Then, we present the core approach which enables left-to-right and right-to-left decoding to interact with each other, so as to utilize both the history and future predictions simultaneously during inference. We apply the proposed model to both LSTM and self-attention networks. In addition, we propose two strategies for parameter optimization. The extensive experiments on machine translation and abstractive summarization demonstrate that our synchronous bidirectional inference model can achieve remarkable improvements over the strong baselines.
http://arxiv.org/abs/1902.08955
This paper presents a vision based robotic system to handle the picking problem involved in automatic express package dispatching. By utilizing two RealSense RGB-D cameras and one UR10 industrial robot, package dispatching task which is usually done by human can be completed automatically. In order to determine grasp point for overlapped deformable objects, we improved the sampling algorithm proposed by the group in Berkeley to directly generate grasp candidate from depth images. For the purpose of package recognition, the deep network framework YOLO is integrated. We also designed a multi-modal robot hand composed of a two-fingered gripper and a vacuum suction cup to deal with different kinds of packages. All the technologies have been integrated in a work cell which simulates the practical conditions of an express package dispatching scenario. The proposed system is verified by experiments conducted for two typical express items.
http://arxiv.org/abs/1902.08951
This paper presents an efficient neural network model to generate robotic grasps with high resolution images. The proposed model uses fully convolution neural network to generate robotic grasps for each pixel using 400 $\times$ 400 high resolution RGB-D images. It first down-sample the images to get features and then up-sample those features to the original size of the input as well as combines local and global features from different feature maps. Compared to other regression or classification methods for detecting robotic grasps, our method looks more like the segmentation methods which solves the problem through pixel-wise ways. We use Cornell Grasp Dataset to train and evaluate the model and get high accuracy about 94.42% for image-wise and 91.02% for object-wise and fast prediction time about 8ms. We also demonstrate that without training on the multiple objects dataset, our model can directly output robotic grasps candidates for different objects because of the pixel wise implementation.
http://arxiv.org/abs/1902.08950
Training generative adversarial networks (GANs) often suffers from cyclic behaviors of iterates. Based on a simple intuition that the direction of centripetal acceleration of an object moving in uniform circular motion is toward the center of the circle, we present the Simultaneous Centripetal Acceleration (SCA) method and the Alternating Centripetal Acceleration (ACA) method to alleviate the cyclic behaviors. Under suitable conditions, gradient descent methods with either SCA or ACA are shown to be linearly convergent for bilinear games. Numerical experiments are conducted by applying ACA to existing gradient-based algorithms in a GAN setup scenario, which demonstrate the superiority of ACA.
https://arxiv.org/abs/1902.08949
Endowing continuum robots with compliance while it is interacting with the internal environment of the human body is essential to prevent damage to the robot and the surrounding tissues. Compared with passive compliance, active compliance has the advantages in terms of increasing the force transmission ability and improving safety with monitored force output. Previous studies have demonstrated that active compliance can be achieved based on a complex model of the mechanics combined with a traditional machine learning technique such as a support vector machine. This paper proposes a recurrent neural network based approach that avoids the complexity of modeling while capturing nonlinear factors such as hysteresis, friction and delay of the electronics that are not easy to model. The approach is tested on a 3-tendon single-segment continuum robot with force sensors on each cable. Experiments are conducted to demonstrate that the continuum robot with an RNN based feed-forward controller is capable of responding to external forces quickly and entering an unknown environment compliantly.
http://arxiv.org/abs/1902.08943
Textual deception constitutes a major problem for online security. Many studies have argued that deceptiveness leaves traces in writing style, which could be detected using text classification techniques. By conducting an extensive literature review of existing empirical work, we demonstrate that while certain linguistic features have been indicative of deception in certain corpora, they fail to generalize across divergent semantic domains. We suggest that deceptiveness as such leaves no content-invariant stylistic trace, and textual similarity measures provide superior means of classifying texts as potentially deceptive. Additionally, we discuss forms of deception beyond semantic content, focusing on hiding author identity by writing style obfuscation. Surveying the literature on both author identification and obfuscation techniques, we conclude that current style transformation methods fail to achieve reliable obfuscation while simultaneously ensuring semantic faithfulness to the original text. We propose that future work in style transformation should pay particular attention to disallowing semantically drastic changes.
http://arxiv.org/abs/1902.08939