Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can sometimes lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of multi-task learning strategies across different datasets from different languages. This paper evaluates 63 multi-task learning strategies for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. Finally, we show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.
http://arxiv.org/abs/1903.04870
We propose a method to maintain high resource in a networked heterogeneous multi-robot system to resource failures. In our model, resources such as and computation are available on robots. The robots engaged in a joint task using these pooled resources. In our model, a resource on a particular robot becomes unavailable e.g., a sensor ceases to function due to a failure), the system reconfigures so that the robot continues to have to this resource by communicating with other robots. Specifically, we consider the problem of selecting edges to be in the system’s communication graph after a resource has occurred. We define a metric that allows us to characterize the quality of the resource distribution in the represented by the communication graph. Upon a resource becoming unavailable due to failure, we reconfigure network so that the resource distribution is brought as to the ideal resource distribution as possible without a big change in the communication cost. Our approach uses integer semi-definite programming to achieve this goal. We also provide a simulated annealing method to compute a formation that satisfies the inter-robot distances imposed by the topology, along with other constraints. Our method can compute a communication topology, spatial formation, and formation change motion planning in a few seconds. We validate our method in simulation and real-robot experiments with a team of seven quadrotors.
http://arxiv.org/abs/1903.04856
There has been much progress in data-driven artificial intelligence technology for medical image analysis in last decades. However, it still remains a challenge due to its distinctive complexity of acquiring and annotating image data, extracting medical domain knowledge, and explaining the diagnostic decision for medical image analysis. In this paper, we propose a data-knowledge-driven evolutionary framework termed as Parallel Medical Imaging (PMI) for medical image analysis based on the methodology of interactive ACP-based parallel intelligence. In the PMI framework, computational experiments with predictive learning in a data-driven way are conducted to extract medical knowledge for diagnostic decision support. Artificial imaging systems are introduced to select and prescriptively generate medical image data in a knowledge-driven way to utilize medical domain knowledge. Through the parallel evolutionary optimization, our proposed PMI framework can boost the generalization ability and alleviate the limitation of medical interpretation for diagnostic decision. A GANs-based PMI framework for case studies of mammogram analysis is demonstrated in this work.
http://arxiv.org/abs/1903.04855
As a basic task of multi-camera surveillance system, person re-identification aims to re-identify a query pedestrian observed from non-overlapping multiple cameras or across different time with a single camera. Recently, deep learning-based person re-identification models have achieved great success in many benchmarks. However, these supervised models require a large amount of labeled image data, and the process of manual labeling spends much manpower and time. In this study, we introduce a method to automatically synthesize labeled person images and adopt them to increase the sample number per identity for person re-identification datasets. To be specific, we use block rectangles to randomly occlude pedestrian images. Then, a generative adversarial network (GAN) model is proposed to use paired occluded and original images to synthesize the de-occluded images that similar but not identical to the original image. Afterwards, we annotate the de-occluded images with the same labels of their corresponding raw images and use them to augment the number of samples per identity. Finally, we use the augmented datasets to train baseline model. The experiment results on CUHK03, Market-1501 and DukeMTMC-reID datasets show that the effectiveness of the proposed method.
http://arxiv.org/abs/1809.09970
The paper addresses the problem of motion saliency in videos, that is, identifying regions that undergo motion departing from its context. We propose a new unsupervised paradigm to compute motion saliency maps. The key ingredient is the flow inpainting stage. Candidate regions are determined from the optical flow boundaries. The residual flow in these regions is given by the difference between the optical flow and the flow inpainted from the surrounding areas. It provides the cue for motion saliency. The method is flexible and general by relying on motion information only. Experimental results on the DAVIS 2016 benchmark demonstrate that the method compares favourably with state-of-the-art video saliency methods.
http://arxiv.org/abs/1903.04842
In this work, we present a novel distributed method for constructing an occupancy grid map of an unknown environment using a swarm of robots with global localization capabilities and limited inter-robot communication. The robots explore the domain by performing Lévy walks in which their headings are defined by maximizing the mutual information between the robot’s estimate of its environment in the form of an occupancy grid map and the distance measurements that it is likely to obtain when it moves in that direction. Each robot is equipped with laser range sensors, and it builds its occupancy grid map by repeatedly combining its own distance measurements with map information that is broadcast by neighboring robots. Using results on average consensus over time-varying graph topologies, we prove that all robots’ maps will eventually converge to the actual map of the environment. In addition, we demonstrate that a technique based on topological data analysis, developed in our previous work for generating topological maps, can be readily extended for adaptive thresholding of occupancy grid maps. We validate the effectiveness of our distributed exploration and mapping strategy through a series of 2D simulations and multi-robot experiments.
https://arxiv.org/abs/1903.04836
In this work, we present a novel distributed method for constructing an occupancy grid map of an unknown environment using a swarm of robots with global localization capabilities and limited inter-robot communication. The robots explore the domain by performing L'evy walks in which their headings are defined by maximizing the mutual information between the robot’s estimate of its environment in the form of an occupancy grid map and the distance measurements that it is likely to obtain when it moves in that direction. Each robot is equipped with laser range sensors, and it builds its occupancy grid map by repeatedly combining its own distance measurements with map information that is broadcast by neighboring robots. Using results on average consensus over time-varying graph topologies, we prove that all robots’ maps will eventually converge to the actual map of the environment. In addition, we demonstrate that a technique based on topological data analysis, developed in our previous work for generating topological maps, can be readily extended for adaptive thresholding of occupancy grid maps. We validate the effectiveness of our distributed exploration and mapping strategy through a series of 2D simulations and multi-robot experiments.
http://arxiv.org/abs/1903.04836
The 5th edition of the International Conference on Cloud and Robotics (ICCR 2018 - this http URL) will be held on November 12-14 2018 in Paris and Saint-Quentin, France. The conference is a co-event with GDR ALROB and the industry exposition Robonumerique (this http URL). The domain of cloud robotics aims to converge robots with computation, storage and communication resources provided by the cloud. The cloud may complement robotic resources in several ways, including crowd-sourcing knowledge databases, context information, computational offloading or data-intensive information processing for artificial intelligence. Today, the paradigms of cloud/fog/edge computing propose software architecture solutions for robots to share computations or offload them to ambiant and networked resources. Yet, combining distant computations with the real time constraints of robotics is very challenging. As the challenges in this domain are multi-disciplinary and similar in other research areas, Cloud Robotics aims at building bridges among experts from academia and industry working in different fields, such as robotics, cyber-physical systems, automotive, aerospace, machine learning, artificial intelligence, software architecture, big data analytics, Internet-of-Things, networked control and distributed cloud systems.
http://arxiv.org/abs/1903.04824
In the age of information explosion, image classification is the key technology of dealing with and organizing a large number of image data. Currently, the classical image classification algorithms are mostly based on RGB images or grayscale images, and fail to make good use of the depth information about objects or scenes. The depth information in the images has a strong complementary effect, which can enhance the classification accuracy significantly. In this paper, we propose an image classification technology using principal component analysis based on multi-view depth characters. In detail, firstly, the depth image of the original image is estimated; secondly, depth characters are extracted from the RGB views and the depth view separately, and then the reducing dimension operation through the PCA is implemented. Eventually, the SVM is applied to image classification. The experimental results show that the method has good performance.
http://arxiv.org/abs/1903.04814
SemEval-2019 Task 6 requires us to identify and categorise offensive language in social media. In this paper we will describe the process we took to tackle this challenge. Our process is heavily inspired by Sosa (2017) [1] where he proposed CNN-LSTM and LSTM-CNN models to conduct twitter sentiment analysis. We decided to follow his approach as well as further his work by testing out different variations of RNN models with CNN. Specifically, we have divided the challenge into two parts: data processing and sampling and choosing the optimal deep learning architecture. In preprocessing, we experimented with two techniques, SMOTE and Class Weights to counter the imbalance between classes. Once we are happy with the quality of our input data, we proceed to choosing the optimal deep learning architecture for this task. Given the quality and quantity of data we have been given, we found that the addition of CNN layer provides very little to no additional improvement to our model’s performance and sometimes even worsen our F1-score. In the end, the deep learning architecture that gives us the highest macro F1-score is a simple BiLSTM-CNN.
http://arxiv.org/abs/1903.05280
The behavior of users of music streaming services is investigated from the point of view of the temporal dimension of individual songs; specifically, the main object of the analysis is the point in time within a song at which users stop listening and start streaming another song (“skip”). The main contribution of this study is the ascertainment of a correlation between the distribution in time of skipping events and the musical structure of songs. It is also shown that such distribution is not only specific to the individual songs, but also independent of the cohort of users and, under stationary conditions, date of observation. Finally, user behavioral data is used to train a predictor of the musical structure of a song solely from its acoustic content; it is shown that the use of such data, available in large quantities to music streaming services, yields significant improvements in accuracy over the customary fashion of training this class of algorithms, in which only smaller amounts of hand-labeled data are available.
http://arxiv.org/abs/1903.06008
Over the past decade, deep neural networks (DNNs) have become a de-facto standard for solving machine learning problems. As we try to solve more advanced problems, growing demand for computing and power resources are inevitable, nearly impossible to employ DNNs on embedded systems, where available resources are limited. Given these circumstances, spiking neural networks (SNNs) are attracting widespread interest as the third generation of neural network, due to event-driven and low-powered nature. However, SNNs come at the cost of significant performance degradation largely due to complex dynamics of SNN neurons and non-differential spike operation. Thus, its application has been limited to relatively simple tasks such as image classification. In this paper, we investigate the performance degradation of SNNs in the much more challenging task of object detection. From our in-depth analysis, we introduce two novel methods to overcome a significant performance gap: channel-wise normalization and signed neuron with imbalanced threshold. Consequently, we present a spiked-based real-time object detection model, called Spiking-YOLO that provides near-lossless information transmission in a shorter period of time for deep SNN. Our experiments show that the Spiking-YOLO is able to achieve comparable results up to 97% of the original YOLO on a non-trivial dataset, PASCAL VOC.
http://arxiv.org/abs/1903.06530
Segmentation stands at the forefront of many high-level vision tasks. In this study, we focus on segmenting finger bones within a newly introduced semi-supervised self-taught deep learning framework which consists of a student network and a stand-alone teacher module. The whole system is boosted in a life-long learning manner wherein each step the teacher module provides a refinement for the student network to learn with newly unlabeled data. Experimental results demonstrate the superiority of the proposed method over conventional supervised deep learning methods.
http://arxiv.org/abs/1903.04778
Machine learning is advancing towards a data-science approach, implying a necessity to a line of investigation to divulge the knowledge learnt by deep neuronal networks. Limiting the comparison among networks merely to a predefined intelligent ability, according to ground truth, does not suffice, it should be associated with innate similarity of these artificial entities. Here, we analysed multiple instances of an identical architecture trained to classify objects in static images (CIFAR and ImageNet data sets). We evaluated the performance of the networks under various distortions and compared it to the intrinsic similarity between their constituent kernels. While we expected a close correspondence between these two measures, we observed a puzzling phenomenon. Pairs of networks whose kernels’ weights are over 99.9% correlated can exhibit significantly different performances, yet other pairs with no correlation can reach quite compatible levels of performance. We show implications of this for transfer learning, and argue its importance in our general understanding of what intelligence is, whether natural or artificial.
http://arxiv.org/abs/1903.04772
Concatenation of the deep network representations extracted from different facial patches helps to improve face recognition performance. However, the concatenated facial template increases in size and contains redundant information. Previous solutions aim to reduce the dimension of the facial template without considering the occlusion pattern of the facial patches. In this paper, we propose an occlusion-guided compact template learning (OGCTL) approach that only uses the information from visible patches to construct the compact template. The compact face representation is not sensitive to the number of patches that are used to construct the facial template, and more suitable for incorporating the information from different view angles for image-set based face recognition. Different from previous ensemble models that use occlusion masks in face matching (e.g., DPRFS), the proposed method uses occlusion masks in template construction and achieves significantly better image-set based face verification performance on challenging database with a template size that is an order-of-magnitude smaller than DPRFS.
http://arxiv.org/abs/1903.04752
Knowledge graph embedding aims to learn distributed representations for entities and relations, and is proven to be effective in many applications. Crossover interactions — bi-directional effects between entities and relations — help select related information when predicting a new triple, but haven’t been formally discussed before. In this paper, we propose CrossE, a novel knowledge graph embedding which explicitly simulates crossover interactions. It not only learns one general embedding for each entity and relation as most previous methods do, but also generates multiple triple specific embeddings for both of them, named interaction embeddings. We evaluate embeddings on typical link prediction tasks and find that CrossE achieves state-of-the-art results on complex and more challenging datasets. Furthermore, we evaluate embeddings from a new perspective — giving explanations for predicted triples, which is important for real applications. In this work, an explanation for a triple is regarded as a reliable closed-path between the head and the tail entity. Compared to other baselines, we show experimentally that CrossE, benefiting from interaction embeddings, is more capable of generating reliable explanations to support its predictions.
http://arxiv.org/abs/1903.04750
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language processing research work. In this work, NER for Myanmar language is treated as a sequence tagging problem and the effectiveness of deep neural networks on NER for Myanmar language has been investigated. Experiments are performed by applying deep neural network architectures on syllable level Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed and proposed. In developing our in-house NER corpus, sentences from online news website and also sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural network models on NER task for Myanmar language. The experimental results show that those neural sequence models can produce promising results compared to the baseline CRF model. Among those neural architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing as well as to promote further researches on this understudied language.
http://arxiv.org/abs/1903.04739
The project of \emph{“quantum spacetime phenomenology”} focuses on searching pragmatically for the Planck scale quantum features of spacetime. Among these features is the existence of a characteristic length scale addressed commonly by effective approaches to quantum gravity (QG). This characteristic length scale could be realized, for instance and simply, by generalizing the standard Heisenberg uncertainty principle (HUP) to a \emph{“generalized uncertainty principle”} (GUP). While usually it is expected that phenomena belonging to the realm of QG are essentially probable solely at the so-called Planck energy, here we show how a GUP proposal containing the most general modification of coordinate representation of the momentum operator could be probed by a \emph{“cold atomic ensemble recoil experiment”} (CARE) as a low energy quantum system. This proposed atomic interferometer setup has advantages over the conventional architectures owing to the enclosure in a high finesse optical cavity which is supported by a new class of low power consumption integrated devices known as \emph{“micro-electro-opto-mechanical systems”} (MEOMS). The proposed system comprises of a micro mechanical oscillator instead of spherical confocal mirrors as one of the components of high finesse optical cavity. In the framework of a bottom-up QG phenomenological viewpoint and by taking into account the measurement accuracy realized for the fine structure constant (FSC) from the Rubidium ($^{87}$Rb) CARE, we set some constraints as upper bounds on the characteristic parameters of the underlying GUP. In the case of superposition of the possible GUP modification terms, we managed to set a tight constraint as $0.999978<\lambda_0<1.00002$ for the dimensionless characteristic parameter.
https://arxiv.org/abs/1804.06389
Recent improvements in generative adversarial network (GAN) training techniques prove that progressively training a GAN drastically stabilizes the training and improves the quality of outputs produced. Adding layers after the previous ones have converged has proven to help in better overall convergence and stability of the model as well as reducing the training time by a sufficient amount. Thus we use this training technique to train the model progressively in the time and pitch domain i.e. starting from a very small time value and pitch range we gradually expand the matrix sizes until the end result is a completely trained model giving outputs having tensor sizes [4 (bar) x 96 (time steps) x 84 (pitch values) x 8 (tracks)]. As proven in previously proposed models deterministic binary neurons also help in improving the results. Thus we make use of a layer of deterministic binary neurons at the end of the generator to get binary valued outputs instead of fractional values existing between 0 and 1.
http://arxiv.org/abs/1903.04722
Interest in larger-context neural machine translation, including document-level and multi-modal translation, has been growing. Multiple works have proposed new network architectures or evaluation schemes, but potentially helpful context is still sometimes ignored by larger-context translation models. In this paper, we propose a novel learning algorithm that explicitly encourages a neural translation model to take into account additional context using a multilevel pair-wise ranking loss. We evaluate the proposed learning algorithm with a transformer-based larger-context translation system on document-level translation. By comparing performance using actual and random contexts, we show that a model trained with the proposed algorithm is more sensitive to the additional context.
http://arxiv.org/abs/1903.04715
We apply recent advances in deep generative modeling to the task of imitation learning from biological agents. Specifically, we apply variations of the variational recurrent neural network model to a multi-agent setting where we learn policies of individual uncoordinated agents acting based on their perceptual inputs and their hidden belief state. We learn stochastic policies for these agents directly from observational data, without constructing a reward function. An inference network learned jointly with the policy allows for efficient inference over the agent’s belief state given a sequence of its current perceptual inputs and the prior actions it performed, which lets us extrapolate observed sequences of behavior into the future while maintaining uncertainty estimates over future trajectories. We test our approach on a dataset of flies interacting in a 2D environment, where we demonstrate better predictive performance than existing approaches which learn deterministic policies with recurrent neural networks. We further show that the uncertainty estimates over future trajectories we obtain are well calibrated, which makes them useful for a variety of downstream processing tasks.
http://arxiv.org/abs/1903.04714
Visual Servoing (VS), where images taken from a camera typically attached to the robot end-effector are used to guide the robot motions, is an important technique to tackle robotic tasks that require a high level of accuracy. We propose a new neural network, based on a Siamese architecture, for highly accurate camera pose estimation. This, in turn, can be used as a final refinement step following a coarse VS or, if applied in an iterative manner, as a standalone VS on its own. The key feature of our neural network is that it outputs the relative pose between any pair of images, and does so with sub-millimeter accuracy. We show that our network can reduce pose estimation errors to 0.6 mm in translation and 0.4 degrees in rotation, from initial errors of 10 mm / 5 degrees if applied once, or of several cm / tens of degrees if applied iteratively. The network can generalize to similar objects, is robust against changing lighting conditions, and to partial occlusions (when used iteratively). The high accuracy achieved enables tackling low-tolerance assembly tasks downstream: using our network, an industrial robot can achieve 97.5% success rate on a VGA-connector insertion task without any force sensing mechanism.
http://arxiv.org/abs/1903.04713
Medical imaging is an essential tool in many areas of medical applications, used for both diagnosis and treatment. However, reading medical images and making diagnosis or treatment recommendations require specially trained medical specialists. The current practice of reading medical images is labor-intensive, time-consuming, costly, and error-prone. It would be more desirable to have a computer-aided system that can automatically make diagnosis and treatment recommendations. Recent advances in deep learning enable us to rethink the ways of clinician diagnosis based on medical images. In this thesis, we will introduce 1) mammograms for detecting breast cancers, the most frequently diagnosed solid cancer for U.S. women, 2) lung CT images for detecting lung cancers, the most frequently diagnosed malignant cancer, and 3) head and neck CT images for automated delineation of organs at risk in radiotherapy. First, we will show how to employ the adversarial concept to generate the hard examples improving mammogram mass segmentation. Second, we will demonstrate how to use the weakly labeled data for the mammogram breast cancer diagnosis by efficiently design deep learning for multi-instance learning. Third, the thesis will walk through DeepLung system which combines deep 3D ConvNets and GBM for automated lung nodule detection and classification. Fourth, we will show how to use weakly labeled data to improve existing lung nodule detection system by integrating deep learning with a probabilistic graphic model. Lastly, we will demonstrate the AnatomyNet which is thousands of times faster and more accurate than previous methods on automated anatomy segmentation.
http://arxiv.org/abs/1903.04711
This paper extends control barrier functions (CBFs) to high order control barrier functions (HOCBFs) that can be used for high relative degree constraints. The proposed HOCBFs are more general than recently proposed (exponential) HOCBFs. We introduce high order barrier functions (HOBF), and show that their satisfaction of Lyapunov-like conditions implies the forward invariance of the intersection of a series of sets. We then introduce HOCBF, and show that any control input that satisfies the HOCBF constraints renders the intersection of a series of sets forward invariant. We formulate optimal control problems with constraints given by HOCBF and control Lyapunov functions (CLF) and analyze the influence of the choice of the class $\mathcal{K}$ functions used in the definition of the HOCBF on the size of the feasible control region. We also provide a promising method to address the conflict between HOCBF constraints and control limitations by penalizing the class $\mathcal{K}$ functions. We illustrate the proposed method on an adaptive cruise control problem.
http://arxiv.org/abs/1903.04706
A defining feature of sampling-based motion planning is the reliance on an implicit representation of the state space, which is enabled by a set of probing samples. Traditionally, these samples are drawn either probabilistically or deterministically to uniformly cover the state space. Yet, the motion of many robotic systems is often restricted to “small” regions of the state space, due to, for example, differential constraints or collision-avoidance constraints. To accelerate the planning process, it is thus desirable to devise non-uniform sampling strategies that favor sampling in those regions where an optimal solution might lie. This paper proposes a methodology for non-uniform sampling, whereby a sampling distribution is learned from demonstrations, and then used to bias sampling. The sampling distribution is computed through a conditional variational autoencoder, allowing sample generation from the latent space conditioned on the specific planning problem. This methodology is general, can be used in combination with any sampling-based planner, and can effectively exploit the underlying structure of a planning problem while maintaining the theoretical guarantees of sampling-based approaches. Specifically, on several planning problems, the proposed methodology is shown to effectively learn representations for the relevant regions of the state space, resulting in an order of magnitude improvement in terms of success rate and convergence to the optimal cost.
http://arxiv.org/abs/1709.05448
Event-based cameras can measure intensity changes (called `{\it events}’) with microsecond accuracy under high-speed motion and challenging lighting conditions. With the active pixel sensor (APS), the event camera allows simultaneous output of the intensity frames. However, the output images are captured at a relatively low frame-rate and often suffer from motion blur. A blurry image can be regarded as the integral of a sequence of latent images, while the events indicate the changes between the latent images. Therefore, we are able to model the blur-generation process by associating event data to a latent image. Based on the abundant event data and the low frame-rate easily blurred images, we propose a simple and effective approach to reconstruct a high-quality and high frame-rate shape video. Starting with a single blurry frame and its event data, we propose the \textbf{Event-based Double Integral (EDI)} model. Then, we extend it to \textbf{ multiple Event-based Double Integral (mEDI)} model to get more smooth and convincing results based on multiple images and their events. We also provide an efficient solver to minimize the proposed energy model. By optimizing the energy model, we achieve significant improvements in removing general blurs and reconstructing high temporal resolution video. The video generation is based on solving a simple non-convex optimization problem in a single scalar variable. Experimental results on both synthetic and real images demonstrate the superiority of our mEDI model and optimization method in comparison to the state-of-the-art.
http://arxiv.org/abs/1903.06531
There is accumulating evidence in the literature that stability of learning algorithms is a key characteristic that permits a learning algorithm to generalize. Despite various insightful results in this direction, there seems to be an overlooked dichotomy in the type of stability-based generalization bounds we have in the literature. On one hand, the literature seems to suggest that exponential generalization bounds for the estimated risk, which are optimal, can be only obtained through stringent, distribution independent and computationally intractable notions of stability such as uniform stability. On the other hand, it seems that weaker notions of stability such as hypothesis stability, although it is distribution dependent and more amenable to computation, can only yield polynomial generalization bounds for the estimated risk, which are suboptimal. In this paper, we address the gap between these two regimes of results. In particular, the main question we address here is whether it is possible to derive exponential generalization bounds for the estimated risk using a notion of stability that is computationally tractable and distribution dependent, but weaker than uniform stability. Using recent advances in concentration inequalities, and using a notion of stability that is weaker than uniform stability but distribution dependent and amenable to computation, we derive an exponential tail bound for the concentration of the estimated risk of a hypothesis returned by a general learning rule, where the estimated risk is expressed in terms of either the resubstitution estimate (empirical error), or the deleted (or, leave-one-out) estimate. As an illustration, we derive exponential tail bounds for ridge regression with unbounded responses – a setting where uniform stability results of Bousquet and Elisseeff (2002) are not applicable.
http://arxiv.org/abs/1903.05457
This paper focuses on the challenging task of learning 3D object surface reconstructions from single RGB images. Existing methods achieve varying degrees of success by using different geometric representations. However, they all have their own drawbacks, and cannot well reconstruct those surfaces of complex topologies. To this end, we propose in this paper a skeleton-bridged, stage-wise learning approach to address the challenge. Our use of skeleton is due to its nice property of topology preservation, while being of lower complexity to learn. To learn skeleton from an input image, we design a deep architecture whose decoder is based on a novel design of parallel streams respectively for synthesis of curve- and surface-like skeleton points. We use different shape representations of point cloud, volume, and mesh in our stage-wise learning, in order to take their respective advantages. We also propose multi-stage use of the input image to correct prediction errors that are possibly accumulated in each stage. We conduct intensive experiments to investigate the efficacy of our proposed approach. Qualitative and quantitative results on representative object categories of both simple and complex topologies demonstrate the superiority of our approach over existing ones. We will make our ShapeNet-Skeleton dataset publicly available.
http://arxiv.org/abs/1903.04704
Deep learning models hold state of the art performance in many fields, yet their design is still based on heuristics or grid search methods. This work proposes a method to analyze a trained network and deduce an optimized, compressed architecture that preserves accuracy while keeping computational costs tractable. Model compression is an active field of research that targets the problem of realizing deep learning models in hardware. However, most pruning methodologies tend to be experimental, requiring large compute and time intensive iterations of retraining the entire network. We introduce structure into model design by proposing a single shot analysis of a trained network that frames model compression as a dimensionality reduction problem. The proposed method analyzes the activations of each layer simultaneously and looks at the dimensionality of the space described by the filters generating these activations. It optimizes the architecture in terms of number of layers, and number of filters per layer without any iterative retraining procedures, making it a viable, low effort technique to design efficient networks. We demonstrate the proposed methodology on AlexNet and VGG style networks on the CIFAR-10, CIFAR-100 and ImageNet datasets, and successfully achieve an optimized architecture, reducing both depth and layer-wise width while trading off less than 1% accuracy.
https://arxiv.org/abs/1812.06224
The use of imitation learning to learn a single policy for a complex task that has multiple modes or hierarchical structure can be challenging. In fact, previous work has shown that when the modes are known, learning separate policies for each mode or sub-task can greatly improve the performance of imitation learning. In this work, we discover the interaction between sub-tasks from their resulting state-action trajectory sequences using a directed graphical model. We propose a new algorithm based on the generative adversarial imitation learning framework which automatically learns sub-task policies from unsegmented demonstrations. Our approach maximizes the directed information flow in the graphical model between sub-task latent variables and their generated trajectories. We also show how our approach connects with the existing Options framework, which is commonly used to learn hierarchical policies.
http://arxiv.org/abs/1810.01266
Both accuracy and efficiency are of significant importance to the task of semantic segmentation. Existing deep FCNs suffer from heavy computations due to a series of high-resolution feature maps for preserving the detailed knowledge in dense estimation. Although reducing the feature map resolution (i.e., applying a large overall stride) via subsampling operations (e.g., pooling and convolution striding) can instantly increase the efficiency, it dramatically decreases the estimation accuracy. To tackle this dilemma, we propose a knowledge distillation method tailored for semantic segmentation to improve the performance of the compact FCNs with large overall stride. To handle the inconsistency between the features of the student and teacher network, we optimize the feature similarity in a transferred latent domain formulated by utilizing a pre-trained autoencoder. Moreover, an affinity distillation module is proposed to capture the long-range dependency by calculating the non-local interactions across the whole image. To validate the effectiveness of our proposed method, extensive experiments have been conducted on three popular benchmarks: Pascal VOC, Cityscapes and Pascal Context. Built upon a highly competitive baseline, our proposed method can improve the performance of a student network by 2.5\% (mIOU boosts from 70.2 to 72.7 on the cityscapes test set) and can train a better compact model with only 8\% float operations (FLOPS) of a model that achieves comparable performances.
http://arxiv.org/abs/1903.04688
The world we see is ever-changing and it always changes with people, things, and the environment. Domain is referred to as the state of the world at a certain moment. A research problem is characterized as domain transfer adaptation when it needs knowledge correspondence between different moments. Conventional machine learning aims to find a model with the minimum expected risk on test data by minimizing the regularized empirical risk on the training data, which, however, supposes that the training and test data share similar joint probability distribution. Transfer adaptation learning aims to build models that can perform tasks of target domain by learning knowledge from a semantic related but distribution different source domain. It is an energetic research filed of increasing influence and importance. This paper surveys the recent advances in transfer adaptation learning methodology and potential benchmarks. Broader challenges being faced by transfer adaptation learning researchers are identified, i.e., instance re-weighting adaptation, feature adaptation, classifier adaptation, deep network adaptation, and adversarial adaptation, which are beyond the early semi-supervised and unsupervised split. The survey provides researchers a framework for better understanding and identifying the research status, challenges and future directions of the field.
http://arxiv.org/abs/1903.04687
The prediction of urban vehicle flow and speed can greatly facilitate people’s travel, and also can provide reasonable advice for the decision-making of relevant government departments. However, due to the spatial, temporal and hierarchy of vehicle flow and many influencing factors such as weather, it is difficult to prediction. Most of the existing research methods are to extract spatial structure information on the road network and extract time series information from the historical data. However, when extracting spatial features, these methods have higher time and space complexity, and incorporate a lot of noise. It is difficult to apply on large graphs, and only considers the influence of surrounding connected road nodes on the central node, ignoring a very important hierarchical relationship, namely, similar information of similar node features and road network structures. In response to these problems, this paper proposes the Graph Hierarchical Convolutional Recurrent Neural Network (GHCRNN) model. The model uses GCN (Graph Convolutional Networks) to extract spatial feature, GRU (Gated Recurrent Units) to extract temporal feature, and uses the learnable Pooling to extract hierarchical information, eliminate redundant information and reduce complexity. Applying this model to the vehicle flow and speed data of Shenzhen and Los Angeles has been well verified, and the time and memory consumption are effectively reduced under the compared precision.
http://arxiv.org/abs/1903.06261
Lifted inference scales to large probability models by exploiting symmetry. However, existing exact lifted inference techniques do not apply to general factor graphs, as they require a relational representation. In this work we provide a theoretical framework and algorithm for performing exact lifted inference on symmetric factor graphs by computing colored graph automorphisms, as is often done for approximate lifted inference. Our key insight is to represent variable assignments directly in the colored factor graph encoding. This allows us to generate representatives and compute the size of each orbit of the symmetric distribution. In addition to exact inference, we use this encoding to implement an MCMC algorithm that explores the space of orbits quickly by uniform orbit sampling.
http://arxiv.org/abs/1903.04672
This paper describes the design, manufacture, and performance of a highly dexterous, low-profile, 7 Degree-of-Freedom (DOF) robotic arm for CT-guided percutaneous needle biopsy. Direct CT guidance allows physicians to localize tumours quickly; however, needle insertion is still performed by hand. This system is mounted to a fully active gantry superior to the patient’s head and teleoperated by a radiologist. Unlike other similar robots, this robot’s fully serial-link approach uses a unique combination of belt and cable drives for high-transparency and minimal-backlash, allowing for an expansive working area and numerous approach angles to targets all while maintaining a small in-bore cross-section of less than $16cm^2$. Simulations verified the system’s expansive collision free work-space and ability to hit targets across the entire chest, as required for lung cancer biopsy. Targeting error is on average $<1mm$ on a teleoperated accuracy task, illustrating the system’s sufficient accuracy to perform biopsy procedures. The system is designed for lung biopsies due to the large working volume that is required for reaching peripheral lung lesions, though, with its large working volume and small in-bore cross-sectional area, the robotic system is effectively a general-purpose CT-compatible manipulation device for percutaneous procedures. Finally, with the considerable development time undertaken in designing a precise and flexible-use system and with the desire to reduce the burden of other researchers in developing algorithms for image-guided surgery, this system provides open-access, and to the best of our knowledge, is the first open-hardware image-guided biopsy robot of its kind.
http://arxiv.org/abs/1903.04646
Many types of 3D acquisition sensors have emerged in recent years and point cloud has been widely used in many areas. Accurate and fast registration of cross-source 3D point clouds from different sensors is an emerged research problem in computer vision. This problem is extremely challenging because cross-source point clouds contain a mixture of various variances, such as density, partial overlap, large noise and outliers, viewpoint changing. In this paper, an algorithm is proposed to align cross-source point clouds with both high accuracy and high efficiency. There are two main contributions: firstly, two components, the weak region affinity and pixel-wise refinement, are proposed to maintain the global and local information of 3D point clouds. Then, these two components are integrated into an iterative tensor-based registration algorithm to solve the cross-source point cloud registration problem. We conduct experiments on synthetic cross-source benchmark dataset and real cross-source datasets. Comparison with six state-of-the-art methods, the proposed method obtains both higher efficiency and accuracy.
http://arxiv.org/abs/1903.04630
Quadrotor stabilizing controllers often require careful, model-specific tuning for safe operation. We use reinforcement learning to train policies in simulation that transfer remarkably well to multiple different physical quadrotors. Our policies are low-level, i.e., we map the rotorcrafts’ state directly to the motor outputs. The trained control policies are very robust to external disturbances and can withstand harsh initial conditions such as throws. We show how different training methodologies (change of the cost function, modeling of noise, use of domain randomization) might affect flight performance. To the best of our knowledge, this is the first work that demonstrates that a simple neural network can learn a robust stabilizing low-level quadrotor controller without the use of a stabilizing PD controller; as well as the first work that analyses the transfer capability of a single policy to multiple quadrotors.
http://arxiv.org/abs/1903.04628
Inferring the relations between two images is an important class of tasks in computer vision. Examples of such tasks include computing optical flow and stereo disparity. We treat the relation inference tasks as a machine learning problem and tackle it with neural networks. A key to the problem is learning a representation of relations. We propose a new neural network module, contrast association unit (CAU), which explicitly models the relations between two sets of input variables. Due to the non-negativity of the weights in CAU, we adopt a multiplicative update algorithm for learning these weights. Experiments show that neural networks with CAUs are more effective in learning five fundamental image transformations than conventional neural networks.
http://arxiv.org/abs/1705.05665
Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.
http://arxiv.org/abs/1711.09078
The past decade has witnessed great success in applying deep learning to enhance the quality of compressed video. However, the existing approaches aim at quality enhancement on a single frame, or only using fixed neighboring frames. Thus they fail to take full advantage of the inter-frame correlation in the video. This paper proposes the Quality-Gated Convolutional Long Short-Term Memory (QG-ConvLSTM) network with bi-directional recurrent structure to fully exploit the advantageous information in a large range of frames. More importantly, due to the obvious quality fluctuation among compressed frames, higher quality frames can provide more useful information for other frames to enhance quality. Therefore, we propose learning the “forget” and “input” gates in the ConvLSTM cell from quality-related features. As such, the frames with various quality contribute to the memory in ConvLSTM with different importance, making the information of each frame reasonably and adequately used. Finally, the experiments validate the effectiveness of our QG-ConvLSTM approach in advancing the state-of-the-art quality enhancement of compressed video, and the ablation study shows that our QG-ConvLSTM approach is learnt to make a trade-off between quality and correlation when leveraging multi-frame information.
http://arxiv.org/abs/1903.04596
Superpixel algorithms are a common pre-processing step for computer vision algorithms such as segmentation, object tracking and localization. Many superpixel methods only rely on colors features for segmentation, limiting performance in low-contrast regions and applicability to infrared or medical images where object boundaries have wide appearance variability. We study the inclusion of deep image features in the SLIC superpixel algorithm to exploit higher-level image representations. In addition, we devise a trainable superpixel algorithm, yielding an intermediate domain-specific image representation that can be applied to different tasks. A clustering-based superpixel algorithm is transformed into a pixel-wise classification task and superpixel training data is derived from semantic segmentation datasets. Our results demonstrate that this approach is able to improve superpixel quality consistently.
http://arxiv.org/abs/1903.04586
Given an environment with continuous state spaces and discrete actions, we investigate using a Double Deep Q-learning Reinforcement Agent to find optimal policies using the LunarLander-v2 OpenAI gym environment.
http://arxiv.org/abs/1708.02378
Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this study, we analyze the distortion problem, compare different acoustic models, and investigate a distortion-independent training scheme for monaural speech recognition. Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem. Such an acoustic model can also work with speech enhancement models different from the one used during training. Moreover, the models investigated in this paper outperform the previous best system on the CHiME-2 corpus.
http://arxiv.org/abs/1903.04567
Neural Machine Translation (NMT) systems are known to degrade when confronted with noisy data, especially when the system is trained only on clean data. In this paper, we show that augmenting training data with sentences containing artificially-introduced grammatical errors can make the system more robust to such errors. In combination with an automatic grammar error correction system, we can recover 1.5 BLEU out of 2.4 BLEU lost due to grammatical errors. We also present a set of Spanish translations of the JFLEG grammar error correction corpus, which allows for testing NMT robustness to real grammatical errors.
https://arxiv.org/abs/1808.06267
Unintended bias in Machine Learning can manifest as systemic differences in performance for different demographic groups, potentially compounding existing challenges to fairness in society at large. In this paper, we introduce a suite of threshold-agnostic metrics that provide a nuanced view of this unintended bias, by considering the various ways that a classifier’s score distribution can vary across designated groups. We also introduce a large new test set of online comments with crowd-sourced annotations for identity references. We use this to show how our metrics can be used to find new and potentially subtle unintended bias in existing public models.
http://arxiv.org/abs/1903.04561
Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, data programming has been proposed in the data management community to reduce the human cost in training data generation. Data programming expects users to write a set of labeling functions, each of which is a weak supervision source that labels a subset of data points with better-than-random accuracy. However, the success of data programming heavily depends on the quality (in terms of both accuracy and coverage) of the labeling functions that users still need to design manually. We propose affinity coding, a new paradigm for fully automatic generation of training data. In affinity coding, the similarity between the unlabeled instances and prototypes that are derived from the same unlabeled instances serve as signals (or sources of weak supervision) for determining class membership. We term this implicit similarity as the affinity score. Consequently, we can have as many sources of weak supervision as the number of unlabeled data points, without any human input. We also propose a system called GOGGLES that is an implementation of affinity coding for labeling image datasets. GOGGLES features novel techniques for deriving affinity scores from image datasets based on “semantic prototypes” extracted from convolutional neural nets, as well as an expectation-maximization approach for performing class label inference based on the computed affinity scores. Compared to the state-of-the-art data programming system Snorkel, GOGGLES exhibits 14.88% average improvement in terms of the quality of labels generated for the binary labeling task. The GOGGLES system is open-sourced at https://github.com/chu-data-lab/GOGGLES/.
http://arxiv.org/abs/1903.04552
We address the unsupervised learning of several interconnected problems in low-level vision: single view depth prediction, camera motion estimation, optical flow, and segmentation of a video into the static scene and moving regions. Our key insight is that these four fundamental vision problems are coupled through geometric constraints. Consequently, learning to solve them together simplifies the problem because the solutions can reinforce each other. We go beyond previous work by exploiting geometry more explicitly and segmenting the scene into static and moving regions. To that end, we introduce Competitive Collaboration, a framework that facilitates the coordinated training of multiple specialized neural networks to solve complex problems. Competitive Collaboration works much like expectation-maximization, but with neural networks that act as both competitors to explain pixels that correspond to static or moving regions, and as collaborators through a moderator that assigns pixels to be either static or independently moving. Our novel method integrates all these problems in a common framework and simultaneously reasons about the segmentation of the scene into moving objects and the static background, the camera motion, depth of the static scene structure, and the optical flow of moving objects. Our model is trained without any supervision and achieves state-of-the-art performance among joint unsupervised methods on all sub-problems.
http://arxiv.org/abs/1805.09806
Recent studies have shown that state-of-the-art deep learning models are vulnerable to the inputs with small perturbations (adversarial examples). We observe two critical obstacles in adversarial examples: (i) Strong adversarial attacks (e.g., C&W attack) require manually tuning hyper-parameters and take a long time to construct an adversarial example, making it impractical to attack real-time systems; (ii) Most of the studies focus on non-sequential tasks, such as image classification, yet only a few consider sequential tasks. In this work, we speed up adversarial attacks, especially on sequential learning tasks. By leveraging the uncertainty of each task, we directly learn the adaptive multi-task weightings, without manually searching hyper-parameters. A unified architecture is developed and evaluated for both non-sequential tasks and sequential ones. To validate the effectiveness, we take the scene text recognition task as a case study. To our best knowledge, our proposed method is the first attempt to adversarial attack for scene text recognition. Adaptive Attack achieves over 99.9\% success rate with 3-6X speedup compared to state-of-the-art adversarial attacks.
https://arxiv.org/abs/1807.03326
In this paper we present a curated dataset from the NASA Solar Dynamics Observatory (SDO) mission in a format suitable for machine learning research. Beginning from level 1 scientific products we have processed various instrumental corrections, downsampled to manageable spatial and temporal resolutions, and synchronized observations spatially and temporally. We illustrate the use of this dataset with two example applications: forecasting future EVE irradiance from present EVE irradiance and translating HMI observations into AIA observations. For each application we provide metrics and baselines for future model comparison. We anticipate this curated dataset will facilitate machine learning research in heliophysics and the physical sciences generally, increasing the scientific return of the SDO mission. This work is a direct result of the 2018 NASA Frontier Development Laboratory Program. Please see the appendix for access to the dataset.
http://arxiv.org/abs/1903.04538
Recent trends in neural network based text-to-speech/speech synthesis pipelines have employed recurrent Seq2seq architectures that can synthesize realistic sounding speech directly from text characters. These systems however have complex architectures and takes a substantial amount of time to train. We introduce several modifications to these Seq2seq architectures that allow for faster training time, and also allows us to reduce the complexity of the model architecture at the same time. We show that our proposed model can achieve attention alignment much faster than previous architectures and that good audio quality can be achieved with a model that’s much smaller in size. Sample audio available at https://soundcloud.com/gary-wang-23/sets/tts-samples-for-cmpt-419.
http://arxiv.org/abs/1903.07398