Future 5G wireless networks will rely on agile and automated network management, where the usage of diverse resources must be jointly optimized with surgical accuracy. A number of key wireless network functionalities (e.g., traffic steering, power control) give rise to hard optimization problems. What is more, high spatio-temporal traffic variability coupled with the need to satisfy strict per slice/service SLAs in modern networks, suggest that these problems must be constantly (re-)solved, to maintain close-to-optimal performance. To this end, we propose the framework of Online Network Optimization (ONO), which seeks to maintain both agile and efficient control over time, using an arsenal of data-driven, online learning, and AI-based techniques. Since the mathematical tools and the studied regimes vary widely among these methodologies, a theoretical comparison is often out of reach. Therefore, the important question `what is the right ONO technique?’ remains open to date. In this paper, we discuss the pros and cons of each technique and present a direct quantitative comparison for a specific use case, using real data. Our results suggest that carefully combining the insights of problem modeling with state-of-the-art AI techniques provides significant advantages at reasonable complexity.
http://arxiv.org/abs/1805.12090
Estimating over-amplification of human epidermal growth factor receptor 2 (HER2) on invasive breast cancer (BC) is regarded as a significant predictive and prognostic marker. We propose a novel deep reinforcement learning (DRL) based model that treats immunohistochemical (IHC) scoring of HER2 as a sequential learning task. For a given image tile sampled from multi-resolution giga-pixel whole slide image (WSI), the model learns to sequentially identify some of the diagnostically relevant regions of interest (ROIs) by following a parameterized policy. The selected ROIs are processed by recurrent and residual convolution networks to learn the discriminative features for different HER2 scores and predict the next location, without requiring to process all the sub-image patches of a given tile for predicting the HER2 score, mimicking the histopathologist who would not usually analyze every part of the slide at the highest magnification. The proposed model incorporates a task-specific regularization term and inhibition of return mechanism to prevent the model from revisiting the previously attended locations. We evaluated our model on two IHC datasets: a publicly available dataset from the HER2 scoring challenge contest and another dataset consisting of WSIs of gastroenteropancreatic neuroendocrine tumor sections stained with Glo1 marker. We demonstrate that the proposed model outperforms other methods based on state-of-the-art deep convolutional networks. To the best of our knowledge, this is the first study using DRL for IHC scoring and could potentially lead to wider use of DRL in the domain of computational pathology reducing the computational burden of the analysis of large multigigapixel histology images.
http://arxiv.org/abs/1903.10762
3D object detection from raw and sparse point clouds has been far less treated to date, compared with its 2D counterpart. In this paper, we propose a novel framework called FVNet for 3D front-view proposal generation and object detection from point clouds. It consists of two stages: generation of front-view proposals and estimation of 3D bounding box parameters. Instead of generating proposals from camera images or bird’s-eye-view maps, we first project point clouds onto a cylindrical surface to generate front-view feature maps which retains rich information. We then introduce a proposal generation network to predict 3D region proposals from the generated maps and further extrude objects of interest from the whole point cloud. Finally, we present another network to extract the point-wise features from the extruded object points and regress the final 3D bounding box parameters in the canonical coordinates. Our framework achieves real-time performance with 12ms per point cloud sample. Extensive experiments on the 3D detection benchmark KITTI show that the proposed architecture outperforms state-of-the-art techniques which take either camera images or point clouds as input, in terms of accuracy and inference time.
http://arxiv.org/abs/1903.10750
Gait analysis is the study of the systematic methods that assess and quantify animal locomotion. The research on gait analysis has considerably evolved through time. It was an ancient art, and it still finds its application today in modern science and medicine. This paper describes how one’s gait can be used as a biometric. It shall diversely cover salient research done within the field and explain the nuances and advances in each type of gait analysis. The prominent methods of gait recognition from the early era to the state of the art are covered. This survey also reviews the various gait datasets. The overall aim of this study is to provide a concise roadmap for anyone who wishes to do research in the field of gait biometrics.
http://arxiv.org/abs/1903.10744
Neural networks equipped with self-attention have parallelizable computation, light-weight structure, and the ability to capture both long-range and local dependencies. Further, their expressive power and performance can be boosted by using a vector to measure pairwise dependency, but this requires to expand the alignment matrix to a tensor, which results in memory and computation bottlenecks. In this paper, we propose a novel attention mechanism called “Multi-mask Tensorized Self-Attention” (MTSA), which is as fast and as memory-efficient as a CNN, but significantly outperforms previous CNN-/RNN-/attention-based models. MTSA 1) captures both pairwise (token2token) and global (source2token) dependencies by a novel compatibility function composed of dot-product and additive attentions, 2) uses a tensor to represent the feature-wise alignment scores for better expressive power but only requires parallelizable matrix multiplications, and 3) combines multi-head with multi-dimensional attentions, and applies a distinct positional mask to each head (subspace), so the memory and computation can be distributed to multiple heads, each with sequential information encoded independently. The experiments show that a CNN/RNN-free model based on MTSA achieves state-of-the-art or competitive performance on nine NLP benchmarks with compelling memory- and time-efficiency.
http://arxiv.org/abs/1805.00912
Flaw detection in non-destructive testing, especially in complex signals like ultrasonic data, has thus far relied heavily on the expertise and judgement of trained human inspectors. While automated systems have been used for a long time, these have mostly been limited to using simple decision automation, such as signal amplitude threshold. The recent advances in various machine learning algorithms have solved many similarly difficult classification problems, that have previously been considered intractable. For non-destructive testing, encouraging results have already been reported in the open literature, but the use of machine learning is still very limited in NDT applications in the field. Key issue hindering their use, is the limited availability of representative flawed data-sets to be used for training. In the present paper, we develop modern, very deep convolutional network to detect flaws from phased-array ultrasonic data. We make extensive use of data augmentation to enhance the initially limited raw data and to aid learning. The data augmentation utilizes virtual flaws - a technique, that has successfully been used in training human inspectors and is soon to be used in nuclear inspection qualification. The results from the machine learning classifier are compared to human performance. We show, that using sophisticated data augmentation, modern deep learning networks can be trained to achieve superhuman performance by significant margin.
http://arxiv.org/abs/1903.11399
This paper presents a new approach for relatively accurate brain region of interest (ROI) detection from dynamic susceptibility contrast (DSC) perfusion magnetic resonance (MR) images of a human head with abnormal brain anatomy. Such images produce problems for automatic brain segmentation algorithms, and as a result, poor perfusion ROI detection affects both quantitative measurements and visual assessment of perfusion data. In the proposed approach image segmentation is based on CUSUM filter usage that was adapted to be applicable to process DSC perfusion MR images. The result of segmentation is a binary mask of brain ROI that is generated via usage of brain boundary location. Each point of the boundary between the brain and surrounding tissues is detected as a change-point by CUSUM filter. Proposed adopted CUSUM filter operates by accumulating the deviations between the observed and expected intensities of image points at the time of moving on a trajectory. Motion trajectory is created by the iterative change of movement direction inside the background region in order to reach brain region, and vice versa after boundary crossing. Proposed segmentation approach was evaluated with Dice index comparing obtained results to the reference standard. Manually marked brain region pixels (reference standard), as well as visual inspection of detected with CUSUM filter usage brain ROI, were provided by experienced radiologists. The results showed that proposed approach is suitable to be used for brain ROI detection from DSC perfusion MR images of a human head with abnormal brain anatomy and can, therefore, be applied in the DSC perfusion data analysis.
http://arxiv.org/abs/1904.00787
With the advantage of low storage cost and high retrieval efficiency, hashing techniques have recently been an emerging topic in cross-modal similarity search. As multiple modal data reflect similar semantic content, many researches aim at learning unified binary codes. However, discriminative hashing features learned by these methods are not adequate. This results in lower accuracy and robustness. We propose a novel hashing learning framework which jointly performs classifier learning, subspace learning and matrix factorization to preserve class-specific semantic content, termed Discriminative Supervised Hashing (DSH), to learn the discrimative unified binary codes for multi-modal data. Besides, reducing the loss of information and preserving the non-linear structure of data, DSH non-linearly projects different modalities into the common space in which the similarity among heterogeneous data points can be measured. Extensive experiments conducted on the three publicly available datasets demonstrate that the framework proposed in this paper outperforms several state-of -the-art methods.
http://arxiv.org/abs/1812.07660
This paper proposes a new type of actuator at millimeter scale, which is based on Simplified Electro-Permanent (SEP) magnets. The new actuator can achieve connection and smooth motion by controlling the polarity of SEP magnets. Analyses based on numerical simulation are used to design a prototype. A dead-time controllable H-bridge and its multiplex design are proposed for controlling the new actuator and simplifying the electronic circuit. Finally, the new actuator is implemented in the DILI modular reconfigurable robot system. The experimental results show that with this new actuator, the DILI module can move smoothly and connect to other modules without power supply during connection. The maximum speed of DILI module is 20mm/s.
http://arxiv.org/abs/1904.09889
Modern large-scale automation systems integrate thousands to hundreds of thousands of physical sensors and actuators. Demands for more flexible reconfiguration of production systems and optimization across different information models, standards and legacy systems challenge current system interoperability concepts. Automatic semantic translation across information models and standards is an increasingly important problem that needs to be addressed to fulfill these demands in a cost-efficient manner under constraints of human capacity and resources in relation to timing requirements and system complexity. Here we define a translator-based operational interoperability model for interacting cyber-physical systems in mathematical terms, which includes system identification and ontology-based translation as special cases. We present alternative mathematical definitions of the translator learning task and mappings to similar machine learning tasks and solutions based on recent developments in machine learning. Possibilities to learn translators between artefacts without a common physical context, for example in simulations of digital twins and across layers of the automation pyramid are briefly discussed.
http://arxiv.org/abs/1903.10735
We present a deep neural network based singing voice synthesizer, inspired by the Deep Convolutions Generative Adversarial Networks (DCGAN) architecture and optimized using the Wasserstein-GAN algorithm. We use vocoder parameters for acoustic modelling, to separate the influence of pitch and timbre. This facilitates the modelling of the large variability of pitch in the singing voice. Our network takes a block of consecutive frame-wise linguistic and fundamental frequency features, along with global singer identity as input and outputs vocoder features. For inference, sequential blocks are concatenated using an overlap-add procedure. We show that the performance of our model is comparable to the state-of-the-art and the original sample using objective metrics and a subjective listening test. We also present examples of the synthesis on a supplementary website and the source code via GitHub.
http://arxiv.org/abs/1903.10729
Human phenotype-gene relations are fundamental to fully understand the origin of some phenotypic abnormalities and their associated diseases. Biomedical literature is the most comprehensive source of these relations, however, we need Relation Extraction tools to automatically recognize them. Most of these tools require an annotated corpus and to the best of our knowledge, there is no corpus available annotated with human phenotype-gene relations. This paper presents the Phenotype-Gene Relations (PGR) corpus, a silver standard corpus of human phenotype and gene annotations and their relations. The corpus consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations. We generated this corpus using Named-Entity Recognition tools, whose results were partially evaluated by eight curators, obtaining a precision of 87.01%. By using the corpus we were able to obtain promising results with two state-of-the-art deep learning tools, namely 78.05% of precision. The PGR corpus was made publicly available to the research community.
http://arxiv.org/abs/1903.10728
The flexible flow shop scheduling problem is an NP-hard problem and it requires significant resolution time to find optimal or even adequate solutions when dealing with large size instances. Thus, this paper proposes a dual island genetic algorithm consisting of a parallel cellular model and a parallel pseudo model. This is a two-level parallelization highly consistent with the underlying architecture and is well suited for parallelizing inside or between GPUs and a multi-core CPU. At the higher level, the efficiency of island GAs is improved by exploring new regions within the search space utilizing different methods. In the meantime, the cellular model keeps the population diversity by decentralization and the pseudo model enhances the search ability by the complementary parent strategy at the lower level. To encourage the information sharing between islands, a penetration inspired migration policy is designed which sets the topology, the rate, the interval and the strategy adaptively. Finally, the proposed method is tested on some large size flexible flow shop scheduling instances in comparison with other parallel algorithms. The computational results show that it cannot only obtain competitive results but also reduces execution time.
https://arxiv.org/abs/1903.10722
With the advantage of low storage cost and high efficiency, hashing learning has received much attention in retrieval field. As multiple modal data representing a common object semantically are complementary, many works focus on learning unified binary codes. However, these works ignore the importance of manifold structre among data. In fact, it is still an interesting problem to directly preserve the local manifold structure among samples in hamming space. Since different modalities are isomerous, we adopt the concatenated feature of multiple modality feature to represent original object. In our framework, Locally Linear Embedding and Locality Preserving Projection are introduced to reconstruct the manifold structure of original space in the Hamming space. Besides, The L21-norm regularization are imposed on the projection matrices to further exploit the discriminative features for different modalities simultaneously. Extensive experiments are performed to evaluate the proposed method, dubbed Unsupervised Concatenation Hashing (UCH), on the three publicly available datasets and the experimental results show the superior performance of UCH outperforming most of state-of-the-art unsupervised hashing models.
http://arxiv.org/abs/1904.00726
This paper explores a simple and efficient baseline for person re-identification (ReID). Person re-identification (ReID) with deep neural networks has made progress and achieved high performance in recent years. However, many state-of-the-arts methods design complex network structure and concatenate multi-branch features. In the literature, some effective training tricks are briefly appeared in several papers or source codes. This paper will collect and evaluate these effective training tricks in person ReID. By combining these tricks together, the model achieves 94.5% rank-1 and 85.9% mAP on Market1501 with only using global features. Our codes and models are available in Github.
http://arxiv.org/abs/1903.07071
Embedding entities and relations into a continuous multi-dimensional vector space have become the dominant method for knowledge graph embedding in representation learning. However, most existing models ignore to represent hierarchical knowledge, such as the similarities and dissimilarities of entities in one domain. We proposed to learn a Domain Representations over existing knowledge graph embedding models, such that entities that have similar attributes are organized into the same domain. Such hierarchical knowledge of domains can give further evidence in link prediction. Experimental results show that domain embeddings give a significant improvement over the most recent state-of-art baseline knowledge graph embedding models.
http://arxiv.org/abs/1903.10716
This paper proposes multiscale convolutional neural network (CNN)-based deep metric learning for bioacoustic classification, under low training data conditions. The proposed CNN is characterized by the utilization of four different filter sizes at each level to analyze input feature maps. This multiscale nature helps in describing different bioacoustic events effectively: smaller filters help in learning the finer details of bioacoustic events, whereas, larger filters help in analyzing a larger context leading to global details. A dynamic triplet loss is employed in the proposed CNN architecture to learn a transformation from the input space to the embedding space, where classification is performed. The triplet loss helps in learning this transformation by analyzing three examples, referred to as triplets, at a time where intra-class distance is minimized while maximizing the inter-class separation by a dynamically increasing margin. The number of possible triplets increases cubically with the dataset size, making triplet loss more suitable than the softmax cross-entropy loss in low training data conditions. Experiments on three different publicly available datasets show that the proposed framework performs better than existing bioacoustic classification frameworks. Experimental results also confirm the superiority of the triplet loss over the cross-entropy loss in low training data conditions
http://arxiv.org/abs/1903.10713
Stereo cameras and dense stereo matching algorithms are core components for many robotic applications due to their abilities to directly obtain dense depth measurements and their robustness against changes in lighting conditions. However, the performance of dense depth estimation relies heavily on accurate stereo extrinsic calibration. In this work, we present a real-time markerless approach for obtaining high-precision stereo extrinsic calibration using a novel 5-DOF (degrees-of-freedom) and nonlinear optimization on a manifold, which captures the observability property of vision-only stereo calibration. Our method minimizes epipolar errors between spatial per-frame sparse natural features.It does not require temporal feature correspondences, making it not only invariant to dynamic scenes and illumination changes, but also able to run significantly faster than standard bundle adjustment-based approaches. We introduce a principled method to determine if the calibration converges to the required level of accuracy, and show through online experiments that our approach achieves a level of accuracy that is comparable to offline marker-based calibration methods. Our method refines stereo extrinsic to the accuracy that is sufficient for block matching-based dense disparity computation. It provides a cost-effective way to improve the reliability of stereo vision systems for long-term autonomy.
http://arxiv.org/abs/1903.10705
A recurrent Neural Network (RNN) is trained to predict sound samples based on audio input augmented by control parameter information for pitch, volume, and instrument identification. During the generative phase following training, audio input is taken from the output of the previous time step, and the parameters are externally controlled allowing the network to be played as a musical instrument. Building on an architecture developed in previous work, we focus on the learning and synthesis of transients - the temporal response of the network during the short time (tens of milliseconds) following the onset and offset of a control signal. We find that the network learns the particular transient characteristics of two different synthetic instruments, and furthermore shows some ability to interpolate between the characteristics of the instruments used in training in response to novel parameter settings. We also study the behaviour of the units in hidden layers of the RNN using various visualisation techniques and find a variety of volume-specific response characteristics.
http://arxiv.org/abs/1903.10703
This paper presents a new method for shadow removal using unpaired data, enabling us to avoid tedious annotations and obtain more diverse training samples. However, directly employing adversarial learning and cycle-consistency constraints is insufficient to learn the underlying relationship between the shadow and shadow-free domains, since the mapping between shadow and shadow-free images is not simply one-to-one. To address the problem, we formulate Mask-ShadowGAN, a new deep framework that automatically learns to produce a shadow mask from the input shadow image and then takes the mask to guide the shadow generation via re-formulated cycle-consistency constraints. Particularly, the framework simultaneously learns to produce shadow masks and learns to remove shadows, to maximize the overall performance. Also, we prepared an unpaired dataset for shadow removal and demonstrated the effectiveness of Mask-ShadowGAN on various experiments, even it was trained on unpaired data.
http://arxiv.org/abs/1903.10683
Combining deep neural networks with structured logic rules is desirable to harness flexibility and reduce uninterpretability of the neural models. We propose a general framework capable of enhancing various types of neural networks (e.g., CNNs and RNNs) with declarative first-order logic rules. Specifically, we develop an iterative distillation method that transfers the structured information of logic rules into the weights of neural networks. We deploy the framework on a CNN for sentiment analysis, and an RNN for named entity recognition. With a few highly intuitive rules, we obtain substantial improvements and achieve state-of-the-art or comparable results to previous best-performing systems.
http://arxiv.org/abs/1603.06318
The power budget for embedded hardware implementations of Deep Learning algorithms can be extremely tight. To address implementation challenges in such domains, new design paradigms, like Approximate Computing, have drawn significant attention. Approximate Computing exploits the innate error-resilience of Deep Learning algorithms, a property that makes them amenable for deployment on low-power computing platforms. This paper describes an Approximate Computing design methodology, AX-DBN, for an architecture belonging to the class of stochastic Deep Learning algorithms known as Deep Belief Networks (DBNs). Specifically, we consider procedures for efficiently implementing the Discriminative Deep Belief Network (DDBN), a stochastic neural network which is used for classification tasks, extending Approximation Computing from the analysis of deterministic to stochastic neural networks. For the purpose of optimizing the DDBN for hardware implementations, we explore the use of: (a)Limited precision of neurons and functional approximations of activation functions; (b) Criticality analysis to identify nodes in the network which can operate at reduced precision while allowing the network to maintain target accuracy levels; and (c) A greedy search methodology with incremental retraining to determine the optimal reduction in precision for all neurons to maximize power savings. Using the AX-DBN methodology proposed in this paper, we present experimental results across several network architectures that show significant power savings under a user-specified accuracy loss constraint with respect to ideal full precision implementations.
https://arxiv.org/abs/1903.04659
Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained contextualized embedding model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks.
http://arxiv.org/abs/1903.10676
We present AlphaX, a fully automated agent that designs complex neural architectures from scratch. AlphaX explores the exponentially grown search space with a distributed Monte Carlo Tree Search (MCTS) and a Meta-Deep Neural Network (DNN). MCTS intrinsically improves the search efficiency by dynamically balancing the exploration and exploitation at fine-grained states, while Meta-DNN predicts the network accuracy to guide the search, and to provide an estimated reward to speed up the rollout. As the search progresses, AlphaX also generates the training data for Meta-DNN. So, the learning of Meta-DNN is end-to-end. In 14 days with only 16 GPUs (1832 samples), AlphaX found an architecture that reaches the state-of-the-art accuracies on both CIFAR-10(97.18%) and ImageNet(75.5% top-1 and 92.2% top-5). This demonstrates up to 10x speedup over the original searching for NASNet that used 500 GPUs in 4 days (20000 samples). On NASBench-101, AlphaX demonstrates 3x and 2.8x speedup over Random Search and Regularized Evolution. Finally, we show the searched architecture improves a variety of vision applications from Neural Style Transfer, to Image Captioning and Object Detection. Our implementation is available at this https URL.
https://arxiv.org/abs/1903.11059
Measuring similarity between texts is an important task for several applications. Available approaches to measure document similarity are inadequate for document pairs that have non-comparable lengths, such as a long document and its summary. This is because of the lexical, contextual and the abstraction gaps between a long document of rich details and its concise summary of abstract information. In this paper, we present a document matching approach to bridge this gap, by comparing the texts in a common space of hidden topics. We evaluate the matching algorithm on two matching tasks and find that it consistently and widely outperforms strong baselines. We also highlight the benefits of incorporating domain knowledge to text matching.
http://arxiv.org/abs/1903.10675
Neural architecture search (NAS) has been proposed to automatically tune deep neural networks, but existing search algorithms, e.g., NASNet, PNAS, usually suffer from expensive computational cost. Network morphism, which keeps the functionality of a neural network while changing its neural architecture, could be helpful for NAS by enabling more efficient training during the search. In this paper, we propose a novel framework enabling Bayesian optimization to guide the network morphism for efficient neural architecture search. The framework develops a neural network kernel and a tree-structured acquisition function optimization algorithm to efficiently explores the search space. Intensive experiments on real-world benchmark datasets have been done to demonstrate the superior performance of the developed framework over the state-of-the-art methods. Moreover, we build an open-source AutoML system based on our method, namely Auto-Keras. The system runs in parallel on CPU and GPU, with an adaptive search strategy for different GPU memory limits.
http://arxiv.org/abs/1806.10282
This paper presents a probabilistic approach for online dense reconstruction using a single monocular camera moving through the environment. Compared to spatial stereo, depth estimation from motion stereo is challenging due to insufficient parallaxes, visual scale changes, pose errors, etc. We utilize both the spatial and temporal correlations of consecutive depth estimates to increase the robustness and accuracy of monocular depth estimation. An online, recursive, probabilistic scheme to compute depth estimates, with corresponding covariances and inlier probability expectations, is proposed in this work. We integrate the obtained depth hypotheses into dense 3D models in an uncertainty-aware way. We show the effectiveness and efficiency of our proposed approach by comparing it with state-of-the-art methods in the TUM RGB-D SLAM and ICL-NUIM dataset. Online indoor and outdoor experiments are also presented for performance demonstration.
http://arxiv.org/abs/1903.10673
Text style transfer rephrases a text from a source style (e.g., informal) to a target style (e.g., formal) while keeping its original meaning. Despite the success existing works have achieved using a parallel corpus for the two styles, transferring text style has proven significantly more challenging when there is no parallel training corpus. In this paper, we address this challenge by using a reinforcement-learning-based generator-evaluator architecture. Our generator employs an attention-based encoder-decoder to transfer a sentence from the source style to the target style. Our evaluator is an adversarially trained style discriminator with semantic and syntactic constraints that score the generated sentence for style, meaning preservation, and fluency. Experimental results on two different style transfer tasks (sentiment transfer and formality transfer) show that our model outperforms state-of-the-art approaches. Furthermore, we perform a manual evaluation that demonstrates the effectiveness of the proposed method using subjective metrics of generated text quality.
http://arxiv.org/abs/1903.10671
Complex blur like the mixup of space-variant and space-invariant blur, which is hard to be modeled mathematically, widely exists in real images. In the real world, a common type of blur occurs when capturing images in low-light environments. In this paper, we propose a novel image deblurring method that does not need to estimate blur kernels. We utilize a pair of images which can be easily acquired in low-light situations: (1) a blurred image taken with low shutter speed and low ISO noise, and (2) a noisy image captured with high shutter speed and high ISO noise. Specifically, the blurred image is first sliced into patches, and we extend the Gaussian mixture model (GMM) to model the underlying intensity distribution of each patch using the corresponding patches in the noisy image. We compute patch correspondences by analyzing the optical flow between the two images. The Expectation-Maximization (EM) algorithm is utilized to estimate the involved parameters in the GMM. To preserve sharp features, we add an additional bilateral term to the objective function in the M-step. We eventually add a detail layer to the deblurred image for refinement. Extensive experiments on both synthetic and real-world data demonstrate that our method outperforms state-of-the-art techniques, in terms of robustness, visual quality and quantitative metrics. We will make our dataset and source code publicly available.
http://arxiv.org/abs/1903.10667
Recent studies in image retrieval task have shown that ensembling different models and combining multiple global descriptors lead to performance improvement. However, training different models for ensemble is not only difficult but also inefficient with respect to time or memory. In this paper, we propose a novel framework that exploits multiple global descriptors to get an ensemble-like effect while it can be trained in an end-to-end manner. The proposed framework is flexible and expandable by the global descriptor, CNN backbone, loss, and dataset. Moreover, we investigate the effectiveness of combining multiple global descriptors with quantitative and qualitative analysis. Our extensive experiments show that the combined descriptor outperforms a single global descriptor, as it can utilize different types of feature properties. In the benchmark evaluation, the proposed framework achieves the state-of-the-art performance on the CARS196, CUB200-2011, In-shop Clothes and Stanford Online Products on image retrieval tasks by a large margin compared to competing approaches.
http://arxiv.org/abs/1903.10663
In this paper we tackle the problem of unsupervised domain adaptation for the task of semantic segmentation, where we attempt to transfer the knowledge learned upon synthetic datasets with ground-truth labels to real-world images without any annotation. With the hypothesis that the structural content of images is the most informative and decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific texture representations, which can further realize image-translation across domains and enable label transfer to improve segmentation performance. Extensive experiments verify the effectiveness of our proposed DISE model and demonstrate its superiority over several state-of-the-art approaches.
http://arxiv.org/abs/1903.12212
Image-to-image translation tasks have been widely investigated with Generative Adversarial Networks (GANs). However, existing approaches are mostly designed in an unsupervised manner while little attention has been paid to domain information within unpaired data. In this paper, we treat domain information as explicit supervision and design an unpaired image-to-image translation framework, Domain-supervised GAN (DosGAN), which takes the first step towards the exploration of explicit domain supervision. In contrast to representing domain characteristics using different generators in CycleGAN or multiple domain codes in StarGAN, we pre-train a classification network to explicitly classify the domain of an image. After pre-training, this network is used to extract the domain-specific features of each image by using the output of its second-to-last layer. Such features, together with the domain-independent features extracted by another encoder (shared across different domains), are used to generate an image in the target domain. Extensive experiments on multiple hair color translation, multiple identity translation, multiple season translation and conditional edges-to-shoes/handbags demonstrate the effectiveness of our method. In addition, we can transfer the domain-specific feature extractor obtained on the Facescrub dataset with domain supervision information to unseen domains, such as faces in the CelebA dataset. We also succeed in achieving conditional translation with any two images in CelebA, while previous models like StarGAN cannot handle this task.
http://arxiv.org/abs/1902.03782
Recently, deep learning based facial landmark detection has achieved great success. Despite this, we notice that the semantic ambiguity greatly degrades the detection performance. Specifically, the semantic ambiguity means that some landmarks (e.g. those evenly distributed along the face contour) do not have clear and accurate definition, causing inconsistent annotations by annotators. Accordingly, these inconsistent annotations, which are usually provided by public databases, commonly work as the ground-truth to supervise network training, leading to the degraded accuracy. To our knowledge, little research has investigated this problem. In this paper, we propose a novel probabilistic model which introduces a latent variable, i.e. the ‘real’ ground-truth which is semantically consistent, to optimize. This framework couples two parts (1) training landmark detection CNN and (2) searching the ‘real’ ground-truth. These two parts are alternatively optimized: the searched ‘real’ ground-truth supervises the CNN training; and the trained CNN assists the searching of ‘real’ ground-truth. In addition, to recover the unconfidently predicted landmarks due to occlusion and low quality, we propose a global heatmap correction unit (GHCU) to correct outliers by considering the global face shape as a constraint. Extensive experiments on both image-based (300W and AFLW) and video-based (300-VW) databases demonstrate that our method effectively improves the landmark detection accuracy and achieves the state of the art performance.
http://arxiv.org/abs/1903.10661
Deep neural networks have achieved great success on the image captioning task. However, most of the existing models depend heavily on paired image-sentence datasets, which are very expensive to acquire in most real-world scenarios. In this paper, we propose a scene graph based approach for unpaired image captioning. Our method merely requires an image set, a sentence corpus, an image scene graph generator, and a sentence scene graph generator. The sentence corpus is used to teach the decoder how to generate meaningful sentences from a scene graph. To further encourage the generated captions to be semantically consistent with the image, we employ adversarial learning to align the visual scene graph to the textual scene graph. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.
http://arxiv.org/abs/1903.10658
We propose a novel genetic algorithm to solve the image deformation estimation problem by preserving the genetic diversity. As a classical problem, there is always a trade-off between the complexity of deformation models and the difficulty of parameters search in image deformation. 2D cubic B-spline surface is a highly free-form deformation model and is able to handle complex deformations such as fluid image distortions. However, it is challenging to estimate an apposite global solution. To tackle this problem, we develop a genetic operation named probabilistic bitwise operation (PBO) to replace the crossover and mutation operations, which can preserve the diversity during generation iteration and achieve better coverage ratio of the solution space. Furthermore, a selection strategy named annealing selection is proposed to control the convergence. Qualitative and quantitative results on synthetic data show the effectiveness of our method.
http://arxiv.org/abs/1903.10657
We examine the problem of adversarial reinforcement learning for multi-agent domains including a rule-based agent. Rule-based algorithms are required in safety-critical applications for them to work properly in a wide range of situations. Hence, every effort is made to find failure scenarios during the development phase. However, as the software becomes complicated, finding failure cases becomes difficult. Especially in multi-agent domains, such as autonomous driving environments, it is much harder to find useful failure scenarios that help us improve the algorithm. We propose a method for efficiently finding failure scenarios; this method trains the adversarial agents using multi-agent reinforcement learning such that the tested rule-based agent fails. We demonstrate the effectiveness of our proposed method using a simple environment and autonomous driving simulator.
http://arxiv.org/abs/1903.10654
The neural network is a powerful computing framework that has been exploited by biological evolution and by humans for solving diverse problems. Although the computational capabilities of neural networks are determined by their structure, the current understanding of the relationships between a neural network’s architecture and function is still primitive. Here we reveal that neural network’s modular architecture plays a vital role in determining the neural dynamics and memory performance of the network of threshold neurons. In particular, we demonstrate that there exists an optimal modularity for memory performance, where a balance between local cohesion and global connectivity is established, allowing optimally modular networks to remember longer. Our results suggest that insights from dynamical analysis of neural networks and information spreading processes can be leveraged to better design neural networks and may shed light on the brain’s modular organization.
http://arxiv.org/abs/1706.06511
It is usually hard for a learning system to predict correctly on rare events that never occur in the training data, and there is no exception for segmentation algorithms. Meanwhile, manual inspection of each case to locate the failures becomes infeasible due to the trend of large data scale and limited human resource.Therefore, we build an alarm system that will set off alerts when the segmentation result is possibly unsatisfactory, assuming no corresponding ground truth mask is provided. One plausible solution is to project the segmentation results into a low dimensional feature space; then learn classifiers/regressors to predict their qualities. Motivated by this, in this paper, we learn a feature space using the shape information which is a strong prior shared among different datasets and robust to the appearance variation of input data.The shape feature is captured using a Variational Auto-Encoder (VAE) network that trained with only the ground truth masks. During testing, the segmentation results with bad shapes shall not fit the shape prior well, resulting in large loss values. Thus, the VAE is able to evaluate the quality of segmentation result on unseen data, without using ground truth. Finally, we learn a regressor in the one-dimensional feature space to predict the qualities of segmentation results. Our alarm system is evaluated on several recent state-of-art segmentation algorithms for 3D medical segmentation tasks. Compared with other standard quality assessment methods, our system consistently provides more reliable prediction on the qualities of segmentation results.
http://arxiv.org/abs/1903.10645
We present a novel clustering objective that learns a neural network classifier from scratch, given only unlabelled data samples. The model discovers clusters that accurately match semantic classes, achieving state-of-the-art results in eight unsupervised clustering benchmarks spanning image classification and segmentation. These include STL10, an unsupervised variant of ImageNet, and CIFAR10, where we significantly beat the accuracy of our closest competitors by 8 and 9.5 absolute percentage points respectively. The method is not specialised to computer vision and operates on any paired dataset samples; in our experiments we use random transforms to obtain a pair from each image. The trained network directly outputs semantic labels, rather than high dimensional representations that need external processing to be usable for semantic clustering. The objective is simply to maximise mutual information between the class assignments of each pair. It is easy to implement and rigorously grounded in information theory, meaning we effortlessly avoid degenerate solutions that other clustering methods are susceptible to. In addition to the fully unsupervised mode, we also test two semi-supervised settings. The first achieves 88.8% accuracy on STL10 classification, setting a new global state-of-the-art over all existing methods (whether supervised, semi supervised or unsupervised). The second shows robustness to 90% reductions in label coverage, of relevance to applications that wish to make use of small amounts of labels. github.com/xu-ji/IIC
http://arxiv.org/abs/1807.06653
This paper presents an evaluation of a number of probabilistic algorithms for localization of autonomous underwater vehicles (AUVs) using bathymetry data. The algorithms, based on the principles of the Bayes filter, work by fusing bathymetry information with depth and altitude data from an AUV. Four different Bayes filter-based algorithms are used to design the localization algorithms: the Extended Kalman Filter (EKF), Unscented Kalman Filter (UKF), Particle Filter (PF), and Marginalized Particle Filter (MPF). We evaluate the performance of these four Bayesian bathymetry-based AUV localization approaches under variable conditions and available computational resources. The localization algorithms overcome unique challenges of the underwater domain, including visual distortion and radio frequency (RF) signal attenuation, which often make landmark-based localization infeasible. Evaluation results on real-world bathymetric data show the effectiveness of each algorithm under a variety of conditions, with the MPF being the most accurate.
http://arxiv.org/abs/1809.08076
In urban driving scenarios, forecasting future trajectories of surrounding vehicles is of paramount importance. While several approaches for the problem have been proposed, the best-performing ones tend to require extremely detailed input representations (eg. image sequences). But, such methods do not generalize to datasets they have not been trained on. We propose intermediate representations that are particularly well-suited for future prediction. As opposed to using texture (color) information, we rely on semantics and train an autoregressive model to accurately predict future trajectories of traffic participants (vehicles) (see fig. above). We demonstrate that using semantics provides a significant boost over techniques that operate over raw pixel intensities/disparities. Uncharacteristic of state-of-the-art approaches, our representations and models generalize to completely different datasets, collected across several cities, and also across countries where people drive on opposite sides of the road (left-handed vs right-handed driving). Additionally, we demonstrate an application of our approach in multi-object tracking (data association). To foster further research in transferrable representations and ensure reproducibility, we release all our code and data.
http://arxiv.org/abs/1903.10641
Inspired by CapsNet’s routing-by-agreement mechanism with its ability to learn object properties, we propose a CapsNet architecture with object coordinate atoms and a modified routing-by-agreement algorithm with unevenly distributed initial routing probabilities. The model is based on CapsNet but uses a routing algorithm to find the objects’ approximate positions in the image coordinate system. We also discussed how to derive the property of translation through coordinate atoms and we discover the importance of sparse representation. We train our model on the single moving MNIST dataset with class labels. Our model can learn and derive the coordinates of the digits better than its convolution counterpart that lacks a routing-by-agreement algorithm, and can also perform well when testing on the multi-digit moving MNIST datasets. When deriving the coordinates, our model performs at least 13\%, 24\%, and 51\% better than the convNet counterpart and ResNet 20 benchmarks on 1-digit, 2-digit, and 3-digit moving MNIST datasets. This shows our method has better transfer learning properties on unseen scenarios of the new but related datasets. We also achieve slightly better performance than the ResNet benchmark in the KTH dataset; these results show our method reaches the state-of-art performance on object localization without any extra localization techniques and modules as in prior work.
http://arxiv.org/abs/1805.07706
Testing autonomous vehicles in simulation environments is crucial. Sim-ATAV is an open-source framework developed for experimenting with different test generation techniques in simulation environments for research purposes. This document provides a tutorial on Sim-ATAV with a running example.
http://arxiv.org/abs/1903.10637
We demonstrate that a character-level recurrent neural network is able to learn out-of-vocabulary (OOV) words under federated learning settings, for the purpose of expanding the vocabulary of a virtual keyboard for smartphones without exporting sensitive text to servers. High-frequency words can be sampled from the trained generative model by drawing from the joint posterior directly. We study the feasibility of the approach in two settings: (1) using simulated federated learning on a publicly available non-IID per-user dataset from a popular social networking website, (2) using federated learning on data hosted on user mobile devices. The model achieves good recall and precision compared to ground-truth OOV words in setting (1). With (2) we demonstrate the practicality of this approach by showing that we can learn meaningful OOV words with good character-level prediction accuracy and cross entropy loss.
http://arxiv.org/abs/1903.10635
We consider the problem of diversifying automated reply suggestions for a commercial instant-messaging (IM) system (Skype). Our conversation model is a standard matching based information retrieval architecture, which consists of two parallel encoders to project messages and replies into a common feature representation. During inference, we select replies from a fixed response set using nearest neighbors in the feature space. To diversify responses, we formulate the model as a generative latent variable model with Conditional Variational Auto-Encoder (M-CVAE). We propose a constrained-sampling approach to make the variational inference in M-CVAE efficient for our production system. In offline experiments, M-CVAE consistently increased diversity by ~30-40% without significant impact on relevance. This translated to a 5% gain in click-rate in our online production system.
http://arxiv.org/abs/1903.10630
Autonomous vehicles are in an intensive research and development stage, and the organizations developing these systems are targeting to deploy them on public roads in a very near future. One of the expectations from fully-automated vehicles is never to cause an accident. However, an automated vehicle may not be able to avoid all collisions, e.g., the collisions caused by other road occupants. Hence, it is important for the system designers to understand the boundary case scenarios where an autonomous vehicle can no longer avoid a collision. In this paper, an automated test generation approach that utilizes Rapidly-exploring Random Trees is presented. A comparison of the proposed approach with an optimization-guided falsification approach from the literature is provided. Furthermore, a cost function that guides the test generation toward almost-avoidable collisions or near-misses is proposed.
http://arxiv.org/abs/1903.10629
Grammatical error correction (GEC) is one of the areas in natural language processing in which purely neural models have not yet superseded more traditional symbolic models. Hybrid systems combining phrase-based statistical machine translation (SMT) and neural sequence models are currently among the most effective approaches to GEC. However, both SMT and neural sequence-to-sequence models require large amounts of annotated data. Language model based GEC (LM-GEC) is a promising alternative which does not rely on annotated training data. We show how to improve LM-GEC by applying modelling techniques based on finite state transducers. We report further gains by rescoring with neural language models. We show that our methods developed for LM-GEC can also be used with SMT systems if annotated training data is available. Our best system outperforms the best published result on the CoNLL-2014 test set, and achieves far better relative improvements over the SMT baselines than previous hybrid systems.
http://arxiv.org/abs/1903.10625
This paper presents the mathematical modeling, controller design, and flight-testing of an over-actuated Vertical Take-off and Landing (VTOL) tiltwing Unmanned Aerial Vehicle (UAV). Based on simplified aerodynamics and first-principles, a dynamical model of the UAV is developed which captures key aerodynamic effects including propeller slipstream on the wing and post-stall characteristics of the airfoils. The model-based steady-state flight envelope and the corresponding trim-actuation is analyzed and the overactuation of the UAV solved by optimizing for, e.g., power-optimal trims. The developed control system is composed of two controllers: First, a low-level attitude controller based on dynamic inversion and a daisy-chaining approach to handle allocation of redundant actuators. Secondly, a higher-level cruise controller to track a desired vertical velocity. It is based on a linearization of the system and look-up tables to determine the strong and nonlinear variation of the trims throughout the flight-envelope. We demonstrate the performance of the control-system for all flight phases (hover, transition, cruise) in extensive flight-tests.
http://arxiv.org/abs/1903.10623
As an advanced research topic in forensics science, automatic shoe-print identification has been extensively studied in the last two decades, since shoe marks are the clues most frequently left in a crime scene. Hence, these impressions provide a pertinent evidence for the proper progress of investigations in order to identify the potential criminals. The main goal of this survey is to provide a cohesive overview of the research carried out in forensic shoe-print identification and its basic background. Apart defining the problem and describing the phases that typically compose the processing chain of shoe-print identification, we provide a summary/comparison of the state-of-the-art approaches, in order to guide the neophyte and help to advance the research topic. This is done through introducing simple and basic taxonomies as well as summaries of the state-of-the-art performance. Lastly, we discuss the current open problems and challenges in this research topic, point out for promising directions in this field.
http://arxiv.org/abs/1901.01431
Unmanned ground vehicles can capture a sub-canopy perspective for plant phenotyping, but their design and construction can be a challenge for scientists unfamiliar with robotics. Here we describe the necessary components and provide guidelines for designing and constructing an autonomous ground robot that can be used for plant phenotyping.
http://arxiv.org/abs/1903.10608