Since it is usually difficult to capture an all-in-focus image of a 3D scene directly, various multi-focus image fusion methods are employed to generate it from several images focusing at different depths. However, the performance of existing methods is barely satisfactory and often degrades for areas near the focused/defocused boundary (FDB). In this paper, a boundary aware method using deep neural network is proposed to overcome this problem. (1) Aiming to acquire improved fusion images, a 2-channel deep network is proposed to better extract the relative defocus information of the two source images. (2) After analyzing the different situations for patches far away from and near the FDB, we use two networks to handle them respectively. (3) To simulate the reality more precisely, a new approach of dataset generation is designed. Experiments demonstrate that the proposed method outperforms the state-of-the-art methods, both qualitatively and quantitatively.
http://arxiv.org/abs/1904.00198
This paper presents a novel approach to exploit the distinctive invariant features in convolutional neural network. The proposed CNN model uses Scale Invariant Feature Transform (SIFT) descriptor instead of the max-pooling layer. Max-pooling layer discards the pose, i.e., translational and rotational relationship between the low-level features, and hence unable to capture the spatial hierarchies between low and high level features. The SIFT descriptor layer captures the orientation and the spatial relationship of the features extracted by convolutional layer. The proposed SIFT Descriptor CNN therefore combines the feature extraction capabilities of CNN model and rotation invariance of SIFT descriptor. Experimental results on the MNIST and fashionMNIST datasets indicates reasonable improvements over conventional methods available in literature.
http://arxiv.org/abs/1904.00197
In image based feature descriptor design, an iterative scanning process utilizing the convolution operation is often adopted to extract local information of the image pixels. In this paper, we propose a convolution-free Local Binary Pattern (CF-LBP) and a convolution-free Histogram of Oriented Gradients (CF-HOG) descriptors in matrix form for mammogram classification. An integrated form of CF-LBP and CF-HOG, CF-LBP-HOG, is subsequently constructed in a single matrix formulation. The proposed descriptors are evaluated using a publicly available mammogram database. The results show promising performance in terms of classification accuracy and computational efficiency.
http://arxiv.org/abs/1904.00187
In most recent years, zero-shot recognition (ZSR) has gained increasing attention in machine learning and image processing fields. It aims at recognizing unseen class instances with knowledge transferred from seen classes. This is typically achieved by exploiting a pre-defined semantic feature space (FS), i.e., semantic attributes or word vectors, as a bridge to transfer knowledge between seen and unseen classes. However, due to the absence of unseen classes during training, the conventional ZSR easily suffers from domain shift and hubness problems. In this paper, we propose a novel ZSR learning framework that can handle these two issues well by adaptively adjusting semantic FS. To the best of our knowledge, our work is the first to consider the adaptive adjustment of semantic FS in ZSR. Moreover, our solution can be formulated to a more efficient framework that significantly boosts the training. Extensive experiments show the remarkable performance improvement of our model compared with other existing methods.
http://arxiv.org/abs/1904.00170
Facial images in surveillance or mobile scenarios often have large view-point variations in terms of pitch and yaw angles. These jointly occurred angle variations make face recognition challenging. Current public face databases mainly consider the case of yaw variations. In this paper, a new large-scale Multi-yaw Multi-pitch high-quality database is proposed for Facial Pose Analysis (M2FPA), including face frontalization, face rotation, facial pose estimation and pose-invariant face recognition. It contains 397,544 images of 229 subjects with yaw, pitch, attribute, illumination and accessory. M2FPA is the most comprehensive multi-view face database for facial pose analysis. Further, we provide an effective benchmark for face frontalization and pose-invariant face recognition on M2FPA with several state-of-the-art methods, including DR-GAN, TP-GAN and CAPG-GAN. We believe that the new database and benchmark can significantly push forward the advance of facial pose analysis in real-world applications. Moreover, a simple yet effective parsing guided discriminator is introduced to capture the local consistency during GAN optimization. Extensive quantitative and qualitative results on M2FPA and Multi-PIE demonstrate the superiority of our face frontalization method. Baseline results for both face synthesis and face recognition from state-of-theart methods demonstrate the challenge offered by this new database.
http://arxiv.org/abs/1904.00168
Generative adversary networks (GANs) have recently led to highly realistic image synthesis results. In this work, we describe a new method to expose GAN-synthesized images using the locations of the facial landmark points. Our method is based on the observations that the facial parts configuration generated by GAN models are different from those of the real faces, due to the lack of global constraints. We perform experiments demonstrating this phenomenon, and show that an SVM classifier trained using the locations of facial landmark points is sufficient to achieve good classification performance for GAN-synthesized faces.
http://arxiv.org/abs/1904.00167
Recently introduced generative adversarial network (GAN) has been shown numerous promising results to generate realistic samples. The essential task of GAN is to control the features of samples generated from a random distribution. While the current GAN structures, such as conditional GAN, successfully generate samples with desired major features, they often fail to produce detailed features that bring specific differences among samples. To overcome this limitation, here we propose a controllable GAN (ControlGAN) structure. By separating a feature classifier from a discriminator, the generator of ControlGAN is designed to learn generating synthetic samples with the specific detailed features. Evaluated with multiple image datasets, ControlGAN shows a power to generate improved samples with well-controlled features. Furthermore, we demonstrate that ControlGAN can generate intermediate features and opposite features for interpolated and extrapolated input labels that are not used in the training process. It implies that ControlGAN can significantly contribute to the variety of generated samples.
http://arxiv.org/abs/1708.00598
In the task of machine translation, context information is one of the important factor. But considering the context information model dose not proposed. The paper propose a new model which can integrate context information and make translation. In this paper, we create a new model based Encoder Decoder model. When translating current sentence, the model integrates output from preceding encoder with current encoder. The model can consider context information and the result score is higher than existing model.
http://arxiv.org/abs/1904.00160
Conventional methods for facial age analysis tend to utilize accurate age labels in a supervised way. However, existing age datasets lies in a limited range of ages, leading to a long-tailed distribution. To alleviate the problem, this paper proposes a Universal Variational Aging (UVA) framework to formulate facial age priors in a disentangling manner. Benefiting from the variational evidence lower bound, the facial images are encoded and disentangled into an age-irrelevant distribution and an age-related distribution in the latent space. A conditional introspective adversarial learning mechanism is introduced to boost the image quality. In this way, when manipulating the age-related distribution, UVA can achieve age translation with arbitrary ages. Further, by sampling noise from the age-irrelevant distribution, we can generate photorealistic facial images with a specific age. Moreover, given an input face image, the mean value of age-related distribution can be treated as an age estimator. These indicate that UVA can efficiently and accurately estimate the age-related distribution by a disentangling manner, even if the training dataset performs a long-tailed age distribution. UVA is the first attempt to achieve facial age analysis tasks, including age translation, age generation and age estimation, in a universal framework. The qualitative and quantitative experiments demonstrate the superiority of UVA on five popular datasets, including CACD2000, Morph, UTKFace, MegaAge-Asian and FG-NET.
http://arxiv.org/abs/1904.00158
In the last decade, deep artificial neural networks have achieved astounding performance in many natural language processing tasks. Given the high productivity of language, these models must possess effective generalization abilities. It is widely assumed that humans handle linguistic productivity by means of algebraic compositional rules: Are deep networks similarly compositional? After reviewing the main innovations characterizing current deep language processing networks, I discuss a set of studies suggesting that deep networks are capable of subtle grammar-dependent generalizations, but also that they do not rely on systematic compositional rules. I argue that the intriguing behaviour of these devices (still awaiting a full understanding) should be of interest to linguists and cognitive scientists, as it offers a new perspective on possible computational strategies to deal with linguistic productivity beyond rule-based compositionality, and it might lead to new insights into the less systematic generalization patterns that also appear in natural language.
http://arxiv.org/abs/1904.00157
We propose a neural network for unsupervised anomaly detection with a novel robust subspace recovery layer (RSR layer). This layer seeks to extract the underlying subspace from a latent representation of the given data and remove outliers that lie away from this subspace. It is used together with an encoder and a decoder. The encoder maps the data into the latent space, from which the RSR layer extracts the subspace. The decoder then smoothly maps back the underlying subspace to a ``manifold” close to the original data. We illustrate algorithmic choices and performance for artificial data with corrupted manifold structure. We also demonstrate competitive precision and recall for image datasets.
http://arxiv.org/abs/1904.00152
We introduce the problem of learning affective correspondence between audio (music) and visual data (images). For this task, a music clip and an image are considered similar (having true correspondence) if they have similar emotion content. In order to estimate this crossmodal, emotion-centric similarity, we propose a deep neural network architecture that learns to project the data from the two modalities to a common representation space, and performs a binary classification task of predicting the affective correspondence (true or false). To facilitate the current study, we construct a large scale database containing more than $3,500$ music clips and $85,000$ images with three emotion classes (positive, neutral, negative). The proposed approach achieves $61.67\%$ accuracy for the affective correspondence prediction task on this database, outperforming two relevant and competitive baselines. We also demonstrate that our network learns modality-specific representations of emotion (without explicitly being trained with emotion labels), which are useful for emotion recognition in individual modalities.
http://arxiv.org/abs/1904.00150
This paper presents a neural relation extraction method to deal with the noisy training data generated by distant supervision. Previous studies mainly focus on sentence-level de-noising by designing neural networks with intra-bag attentions. In this paper, both intra-bag and inter-bag attentions are considered in order to deal with the noise at sentence-level and bag-level respectively. First, relation-aware bag representations are calculated by weighting sentence embeddings using intra-bag attentions. Here, each possible relation is utilized as the query for attention calculation instead of only using the target relation in conventional methods. Furthermore, the representation of a group of bags in the training set which share the same relation label is calculated by weighting bag representations using a similarity-based inter-bag attention module. Finally, a bag group is utilized as a training sample when building our relation extractor. Experimental results on the New York Times dataset demonstrate the effectiveness of our proposed intra-bag and inter-bag attention modules. Our method also achieves better relation extraction accuracy than state-of-the-art methods on this dataset.
http://arxiv.org/abs/1904.00143
This paper describes the system submitted by ANA Team for the SemEval-2019 Task 3: EmoContext. We propose a novel Hierarchical LSTMs for Contextual Emotion Detection (HRLCE) model. It classifies the emotion of an utterance given its conversational context. The results show that, in this task, our HRCLE outperforms the most recent state-of-the-art text classification framework: BERT. We combine the results generated by BERT and HRCLE to achieve an overall score of 0.7709 which ranked 5th on the final leader board of the competition among 165 Teams.
http://arxiv.org/abs/1904.00132
This work presents computational methods for transferring body movements from one person to another with videos collected in the wild. Specifically, we train a personalized model on a single video from the Internet which can generate videos of this target person driven by the motions of other people. Our model is built on two generative networks: a human (foreground) synthesis net which generates photo-realistic imagery of the target person in a novel pose, and a fusion net which combines the generated foreground with the scene (background), adding shadows or reflections as needed to enhance realism. We validate the the efficacy of our proposed models over baselines with qualitative and quantitative evaluations as well as a subjective test.
http://arxiv.org/abs/1904.00129
Despite continuously improving performance, contemporary image captioning models are prone to “hallucinating” objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and assess their rate of object hallucination. We analyze how captioning model architectures and learning objectives contribute to object hallucination, explore when hallucination is likely due to image misclassification or language priors, and assess how well current sentence metrics capture object hallucination. We investigate these questions on the standard image captioning benchmark, MSCOCO, using a diverse set of models. Our analysis yields several interesting findings, including that models which score best on standard sentence metrics do not always have lower hallucination and that models which hallucinate more tend to make errors driven by language priors.
https://arxiv.org/abs/1809.02156
We present an approach to minimally supervised relation extraction that combines the benefits of learned representations and structured learning, and accurately predicts sentence-level relation mentions given only proposition-level supervision from a KB. By explicitly reasoning about missing data during learning, our approach enables large-scale training of 1D convolutional neural networks while mitigating the issue of label noise inherent in distant supervision. Our approach achieves state-of-the-art results on minimally supervised sentential relation extraction, outperforming a number of baselines, including a competitive approach that uses the attention layer of a purely neural model.
http://arxiv.org/abs/1904.00118
Authors’ keyphrases assigned to scientific articles are essential for recognizing content and topic aspects. Most of the proposed supervised and unsupervised methods for keyphrase generation are unable to produce terms that are valuable but do not appear in the text. In this paper, we explore the possibility of considering the keyphrase string as an abstractive summary of the title and the abstract. First, we collect, process and release a large dataset of scientific paper metadata that contains 2.2 million records. Then we experiment with popular text summarization neural architectures. Despite using advanced deep learning models, large quantities of data and many days of computation, our systematic evaluation on four test datasets reveals that the explored text summarization methods could not produce better keyphrases than the simpler unsupervised methods, or the existing supervised ones.
http://arxiv.org/abs/1904.00110
Metaheuristics are general methods that guide application of concrete heuristic(s) to problems that are too hard to solve using exact algorithms. However, even though a growing body of literature has been devoted to their statistical evaluation, the approaches proposed so far are able to assess only coupled effects of metaheuristics and heuristics. They do not reveal us anything about how efficient the examined metaheuristic is at guiding its subordinate heuristic(s), nor do they provide us information about how much the heuristic component of the combined algorithm contributes to the overall performance. In this paper, we propose a simple yet effective methodology of doing so by deriving a naive, placebo metaheuristic from the one being studied and comparing the distributions of chosen performance metrics for the two methods. We propose three measures of difference between the two distributions. Those measures, which we call BER values (benefit, equivalence, risk) are based on a preselected threshold of practical significance which represents the minimal difference between two performance scores required for them to be considered practically different. We illustrate usefulness of our methodology on the example of Simulated Annealing, Boolean Satisfiability Problem, and the Flip heuristic.
http://arxiv.org/abs/1904.00103
This paper develops an autonomous tethered aerial visual assistant for robot operations in unstructured or confined environments. Robotic tele-operation in remote environments is difficult due to lack of sufficient situational awareness, mostly caused by the stationary and limited field-of-view and lack of depth perception from the robot’s onboard camera. The emerging state of the practice is to use two robots, a primary and a secondary that acts as a visual assistant to overcome the perceptual limitations of the onboard sensors by providing an external viewpoint. However, problems exist when using a tele-operated visual assistant: extra manpower, manually chosen suboptimal viewpoint, and extra teamwork demand between primary and secondary operators. In this work, we use an autonomous tethered aerial visual assistant to replace the secondary robot and operator, reducing human robot ratio from 2:2 to 1:2. This visual assistant is able to autonomously navigate through unstructured or confined spaces in a risk-aware manner, while continuously maintaining good viewpoint quality to increase the primary operator’s situational awareness. With the proposed co-robots team, tele-operation missions in nuclear operations, bomb squad, disaster robots, and other domains with novel tasks or highly occluded environments could benefit from reduced manpower and teamwork demand, along with improved visual assistance quality based on trustworthy risk-aware motion in cluttered environments.
http://arxiv.org/abs/1904.00078
Automatic delineation and measurement of main organs such as liver is one of the critical steps for assessment of hepatic diseases, planning and postoperative or treatment follow-up. However, addressing this problem typically requires performing computed tomography (CT) scanning and complicated postprocessing of the resulting scans using slice-by-slice techniques. In this paper, we show that 3D organ shape can be automatically predicted directly from topogram images, which are easier to acquire and have limited exposure to radiation during acquisition, compared to CT scans. We evaluate our approach on the challenging task of predicting liver shape using a generative model. We also demonstrate that our method can be combined with user annotations, such as a 2D mask, for improved prediction accuracy. We show compelling results on 3D liver shape reconstruction and volume estimation on 2129 CT scans.
http://arxiv.org/abs/1904.00073
As 3D scanning solutions become increasingly popular, several deep learning setups have been developed geared towards that task of scan completion, i.e., plausibly filling in regions there were missed in the raw scans. These methods, however, largely rely on supervision in the form of paired training data, i.e., partial scans with corresponding desired completed scans. While these methods have been successfully demonstrated on synthetic data, the approaches cannot be directly used on real scans in absence of suitable paired training data. We develop a first approach that works directly on input point clouds, does not require paired training data, and hence can directly be applied to real scans for scan completion. We evaluate the approach qualitatively on several real-world datasets (ScanNet, Matterport, KITTI), quantitatively on 3D-EPN shape completion benchmark dataset, and demonstrate realistic completions under varying levels of incompleteness.
http://arxiv.org/abs/1904.00069
Automatic segmentation of brain Magnetic Resonance Imaging (MRI) images is one of the vital steps for quantitative analysis of brain for further inspection. In this paper, NeuroNet has been adopted to segment the brain tissues (white matter (WM), grey matter (GM) and cerebrospinal fluid (CSF)) which uses Residual Network (ResNet) in encoder and Fully Convolution Network (FCN) in the decoder. To achieve the best performance, various hyper-parameters have been tuned, while, network parameters (kernel and bias) were initialized using the NeuroNet pre-trained model. Different pre-processing pipelines have also been introduced to get a robust trained model. The model has been trained and tested on IBSR18 data-set. To validate the research outcome, performance was measured quantitatively using Dice Similarity Coefficient (DSC) and is reported on average as 0.84 for CSF, 0.94 for GM, and 0.94 for WM. The outcome of the research indicates that for the IBSR18 data-set, pre-processing and proper tuning of hyper-parameters for NeuroNet model have improvement in DSC for the brain tissue segmentation.
http://arxiv.org/abs/1904.00068
Attention mechanism has been widely applied to various sound-related tasks. In this work, we propose a Multi-Scale Time-Frequency Attention (MTFA) module for sound event detection. By generating an attention heatmap, MTFA enables the model to focus on discriminative components of the spectrogram along both time and frequency axis. Besides, gathering information at multiple scales helps the model adapt better to the characteristics of different categories of target events. The proposed method is demonstrated on task 2 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge. To the best of our knowledge, our method outperforms all previous methods that don’t use model ensemble on development dataset and achieves state-of-the-art on evaluation dataset by reducing the error rate to 0.09 from 0.13. This demonstrates the effectiveness of MTFA on retrieving discriminative representations for sound event detection.
http://arxiv.org/abs/1904.00063
Identification and localization of sounds are both integral parts of computational auditory scene analysis. Although each can be solved separately, the goal of forming coherent auditory objects and achieving a comprehensive spatial scene understanding suggests pursuing a joint solution of the two problems. This work presents an approach that robustly binds localization with the detection of sound events in a binaural robotic system. Both tasks are joined through the use of spatial stream segregation which produces probabilistic time-frequency masks for individual sources attributable to separate locations, enabling segregated sound event detection operating on these streams. We use simulations of a comprehensive suite of test scenes with multiple co-occurring sound sources, and propose performance measures for systematic investigation of the impact of scene complexity on this segregated detection of sound types. Analyzing the effect of head orientation, we show how a robot can facilitate high performance through optimal head rotation. Furthermore, we investigate the performance of segregated detection given possible localization error as well as error in the estimation of number of active sources. Our analysis demonstrates that the proposed approach is an effective method to obtain joint sound event location and type information under a wide range of conditions.
http://arxiv.org/abs/1904.00055
Deep convolutional neural network significantly boosted the capability of salient object detection in handling large variations of scenes and object appearances. However, convolution operations seek to generate strong responses on individual pixels, while lack the ability to maintain the spatial structure of objects. Moreover, the down-sampling operations, such as pooling and striding, lose spatial details of the salient objects. In this paper, we propose a simple yet effective Siamese Edge-Enhancement Network (SE2Net) to preserve the edge structure for salient object detection. Specifically, a novel multi-stage siamese network is built to aggregate the low-level and high-level features, and parallelly estimate the salient maps of edges and regions. As a result, the predicted regions become more accurate by enhancing the responses at edges, and the predicted edges become more semantic by suppressing the false positives in background. After the refined salient maps of edges and regions are produced by the SE2Net, an edge-guided inference algorithm is designed to further improve the resulting salient masks along the predicted edges. Extensive experiments on several benchmark datasets have been conducted, which show that our method is superior than the state-of-the-art approaches.
http://arxiv.org/abs/1904.00048
The operational space of an autonomous vehicle (AV) can be diverse and vary significantly. This may lead to a scenario that was not postulated in the design phase. Due to this, formulating a rule based decision maker for selecting maneuvers may not be ideal. Similarly, it may not be effective to design an a-priori cost function and then solve the optimal control problem in real-time. In order to address these issues and to avoid peculiar behaviors when encountering unforeseen scenario, we propose a reinforcement learning (RL) based method, where the ego car, i.e., an autonomous vehicle, learns to make decisions by directly interacting with simulated traffic. The decision maker for AV is implemented as a deep neural network providing an action choice for a given system state. In a critical application such as driving, an RL agent without explicit notion of safety may not converge or it may need extremely large number of samples before finding a reliable policy. To best address the issue, this paper incorporates reinforcement learning with an additional short horizon safety check (SC). In a critical scenario, the safety check will also provide an alternate safe action to the agent provided if it exists. This leads to two novel contributions. First, it generalizes the states that could lead to undesirable “near-misses” or “collisions “. Second, inclusion of safety check can provide a safe and stable training environment. This significantly enhances learning efficiency without inhibiting meaningful exploration to ensure safe and optimal learned behavior. We demonstrate the performance of the developed algorithm in highway driving scenario where the trained AV encounters varying traffic density in a highway setting.
http://arxiv.org/abs/1904.00035
Deep neural networks are known to suffer from catastrophic forgetting in class-incremental learning, where the performance on previous tasks drastically degrades when learning a new task. To alleviate this effect, we propose to leverage a continuous and large stream of unlabeled data in the wild. In particular, to leverage such transient external data effectively, we design a novel class-incremental learning scheme with (a) a new distillation loss, termed global distillation, (b) a learning strategy to avoid overfitting to the most recent task, and (c) a sampling strategy for the desired external data. Our experimental results on various datasets, including CIFAR and ImageNet, demonstrate the superiority of the proposed methods over prior methods, particularly when a stream of unlabeled data is accessible: we achieve up to 9.3% of relative performance improvement compared to the state-of-the-art method.
http://arxiv.org/abs/1903.12648
We consider preoperative prediction of thyroid cancer based on ultra-high-resolution whole-slide cytopathology images. Inspired by how human experts perform diagnosis, our approach first identifies and classifies diagnostic image regions containing informative thyroid cells, which only comprise a tiny fraction of the entire image. These local estimates are then aggregated into a single prediction of thyroid malignancy. Several unique characteristics of thyroid cytopathology guide our deep-learning-based approach. While our method is closely related to multiple-instance learning, it deviates from these methods by using a supervised procedure to extract diagnostically relevant regions. Moreover, we propose to simultaneously predict thyroid malignancy, as well as a diagnostic score assigned by a human expert, which further allows us to devise an improved training strategy. Experimental results show that the proposed algorithm achieves performance comparable to human experts, and demonstrate the potential of using the algorithm for screening and as an assistive tool for the improved diagnosis of indeterminate cases.
http://arxiv.org/abs/1904.00839
Insufficient or even unavailable training data of emerging classes is a big challenge of many classification tasks, including text classification. Recognising text documents of classes that have never been seen in the learning stage, so-called zero-shot text classification, is therefore difficult and only limited previous works tackled this problem. In this paper, we propose a two-phase framework together with data augmentation and feature augmentation to solve this problem. Four kinds of semantic knowledge (word embeddings, class descriptions, class hierarchy, and a general knowledge graph) are incorporated into the proposed framework to deal with instances of unseen classes effectively. Experimental results show that each and the combination of the two phases achieve the best overall accuracy compared with baselines and recent approaches in classifying real-world texts under the zero-shot scenario.
http://arxiv.org/abs/1903.12626
Robotic systems often need to consider multiple tasks concurrently. This challenge calls for control synthesis algorithms that are capable of fulfilling multiple control specifications simultaneously while maintaining the stability of the overall system. In this paper, we decompose complex, multi-objective tasks into subtasks, where individual subtask controllers are designed independently and then combined to generate the overall control policy. In particular, we adopt Riemannian Motion Policies (RMPs), a recently proposed controller structure in robotics, and, RMPflow, its associated computational framework for combining RMP controllers. We re-establish and extend the stability results of RMPflow through a rigorous Control Lyapunov Function (CLF) treatment. We then show that RMPflow can stably combine individually designed subtask controllers that satisfy certain CLF constraints. This new insight leads to an efficient CLF-based computational framework to generate stable controllers that consider all the subtasks simultaneously. Compared with the original usage of RMPflow, our framework provides users the flexibility to incorporate design heuristics through nominal controllers for the subtasks. We validate the proposed computational framework through numerical simulation and robotic implementation.
http://arxiv.org/abs/1903.12605
The research on aerial manipulation systems has been increased rapidly in recent years. These systems are very attractive for a wide range of applications due to their unique features. However, dynamics, control and manipulation tasks of such systems are quite challenging because they are naturally unstable, have very fast dynamics, have strong nonlinearities, are very susceptible to parameters variations due to carrying a payload besides the external disturbances, and have complex inverse kinematics. In addition, the manipulation tasks require estimating (applying) a certain force of (at) the end-effector as well as the accurate positioning of it. Thus, in this article, a robust force estimation and impedance control scheme is proposed to address these issues. The robustness is achieved based on the Disturbance Observer (DOb) technique. Then, a tracking and performance low computational linear controller is used. For teleoperation purpose, the contact force needs to be identified. However, the current developed techniques for force estimation have limitations because they are based on ignoring some dynamics and/or requiring of an indicator of the environment contact. Unlike these techniques, we propose a technique based on linearization capabilities of DOb and a Fast Tracking Recursive Least Squares (FTRLS) algorithm. The complex inverse kinematics problem of such a system is solved by a Jacobin based algorithm. The stability analysis of the proposed scheme is presented. The algorithm is tested to achieve tracking of task space reference trajectories besides the impedance control. The efficiency of the proposed technique is enlightened via numerical simulation.
http://arxiv.org/abs/1904.00008
Descriptive comments play a crucial role in the software engineering process. They decrease development time, enable better bug detection, and facilitate the reuse of previously written code. However, comments are commonly the last of a software developer’s priorities and are thus either insufficient or missing entirely. Automatic source code summarization may therefore have the ability to significantly improve the software development process. We introduce a novel encoder-decoder model that summarizes source code, effectively writing a comment to describe the code’s functionality. We make two primary innovations beyond current source code summarization models. First, our encoder is fully language-agnostic and requires no complex input preprocessing. Second, our decoder has an open vocabulary, enabling it to predict any word, even ones not seen in training. We demonstrate results comparable to state-of-the-art methods on a single-language data set and provide the first results on a data set consisting of multiple programming languages.
http://arxiv.org/abs/1904.00805
Implementing color constancy as a pre-processing step in contemporary digital cameras is of significant importance as it removes the influence of scene illumination on object colors. Several benchmark color constancy datasets have been created for the purpose of developing and testing new color constancy methods. However, they all have numerous drawbacks including a small number of images, erroneously extracted ground-truth illuminations, long histories of misuses, violations of their stated assumptions, etc. To overcome such and similar problems, in this paper a color constancy benchmark dataset generator is proposed. For a given camera sensor it enables generation of any number of realistic raw images taken in a subset of the real world, namely images of printed photographs. Datasets with such images share many positive features with other existing real-world datasets, while some of the negative features are completely eliminated. The generated images can be successfully used to train methods that afterward achieve high accuracy on real-world datasets. This opens the way for creating large enough datasets for advanced deep learning techniques. Experimental results are presented and discussed. The source code is available at this http URL
http://arxiv.org/abs/1903.12581
Deep learning methods capable of handling relational data have proliferated over the last years. In contrast to traditional relational learning methods that leverage first-order logic for representing such data, these deep learning methods aim at re-representing symbolic relational data in Euclidean spaces. They offer better scalability, but can only numerically approximate relational structures and are less flexible in terms of reasoning tasks supported. This paper introduces a novel framework for relational representation learning that combines the best of both worlds. This framework, inspired by the auto-encoding principle, uses first-order logic as a data representation language, and the mapping between the original and latent representation is done by means of logic programs instead of neural networks. We show how learning can be cast as a constraint optimisation problem for which existing solvers can be used. The use of logic as a representation language makes the proposed framework more accurate (as the representation is exact, rather than approximate), more flexible, and more interpretable than deep learning methods. We experimentally show that these latent representations are indeed beneficial in relational learning tasks.
http://arxiv.org/abs/1903.12577
Prostate cancer is the most common cancer among US men. However, prostate imaging is still challenging despite the advances in multi-parametric Magnetic Resonance Imaging (MRI), which provides both morphologic and functional information pertaining to the pathological regions. Along with whole prostate gland segmentation, distinguishing between the Central Gland (CG) and Peripheral Zone (PZ) can guide towards differential diagnosis, since the frequency and severity of tumors differ in these regions; however, their boundary is often weak and fuzzy. This work presents a preliminary study on Deep Learning to automatically delineate the CG and PZ, aiming at evaluating the generalization ability of Convolutional Neural Networks (CNNs) on two multi-centric MRI prostate datasets. Especially, we compared three CNN-based architectures: SegNet, U-Net, and pix2pix. In such a context, the segmentation performances achieved with/without pre-training were compared in 4-fold cross-validation. In general, U-Net outperforms the other methods, especially when training and testing are performed on multiple datasets.
http://arxiv.org/abs/1903.12571
A femtosecond laser focused inside bulk GaN was used to slice a thin GaN film with an epitaxial device structure from a bulk GaN substrate. The demonstrated laser slicing lift-off process did not require any special release layers in the epitaxial structure. GaN film with a thickness of 5 $\mu$m and an InGaN LED epitaxial device structure was lifted off a GaN substrate and transferred onto a copper substrate. The electroluminescence of the LED chip after the laser slicing lift-off was demonstrated.
https://arxiv.org/abs/1902.06348
Due to the lack of available annotated medical images, accurate computer-assisted diagnosis requires intensive Data Augmentation (DA) techniques, such as geometric/intensity transformations of original images; however, those transformed images intrinsically have a similar distribution to the original ones, leading to limited performance improvement. To fill the data lack in the real image distribution, we synthesize brain contrast-enhanced Magnetic Resonance (MR) images—realistic but completely different from the original ones—using Generative Adversarial Networks (GANs). This study exploits Progressive Growing of GANs (PGGANs), a multi-stage generative training method, to generate original-sized 256 X 256 MR images for Convolutional Neural Network-based brain tumor detection, which is challenging via conventional GANs; difficulties arise due to unstable GAN training with high resolution and a variety of tumors in size, location, shape, and contrast. Our preliminary results show that this novel PGGAN-based DA method can achieve promising performance improvement, when combined with classical DA, in tumor detection and also in other medical imaging tasks.
http://arxiv.org/abs/1903.12564
It is well known that deep neural networks (DNNs) are vulnerable to adversarial attacks, which are implemented by adding crafted perturbations onto benign examples. Min-max robust optimization based adversarial training can provide a notion of security against adversarial attacks. However, adversarial robustness requires a significantly larger capacity of the network than that for the natural training with only benign examples. This paper proposes a framework of concurrent adversarial training and weight pruning that enables model compression while still preserving the adversarial robustness and essentially tackles the dilemma of adversarial training. Furthermore, this work studies two hypotheses about weight pruning in the conventional network pruning setting and finds that weight pruning is essential for reducing the network model size in the adversarial setting, i.e., training a small model from scratch even with inherited initialization from the large model cannot achieve both adversarial robustness and model compression.
http://arxiv.org/abs/1903.12561
Accurate computer-assisted diagnosis using Convolutional Neural Networks (CNNs) requires large-scale annotated training data, associated with expert physicians’ time-consuming labor; thus, Data Augmentation (DA) using Generative Adversarial Networks (GANs) is essential in Medical Imaging, since they can synthesize additional annotated training data to handle small and fragmented medical images from various scanners; those images are realistic but completely different from the original ones, filling the data lack in the real image distribution. As a tutorial, this paper introduces background on GAN-based Medical Image Augmentation, along with tricks to achieve high classification/object detection/segmentation performance using them, based on our empirical experience and related work. Moreover, we show our first GAN-based DA work using automatic bounding box annotation, for robust CNN-based brain metastases detection on 256 x 256 MR images; GAN-based DA can boost 10% sensitivity in diagnosis with a clinically acceptable amount of additional False Positives, even with highly-rough and inconsistent bounding boxes.
http://arxiv.org/abs/1904.00838
Time series forecasting is one of the challenging problems for humankind. Traditional forecasting methods using mean regression models have severe shortcomings in reflecting real-world fluctuations. While new probabilistic methods rush to rescue, they fight with technical difficulties like quantile crossing or selecting a prior distribution. To meld the different strengths of these fields while avoiding their weaknesses as well as to push the boundary of the state-of-the-art, we introduce ForGAN - one step ahead probabilistic forecasting with generative adversarial networks. ForGAN utilizes the power of the conditional generative adversarial network to learn the data generating distribution and compute probabilistic forecasts from it. We argue how to evaluate ForGAN in opposition to regression methods. To investigate probabilistic forecasting of ForGAN, we create a new dataset and demonstrate our method abilities on it. This dataset will be made publicly available for comparison. Furthermore, we test ForGAN on two publicly available datasets, namely Mackey-Glass dataset and Internet traffic dataset (A5M) where the impressive performance of ForGAN demonstrate its high capability in forecasting future values.
http://arxiv.org/abs/1903.12549
In this work we describe a novel motion guided method for targetless self-calibration of a LiDAR and camera and use the re-projection of LiDAR points onto the image reference frame for real-time depth upsampling. The calibration parameters are estimated by optimizing an objective function that penalizes distances between 2D and re-projected 3D motion vectors obtained from time-synchronized image and point cloud sequences. For upsampling, we propose a simple, yet effective and time efficient formulation that minimizes depth gradients subject to an equality constraint involving the LiDAR measurements. We test our algorithms on real data from urban environments and demonstrate that our two methods are effective and suitable to mobile robotics and autonomous vehicle applications imposing real-time requirements.
http://arxiv.org/abs/1803.10681
Topics models, such as LDA, are widely used in Natural Language Processing. Making their output interpretable is an important area of research with applications to areas such as the enhancement of exploratory search interfaces and the development of interpretable machine learning models. Conventionally, topics are represented by their n most probable words, however, these representations are often difficult for humans to interpret. This paper explores the re-ranking of topic words to generate more interpretable topic representations. A range of approaches are compared and evaluated in two experiments. The first uses crowdworkers to associate topics represented by different word rankings with related documents. The second experiment is an automatic approach based on a document retrieval task applied on multiple domains. Results in both experiments demonstrate that re-ranking words improves topic interpretability and that the most effective re-ranking schemes were those which combine information about the importance of words both within topics and their relative frequency in the entire corpus. In addition, close correlation between the results of the two evaluation approaches suggests that the automatic method proposed here could be used to evaluate re-ranking methods without the need for human judgements.
http://arxiv.org/abs/1903.12542
The Estimation of Distribution Algorithm is a new class of population based search methods in that a probabilistic model of individuals is estimated based on the high quality individuals and used to generate the new individuals. In this paper we compute 1) some upper bounds on the number of iterations required for global convergence of EDA 2) the exact number of iterations needed for EDA to converge to global optima.
http://arxiv.org/abs/cs/0601132
Gaze redirection is the task of changing the gaze to a desired direction for a given monocular eye patch image. Many applications such as videoconferencing, films and games, and generation of training data for gaze estimation require redirecting the gaze, without distorting the appearance of the area surrounding the eye and while producing photo-realistic images. Existing methods lack the ability to generate perceptually plausible images. In this work, we present a novel method to alleviate this problem by leveraging generative adversarial training to synthesize an eye image conditioned on a target gaze direction. Our method ensures perceptual similarity and consistency of synthesized images to the real images. Furthermore, a gaze estimation loss is used to control the gaze direction accurately. To attain high-quality images, we incorporate perceptual and cycle consistency losses into our architecture. In extensive evaluations we show that the proposed method outperforms state-of-the-art approaches in terms of both image quality and redirection precision. Finally, we show that generated images can bring significant improvement for the gaze estimation task if used to augment real training data.
http://arxiv.org/abs/1903.12530
While deep neural networks (DNN) based single image super-resolution (SISR) methods are rapidly gaining popularity, they are mainly designed for the widely-used bicubic degradation, and there still remains the fundamental challenge for them to super-resolve low-resolution (LR) image with arbitrary blur kernels. In the meanwhile, plug-and-play image restoration has been recognized with high flexibility due to its modular structure for easy plug-in of denoiser priors. In this paper, we propose a principled formulation and framework by extending bicubic degradation based deep SISR with the help of plug-and-play framework to handle LR images with arbitrary blur kernels. Specifically, we design a new SISR degradation model so as to take advantage of existing blind deblurring methods for blur kernel estimation. To optimize the new degradation induced energy function, we then derive a plug-and-play algorithm via variable splitting technique, which allows us to plug any super-resolver prior rather than the denoiser prior as a modular part. Quantitative and qualitative evaluations on synthetic and real LR images demonstrate that the proposed deep plug-and-play super-resolution framework is flexible and effective to deal with blurry LR images.
http://arxiv.org/abs/1903.12529
We present a training system, which can provably defend significantly larger neural networks than previously possible, including ResNet-34 and DenseNet-100. Our approach is based on differentiable abstract interpretation and introduces two novel concepts: (i) abstract layers for fine-tuning the precision and scalability of the abstraction, (ii) a flexible domain specific language (DSL) for describing training objectives that combine abstract and concrete losses with arbitrary specifications. Our training method is implemented in the DiffAI system.
http://arxiv.org/abs/1903.12519
An effective way to achieve intelligence is to simulate various intelligent behaviors in the human brain. In recent years, bio-inspired learning methods have emerged, and they are different from the classical mathematical programming principle. In the perspective of brain inspiration, reinforcement learning has gained additional interest in solving decision-making tasks as increasing neuroscientific research demonstrates that significant links exist between reinforcement learning and specific neural substrates. Because of the tremendous research that focuses on human brains and reinforcement learning, scientists have investigated how robots can autonomously tackle complex tasks in the form of a self-driving agent control in a human-like way. In this study, we propose an end-to-end architecture using novel deep-Q-network architecture in conjunction with a recurrence to resolve the problem in the field of simulated self-driving. The main contribution of this study is that we trained the driving agent using a brain-inspired trial-and-error technique, which was in line with the real world situation. Besides, there are three innovations in the proposed learning network: raw screen outputs are the only information which the driving agent can rely on, a weighted layer that enhances the differences of the lengthy episode, and a modified replay mechanism that overcomes the problem of sparsity and accelerates learning. The proposed network was trained and tested under a third-partied OpenAI Gym environment. After training for several episodes, the resulting driving agent performed advanced behaviors in the given scene. We hope that in the future, the proposed brain-inspired learning system would inspire practicable self-driving control solutions.
http://arxiv.org/abs/1903.12517
One of the grand challenges of deep learning is the requirement to obtain large labeled training data sets. While synthesized data sets can be used to overcome this challenge, it is important that these data sets close the reality gap, i.e., a model trained on synthetic image data is able to generalize to real images. Whereas, the reality gap can be considered bridged in several application scenarios, training on synthesized images containing reflecting materials requires further research. Since the appearance of objects with reflecting materials is dominated by the surrounding environment, this interaction needs to be considered during training data generation. Therefore, within this paper we examine the effect of reflecting materials in the context of synthetic image generation for training object detectors. We investigate the influence of rendering approach used for image synthesis, the effect of domain randomization, as well as the amount of used training data. To be able to compare our results to the state-of-the-art, we focus on indoor scenes as they have been investigated extensively. Within this scenario, bathroom furniture is a natural choice for objects with reflecting materials, for which we report our findings on real and synthetic testing data.
http://arxiv.org/abs/1904.00824
The traditional ground-and-solve approach to Answer Set Programming (ASP) suffers from the grounding bottleneck, which makes large-scale problem instances unsolvable. Lazy grounding is an alternative approach that interleaves grounding with solving and thus uses space more efficiently. The limited view on the search space in lazy grounding poses unique challenges, however, and can have adverse effects on solving performance. In this paper we present a novel characterization of degrees of laziness in grounding for ASP, i.e. of compromises between lazily grounding as little as possible and the traditional full grounding upfront. We investigate how these degrees of laziness compare to each other formally as well as, by means of an experimental analysis using a number of benchmarks, in terms of their effects on solving performance. Our contributions are the introduction of a range of novel lazy grounding strategies, a formal account on their relationships and their correctness, and an investigation of their effects on solving performance. Experiments show that our approach performs significantly better than state-of-the-art lazy grounding in many cases.
http://arxiv.org/abs/1903.12510