Context categorization is a fundamental pre-requisite for multi-domain multimedia content analysis applications in order to manage contextual information in an efficient manner. In this paper, we introduce a new color image context categorization method (DITEC) based on the trace transform. The problem of dimensionality reduction of the obtained trace transform signal is addressed through statistical descriptors that keep the underlying information. These extracted features offer a highly discriminant behavior for content categorization. The theoretical properties of the method are analyzed and validated experimentally through two different datasets.
http://arxiv.org/abs/1208.3901
We show how to control the movement of a wheeled robot using on-board liquid marbles made of Belousov-Zhabotinsky solution droplets coated with polyethylene powder. Two stainless steel, iridium coated electrodes were inserted in a marble and the electrical potential recorded was used to control the robot’s motor. We stimulated the marble with a laser beam. It responded to the stimulation by pronounced changes in the electrical potential output. The electrical output was detected by robot. The robot was changing its trajectory in response to the stimulation. The results open new horizons for applications for oscillatory chemical reactions in robotics.
http://arxiv.org/abs/1904.01520
In this paper, we propose a novel meta learning approach for automatic channel pruning of very deep neural networks. We first train a PruningNet, a kind of meta network, which is able to generate weight parameters for any pruned structure given the target network. We use a simple stochastic structure sampling method for training the PruningNet. Then, we apply an evolutionary procedure to search for good-performing pruned networks. The search is highly efficient because the weights are directly generated by the trained PruningNet and we do not need any finetuning. With a single PruningNet trained for the target network, we can search for various Pruned Networks under different constraints with little human participation. We have demonstrated competitive performances on MobileNet V1/V2 networks, up to 9.0/9.9 higher ImageNet accuracy than V1/V2. Compared to the previous state-of-the-art AutoML-based pruning methods, like AMC and NetAdapt, we achieve higher or comparable accuracy under various conditions.
http://arxiv.org/abs/1903.10258
We applied deep learning to create an algorithm for breathing phase detection in lung sound recordings, and we compared the breathing phases detected by the algorithm and manually annotated by two experienced lung sound researchers. Our algorithm uses a convolutional neural network with spectrograms as the features, removing the need to specify features explicitly. We trained and evaluated the algorithm using three subsets that are larger than previously seen in the literature. We evaluated the performance of the method using two methods. First, discrete count of agreed breathing phases (using 50% overlap between a pair of boxes), shows a mean agreement with lung sound experts of 97% for inspiration and 87% for expiration. Second, the fraction of time of agreement (in seconds) gives higher pseudo-kappa values for inspiration (0.73-0.88) than expiration (0.63-0.84), showing an average sensitivity of 97% and an average specificity of 84%. With both evaluation methods, the agreement between the annotators and the algorithm shows human level performance for the algorithm. The developed algorithm is valid for detecting breathing phases in lung sound recordings.
http://arxiv.org/abs/1903.10251
We review computational and robotics models of early language learning and development. We first explain why and how these models are used to understand better how children learn language. We argue that they provide concrete theories of language learning as a complex dynamic system, complementing traditional methods in psychology and linguistics. We review different modeling formalisms, grounded in techniques from machine learning and artificial intelligence such as Bayesian and neural network approaches. We then discuss their role in understanding several key mechanisms of language development: cross-situational statistical learning, embodiment, situated social interaction, intrinsically motivated learning, and cultural evolution. We conclude by discussing future challenges for research, including modeling of large-scale empirical data about language acquisition in real-world environments. Keywords: Early language learning, Computational and robotic models, machine learning, development, embodiment, social interaction, intrinsic motivation, self-organization, dynamical systems, complexity.
http://arxiv.org/abs/1903.10246
Two types of knowledge, factoid knowledge from graphs and non-factoid knowledge from unstructured documents, have been studied for knowledge aware open-domain conversation generation, in which edge information in graphs can help generalization of knowledge selectors, and text sentences of non-factoid knowledge can provide rich information for response generation. Fusion of knowledge triples and sentences might yield mutually reinforcing advantages for conversation generation, but there is less study on that. To address this challenge, we propose a knowledge aware chatting machine with three components, augmented knowledge graph containing both factoid and non-factoid knowledge, knowledge selector, and response generator. For knowledge selection on the graph, we formulate it as a problem of multi-hop graph reasoning that is more flexible in comparison with previous one-hop knowledge selection models. To fully leverage long text information that differentiates our graph from others, we improve a state of the art reasoning algorithm with machine reading comprehension technology. We demonstrate that supported by such unified knowledge and knowledge selection method, our system can generate more appropriate and informative responses than baselines.
http://arxiv.org/abs/1903.10245
The problem of learning to translate between two vector spaces given a set of aligned points arises in several application areas of NLP. Current solutions assume that the lexicon which defines the alignment pairs is noise-free. We consider the case where the set of aligned points is allowed to contain an amount of noise, in the form of incorrect lexicon pairs and show that this arises in practice by analyzing the edited dictionaries after the cleaning process. We demonstrate that such noise substantially degrades the accuracy of the learned translation when using current methods. We propose a model that accounts for noisy pairs. This is achieved by introducing a generative model with a compatible iterative EM algorithm. The algorithm jointly learns the noise level in the lexicon, finds the set of noisy pairs, and learns the mapping between the spaces. We demonstrate the effectiveness of our proposed algorithm on two alignment problems: bilingual word embedding translation, and mapping between diachronic embedding spaces for recovering the semantic shifts of words across time periods.
http://arxiv.org/abs/1903.10238
Many recent few-shot learning methods concentrate on designing novel model architectures. In this paper, we instead show that with a simple backbone convolutional network we can even surpass state-of-the-art classification accuracy. The essential part that contributes to this superior performance is an adversarial feature learning strategy that improves the generalization capability of our model. In this work, adversarial features are those features that can cause the classifier uncertain about its prediction. In order to generate adversarial features, we firstly locate adversarial regions based on the derivative of the entropy with respect to an averaging mask. Then we use the adversarial region attention to aggregate the feature maps to obtain the adversarial features. In this way, we can explore and exploit the entire spatial area of the feature maps to mine more diverse discriminative knowledge. We perform extensive model evaluations and analyses on miniImageNet and tieredImageNet datasets demonstrating the effectiveness of the proposed method.
http://arxiv.org/abs/1903.10225
Mental well-being and social media have been closely related domains of study. In this research a novel model, AD prediction model, for anxious depression prediction in real-time tweets is proposed. This mixed anxiety-depressive disorder is a predominantly associated with erratic thought process, restlessness and sleeplessness. Based on the linguistic cues and user posting patterns, the feature set is defined using a 5-tuple vector <word, timing, frequency, sentiment, contrast>. An anxiety-related lexicon is built to detect the presence of anxiety indicators. Time and frequency of tweet is analyzed for irregularities and opinion polarity analytics is done to find inconsistencies in posting behaviour. The model is trained using three classifiers (multinomial na"ive bayes, gradient boosting, and random forest) and majority voting using an ensemble voting classifier is done. Preliminary results are evaluated for tweets of sampled 100 users and the proposed model achieves a classification accuracy of 85.09%.
http://arxiv.org/abs/1903.10222
The dynamic of real-world optimization problems raises new challenges to the traditional particle swarm optimization (PSO). Responding to these challenges, the dynamic optimization has received considerable attention over the past decade. This paper introduces a new dynamic multi-objective optimization based particle swarm optimization (Dynamic-MOPSO).The main idea of this paper is to solve such dynamic problem based on a new environment change detection strategy using the advantage of the particle swarm optimization. In this way, our approach has been developed not just to obtain the optimal solution, but also to have a capability to detect the environment changes. Thereby, DynamicMOPSO ensures the balance between the exploration and the exploitation in dynamic research space. Our approach is tested through the most popularized dynamic benchmark’s functions to evaluate its performance as a good method.
https://arxiv.org/abs/1903.10681
In many practical transfer learning scenarios, the feature distribution is different across the source and target domains (i.e. non-i.i.d.). Maximum mean discrepancy (MMD), as a domain discrepancy metric, has achieved promising performance in unsupervised domain adaptation (DA). We argue that MMD-based DA methods ignore the data locality structure, which, to some extent, would cause the negative transfer effect. The locality plays an important role in minimizing the nonlinear local domain discrepancy underlying the marginal distributions. For better exploiting the domain locality, a novel local generative discrepancy metric (LGDM) based intermediate domain generation learning called Manifold Criterion guided Transfer Learning (MCTL) is proposed in this paper. The merits of the proposed MCTL are four-fold: 1) the concept of manifold criterion (MC) is first proposed as a measure validating the distribution matching across domains, and domain adaptation is achieved if the MC is satisfied; 2) the proposed MC can well guide the generation of the intermediate domain sharing similar distribution with the target domain, by minimizing the local domain discrepancy; 3) a global generative discrepancy metric (GGDM) is presented, such that both the global and local discrepancy can be effectively and positively reduced; 4) a simplified version of MCTL called MCTL-S is presented under a perfect domain generation assumption for more generic learning scenario. Experiments on a number of benchmark visual transfer tasks demonstrate the superiority of the proposed manifold criterion guided generative transfer method, by comparing with other state-of-the-art methods. The source code is available in https://github.com/wangshanshanCQU/MCTL.
http://arxiv.org/abs/1903.10211
How should prior knowledge from physics inform a neural network solution? We study the blending of physics and deep learning in the context of Shape from Polarization (SfP). The classic SfP problem recovers an object’s shape from polarized photographs of the scene. The SfP problem is special because the physical models are only approximate. Previous attempts to solve SfP have been purely model-based, and are susceptible to errors when real-world conditions deviate from the idealized physics. In our solution, there is a subtlety to combining physics and neural networks. Our final solution blends deep learning with synthetic renderings (derived from physics) in the framework of a two-stage encoder. The lessons learned from this exemplary problem foreshadow the future impact of physics-based learning.
http://arxiv.org/abs/1903.10210
Currently, digital maps are indispensable for automated driving. However, due to the low precision and reliability of GNSS particularly in urban areas, fusing trajectories of independent recording sessions and different regions is a challenging task. To bypass the flaws from direct incorporation of GNSS measurements for geo-referencing, the usage of aerial imagery seems promising. Furthermore, more accurate geo-referencing improves the global map accuracy and allows to estimate the sensor calibration error. In this paper, we present a novel geo-referencing approach to align trajectories to aerial imagery using poles and road markings. To match extracted features from sensor observations to aerial imagery landmarks robustly, a RANSAC-based matching approach is applied in a sliding window. For that, we assume that the trajectories are roughly referenced to the imagery which can be achieved by rough GNSS measurements from a low-cost GNSS receiver. Finally, we align the initial trajectories precisely to the aerial imagery by minimizing a geometric cost function comprising all determined matches. Evaluations performed on data recorded in Karlsruhe, Germany show that our algorithm yields trajectories which are accurately referenced to the used aerial imagery.
http://arxiv.org/abs/1903.10205
Heterogeneous Face Recognition (HFR) is a challenging issue because of the large domain discrepancy and a lack of heterogeneous data. This paper considers HFR as a dual generation problem, and proposes a new Dual Variational Generation (DVG) framework. It generates large-scale paired heterogeneous images with the same identity from noise, for the sake of reducing the domain gap of HFR, which provides a new insight into the two challenging issues in HFR. Specifically, we first introduce a dual variational autoencoder to represent a joint distribution of paired heterogeneous images. Then, we impose a distribution alignment loss in the latent space and a pairwise identity preserving loss in the image space. These ensure that DVG can generate diverse paired heterogeneous images of the same identity. Moreover, a pairwise distance loss between the generated paired heterogeneous images contributes to the optimization of the HFR network, aiming at reducing the domain discrepancy. Significant recognition improvements are observed on four HFR databases, paving a new way to address the low-shot HFR problems.
http://arxiv.org/abs/1903.10203
Because preferences naturally arise and play an important role in many real-life decisions, they are at the backbone of various fields. In particular preferences are increasingly used in almost all matching procedures-based applications. In this work we highlight the benefit of using AI insights on preferences in a large scale application, namely the French Admission Post-Baccalaureat Platform (APB). Each year APB allocates hundreds of thousands first year applicants to universities. This is done automatically by matching applicants preferences to university seats. In practice, APB can be unable to distinguish between applicants which leads to the introduction of random selection. This has created frustration in the French public since randomness, even used as a last mean does not fare well with the republican egalitarian principle. In this work, we provide a solution to this problem. We take advantage of recent AI Preferences Theory results to show how to enhance APB in order to improve expressiveness of applicants preferences and reduce their exposure to random decisions.
http://arxiv.org/abs/1707.07298
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of youtubers with notable expressiveness in both the speech and visual signals.
http://arxiv.org/abs/1903.10195
The area of formal ethics is experiencing a shift from a unique or standard approach to normative reasoning, as exemplified by so-called standard deontic logic, to a variety of application-specific theories. However, the adequate handling of normative concepts such as obligation, permission, prohibition, and moral commitment is challenging, as illustrated by the notorious paradoxes of deontic logic. In this article we introduce an approach to design and evaluate theories of normative reasoning. In particular, we present a formal framework based on higher-order logic, a design methodology, and we discuss tool support. Moreover, we illustrate the approach using an example of an implementation, we demonstrate different ways of using it, and we discuss how the design of normative theories is now made accessible to non-specialist users and developers.
http://arxiv.org/abs/1903.10187
Text-based person search aims to retrieve the corresponding persons in an image database by virtue of a describing sentence about the person, which poses great potential for various applications such as video surveillance. Extracting visual contents corresponding to the human description is the key to this cross-modal matching problem. Moreover, correlated images and descriptions involve different levels of semantic relevance. To exploit the multilevel relevances between human description and corresponding visual contents, we propose a pose-guided joint global and attentive local matching network (GALM), which includes global, uni-local and bi-local matching. The global matching network aims to learn global cross-modal representations. To further capture the meaningful local relations, we propose an uni-local matching network to compute the local similarities between image regions and textual description and then utilize a similarity-based hard attention to select the description-related image regions. In addition to sentence-level matching, the fine-grained phrase-level matching is captured by the bi-local matching network, which employs pose information to learn latent semantic alignment between visual body part and textual noun phrase. To verify the effectiveness of our model, we perform extensive experiments on the CUHK Person Description Dataset (CUHK-PEDES) which is currently the only available dataset for text-based person search. Experimental results show that our approach outperforms the state-of-the-art methods by 15 \% in terms of top-1 metric.
http://arxiv.org/abs/1809.08440
In interpretation of remote sensing images, it is possible that some images which are supplied by different sensors become understandable. For better visual perception of these images, it is essential to operate series of pre-processing and elementary corrections and then operate a series of main processing steps for more precise analysis on the images. There are several approaches for processing which are depended on the type of remote sensing images. The discussed approach in this article, i.e. image fusion, is the use of natural colors of an optical image for adding color to a grayscale satellite image which gives us the ability for better observation of the HR image of OLI sensor of Landsat-8. This process with emphasis on details of fusion technique has previously been performed; however, we are going to apply the concept of the interpolation process. In fact, we see many important software tools such as ENVI and ERDAS as the most famous remote sensing image processing tools have only classical interpolation techniques (such as bi-linear (BL) and bi-cubic/cubic convolution (CC)). Therefore, ENVI- and ERDAS-based researches in image fusion area and even other fusion researches often dont use new and better interpolators and are mainly concentrated on the fusion algorithms details for achieving a better quality, so we only focus on the interpolation impact on fusion quality in Landsat-8 multispectral images. The important feature of this approach is to use a statistical, adaptive, and edge-guided interpolation method for improving the color quality in the images in practice. Numerical simulations show selecting the suitable interpolation techniques in MRF-based images creates better quality than the classical interpolators.
http://arxiv.org/abs/1512.08475
Inverse problems in imaging are extensively studied, with a variety of strategies, tools, and theory that have been accumulated over the years. Recently, this field has been immensely influenced by the emergence of deep-learning techniques. One such contribution, which is the focus of this paper, is the Deep Image Prior (DIP) work by Ulyanov, Vedaldi, and Lempitsky (2018). DIP offers a new approach towards the regularization of inverse problems, obtained by forcing the recovered image to be synthesized from a given deep architecture. While DIP has been shown to be effective, its results fall short when compared to state-of-the-art alternatives. In this work, we aim to boost DIP by adding an explicit prior, which enriches the overall regularization effect in order to lead to better-recovered images. More specifically, we propose to bring-in the concept of Regularization by Denoising (RED), which leverages existing denoisers for regularizing inverse problems. Our work shows how the two (DeepRED) can be merged to a highly effective recovery process while avoiding the need to differentiate the chosen denoiser, and leading to very effective results, demonstrated for several tested inverse problems.
http://arxiv.org/abs/1903.10176
Absolute pose estimation is a fundamental problem in computer vision, and it is a typical parameter estimation problem, meaning that efforts to solve it will always suffer from outlier-contaminated data. Conventionally, for a fixed dimensionality d and the number of measurements N, a robust estimation problem cannot be solved faster than O(N^d). Furthermore, it is almost impossible to remove d from the exponent of the runtime of a globally optimal algorithm. However, absolute pose estimation is a geometric parameter estimation problem, and thus has special constraints. In this paper, we consider pairwise constraints and propose a globally optimal algorithm for solving the absolute pose estimation problem. The proposed algorithm has a linear complexity in the number of correspondences at a given outlier ratio. Concretely, we first decouple the rotation and the translation subproblems by utilizing the pairwise constraints, and then we solve the rotation subproblem using the branch-and-bound algorithm. Lastly, we estimate the translation based on the known rotation by using another branch-and-bound algorithm. The advantages of our method are demonstrated via thorough testing on both synthetic and real-world data
http://arxiv.org/abs/1903.10175
With a single eye fixation lasting a fraction of a second, the human visual system is capable of forming a rich representation of a complex environment, reaching a holistic understanding which facilitates object recognition and detection. This phenomenon is known as recognizing the “gist” of the scene and is accomplished by relying on relevant prior knowledge. This paper addresses the analogous question of whether using memory in computer vision systems can not only improve the accuracy of object detection in video streams, but also reduce the computation time. By interleaving conventional feature extractors with extremely lightweight ones which only need to recognize the gist of the scene, we show that minimal computation is required to produce accurate detections when temporal memory is present. In addition, we show that the memory contains enough information for deploying reinforcement learning algorithms to learn an adaptive inference policy. Our model achieves state-of-the-art performance among mobile methods on the Imagenet VID 2015 dataset, while running at speeds of up to 70+ FPS on a Pixel 3 phone.
http://arxiv.org/abs/1903.10172
We present LOGAN, a deep neural network aimed at learning generic shape transforms from unpaired domains. The network is trained on two sets of shapes, e.g., tables and chairs, but there is neither a pairing between shapes in the two domains to supervise the shape translation nor any point-wise correspondence between any shapes. Once trained, LOGAN takes a shape from one domain and transforms it into the other. Our network consists of an autoencoder to encode shapes from the two input domains into a common latent space, where the latent codes encode multi-scale shape features in an overcomplete manner. The translator is based on a generative adversarial network (GAN), operating in the latent space, where an adversarial loss enforces cross-domain translation while a feature preservation loss ensures that the right shape features are preserved for a natural shape transform. We conduct various ablation studies to validate each of our key network designs and demonstrate superior capabilities in unpaired shape transforms on a variety of examples over baselines and state-of-the-art approaches. We show that our network is able to learn what shape features to preserve during shape transforms, either local or non-local, whether content or style, etc., depending solely on the input domain pairs.
http://arxiv.org/abs/1903.10170
Tracking vehicles in LIDAR point clouds is a challenging task due to the sparsity of the data and the dense search space. The lack of structure in point clouds impedes the use of convolution and correlation filters usually employed in 2D object tracking. In addition, structuring point clouds is cumbersome and implies losing fine-grained information. As a result, generating proposals in 3D space is expensive and inefficient. In this paper, we leverage the dense and structured Bird Eye View (BEV) representation of LIDAR point clouds to efficiently search for objects of interest. We use an efficient Region Proposal Network and generate a small number of object proposals in 3D. Successively, we refine our selection of 3D object candidates by exploiting the similarity capability of a 3D Siamese network. We regularize the latter 3D Siamese network for shape completion to enhance its discrimination capability. Our method attempts to solve both for an efficient search space in the BEV space and a meaningful selection using 3D LIDAR point cloud. We show that the Region Proposal in the BEV outperforms Bayesian methods such as Kalman and Particle Filters in providing proposal by a significant margin and that such candidates are suitable for the 3D Siamese network. By training our method end-to-end, we outperform the previous baseline in vehicle tracking by 12% / 18% in Success and Precision when using only 16 candidates.
http://arxiv.org/abs/1903.10168
Deep neural networks (DNNs) can be easily fooled by adding human imperceptible perturbations to the images. These perturbed images are known as `adversarial examples’ and pose a serious threat to security and safety critical systems. A litmus test for the strength of adversarial examples is their transferability across different DNN models in a black box setting (i.e. when the target model’s architecture and parameters are not known to attacker). Current attack algorithms that seek to enhance adversarial transferability work on the decision level i.e. generate perturbations that alter the network decisions. This leads to two key limitations: (a) An attack is dependent on the task-specific loss function (e.g. softmax cross-entropy for object recognition) and therefore does not generalize beyond its original task. (b) The adversarial examples are specific to the network architecture and demonstrate poor transferability to other network architectures. We propose a novel approach to create adversarial examples that can broadly fool different networks on multiple tasks. Our approach is based on the following intuition: “Perpetual metrics based on neural network features are highly generalizable and show excellent performance in measuring and stabilizing input distortions. Therefore an ideal attack that creates maximum distortions in the network feature space should realize highly transferable examples”. We report extensive experiments to show how adversarial examples generalize across multiple networks for classification, object detection and segmentation tasks.
http://arxiv.org/abs/1811.09020
To detect salient objects accurately, existing methods usually design complex backbone network architectures to learn and fuse powerful features. However, the saliency inference module that performs saliency prediction from the fused features receives much less attention on its architecture design and typically adopts only a few fully convolutional layers. In this paper, we find the limited capacity of the saliency inference module indeed makes a fundamental performance bottleneck, and enhancing its capacity is critical for obtaining better saliency prediction. Correspondingly, we propose a deep yet light-weight saliency inference module that adopts a multi-dilated depth-wise convolution architecture. Such a deep inference module, though with simple architecture, can directly perform reasoning about salient objects from the multi-scale convolutional features fast, and give superior salient object detection performance with less computational cost. To our best knowledge, we are the first to reveal the importance of the inference module for salient object detection, and present a novel architecture design with attractive efficiency and accuracy. Extensive experimental evaluations demonstrate that our simple framework performs favorably compared with the state-of-the-art methods with complex backbone design.
http://arxiv.org/abs/1901.08362
Deep unsupervised domain adaptation (UDA) has recently received increasing attention from researchers. However, existing methods are computationally intensive due to the computation cost of Convolutional Neural Networks (CNN) adopted by most work. To date, there is no effective network compression method for accelerating these models. In this paper, we propose a unified Transfer Channel Pruning (TCP) approach for accelerating UDA models. TCP is capable of compressing the deep UDA model by pruning less important channels while simultaneously learning transferable features by reducing the cross-domain distribution divergence. Therefore, it reduces the impact of negative transfer and maintains competitive performance on the target task. To the best of our knowledge, TCP is the first approach that aims at accelerating deep UDA models. TCP is validated on two benchmark datasets-Office-31 and ImageCLEF-DA with two common backbone networks-VGG16 and ResNet50. Experimental results demonstrate that TCP achieves comparable or better classification accuracy than other comparison methods while significantly reducing the computational cost. To be more specific, in VGG16, we get even higher accuracy after pruning 26% floating point operations (FLOPs); in ResNet50, we also get higher accuracy on half of the tasks after pruning 12% FLOPs. We hope that TCP will open a new door for future research on accelerating transfer learning models.
http://arxiv.org/abs/1904.02654
Multi-scale approach has been used for blind image / video deblurring problems to yield excellent performance for both conventional and recent deep-learning-based state-of-the-art methods. Bicubic down-sampling is a typical choice for multi-scale approach to reduce spatial dimension after filtering with a fixed kernel. However, this fixed kernel may be sub-optimal since it may destroy important information for reliable deblurring such as strong edges. We propose convolutional neural network (CNN)-based down-scale methods for multi-scale deep-learning-based non-uniform single image deblurring. We argue that our CNN-based down-scaling effectively reduces the spatial dimension of the original image, while learned kernels with multiple channels may well-preserve necessary details for deblurring tasks. For each scale, we adopt to use RCAN (Residual Channel Attention Networks) as a backbone network to further improve performance. Our proposed method yielded state-of-the-art performance on GoPro dataset by large margin. Our proposed method was able to achieve 2.59dB higher PSNR than the current state-of-the-art method by Tao. Our proposed CNN-based down-scaling was the key factor for this excellent performance since the performance of our network without it was decreased by 1.98dB. The same networks trained with GoPro set were also evaluated on large-scale Su dataset and our proposed method yielded 1.15dB better PSNR than the Tao’s method. Qualitative comparisons on Lai dataset also confirmed the superior performance of our proposed method over other state-of-the-art methods.
http://arxiv.org/abs/1903.10157
Recently developed deep-learning-based denoisers often outperform state-of-the-art conventional denoisers such as the BM3D. They are typically trained to minimize the mean squared error (MSE) between the output image of a deep neural network (DNN) and a ground truth image. Thus, it is important for deep-learning-based denoisers to use high quality noiseless ground truth data for high performance. However, it is often challenging or even infeasible to obtain noiseless images in some applications. Here, we propose a method based on Stein’s unbiased risk estimator (SURE) for training DNN denoisers based only on the use of noisy images in the training data with Gaussian noise. We demonstrate that our SURE-based method, without the use of ground truth data, is able to train DNN denoisers to yield performances close to those networks trained with ground truth for both grayscale and color images. We also propose a SURE-based refining method with a noisy test image for further performance improvement. Our quick refining method outperformed conventional BM3D, deep image prior, and often the networks trained with ground truth. Potential extension of our SURE-based methods to Poisson noise model was also investigated.
http://arxiv.org/abs/1803.01314
Recovering 3D human body shape and pose from 2D images is a challenging task due to high complexity and flexibility of human body, and relatively less 3D labeled data. Previous methods addressing these issues typically rely on predicting intermediate results such as body part segmentation, 2D/3D joints, silhouette mask to decompose the problem into multiple sub-tasks in order to utilize more 2D labels. Most previous works incorporated parametric body shape model in their methods and predict parameters in low-dimensional space to represent human body. In this paper, we propose to directly regress the 3D human mesh from a single color image using Convolutional Neural Network(CNN). We use an efficient representation of 3D human shape and pose which can be predicted through an encoder-decoder neural network. The proposed method achieves state-of-the-art performance on several 3D human body datasets including Human3.6M, SURREAL and UP-3D with even faster running speed.
http://arxiv.org/abs/1903.10153
This paper presents a new deep neural network design for salient object detection by maximizing the integration of local and global image context within, around, and beyond the salient objects. Our key idea is to adaptively propagate and aggregate the image context with variable attenuation over the entire feature maps. To achieve this, we design the spatial attenuation context (SAC) module to recurrently translate and aggregate the context features independently with different attenuation factors and then attentively learn the weights to adaptively integrate the aggregated context features. By further embedding the module to process individual layers in a deep network, namely SAC-Net, we can train the network end-to-end and optimize the context features for detecting salient objects. Compared with 22 state-of-the-art methods, experimental results show that our method performs favorably over all the others on six common benchmark data, both quantitatively and visually.
http://arxiv.org/abs/1903.10152
Convolutional neural networks (CNN) have been shown to achieve state-of-the-art performance in a significant number of computer vision tasks. Although they require large labelled training datasets to learn the CNN models, they have striking attributes of transferring learned representations from large source sets to smaller target sets by normal fine-tuning approaches. Prior research has shown that these techniques boost the performance on smaller target sets. In this paper, we demonstrate that growing network depth capacity beyond classification layer along with careful normalization and scaling scheme boosts fine-tuning by creating harmony between the pre-trained and new layers to adjust more to the target task. This indicates pre-trained classification layer holds high-level (global) image information that can be propagated through the newly introduced layers in fine-tuning. We evaluate our depth augmented networks following our designed incremental fine-tuning scheme on several benchmark datatsets and show that they outperform contemporary transfer learning approaches. On average, for fine-grained datasets we achieve up to 6.7% (AlexNet), 5.4% (VGG16) and for coarse datasets 9.3% (AlexNet), 8.7% (VGG16) improvement than normal fine-tuning. In addition, our in-depth analysis manifests freezing highly generic layers encourage better learning of target tasks. Furthermore, we have found that the learning rate for newly introduced layers of depth augmented networks depend on target set and size of new layers.
http://arxiv.org/abs/1903.10150
We propose a universal image reconstruction method to represent detailed images purely from binary sparse edge and flat color domain. Inspired by the procedures of painting, our framework, based on generative adversarial network, consists of three phases: Imitation Phase aims at initializing networks, followed by Generating Phase to reconstruct preliminary images. Moreover, Refinement Phase is utilized to fine-tune preliminary images into final outputs with details. This framework allows our model generating abundant high frequency details from sparse input information. We also explore the defects of disentangling style latent space implicitly from images, and demonstrate that explicit color domain in our model performs better on controllability and interpretability. In our experiments, we achieve outstanding results on reconstructing realistic images and translating hand drawn drafts into satisfactory paintings. Besides, within the domain of edge-to-image translation, our model PI-REC outperforms existing state-of-the-art methods on evaluations of realism and accuracy, both quantitatively and qualitatively.
http://arxiv.org/abs/1903.10146
Variational autoencoders (VAEs) with an auto-regressive decoder have been applied for many natural language processing (NLP) tasks. The VAE objective consists of two terms, (i) reconstruction and (ii) KL regularization, balanced by a weighting hyper-parameter \beta. One notorious training difficulty is that the KL term tends to vanish. In this paper we study scheduling schemes for \beta, and show that KL vanishing is caused by the lack of good latent codes in training the decoder at the beginning of optimization. To remedy this, we propose a cyclical annealing schedule, which repeats the process of increasing \beta multiple times. This new procedure allows the progressive learning of more meaningful latent codes, by leveraging the informative representations of previous cycles as warm re-starts. The effectiveness of cyclical annealing is validated on a broad range of NLP tasks, including language modeling, dialog response generation and unsupervised language pre-training.
http://arxiv.org/abs/1903.10145
Facial action unit (AU) detection in the wild is a challenging problem, with current methods either depending on impractical labor-intensive labeling by experts, or inaccurate pseudo labels. In this paper, we aim to exploit accurate AU labels from a well-constrained source domain for training an AU detector for the target domain of unconstrained in-the-wild images. Instead of attempting to map directly from source to target domains, we propose to generate a new feature domain to combine source-domain facial inner landmark features (denoted as shape features) with target-domain global pose and texture features (denoted as texture features), so as to train target AU detector with source AU labels. Specifically, we first disentangle the rich features learned from images into shape features and texture features by introducing a novel shape adversarial loss and a shape classification loss. After swapping the shape features of unpaired source and target images, the combined source shape features and target texture features are translated to the new domain by learning a mapping that maximizes the AU detector performance. A further disentanglement and swap is applied to cross-cyclically reconstruct the original rich features. Moreover, our framework can also be naturally extended for use with pseudo AU labels. Extensive experiments show that our method soundly outperforms the baseline, upper-bound methods and state-of-the-art approaches on the challenging benchmark dataset EmotioNet.
http://arxiv.org/abs/1903.10143
Despite the significant advances in iris segmentation, accomplishing accurate iris segmentation in non-cooperative environment remains a grand challenge. In this paper, we present a deep learning framework, referred to as Iris R-CNN, to offer superior accuracy for iris segmentation. The proposed framework is derived from Mask R-CNN, and several novel techniques are proposed to carefully explore the unique characteristics of iris. First, we propose two novel networks: (i) Double-Circle Region Proposal Network (DC-RPN), and (ii) Double-Circle Classification and Regression Network (DC-CRN) to take into account the iris and pupil circles to maximize the accuracy for iris segmentation. Second, we propose a novel normalization scheme for Regions of Interest (RoIs) to facilitate a radically new pooling operation over a double-circle region. Experimental results on two challenging iris databases, UBIRIS.v2 and MICHE, demonstrate the superior accuracy of the proposed approach over other state-of-the-art methods.
http://arxiv.org/abs/1903.10140
Registration is an important task in automated medical image analysis. Although deep learning (DL) based image registration methods out perform time consuming conventional approaches, they are heavily dependent on training data and do not generalize well for new images types. We present a DL based approach that can register an image pair which is different from the training images. This is achieved by training generative adversarial networks (GANs) in combination with segmentation information and transfer learning. Experiments on chest Xray and brain MR images show that our method gives better registration performance over conventional methods.
http://arxiv.org/abs/1903.10139
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
http://arxiv.org/abs/1903.10132
We proposed a novel architecture for the problem of video super-resolution. We integrate spatial and temporal contexts from continuous video frames using a recurrent encoder-decoder module, that fuses multi-frame information with the more traditional, single frame super-resolution path for the target frame. In contrast to most prior work where frames are pooled together by stacking or warping, our model, the Recurrent Back-Projection Network (RBPN) treats each context frame as a separate source of information. These sources are combined in an iterative refinement framework inspired by the idea of back-projection in multiple-image super-resolution. This is aided by explicitly representing estimated inter-frame motion with respect to the target, rather than explicitly aligning frames. We propose a new video super-resolution benchmark, allowing evaluation at a larger scale and considering videos in different motion regimes. Experimental results demonstrate that our RBPN is superior to existing methods on several datasets.
http://arxiv.org/abs/1903.10128
Knowledge Bases (KBs) require constant up-dating to reflect changes to the world they represent. For general purpose KBs, this is often done through Relation Extraction (RE), the task of predicting KB relations expressed in text mentioning entities known to the KB. One way to improve RE is to use KB Embeddings (KBE) for link prediction. However, despite clear connections between RE and KBE, little has been done toward properly unifying these models systematically. We help close the gap with a framework that unifies the learning of RE and KBE models leading to significant improvements over the state-of-the-art in RE. The code is available at https://github. com/billy-inn/HRERE.
http://arxiv.org/abs/1903.10126
Generating long and semantic-coherent reports to describe medical images poses great challenges towards bridging visual and linguistic modalities, incorporating medical domain knowledge, and generating realistic and accurate descriptions. We propose a novel Knowledge-driven Encode, Retrieve, Paraphrase (KERP) approach which reconciles traditional knowledge- and retrieval-based methods with modern learning-based methods for accurate and robust medical report generation. Specifically, KERP decomposes medical report generation into explicit medical abnormality graph learning and subsequent natural language modeling. KERP first employs an Encode module that transforms visual features into a structured abnormality graph by incorporating prior medical knowledge; then a Retrieve module that retrieves text templates based on the detected abnormalities; and lastly, a Paraphrase module that rewrites the templates according to specific cases. The core of KERP is a proposed generic implementation unit—Graph Transformer (GTR) that dynamically transforms high-level semantics between graph-structured data of multiple domains such as knowledge graphs, images and sequences. Experiments show that the proposed approach generates structured and robust reports supported with accurate abnormality description and explainable attentive regions, achieving the state-of-the-art results on two medical report benchmarks, with the best medical abnormality and disease classification accuracy and improved human evaluation performance.
http://arxiv.org/abs/1903.10122
So far, research to generate captions from images has been carried out from the viewpoint that a caption holds sufficient information for an image. If it is possible to generate an image that is close to the input image from a generated caption, i.e., if it is possible to generate a natural language caption containing sufficient information to reproduce the image, then the caption is considered to be faithful to the image. To make such regeneration possible, learning using the cycle-consistency loss is effective. In this study, we propose a method of generating captions by learning end-to-end mutual transformations between images and texts. To evaluate our method, we perform comparative experiments with and without the cycle consistency. The results are evaluated by an automatic evaluation and crowdsourcing, demonstrating that our proposed method is effective.
http://arxiv.org/abs/1903.10118
Peer-review plays a critical role in the scientific writing and publication ecosystem. To assess the efficiency and efficacy of the reviewing process, one essential element is to understand and evaluate the reviews themselves. In this work, we study the content and structure of peer reviews under the argument mining framework, through automatically detecting (1) argumentative propositions put forward by reviewers, and (2) their types (e.g., evaluating the work or making suggestions for improvement). We first collect 14.2K reviews from major machine learning and natural language processing venues. 400 reviews are annotated with 10,386 propositions and corresponding types of Evaluation, Request, Fact, Reference, or Quote. We then train state-of-the-art proposition segmentation and classification models on the data to evaluate their utilities and identify new challenges for this new domain, motivating future directions for argument mining. Further experiments show that proposition usage varies across venues in amount, type, and topic.
http://arxiv.org/abs/1903.10104
Within the realm of service robotics, researchers have placed a great amount of effort into learning, understanding, and representing motions as manipulations for task execution by robots. The task of robot learning and problem-solving is very broad, as it integrates a variety of tasks such as object detection, activity recognition, task/motion planning, localization, knowledge representation and retrieval, and the intertwining of perception/vision and machine learning techniques. In this paper, we solely focus on knowledge representations and notably how knowledge is typically gathered, represented, and reproduced to solve problems as done by researchers in the past decades. In accordance with the definition of knowledge representations, we discuss the key distinction between such representations and useful learning models that have extensively been introduced and studied in recent years, such as machine learning, deep learning, probabilistic modelling, and semantic graphical structures. Along with an overview of such tools, we discuss the problems which have existed in robot learning and how they have been built and used as solutions, technologies or developments (if any) which have contributed to solving them. Finally, we discuss key principles that should be considered when designing an effective knowledge representation.
http://arxiv.org/abs/1807.02192
$\textit{Magic: The Gathering}$ is a popular and famously complicated trading card game about magical combat. In this paper we show that optimal play in real-world $\textit{Magic}$ is at least as hard as the Halting Problem, solving a problem that has been open for a decade. To do this, we present a methodology for embedding an arbitrary Turing machine into a game of $\textit{Magic}$ such that the first player is guaranteed to win the game if and only if the Turing machine halts. Our result applies to how real $\textit{Magic}$ is played, can be achieved using standard-size tournament-legal decks, and does not rely on stochasticity or hidden information. Our result is also highly unusual in that all moves of both players are forced in the construction. This shows that even recognising who will win a game in which neither player has a non-trivial decision to make for the rest of the game is undecidable. We conclude with a discussion of the implications for a unified computational theory of games and remarks about the playability of such a board in a tournament setting.
http://arxiv.org/abs/1904.09828
In this paper, we propose a residual non-local attention network for high-quality image restoration. Without considering the uneven distribution of information in the corrupted images, previous methods are restricted by local convolutional operation and equal treatment of spatial- and channel-wise features. To address this issue, we design local and non-local attention blocks to extract features that capture the long-range dependencies between pixels and pay more attention to the challenging parts. Specifically, we design trunk branch and (non-)local mask branch in each (non-)local attention block. The trunk branch is used to extract hierarchical features. Local and non-local mask branches aim to adaptively rescale these hierarchical features with mixed attentions. The local mask branch concentrates on more local structures with convolutional operations, while non-local attention considers more about long-range dependencies in the whole feature map. Furthermore, we propose residual local and non-local attention learning to train the very deep network, which further enhance the representation ability of the network. Our proposed method can be generalized for various image restoration applications, such as image denoising, demosaicing, compression artifacts reduction, and super-resolution. Experiments demonstrate that our method obtains comparable or better results compared with recently leading methods quantitatively and visually.
http://arxiv.org/abs/1903.10082
In this paper, we propose a novel representation for text documents based on aggregating word embedding vectors into document embeddings. Our approach is inspired by the Vector of Locally-Aggregated Descriptors used for image representation, and it works as follows. First, the word embeddings gathered from a collection of documents are clustered by k-means in order to learn a codebook of semnatically-related word embeddings. Each word embedding is then associated to its nearest cluster centroid (codeword). The Vector of Locally-Aggregated Word Embeddings (VLAWE) representation of a document is then computed by accumulating the differences between each codeword vector and each word vector (from the document) associated to the respective codeword. We plug the VLAWE representation, which is learned in an unsupervised manner, into a classifier and show that it is useful for a diverse set of text classification tasks. We compare our approach with a broad range of recent state-of-the-art methods, demonstrating the effectiveness of our approach. Furthermore, we obtain a considerable improvement on the Movie Review data set, reporting an accuracy of 93.3%, which represents an absolute gain of 10% over the state-of-the-art approach.
http://arxiv.org/abs/1902.08850
This paper proposes an intuitive human-swarm interaction framework inspired by our childhood memory in which we interacted with living ants by changing their positions and environments as if we were omnipotent relative to the ants. In virtual reality, analogously, we can be a super-powered virtual giant who can supervise a swarm of mobile robots in a vast and remote environment by flying over or resizing the world and coordinate them by picking and placing a robot or creating virtual walls. This work implements this idea by using Virtual Reality along with Leap Motion, which is then validated by proof-of-concept experiments using real and virtual mobile robots in mixed reality. We conduct a usability analysis to quantify the effectiveness of the overall system as well as the individual interfaces proposed in this work. The results revealed that the proposed method is intuitive and feasible for interaction with swarm robots, but may require appropriate training for the new end-user interface device.
http://arxiv.org/abs/1903.10064
Approximate nearest neighbour (ANN) search is one of the most important problems in computer science fields such as data mining or computer vision. In this paper, we focus on ANN for high-dimensional binary vectors and we propose a simple yet powerful search method that uses Random Binary Search Trees (RBST). We apply our method to a dataset of 1.25M binary local feature descriptors obtained from a real-life image-based localisation system provided by Google as a part of Project Tango. An extensive evaluation of our method against the state-of-the-art variations of Locality Sensitive Hashing (LSH), namely Uniform LSH and Multi-probe LSH, shows the superiority of our method in terms of retrieval precision with performance boost of over 20%
http://arxiv.org/abs/1708.02976
There is a strong need for automated systems to improve diagnostic quality and reduce the analysis time in histopathology image processing. Automated detection and classification of pathological tissue characteristics with computer-aided diagnostic systems are a critical step in the early diagnosis and treatment of diseases. Once a pathology image is scanned by a microscope and loaded onto a computer, it can be used for automated detection and classification of diseases. In this study, the DenseNet-161 and ResNet-50 pre-trained CNN models have been used to classify digital histopathology patches into the corresponding whole slide images via transfer learning technique. The proposed pre-trained models were tested on grayscale and color histopathology images. The DenseNet-161 pre-trained model achieved a classification accuracy of 97.89% using grayscale images and the ResNet-50 model obtained the accuracy of 98.87% for color images. The proposed pre-trained models outperform state-of-the-art methods in all performance metrics to classify digital pathology patches into 24 categories.
http://arxiv.org/abs/1903.10035