Reinforcement learning (RL) agents optimize only the features specified in a reward function and are indifferent to anything left out inadvertently. This means that we must not only specify what to do, but also the much larger space of what not to do. It is easy to forget these preferences, since these preferences are already satisfied in our environment. This motivates our key insight: when a robot is deployed in an environment that humans act in, the state of the environment is already optimized for what humans want. We can therefore use this implicit preference information from the state to fill in the blanks. We develop an algorithm based on Maximum Causal Entropy IRL and use it to evaluate the idea in a suite of proof-of-concept environments designed to show its properties. We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized. Our code can be found at https://github.com/HumanCompatibleAI/rlsp.
http://arxiv.org/abs/1902.04198
We study the problem of interpreting trained classification models in the setting of linguistic data sets. Leveraging a parse tree, we propose to assign least-squares based importance scores to each word of an instance by exploiting syntactic constituency structure. We establish an axiomatic characterization of these importance scores by relating them to the Banzhaf value in coalitional game theory. Based on these importance scores, we develop a principled method for detecting and quantifying interactions between words in a sentence. We demonstrate that the proposed method can aid in interpretability and diagnostics for several widely-used language models.
http://arxiv.org/abs/1902.04187
Dictionary leaning (DL) and dimensionality reduction (DR) are powerful tools to analyze high-dimensional noisy signals. This paper presents a proposal of a novel Riemannian joint dimensionality reduction and dictionary learning (R-JDRDL) on symmetric positive definite (SPD) manifolds for classification tasks. The joint learning considers the interaction between dimensionality reduction and dictionary learning procedures by connecting them into a unified framework. We exploit a Riemannian optimization framework for solving DL and DR problems jointly. Finally, we demonstrate that the proposed R-JDRDL outperforms existing state-of-the-arts algorithms when used for image classification tasks.
http://arxiv.org/abs/1902.04186
We describe in this paper the theory and practice behind a new modal clustering method for binary data. Our approach (BinNNMS) is based on the nearest neighbor median shift. The median shift is an extension of the well-known mean shift, which was designed for continuous data, to handle binary data. We demonstrate that BinNNMS can discover accurately the location of clusters in binary data with theoretical and experimental analyses.
http://arxiv.org/abs/1902.04181
In reinforcement learning, a decision needs to be made at some point as to whether it is worthwhile to carry on with the learning process or to terminate it. In many such situations, stochastic elements are often present which govern the occurrence of rewards, with the sequential occurrences of positive rewards randomly interleaved with negative rewards. For most practical learners, the learning is considered useful if the number of positive rewards always exceeds the negative ones. A situation that often calls for learning termination is when the number of negative rewards exceeds the number of positive rewards. However, while this seems reasonable, the error of premature termination, whereby termination is enacted along with the conclusion of learning failure despite the positive rewards eventually far outnumber the negative ones, can be significant. In this paper, using combinatorial analysis we study the error probability in wrongly terminating a reinforcement learning activity which undermines the effectiveness of an optimal policy, and we show that the resultant error can be quite high. Whilst we demonstrate mathematically that such errors can never be eliminated, we propose some practical mechanisms that can effectively reduce such errors. Simulation experiments have been carried out, the results of which are in close agreement with our theoretical findings.
http://arxiv.org/abs/1902.04179
In reinforcement learning episodes, the rewards and punishments are often non-deterministic, and there are invariably stochastic elements governing the underlying situation. Such stochastic elements are often numerous and cannot be known in advance, and they have a tendency to obscure the underlying rewards and punishments patterns. Indeed, if stochastic elements were absent, the same outcome would occur every time and the learning problems involved could be greatly simplified. In addition, in most practical situations, the cost of an observation to receive either a reward or punishment can be significant, and one would wish to arrive at the correct learning conclusion by incurring minimum cost. In this paper, we present a stochastic approach to reinforcement learning which explicitly models the variability present in the learning environment and the cost of observation. Criteria and rules for learning success are quantitatively analyzed, and probabilities of exceeding the observation cost bounds are also obtained.
http://arxiv.org/abs/1902.04178
In this work, we propose ReStoCNet, a residual stochastic multilayer convolutional Spiking Neural Network (SNN) composed of binary kernels, to reduce the synaptic memory footprint and enhance the computational efficiency of SNNs for complex pattern recognition tasks. ReStoCNet consists of an input layer followed by stacked convolutional layers for hierarchical input feature extraction, pooling layers for dimensionality reduction, and fully-connected layer for inference. In addition, we introduce residual connections between the stacked convolutional layers to improve the hierarchical feature learning capability of deep SNNs. We propose Spike Timing Dependent Plasticity (STDP) based probabilistic learning algorithm, referred to as Hybrid-STDP (HB-STDP), incorporating Hebbian and anti-Hebbian learning mechanisms, to train the binary kernels forming ReStoCNet in a layer-wise unsupervised manner. We demonstrate the efficacy of ReStoCNet and the presented HB-STDP based unsupervised training methodology on the MNIST and CIFAR-10 datasets. We show that residual connections enable the deeper convolutional layers to self-learn useful high-level input features and mitigate the accuracy loss observed in deep SNNs devoid of residual connections. The proposed ReStoCNet offers >20x kernel memory compression compared to full-precision (32-bit) SNN while yielding high enough classification accuracy on the chosen pattern recognition tasks.
http://arxiv.org/abs/1902.04161
In this paper, we propose a novel three-stage multimodal deep hashing neural decoder (MDHND) architecture, which integrates a deep hashing framework with a neural network decoder (NND) to create an effective multibiometric authentication system. The MDHND consists of two separate modules: a multimodal deep hashing (MDH) module, which is used for feature-level fusion and binarization of multiple biometrics, and a neural network decoder (NND) module, which is used to refine the intermediate binary codes generated by the MDH and compensate for the difference between enrollment and probe biometrics (variations in pose, illumination, etc.). Use of NND helps to improve the performance of the overall multimodal authentication system. The MDHND framework is trained in 3 stages using joint optimization of the two modules. In Stage 1, the MDH parameters are trained and learned to generate a shared multimodal latent code; in Stage 2, the latent codes from Stage 1 are passed through a conventional error-correcting code (ECC) decoder to generate the ground truth to train a neural network decoder (NND); in Stage 3, the NND decoder is trained using the ground truth from Stage 2 and the MDH and NND are jointly optimized. Experimental results on a standard multimodal dataset demonstrate the superiority of our method relative to other current multimodal authentication systems. Furthermore, the proposed system can work in both identification and authentication modes.
http://arxiv.org/abs/1902.04149
Age-Related Macular Degeneration (AMD) is an asymptomatic retinal disease which may result in loss of vision. There is limited access to high-quality relevant retinal images and poor understanding of the features defining sub-classes of this disease. Motivated by recent advances in machine learning we specifically explore the potential of generative modeling, using Generative Adversarial Networks (GANs) and style transferring, to facilitate clinical diagnosis and disease understanding by feature extraction. We design an analytic pipeline which first generates synthetic retinal images from clinical images; a subsequent verification step is applied. In the synthesizing step we merge GANs (DCGANs and WGANs architectures) and style transferring for the image generation, whereas the verified step controls the accuracy of the generated images. We find that the generated images contain sufficient pathological details to facilitate ophthalmologists’ task of disease classification and in discovery of disease relevant features. In particular, our system predicts the drusen and geographic atrophy sub-classes of AMD. Furthermore, the performance using CFP images for GANs outperforms the classification based on using only the original clinical dataset. Our results are evaluated using existing classifier of retinal diseases and class activated maps, supporting the predictive power of the synthetic images and their utility for feature extraction. Our code examples are available online.
http://arxiv.org/abs/1902.04147
Max-C and min-D projection autoassociative fuzzy morphological memories (max-C and min-D PAFMMs) are two layer feedforward fuzzy morphological neural networks able to implement an associative memory designed for the storage and retrieval of finite fuzzy sets or vectors on a hypercube. In this paper we address the main features of these autoassociative memories, which include unlimited absolute storage capacity, fast retrieval of stored items, few spurious memories, and an excellent tolerance to either dilative noise or erosive noise. Particular attention is given to the so-called PAFMM of Zadeh which, besides performing no floating-point operations, exhibit the largest noise tolerance among max-C and min-D PAFMMs. Computational experiments reveal that Zadeh’s max-C PFAMM, combined with a noise masking strategy, yields a fast and robust classifier with strong potential for face recognition.
http://arxiv.org/abs/1902.04144
With benefits of fast query speed and low storage cost, hashing-based image retrieval approaches have garnered considerable attention from the research community. In this paper, we propose a novel Error-Corrected Deep Cross Modal Hashing (CMH-ECC) method which uses a bitmap specifying the presence of certain facial attributes as an input query to retrieve relevant face images from the database. In this architecture, we generate compact hash codes using an end-to-end deep learning module, which effectively captures the inherent relationships between the face and attribute modality. We also integrate our deep learning module with forward error correction codes to further reduce the distance between different modalities of the same subject. Specifically, the properties of deep hashing and forward error correction codes are exploited to design a cross modal hashing framework with high retrieval performance. Experimental results using two standard datasets with facial attributes-image modalities indicate that our CMH-ECC face image retrieval model outperforms most of the current attribute-based face image retrieval approaches.
http://arxiv.org/abs/1902.04139
Objective: Accurately classifying the malignancy of lesions detected in a screening scan is critical for reducing false positives. Radiomics holds great potential to differentiate malignant from benign tumors by extracting and analyzing a large number of quantitative image features. Since not all radiomic features contribute to an effective classifying model, selecting an optimal feature subset is critical. Methods: This work proposes a new multi-objective based feature selection (MO-FS) algorithm that considers sensitivity and specificity simultaneously as the objective functions during feature selection. For MO-FS, we developed a modified entropy based termination criterion (METC) that stops the algorithm automatically rather than relying on a preset number of generations. We also designed a solution selection methodology for multi-objective learning that uses the evidential reasoning approach (SMOLER) to automatically select the optimal solution from the Pareto-optimal set. Furthermore, we developed an adaptive mutation operation to generate the mutation probability in MO-FS automatically. Results: We evaluated the MO-FS for classifying lung nodule malignancy in low-dose CT and breast lesion malignancy in digital breast tomosynthesis. Conclusion: The experimental results demonstrated that the feature set selected by MO-FS achieved better classification performance than features selected by other commonly used methods. Significance: The proposed method is general and more effective radiomic feature selection strategy.
http://arxiv.org/abs/1807.03236
Yelp has been one of the most popular local service search engine in US since 2004. It is powered by crowd-sourced text reviews and photo reviews. Restaurant customers and business owners upload photo images to Yelp, including reviewing or advertising either food, drinks, or inside and outside decorations. It is obviously not so effective that labels for food photos rely on human editors, which is an issue should be addressed by innovative machine learning approaches. In this paper, we present a simple but effective approach which can identify up to ten kinds of food via raw photos from the challenge dataset. We use 1) image pre-processing techniques, including filtering and image augmentation, 2) feature extraction via convolutional neural networks (CNN), and 3) three ways of classification algorithms. Then, we illustrate the classification accuracy by tuning parameters for augmentations, CNN, and classification. Our experimental results show this simple but effective approach to identify up to 10 food types from images.
http://arxiv.org/abs/1902.05413
Comparing with enormous research achievements targeting better image classification models, efforts applied to object detector training are dwarfed in terms of popularity and universality. Due to significantly more complex network structures and optimization targets, various training strategies and pipelines are specifically designed for certain detection algorithms and no other. In this work, we explore universal tweaks that help boosting the performance of state-of-the-art object detection models to a new level without sacrificing inference speed. Our experiments indicate that these freebies can be as much as 5% absolute precision increase that everyone should consider applying to object detection training to a certain degree.
http://arxiv.org/abs/1902.04103
Medical image segmentation is a primary task in many applications, and the accuracy of the segmentation is a necessity. Recently, many deep learning networks derived from U-Net have been extensively used and have achieved notable results. To further improve and refine the performance of U-Net, parallel decoders along with mask prediction decoder have been carried out and have shown significant improvement with additional advantages. In our work, we utilize the advantages of using a combination of contour and distance map as regularizers. In turn, we propose a novel architecture Psi-Net with a single encoder and three parallel decoders, one decoder to learn the mask and other two to learn the auxiliary tasks of contour detection and distance map estimation. The learning of these auxiliary tasks helps in capturing the shape and boundary. We also propose a new joint loss function for the proposed architecture. The loss function consists of a weighted combination of Negative likelihood and Mean Square Error loss. We have used two publicly available datasets: 1) Origa dataset for the task of optic cup and disc segmentation and 2) Endovis segment dataset for the task of polyp segmentation to evaluate our model. We have conducted extensive experiments using our network to show our model gives better results in terms of segmentation, boundary and shape metrics.
http://arxiv.org/abs/1902.04099
We show that BERT (Devlin et al., 2018) is a Markov random field language model. Formulating BERT in this way gives way to a natural procedure to sample sentence from BERT. We sample sentences from BERT and find that it can produce high-quality, fluent generations. Compared to the generations of a traditional left-to-right language model, BERT generates sentences that are more diverse but of slightly worse quality.
http://arxiv.org/abs/1902.04094
We present a novel event recognition approach called Spatially-preserved Doubly-injected Object Detection CNN (S-DOD-CNN), which incorporates the spatially preserved object detection information in both a direct and an indirect way. Indirect injection is carried out by simply sharing the weights between the object detection modules and the event recognition module. Meanwhile, our novelty lies in the fact that we have preserved the spatial information for the direct injection. Once multiple regions-of-intereset (RoIs) are acquired, their feature maps are computed and then projected onto a spatially-preserving combined feature map using one of the four RoI Projection approaches we present. In our architecture, combined feature maps are generated for object detection which are directly injected to the event recognition module. Our method provides the state-of-the-art accuracy for malicious event recognition.
http://arxiv.org/abs/1902.04051
In recent years Deep Learning has brought about a breakthrough in Medical Image Segmentation. U-Net is the most prominent deep network in this regard, which has been the most popular architecture in the medical imaging community. Despite outstanding overall performance in segmenting multimodal medical images, from extensive experimentations on challenging datasets, we found out that the classical U-Net architecture seems to be lacking in certain aspects. Therefore, we propose some modifications to improve upon the already state-of-the-art U-Net model. Hence, following the modifications we develop a novel architecture MultiResUNet as the potential successor to the successful U-Net architecture. We have compared our proposed architecture MultiResUNet with the classical U-Net on a vast repertoire of multimodal medical images. Albeit slight improvements in the cases of ideal images, a remarkable gain in performance has been attained for challenging images. We have evaluated our model on five different datasets, each with their own unique challenges, and have obtained a relative improvement in performance of 10.15%, 5.07%, 2.63%, 1.41%, and 0.62% respectively.
http://arxiv.org/abs/1902.04049
In this paper, we present a novel single shot face-related task analysis method, called Face-SSD, for detecting faces and for performing various face-related (classification/regression) tasks including smile recognition, face attribute prediction and valence-arousal estimation in the wild. Face-SSD uses a Fully Convolutional Neural Network (FCNN) to detect multiple faces of different sizes and recognise/regress one or more face-related classes. Face-SSD has two parallel branches that share the same low-level filters, one branch dealing with face detection and the other one with face analysis tasks. The outputs of both branches are spatially aligned heatmaps that are produced in parallel - therefore Face-SSD does not require that face detection, facial region extraction, size normalisation, and facial region processing are performed in subsequent steps. Our contributions are threefold: 1) Face-SSD is the first network to perform face analysis without relying on pre-processing such as face detection and registration in advance - Face-SSD is a simple and a single FCNN architecture simultaneously performing face detection and face-related task analysis - those are conventionally treated as separate consecutive tasks; 2) Face-SSD is a generalised architecture that is applicable for various face analysis tasks without modifying the network structure - this is in contrast to designing task-specific architectures; and 3) Face-SSD achieves real-time performance (21 FPS) even when detecting multiple faces and recognising multiple classes in a given image. Experimental results show that Face-SSD achieves state-of-the-art performance in various face analysis tasks by reaching a recognition accuracy of 95.76% for smile detection, 90.29% for attribute prediction, and Root Mean Square (RMS) error of 0.44 and 0.39 for valence and arousal estimation.
http://arxiv.org/abs/1902.04042
The authenticity of images posted on social media is an issue of growing concern. Many algorithms have been developed to detect manipulated images, but few have investigated the ability of deep neural network based approaches to verify the authenticity of image labels, such as event names. In this paper, we propose several novel methods to predict if an image was captured at one of several noteworthy events. We use a set of images from several recorded events such as storms, marathons, protests, and other large public gatherings. Two strategies of applying pre-trained Imagenet network for event verification are presented, with two modifications for each strategy. The first method uses the features from the last convolutional layer of a pre-trained network as input to a classifier. We also consider the effects of tuning the convolutional weights of the pre-trained network to improve classification. The second method combines many features extracted from smaller scales and uses the output of a pre-trained network as the input to a second classifier. For both methods, we investigated several different classifiers and tested many different pre-trained networks. Our experiments demonstrate both these approaches are effective for event verification and image re-purposing detection. The classification at the global scale tends to marginally outperform our tested local methods and fine tuning the network further improves the results.
http://arxiv.org/abs/1902.04038
This paper presents an application of specification based runtime verification techniques to control mobile robots in a reactive manner. In our case study, we develop a layered control architecture where runtime monitors constructed from formal specifications are embedded into the navigation stack. We use temporal logic and regular expressions to describe safety requirements and mission specifications, respectively. An immediate benefit of our approach is that it leverages simple requirements and objectives of traditional control applications to more complex specifications in a non-intrusive and compositional way. Finally, we demonstrate a simulation of robots controlled by the proposed architecture and we discuss further extensions of our approach.
http://arxiv.org/abs/1902.04024
The idea of equivariance to symmetry transformations provides one of the first theoretically grounded principles for neural network architecture design. Equivariant networks have shown excellent performance and data efficiency on vision and medical imaging problems that exhibit symmetries. Here we show how this principle can be extended beyond global symmetries to local gauge transformations, thereby enabling the development of equivariant convolutional networks on general manifolds. We implement gauge equivariant CNNs for signals defined on the icosahedron, which provides a reasonable approximation of spherical signals. By choosing to work with this very regular manifold, we are able to implement the gauge equivariant convolution using a single conv2d call, making it a highly scalable and practical alternative to Spherical CNNs. We evaluate the Icosahedral CNN on omnidirectional image segmentation and climate pattern segmentation, and find that it outperforms previous methods.
http://arxiv.org/abs/1902.04615
In this paper, we aim to monitor the flow of people in large public infrastructures. We propose an unsupervised methodology to cluster people flow patterns into the most typical and meaningful configurations. By processing 3D images from a network of depth cameras, we build a descriptor for the flow pattern. We define a data-irregularity measure that assesses how well each descriptor fits a data model. This allows us to rank flow patterns from highly distinctive (outliers) to very common ones. By discarding outliers, we obtain more reliable key configurations (classes). Synthetic experiments show that the proposed method is superior to standard clustering methods. We applied it in an operational scenario during 14 days in the X-ray screening area of an international airport. Results show that our methodology is able to successfully summarize the representative patterns for such a long observation period, providing relevant information for airport management. Beyond regular flows, our method identifies a set of rare events corresponding to uncommon activities (cleaning, special security and circulating staff).
http://arxiv.org/abs/1705.00027
In this paper we propose a new methodology for decision-making under uncertainty using recent advancements in the areas of nonlinear stochastic optimal control theory, applied mathematics and machine learning. Our work is grounded on the nonlinear Feynman-Kac lemma and the fundamental connection between backward nonlinear partial differential equations and forward-backward stochastic differential equations. Using these connections and results from our prior work on importance sampling for forward-backward stochastic differential equations, we develop a control framework that is scalable and applicable to general classes of stochastic systems and decision-making problem formulations in robotics and autonomy. Two architectures for stochastic control are proposed that consist of feed-forward and recurrent neural networks. The performance and scalability of the aforementioned algorithms is investigated in two stochastic optimal control problem formulations including the unconstrained L2 and control-constrained case, and three systems in simulation. We conclude with a discussion on the implications of the proposed algorithms to robotics and autonomous systems.
http://arxiv.org/abs/1902.03986
Generative Adversarial Networks (GANs) are one of the most popular tools for learning complex high dimensional distributions. However, generalization properties of GANs have not been well understood. In this paper, we analyze the generalization of GANs in practical settings. We show that discriminators trained on discrete datasets with the original GAN loss have poor generalization capability and do not approximate the theoretically optimal discriminator. We propose a zero-centered gradient penalty for improving the generalization of the discriminator by pushing it toward the optimal discriminator. The penalty guarantees the generalization and convergence of GANs. Experiments on synthetic and large scale datasets verify our theoretical analysis.
http://arxiv.org/abs/1902.03984
This paper presents design principles for comfort-centered wearable robots and their application in a lightweight and backdrivable knee exoskeleton. The mitigation of discomfort is treated as mechanical design and control issues and three solutions are proposed in this paper: 1) a new wearable structure optimizes the strap attachment configuration and suit layout to ameliorate excessive shear forces of conventional wearable structure design; 2) rolling knee joint and double-hinge mechanisms reduce the misalignment in the sagittal and frontal plane, without increasing the mechanical complexity and inertia, respectively; 3) a low impedance mechanical transmission reduces the reflected inertia and damping of the actuator to human, thus the exoskeleton is highly-backdrivable. Kinematic simulations demonstrate that misalignment between the robot joint and knee joint can be reduced by 74% at maximum knee flexion. In experiments, the exoskeleton in the unpowered mode exhibits 1.03 Nm root mean square (RMS) low resistive torque. The torque control experiments demonstrate 0.31 Nm RMS torque tracking error in three human subjects.
http://arxiv.org/abs/1902.03966
Filtering images of more than one channel is challenging in terms of both efficiency and effectiveness. By grouping similar patches to utilize the self-similarity and sparse linear approximation of natural images, recent nonlocal and transform-domain methods have been widely used in color and multispectral image (MSI) denoising. Many related methods focus on the modeling of group level correlation to enhance sparsity, which often resorts to a recursive strategy with a large number of similar patches. The importance of the patch level representation is understated. In this paper, we mainly investigate the influence and potential of representation at patch level by considering a general formulation with block diagonal matrix. We further show that by training a proper global patch basis, along with a local principal component analysis transform in the grouping dimension, a simple transform-threshold-inverse method could produce very competitive results. Fast implementation is also developed to reduce computational complexity. Extensive experiments on both simulated and real datasets demonstrate its robustness, effectiveness and efficiency.
http://arxiv.org/abs/1902.03954
Unpaired multimodal image-to-image translation is a task of translating a given image in a source domain into diverse images in the target domain, overcoming the limitation of one-to-one mapping. Existing multimodal translation models are mainly based on the disentangled representations with an image reconstruction loss. We propose two approaches to improve multimodal translation quality. First, we use a content representation from the source domain conditioned on a style representation from the target domain. Second, rather than using a typical image reconstruction loss, we design MILO (Mutual Information LOss), a new stochastically-defined loss function based on information theory. This loss function directly reflects the interpretation of latent variables as a random variable. We show that our proposed model Mutual Information with StOchastic Style Representation(MISO) achieves state-of-the-art performance through extensive experiments on various real-world datasets.
http://arxiv.org/abs/1902.03938
The posteriors over neural network weights are high dimensional and multimodal. Each mode typically characterizes a meaningfully different representation of the data. We develop Cyclical Stochastic Gradient MCMC (SG-MCMC) to automatically explore such distributions. In particular, we propose a cyclical stepsize schedule, where larger steps discover new modes, and smaller steps characterize each mode. We prove that our proposed learning rate schedule provides faster convergence to samples from a stationary distribution than SG-MCMC with standard decaying schedules. Moreover, we provide extensive experimental results to demonstrate the effectiveness of cyclical SG-MCMC in learning complex multimodal distributions, especially for fully Bayesian inference with modern deep neural networks.
http://arxiv.org/abs/1902.03932
Time-frequency (TF) representations provide powerful and intuitive features for the analysis of time series such as audio. But still, generative modeling of audio in the TF domain is a subtle matter. Consequently, neural audio synthesis widely relies on directly modeling the waveform and previous attempts at unconditionally synthesizing audio from neurally generated TF features still struggle to produce audio at satisfying quality. In this contribution, focusing on the short-time Fourier transform, we discuss the challenges that arise in audio synthesis based on generated TF features and how to overcome them. We demonstrate the potential of deliberate generative TF modeling by training a generative adversarial network (GAN) on short-time Fourier features. We show that our TF-based network was able to outperform the state-of-the-art GAN generating waveform, despite the similar architecture in the two networks.
http://arxiv.org/abs/1902.04072
The translation of pronouns presents a special challenge to machine translation to this day, since it often requires context outside the current sentence. Recent work on models that have access to information across sentence boundaries has seen only moderate improvements in terms of automatic evaluation metrics such as BLEU. However, metrics that quantify the overall translation quality are ill-equipped to measure gains from additional context. We argue that a different kind of evaluation is needed to assess how well models translate inter-sentential phenomena such as pronouns. This paper therefore presents a test suite of contrastive translations focused specifically on the translation of pronouns. Furthermore, we perform experiments with several context-aware models. We show that, while gains in BLEU are moderate for those systems, they outperform baselines by a large margin in terms of accuracy on our contrastive test set. Our experiments also show the effectiveness of parameter tying for multi-encoder architectures.
http://arxiv.org/abs/1810.02268
This paper uses robust statistics and curvelet transform to learn a general-purpose no-reference (NR) image quality assessment (IQA) model. The new approach, here called M1, competes with the Curvelet Quality Assessment proposed in 2014 (Curvelet2014). The central idea is to use descriptors based on robust statistics to extract features and predict the human opinion about degraded images. To show the consistency of the method the model is tested with 3 different datasets, LIVE IQA, TID2013 and CSIQ. To test evaluation, it is used the Wilcoxon test to verify the statistical significance of results and promote an accurate comparison between new model M1 and Curvelet2014. The results show a gain when robust statistics are used as descriptor.
http://arxiv.org/abs/1902.03842
Most attention-based image captioning models attend to the image once per word. However, attending once per word is rigid and is easy to miss some information. Attending more times can adjust the attention position, find the missing information back and avoid generating the wrong word. In this paper, we show that attending more times per word can gain improvements in the image captioning task, without increasing the number of parameters. We propose a flexible two-LSTM merge model to make it convenient to encode more attentions than words. Our captioning model uses two LSTMs to encode the word sequence and the attention sequence respectively. The information of the two LSTMs and the image feature are combined to predict the next word. Experiments on the MSCOCO caption dataset show that our method outperforms the state-of-the-art. Using bottom up features and self-critical training method, our method gets BLEU-4, METEOR, ROUGE-L, CIDEr and SPICE scores of 0.381, 0.283, 0.580, 1.261 and 0.220 on the Karpathy test split.
https://arxiv.org/abs/1812.03283
In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential, the Mean Shift approach is a computationally expensive method for unsupervised learning. Thus, we introduce two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering. Firstly we propose a scalable procedure to approximate the density gradient ascent. Second, our proposed scalable cluster labeling technique is presented. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. These two techniques may be used for moderate sized datasets. Furthermore, we show that using our proposed approximations of the density gradient ascent as a pre-processing step in other clustering methods can also improve dedicated classification metrics. For the latter, a distributed implementation, written for the Spark/Scala ecosystem is proposed. For all these considered clustering methods, we present experimental results illustrating their labeling accuracy and their potential to solve concrete problems.
http://arxiv.org/abs/1902.03833
Intrinsic Image Decomposition (IID) is a challenging and interesting computer vision problem with various applications in several fields. We present novel semantic priors and an integrated approach for single image IID that involves analyzing image at three hierarchical context levels. Local context priors capture scene properties at each pixel within a small neighbourhood. Mid-level context priors encode object level semantics. Global context priors establish correspondences at the scene level. Our semantic priors are designed on both fixed and flexible regions, using selective search method and Convolutional Neural Network features. Our IID method is an iterative multistage optimization scheme and consists of two complementary formulations: $L_2$ smoothing for shading and $L_1$ sparsity for reflectance. Experiments and analysis of our method indicate the utility of our semantic priors and structured hierarchical analysis in an IID framework. We compare our method with other contemporary IID solutions and show results with lesser artifacts. Finally, we highlight that proper choice and encoding of prior knowledge can produce competitive results even when compared to end-to-end deep learning IID methods, signifying the importance of such priors. We believe that the insights and techniques presented in this paper would be useful in the future IID research.
http://arxiv.org/abs/1902.03830
Supervised deep learning methods for segmentation require large amounts of labelled training data, without which they are prone to overfitting, not generalizing well to unseen images. In practice, obtaining a large number of annotations from clinical experts is expensive and time-consuming. One way to address scarcity of annotated examples is data augmentation using random spatial and intensity transformations. Recently, it has been proposed to use generative models to synthesize realistic training examples, complementing the random augmentation. So far, these methods have yielded limited gains over the random augmentation. However, there is potential to improve the approach by (i) explicitly modeling deformation fields (non-affine spatial transformation) and intensity transformations and (ii) leveraging unlabelled data during the generative process. With this motivation, we propose a novel task-driven data augmentation method where to synthesize new training examples, a generative network explicitly models and applies deformation fields and additive intensity masks on existing labelled data, modeling shape and intensity variations, respectively. Crucially, the generative model is optimized to be conducive to the task, in this case segmentation, and constrained to match the distribution of images observed from labelled and unlabelled samples. Furthermore, explicit modeling of deformation fields allow synthesizing segmentation masks and images in exact correspondence by simply applying the generated transformation to an input image and the corresponding annotation. Our experiments on cardiac magnetic resonance images (MRI) showed that, for the task of segmentation in small training data scenarios, the proposed method substantially outperforms conventional augmentation techniques.
http://arxiv.org/abs/1902.05396
In the era of social media and big data, the use of visual evidence to document conflict and human rights abuse has become an important element for human rights organizations and advocates. In this paper, we address the task of detecting two types of human rights abuses in challenging, everyday photos: (1) child labour, and (2) displaced populations. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the emotional state of a person – how positive or pleasant an emotion is, and the control level of the situation by the person – are powerful cues for perceiving potential human rights violations. To exploit these cues, our model learns to predict global emotional traits over a given image based on the joint analysis of every detected person and the whole scene. By integrating these predictions with a data-driven convolutional neural network (CNN) classifier, our system efficiently infers potential human rights abuses in a clean, end-to-end system we call GET-AID (from Global Emotional Traits for Abuse IDentification). Extensive experiments are performed to verify our method on the recently introduced subset of Human Rights Archive (HRA) dataset (2 violation categories with the same number of positive and negative samples), where we show quantitatively compelling results. Compared with previous works and the sole use of a CNN classifier, this paper improves the coverage up to 23.73% for child labour and 57.21% for displaced populations. Our dataset, codes and trained models are available online at https://github.com/GKalliatakis/GET-AID.
http://arxiv.org/abs/1902.03817
In this report, we provide additional and corrected results for the paper “Extended YouTube Faces: a Dataset for Heterogeneous Open-Set Face Identification”. After further investigations, we discovered and corrected wrongly labeled images and incorrect identities. This forced us to re-generate the evaluation protocol for the new data; in doing so, we also reproduced and extended the experimental results with other standard metrics and measures used in the literature. The reader can refer to the original paper for additional details regarding the data collection procedure and recognition pipeline.
http://arxiv.org/abs/1902.03804
Despite the recent success in per-frame monocular dense depth estimation of rigid scenes using deep learning methods, they fail to achieve similar success for complex dynamic scenes, such as MPI Sintel \cite{butler2012naturalistic}. Moreover, conventional geometric methods to address this problem using a piece-wise rigid scene model requires a reliable estimation of motion parameters for each local model, which is difficult to obtain and validate. In this work, we show that, given per-pixel optical flow correspondences between two consecutive frames and the sparse depth prior for the reference frame, we can recover the dense depth map for the successive frames without solving for motion parameters. By assigning the locally rigid structure to the piece-wise planar approximation of a dynamic scene which transforms as rigid as possible over frames, we demonstrate that we can bypass the motion estimation step. In essence, our formulation provides a new way to think and recover dense depth map of a complex dynamic scene which is recursive, incremental and motion free in nature and therefore, it can also be integrated with the modern neural network frameworks for large-scale depth-estimation applications. Our proposed method does not make any prior assumption about the rigidity of a dynamic scene, as a result, it is applicable to a wide range of scenarios. Experimental results show that our method can effectively provide the depth for the successive/multiple frames of a dynamic scene without using any motion parameters.
http://arxiv.org/abs/1902.03791
The primate visual system has inspired the development of deep artificial neural networks, which have revolutionized the computer vision domain. Yet these networks are much less energy-efficient than their biological counterparts, and they are typically trained with backpropagation, which is extremely data-hungry. To address these limitations, we used a deep convolutional spiking neural network (DCSNN) and a latency-coding scheme. We trained it using a combination of spike-timing-dependent plasticity (STDP) for the lower layers and reward-modulated STDP (R-STDP) for the higher ones. In short, with R-STDP a correct (resp. incorrect) decision leads to STDP (resp. anti-STDP). This approach led to an accuracy of $97.2\%$ on MNIST, without requiring an external classifier. In addition, we demonstrated that R-STDP extracts features that are diagnostic for the task at hand, and discards the other ones, whereas STDP extracts any feature that repeats. Finally, our approach is biologically plausible, hardware friendly, and energy-efficient.
http://arxiv.org/abs/1804.00227
Image-to-image translation has been widely investigated in recent years. Existing approaches are elaborately designed in an unsupervised manner and little attention has been paid to domain information beneath unpaired data. In this work, we treat domain information as explicit supervision and design an unpaired image-to-image translation framework, Domain-supervised GAN (briefly, DosGAN), that takes the first step towards exploration of domain supervision. Instead of representing domain characteristics with different generators in CycleGAN\cite{zhu2017unpaired} or multiple domain codes in StarGAN~\cite{choi2017stargan}, we pre-train a classification network to classify the domain of an image. After pre-training, this network is used to extract domain features of each image by using the output of its second-to-last layer. Such features, together with the latent semantic features extracted by another encoder (shared across different domains), are used to generate an image in the target domain. Experiments on multiple hair color translation, multiple identity translation and conditional edges-to-shoes/handbags demonstrate the effectiveness of our method. In addition, we transfer the domain feature extractor obtained on the Facescrub dataset with domain supervision information, to the CelebA dataset without domain supervision information, and succeed achieving conditional translation with any two images in CelebA, while previous models like StarGAN cannot handle this task. Our code is available at \url{https://github.com/linjx-ustc1106/DosGAN-PyTorch}.
http://arxiv.org/abs/1902.03782
Semantic segmentation maps can be used as input to models for maneuvering the controls of a car. However, not all labels may be necessary for making the control decision. One would expect that certain labels such as road lanes or sidewalks would be more critical in comparison with labels for vegetation or buildings which may not have a direct influence on the car’s driving decision. In this appendix, we evaluate and quantify how sensitive and important the different semantic labels are for controlling the car. Labels that do not influence the driving decision are remapped to other classes, thereby simplifying the task by reducing to only labels critical for driving of the vehicle.
http://arxiv.org/abs/1902.03777
In the era of Internet, recognizing pornographic images is of great significance for protecting children’s physical and mental health. However, this task is very challenging as the key pornographic contents (e.g., breast and private part) in an image often lie in local regions of small size. In this paper, we model each image as a bag of regions, and follow a multiple instance learning (MIL) approach to train a generic region-based recognition model. Specifically, we take into account the region’s degree of pornography, and make three main contributions. First, we show that based on very few annotations of the key pornographic contents in a training image, we can generate a bag of properly sized regions, among which the potential positive regions usually contain useful contexts that can aid recognition. Second, we present a simple quantitative measure of a region’s degree of pornography, which can be used to weigh the importance of different regions in a positive image. Third, we formulate the recognition task as a weighted MIL problem under the convolutional neural network framework, with a bag probability function introduced to combine the importance of different regions. Experiments on our newly collected large scale dataset demonstrate the effectiveness of the proposed method, achieving an accuracy with 97.52% true positive rate at 1% false positive rate, tested on 100K pornographic images and 100K normal images.
http://arxiv.org/abs/1902.03771
Model-free reinforcement learning has recently been shown to successfully learn navigation policies from raw sensor data. In this work, we address the problem of learning driving policies for an autonomous agent in a high-fidelity simulator. Building upon recent research that applies deep reinforcement learning to navigation problems, we present a modular deep reinforcement learning approach to predict the steering angle of the car from raw images. The first module extracts a low-dimensional latent semantic representation of the image. The control module trained with reinforcement learning takes the latent vector as input to predict the correct steering angle. The experimental results have showed that our method is capable of learning to maneuver the car without any human control signals.
http://arxiv.org/abs/1902.03765
In this paper we address the problem of hallucinating high-resolution facial images from unaligned low-resolution inputs at high magnification factors. We approach the problem with convolutional neural networks (CNNs) and propose a novel (deep) face hallucination model that incorporates identity priors into the learning procedure. The model consists of two main parts: i) a cascaded super-resolution network that upscales the low-resolution images, and ii) an ensemble of face recognition models that act as identity priors for the super-resolution network during training. Different from competing super-resolution approaches that typically rely on a single model for upscaling (even with large magnification factors), our network uses a cascade of multiple SR models that progressively upscale the low-resolution images using steps of $2\times$. This characteristic allows us to apply supervision signals (target appearances) at different resolutions and incorporate identity constraints at multiple-scales. Our model is able to upscale (very) low-resolution images captured in unconstrained conditions and produce visually convincing results. We rigorously evaluate the proposed model on a large datasets of facial images and report superior performance compared to the state-of-the-art.
http://arxiv.org/abs/1805.10938
Event sensors output a stream of asynchronous brightness changes (called ``events’’) at a very high temporal rate. Previous works on recovering the lost intensity information from the event sensor data have heavily relied on the event stream, which makes the reconstructed images non-photorealistic and also susceptible to noise in the event stream. We propose to reconstruct photorealistic intensity images from a hybrid sensor consisting of a low frame rate conventional camera, which has the scene texture information, along with the event sensor. To accomplish our task, we warp the low frame rate intensity images to temporally dense locations of the event data by estimating a spatially dense scene depth and temporally dense sensor ego-motion. The results obtained from our algorithm are more photorealistic compared to any of the previous state-of-the-art algorithms. We also demonstrate our algorithm’s robustness to abrupt camera motion and noise in the event sensor data.
http://arxiv.org/abs/1805.06140
Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than associating language with visual concepts. In this work, we propose a generic framework which we call Human Importance-aware Network Tuning (HINT) that effectively leverages human supervision to improve visual grounding. HINT constrains deep networks to be sensitive to the same input regions as humans. Crucially, our approach optimizes the alignment between human attention maps and gradient-based network importances - ensuring that models learn not just to look at but rather rely on visual concepts that humans found relevant for a task when making predictions. We demonstrate our approach on Visual Question Answering and Image Captioning tasks, achieving state of-the-art for the VQA-CP dataset which penalizes over-reliance on language priors.
http://arxiv.org/abs/1902.03751
Deciphering human behaviors to predict their future paths/trajectories and what they would do from videos is important in many applications. Motivated by this idea, this paper studies predicting a pedestrian’s future path jointly with future activities. We propose an end-to-end, multi-task learning system utilizing rich visual features about the human behavioral information and interaction with their surroundings. To facilitate the training, the network is learned with two auxiliary tasks of predicting future activities and the location in which the activity will happen. Experimental results demonstrate our state-of-the-art performance over two public benchmarks on future trajectory prediction. Moreover, our method is able to produce meaningful future activity prediction in addition to the path. The result provides the first empirical evidence that a joint modeling of paths and activities benefits future path prediction.
http://arxiv.org/abs/1902.03748
Bundle adjustment plays a vital role in feature-based monocular SLAM. In many modern SLAM pipelines, bundle adjustment is performed to estimate the 6DOF camera trajectory and 3D map (3D point cloud) from the input feature tracks. However, two fundamental weaknesses plague SLAM systems based on bundle adjustment. First, the need to carefully initialise bundle adjustment means that all variables, in particular the map, must be estimated as accurately as possible and maintained over time, which makes the overall algorithm cumbersome. Second, since estimating the 3D structure (which requires sufficient baseline) is inherent in bundle adjustment, the SLAM algorithm will encounter difficulties during periods of slow motion or pure rotational motion. We propose a different SLAM optimisation core: instead of bundle adjustment, we conduct rotation averaging to incrementally optimise only camera orientations. Given the orientations, we estimate the camera positions and 3D points via a quasi-convex formulation that can be solved efficiently and globally optimally. Our approach not only obviates the need to estimate and maintain the positions and 3D map at keyframe rate (which enables simpler SLAM systems), it is also more capable of handling slow motions or pure rotational motions.
http://arxiv.org/abs/1902.03747
Two major uncertainties, dataset bias and adversarial examples, prevail in state-of-the-art AI algorithms with deep neural networks. In this paper, we present an intuitive explanation for these issues as well as an interpretation of the performance of deep networks in a natural-image space. The explanation consists of two parts: the philosophy of neural networks and a hypothetical model of natural-image spaces. Following the explanation, we 1) demonstrate that the values of training samples differ, 2) provide incremental boost to the accuracy of a CIFAR-10 classifier by introducing an additional “random-noise” category during training, 3) alleviate over-fitting thereby enhancing the robustness against adversarial examples by detecting and excluding illusive training samples that are consistently misclassified. Our overall contribution is therefore twofold. First, while most existing algorithms treat data equally and have a strong appetite for more data, we demonstrate in contrast that an individual datum can sometimes have disproportionate and counterproductive influence and that it is not always better to train neural networks with more data. Next, we consider more thoughtful strategies by taking into account the geometric and topological properties of natural-image spaces to which deep networks are applied.
http://arxiv.org/abs/1801.09097