Modern crowd counting methods usually employ deep neural networks (DNN) to estimate crowd counts via density regression. Despite their significant improvements, the regression-based methods are incapable of providing the detection of individuals in crowds. The detection-based methods, on the other hand, have not been largely explored in recent trends of crowd counting due to the needs for expensive bounding box annotations. In this work, we instead propose a new deep detection network with only point supervision required. It can simultaneously detect the size and location of human heads and count them in crowds. We first mine useful person size information from point-level annotations and initialize the pseudo ground truth bounding boxes. An online updating scheme is introduced to refine the pseudo ground truth during training; while a locally-constrained regression loss is designed to provide additional constraints on the size of the predicted boxes in a local neighborhood. In the end, we propose a curriculum learning strategy to train the network from images of relatively accurate and easy pseudo ground truth first. Extensive experiments are conducted in both detection and counting tasks on several standard benchmarks, e.g. ShanghaiTech, UCF_CC_50, WiderFace, and TRANCOS datasets, and the results show the superiority of our method over the state-of-the-art.
http://arxiv.org/abs/1904.01333
We propose a novel generative adversarial network (GAN) for the task of unsupervised learning of 3D representations from natural images. Most generative models rely on 2D kernels to generate images and make few assumptions about the 3D world. These models therefore tend to create blurry images or artefacts in tasks that require a strong 3D understanding, such as novel-view synthesis. HoloGAN instead learns a 3D representation of the world, and to render this representation in a realistic manner. Unlike other GANs, HoloGAN provides explicit control over the pose of generated objects through rigid-body transformations of the learnt 3D features. Our experiments show that using explicit 3D features enables HoloGAN to disentangle 3D pose and identity, which is further decomposed into shape and appearance, while still being able to generate images with similar or higher visual quality than other generative models. HoloGAN can be trained end-to-end from unlabelled 2D images only. Particularly, we do not require pose labels, 3D shapes, or multiple views of the same objects. This shows that HoloGAN is the first generative model that learns 3D representations from natural images in an entirely unsupervised manner.
http://arxiv.org/abs/1904.01326
Monocular 3D Human Pose Estimation from static images is a challenging problem, due to the curse of dimensionality and the ill-posed nature of lifting 2D to 3D. In this paper, we propose a Deep Conditional Variational Autoencoder based model that synthesizes diverse 3D pose samples conditioned on the estimated 2D pose. Our experiments reveal that the CVAE generates significantly diverse 3D samples that are consistent with the 2D pose, thereby reducing the ambiguity in lifting from 2D-to-3D. We use two strategies for predicting the final 3D pose - (a) depth-ordering/ordinal relations to score and aggregate the final 3D pose, or OrdinalScore, and (b) with supervision from an Oracle. We report close to state of the art results on two benchmark datasets using OrdinalScore, and state-of-the-art results using the Oracle. We also show our pipeline gives competitive results without paired 3D supervision. We shall make the training and evaluation code available at https://github.com/ssfootball04/generative_pose.
http://arxiv.org/abs/1904.01324
As deep reinforcement learning driven by visual perception becomes more widely used there is a growing need to better understand and probe the learned agents. Understanding the decision making process and its relationship to visual inputs can be very valuable to identify problems in learned behavior. However, this topic has been relatively under-explored in the research community. In this work we present a method for synthesizing visual inputs of interest for a trained agent. Such inputs or states could be situations in which specific actions are necessary. Further, critical states in which a very high or a very low reward can be achieved are often interesting to understand the situational awareness of the system as they can correspond to risky states. To this end, we learn a generative model over the state space of the environment and use its latent space to optimize a target function for the state of interest. In our experiments we show that this method can generate insights for a variety of environments and reinforcement learning methods. We explore results in the standard Atari benchmark games as well as in an autonomous driving simulator. Based on the efficiency with which we have been able to identify behavioural weaknesses with this technique, we believe this general approach could serve as an important tool for AI safety applications.
http://arxiv.org/abs/1904.01318
With the explosive development of mobile Internet, short text has been applied extensively. The difference between classifying short text and long documents is that short text is of shortness and sparsity. Thus, it is challenging to deal with short text classification owing to its less semantic information. In this paper, we propose a novel topic-based convolutional neural network (TB-CNN) based on Latent Dirichlet Allocation (LDA) model and convolutional neural network. Comparing to traditional CNN methods, TB-CNN generates topic words with LDA model to reduce the sparseness and combines the embedding vectors of topic words and input words to extend feature space of short text. The validation results on IMDB movie review dataset show the improvement and effectiveness of TB-CNN.
http://arxiv.org/abs/1904.01313
In this paper, we focus on generating realistic images from text descriptions. Current methods first generate an initial image with rough shape and color, and then refine the initial image to a high-resolution one. Most existing text-to-image synthesis methods have two main problems. (1) These methods depend heavily on the quality of the initial images. If the initial image is not well initialized, the following processes can hardly refine the image to a satisfactory quality. (2) Each word contributes a different level of importance when depicting different image contents, however, unchanged text representation is used in existing image refinement processes. In this paper, we propose the Dynamic Memory Generative Adversarial Network (DM-GAN) to generate high-quality images. The proposed method introduces a dynamic memory module to refine fuzzy image contents, when the initial images are not well generated. A memory writing gate is designed to select the important text information based on the initial image content, which enables our method to accurately generate images from the text description. We also utilize a response gate to adaptively fuse the information read from the memories and the image features. We evaluate the DM-GAN model on the Caltech-UCSD Birds 200 dataset and the Microsoft Common Objects in Context dataset. Experimental results demonstrate that our DM-GAN model performs favorably against the state-of-the-art approaches.
http://arxiv.org/abs/1904.01310
We review some practical and philosophical questions raised by the use of machine learning in creative practice. Beyond the obvious problems regarding plagiarism and authorship, we argue that the novelty in AI Art relies mostly on a narrow machine learning contribution : manifold approximation. Nevertheless, this contribution creates a radical shift in the way we have to consider this movement. Is this omnipotent tool a blessing or a curse for the artists?
http://arxiv.org/abs/1904.01957
Unsupervised person re-identification (Re-ID) methods consist of training with a carefully labeled source dataset, followed by generalization to an unlabeled target dataset, i.e. person-identity information is unavailable. Inspired by domain adaptation techniques, these methods avoid a costly, tedious and often unaffordable labeling process. This paper investigates the use of camera-index information, namely which camera captured which image, for unsupervised person Re-ID. More precisely, inspired by domain adaptation adversarial approaches, we develop an adversarial framework in which the output of the feature extractor should be useful for person Re-ID and in the same time should fool a camera discriminator. We refer to the proposed method as camera adversarial transfer (CAT). We evaluate adversarial variants and, alongside, the camera robustness achieved for each variant. We report cross-dataset ReID performance and we compare the variants of our method with several state-of-the-art methods, thus showing the interest of exploiting camera-index information within an adversarial framework for the unsupervised person Re-ID.
http://arxiv.org/abs/1904.01308
Multilingual (or cross-lingual) embeddings represent several languages in a unique vector space. Using a common embedding space enables for a shared semantic between words from different languages. In this paper, we propose to embed images and texts into a unique distributional vector space, enabling to search images by using text queries expressing information needs related to the (visual) content of images, as well as using image similarity. Our framework forces the representation of an image to be similar to the representation of the text that describes it. Moreover, by using multilingual embeddings we ensure that words from two different languages have close descriptors and thus are attached to similar images. We provide experimental evidence of the efficiency of our approach by experimenting it on two datasets: Common Objects in COntext (COCO) [19] and Multi30K [7].
https://arxiv.org/abs/1903.11299
In this paper, we focus on the problem of task allocation, cooperative path planning and motion coordination of the large-scale system with thousands of robots, aiming for practical applications in robotic warehouses and automated logistics systems. Particularly, we solve the life-long planning problem and guarantee the coordination performance of large-scale robot network in the presence of robot motion uncertainties and communication failures. A hierarchical planning and coordination structure is presented. The environment is divided into several sectors and a dynamic traffic heat-map is generated to describe the current sector-level traffic flow. In task planning level, a greedy task allocation method is implemented to assign the current task to the nearest free robot and the sector-level path is generated by comprehensively considering the traveling distance, the traffic heat-value distribution and the current robot/communication failures. In motion coordination level, local cooperative A* algorithm is implemented in each sector to generate the collision-free road-level path of each robot in the sector and the rolling planning structure is introduced to solve problems caused by motion and communication uncertainties. The effectiveness and practical applicability of the proposed approach are validated by large-scale simulations with more than one thousand robots and real laboratory experiments.
http://arxiv.org/abs/1904.01303
We improve the informativeness of models for conditional text generation using techniques from computational pragmatics. These techniques formulate language production as a game between speakers and listeners, in which a speaker should generate output text that a listener can use to correctly identify the original input that the text describes. While such approaches are widely used in cognitive science and grounded language learning, they have received less attention for more standard language generation tasks. We consider two pragmatic modeling methods for text generation: one where pragmatics is imposed by information preservation, and another where pragmatics is imposed by explicit modeling of distractors. We find that these methods improve the performance of strong existing systems for abstractive summarization and generation from structured meaning representations.
http://arxiv.org/abs/1904.01301
Accurate manipulation of a deformable body such as a piece of fabric is difficult because of its many degrees of freedom and unobservable properties affecting its dynamics. To alleviate these challenges, we propose the application of feedback-based control to robotic fabric strip folding. The feedback is computed from the low dimensional state extracted from a camera image. We trained the controller using reinforcement learning in simulation which was calibrated to cover the real fabric strip behaviors. The proposed feedback-based folding was experimentally compared to two state-of-the-art folding methods and our method outperformed both of them in terms of accuracy.
http://arxiv.org/abs/1904.01298
Recent advances in generative adversarial networks (GANs) have shown great potentials in realistic image synthesis whereas most existing works address synthesis realism in either appearance space or geometry space but few in both. This paper presents an innovative Spatial Fusion GAN (SF-GAN) that combines a geometry synthesizer and an appearance synthesizer to achieve synthesis realism in both geometry and appearance spaces. The geometry synthesizer learns contextual geometries of background images and transforms and places foreground objects into the background images unanimously. The appearance synthesizer adjusts the color, brightness and styles of the foreground objects and embeds them into background images harmoniously, where a guided filter is introduced for detail preserving. The two synthesizers are inter-connected as mutual references which can be trained end-to-end without supervision. The SF-GAN has been evaluated in two tasks: (1) realistic scene text image synthesis for training better recognition models; (2) glass and hat wearing for realistic matching glasses and hats with real portraits. Qualitative and quantitative comparisons with the state-of-the-art demonstrate the superiority of the proposed SF-GAN.
https://arxiv.org/abs/1812.05840
Inferences regarding “Jane’s arrival in London” from predications such as “Jane is going to London” or “Jane has gone to London” depend on tense and aspect of the predications. Tense determines the temporal location of the predication in the past, present or future of the time of utterance. The aspectual auxiliaries on the other hand specify the internal constituency of the event, i.e. whether the event of “going to London” is completed and whether its consequences hold at that time or not. While tense and aspect are among the most important factors for determining natural language inference, there has been very little work to show whether modern NLP models capture these semantic concepts. In this paper we propose a novel entailment dataset and analyse the ability of a range of recently proposed NLP models to perform inference on temporal predications. We show that the models encode a substantial amount of morphosyntactic information relating to tense and aspect, but fail to model inferences that require reasoning with these semantic properties.
http://arxiv.org/abs/1904.01297
In contrast to traditional cameras, whose pixels have a common exposure time, event-based cameras are novel bio-inspired sensors whose pixels work independently and asynchronously output intensity changes (called “events”), with microsecond resolution. Since events are caused by the apparent motion of objects, event-based cameras sample visual information based on the scene dynamics and are, therefore, a more natural fit than traditional cameras to acquire motion, especially at high speeds, where traditional cameras suffer from motion blur. However, distinguishing between events caused by different moving objects and by the camera’s ego-motion is a challenging task. We present the first per-event segmentation method for splitting a scene into independently moving objects. Our method jointly estimates the event-object associations (i.e., segmentation) and the motion parameters of the objects (or the background) by maximization of an objective function, which builds upon recent results on event-based motion-compensation. We provide a thorough evaluation of our method on a public dataset, outperforming the state-of-the-art by as much as 10%. We also show the first quantitative evaluation of a segmentation algorithm for event cameras, yielding around 90% accuracy at 4 pixels relative displacement.
http://arxiv.org/abs/1904.01293
Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete nature of module assembly. Overall, the proposed NMTree consistently outperforms the state-of-the-arts on several benchmarks and tasks. Qualitative results show explainable grounding score calculation in great detail.
http://arxiv.org/abs/1812.03299
Current finger knuckle image recognition systems, often require users to place fingers’ major or minor joints flatly towards the capturing sensor. To extend these systems for user non-intrusive application scenarios, such as consumer electronics, forensic, defence etc, we suggest matching the full dorsal fingers, rather than the major/ minor region of interest (ROI) alone. In particular, this paper makes a comprehensive study on the comparisons between full finger and fusion of finger ROI’s for finger knuckle image recognition. These experiments suggest that using full-finger, provides a more elegant solution. Addressing the finger matching problem, we propose a CNN (convolutional neural network) which creates a $128$-D feature embedding of an image. It is trained via. triplet loss function, which enforces the L2 distance between the embeddings of the same subject to be approaching zero, whereas the distance between any 2 embeddings of different subjects to be at least a margin. For precise training of the network, we use dynamic adaptive margin, data augmentation, and hard negative mining. In distinguished experiments, the individual performance of finger, as well as weighted sum score level fusion of major knuckle, minor knuckle, and nail modalities have been computed, justifying our assumption to consider full finger as biometrics instead of its counterparts. The proposed method is evaluated using two publicly available finger knuckle image datasets i.e., PolyU FKP dataset and PolyU Contactless FKI Datasets.
http://arxiv.org/abs/1904.01289
An autoencoder is a neural network which data projects to and from a lower dimensional latent space, where this data is easier to understand and model. The autoencoder consists of two sub-networks, the encoder and the decoder, which carry out these transformations. The neural network is trained such that the output is as close to the input as possible, the data having gone through an information bottleneck : the latent space. This tool bears significant ressemblance to Principal Component Analysis (PCA), with two main differences. Firstly, the autoencoder is a non-linear transformation, contrary to PCA, which makes the autoencoder more flexible and powerful. Secondly, the axes found by a PCA are orthogonal, and are ordered in terms of the amount of variability which the data presents along these axes. This makes the interpretability of the PCA much greater than that of the autoencoder, which does not have these attributes. Ideally, then, we would like an autoencoder whose latent space consists of independent components, ordered by decreasing importance to the data. In this paper, we propose an algorithm to create such a network. We create an iterative algorithm which progressively increases the size of the latent space, learning a new dimension at each step. Secondly, we propose a covariance loss term to add to the standard autoencoder loss function, as well as a normalisation layer just before the latent space, which encourages the latent space components to be statistically independent. We demonstrate the results of this autoencoder on simple geometric shapes, and find that the algorithm indeed finds a meaningful representation in the latent space. This means that subsequent interpolation in the latent space has meaning with respect to the geometric properties of the images.
http://arxiv.org/abs/1904.01277
This paper describes the design and development of specific software tools used during the creation of Family Names in Britain and Ireland (FaNBI) research project, started by the University of the West of England in 2010 and finished successfully in 2016. First, the overview of the project and methodology is provided. Next section contains the description of dictionary management tools and software tools to combine input data resources.
http://arxiv.org/abs/1904.09234
This paper presents a study on discriminative artificial neural network classifiers in the context of open-set speaker identification. Both 2-class and multi-class architectures are tested against the conventional Gaussian mixture model based classifier on enrolled speaker sets of different sizes. The performance evaluation shows that the multi-class neural network system has superior performance for large population sizes.
http://arxiv.org/abs/1904.01269
The absence of large scale datasets with pixel-level supervisions is a significant obstacle for the training of deep convolutional networks for scene text segmentation. For this reason, synthetic data generation is normally employed to enlarge the training dataset. Nonetheless, synthetic data cannot reproduce the complexity and variability of natural images. In this paper, a weakly supervised learning approach is used to reduce the shift between training on real and synthetic data. Pixel-level supervisions for a text detection dataset (i.e. where only bounding-box annotations are available) are generated. In particular, the COCO-Text-Segmentation (COCO_TS) dataset, which provides pixel-level supervisions for the COCO-Text dataset, is created and released. The generated annotations are used to train a deep convolutional neural network for semantic segmentation. Experiments show that the proposed dataset can be used instead of synthetic data, allowing us to use only a fraction of the training samples and significantly improving the performances.
https://arxiv.org/abs/1904.00818
Cataracts, which are lenticular opacities that may occur at different lens locations, are the leading cause of visual impairment worldwide. Accurate and timely diagnosis can improve the quality of life of cataract patients. In this paper, a feature extraction-based method for grading cataract severity using retinal images is proposed. To obtain more appropriate features for the automatic grading, the Haar wavelet is improved according to the characteristics of retinal images. Retinal images of non-cataract, as well as mild, moderate, and severe cataracts, are automatically recognized using the improved Haar wavelet. A hierarchical strategy is used to transform the four-class classification problem into three adjacent two-class classification problems. Three sets of two-class classifiers based on a neural network are trained individually and integrated together to establish a complete classification system. The accuracies of the two-class classification (cataract and non-cataract) and four-class classification are 94.83% and 85.98%, respectively. The performance analysis demonstrates that the improved Haar wavelet feature achieves higher accuracy than the original Haar wavelet feature, and the fusion of three sets of two-class classifiers is superior to a simple four-class classifier. The discussion indicates that the retinal image-based method offers significant potential for cataract detection.
http://arxiv.org/abs/1904.01261
Hashing methods have been recently found very effective in retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. The traditional hashing methods in RS usually exploit hand-crafted features to learn hash functions to obtain binary codes, which can be insufficient to optimally represent the information content of RS images. To overcome this problem, in this paper we introduce a metric-learning based hashing network, which learns: 1) a semantic-based metric space for effective feature representation; and 2) compact binary hash codes for fast archive search. Our network considers an interplay of multiple loss functions that allows to jointly learn a metric based semantic space facilitating similar images to be clustered together in that target space and at the same time producing compact final activations that lose negligible information when binarized. Experiments carried out on two benchmark RS archives point out that the proposed network significantly improves the retrieval performance under the same retrieval time when compared to the state-of-the-art hashing methods in RS.
http://arxiv.org/abs/1904.01258
In relation extraction for knowledge-based question answering, searching from one entity to another entity via a single relation is called “one hop”. In related work, an exhaustive search from all one-hop relations, two-hop relations, and so on to the max-hop relations in the knowledge graph is necessary but expensive. Therefore, the number of hops is generally restricted to two or three. In this paper, we propose UHop, an unrestricted-hop framework which relaxes this restriction by use of a transition-based search framework to replace the relation-chain-based search one. We conduct experiments on conventional 1- and 2-hop questions as well as lengthy questions, including datasets such as WebQSP, PathQuestion, and Grid World. Results show that the proposed framework enables the ability to halt, works well with state-of-the-art models, achieves competitive performance without exhaustive searches, and opens the performance gap for long relation paths.
http://arxiv.org/abs/1904.01246
In preoperative planning of left atrial appendage closure (LAAC) with CT angiography, the assessment of the appendage orifice plays a crucial role in choosing an appropriate LAAC device size and a proper C-arm angulation. However, accurate orifice detection is laborious because of the high anatomic variation of the appendage, as well as the unclear orifice position and orientation in the available views. We propose an automatic orifice detection approach performing a search on the principal medial axis of the appendage, where we present an efficient iterative algorithm to grow the axis from the appendage to the left atrium. We propose to use the axis-to-surface distance of the appendage for efficient and effective detection. To localize the necessary initial seed for growing the medial axis, we train an artificial localization agent using an actor-critic reinforcement learning approach, defining the localization as a sequential decision process. The entire detection process takes only about 8 seconds, and the variance of the detected orifice with respect to annotations from two experts is calculated to be significantly small and less than the inter-observer variance. The proposed orifice search on the medial axis of the appendage comparing only its distance from the surface provides a simple, yet robust solution for orifice detection. While being the first fully automatic approach and providing a detection error below the inter-observer difference, our method improved the detection efficiency by eighteen times compared to the existing solution, therefore, can be potentially useful for physicians.
http://arxiv.org/abs/1904.01241
Currently, a plethora of saliency models based on deep neural networks have led great breakthroughs in many complex high-level vision tasks (e.g. scene description, object detection). The robustness of these models, however, has not yet been studied. In this paper, we propose a sparse feature-space adversarial attack method against deep saliency models for the first time. The proposed attack only requires a part of the model information, and is able to generate a sparser and more insidious adversarial perturbation, compared to traditional image-space attacks. These adversarial perturbations are so subtle that a human observer cannot notice their presences, but the model outputs will be revolutionized. This phenomenon raises security threats to deep saliency models in practical applications. We also explore some intriguing properties of the feature-space attack, e.g. 1) the hidden layers with bigger receptive fields generate sparser perturbations, 2) the deeper hidden layers achieve higher attack success rates, and 3) different loss functions and different attacked layers will result in diverse perturbations. Experiments indicate that the proposed method is able to successfully attack different model architectures across various image scenes.
http://arxiv.org/abs/1904.01231
Do very high accuracies of deep networks suggest pride of effective AI or are deep networks prejudiced? Do they suffer from in-group biases (own-race-bias and own-age-bias), and mimic the human behavior? Is in-group specific information being encoded sub-consciously by the deep networks? This research attempts to answer these questions and presents an in-depth analysis of `bias’ in deep learning based face recognition systems. This is the first work which decodes if and where bias is encoded for face recognition. Taking cues from cognitive studies, we inspect if deep networks are also affected by social in- and out-group effect. Networks are analyzed for own-race and own-age bias, both of which have been well established in human beings. The sub-conscious behavior of face recognition models is examined to understand if they encode race or age specific features for face recognition. Analysis is performed based on 36 experiments conducted on multiple datasets. Four deep learning networks either trained from scratch or pre-trained on over 10M images are used. Variations across class activation maps and feature visualizations provide novel insights into the functioning of deep learning systems, suggesting behavior similar to humans. It is our belief that a better understanding of state-of-the-art deep learning networks would enable researchers to address the given challenge of bias in AI, and develop fairer systems.
http://arxiv.org/abs/1904.01219
Synthesizing high quality saliency maps from noisy images is a challenging problem in computer vision and has many practical applications. Samples generated by existing techniques for saliency detection cannot handle the noise perturbations smoothly and fail to delineate the salient objects present in the given scene. In this paper, we present a novel end-to-end coupled Denoising based Saliency Prediction with Generative Adversarial Network (DSAL-GAN) framework to address the problem of salient object detection in noisy images. DSAL-GAN consists of two generative adversarial-networks (GAN) trained end-to-end to perform denoising and saliency prediction altogether in a holistic manner. The first GAN consists of a generator which denoises the noisy input image, and in the discriminator counterpart we check whether the output is a denoised image or ground truth original image. The second GAN predicts the saliency maps from raw pixels of the input denoised image using a data-driven metric based on saliency prediction method with adversarial loss. Cycle consistency loss is also incorporated to further improve salient region prediction. We demonstrate with comprehensive evaluation that the proposed framework outperforms several baseline saliency models on various performance benchmarks.
http://arxiv.org/abs/1904.01215
Anomaly detection is a classical problem where the aim is to detect anomalous data that do not belong to the normal data distribution. Current state-of-the-art methods for anomaly detection on complex high-dimensional data are based on the generative adversarial network (GAN). However, the traditional GAN loss is not directly aligned with the anomaly detection objective: it encourages the distribution of the generated samples to overlap with the real data and so the resulting discriminator has been found to be ineffective as an anomaly detector. In this paper, we propose simple modifications to the GAN loss such that the generated samples lie at the boundary of the real data distribution. With our modified GAN loss, our anomaly detection method, called Fence GAN (FGAN), directly uses the discriminator score as an anomaly threshold. Our experimental results using the MNIST, CIFAR10 and KDD99 datasets show that Fence GAN yields the best anomaly classification accuracy compared to state-of-the-art methods.
https://arxiv.org/abs/1904.01209
Despite rapid developments in visual image-based road detection, robustly identifying road areas in visual images remains challenging due to issues like illumination changes and blurry images. To this end, LiDAR sensor data can be incorporated to improve the visual image-based road detection, because LiDAR data is less susceptible to visual noises. However, the main difficulty in introducing LiDAR information into visual image-based road detection is that LiDAR data and its extracted features do not share the same space with the visual data and visual features. Such gaps in spaces may limit the benefits of LiDAR information for road detection. To overcome this issue, we introduce a novel Progressive LiDAR Adaptation-aided Road Detection (PLARD) approach to adapt LiDAR information into visual image-based road detection and improve detection performance. In PLARD, progressive LiDAR adaptation consists of two subsequent modules: 1) data space adaptation, which transforms the LiDAR data to the visual data space to align with the perspective view by applying altitude difference-based transformation; and 2) feature space adaptation, which adapts LiDAR features to visual features through a cascaded fusion structure. Comprehensive empirical studies on the well-known KITTI road detection benchmark demonstrate that PLARD takes advantage of both the visual and LiDAR information, achieving much more robust road detection even in challenging urban scenes. In particular, PLARD outperforms other state-of-the-art road detection models and is currently top of the publicly accessible benchmark leader-board.
http://arxiv.org/abs/1904.01206
We present ChromAlignNet, a deep learning model for alignment of peaks in Gas Chromatogram-Mass Spectrometry (GC-MS) data. GC-MS is regarded as a gold standard in analysis of chemical composition in samples. However, due to the complexity of the instrument, a substance’s retention time (RT) may not stay fixed across multiple GC-MS chromatograms. To use GC-MS data for biomarker discovery requires alignment of identical analyte’s RT from different samples. Current methods of alignment are all based on a set of formal, mathematical rules, consequently, they are unable to handle the complexity of GC-MS data from human breath. We present a solution to GC-MS alignment using deep learning neural networks, which are more adept at complex, fuzzy data sets. We tested our model on several GC-MS data sets of various complexities and show the model has very good true position rates (up to 99% for easy data sets and up to 92% for very complex data sets). We compared our model with the popular correlation optimized warping (COW) and show our model has much better overall performance. This method can easily be adapted to other similar data such as those from liquid chromatography.
http://arxiv.org/abs/1904.01205
We present Habitat, a new platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation, before transferring the learned skills to reality. Specifically, Habitat consists of the following: 1. Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, multiple sensors, and generic 3D dataset handling (with built-in support for SUNCG, Matterport3D, Gibson datasets). Habitat-Sim is fast – when rendering a scene from the Matterport3D dataset, Habitat-Sim achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multi-process on a single GPU, which is orders of magnitude faster than the closest simulator. 2. Habitat-API: a modular high-level library for end-to-end development of embodied AI algorithms – defining embodied AI tasks (e.g. navigation, instruction following, question answering), configuring and training embodied agents (via imitation or reinforcement learning, or via classic SLAM), and benchmarking using standard metrics. These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or `merely’ impractical. Specifically, in the context of point-goal navigation (1) we revisit the comparison between learning and SLAM approaches from two recent works and find evidence for the opposite conclusion – that learning outperforms SLAM, if scaled to total experience far surpassing that of previous investigations, and (2) we conduct the first cross-dataset generalization experiments {train, test} x {Matterport3D, Gibson} for multiple sensors {blind, RGB, RGBD, D} and find that only agents with depth (D) sensors generalize across datasets. We hope that our open-source platform and these findings will advance research in embodied AI.
http://arxiv.org/abs/1904.01201
Models trained for classification often assume that all testing classes are known while training. As a result, when presented with an unknown class during testing, such closed-set assumption forces the model to classify it as one of the known classes. However, in a real world scenario, classification models are likely to encounter such examples. Hence, identifying those examples as unknown becomes critical to model performance. A potential solution to overcome this problem lies in a class of learning problems known as open-set recognition. It refers to the problem of identifying the unknown classes during testing, while maintaining performance on the known classes. In this paper, we propose an open-set recognition algorithm using class conditioned auto-encoders with novel training and testing methodology. In contrast to previous methods, training procedure is divided in two sub-tasks, 1. closed-set classification and, 2. open-set identification (i.e. identifying a class as known or unknown). Encoder learns the first task following the closed-set classification training pipeline, whereas decoder learns the second task by reconstructing conditioned on class identity. Furthermore, we model reconstruction errors using the Extreme Value Theory of statistical modeling to find the threshold for identifying known/unknown class samples. Experiments performed on multiple image classification datasets show proposed method performs significantly better than state of the art.
http://arxiv.org/abs/1904.01198
Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using approximate expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.
http://arxiv.org/abs/1904.01191
Skeleton-based human action recognition has attracted a lot of interests. Recently, there is a trend of using deep feedforward neural networks to model the skeleton sequence which takes the 2D spatio-temporal map derived from the 3D coordinates of joints as input. Some semantics of the joints (frame index and joint type) are implicitly captured and exploited by large receptive fields of deep convolutions at the cost of high complexity. In this paper, we propose a simple yet effective semantics-guided neural network (SGN) for skeleton-based action recognition. We explicitly introduce the high level semantics of joints as part of the network input to enhance the feature representation capability. The model exploits the global and local information through two semantics-aware graph convolutional layers followed by a convolutional layer. We first leverage the semantics and dynamics (coordinate and velocity) of joints to learn a content adaptive graph for capturing the global spatio-temporal correlations of joints. Then a convolutional layer is used to further enhance the representation power of the features. With an order of magnitude smaller model size and higher speed than some previous works, SGN achieves the state-of-the-art performance on the NTU, SYSU, and N-UCLA datasets. Experimental results demonstrate the effectiveness of explicitly exploiting semantic information in reducing model complexity and improving the recognition accuracy.
http://arxiv.org/abs/1904.01189
Learning portable neural networks is very essential for computer vision for the purpose that pre-trained heavy deep models can be well applied on edge devices such as mobile phones and micro sensors. Most existing deep neural network compression and speed-up methods are very effective for training compact deep models, when we can directly access the training dataset. However, training data for the given deep network are often unavailable due to some practice problems (e.g. privacy, legal issue, and transmission), and the architecture of the given network are also unknown except some interfaces. To this end, we propose a novel framework for training efficient deep neural networks by exploiting generative adversarial networks (GANs). To be specific, the pre-trained teacher networks are regarded as a fixed discriminator and the generator is utilized for derivating training samples which can obtain the maximum response on the discriminator. Then, an efficient network with smaller model size and computational complexity is trained using the generated data and the teacher network, simultaneously. Efficient student networks learned using the proposed Data-Free Learning (DFL) method achieve 92.22% and 74.47% accuracies without any training data on the CIFAR-10 and CIFAR-100 datasets, respectively. Meanwhile, our student network obtains an 80.56% accuracy on the CelebA benchmark.
http://arxiv.org/abs/1904.01186
Lipschitz continuity recently becomes popular in generative adversarial networks (GANs). It was observed that the Lipschitz regularized discriminator leads to improved training stability and sample quality. The mainstream implementations of Lipschitz continuity include gradient penalty and spectral normalization. In this paper, we demonstrate that gradient penalty introduces undesired bias, while spectral normalization might be over restrictive. We accordingly propose a new method which is efficient and unbiased. Our experiments verify our analysis and show that the proposed method is able to achieve successful training in various situations where gradient penalty and spectral normalization fail.
https://arxiv.org/abs/1904.01184
SafeAccess is an interactive assistive technology solution to enhance the safety and independence of people with disability (i.e., visually impaired and limited mobility). The system output is the classification and identification of a person in front of the door or around the house into groups such as friends/families/caregiver versus intruders/burglars/unknown. This will allow the user to grant/deny remote access to the premises or ability to call emergency services. In this paper, we focus on designing a prototype system and building a robust recognition engine that meets the system criteria and addresses speed, accuracy, deployment and environmental challenges under a wide variety of practical and real-life situations. The premise is assumed to be equipped with cameras placed at strategic locations to capture images and videos. To interact with the system, we implemented a dialog enabled interface to create a personalized profile using face images or video of friend/families/caregiver. To improve the computational efficiency, we apply change detection to filter out frames and use Faster-RCNN to detect the human presence and extract faces using Multitask Cascaded Convolutional Networks (MTCNN). Subsequently, we apply LBP/FaceNet to identify a person and groups by matching extracted faces with the profile. SafeAccess sends identification result to the users with an MMS containing persons name if any match found or as intruder, scene image and a confidence score between 1 to 10. In addition, the daily, weekly and monthly summarized report of the past incident can be queried from the system. Empirical analysis shows a robust performance with an F-score of 0.97 in identifying friends/families/caregiver versus intruders/unknown.
http://arxiv.org/abs/1904.01178
We present a learning-based method to infer plausible high dynamic range (HDR), omnidirectional illumination given an unconstrained, low dynamic range (LDR) image from a mobile phone camera with a limited field of view (FOV). For training data, we collect videos of various reflective spheres placed within the camera’s FOV, leaving most of the background unoccluded, leveraging that materials with diverse reflectance functions reveal different lighting cues in a single exposure. We train a deep neural network to regress from the LDR background image to HDR lighting by matching the LDR ground truth sphere images to those rendered with the predicted illumination using image-based relighting, which is differentiable. Our inference runs at interactive frame rates on a mobile device, enabling realistic rendering of virtual objects into real scenes for mobile mixed reality. Training on automatically exposed and white-balanced videos, we improve the realism of rendered objects compared to the state-of-the art methods for both indoor and outdoor scenes.
http://arxiv.org/abs/1904.01175
We propose a generative model for a sentence that uses two latent variables, with one intended to represent the syntax of the sentence and the other to represent its semantics. We show we can achieve better disentanglement between semantic and syntactic representations by training with multiple losses, including losses that exploit aligned paraphrastic sentences and word-order information. We also investigate the effect of moving from bag-of-words to recurrent neural network modules. We evaluate our models as well as several popular pretrained embeddings on standard semantic similarity tasks and novel syntactic similarity tasks. Empirically, we find that the model with the best performing syntactic and semantic representations also gives rise to the most disentangled representations.
http://arxiv.org/abs/1904.01173
Commonsense knowledge and commonsense reasoning are some of the main bottlenecks in machine intelligence. In the NLP community, many benchmark datasets and tasks have been created to address commonsense reasoning for language understanding. These tasks are designed to assess machines’ ability to acquire and learn commonsense knowledge in order to reason and understand natural language text. As these tasks become instrumental and a driving force for commonsense research, this paper aims to provide an overview of existing tasks and benchmarks, knowledge resources, and learning and inference approaches toward commonsense reasoning for natural language understanding. Through this, our goal is to support a better understanding of the state of the art, its limitations, and future challenges.
http://arxiv.org/abs/1904.01172
While natural language processing (NLP) of unstructured clinical narratives holds the potential for patient care and clinical research, portability of NLP approaches across multiple sites remains a major challenge. This study investigated the portability of an NLP system developed initially at the Department of Veterans Affairs (VA) to extract 27 key cardiac concepts from free-text or semi-structured echocardiograms from three academic medical centers: Weill Cornell Medicine, Mayo Clinic and Northwestern Medicine. While the NLP system showed high precision and recall measurements for four target concepts (aortic valve regurgitation, left atrium size at end systole, mitral valve regurgitation, tricuspid valve regurgitation) across all sites, we found moderate or poor results for the remaining concepts and the NLP system performance varied between individual sites.
http://arxiv.org/abs/1905.01961
Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models will be made publicly available.
http://arxiv.org/abs/1904.01169
In this paper, we present a deep learning algorithm to rapidly obtain high quality CT reconstructions for AM parts. In particular, we propose to use CAD models of the parts that are to be manufactured, introduce typical defects and simulate XCT measurements. These simulated measurements were processed using FBP (computationally simple but result in noisy images) and the MBIR technique. We then train a 2.5D deep convolutional neural network [4], deemed 2.5D Deep Learning MBIR (2.5D DL-MBIR), on these pairs of noisy and high-quality 3D volumes to learn a fast, non-linear mapping function. The 2.5D DL-MBIR reconstructs a 3D volume in a 2.5D scheme where each slice is reconstructed from multiple inputs slices of the FBP input. Given this trained system, we can take a small set of measurements on an actual part, process it using a combination of FBP followed by 2.5D DL-MBIR. Both steps can be rapidly performed using GPUs, resulting in a real-time algorithm that achieves the high-quality of MBIR as fast as standard techniques. Intuitively, since CAD models are typically available for parts to be manufactured, this provides a strong constraint “prior” which can be leveraged to improve the reconstruction.
http://arxiv.org/abs/1904.12585
Image classifiers based on deep neural networks suffer from harassment caused by adversarial examples. Two defects exist in black-box iterative attacks that generate adversarial examples by incrementally adjusting the noise-adding direction for each step. On the one hand, existing iterative attacks add noises monotonically along the direction of gradient ascent, resulting in a lack of diversity and adaptability of the generated iterative trajectories. On the other hand, it is trivial to perform adversarial attack by adding excessive noises, but currently there is no refinement mechanism to squeeze redundant noises. In this work, we propose Curls & Whey black-box attack to fix the above two defects. During Curls iteration, by combining gradient ascent and descent, we curl' up iterative trajectories to integrate more diversity and transferability into adversarial examples. Curls iteration also alleviates the diminishing marginal effect in existing iterative attacks. The Whey optimization further squeezes the
whey’ of noises by exploiting the robustness of adversarial perturbation. Extensive experiments on Imagenet and Tiny-Imagenet demonstrate that our approach achieves impressive decrease on noise magnitude in l2 norm. Curls & Whey attack also shows promising transferability against ensemble models as well as adversarially trained models. In addition, we extend our attack to the targeted misclassification, effectively reducing the difficulty of targeted attacks under black-box condition.
http://arxiv.org/abs/1904.01160
There has been a debate in medical image segmentation on whether to use 2D or 3D networks, where both pipelines have advantages and disadvantages. This paper presents a novel approach which thickens the input of a 2D network, so that the model is expected to enjoy both the stability and efficiency of 2D networks as well as the ability of 3D networks in modeling volumetric contexts. A major information loss happens when a large number of 2D slices are fused at the first convolutional layer, resulting in a relatively weak ability of the network in distinguishing the difference among slices. To alleviate this drawback, we propose an effective framework which (i) postpones slice fusion and (ii) adds highway connections from the pre-fusion layer so that the prediction layer receives slice-sensitive auxiliary cues. Experiments on segmenting a few abdominal targets in particular blood vessels which require strong 3D contexts demonstrate the effectiveness of our approach.
http://arxiv.org/abs/1904.01150
This thesis describes our ongoing work on Contrastive Predictive Coding (CPC) features for speaker verification. CPC is a recently proposed representation learning framework based on predictive coding and noise contrastive estimation. We focus on incorporating CPC features into the standard automatic speaker verification systems, and we present our methods, experiments, and analysis. This thesis also details necessary background knowledge in past and recent work on automatic speaker verification systems, conventional speech features, and the motivation and techniques behind CPC.
https://arxiv.org/abs/1904.01575
In this paper, we address the open research problem of surgical gesture recognition using motion cues from video data only. We adapt Optical flow ConvNets initially proposed by Simonyan et al.. While Simonyan uses both RGB frames and dense optical flow, we use only dense optical flow representations as input to emphasize the role of motion in surgical gesture recognition, and present it as a robust alternative to kinematic data. We also overcome one of the limitations of Optical flow ConvNets by initializing our model with cross modality pre-training. A large number of promising studies that address surgical gesture recognition highly rely on kinematic data which requires additional recording devices. To our knowledge, this is the first paper that addresses surgical gesture recognition using dense optical flow information only. We achieve competitive results on JIGSAWS dataset, moreover, our model achieves more robust results with less standard deviation, which suggests optical flow information can be used as an alternative to kinematic data for the recognition of surgical gestures.
http://arxiv.org/abs/1904.01143
Exact structured inference with neural network scoring functions is computationally challenging but several methods have been proposed for approximating inference. One approach is to perform gradient descent with respect to the output structure directly (Belanger and McCallum, 2016). Another approach, proposed recently, is to train a neural network (an “inference network”) to perform inference (Tu and Gimpel, 2018). In this paper, we compare these two families of inference methods on three sequence labeling datasets. We choose sequence labeling because it permits us to use exact inference as a benchmark in terms of speed, accuracy, and search error. Across datasets, we demonstrate that inference networks achieve a better speed/accuracy/search error trade-off than gradient descent, while also being faster than exact inference at similar accuracy levels. We find further benefit by combining inference networks and gradient descent, using the former to provide a warm start for the latter.
http://arxiv.org/abs/1904.01138
Received: 1st December 2018; Accepted: 18th February 2019; Published: 1st April 2019 Abstract: The calculated filling factors (FFs) for a feature reflect the fraction of the solar disc covered by that feature, and the assignment of reference synthetic spectra. In this paper, the FFs, specified as a function of radial position on the solar disc, are computed for each image in a tabular form. The filling factor (FF) is an important parameter and is defined as the fraction of area in a pixel covered with the magnetic field, whereas the rest of the area in the pixel is field-free. However, this does not provide extensive information about the experiments conducted on tens or hundreds of such images. This is the first time that filling factors for SODISM images have been catalogued in tabular formation. This paper presents a new method that provides the means to detect sunspots on full-disk solar images recorded by the Solar Diameter Imager and Surface Mapper (SODISM) on the PICARD satellite. The method is a totally automated detection process that achieves a sunspot recognition rate of 97.6%. The number of sunspots detected by this method strongly agrees with the NOAA catalogue. The sunspot areas calculated by this method have a 99% correlation with SOHO over the same period, and thus help to calculate the filling factor for wavelength (W.L.) 607nm.
http://arxiv.org/abs/1904.01133