Although deep learning models perform remarkably well across a range of tasks such as language translation and object recognition, it remains unclear what high-level logic, if any, they follow. Understanding this logic may lead to more transparency, better model design, and faster experimentation. Recent machine learning research has leveraged statistical methods to identify hidden units that behave (e.g., activate) similarly to human understandable logic, but those analyses require considerable manual effort. Our insight is that many of those studies follow a common analysis pattern, which we term Deep Neural Inspection. There is opportunity to provide a declarative abstraction to easily express, execute, and optimize them. This paper describes DeepBase, a system to inspect neural network behaviors through a unified interface. We model logic with user-provided hypothesis functions that annotate the data with high-level labels (e.g., part-of-speech tags, image captions). DeepBase lets users quickly identify individual or groups of units that have strong statistical dependencies with desired hypotheses. We discuss how DeepBase can express existing analyses, propose a set of simple and effective optimizations to speed up a standard Python implementation by up to 72x, and reproduce recent studies from the NLP literature.
https://arxiv.org/abs/1808.04486
Matrix completion focuses on recovering a matrix from a small subset of its observed elements, and has already gained cumulative attention in computer vision. Many previous approaches formulate this issue as a low-rank matrix approximation problem. Recently, a truncated nuclear norm has been presented as a surrogate of traditional nuclear norm, for better estimation to the rank of a matrix. The truncated nuclear norm regularization (TNNR) method is applicable in real-world scenarios. However, it is sensitive to the selection of the number of truncated singular values and requires numerous iterations to converge. Hereby, this paper proposes a revised approach called the double weighted truncated nuclear norm regularization (DW-TNNR), which assigns different weights to the rows and columns of a matrix separately, to accelerate the convergence with acceptable performance. The DW-TNNR is more robust to the number of truncated singular values than the TNNR. Instead of the iterative updating scheme in the second step of TNNR, this paper devises an efficient strategy that uses a gradient descent manner in a concise form, with a theoretical guarantee in optimization. Sufficient experiments conducted on real visual data prove that DW-TNNR has promising performance and holds the superiority in both speed and accuracy for matrix completion.
http://arxiv.org/abs/1901.01711
This paper proposes the first known to us iris recognition methodology designed specifically for post-mortem samples. We propose to use deep learning-based iris segmentation models to extract highly irregular iris texture areas in post-mortem iris images. We show how to use segmentation masks predicted by neural networks in conventional, Gabor-based iris recognition method, which employs circular approximations of the pupillary and limbic iris boundaries. As a whole, this method allows for a significant improvement in post-mortem iris recognition accuracy over the methods designed only for ante-mortem irises, including the academic OSIRIS and commercial IriCore implementations. The proposed method reaches the EER less than 1% for samples collected up to 10 hours after death, when compared to 16.89% and 5.37% of EER observed for OSIRIS and IriCore, respectively. For samples collected up to 369 hours post-mortem, the proposed method achieves the EER 21.45%, while 33.59% and 25.38% are observed for OSIRIS and IriCore, respectively. Additionally, the method is tested on a database of iris images collected from ophthalmology clinic patients, for which it also offers an advantage over the two remaining methods. This work is the first step towards post-mortem-specific iris recognition, which increases the chances of identification of deceased subjects in forensic investigations. The new database of post-mortem iris images acquired from 42 subjects, as well as the deep learning-based segmentation models are made available along with the paper, to ensure all the results presented in this manuscript are reproducible.
http://arxiv.org/abs/1901.01708
Ultrasound (US) imaging is based on the time-reversal principle, in which individual channel RF measurements are back-propagated and accumulated to form an image after applying specific delays. While this time reversal is usually implemented as a delay-and-sum (DAS) beamformer, the image quality quickly degrades as the number of measurement channels decreases. To address this problem, various types of adaptive beamforming techniques have been proposed using predefined models of the signals. However, the performance of these adaptive beamforming approaches degrade when the underlying model is not sufficiently accurate. Here, we demonstrate for the first time that a single universal deep beamformer trained using a purely data-driven way can generate significantly improved images over widely varying aperture and channel subsampling patterns. In particular, we design an end-to-end deep learning framework that can directly process sub-sampled RF data acquired at different subsampling rate and detector configuration to generate high quality ultrasound images using a single beamformer. Experimental results using B-mode focused ultrasound confirm the efficacy of the proposed methods.
http://arxiv.org/abs/1901.01706
In existing visual representation learning tasks, deep convolutional neural networks (CNNs) are often trained on images annotated with single tags, such as ImageNet. However, a single tag cannot describe all important contents of one image, and some useful visual information may be wasted during training. In this work, we propose to train CNNs from images annotated with multiple tags, to enhance the quality of visual representation of the trained CNN model. To this end, we build a large-scale multi-label image database with 18M images and 11K categories, dubbed Tencent ML-Images. We efficiently train the ResNet-101 model with multi-label outputs on Tencent ML-Images, taking 90 hours for 60 epochs, based on a large-scale distributed deep learning framework,i.e.,TFplus. The good quality of the visual representation of the Tencent ML-Images checkpoint is verified through three transfer learning tasks, including single-label image classification on ImageNet and Caltech-256, object detection on PASCAL VOC 2007, and semantic segmentation on PASCAL VOC 2012. The Tencent ML-Images database, the checkpoints of ResNet-101, and all the training codehave been released at https://github.com/Tencent/tencent-ml-images. It is expected to promote other vision tasks in the research and industry community.
http://arxiv.org/abs/1901.01703
Recently, low-rank tensor completion has become increasingly attractive in recovering incomplete visual data. Considering a color image or video as a three-dimensional (3D) tensor, existing studies have put forward several definitions of tensor nuclear norm. However, they are limited and may not accurately approximate the real rank of a tensor, and they do not explicitly use the low-rank property in optimization. It is proved that the recently proposed truncated nuclear norm (TNN) can replace the traditional nuclear norm, as an improved approximation to the rank of a matrix. In this paper, we propose a new method called the tensor truncated nuclear norm (T-TNN), which suggests a new definition of tensor nuclear norm. The truncated nuclear norm is generalized from the matrix case to the tensor case. With the help of the low rankness of TNN, our approach improves the efficacy of tensor completion. We adopt the definition of the previously proposed tensor singular value decomposition, the alternating direction method of multipliers, and the accelerated proximal gradient line search method in our algorithm. Substantial experiments on real-world videos and images illustrate that the performance of our approach is better than those of previous methods.
http://arxiv.org/abs/1901.01997
In this dissertation we report results of our research on dense distributed representations of text data. We propose two novel neural models for learning such representations. The first model learns representations at the document level, while the second model learns word-level representations. For document-level representations we propose Binary Paragraph Vector: a neural network models for learning binary representations of text documents, which can be used for fast document retrieval. We provide a thorough evaluation of these models and demonstrate that they outperform the seminal method in the field in the information retrieval task. We also report strong results in transfer learning settings, where our models are trained on a generic text corpus and then used to infer codes for documents from a domain-specific dataset. In contrast to previously proposed approaches, Binary Paragraph Vector models learn embeddings directly from raw text data. For word-level representations we propose Disambiguated Skip-gram: a neural network model for learning multi-sense word embeddings. Representations learned by this model can be used in downstream tasks, like part-of-speech tagging or identification of semantic relations. In the word sense induction task Disambiguated Skip-gram outperforms state-of-the-art models on three out of four benchmarks datasets. Our model has an elegant probabilistic interpretation. Furthermore, unlike previous models of this kind, it is differentiable with respect to all its parameters and can be trained with backpropagation. In addition to quantitative results, we present qualitative evaluation of Disambiguated Skip-gram, including two-dimensional visualisations of selected word-sense embeddings.
http://arxiv.org/abs/1901.01695
Convolutional Neural Networks have achieved significant success across multiple computer vision tasks. However, they are vulnerable to carefully crafted, human imperceptible adversarial noise patterns which constrain their deployment in critical security-sensitive systems. This paper proposes a computationally efficient image enhancement approach that provides a strong defense mechanism to effectively mitigate the effect of such adversarial perturbations. We show that the deep image restoration networks learn mapping functions that can bring \textit{off-the-manifold} adversarial samples onto the natural image manifold, thus restoring classifier beliefs towards correct classes. A distinguishing feature of our approach is that, in addition to providing robustness against attacks, it simultaneously enhances image quality and retains models performance on clean images. Furthermore, the proposed method does not modify the classifier or requires a separate mechanism to detect adversarial images. The effectiveness of the scheme has been demonstrated through extensive experiments, where it has proven a strong defense in both white-box and black-box attack settings. The proposed scheme is simple and has the following advantages: (1) it does not require any model training or parameter optimization, (2) it complements other existing defense mechanisms, (3) it is agnostic to the attacked model and attack type and (4) it provides superior performance across all popular attack algorithms. Our codes are publicly available at https://github.com/aamir-mustafa/super-resolution-adversarial-defense.
http://arxiv.org/abs/1901.01677
Detection of faint edges in noisy images is a challenging problem studied in the last decades. \cite{ofir2016fast} introduced a fast method to detect faint edges in the highest accuracy among all the existing approaches. Their complexity is nearly linear in the image’s pixels and their runtime is seconds for a noisy image. Their approach utilizes a multi-scale binary partitioning of the image. By utilizing the multi-scale U-net architecture \cite{unet}, we show in this paper that their method can be dramatically improved in both aspects of run time and accuracy. By training the network on a dataset of binary images, we developed an approach for faint edge detection that works in a linear complexity. Our runtime of a noisy image in milliseconds on a GPU. Even though our method is orders of magnitude faster, we still achieve higher accuracy of detection under many challenging scenarios. In addition, we show that our approach to performing multi-scale preprocessing of noisy images using U-net improves the ability to perform other vision tasks under the presence of noise. We prove this point by the resnet 20 \cite{resnet} classifier and the CIFAR 10 \cite{cifar} dataset.
http://arxiv.org/abs/1803.09420
Why does training deep neural networks using stochastic gradient descent (SGD) result in a generalization error that does not worsen with the number of parameters in the network? To answer this question, we advocate a notion of effective model capacity that is dependent on {\em a given random initialization of the network} and not just the training algorithm and the data distribution. We provide empirical evidences that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of {\em the $\ell_2$ distance from the initialization}. We also provide theoretical arguments that further highlight the need for initialization-dependent notions of model capacity. We leave as open questions how and why distance from initialization is regularized, and whether it is sufficient to explain generalization.
http://arxiv.org/abs/1901.01672
Siamese networks have drawn great attention in visual tracking because of their balanced accuracy and speed. However, the backbone networks used in Siamese trackers are relatively shallow, such as AlexNet [18], which does not fully take advantage of the capability of modern deep neural networks. In this paper, we investigate how to leverage deeper and wider convolutional neural networks to enhance tracking robustness and accuracy. We observe that direct replacement of backbones with existing powerful architectures, such as ResNet [14] and Inception [33], does not bring improvements. The main reasons are that 1)large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision; and 2) the network padding for convolutions induces a positional bias in learning. To address these issues, we propose new residual modules to eliminate the negative impact of padding, and further design new architectures using these modules with controlled receptive field size and network stride. The designed architectures are lightweight and guarantee real-time tracking speed when applied to SiamFC [2] and SiamRPN [20]. Experiments show that solely due to the proposed network architectures, our SiamFC+ and SiamRPN+ obtain up to 9.8%/5.7% (AUC), 23.3%/8.8% (EAO) and 24.4%/25.0% (EAO) relative improvements over the original versions [2, 20] on the OTB-15, VOT-16 and VOT-17 datasets, respectively.
http://arxiv.org/abs/1901.01660
Shape analysis is important in anthropology, bioarchaeology and forensic science for interpreting useful information from human remains. In particular, teeth are morphologically stable and hence well-suited for shape analysis. In this work, we propose a framework for tooth morphometry using quasi-conformal theory. Landmark-matching Teichm"uller maps are used for establishing a 1-1 correspondence between tooth surfaces with prescribed anatomical landmarks. Then, a quasi-conformal statistical shape analysis model based on the Teichm"uller mapping results is proposed for building a tooth classification scheme. We deploy our framework on a dataset of human premolars to analyze the tooth shape variation among genders and ancestries. Experimental results show that our method achieves much higher classification accuracy with respect to both gender and ancestry when compared to the existing methods. Furthermore, our model reveals the underlying tooth shape difference between different genders and ancestries in terms of the local geometric distortion and curvatures.
http://arxiv.org/abs/1901.01651
Predicting the future is a fantasy but practicality work. It is the key component to intelligent agents, such as self-driving vehicles, medical monitoring devices and robotics. In this work, we consider generating unseen future frames from previous obeservations, which is notoriously hard due to the uncertainty in frame dynamics. While recent works based on generative adversarial networks (GANs) made remarkable progress, there is still an obstacle for making accurate and realistic predictions. In this paper, we propose a novel GAN based on inter-frame difference to circumvent the difficulties. More specifically, our model is a multi-stage generative network, which is named the Difference Guided Generative Adversarial Netwok (DGGAN). The DGGAN learns to explicitly enforce future-frame predictions that is guided by synthetic inter-frame difference. Given a sequence of frames, DGGAN first uses dual paths to generate meta information. One path, called Coarse Frame Generator, predicts the coarse details about future frames, and the other path, called Difference Guide Generator, generates the difference image which include complementary fine details. Then our coarse details will then be refined via guidance of difference image under the support of GANs. With this model and novel architecture, we achieve state-of-the-art performance for future video prediction on UCF-101, KITTI.
http://arxiv.org/abs/1901.01649
This study applies text mining to analyze customer reviews and automatically assign a collective restaurant star rating based on five predetermined aspects: ambiance, cost, food, hygiene, and service. The application provides a web and mobile crowd sourcing platform where users share dining experiences and get insights about the strengths and weaknesses of a restaurant through user contributed feedback. Text reviews are tokenized into sentences. Noun-adjective pairs are extracted from each sentence using Stanford Core NLP library and are associated to aspects based on the bag of associated words fed into the system. The sentiment weight of the adjectives is determined through AFINN library. An overall restaurant star rating is computed based on the individual aspect rating. Further, a word cloud is generated to provide visual display of the most frequently occurring terms in the reviews. The more feedbacks are added the more reflective the sentiment score to the restaurants’ performance.
http://arxiv.org/abs/1901.01642
Blind motion deblurring is one of the most basic and challenging problems in image processing and computer vision. It aims to recover a sharp image from its blurred version knowing nothing about the blur process. Many existing methods use Maximum A Posteriori (MAP) or Expectation Maximization (EM) frameworks to deal with this kind of problems, but they cannot handle well the figh frequency features of natural images. Most recently, deep neural networks have been emerging as a powerful tool for image deblurring. In this paper, we prove that encoder-decoder architecture gives better results for image deblurring tasks. In addition, we propose a novel end-to-end learning model which refines generative adversarial network by many novel training strategies so as to tackle the problem of deblurring. Experimental results show that our model can capture high frequency features well, and the results on benchmark dataset show that proposed model achieves the competitive performance.
http://arxiv.org/abs/1901.01641
Central Pattern Generators (CPGs) are biological neural circuits capable of producing coordinated rhythmic outputs in the absence of rhythmic input. As a result, they are responsible for most rhythmic motion in living organisms. This rhythmic control is broadly applicable to fields such as locomotive robotics and medical devices. In this paper, we explore the possibility of creating a self-sustaining CPG network for reinforcement learning that learns rhythmic motion more efficiently and across more general environments than the current multilayer perceptron (MLP) baseline models. Recent work introduces the Structured Control Net (SCN), which maintains linear and nonlinear modules for local and global control, respectively. Here, we show that time-sequence architectures such as Recurrent Neural Networks (RNNs) model CPGs effectively. Combining previous work with RNNs and SCNs, we introduce the Recurrent Control Net (RCN), which adds a linear component to the, RCNs match and exceed the performance of baseline MLPs and SCNs across all environment tasks. Our findings confirm existing intuitions for RNNs on reinforcement learning tasks, and demonstrate promise of SCN-like structures in reinforcement learning.
http://arxiv.org/abs/1901.01994
Recognizing an object’s material can inform a robot on the object’s fragility or appropriate use. To estimate an object’s material during manipulation, many prior works have explored the use of haptic sensing. In this paper, we explore a technique for robots to estimate the materials of objects using spectroscopy. We demonstrate that spectrometers provide several benefits for material recognition, including fast response times and accurate measurements with low noise. Furthermore, spectrometers do not require direct contact with an object. To explore this, we collected a dataset of spectral measurements from two commercially available spectrometers during which a robotic platform interacted with 50 flat material objects, and we show that a neural network model can accurately analyze these measurements. Due to the similarity between consecutive spectral measurements, our model achieved a material classification accuracy of 94.6% when given only one spectral sample per object. Similar to prior works with haptic sensors, we found that generalizing material recognition to new objects posed a greater challenge, for which we achieved an accuracy of 79.1% via leave-one-object-out cross-validation. Finally, we demonstrate how a PR2 robot can leverage spectrometers to estimate the materials of everyday objects found in the home. From this work, we find that spectroscopy poses a promising approach for material classification during robotic manipulation.
http://arxiv.org/abs/1805.04051
Automatic segmentation of pathological shoulder muscles in patients with musculo-skeletal diseases is a challenging task due to the huge variability in muscle shape, size, location, texture and injury. A reliable fully-automated segmentation method from magnetic resonance images could greatly help clinicians to plan therapeutic interventions and predict interventional outcomes while eliminating time consuming manual segmentation efforts. The purpose of this work is three-fold. First, we investigate the feasibility of pathological shoulder muscle segmentation using deep learning techniques, given a very limited amount of available annotated pediatric data. Second, we address the learning transferability from healthy to pathological data by comparing different learning schemes in terms of model generalizability. Third, extended versions of deep convolutional encoder-decoder architectures using encoders pre-trained on non-medical data are proposed to improve the segmentation accuracy. Methodological aspects are evaluated in a leave-one-out fashion on a dataset of 24 shoulder examinations from patients with obstetrical brachial plexus palsy and focus on 4 different muscles including deltoid as well as infraspinatus, supraspinatus and subscapularis from the rotator cuff. The most relevant segmentation model is partially pre-trained on ImageNet and jointly exploits inter-patient healthy and pathological annotated data. Its performance reaches Dice scores of 82.4%, 82.0%, 71.0% and 82.8% for deltoid, infraspinatus, supraspinatus and subscapularis muscles. Absolute surface estimation errors are all below 83mm$^2$ except for supraspinatus with 134.6mm$^2$. These contributions offer new perspectives for force inference in the context of musculo-skeletal disorder management.
http://arxiv.org/abs/1901.01620
Similarity and metric learning provides a principled approach to construct a task-specific similarity from weakly supervised data. However, these methods are subject to the curse of dimensionality: as the number of features grows large, poor generalization is to be expected and training becomes intractable due to high computational and memory costs. In this paper, we propose a similarity learning method that can efficiently deal with high-dimensional sparse data. This is achieved through a parameterization of similarity functions by convex combinations of sparse rank-one matrices, together with the use of a greedy approximate Frank-Wolfe algorithm which provides an efficient way to control the number of active features. We show that the convergence rate of the algorithm, as well as its time and memory complexity, are independent of the data dimension. We further provide a theoretical justification of our modeling choices through an analysis of the generalization error, which depends logarithmically on the sparsity of the solution rather than on the number of features. Our experiments on datasets with up to one million features demonstrate the ability of our approach to generalize well despite the high dimensionality as well as its superiority compared to several competing methods.
http://arxiv.org/abs/1807.07789
Artificial Intelligence frameworks should allow for ever more autonomous and general systems in contrast to very narrow and restricted (human pre-defined) domain systems, in analogy to how the brain works. Self-constructive Artificial Intelligence ($SCAI$) is one such possible framework. We herein propose that $SCAI$ is based on three principles of organization: self-growing, self-experimental and self-repairing. Self-growing: the ability to autonomously and incrementally construct structures and functionality as needed to solve encountered (sub)problems. Self-experimental: the ability to internally simulate, anticipate and take decisions based on these expectations. Self-repairing: the ability to autonomously re-construct a previously successful functionality or pattern of interaction lost from a possible sub-component failure (damage). To implement these principles of organization, a constructive architecture capable of evolving adaptive autonomous agents is required. We present Schema-based learning as one such architecture capable of incrementally constructing a myriad of internal models of three kinds: predictive schemas, dual (inverse models) schemas and goal schemas as they are necessary to autonomously develop increasing functionality. We claim that artificial systems, whether in the digital or in the physical world, can benefit very much form this constructive architecture and should be organized around these principles of organization. To illustrate the generality of the proposed framework, we include several test cases in structural adaptive navigation in artificial intelligence systems in Paper II of this series, and resilient robot motor control in Paper III of this series. Paper IV of this series will also include $SCAI$ for problem structural discovery in predictive Business Intelligence.
http://arxiv.org/abs/1901.01989
Live cell microscopy sequences exhibit complex spatial structures and complicated temporal behaviour, making their analysis a challenging task. Considering cell segmentation problem, which plays a significant role in the analysis, the spatial properties of the data can be captured using Convolutional Neural Networks (CNNs). Recent approaches show promising segmentation results using convolutional encoder-decoders such as the U-Net. Nevertheless, these methods are limited by their inability to incorporate temporal information, that can facilitate segmentation of individual touching cells or of cells that are partially visible. In order to exploit cell dynamics we propose a novel segmentation architecture which integrates Convolutional Long Short Term Memory (C-LSTM) with the U-Net. The network’s unique architecture allows it to capture multi-scale, compact, spatio-temporal encoding in the C-LSTMs memory units. The method was evaluated on the Cell Tracking Challenge and achieved state-of-the-art results (1st on Fluo-N2DH-SIM+ and 2nd on DIC-C2DL-HeLa datasets) The code is freely available at: https://github.com/arbellea/LSTM-UNet.git
http://arxiv.org/abs/1805.11247
Neural networks (NNs) have become the state of the art in many machine learning applications, especially in image and sound processing [1]. The same, although to a lesser extent [2,3], could be said in natural language processing (NLP) tasks, such as named entity recognition. However, the success of NNs remains dependent on the availability of large labelled datasets, which is a significant hurdle in many important applications. One such case are electronic health records (EHRs), which are arguably the largest source of medical data, most of which lies hidden in natural text [4,5]. Data access is difficult due to data privacy concerns, and therefore annotated datasets are scarce. With scarce data, NNs will likely not be able to extract this hidden information with practical accuracy. In our study, we develop an approach that solves these problems for named entity recognition, obtaining 94.6 F1 score in I2B2 2009 Medical Extraction Challenge [6], 4.3 above the architecture that won the competition. Beyond the official I2B2 challenge, we further achieve 82.4 F1 on extracting relationships between medical terms. To reach this state-of-the-art accuracy, our approach applies transfer learning to leverage on datasets annotated for other I2B2 tasks, and designs and trains embeddings that specially benefit from such transfer.
http://arxiv.org/abs/1901.01592
Unsupervised learning of cross-lingual word embedding offers elegant matching of words across languages, but has fundamental limitations in translating sentences. In this paper, we propose simple yet effective methods to improve word-by-word translation of cross-lingual embeddings, using only monolingual corpora but without any back-translation. We integrate a language model for context-aware search, and use a novel denoising autoencoder to handle reordering. Our system surpasses state-of-the-art unsupervised neural translation systems without costly iterative training. We also analyze the effect of vocabulary size and denoising type on the translation performance, which provides better understanding of learning the cross-lingual word embedding and its usage in translation.
http://arxiv.org/abs/1901.01590
Convolutional neural networks (CNNs) for biomedical image analysis are often of very large size, resulting in high memory requirement and high latency of operations. Searching for an acceptable compressed representation of the base CNN for a specific imaging application typically involves a series of time-consuming training/validation experiments to achieve a good compromise between network size and accuracy. To address this challenge, we propose CC-Net, a new image complexity-guided CNN compression scheme for biomedical image segmentation. Given a CNN model, CC-Net predicts the final accuracy of networks of different sizes based on the average image complexity computed from the training data. It then selects a multiplicative factor for producing a desired network with acceptable network accuracy and size. Experiments show that CC-Net is effective for generating compressed segmentation networks, retaining up to 95% of the base network segmentation accuracy and utilizing only 0.1% of trainable parameters of the full-sized networks in the best case.
http://arxiv.org/abs/1901.01578
We address for the first time unsupervised training for a translation task with hundreds of thousands of vocabulary words. We scale up the expectation-maximization (EM) algorithm to learn a large translation table without any parallel text or seed lexicon. First, we solve the memory bottleneck and enforce the sparsity with a simple thresholding scheme for the lexicon. Second, we initialize the lexicon training with word classes, which efficiently boosts the performance. Our methods produced promising results on two large-scale unsupervised translation tasks.
http://arxiv.org/abs/1901.01577
Subject matching performance in iris biometrics is contingent upon fast, high-quality iris segmentation. In many cases, iris biometrics acquisition equipment takes a number of images in sequence and combines the segmentation and matching results for each image to strengthen the result. To date, segmentation has occurred in 2D, operating on each image individually. But such methodologies, while powerful, do not take advantage of potential gains in performance afforded by treating sequential images as volumetric data. As a first step in this direction, we apply the Flexible Learning-Free Reconstructoin of Neural Volumes (FLoRIN) framework, an open source segmentation and reconstruction framework originally designed for neural microscopy volumes, to volumetric segmentation of iris videos. Further, we introduce a novel dataset of near-infrared iris videos, in which each subject’s pupil rapidly changes size due to visible-light stimuli, as a test bed for FLoRIN. We compare the matching performance for iris masks generated by FLoRIN, deep-learning-based (SegNet), and Daugman’s (OSIRIS) iris segmentation approaches. We show that by incorporating volumetric information, FLoRIN achieves a factor of 3.6 to an order of magnitude increase in throughput with only a minor drop in subject matching performance. We also demonstrate that FLoRIN-based iris segmentation maintains this speedup on low-resource hardware, making it suitable for embedded biometrics systems.
http://arxiv.org/abs/1901.01575
This work systematically analyzes the smoothing effect of vocabulary reduction for phrase translation models. We extensively compare various word-level vocabularies to show that the performance of smoothing is not significantly affected by the choice of vocabulary. This result provides empirical evidence that the standard phrase translation model is extremely sparse. Our experiments also reveal that vocabulary reduction is more effective for smoothing large-scale phrase tables.
http://arxiv.org/abs/1901.01574
Generative Adversarial Networks (GANs) convergence in a high-resolution setting with a computational constrain of GPU memory capacity (from 12GB to 24 GB) has been beset with difficulty due to the known lack of convergence rate stability. In order to boost network convergence of DCGAN (Deep Convolutional Generative Adversarial Networks) and achieve good-looking high-resolution results we propose a new layered network structure, HDCGAN, that incorporates current state-of-the-art techniques for this effect. A novel dataset, Curt'o & Zarza, containing human faces from different ethnical groups in a wide variety of illumination conditions and image resolutions is introduced. Curt'o is enhanced with HDCGAN synthetic images, thus being the first GAN augmented face dataset. We conduct extensive experiments on CelebA (MS-SSIM 0.1978 and FR'ECHET Inception Distance 8.77) and Curt'o.
http://arxiv.org/abs/1711.06491
Zero-shot Learning (ZSL) aims to recognize objects of the unseen classes, whose instances may not have been seen during training. It associates seen and unseen classes with the common semantic space and provides the visual features for each data instance. Most existing methods first learn a compatible projection function between the semantic space and the visual space based on the data of source seen classes, then directly apply it to target unseen classes. However, in real scenarios, the data distribution between the source and target domain might not match well, thus causing the well-known domain shift problem. Based on the observation that visual features of test instances can be separated into different clusters, we propose a visual structure constraint on class centers for transductive ZSL, to improve the generality of the projection function (i.e. alleviate the above domain shift problem). Specifically, two different strategies (symmetric Chamfer-distance and bipartite matching) are adopted to align the projected unseen semantic centers and visual cluster centers of test instances. Experiments on three widely used datasets demonstrate that the proposed visual structure constraint can bring substantial performance gain consistently and achieve state-of-the-art results.
http://arxiv.org/abs/1901.01570
Recently image-to-image translation has received increasing attention, which aims to map images in one domain to another specific one. Existing methods mainly solve this task via a deep generative model, and focus on exploring the relationship between different domains. However, these methods neglect to utilize higher-level and instance-specific information to guide the training process, leading to a great deal of unrealistic generated images of low quality. Existing methods also lack of spatial controllability during translation. To address these challenge, we propose a novel Segmentation Guided Generative Adversarial Networks (SGGAN), which leverages semantic segmentation to further boost the generation performance and provide spatial mapping. In particular, a segmentor network is designed to impose semantic information on the generated images. Experimental results on multi-domain face image translation task empirically demonstrate our ability of the spatial modification and our superiority in image quality over several state-of-the-art methods.
http://arxiv.org/abs/1901.01569
We address the vessel segmentation problem by building upon the multiscale feature learning method of Kiros et al., which achieves the current top score in the VESSEL12 MICCAI challenge. Following their idea of feature learning instead of hand-crafted filters, we have extended the method to learn 3D features. The features are learned in an unsupervised manner in a multi-scale scheme using dictionary learning via least angle regression. The 3D feature kernels are further convolved with the input volumes in order to create feature maps. Those maps are used to train a supervised classifier with the annotated voxels. In order to process the 3D data with a large number of filters a parallel implementation has been developed. The algorithm has been applied on the example scans and annotations provided by the VESSEL12 challenge. We have compared our setup with Kiros et al. by running their implementation. Our current results show an improvement in accuracy over the slice wise method from 96.66$\pm$1.10% to 97.24$\pm$0.90%.
http://arxiv.org/abs/1901.01562
In this paper, we address the problem of quantifying reliability of computational saliency for videos, which can be used to improve saliency-based video processing and enable more reliable performance and risk assessment of such processing. Our approach is twofold. First, we explore spatial correlations in both saliency map and eye-fixation map. Then, we learn spatiotemporal correlations that define a reliable saliency map. We first study spatiotemporal eye-fixation data from a public dataset and investigate a common feature in human visual attention, which dictates correlation in saliency between a pixel and its direct neighbors. Based on the study, we then develop an algorithm that estimates a pixel-wise uncertainty map that reflects our confidence in the associated computational saliency map by relating a pixel’s saliency to the saliency of its neighbors. To estimate such uncertainties, we measure the divergence of a pixel, in a saliency map, from its local neighborhood. Additionally, we propose a systematic procedure to evaluate the estimation performance by explicitly computing uncertainty ground truth as a function of a given saliency map and eye fixations of human subjects. In our experiments, we explore multiple definitions of locality and neighborhoods in spatiotemporal video signals. In addition, we examine the relationship between the parameters of our proposed algorithm and the content of the videos. The proposed algorithm is unsupervised, making it more suitable for generalization to most natural videos. Also, it is computationally efficient and flexible for customization to specific video content. Experiments using three publicly available video datasets show that the proposed algorithm outperforms state-of-the-art uncertainty estimation methods with improvement in accuracy up to 63% and offers efficiency and flexibility that make it more useful in practical situations.
http://arxiv.org/abs/1901.01550
Using precise times of every spike, spiking supervised learning has more effects on complex spatial-temporal pattern than that only through neuronal firing rates. The purpose of spiking supervised learning after temporal encoding is to make neural networks emit desired spike trains with precise firing time. Existing algorithms of spiking supervised learning have excellent performances, but mechanisms of them are most in an offline pattern or still have some problems. Based on an online regulative mechanism of biological neuronal synapses, this paper proposes an online supervised learning algorithm of multiple spike trains for spiking neural networks. The proposed algorithm can make a regulation of weights as soon as firing time of an output spike is obtained. Relationship among desired output, actual output and input spike trains is firstly analyzed and synthesized to select spikes simply and correctly for a direct regulation, and then a computational method is constructed based on simple triple spikes using this direct regulation. Results of experiments show that this online supervised algorithm improves learning performance obviously compared with offline pattern and has higher learning accuracy and efficiency than other learning algorithms.
http://arxiv.org/abs/1901.01549
How to understand deep learning systems remains an open problem. In this paper we propose that the answer may lie in the geometrization of deep networks. Geometrization is a bridge to connect physics, geometry, deep network and quantum computation and this may result in a new scheme to reveal the rule of the physical world. By comparing the geometry of image matching and deep networks, we show that geometrization of deep networks can be used to understand existing deep learning systems and it may also help to solve the interpretability problem of deep learning systems.
http://arxiv.org/abs/1901.02354
In this work we present a method to improve the pruning step of the current state-of-the-art methodology to compress neural networks. The novelty of the proposed pruning technique is in its differentiability, which allows pruning to be performed during the backpropagation phase of the network training. This enables an end-to-end learning and strongly reduces the training time. The technique is based on a family of differentiable pruning functions and a new regularizer specifically designed to enforce pruning. The experimental results show that the joint optimization of both the thresholds and the network weights permits to reach a higher compression rate, reducing the number of weights of the pruned network by a further 14% to 33% compared to the current state-of-the-art. Furthermore, we believe that this is the first study where the generalization capabilities in transfer learning tasks of the features extracted by a pruned network are analyzed. To achieve this goal, we show that the representations learned using the proposed pruning methodology maintain the same effectiveness and generality of those learned by the corresponding non-compressed network on a set of different recognition tasks.
http://arxiv.org/abs/1712.01721
Deep Reinforcement Learning (DRL) has become increasingly powerful in recent years, with notable achievements such as Deepmind’s AlphaGo. It has been successfully deployed in commercial vehicles like Mobileye’s path planning system. However, a vast majority of work on DRL is focused on toy examples in controlled synthetic car simulator environments such as TORCS and CARLA. In general, DRL is still at its infancy in terms of usability in real-world applications. Our goal in this paper is to encourage real-world deployment of DRL in various autonomous driving (AD) applications. We first provide an overview of the tasks in autonomous driving systems, reinforcement learning algorithms and applications of DRL to AD systems. We then discuss the challenges which must be addressed to enable further progress towards real-world deployment.
http://arxiv.org/abs/1901.01536
In this paper, we consider the problem of reconstructing a dense 3D model using images captured from different views. Recent methods based on convolutional neural networks (CNN) allow learning the entire task from data. However, they do not incorporate the physics of image formation such as perspective geometry and occlusion. Instead, classical approaches based on Markov Random Fields (MRF) with ray-potentials explicitly model these physical processes, but they cannot cope with large surface appearance variations across different viewpoints. In this paper, we propose RayNet, which combines the strengths of both frameworks. RayNet integrates a CNN that learns view-invariant feature representations with an MRF that explicitly encodes the physics of perspective projection and occlusion. We train RayNet end-to-end using empirical risk minimization. We thoroughly evaluate our approach on challenging real-world datasets and demonstrate its benefits over a piece-wise trained baseline, hand-crafted models as well as other learning-based approaches.
http://arxiv.org/abs/1901.01535
In recent years, face detection has experienced significant performance improvement with the boost of deep convolutional neural networks. In this report, we reimplement the state-of-the-art detector SRN and apply some tricks proposed in the recent literatures to obtain an extremely strong face detector, named VIM-FD. In specific, we exploit more powerful backbone network like DenseNet-121, revisit the data augmentation based on data-anchor-sampling proposed in PyramidBox, and use the max-in-out label and anchor matching strategy in SFD. In addition, we also introduce the attention mechanism to provide additional supervision. Over the most popular and challenging face detection benchmark, i.e., WIDER FACE, the proposed VIM-FD achieves state-of-the-art performance.
http://arxiv.org/abs/1901.02350
Acoustic scene classification is the task of identifying the scene from which the audio signal is recorded. Convolutional neural network (CNN) models are widely adopted with proven successes in acoustic scene classification. However, there is little insight on how an audio scene is perceived in CNN, as what have been demonstrated in image recognition research. In the present study, the Class Activation Mapping (CAM) is utilized to analyze how the log-magnitude Mel-scale filter-bank (log-Mel) features of different acoustic scenes are learned in a CNN classifier. It is noted that distinct high-energy time-frequency components of audio signals generally do not correspond to strong activation on CAM, while the background sound texture are well learned in CNN. In order to make the sound texture more salient, we propose to apply the Difference of Gaussian (DoG) and Sobel operator to process the log-Mel features and enhance edge information of the time-frequency image. Experimental results on the DCASE 2017 ASC challenge show that using edge enhanced log-Mel images as input feature of CNN significantly improves the performance of audio scene classification.
http://arxiv.org/abs/1901.01502
Probability density estimation is a classical and well studied problem, but standard density estimation methods have historically lacked the power to model complex and high-dimensional image distributions. More recent generative models leverage the power of neural networks to implicitly learn and represent probability models over complex images. We describe methods to extract explicit probability density estimates from GANs, and explore the properties of these image density functions. We perform sanity check experiments to provide evidence that these probabilities are reasonable. However, we also show that density functions of natural images are difficult to interpret and thus limited in use. We study reasons for this lack of interpretability, and show that we can get interpretability back by doing density estimation on latent representations of images.
http://arxiv.org/abs/1901.01499
Variational Autoencoder (VAE), a simple and effective deep generative model, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations. However, recent studies demonstrate that, when equipped with expressive generative distributions (aka. decoders), VAE suffers from learning uninformative latent representations with the observation called KL Varnishing, in which case VAE collapses into an unconditional generative model. In this work, we introduce mutual posterior-divergence regularization, a novel regularization that is able to control the geometry of the latent space to accomplish meaningful representation learning, while achieving comparable or superior capability of density estimation. Experiments on three image benchmark datasets demonstrate that, when equipped with powerful decoders, our model performs well both on density estimation and representation learning.
http://arxiv.org/abs/1901.01498
We present the design, characterization, and experimental results for a new modular robotic system for programmable self-assembly. The proposed system uses the Hybrid Cube Model (HCM), which integrates classical features from both deterministic and stochastic self-organization models. Thus, for instance, the modules are passive as far as their locomotion is concerned (stochastic), and yet they possess an active undocking routine (deterministic). The robots are constructed entirely from readily accessible components, and unlike many existing robots, their excitation is not fluid mediated. Instead, the actuation setup is a solid state, independently programmable, and highly portable platform. The system is capable of demonstrating fully autonomous and distributed stochastic self-assembly in two dimensions. It is shown to emulate the performance of several existing modular systems and promises to be a substantial effort towards developing a universal testbed for programmable self-assembly algorithms.
http://arxiv.org/abs/1901.01497
Biology can provide biomimetic components and new control principles for robotics. Developing a robot system equipped with bionic eyes is a difficult but exciting task. Researchers have been studying the control mechanisms of bionic eyes for many years and considerable models are available. In this paper, control model and its implementation on robots for bionic eyes are reviewed, which covers saccade, smooth pursuit, vergence, vestibule-ocular reflex (VOR), optokinetic reflex (OKR) and eye-head coordination. What is more, some problems and possible solutions in the field of bionic eyes are discussed and analyzed. This review paper can be used as a guide for researchers to identify potential research problems and solutions of the bionic eyes’ motion control.
http://arxiv.org/abs/1901.01494
Attention mechanism is a hot spot in deep learning field. Using channel attention model is an effective method for improving the performance of the convolutional neural network. Squeeze-and-Excitation block takes advantage of the channel dependence, selectively emphasizing the important channels and compressing the relatively useless channel. In this paper, we proposed a variant of SE block based on channel locality. Instead of using full connection layers to explore the global channel dependence, we adopt convolutional layers to learn the correlation between the nearby channels. We term this new algorithm Channel Locality(C-Local) block. We evaluate SE block and C-Local block by applying them to different CNNs architectures on cifar-10 dataset. We observed that our C-Local block got higher accuracy than SE block did.
http://arxiv.org/abs/1901.01493
Long-term planning poses a major difficulty to many reinforcement learning algorithms. This problem becomes even more pronounced in dynamic visual environments. In this work we propose Hierarchical Planning and Reinforcement Learning (HIP-RL), a method for merging the benefits and capabilities of Symbolic Planning with the learning abilities of Deep Reinforcement Learning. We apply HIPRL to the complex visual tasks of interactive question answering and visual semantic planning and achieve state-of-the-art results on three challenging datasets all while taking fewer steps at test time and training in fewer iterations. Sample results can be found at youtu.be/0TtWJ_0mPfI
http://arxiv.org/abs/1901.01492
Adversarial examples have been demonstrated to threaten many computer vision tasks including object detection. However, the existing attacking methods for object detection have two limitations: poor transferability, which denotes that the generated adversarial examples have low success rate to attack other kinds of detection methods, and high computation cost, which means that they need more time to generate an adversarial image, and therefore are difficult to deal with the video data. To address these issues, we utilize a generative mechanism to obtain the adversarial image and video. In this way, the processing time is reduced. To enhance the transferability, we destroy the feature maps extracted from the feature network, which usually constitutes the basis of object detectors. The proposed method is based on the Generative Adversarial Network (GAN) framework, where we combine the high-level class loss and low-level feature loss to jointly train the adversarial example generator. A series of experiments conducted on PASCAL VOC and ImageNet VID datasets show that our method can efficiently generate image and video adversarial examples, and more importantly, these adversarial examples have better transferability, and thus, are able to simultaneously attack two kinds of representative object detection models: proposal based models like Faster-RCNN, and regression based models like SSD.
https://arxiv.org/abs/1811.12641
Multisection continuum arms offer complementary characteristics to those of traditional rigid-bodied robots. Inspired by biological appendages, such as elephant trunks and octopus arms, these robots trade rigidity for compliance, accuracy for safety, and therefore exhibit strong potential for applications in human-occupied spaces. Prior work has demonstrated their superiority in operation in congested spaces and manipulation of irregularly-shaped objects. However, they are yet to be widely applied outside laboratory spaces. One key reason is that, due to compliance, they are difficult to control. Sophisticated and numerically efficient dynamic models are a necessity to implement dynamic control. In this paper, we propose a novel, numerically stable, center of gravity-based dynamic model for variable-length multisection continuum arms. The model can accommodate continuum robots having any number of sections with varying physical dimensions. The dynamic algorithm is of O(n2) complexity, runs at 9.5 kHz, simulates 6-8 times faster than real-time for a three-section continuum robot, and therefore is ideally suited for real-time control implementations. The model accuracy is validated numerically against an integral-dynamic model proposed by the authors and experimentally for a three-section, pneumatically actuated variable-length multisection continuum arm. This is the first sub real-time dynamic model based on a smooth continuous deformation model for variable-length multisection continuum arms.
http://arxiv.org/abs/1901.01479
Hashing has been recognized as an efficient representation learning method to effectively handle big data due to its low computational complexity and memory cost. Most of the existing hashing methods focus on learning the low-dimensional vectorized binary features based on the high-dimensional raw vectorized features. However, studies on how to obtain preferable binary codes from the original 2D image features for retrieval is very limited. This paper proposes a bilinear supervised discrete hashing (BSDH) method based on 2D image features which utilizes bilinear projections to binarize the image matrix features such that the intrinsic characteristics in the 2D image space are preserved in the learned binary codes. Meanwhile, the bilinear projection approximation and vectorization binary codes regression are seamlessly integrated together to formulate the final robust learning framework. Furthermore, a discrete optimization strategy is developed to alternatively update each variable for obtaining the high-quality binary codes. In addition, two 2D image features, traditional SURF-based FVLAD feature and CNN-based AlexConv5 feature are designed for further improving the performance of the proposed BSDH method. Results of extensive experiments conducted on four benchmark datasets show that the proposed BSDH method almost outperforms all competing hashing methods with different input features by different evaluation protocols.
http://arxiv.org/abs/1901.01474
The paper proves sum-of-square-of-rational-function based representations (shortly, sosrf-based representations) of polynomial matrices that are positive semidefinite on some special sets: $\mathbb{R}^n;$ $\mathbb{R}$ and its intervals $[a,b]$, $[0,\infty)$; and the strips $[a,b] \times \mathbb{R} \subset \mathbb{R}^2.$ A method for numerically computing such representations is also presented. The methodology is divided into two stages: (S1) diagonalizing the initial polynomial matrix based on the Schm"{u}dgen’s procedure \cite{Schmudgen09}; (S2) for each diagonal element of the resulting matrix, find its low rank sosrf-representation satisfying the Artin’s theorem solving the Hilbert’s 17th problem. Some numerical tests and illustrations with \textsf{OCTAVE} are also presented for each type of polynomial matrices.
http://arxiv.org/abs/1901.02360
Data-driven generative modeling has made remarkable progress by leveraging the power of deep neural networks. A reoccurring challenge is how to sample a rich variety of data from the entire target distribution, rather than only from the distribution of the training data. In other words, we would like the generative model to go beyond the observed training samples and learn to also generate “unseen” data. In our work, we present a generative neural network for shapes that is based on a part-based prior, where the key idea is for the network to synthesize shapes by varying both the shape parts and their compositions. Treating a shape not as an unstructured whole, but as a (re-)composable set of deformable parts, adds a combinatorial dimension to the generative process to enrich the diversity of the output, encouraging the generator to venture more into the “unseen”. We show that our part-based model generates richer variety of feasible shapes compared with a baseline generative model. To this end, we introduce two quantitative metrics to evaluate the ingenuity of the generative model and assess how well generated data covers both the training data and unseen data from the same target distribution.
http://arxiv.org/abs/1811.07441