This paper presents a Poisson multi-Bernoulli mixture (PMBM) conjugate prior for multiple extended object filtering. A Poisson point process is used to describe the existence of yet undetected targets, while a multi-Bernoulli mixture describes the distribution of the targets that have been detected. The prediction and update equations are presented for the standard transition density and measurement likelihood. Both the prediction and the update preserve the PMBM form of the density, and in this sense the PMBM density is a conjugate prior. However, the unknown data associations lead to an intractably large number of terms in the PMBM density, and approximations are necessary for tractability. A gamma Gaussian inverse Wishart implementation is presented, along with methods to handle the data association problem. A simulation study shows that the extended target PMBM filter performs well in comparison to the extended target d-GLMB and LMB filters. An experiment with Lidar data illustrates the benefit of tracking both detected and undetected targets.
We propose a new algorithm for color transfer between images that have perceptually similar semantic structures. We aim to achieve a more accurate color transfer that leverages semantically-meaningful dense correspondence between images. To accomplish this, our algorithm uses neural representations for matching. Additionally, the color transfer should be spatially variant and globally coherent. Therefore, our algorithm optimizes a local linear model for color transfer satisfying both local and global constraints. Our proposed approach jointly optimizes matching and color transfer, adopting a coarse-to-fine strategy. The proposed method can be successfully extended from one-to-one to one-to-many color transfer. The latter further addresses the problem of mismatching elements of the input image. We validate our proposed method by testing it on a large variety of image content.
We describe a new approach that improves the training of generative adversarial nets (GANs) for synthesizing diverse images from a text input. Our approach is based on the conditional version of GANs and expands on previous work leveraging an auxiliary task in the discriminator. Our generated images are not limited to certain classes and do not suffer from mode collapse while semantically matching the text input. A key to our training methods is how to form positive and negative training examples with respect to the class label of a given image. Instead of selecting random training examples, we perform negative sampling based on the semantic distance from a positive example in the class. We evaluate our approach using the Oxford-102 flower dataset, adopting the inception score and multi-scale structural similarity index (MS-SSIM) metrics to assess discriminability and diversity of the generated images. The empirical results indicate greater diversity in the generated images, especially when we gradually select more negative training examples closer to a positive example in the semantic space.
Feature extraction analysis has been widely investigated during the last decades in computer vision community due to the large range of possible applications. Significant work has been done in order to improve the performance of the emotion detection methods. Classification algorithms have been refined, novel preprocessing techniques have been applied and novel representations from images and videos have been introduced. In this paper, we propose a preprocessing method and a novel facial landmarks’ representation aiming to improve the facial emotion detection accuracy. We apply our novel methodology on the extended Cohn-Kanade (CK+) dataset and other datasets for affect classification based on Action Units (AU). The performance evaluation demonstrates an improvement on facial emotion classification (accuracy and F1 score) that indicates the superiority of the proposed methodology.
Hyper-heuristics are a novel tool. They deal with complex optimization problems where standalone solvers exhibit varied performance. Among such a tool reside selection hyper-heuristics. By combining the strengths of each solver, this kind of hyper-heuristic offers a more robust tool. However, their effectiveness is highly dependent on the ‘features’ used to link them with the problem that is being solved. Aiming at enhancing selection hyper-heuristics, in this paper we propose two types of transformation: explicit and implicit. The first one directly changes the distribution of critical points within the feature domain while using a Euclidean distance to measure proximity. The second one operates indirectly by preserving the distribution of critical points but changing the distance metric through a kernel function. We focus on analyzing the effect of each kind of transformation, and of their combinations. We test our ideas in the domain of constraint satisfaction problems because of their popularity and many practical applications. In this work, we compare the performance of our proposals against those of previously published data. Furthermore, we expand on previous research by increasing the number of analyzed features. We found that, by incorporating transformations into the model of selection hyper-heuristics, overall performance can be improved, yielding more stable results. However, combining implicit and explicit transformations was not as fruitful. Additionally, we ran some confirmatory tests on the domain of knapsack problems. Again, we observed improved stability, leading to the generation of hyper-heuristics whose profit had a standard deviation between 20% and 30% smaller.
Learning useful representations with little or no supervision is a key challenge in artificial intelligence. We provide an in-depth review of recent advances in representation learning with a focus on autoencoder-based models. To organize these results we make use of meta-priors believed useful for downstream tasks, such as disentanglement and hierarchical organization of features. In particular, we uncover three main mechanisms to enforce such properties, namely (i) regularizing the (approximate or aggregate) posterior distribution, (ii) factorizing the encoding and decoding distribution, or (iii) introducing a structured prior distribution. While there are some promising results, implicit or explicit supervision remains a key enabler and all current methods use strong inductive biases and modeling assumptions. Finally, we provide an analysis of autoencoder-based representation learning through the lens of rate-distortion theory and identify a clear tradeoff between the amount of prior knowledge available about the downstream tasks, and how useful the representation is for this task.
In this paper we illustrate how to perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach. Our method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes at 35 frames per second. Despite its simplicity, versatility and fast speed, our strategy allows us to establish a new state-of-the-art among real-time trackers on VOT-2018, while at the same time demonstrating competitive performance and the best speed for the semi-supervised video object segmentation task on DAVIS-2016 and DAVIS-2017. The project website is this http URL
Recently, increasing attention has been drawn to training semantic segmentation models using synthetic data and computer-generated annotation. However, domain gap remains a major barrier and prevents models learned from synthetic data from generalizing well to real-world applications. In this work, we take the advantage of additional geometric information from synthetic data, a powerful yet largely neglected cue, to bridge the domain gap. Such geometric information can be generated easily from synthetic data, and is proven to be closely coupled with semantic information. With the geometric information, we propose a model to reduce domain shift on two levels: on the input level, we augment the traditional image translation network with the additional geometric information to translate synthetic images into realistic styles; on the output level, we build a task network which simultaneously performs depth estimation and semantic segmentation on the synthetic data. Meanwhile, we encourage the network to preserve correlation between depth and semantics by adversarial training on the output space. We then validate our method on two pairs of synthetic to real dataset: Virtual KITTI to KITTI, and SYNTHIA to Cityscapes, where we achieve a significant performance gain compared to the non-adapt baseline and methods using only semantic label. This demonstrates the usefulness of geometric information from synthetic data for cross-domain semantic segmentation.
To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.
Machine learning models are vulnerable to simple model stealing attacks if the adversary can obtain output labels for chosen inputs. To protect against these attacks, it has been proposed to limit the information provided to the adversary by omitting probability scores, significantly impacting the utility of the provided service. In this work, we illustrate how a service provider can still provide useful, albeit misleading, class probability information, while significantly limiting the success of the attack. Our defense forces the adversary to discard the class probabilities, requiring significantly more queries before they can train a model with comparable performance. We evaluate several attack strategies, model architectures, and hyperparameters under varying adversarial models, and evaluate the efficacy of our defense against the strongest adversary. Finally, we quantify the amount of noise injected into the class probabilities to mesure the loss in utility, e.g., adding 1.26 nats per query on CIFAR-10 and 3.27 on MNIST. Our evaluation shows our defense can degrade the accuracy of the stolen model at least 20%, or require up to 64 times more queries while keeping the accuracy of the protected model almost intact.
In this paper we propose an approach to avoiding catastrophic forgetting in sequential task learning scenarios. Our technique is based on a network reparameterization that approximately diagonalizes the Fisher Information Matrix of the network parameters. This reparameterization takes the form of a factorized rotation of parameter space which, when used in conjunction with Elastic Weight Consolidation (which assumes a diagonal Fisher Information Matrix), leads to significantly better performance on lifelong learning of sequential tasks. Experimental results on the MNIST, CIFAR-100, CUB-200 and Stanford-40 datasets demonstrate that we significantly improve the results of standard elastic weight consolidation, and that we obtain competitive results when compared to other state-of-the-art in lifelong learning without forgetting.
Anomaly detection is a challenging problem in intelligent video surveillance. Most existing methods are computation consuming, which cannot satisfy the real-time requirement. In this paper, we propose a real-time anomaly detection framework with low computational complexity and high efficiency. A new feature, named Histogram of Magnitude Optical Flow (HMOF), is proposed to capture the motion of video patches. Compared with existing feature descriptors, HMOF is more sensitive to motion magnitude and more efficient to distinguish anomaly information. The HMOF features are computed for foreground patches, and are reconstructed by the auto-encoder for better clustering. Then, we use Gaussian Mixture Model (GMM) Classifiers to distinguish anomalies from normal activities in videos. Experimental results show that our framework outperforms state-of-the-art methods, and can reliably detect anomalies in real-time.
Attention level estimation systems have a high potential in many use cases, such as human-robot interaction, driver modeling and smart home systems, since being able to measure a person’s attention level opens the possibility to natural interaction between humans and computers. The topic of estimating a human’s visual focus of attention has been actively addressed recently in the field of HCI. However, most of these previous works do not consider attention as a subjective, cognitive attentive state. New research within the field also faces the problem of the lack of annotated datasets regarding attention level in a certain context. The novelty of our work is two-fold: First, we introduce a new annotation framework that tackles the subjective nature of attention level and use it to annotate more than 100,000 images with three attention levels and second, we introduce a novel method to estimate attention levels, relying purely on extracted geometric features from RGB and depth images, and evaluate it with a deep learning fusion framework. The system achieves an overall accuracy of 80.02%. Our framework and attention level annotations are made publicly available.
Most existing semantic segmentation methods employ atrous convolution to enlarge the receptive field of filters, but neglect partial information. To tackle this issue, we firstly propose a novel Kronecker convolution which adopts Kronecker product to expand the standard convolutional kernel for taking into account the partial feature neglected by atrous convolutions. Therefore, it can capture partial information and enlarge the receptive field of filters simultaneously without introducing extra parameters. Secondly, we propose Tree-structured Feature Aggregation (TFA) module which follows a recursive rule to expand and forms a hierarchical structure. Thus, it can naturally learn representations of multi-scale objects and encode hierarchical contextual information in complex scenes. Finally, we design Tree-structured Kronecker Convolutional Networks (TKCN) which employs Kronecker convolution and TFA module. Extensive experiments on three datasets, PASCAL VOC 2012, PASCAL-Context and Cityscapes, verify the effectiveness of our proposed approach. We make the code and the trained model publicly available at https://github.com/wutianyiRosun/TKCN.
Face sketch synthesis has made great progress in the past few years. Recent methods based on deep neural networks are able to generate high quality sketches from face photos. However, due to the lack of training data (photo-sketch pairs), none of such deep learning based methods can be applied successfully to face photos in the wild. In this paper, we propose a semi-supervised deep learning architecture which extends face sketch synthesis to handle face photos in the wild by exploiting additional face photos in training. Instead of supervising the network with ground truth sketches, we first perform patch matching in feature space between the input photo and photos in a small reference set of photo-sketch pairs. We then compose a pseudo sketch feature representation using the corresponding sketch feature patches to supervise our network. With the proposed approach, we can train our networks using a small reference set of photo-sketch pairs together with a large face photo dataset without ground truth sketches. Experiments show that our method achieve state-of-the-art performance both on public benchmarks and face photos in the wild. Codes are available at https://github.com/chaofengc/Face-Sketch-Wild.
Purpose: To perform and evaluate water and fat signal separation of whole-body gradient echo scans using convolutional neural networks. Methods: Whole-body gradient echo scans of 240 subjects, each consisting of five bipolar echoes, were used. Reference fat fraction maps were created using a conventional method. Convolutional neural networks, more specifically 2D U-nets, were trained using 5-fold cross-validation with one or several echoes as input, using the squared difference between the output and the reference fat fraction maps as the loss function. The outputs of the networks were assessed by the loss function, measured liver fat fractions, and visually. Training was performed using a GPU. Inference was performed both using the GPU as well as a CPU. Results: The final loss of the validation data decreased when using more echoes as input, and the loss curves indicated convergence. The liver fat fractions could be estimated using only one echo, but results were improved by use of more echoes. Visual assessment found the quality of the outputs of the networks to be similar to the reference even when using only one echo, with slight improvements when using more echoes. Training a network took at most 28.6 h. Inference time of a whole-body scan took at most 3.7 s using the GPU and 5.8 min using the CPU. Conclusion: It is possible to perform water and fat signal separation of whole-body gradient echo scans using convolutional neural networks. Separation was possible using only one echo, although using more echoes improved the results.
The semantic segmentation requires a lot of computational cost. The dilated convolution relieves this burden of complexity by increasing the receptive field without additional parameters. For a more lightweight model, using depth-wise separable convolution is one of the practical choices. However, a simple combination of these two methods results in too sparse an operation which might cause severe performance degradation. To resolve this problem, we propose a new block of Concentrated-Comprehensive Convolution (CCC) which takes both advantages of the dilated convolution and the depth-wise separable convolution. The CCC block consists of an information concentration stage and a comprehensive convolution stage. The first stage uses two depth-wise asymmetric convolutions for compressed information from the neighboring pixels. The second stage increases the receptive field by using a depth-wise separable dilated convolution from the feature map of the first stage. By replacing the conventional ESP module with the proposed CCC module, without accuracy degradation in Cityscapes dataset, we could reduce the number of parameters by half and the number of flops by 35% compared to the ESPnet which is one of the fastest models. We further applied the CCC to other segmentation models based on dilated convolution and our method achieved comparable or higher performance with a decreased number of parameters and flops. Finally, experiments on ImageNet classification task show that CCC can successfully replace dilated convolutions.
In this paper, we propose a novel heart segmentation pipeline Combining Faster R-CNN and U-net Network (CFUN). Due to Faster R-CNN’s precise localization ability and U-net’s powerful segmentation ability, CFUN needs only one-step detection and segmentation inference to get the whole heart segmentation result, obtaining good results with significantly reduced computational cost. Besides, CFUN adopts a new loss function based on edge information named 3D Edge-loss as an auxiliary loss to accelerate the convergence of training and improve the segmentation results. Extensive experiments on the public dataset show that CFUN exhibits competitive segmentation performance in a sharply reduced inference time. Our source code and the model are publicly available at https://github.com/Wuziyi616/CFUN.
Natural evolution gives the impression of leading to an open-ended process of increasing diversity and complexity. If our goal is to produce such open-endedness artificially, this suggests an approach driven by evolutionary metaphor. On the other hand, techniques from machine learning and artificial intelligence are often considered too narrow to provide the sort of exploratory dynamics associated with evolution. In this paper, we hope to bridge that gap by reviewing common barriers to open-endedness in the evolution-inspired approach and how they are dealt with in the evolutionary case - collapse of diversity, saturation of complexity, and failure to form new kinds of individuality. We then show how these problems map onto similar issues in the machine learning approach, and discuss how the same insights and solutions which alleviated those barriers in evolutionary approaches can be ported over. At the same time, the form these issues take in the machine learning formulation suggests new ways to analyze and resolve barriers to open-endedness. Ultimately, we hope to inspire researchers to be able to interchangeably use evolutionary and gradient-descent-based machine learning methods to approach the design and creation of open-ended systems.
Individual pig detection and tracking is an important requirement in many video-based pig monitoring applications. However, it still remains a challenging task in complex scenes, due to problems of light fluctuation, similar appearances of pigs, shape deformations and occlusions. To tackle these problems, we propose a robust real time multiple pig detection and tracking method which does not require manual marking or physical identification of the pigs, and works under both daylight and infrared light conditions. Our method couples a CNN-based detector and a correlation filter-based tracker via a novel hierarchical data association algorithm. The detector gains the best accuracy/speed trade-off by using the features derived from multiple layers at different scales in a one-stage prediction network. We define a tag-box for each pig as the tracking target, and the multiple object tracking is conducted in a key-points tracking manner using learned correlation filters. Under challenging conditions, the tracking failures are modelled based on the relations between responses of detector and tracker, and the data association algorithm allows the detection hypotheses to be refined, meanwhile the drifted tracks can be corrected by probing the tracking failures followed by the re-initialization of tracking. As a result, the optimal tracklets can sequentially grow with on-line refined detections, and tracking fragments are correctly integrated into respective tracks while keeping the original identifications. Experiments with a dataset captured from a commercial farm show that our method can robustly detect and track multiple pigs under challenging conditions. The promising performance of the proposed method also demonstrates a feasibility of long-term individual pig tracking in a complex environment and thus promises a commercial potential.
We introduce DeepProbLog, a probabilistic logic programming language that incorporates deep learning by means of neural predicates. We show how existing inference and learning techniques can be adapted for the new language. Our experiments demonstrate that DeepProbLog supports both symbolic and subsymbolic representations and inference, 1) program induction, 2) probabilistic (logic) programming, and 3) (deep) learning from examples. To the best of our knowledge, this work is the first to propose a framework where general-purpose neural networks and expressive probabilistic-logical modeling and reasoning are integrated in a way that exploits the full expressiveness and strengths of both worlds and can be trained end-to-end based on examples.
Road safety mapping using satellite images is a cost-effective but a challenging problem for smart city planning. The scarcity of labeled data, misalignment and ambiguity makes it hard for supervised deep networks to learn efficient embeddings in order to classify between safe and dangerous road segments. In this paper, we address the challenges using a region guided attention network. In our model, we extract global features from a base network and augment it with local features obtained using the region guided attention network. In addition, we perform domain adaptation for unlabeled target data. In order to bridge the gap between safe samples and dangerous samples from source and target respectively, we propose a loss function based on within and between class covariance matrices. We conduct experiments on a public dataset of London to show that the algorithm achieves significant results with the classification accuracy of 86.21%. We obtain an increase of 4% accuracy for NYC using domain adaptation network. Besides, we perform a user study and demonstrate that our proposed algorithm achieves 23.12% better accuracy compared to subjective analysis.
Illumination estimation is often used in mixed reality to re-render a scene from another point of view, to change the color/texture of an object, or to insert a virtual object consistently lit into a real video or photograph. Specifically, the estimation of a point light source is required for the shadows cast by the inserted object to be consistent with the real scene. We tackle the problem of illumination retrieval given an RGBD image of the scene as an inverse problem: we aim to find the illumination that minimizes the photometric error between the rendered image and the observation. In particular we propose a novel differentiable renderer based on the Blinn-Phong model with cast shadows. We compare our differentiable renderer to state-of-the-art methods and demonstrate its robustness to an incorrect reflectance estimation.
Robots are widely collaborating with human users in diferent tasks that require high-level cognitive functions to make them able to discover the surrounding environment. A difcult challenge that we briefy highlight in this short paper is inferring the latent grammatical structure of language, which includes grounding parts of speech (e.g., verbs, nouns, adjectives, and prepositions) through visual perception, and induction of Combinatory Categorial Grammar (CCG) for phrases. This paves the way towards grounding phrases so as to make a robot able to understand human instructions appropriately during interaction.
Weakly-supervised instance segmentation, which could greatly save labor and time cost of pixel mask annotation, has attracted increasing attention in recent years. The commonly used pipeline firstly utilizes conventional image segmentation methods to automatically generate initial masks and then use them to train an off-the-shelf segmentation network in an iterative way. However, the initial generated masks usually contains a notable proportion of invalid masks which are mainly caused by small object instances. Directly using these initial masks to train segmentation model is harmful for the performance. To address this problem, we propose a hybrid network in this paper. In our architecture, there is a principle segmentation network which is used to handle the normal samples with valid generated masks. In addition, a complementary branch is added to handle the small and dim objects without valid masks. Experimental results indicate that our method can achieve significantly performance improvement both on the small object instances and large ones, and outperforms all state-of-the-art methods.
A comprehensive and systematic framework for easily extending and implementing the spatial-temporal subset-based digital image correlation (DIC) algorithm is presented. The framework decouples the three main factors (shape function, correlation criterion, and optimization algorithm) in DIC, and represents different algorithms in a uniform form. One can freely choose and combine the three factors to meet his own need, or freely add more parameters to extract analytic results. Subpixel translation and a simulated image series with different velocity characters are analyzed using different algorithms based on the proposed framework. And an application of mitigating air disturbance due to heat haze using spatial-temporal DIC (ST-DIC) is demonstrated, proving the applicability of the framework.
Recently, Deep Learning (DL), especially Convolutional Neural Network (CNN), develops rapidly and is applied to many tasks, such as image classification, face recognition, image segmentation, and human detection. Due to its superior performance, DL-based models have a wide range of application in many areas, some of which are extremely safety-critical, e.g. intelligent surveillance and autonomous driving. Due to the latency and privacy problem of cloud computing, embedded accelerators are popular in these safety-critical areas. However, the robustness of the embedded DL system might be harmed by inserting hardware/software Trojans into the accelerator and the neural network model, since the accelerator and deploy tool (or neural network model) are usually provided by third-party companies. Fortunately, inserting hardware Trojans can only achieve inflexible attack, which means that hardware Trojans can easily break down the whole system or exchange two outputs, but can’t make CNN recognize unknown pictures as targets. Though inserting software Trojans has more freedom of attack, it often requires tampering input images, which is not easy for attackers. So, in this paper, we propose a hardware-software collaborative attack framework to inject hidden neural network Trojans, which works as a back-door without requiring manipulating input images and is flexible for different scenarios. We test our attack framework for image classification and face recognition tasks, and get attack success rate of 92.6% and 100% on CIFAR10 and YouTube Faces, respectively, while keeping almost the same accuracy as the unattacked model in the normal mode. In addition, we show a specific attack scenario in which a face recognition system is attacked and gives a specific wrong answer.
Generating iris images which look realistic is both an interesting and challenging problem. Most of the classical statistical models are not powerful enough to capture the complicated texture representation in iris images, and therefore fail to generate iris images which look realistic. In this work, we present a machine learning framework based on generative adversarial network (GAN), which is able to generate iris images sampled from a prior distribution (learned from a set of training images). We apply this framework to two popular iris databases, and generate images which look very realistic, and similar to the image distribution in those databases. Through experimental results, we show that the generated iris images have a good diversity, and are able to capture different part of the prior distribution.
Single Image Super Resolution (SISR) is a well-researched problem with broad commercial relevance. However, most of the SISR literature focuses on small-size images under 500px, whereas business needs can mandate the generation of very high resolution images. At Expedia Group, we were tasked with generating images of at least 2000px for display on the website, four times greater than the sizes typically reported in the literature. This requirement poses a challenge that state-of-the-art models, validated on small images, have not been proven to handle. In this paper, we investigate solutions to the problem of generating high-quality images for large-scale super resolution in a commercial setting. We find that training a generative adversarial network (GAN) with attention from scratch using a large-scale lodging image data set generates images with high PSNR and SSIM scores. We describe a novel attentional SISR model for large-scale images, A-SRGAN, that uses a Flexible Self Attention layer to enable processing of large-scale images. We also describe a distributed algorithm which speeds up training by around a factor of five.
In recent years, spectral clustering has become one of the most popular clustering algorithms for image segmentation. However, it has restricted applicability to large-scale images due to its high computational complexity. In this paper, we first propose a novel algorithm called Fast Spectral Clustering based on quad-tree decomposition. The algorithm focuses on the spectral clustering at superpixel level and its computational complexity is O(nlogn) + O(m) + O(m^(3/2)); its memory cost is O(m), where n and m are the numbers of pixels and the superpixels of a image. Then we propose Multiscale Fast Spectral Clustering by improving Fast Spectral Clustering, which is based on the hierarchical structure of the quad-tree. The computational complexity of Multiscale Fast Spectral Clustering is O(nlogn) and its memory cost is O(m). Extensive experiments on real large-scale images demonstrate that Multiscale Fast Spectral Clustering outperforms Normalized cut in terms of lower computational complexity and memory cost, with comparable clustering accuracy.
Artificial Intelligence principles define social and ethical considerations to develop future AI. They come from research institutes, government organizations and industries. All versions of AI principles are with different considerations covering different perspectives and making different emphasis. None of them can be considered as complete and can cover the rest AI principle proposals. Here we introduce LAIP, an effort and platform for linking and analyzing different Artificial Intelligence Principles. We want to explicitly establish the common topics and links among AI Principles proposed by different organizations and investigate on their uniqueness. Based on these efforts, for the long-term future of AI, instead of directly adopting any of the AI principles, we argue for the necessity of incorporating various AI Principles into a comprehensive framework and focusing on how they can interact and complete each other.
We propose an approach for unsupervised adaptation of object detectors from label-rich to label-poor domains which can significantly reduce annotation costs associated with detection. Recently, approaches that align distributions of source and target images using an adversarial loss have been proven effective for adapting object classifiers. However, for object detection, fully matching the entire distributions of source and target images to each other at the global image level may fail, as domains could have distinct scene layouts and different combinations of objects. On the other hand, strong matching of local features such as texture and color makes sense, as it does not change category level semantics. This motivates us to propose a novel approach for detector adaptation based on strong local alignment and weak global alignment. Our key contribution is the weak alignment model, which focuses the adversarial alignment loss on images that are globally similar and puts less emphasis on aligning images that are globally dissimilar. Additionally, we design the strong domain alignment model to only look at local receptive fields of the feature map. We empirically verify the effectiveness of our approach on several detection datasets comprising both large and small domain shifts.
The task in referring expression comprehension is to localise the object instance in an image described by a referring expression phrased in natural language. As a language-to-vision matching task, the key to this problem is to learn a discriminative object feature that can adapt to the expression used. To avoid ambiguity, the expression normally tends to describe not only the properties of the referent itself, but also its relationships to its neighbourhood. To capture and exploit this important information we propose a graph-based, language-guided attention mechanism. Being composed of node attention component and edge attention component, the proposed graph attention mechanism explicitly represents inter-object relationships, and properties with a flexibility and power impossible with competing approaches. Furthermore, the proposed graph attention mechanism enables the comprehension decision to be visualisable and explainable. Experiments on three referring expression comprehension datasets show the advantage of the proposed approach.
We present Extreme View Synthesis, a solution for novel view extrapolation when the number of input images is small. Occlusions and depth uncertainty, in this context, are two of the most pressing issues, and worsen as the degree of extrapolation increases. State-of-the-art methods approach this problem by leveraging explicit geometric constraints, or learned priors. Our key insight is that only by modeling both depth uncertainty and image priors can the extreme cases be solved. We first generate a depth probability volume for the novel view and synthesize an estimate of the sought image. Then, we use learned priors combined with depth uncertainty, to refine it. Our method is the first to show visually pleasing results for baseline magnifications of up to 30X.
Artificial Neural Networks (ANNs) were devised as a tool for Artificial Intelligence design implementations. However, it was soon became obvious that they are unable to fulfill their duties. The fully autonomous way of ANNs working, precluded from any human intervention or supervision, deprived of any theoretical underpinning, leads to a strange state of affairs, when ANN designers cannot explain why and how they achieve their amazing and remarkable results. Therefore, contemporary Artificial Intelligence R&D looks more like a Modern Alchemy enterprise rather than a respected scientific or technological undertaking. On the other hand, modern biological science posits that intelligence can be distinguished not only in human brains. Intelligence today is considered as a fundamental property of each and every living being. Therefore, lower simplified forms of natural intelligence are more suitable for investigation and further replication in artificial cognitive architectures.
We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.
As biometric applications are fielded to serve large population groups, issues of performance differences between individual sub-groups are becoming increasingly important. In this paper we examine cases where we believe race is one such factor. We look in particular at two forms of problem; facial classification and image synthesis. We take the novel approach of considering race as a boundary for transfer learning in both the task (facial classification) and the domain (synthesis over distinct datasets). We demonstrate a series of techniques to improve transfer learning of facial classification; outperforming similar models trained in the target’s own domain. We conduct a study to evaluate the performance drop of Generative Adversarial Networks trained to conduct image synthesis, in this process, we produce a new annotation for the Celeb-A dataset by race. These networks are trained solely on one race and tested on another - demonstrating the subsets of the CelebA to be distinct domains for this task.
Machine hearing or listening represents an emerging area. Conventional approaches rely on the design of handcrafted features specialized to a specific audio task and that can hardly generalized to other audio fields. For example, Mel-Frequency Cepstral Coefficients (MFCCs) and its variants were successfully applied to computational auditory scene recognition while Chroma vectors are good at music chord recognition. Unfortunately, these predefined features may be of variable discrimination power while extended to other tasks or even within the same task due to different nature of clips. Motivated by this need of a principled framework across domain applications for machine listening, we propose a generic and data-driven representation learning approach. For this sake, a novel and efficient supervised dictionary learning method is presented. The method learns dissimilar dictionaries, one per each class, in order to extract heterogeneous information for classification. In other words, we are seeking to minimize the intra-class homogeneity and maximize class separability. This is made possible by promoting pairwise orthogonality between class specific dictionaries and controlling the sparsity structure of the audio clip’s decomposition over these dictionaries. The resulting optimization problem is non-convex and solved using a proximal gradient descent method. Experiments are performed on both computational auditory scene (East Anglia and Rouen) and synthetic music chord recognition datasets. Obtained results show that our method is capable to reach state-of-the-art hand-crafted features for both applications.
An autonomous system is constructed by a manufacturer, operates in a society subject to norms and laws, and is interacting with end-users. We address the challenge of how the moral values and views of all stakeholders can be integrated and reflected in the moral behaviour of the autonomous system. We propose an artificial moral agent architecture that uses techniques from normative systems and formal argumentation to reach moral agreements among stakeholders. We show how our architecture can be used not only for ethical practical reasoning and collaborative decision-making, but also for the explanation of such moral behavior.
High-frequency noise is present in several modalities of medical images. It originates from the acquisition process and may be related to the scanner configurations, the scanned body, or to other external factors. This way, prospective filters are an important tool to improve the image quality. In this paper, we propose a non-local weighted operational anisotropic diffusion filter and evaluate its effect on magnetic resonance images and on kV/CBCT radiotherapy images. We also provide a detailed analysis of non-local parameter settings. Results show that the new filter enhances previous local implementations and has potential application in radiotherapy treatments.
The detection of objects that are multi-oriented is a difficult pattern recognition problem. In this paper, we propose to evaluate the performance of different families of descriptors for the classification of galaxy morphologies. We investigate the performance of the Hu moments, Flusser moments, Zernike moments, Fourier-Mellin moments, and ring projection techniques based on 1D moment and the Fourier transform. We consider two main datasets for the performance evaluation. The first dataset is an artificial dataset based on representative templates from 11 types of galaxies, which are evaluated with different transformations (noise, smoothing), alone or combined. The evaluation is based on image retrieval performance to estimate the robustness of the rotation invariant descriptors with this type of images. The second dataset is composed of real images extracted from the Galaxy Zoo 2 project. The binary classification of elliptical and spiral galaxies is achieved with pre-processing steps including morphological filtering and a Laplacian pyramid. For the binary classification, we compare the different set of features with Support Vector Machines (SVM), Extreme Learning Machine, and different types of linear discriminant analysis techniques. The results support the conclusion that the proposed framework for the binary classification of elliptical and spiral galaxies provides an area under the ROC curve reaching 99.54%, proving the robustness of the approach for helping astronomers to study galaxies.
Generative adversarial networks have been able to generate striking results in various domains. This generation capability can be general while the networks gain deep understanding regarding the data distribution. In many domains, this data distribution consists of anomalies and normal data, with the anomalies commonly occurring relatively less, creating datasets that are imbalanced. The capabilities that generative adversarial networks offer can be leveraged to examine these anomalies and help alleviate the challenge that imbalanced datasets propose via creating synthetic anomalies. This anomaly generation can be specifically beneficial in domains that have costly data creation processes as well as inherently imbalanced datasets. One of the domains that fits this description is the host-based intrusion detection domain. In this work, ADFA-LD dataset is chosen as the dataset of interest containing system calls of small foot-print next generation attacks. The data is first converted into images, and then a Cycle-GAN is used to create images of anomalous data from images of normal data. The generated data is combined with the original dataset and is used to train a model to detect anomalies. By doing so, it is shown that the classification results are improved, with the AUC rising from 0.55 to 0.71, and the anomaly detection rate rising from 17.07% to 80.49%. The results are also compared to SMOTE, showing the potential presented by generative adversarial networks in anomaly generation.
Due to the recent advances in the area of deep learning, it has been demonstrated that a deep neural network, trained on a huge amount of data, can recognize cardiac arrhythmias better than cardiologists. Moreover, traditionally feature extraction was considered an integral part of ECG pattern recognition; however, recent findings have shown that deep neural networks can carry out the task of feature extraction directly from the data itself. In order to use deep neural networks for their accuracy and feature extraction, high volume of training data is required, which in the case of independent studies is not pragmatic. To arise to this challenge, in this work, the identification and classification of four ECG patterns are studied from a transfer learning perspective, transferring knowledge learned from the image classification domain to the ECG signal classification domain. It is demonstrated that feature maps learned in a deep neural network trained on great amounts of generic input images can be used as general descriptors for the ECG signal spectrograms and result in features that enable classification of arrhythmias. Overall, an accuracy of 97.23 percent is achieved in classifying near 7000 instances by ten-fold cross validation.
Contouring of organs at risk is an important but time consuming part of radiotherapy treatment planning. Several authors proposed methods for automatic delineation but the clinical experts eye remains the gold standard method. In this paper, we present a totally visual software for automated delineation of the femural head. The software was successfully characterized in pelvic CT Scan of prostate patients (n=11). The automatic delineation was compared with manual and approved delineation through blind test evaluated by a panel of seniors radiation oncologists (n=9). Clinical experts evaluated that no any contouring correction were need in 77.8% and 67.8% of manual and automatic delineation respectively. Our results show that the software is robust, the automated delineation was reproducible in all patient, and its performance was similar to manually delineation.
Recent studies have shown evidence of a significant decline of the Posidonia oceanica (P.O.) meadows on a global scale. The monitoring and mapping of these meadows are fundamental tools for measuring their status. We present an approach based on a deep neural network to automatically perform a high-precision semantic segmentation of P.O. meadows in sea-floor images, offering several improvements over the state of the art techniques. Our network demonstrates outstanding performance over diverse test sets, reaching a precision of 96.57% and an accuracy of 96.81%, surpassing the reliability of labelling the images manually. Also, the network is implemented in an Autonomous Underwater Vehicle (AUV), performing an online P.O. segmentation, which will be used to generate real-time semantic coverage maps.
Product name recognition is a significant practical problem, spurred by the greater availability of platforms for discussing products such as social media and product review functionalities of online marketplaces. Customers, product manufacturers and online marketplaces may want to identify product names in unstructured text to extract important insights, such as sentiment, surrounding a product. Much extant research on product name identification has been domain-specific (e.g., identifying mobile phone models) and used supervised or semi-supervised methods. With massive numbers of new products released to the market every year such methods may require retraining on updated labeled data to stay relevant, and may transfer poorly across domains. This research addresses this challenge and develops a domain-agnostic, unsupervised algorithm for identifying product names based on Facebook posts. The algorithm consists of two general steps: (a) candidate product name identification using an off-the-shelf pretrained conditional random fields (CRF) model, part-of-speech tagging and a set of simple patterns; and (b) filtering of candidate names to remove spurious entries using clustering and word embeddings generated from the data.
Image synthesis learns a transformation from the intensity features of an input image to yield a different tissue contrast of the output image. This process has been shown to have application in many medical image analysis tasks including imputation, registration, and segmentation. To carry out synthesis, the intensities of the input images are typically scaled–i.e., normalized–both in training to learn the transformation and in testing when applying the transformation, but it is not presently known what type of input scaling is optimal. In this paper, we consider seven different intensity normalization algorithms and three different synthesis methods to evaluate the impact of normalization. Our experiments demonstrate that intensity normalization as a preprocessing step improves the synthesis results across all investigated synthesis algorithms. Furthermore, we show evidence that suggests intensity normalization is vital for successful deep learning-based MR image synthesis.
Language models (LM) for interactive speech recognition systems are trained on large amounts of data and the model parameters are optimized on past user data. New application intents and interaction types are released for these systems over time, imposing challenges to adapt the LMs since the existing training data is no longer sufficient to model the future user interactions. It is unclear how to adapt LMs to new application intents without degrading the performance on existing applications. In this paper, we propose a solution to (a) estimate n-gram counts directly from the hand-written grammar for training LMs and (b) use constrained optimization to optimize the system parameters for future use cases, while not degrading the performance on past usage. We evaluated our approach on new applications intents for a personal assistant system and find that the adaptation improves the word error rate by up to 15% on new applications even when there is no adaptation data available for an application.
The question addressed in this paper is: If we present to a user an AI system that explains how it works, how do we know whether the explanation works and the user has achieved a pragmatic understanding of the AI? In other words, how do we know that an explanainable AI system (XAI) is any good? Our focus is on the key concepts of measurement. We discuss specific methods for evaluating: (1) the goodness of explanations, (2) whether users are satisfied by explanations, (3) how well users understand the AI systems, (4) how curiosity motivates the search for explanations, (5) whether the user’s trust and reliance on the AI are appropriate, and finally, (6) how the human-XAI work system performs. The recommendations we present derive from our integration of extensive research literatures and our own psychometric evaluations.
It is important to detect and handle anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data commonly used by deep learning systems are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This approach enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments in vision and natural language processing settings, we find that Outlier Exposure significantly improves the detection performance. Our approach is even applicable to density estimation models and anomaly detectors for large-scale images. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.