We present NeuroSAT, a message passing neural network that learns to solve SAT problems after only being trained as a classifier to predict satisfiability. Although it is not competitive with state-of-the-art SAT solvers, NeuroSAT can solve problems that are substantially larger and more difficult than it ever saw during training by simply running for more iterations. Moreover, NeuroSAT generalizes to novel distributions; after training only on random SAT problems, at test time it can solve SAT problems encoding graph coloring, clique detection, dominating set, and vertex cover problems, all on a range of distributions over small random graphs.
http://arxiv.org/abs/1802.03685
We work out a quantum-theoretic model in complex Hilbert space of a recently performed test on co-occurrencies of two concepts and their combination in retrieval processes on specific corpuses of documents. The test violated the Clauser-Horne-Shimony-Holt version of the Bell inequalities (‘CHSH inequality’), thus indicating the presence of entanglement between the combined concepts. We make use of a recently elaborated ‘entanglement scheme’ and represent the collected data in the tensor product of Hilbert spaces of the individual concepts, showing that the identified violation is due to the occurrence of a strong form of entanglement, involving both states and measurements and reflecting the meaning connection between the component concepts. These results provide a significant confirmation of the presence of quantum structures in corpuses of documents, like it is the case for the entanglement identified in human cognition.
http://arxiv.org/abs/1901.04299
Statistical spoken dialogue systems usually rely on a single- or multi-domain dialogue model that is restricted in its capabilities of modelling complex dialogue structures, e.g., relations. In this work, we propose a novel dialogue model that is centred around entities and is able to model relations as well as multiple entities of the same type. We demonstrate in a prototype implementation benefits of relation modelling on the dialogue level and show that a trained policy using these relations outperforms the multi-domain baseline. Furthermore, we show that by modelling the relations on the dialogue level, the system is capable of processing relations present in the user input and even learns to address them in the system response.
http://arxiv.org/abs/1901.01466
Existing research highlight the myriad of benefits realized when technology is sufficiently democratized and made accessible to non-technical or novice users. However, democratizing complex technologies such as artificial intelligence (AI) remains hard. In this work, we draw on theoretical underpinnings from the democratization of innovation, in exploring the design of maker kits that help introduce novice users to complex technologies. We report on our work designing TJBot: an open source cardboard robot that can be programmed using pre-built AI services. We highlight principles we adopted in this process (approachable design, simplicity, extensibility and accessibility), insights we learned from showing the kit at workshops (66 participants) and how users interacted with the project on GitHub over a 12-month period (Nov 2016 - Nov 2017). We find that the project succeeds in attracting novice users (40% of users who forked the project are new to GitHub) and a variety of demographics are interested in prototyping use cases such as home automation, task delegation, teaching and learning.
http://arxiv.org/abs/1805.10723
It remains challenging to automatically segment kidneys in clinical ultrasound images due to the kidneys’ varied shapes and image intensity distributions, although semi-automatic methods have achieved promising performance. In this study, we developed a novel boundary distance regression deep neural network to segment the kidneys, informed by the fact that the kidney boundaries are relatively consistent across images in terms of their appearance. Particularly, we first use deep neural networks pre-trained for classification of natural images to extract high-level image features from ultrasound images, then these feature maps are used as input to learn kidney boundary distance maps using a boundary distance regression network, and finally the predicted boundary distance maps are classified as kidney pixels or non-kidney pixels using a pixel classification network in an end-to-end learning fashion. Experimental results have demonstrated that our method could effectively improve the performance of automatic kidney segmentation, significantly better than deep learning based pixel classification networks.
http://arxiv.org/abs/1901.01982
Multi-modal biological, imaging, and neuropsychological markers have demonstrated promising performance for distinguishing Alzheimer’s disease (AD) patients from cognitively normal elders. However, it remains difficult to early predict when and which mild cognitive impairment (MCI) individuals will convert to AD dementia. Informed by pattern classification studies which have demonstrated that pattern classifiers built on longitudinal data could achieve better classification performance than those built on cross-sectional data, we develop a deep learning model based on recurrent neural networks (RNNs) to learn informative representation and temporal dynamics of longitudinal cognitive measures of individual subjects and combine them with baseline hippocampal MRI for building a prognostic model of AD dementia progression. Experimental results on a large cohort of MCI subjects have demonstrated that the deep learning model could learn informative measures from longitudinal data for characterizing the progression of MCI subjects to AD dementia, and the prognostic model could early predict AD progression with high accuracy.
http://arxiv.org/abs/1901.01451
We propose control systems for the coordination of the ground robots. We develop robot efficient coordination using the devices located on towers or a tethered aerial apparatus tracing the robots on controlled area and supervising their environment including natural and artificial markings. The simple prototype of such a system was created in the Laboratory of Applied Mathematics of Ariel University (under the supervision of Prof. Domoshnitsky Alexander) in collaboration with company TRANSIST VIDEO LLC (Skolkovo, Moscow). We plan to create much more complicated prototype using Kamin grant (Israel)
http://arxiv.org/abs/1809.08272
Recent radiomic studies have witnessed promising performance of deep learning techniques in learning radiomic features and fusing multimodal imaging data. Most existing deep learning based radiomic studies build predictive models in a setting of pattern classification, not appropriate for survival analysis studies where some data samples have incomplete observations. To improve existing survival analysis techniques whose performance is hinged on imaging features, we propose a deep learning method to build survival regression models by optimizing imaging features with deep convolutional neural networks (CNNs) in a proportional hazards model. To make the CNNs applicable to tumors with varied sizes, a spatial pyramid pooling strategy is adopted. Our method has been validated based on a simulated imaging dataset and a FDG-PET/CT dataset of rectal cancer patients treated for locally advanced rectal cancer. Compared with survival prediction models built upon hand-crafted radiomic features using Cox proportional hazards model and random survival forests, our method achieved competitive prediction performance.
http://arxiv.org/abs/1901.01449
Datasets advance research by posing challenging new problems and providing standardized methods of algorithm comparison. High-quality datasets exist for many important problems in robotics and computer vision, including egomotion estimation and motion/scene segmentation, but not for techniques that estimate every motion in a scene. Metric evaluation of these multimotion estimation techniques requires datasets consisting of multiple, complex motions that also contain ground truth for every moving body. The Oxford Multimotion Dataset provides a number of multimotion estimation problems of varying complexity. It includes both complex problems that challenge existing algorithms as well as a number of simpler problems to support development. These include observations from both static and dynamic sensors, a varying number of moving bodies, and a variety of different 3D motions. It also provides a number of experiments designed to isolate specific challenges of the multimotion problem, including rotation about the optical axis and occlusion. In total, the Oxford Multimotion Dataset contains over 110 minutes of multimotion data consisting of stereo and RGB-D camera images, IMU data, and Vicon ground-truth trajectories. The dataset culminates in a complex toy car segment representative of many challenging real-world scenarios. This paper describes each experiment with a focus on its relevance to the multimotion estimation problem.
http://arxiv.org/abs/1901.01445
As an advanced research topic in forensics science, automatic shoe-print identification has been extensively studied in the last two decades, since shoe marks are the clues most frequently left in a crime scene. Hence, these impressions provide a pertinent evidence for the proper progress of investigations in order to identify the potential criminals. The main goal of this survey is to provide a cohesive overview of the research carried out in forensic shoe-print identification and its basic background. Apart defining the problem and describing the phases that typically compose the processing chain of shoe-print identification, we provide a summary/comparison of the state-of-the-art approaches, in order to guide the neophyte and help to advance the research topic. This is done through introducing simple and basic taxonomies as well as summaries of the state-of-the-art performance. Lastly, we discuss the current open problems and challenges in this research topic, point out for promising directions in this field.
http://arxiv.org/abs/1901.01431
In this study, a novel control algorithm for a P-300 based brain-computer interface is fully developed to control a 2-DoF robotic arm. Eight subjects including 5 men and 3 women, perform a 2-dimensional target tracking task in a simulated environment. Their EEG signals from visual cortex are recorded and P-300 components are extracted and evaluated to perform a real-time BCI based controller. The volunteer’s intention is recognized and will be decoded as an appropriate command to control the cursor. The final goal of the system is to control a simulated robotic arm in a 2-dimensional space for writing some English letters. The results show that the system allows the robot end-effector to move between arbitrary positions in a point-to-point session with the desired accuracy. This model is tested on and compared with Dataset II of the BCI Competition. The best result is obtained with a multi-class SVM solution as the classifier, with a recognition rate of 97 percent, without pre-channel selection.
http://arxiv.org/abs/1901.01422
This work addresses the problem of semantic scene understanding under fog. Although marked progress has been made in semantic scene understanding, it is mainly concentrated on clear-weather scenes. Extending semantic segmentation methods to adverse weather conditions such as fog is crucial for outdoor applications. In this paper, we propose a novel method, named Curriculum Model Adaptation (CMAda), which gradually adapts a semantic segmentation model from light synthetic fog to dense real fog in multiple steps, using both labeled synthetic foggy data and unlabeled real foggy data. The method is based on the fact that the results of semantic segmentation in moderately adverse conditions (light fog) can be bootstrapped to solve the same problem in highly adverse conditions (dense fog). CMAda is extensible to other adverse conditions and provides a new paradigm for learning with synthetic data and unlabeled real data. In addition, we present three other main stand-alone contributions: 1) a novel method to add synthetic fog to real, clear-weather scenes using semantic input; 2) a new fog density estimator; 3) a novel fog densification method to densify the fog in real foggy scenes without using depth; and 4) the Foggy Zurich dataset comprising 3808 real foggy images, with pixel-level semantic annotations for 40 images under dense fog. Our experiments show that 1) our fog simulation and fog density estimator outperform their state-of-the-art counterparts with respect to the task of semantic foggy scene understanding (SFSU); 2) CMAda improves the performance of state-of-the-art models for SFSU significantly, benefiting both from our synthetic and real foggy data. The datasets and code are available at the project website.
http://arxiv.org/abs/1901.01415
News plays a significant role in shaping people’s beliefs and opinions. Fake news has always been a problem, which wasn’t exposed to the mass public until the past election cycle for the 45th President of the United States. While quite a few detection methods have been proposed to combat fake news since 2015, they focus mainly on linguistic aspects of an article without any fact checking. In this paper, we argue that these models have the potential to misclassify fact-tampering fake news as well as under-written real news. Through experiments on Fakebox, a state-of-the-art fake news detector, we show that fact tampering attacks can be effective. To address these weaknesses, we argue that fact checking should be adopted in conjunction with linguistic characteristics analysis, so as to truly separate fake news from real news. A crowdsourced knowledge graph is proposed as a straw man solution to collecting timely facts about news events.
http://arxiv.org/abs/1901.09657
This work presents a methodology to design trajectory tracking feedback control laws, which embed non-parametric statistical models, such as Gaussian Processes (GPs). The aim is to minimize unmodeled dynamics such as undesired slippages. The proposed approach has the benefit of avoiding complex terramechanics analysis to directly estimate from data the robot dynamics on a wide class of trajectories. Experiments in both real and simulated environments prove that the proposed methodology is promising.
http://arxiv.org/abs/1810.03711
In this study, we proposed and validated a multi-atlas guided 3D fully convolutional network (FCN) ensemble model (M-FCN) for segmenting brain regions of interest (ROIs) from structural magnetic resonance images (MRIs). One major limitation of existing state-of-the-art 3D FCN segmentation models is that they often apply image patches of fixed size throughout training and testing, which may miss some complex tissue appearance patterns of different brain ROIs. To address this limitation, we trained a 3D FCN model for each ROI using patches of adaptive size and embedded outputs of the convolutional layers in the deconvolutional layers to further capture the local and global context patterns. In addition, with an introduction of multi-atlas based guidance in M-FCN, our segmentation was generated by combining the information of images and labels, which is highly robust. To reduce over-fitting of the FCN model on the training data, we adopted an ensemble strategy in the learning procedure. Evaluation was performed on two brain MRI datasets, aiming respectively at segmenting 14 subcortical and ventricular structures and 54 brain ROIs. The segmentation results of the proposed method were compared with those of a state-of-the-art multi-atlas based segmentation method and an existing 3D FCN segmentation model. Our results suggested that the proposed method had a superior segmentation performance.
http://arxiv.org/abs/1901.01381
Data in real-world application often exhibit skewed class distribution which poses an intense challenge for machine learning. Conventional classification algorithms are not effective in the case of imbalanced data distribution, and may fail when the data distribution is highly imbalanced. To address this issue, we propose a general imbalanced classification model based on deep reinforcement learning. We formulate the classification problem as a sequential decision-making process and solve it by deep Q-learning network. The agent performs a classification action on one sample at each time step, and the environment evaluates the classification action and returns a reward to the agent. The reward from minority class sample is larger so the agent is more sensitive to the minority class. The agent finally finds an optimal classification policy in imbalanced data under the guidance of specific reward function and beneficial learning environment. Experiments show that our proposed model outperforms the other imbalanced classification algorithms, and it can identify more minority samples and has great classification performance.
http://arxiv.org/abs/1901.01379
For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data. We incorporate up to 8000 hours of untranscribed data for training and present our results on sequence trained models apart from cross entropy trained ones. The best sequence trained student model yields relative word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our clean, simulated noisy and real test sets respectively comparing to a sequence trained teacher.
http://arxiv.org/abs/1901.02348
We propose Text2Scene, a model that interprets input natural language descriptions in order to generate various forms of compositional scene representations; from abstract cartoon-like scenes to synthetic images. Unlike recent works, our method does not use generative adversarial networks, but a combination of an encoder-decoder model with a semi-parametric retrieval-based approach. Text2Scene learns to sequentially produce objects and their attributes (location, size, appearance, etc) at every time step by attending to different parts of the input text, and the current status of the generated scene. We show that under minor modifications, the proposed framework can handle the generation of different forms of scene representations, including cartoon-like scenes, object layouts corresponding to real images, and synthetic image composites. Our method is not only competitive when compared with state-of-the-art GAN-based methods using automatic metrics and superior based on human judgments but it is also more general and interpretable.
http://arxiv.org/abs/1809.01110
In this work, we present a camera configuration for acquiring “stereoscopic dark flash” images: a simultaneous stereo pair in which one camera is a conventional RGB sensor, but the other camera is sensitive to near-infrared and near-ultraviolet instead of R and B. When paired with a “dark” flash (i.e., one having near-infrared and near-ultraviolet light, but no visible light) this camera allows us to capture the two images in a flash/no-flash image pair at the same time, all while not disturbing any human subjects or onlookers with a dazzling visible flash. We present a hardware prototype of this camera that approximates an idealized camera, and we present an imaging procedure that let us acquire dark flash stereo pairs that closely resemble those we would get from that idealized camera. We then present a technique for fusing these stereo pairs, first by performing registration and warping, and then by using recent advances in hyperspectral image fusion and deep learning to produce a final image. Because our camera configuration and our data acquisition process allow us to capture true low-noise long exposure RGB images alongside our dark flash stereo pairs, our learned model can be trained end-to-end to produce a fused image that retains the color and tone of a real RGB image while having the low-noise properties of a flash image.
http://arxiv.org/abs/1901.01370
RGB-D salient object detection aims to identify the most visually distinctive objects in a pair of color and depth images. Based upon an observation that most of the salient objects may stand out at least in one modality, this paper proposes an adaptive fusion scheme to fuse saliency predictions generated from two modalities. Specifically, we design a two-streamed convolutional neural network (CNN), each of which extracts features and predicts a saliency map from either RGB or depth modality. Then, a saliency fusion module learns a switch map that is used to adaptively fuse the predicted saliency maps. A loss function composed of saliency supervision, switch map supervision, and edge-preserving constraints is designed to make full supervision, and the entire network is trained in an end-to-end manner. Benefited from the adaptive fusion strategy and the edge-preserving constraint, our approach outperforms state-of-the-art methods on three publicly available datasets.
http://arxiv.org/abs/1901.01369
Real-world tasks are often highly structured. Hierarchical reinforcement learning (HRL) has attracted research interest as an approach for leveraging the hierarchical structure of a given task in reinforcement learning (RL). However, identifying the hierarchical policy structure that enhances the performance of RL is not a trivial task. In this paper, we propose an HRL method that learns a latent variable of a hierarchical policy using mutual information maximization. Our approach can be interpreted as a way to learn a discrete and latent representation of the state-action space. To learn option policies that correspond to modes of the advantage function, we introduce advantage-weighted importance sampling. In our HRL method, the gating policy learns to select option policies based on an option-value function, and these option policies are optimized based on the deterministic policy gradient method. This framework is derived by leveraging the analogy between a monolithic policy in standard RL and a hierarchical policy in HRL by using a deterministic option policy. Experimental results indicate that our HRL approach can learn a diversity of options and that it can enhance the performance of RL in continuous control tasks.
http://arxiv.org/abs/1901.01365
Person Re-Identification (ReID) refers to the task of verifying the identity of a pedestrian observed from non-overlapping views of surveillance cameras networks. Recently, it has been validated that re-ranking could bring remarkable performance improvements for a person ReID system. However, the current re-ranking approaches either require feedbacks from users or suffer from burdensome computation cost. In this paper, we propose to exploit a density-adaptive smooth kernel technique to perform efficient and effective re-ranking. Specifically, we adopt a smooth kernel function to formulate the neighboring relationship amongst data samples with a density-adaptive parameter. Based on the new formulation, we present two simple yet effective re-ranking methods, termed inverse Density-Adaptive Kernel based Re-ranking (inv-DAKR) and bidirectional Density-Adaptive Kernel based Re-ranking (bi-DAKR), in which the local density information around each gallery sample is elegantly exploited. Moreover, we extend the proposed inv-DAKR and bi-DAKR to incorporate the available extra probe samples and demonstrate that the extra probe samples are able to improve the local neighborhood and thus further refine the ranking result. Extensive experiments are conducted on six benchmark datasets, including PRID450s, VIPeR, CUHK03, GRID, Market-1501 and Mars. Experimental results demonstrate that our proposals are effective and efficient.
http://arxiv.org/abs/1805.07698
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.
http://arxiv.org/abs/1901.01342
A data matrix may be seen simply as a means of organizing observations into rows ( e.g., by measured object) and into columns ( e.g., by measured variable) so that the observations can be analyzed with mathematical tools. As a mathematical object, a matrix defines a linear mapping between points representing weighted combinations of its rows (the row vector space) and points representing weighted combinations of its columns (the column vector space). From this perspective, a data matrix defines a relationship between the information that labels its rows and the information that labels its columns, and numerical methods are used to analyze this relationship. A first step is to normalize the data, transforming each observation from scales convenient for measurement to a common scale, on which addition and multiplication can meaningfully combine the different observations. For example, z-transformation rescales every variable to the same scale, standardized variation from an expected value, but ignores scale differences between measured objects. Here we develop the concepts and properties of projective decomposition, which applies the same normalization strategy to both rows and columns by separating the matrix into row- and column-scaling factors and a scale-normalized matrix. We show that different scalings of the same scale-normalized matrix form an equivalence class, and call the scale-normalized, canonical member of the class its scale-invariant form that preserves all pairwise relative ratios. Projective decomposition therefore provides a means of normalizing the broad class of ratio-scale data, in which relative ratios are of primary interest, onto a common scale without altering the ratios of interest, and simultaneously accounting for scale effects for both organizations of the matrix values. Both of these properties distinguish it from z-transformation.
http://arxiv.org/abs/1901.01336
Optimal decision making with limited or no information in stochastic environments where multiple agents interact is a challenging topic in the realm of artificial intelligence. Reinforcement learning (RL) is a popular approach for arriving at optimal strategies by predicating stimuli, such as the reward for following a strategy, on experience. RL is heavily explored in the single-agent context, but is a nascent concept in multiagent problems. To this end, I propose several principled model-free and partially model-based reinforcement learning approaches for several multiagent settings. In the realm of normative reinforcement learning, I introduce scalable extensions to Monte Carlo exploring starts for partially observable Markov Decision Processes (POMDP), dubbed MCES-P, where I expand the theory and algorithm to the multiagent setting. I first examine MCES-P with probably approximately correct (PAC) bounds in the context of multiagent setting, showing MCESP+PAC holds in the presence of other agents. I then propose a more sample-efficient methodology for antagonistic settings, MCESIP+PAC. For cooperative settings, I extend MCES-P to the Multiagent POMDP, dubbed MCESMP+PAC. I then explore the use of reinforcement learning as a methodology in searching for optima in realistic and latent model environments. First, I explore a parameterized Q-learning approach in modeling humans learning to reason in an uncertain, multiagent environment. Next, I propose an implementation of MCES-P, along with image segmentation, to create an adaptive team-based reinforcement learning technique to positively identify the presence of phenotypically-expressed water and pathogen stress in crop fields.
http://arxiv.org/abs/1901.01325
A wide range of biomedical applications requires enhancement, detection, quantification and modelling of curvilinear structures in 2D and 3D images. Curvilinear structure enhancement is a crucial step for further analysis, but many of the enhancement approaches still suffer from contrast variations and noise. This can be addressed using a multiscale approach that produces a better quality enhancement for low contrast and noisy images compared with a single-scale approach in a wide range of biomedical images. Here, we propose the Multiscale Top-Hat Tensor (MTHT) approach, which combines multiscale morphological filtering with a local tensor representation of curvilinear structures in 2D and 3D images. The proposed approach is validated on synthetic and real data and is also compared to the state-of-the-art approaches. Our results show that the proposed approach achieves high-quality curvilinear structure enhancement in synthetic examples and in a wide range of 2D and 3D images.
http://arxiv.org/abs/1809.08678
The navigation problem is classically approached in two steps: an exploration step, where map-information about the environment is gathered; and an exploitation step, where this information is used to navigate efficiently. Deep reinforcement learning (DRL) algorithms, alternatively, approach the problem of navigation in an end-to-end fashion. Inspired by the classical approach, we ask whether DRL algorithms are able to inherently explore, gather and exploit map-information over the course of navigation. We build upon Mirowski et al. [2017] work and introduce a systematic suite of experiments that vary three parameters: the agent’s starting location, the agent’s target location, and the maze structure. We choose evaluation metrics that explicitly measure the algorithm’s ability to gather and exploit map-information. Our experiments show that when trained and tested on the same maps, the algorithm successfully gathers and exploits map-information. However, when trained and tested on different sets of maps, the algorithm fails to transfer the ability to gather and exploit map-information to unseen maps. Furthermore, we find that when the goal location is randomized and the map is kept static, the algorithm is able to gather and exploit map-information but the exploitation is far from optimal. We open-source our experimental suite in the hopes that it serves as a framework for the comparison of future algorithms and leads to the discovery of robust alternatives to classical navigation methods.
http://arxiv.org/abs/1802.02274
We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages - multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. The self-expressive layer is responsible for enforcing the self-expressiveness property and acquiring an affinity matrix corresponding to the data points. The decoder reconstructs the original input data. The network uses the distance between the decoder’s reconstruction and the original input in its training. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressive layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods.
http://arxiv.org/abs/1804.06498
Fundamental to robotics is the debate between model-based and model-free learning: should the robot build an explicit model of the world, or learn a policy directly? In the context of HRI, part of the world to be modeled is the human. One option is for the robot to treat the human as a black box and learn a policy for how they act directly. But it can also model the human as an agent, and rely on a “theory of mind” to guide or bias the learning (grey box). We contribute a characterization of the performance of these methods under the optimistic case of having an ideal theory of mind, as well as under different scenarios in which the assumptions behind the robot’s theory of mind for the human are wrong, as they inevitably will be in practice. We find that there is a significant sample complexity advantage to theory of mind methods and that they are more robust to covariate shift, but that when enough interaction data is available, black box approaches eventually dominate.
http://arxiv.org/abs/1901.01291
We propose a hybrid approach aimed at improving the sample efficiency in goal-directed reinforcement learning. We do this via a two-step mechanism where firstly, we approximate a model from Model-Free reinforcement learning. Then, we leverage this approximate model along with a notion of reachability using Mean First Passage Times to perform Model-Based reinforcement learning. Built on such a novel observation, we design two new algorithms - Mean First Passage Time based Q-Learning (MFPT-Q) and Mean First Passage Time based DYNA (MFPT-DYNA), that have been fundamentally modified from the state-of-the-art reinforcement learning techniques. Preliminary results have shown that our hybrid approaches converge with much fewer iterations than their corresponding state-of-the-art counterparts and therefore requiring much fewer samples and much fewer training trials to converge.
http://arxiv.org/abs/1901.01977
We propose two approaches for speaker adaptation in end-to-end (E2E) automatic speech recognition systems. One is Kullback-Leibler divergence (KLD) regularization and the other is multi-task learning (MTL). Both approaches aim to address the data sparsity especially output target sparsity issue of speaker adaptation in E2E systems. The KLD regularization adapts a model by forcing the output distribution from the adapted model to be close to the unadapted one. The MTL utilizes a jointly trained auxiliary task to improve the performance of the main task. We investigated our approaches on E2E connectionist temporal classification (CTC) models with three different types of output units. Experiments on the Microsoft short message dictation task demonstrated that MTL outperforms KLD regularization. In particular, the MTL adaptation obtained 8.8\% and 4.0\% relative word error rate reductions (WERRs) for supervised and unsupervised adaptations for the word CTC model, and 9.6% and 3.8% relative WERRs for the mix-unit CTC model, respectively.
https://arxiv.org/abs/1901.01239
Cardiac image segmentation is a critical process for generating personalized models of the heart and for quantifying cardiac performance parameters. Several convolutional neural network (CNN) architectures have been proposed to segment the heart chambers from cardiac cine MR images. Here we propose a multi-task learning (MTL)-based regularization framework for cardiac MR image segmentation. The network is trained to perform the main task of semantic segmentation, along with a simultaneous, auxiliary task of pixel-wise distance map regression. The proposed distance map regularizer is a decoder network added to the bottleneck layer of an existing CNN architecture, facilitating the network to learn robust global features. The regularizer block is removed after training, so that the original number of network parameters does not change. We show that the proposed regularization method improves both binary and multi-class segmentation performance over the corresponding state-of-the-art CNN architectures on two publicly available cardiac cine MRI datasets, obtaining average dice coefficient of 0.84$\pm$0.03 and 0.91$\pm$0.04, respectively. Furthermore, we also demonstrate improved generalization performance of the distance map regularized network on cross-dataset segmentation, showing as much as 41% improvement in average Dice coefficient from 0.57$\pm$0.28 to 0.80$\pm$0.13.
https://arxiv.org/abs/1901.01238
In this paper, we propose a method for clustering image-caption pairs by simultaneously learning image representations and text representations that are constrained to exhibit similar distributions. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce but free-text descriptions are common. MultiDEC initializes parameters with stacked autoencoders, then iteratively minimizes the Kullback-Leibler divergence between the distribution of the images (and text) to that of a combined joint target distribution. We regularize by penalizing non-uniform distributions across clusters. The representations that minimize this objective produce clusters that outperform both single-view and multi-view techniques on large benchmark image-caption datasets.
https://arxiv.org/abs/1901.01860
A new mechanism for efficiently solving the Markov decision processes (MDPs) is proposed in this paper. We introduce the notion of reachability landscape where we use the Mean First Passage Time (MFPT) as a means to characterize the reachability of every state in the state space. We show that such reachability characterization very well assesses the importance of states and thus provides a natural basis for effectively prioritizing states and approximating policies. Built on such a novel observation, we design two new algorithms – Mean First Passage Time based Value Iteration (MFPT-VI) and Mean First Passage Time based Policy Iteration (MFPT-PI) – that have been modified from the state-of-the-art solution methods. To validate our design, we have performed numerical evaluations in robotic decision-making scenarios, by comparing the proposed new methods with corresponding classic baseline mechanisms. The evaluation results showed that MFPT-VI and MFPT-PI have outperformed the state-of-the-art solutions in terms of both practical runtime and number of iterations. Aside from the advantage of fast convergence, this new solution method is intuitively easy to understand and practically simple to implement.
https://arxiv.org/abs/1901.01229
Deep learning has been broadly leveraged by major cloud providers such as Google, AWS, Baidu, to offer various computer vision related services including image auto-classification, object identification and illegal image detection. While many recent works demonstrated that deep learning classification models are vulnerable to adversarial examples, real-world cloud-based image detection services are more complex than classification and there is little literature about adversarial example attacks on detection services. In this paper, we mainly focus on studying the security of real-world cloud-based image detectors. Specifically, (1) based on effective semantic segmentation, we propose four different attacks to generate semantics-aware adversarial examples via only interacting with black-box APIs; and (2) we make the first attempt to conduct an extensive empirical study of black-box attacks against real-world cloud-based image detectors. Through evaluations on five popular cloud platforms including AWS, Azure, Google Cloud, Baidu Cloud and Alibaba Cloud, we demonstrate that our IP attack has a success rate of approximately 100%, and semantic segmentation based attacks (e.g., SP, SBLS, SBB) have a a success rate over 90% among different detection services, such as violence, politician and pornography detection. We discuss the possible defenses to address these security challenges in cloud-based detectors.
https://arxiv.org/abs/1901.01223
We present the first attempt to perform short glass fiber semantic segmentation from X-ray computed tomography volumetric datasets at medium (3.9 μm isotropic) and low (8.3 μm isotropic) resolution using deep learning architectures. We performed experiments on both synthetic and real CT scans and evaluated deep fully convolutional architectures with both 2D and 3D kernels. Our artificial neural networks outperform existing methods at both medium and low resolution scans.
https://arxiv.org/abs/1901.01211
Comparing different algorithms for segmenting glass fibers in industrial computed tomography (CT) scans is difficult due to the absence of a standard reference dataset. In this work, we introduce a set of annotated scans of short-fiber reinforced polymers (SFRP) as well as synthetically created CT volume data together with the evaluation metrics. We suggest both the metrics and this data set as a reference for studying the performance of different algorithms. The real scans were acquired by a Nikon MCT225 X-ray CT system. The simulated scans were created by the use of an in-house computational model and third-party commercial software. For both types of data, corresponding ground truth annotations have been prepared, including hand annotations for the real scans and STL models for the synthetic scans. Additionally, a Hessian-based Frangi vesselness filter for fiber segmentation has been implemented and open-sourced to serve as a reference for comparisons.
https://arxiv.org/abs/1901.01210
A correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints has already been established. In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes. Here, causes are also introduced at the attribute level by appealing to a both null-based and attribute-based repair semantics. The corresponding repair programs are presented, and they are used as a basis for computation and reasoning about attribute-level causes. They are extended to deal with the case of causality under integrity constraints. Several examples with the DLV system are shown.
http://arxiv.org/abs/1712.01001
As sound event classification moves towards larger datasets, issues of label noise become inevitable. Web sites can supply large volumes of user-contributed audio and metadata, but inferring labels from this metadata introduces errors due to unreliable inputs, and limitations in the mapping. There is, however, little research into the impact of these errors. To foster the investigation of label noise in sound event classification we present FSDnoisy18k, a dataset containing 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. We characterize the label noise empirically, and provide a CNN baseline system. Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data. We also show that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
http://arxiv.org/abs/1901.01189
Neural attention (NA) has become a key component of sequence-to-sequence models that yield state-of-the-art performance in as hard tasks as abstractive document summarization (ADS) and video captioning (VC). NA mechanisms perform inference of context vectors; these constitute weighted sums of deterministic input sequence encodings, adaptively sourced over long temporal horizons. Inspired from recent work in the field of amortized variational inference (AVI), in this work we consider treating the context vectors generated by soft-attention (SA) models as latent variables, with approximate finite mixture model posteriors inferred via AVI. We posit that this formulation may yield stronger generalization capacity, in line with the outcomes of existing applications of AVI to deep networks. To illustrate our method, we implement it and experimentally evaluate it considering challenging ADS, VC, and MT benchmarks. This way, we exhibit its improved effectiveness over state-of-the-art alternatives.
https://arxiv.org/abs/1805.09039
The e-commerce has started a new trend in natural language processing through sentiment analysis of user-generated reviews. Different consumers have different concerns about various aspects of a specific product or service. Aspect category detection, as a subtask of aspect-based sentiment analysis, tackles the problem of categorizing a given review sentence into a set of pre-defined aspect categories. In recent years, deep learning approaches have brought revolutionary advances in multiple branches of natural language processing including sentiment analysis. In this paper, we propose a deep neural network method based on attention mechanism to identify different aspect categories of a given review sentence. Our model utilizes several attentions with different topic contexts, enabling it to attend to different parts of a review sentence based on different topics. Experimental results on two datasets in the restaurant domain released by SemEval workshop demonstrates that our approach outperforms existing methods on both datasets. Visualization of the topic attention weights shows the effectiveness of our model in identifying words related to different topics.
https://arxiv.org/abs/1901.01183
In Intelligent Transportation System, real-time systems that monitor and analyze road users become increasingly critical as we march toward the smart city era. Vision-based frameworks for Object Detection, Multiple Object Tracking, and Traffic Near Accident Detection are important applications of Intelligent Transportation System, particularly in video surveillance and etc. Although deep neural networks have recently achieved great success in many computer vision tasks, a uniformed framework for all the three tasks is still challenging where the challenges multiply from demand for real-time performance, complex urban setting, highly dynamic traffic event, and many traffic movements. In this paper, we propose a two-stream Convolutional Network architecture that performs real-time detection, tracking, and near accident detection of road users in traffic video data. The two-stream model consists of a spatial stream network for Object Detection and a temporal stream network to leverage motion features for Multiple Object Tracking. We detect near accidents by incorporating appearance features and motion features from two-stream networks. Using aerial videos, we propose a Traffic Near Accident Dataset (TNAD) covering various types of traffic interactions that is suitable for vision-based traffic analysis tasks. Our experiments demonstrate the advantage of our framework with an overall competitive qualitative and quantitative performance at high frame rates on the TNAD dataset.
https://arxiv.org/abs/1901.01138
Unmanned Aerial Vehicles (UAVs), have intrigued different people from all walks of life, because of their pervasive computing capabilities. UAV equipped with vision techniques, could be leveraged to establish navigation autonomous control for UAV itself. Also, object detection from UAV could be used to broaden the utilization of drone to provide ubiquitous surveillance and monitoring services towards military operation, urban administration and agriculture management. As the data-driven technologies evolved, machine learning algorithm, especially the deep learning approach has been intensively utilized to solve different traditional computer vision research problems. Modern Convolutional Neural Networks based object detectors could be divided into two major categories: one-stage object detector and two-stage object detector. In this study, we utilize some representative CNN based object detectors to execute the computer vision task over Stanford Drone Dataset (SDD). State-of-the-art performance has been achieved in utilizing focal loss dense detector RetinaNet based approach for object detection from UAV in a fast and accurate manner.
https://arxiv.org/abs/1808.05756
Generative modeling of natural images has been extensively studied in recent years, yielding remarkable progress. Current state-of-the-art methods are either based on maximum likelihood estimation or adversarial training. Both methods have their own drawbacks, which are complementary in nature. The first leads to over-generalization as the maximum likelihood criterion encourages models to cover the support of the training data by heavily penalizing small masses assigned to training data. Simplifying assumptions in such models limits their capacity and makes them spill mass on unrealistic samples. The second leads to mode-dropping since adversarial training encourages high quality samples from the model, but only indirectly enforces diversity among the samples. To overcome these drawbacks we make two contributions. First, we propose a novel extension to the variational autoencoders model by using deterministic invertible transformation layers to map samples from the decoder to the image space. This induces correlations among the pixels given the latent variables, improving over commonly used factorial decoders. Second, we propose a training approach that leverages coverage and quality based criteria. Our models obtain likelihood scores competitive with state-of-the-art likelihood-based models, while achieving sample quality typical of adversarially trained networks.
https://arxiv.org/abs/1901.01091
Over the past few years significant progress has been made in the field of presentation attack detection (PAD) for automatic speaker recognition (ASV). This includes the development of new speech corpora, standard evaluation protocols and advancements in front-end feature extraction and back-end classifiers. The use of standard databases and evaluation protocols has enabled for the first time the meaningful benchmarking of different PAD solutions. This chapter summarises the progress, with a focus on studies completed in the last three years. The article presents a summary of findings and lessons learned from two ASVspoof challenges, the first community-led benchmarking efforts. These show that ASV PAD remains an unsolved problem and that further attention is required to develop generalised PAD solutions which have potential to detect diverse and previously unseen spoofing attacks.
http://arxiv.org/abs/1901.01085
Penetration testing is a well-established practical concept for the identification of potentially exploitable security weaknesses and an important component of a security audit. Providing a holistic security assessment for networks consisting of several hundreds hosts is hardly feasible though without some sort of mechanization. Mitigation, prioritizing counter-measures subject to a given budget, currently lacks a solid theoretical understanding and is hence more art than science. In this work, we propose the first approach for conducting comprehensive what-if analyses in order to reason about mitigation in a conceptually well-founded manner. To evaluate and compare mitigation strategies, we use simulated penetration testing, i.e., automated attack-finding, based on a network model to which a subset of a given set of mitigation actions, e.g., changes to the network topology, system updates, configuration changes etc. is applied. Using Stackelberg planning, we determine optimal combinations that minimize the maximal attacker success (similar to a Stackelberg game), and thus provide a well-founded basis for a holistic mitigation strategy. We show that these Stackelberg planning models can largely be derived from network scan, public vulnerability databases and manual inspection with various degrees of automation and detail, and we simulate mitigation analysis on networks of different size and vulnerability.
http://arxiv.org/abs/1705.05088
Since the advent of deep learning, neural networks have demonstrated remarkable results in many visual recognition tasks, constantly pushing the limits. However, the state-of-the-art approaches are largely unsuitable in scarce data regimes. To address this shortcoming, this paper proposes employing a 3D model, which is derived from training images. Such a model can then be used to hallucinate novel viewpoints and poses for the scarce samples of the few-shot learning scenario. A self-paced learning approach allows for the selection of a diverse set of high-quality images, which facilitates the training of a classifier. The performance of the proposed approach is showcased on the fine-grained CUB-200-2011 dataset in a few-shot setting and significantly improves our baseline accuracy.
http://arxiv.org/abs/1901.01868
We present a novel and effective method for detecting 3D primitives in cluttered, unorganized point clouds, without axillary segmentation or type specification. We consider the quadric surfaces for encapsulating the basic building blocks of our environments - planes, spheres, ellipsoids, cones or cylinders, in a unified fashion. Moreover, quadrics allow us to model higher degree of freedom shapes, such as hyperboloids or paraboloids that could be used in non-rigid settings. We begin by contributing two novel quadric fits targeting 3D point sets that are endowed with tangent space information. Based upon the idea of aligning the quadric gradients with the surface normals, our first formulation is exact and requires as low as four oriented points. The second fit approximates the first, and reduces the computational effort. We theoretically analyze these fits with rigor, and give algebraic and geometric arguments. Next, by re-parameterizing the solution, we devise a new local Hough voting scheme on the null-space coefficients that is combined with RANSAC, reducing the complexity from $O(N^4)$ to $O(N^3)$ (three points). To the best of our knowledge, this is the first method capable of performing a generic cross-type multi-object primitive detection in difficult scenes without segmentation. Our extensive qualitative and quantitative results show that our method is efficient and flexible, as well as being accurate.
http://arxiv.org/abs/1901.01255
Point clouds obtained with 3D scanners or by image-based reconstruction techniques are often corrupted with significant amount of noise and outliers. Traditional methods for point cloud denoising largely rely on local surface fitting (e.g., jets or MLS surfaces), local or non-local averaging, or on statistical assumptions about the underlying noise model. In contrast, we develop a simple data-driven method for removing outliers and reducing noise in unordered point clouds. We base our approach on a deep learning architecture adapted from PCPNet, which was recently proposed for estimating local 3D shape properties in point clouds. Our method classifies and discards outlier samples, and estimates correction vectors that project noisy points onto the original clean surfaces. The approach is efficient and robust to varying amounts of noise and outliers, while being able to handle large densely-sampled point clouds. In our extensive evaluation, we show an increased robustness to strong noise levels compared to various state-of-the-art methods, enabling high-quality surface reconstruction from extremely noisy real data obtained by range scans. Finally, the simplicity and universality of our approach makes it very easy to integrate in any existing geometry processing pipeline.
https://arxiv.org/abs/1901.01060
Drones have gained popularity in a wide range of field ranging from aerial photography, aerial mapping, and investigation of electric power lines. Every drone that we know today is carrying out some kind of control algorithm at the low level in order to manoeuvre itself around. For the quadrotor to either control itself autonomously or to develop a high-level user interface for us to control it, we need to understand the basic mathematics behind how it functions. This paper aims to explain the mathematical modelling of the dynamics of a 3 Dimensional quadrotor. As it may seem like a trivial task, it plays a vital role in how we control the drone. Also, additional effort has been taken to explain the transformations of the drone’s frame of reference to the inertial frame of reference.
http://arxiv.org/abs/1901.01051