In this paper, we propose and study opportunistic contextual bandits - a special case of contextual bandits where the exploration cost varies under different environmental conditions, such as network load or return variation in recommendations. When the exploration cost is low, so is the actual regret of pulling a sub-optimal arm (e.g., trying a suboptimal recommendation). Therefore, intuitively, we could explore more when the exploration cost is relatively low and exploit more when the exploration cost is relatively high. Inspired by this intuition, for opportunistic contextual bandits with Linear payoffs, we propose an Adaptive Upper-Confidence-Bound algorithm (AdaLinUCB) to adaptively balance the exploration-exploitation trade-off for opportunistic learning. We prove that AdaLinUCB achieves O((log T)^2) problem-dependent regret upper bound, which has a smaller coefficient than that of the traditional LinUCB algorithm. Moreover, based on both synthetic and real-world dataset, we show that AdaLinUCB significantly outperforms other contextual bandit algorithms, under large exploration cost fluctuations.
http://arxiv.org/abs/1902.07802
Model-based reinforcement learning (RL) has proven to be a data efficient approach for learning control tasks but is difficult to utilize in domains with complex observations such as images. In this paper, we present a method for learning representations that are suitable for iterative model-based policy improvement, in that these representations are optimized for inferring simple dynamics and cost models given data from the current policy. This enables a model-based RL method based on the linear-quadratic regulator (LQR) to be used for systems with image observations. We evaluate our approach on a suite of robotics tasks, including manipulation tasks on a real Sawyer robot arm directly from images, and we find that our method results in better final performance than other model-based RL methods while being significantly more efficient than model-free RL. Videos of our results are available at https://sites.google.com/view/icml19solar
http://arxiv.org/abs/1808.09105
Deep neural networks provide unprecedented performance in all image classification problems, leveraging the availability of huge amounts of data for training. Recent studies, however, have shown their vulnerability to adversarial attacks, spawning an intense research effort in this field. With the aim of building better systems, new countermeasures and stronger attacks are proposed by the day. On the attacker’s side, there is growing interest for the realistic black-box scenario, in which the user has no access to the neural network parameters. The problem is to design limited-complexity attacks which mislead the neural network without impairing image quality too much, not to raise the attention of human observers. In this work, we put special emphasis on this latter requirement and propose a powerful and low-complexity black-box attack which preserves perceptual image quality. Numerical experiments prove the effectiveness of the proposed techniques both for tasks commonly considered in this context, and for other applications in biometrics (face recognition) and forensics (camera model identification).
http://arxiv.org/abs/1902.07776
Supervised deep learning relies on the assumption that enough training data is available, which presents a problem for its application to several fields, like medical imaging. On the example of a binary image classification task (breast cancer recognition), we show that pretraining a generative model for meaningful image augmentation helps enhance the performance of the resulting classifier. By augmenting the data, performance on downstream classification tasks could be improved even with a relatively small training set. We show that this “adversarial augmentation” yields promising results compared to classical image augmentation on the example of breast cancer classification.
http://arxiv.org/abs/1902.07762
Connected and automated vehicle (CAV) technology is one of the promising solutions to addressing the safety, mobility and sustainability issues of our current transportation systems. Specifically, the control algorithm plays an important role in a CAV system, since it executes the commands generated by former steps, such as communication, perception, and planning. In this study, we propose a consensus algorithm to control the longitudinal motion of CAVs in real time. Different from previous studies in this field where control gains of the consensus algorithm are pre-determined and fixed, we develop algorithms to build up a lookup table, searching for the ideal control gains with respect to different initial conditions of CAVs in real time. Numerical simulation shows that, the proposed lookup table-based consensus algorithm outperforms the authors’ previous work, as well as van Arem’s linear feedback-based longitudinal motion control algorithm in all four different scenarios with various initial conditions of CAVs, in terms of convergence time and maximum jerk of the simulation run.
http://arxiv.org/abs/1902.07747
Defined by Gelfond in 1991 (G91), epistemic specifications (or programs) are an extension of logic programming under stable models semantics that introducessubjective literals. A subjective literal al-lows checking whether some regular literal is true in all (or in some of) the stable models of the program, being those models collected in a setcalledworld view. One epistemic program may yield several world views but, under the original G91 semantics, some of them resulted from self-supported derivations. During the last eight years, several alternative approaches have been proposed to get rid of these self-supported worldviews. Unfortunately, their success could only be measured by studying their behaviour on a set of common examples in the literature, since no formal property of “self-supportedness” had been defined. To fill this gap, we extend in this paper the idea of unfounded set from standard logic programming to the epistemic case. We define when a world view is founded with respect to some program and propose the foundedness property for any semantics whose world views are always founded. Using counterexamples, we explain that the previous approaches violate foundedness, and proceed to propose a new semantics based on a combination of Moore’s Autoepistemic Logic and Pearce’s Equilibrium Logic. The main result proves that this new semantics precisely captures the set of founded G91 world views.
http://arxiv.org/abs/1902.07741
Variational inference relies on flexible approximate posterior distributions. Normalizing flows provide a general recipe to construct flexible variational posteriors. We introduce Sylvester normalizing flows, which can be seen as a generalization of planar flows. Sylvester normalizing flows remove the well-known single-unit bottleneck from planar flows, making a single transformation much more flexible. We compare the performance of Sylvester normalizing flows against planar flows and inverse autoregressive flows and demonstrate that they compare favorably on several datasets.
http://arxiv.org/abs/1803.05649
Recent studies have highlighted the high correlation between cardiovascular diseases (CVD) and lung cancer, and both are associated with significant morbidity and mortality. Low-Dose CT (LCDT) scans have led to significant improvements in the accuracy of lung cancer diagnosis and thus the reduction of cancer deaths. However, the high correlation between lung cancer and CVD has not been well explored for mortality prediction. This paper introduces a knowledge-based analytical method using deep convolutional neural network (CNN) for all-cause mortality prediction. The underlying approach combines structural image features extracted from CNNs, based on LDCT volume in different scale, and clinical knowledge obtained from quantitative measurements, to comprehensively predict the mortality risk of lung cancer screening subjects. The introduced method is referred to here as the Knowledge-based Analysis of Mortality Prediction Network, or KAMP-Net. It constitutes a collaborative framework that utilizes both imaging features and anatomical information, instead of completely relying on automatic feature extraction. Our work demonstrates the feasibility of incorporating quantitative clinical measurements to assist CNNs in all-cause mortality prediction from chest LDCT images. The results of this study confirm that radiologist defined features are an important complement to CNNs to achieve a more comprehensive feature extraction. Thus, the proposed KAMP-Net has shown to achieve a superior performance when compared to other methods.
http://arxiv.org/abs/1902.07687
As humans we are driven by a strong desire for seeking novelty in our world. Also upon observing a novel pattern we are capable of refining our understanding of the world based on the new information—humans can discover their world. The outstanding ability of the human mind for discovery has led to many breakthroughs in science, art and technology. Here we investigate the possibility of building an agent capable of discovering its world using the modern AI technology. In particular we introduce NDIGO, Neural Differential Information Gain Optimisation, a self-supervised discovery model that aims at seeking new information to construct a global view of its world from partial and noisy observations. Our experiments on some controlled 2-D navigation tasks show that NDIGO outperforms state-of-the-art information-seeking methods in terms of the quality of the learned representation. The improvement in performance is particularly significant in the presence of white or structured noise where other information-seeking methods follow the noise instead of discovering their world.
http://arxiv.org/abs/1902.07685
Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new tool for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/
http://arxiv.org/abs/1902.07669
In this paper, we propose a simple, fast and easy to implement algorithm LOSSGRAD (locally optimal step-size in gradient descent), which automatically modifies the step-size in gradient descent during neural networks training. Given a function $f$, a point $x$, and the gradient $\nabla_x f$ of $f$, we aim to find the step-size $h$ which is (locally) optimal, i.e. satisfies: Making use of quadratic approximation, we show that the algorithm satisfies the above assumption. We experimentally show that our method is insensitive to the choice of initial learning rate while achieving results comparable to other methods.
http://arxiv.org/abs/1902.07656
Automatic age estimation from facial images represents an important task in computer vision. This paper analyses the effect of gender, age, ethnic, makeup and expression attributes of faces as sources of bias to improve deep apparent age prediction. Following recent works where it is shown that apparent age labels benefit real age estimation, rather than direct real to real age regression, our main contribution is the integration, in an end-to-end architecture, of face attributes for apparent age prediction with an additional loss for real age regression. Experimental results on the APPA-REAL dataset indicate the proposed network successfully take advantage of the adopted attributes to improve both apparent and real age estimation. Our model outperformed a state-of-the-art architecture proposed to separately address apparent and real age regression. Finally, we present preliminary results and discussion of a proof of concept application using the proposed model to regress the apparent age of an individual based on the gender of an external observer.
http://arxiv.org/abs/1902.07653
Convolutional Neural Networks (CNNs) are the state-of-the-art algorithms used in computer vision. However, these models often suffer from the lack of interpretability of their information transformation process. To address this problem, we introduce a novel model called Sparse Deep Predictive Coding (SDPC). In a biologically realistic manner, SDPC mimics how the brain is efficiently representing visual information. This model complements the hierarchical convolutional layers found in CNNs with the feed-forward and feed-back update scheme described in the Predictive Coding (PC) theory and found in the architecture of the mammalian visual system. We experimentally demonstrate on two databases that the SDPC model extracts qualitatively meaningful features. These features, besides being similar to some of the biological Receptive Fields of the visual cortex, also represent hierarchically independent components of the image that are crucial to describe it in a generic manner. For the first time, the SDPC model demonstrates a meaningful representation of features within the hierarchical generative model and of the decision-making process leading to a specific prediction. A quantitative analysis reveals that the features extracted by the SDPC model encode the input image into a representation that is both easily classifiable and robust to noise.
http://arxiv.org/abs/1902.07651
Neural architecture search (NAS) is a promising research direction that has the potential to replace expert-designed networks with learned, task-specific architectures. In this work, in order to help ground the empirical results in this field, we propose new NAS baselines that build off the following observations: (i) NAS is a specialized hyperparameter optimization problem; and (ii) random search is a competitive baseline for hyperparameter optimization. Leveraging these observations, we evaluate both random search with early-stopping and a novel random search with weight-sharing algorithm on two standard NAS benchmarks—PTB and CIFAR-10. Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performs at least as well as ENAS, a leading NAS method, on both benchmarks. Additionally, random search with weight-sharing outperforms random search with early-stopping, achieving a state-of-the-art NAS result on PTB and a highly competitive result on CIFAR-10. Finally, we explore the existing reproducibility issues of published NAS results. We note the lack of source material needed to exactly reproduce these results, and further discuss the robustness of published results given the various sources of variability in NAS experimental setups. Relatedly, we provide all information (code, random seeds, documentation) needed to exactly reproduce our results, and report our random search with weight-sharing results for each benchmark on two independent experimental runs.
https://arxiv.org/abs/1902.07638
Filtering point targets in highly cluttered and noisy data frames can be very challenging, especially for complex target motions. Fixed motion models can fail to provide accurate predictions, while learning based algorithm can be difficult to design (due to the variable number of targets), slow to train and dependent on separate train/test steps. To address these issues, this paper proposes a multi-target filtering algorithm which learns the motion models, on the fly, using a recurrent neural network with a long short-term memory architecture, as a regression block. The target state predictions are then corrected using a novel data association algorithm, with a low computational complexity. The proposed algorithm is evaluated over synthetic and real point target filtering scenarios, demonstrating a remarkable performance over highly cluttered data sequences.
http://arxiv.org/abs/1902.07630
advertorch is a toolbox for adversarial robustness research. It contains various implementations for attacks, defenses and robust training methods. advertorch is built on PyTorch (Paszke et al., 2017), and leverages the advantages of the dynamic computational graph to provide concise and efficient reference implementations. The code is licensed under the LGPL license and is open sourced at https://github.com/BorealisAI/advertorch .
http://arxiv.org/abs/1902.07623
A self-learning optimal control algorithm for episodic fixed-horizon manufacturing processes with time-discrete control actions is proposed and evaluated on a simulated deep drawing process. The control model is built during consecutive process executions under optimal control via reinforcement learning, using the measured product quality as reward after each process execution. Prior model formulation, which is required by state-of-the-art algorithms from model predictive control and approximate dynamic programming, is therefore obsolete. This avoids several difficulties namely in system identification, accurate modelling, and runtime complexity, that arise when dealing with processes subject to nonlinear dynamics and stochastic influences. Instead of using pre-created process and observation models, value function-based reinforcement learning algorithms build functions of expected future reward, which are used to derive optimal process control decisions. The expectation functions are learned online, by interacting with the process. The proposed algorithm takes stochastic variations of the process conditions into account and is able to cope with partial observability. A Q-learning-based method for adaptive optimal control of partially observable episodic fixed-horizon manufacturing processes is developed and studied. The resulting algorithm is instantiated and evaluated by applying it to a simulated stochastic optimal control problem in metal sheet deep drawing.
http://arxiv.org/abs/1809.06646
Building multilingual and crosslingual models help bring different languages together in a language universal space. It allows models to share parameters and transfer knowledge across languages, enabling faster and better adaptation to a new language. These approaches are particularly useful for low resource languages. In this paper, we propose a phoneme-level language model that can be used multilingually and for crosslingual adaptation to a target language. We show that our model performs almost as well as the monolingual models by using six times fewer parameters, and is capable of better adaptation to languages not seen during training in a low resource scenario. We show that these phoneme-level language models can be used to decode sequence based Connectionist Temporal Classification (CTC) acoustic model outputs to obtain comparable word error rates with Weighted Finite State Transducer (WFST) based decoding in Babel languages. We also show that these phoneme-level language models outperform WFST decoding in various low-resource conditions like adapting to a new language and domain mismatch between training and testing data.
http://arxiv.org/abs/1902.07613
In this paper, we propose an algorithm for point cloud denoising based on the tensor Tucker decomposition. We first represent the local surface patches of a noisy point cloud to be matrices by their distances to a reference point, and stack the similar patch matrices to be a 3rd order tensor. Then we use the Tucker decomposition to compress this patch tensor to be a core tensor of smaller size. We consider this core tensor as the frequency domain and remove the noise by manipulating the hard thresholding. Finally, all the fibers of the denoised patch tensor are placed back, and the average is taken if there are more than one estimators overlapped. The experimental evaluation shows that the proposed algorithm outperforms the state-of-the-art graph Laplacian regularized (GLR) algorithm when the Gaussian noise is high ($\sigma=0.1$), and the GLR algorithm is better in lower noise cases ($\sigma=0.04, 0.05, 0.08$).
http://arxiv.org/abs/1902.07602
Spectral unmixing (SU) is a technique to characterize mixed pixels in hyperspectral images measured by remote sensors. Most of the spectral unmixing algorithms are developed using the linear mixing models. To estimate endmembers and fractional abundance matrices in a blind problem, nonnegative matrix factorization (NMF) and its developments are widely used in the SU problem. One of the constraints which was added to NMF is sparsity, that was regularized by Lq norm. In this paper, a new algorithm based on distributed optimization is suggested for spectral unmixing. In the proposed algorithm, a network including single-node clusters is employed. Each pixel in the hyperspectral images is considered as a node in this network. The sparsity constrained distributed unmixing is optimized with diffusion least mean p-power (LMP) strategy, and then the update equations for fractional abundance and signature matrices are obtained. Afterwards the proposed algorithm is analyzed for different values of LMP power and Lq norms. Simulation results based on defined performance metrics illustrate the advantage of the proposed algorithm in spectral unmixing of hyperspectral data compared with other methods.
http://arxiv.org/abs/1902.07593
Synchrotron-based x-ray tomography is a noninvasive imaging technique that allows for reconstructing the internal structure of materials at high spatial resolutions. Here we present TomoGAN, a novel denoising technique based on generative adversarial networks, for improving the quality of reconstructed images for low-dose imaging conditions, as at smaller length scales where higher radiation doses are required to resolve sample features. Our trained model, unlike other machine-learning-based solutions, is generic: it can be applied to many datasets collected at varying experimental conditions. We evaluate our approach in two photon-budget-limited experimental conditions: (1) sufficient number of low-dose projections (based on Nyquist sampling), and (2) insufficient or limited number of high-dose projections. In both cases, angular sampling is assumed to be isotropic, and the photon budget throughout the experiment is fixed based on the maximum allowable radiation dose. Evaluation with both simulated and experimental datasets shows that our approach can reduce noise in reconstructed images significantly, improving the structural similarity score for simulation and experimental data with ground truth from 0.18 to 0.9 and from 0.18 to 0.41, respectively. Furthermore, the quality of the reconstructed images with filtered back projection followed by our denoising approach exceeds that of reconstructions with simultaneous iterative reconstruction.
http://arxiv.org/abs/1902.07582
In this paper, we propose a multimodal search engine that combines visual and textual cues to retrieve items from a multimedia database aesthetically similar to the query. The goal of our engine is to enable intuitive retrieval of fashion merchandise such as clothes or furniture. Existing search engines treat textual input only as an additional source of information about the query image and do not correspond to the real-life scenario where the user looks for ‘the same shirt but of denim’. Our novel method, dubbed DeepStyle, mitigates those shortcomings by using a joint neural network architecture to model contextual dependencies between features of different modalities. We prove the robustness of this approach on two different challenging datasets of fashion items and furniture where our DeepStyle engine outperforms baseline methods by 18-21% on the tested datasets. Our search engine is commercially deployed and available through a Web-based application.
http://arxiv.org/abs/1801.03002
In Reinforcement Learning (RL), an agent explores the environment and collects trajectories into the memory buffer for later learning. However, the collected trajectories can easily be imbalanced with respect to the achieved goal states. The problem of learning from imbalanced data is a well-known problem in supervised learning, but has not yet been thoroughly researched in RL. To address this problem, we propose a novel Curiosity-Driven Prioritization (CDP) framework to encourage the agent to over-sample those trajectories that have rare achieved goal states. The CDP framework mimics the human learning process and focuses more on relatively uncommon events. We evaluate our methods using the robotic environment provided by OpenAI Gym. The environment contains six robot manipulation tasks. In our experiments, we combined CDP with Deep Deterministic Policy Gradient (DDPG) with or without Hindsight Experience Replay (HER). The experimental results show that CDP improves both performance and sample-efficiency of reinforcement learning agents, compared to state-of-the-art methods.
http://arxiv.org/abs/1902.08039
Automatically reasoning with conflicting generic clinical guidelines is a burning issue in patient-centric medical reasoning where patient-specific conditions and goals need to be taken into account. It is even more challenging in the presence of preferences such as patient’s wishes and clinician’s priorities over goals. We advance a structured argumentation formalism for reasoning with conflicting clinical guidelines, patient-specific information and preferences. Our formalism integrates assumption-based reasoning and goal-driven selection among reasoning outcomes. Specifically, we assume applicability of guideline recommendations concerning the generic goal of patient well-being, resolve conflicts among recommendations using patient’s conditions and preferences, and then consider prioritised patient-centered goals to yield non-conflicting, goal-maximising and preference-respecting recommendations. We rely on the state-of-the-art Transition-based Medical Recommendation model for representing guideline recommendations and augment it with context given by the patient’s conditions, goals, as well as preferences over recommendations and goals. We establish desirable properties of our approach in terms of sensitivity to recommendation conflicts and patient context.
http://arxiv.org/abs/1902.07526
We propose a novel dynamic image reconstruction method from PET listmode data that could be particularly suited to tracking single or small numbers of cells. In contrast to conventional PET reconstruction the proposed method combines the information from all detected events not only to reconstruct the dynamic evolution of the radionuclide distribution, but also to simultaneously improve the reconstruction at each single time point by enforcing temporal consistency. This is achieved via the use of optimal transport regularization where in principle, among all possible temporally evolving radionuclide distributions consistent with the PET measurement, the one is chosen with least kinetic motion energy. The reconstruction is found by convex optimization so that there is no dependence on the initialization of the method. We study its behaviour on simulated data of a human PET system and demonstrate its robustness even in settings with very low radioactivity. In contrast to previously reported cell tracking algorithms, the proposed technique is oblivious to the number of tracked cells. Without any additional complexity one or multiple cells can be reconstructed, and the model automatically determines the number of particles. As one of the results, four radiolabelled cells moving with a velocity of 3.1 mm/s and a PET recorded count rate of 1.1 cps (for each cell) could be simultaneously tracked with a tracking accuracy of 5.4 mm inside a simulated human body.
http://arxiv.org/abs/1902.07521
Glaucoma is a leading cause of irreversible blindness. Accurate segmentation of the optic disc (OD) and cup (OC) from fundus images is beneficial to glaucoma screening and diagnosis. Recently, convolutional neural networks demonstrate promising progress in joint OD and OC segmentation. However, affected by the domain shift among different datasets, deep networks are severely hindered in generalizing across different scanners and institutions. In this paper, we present a novel patchbased Output Space Adversarial Learning framework (pOSAL) to jointly and robustly segment the OD and OC from different fundus image datasets. We first devise a lightweight and efficient segmentation network as a backbone. Considering the specific morphology of OD and OC, a novel morphology-aware segmentation loss is proposed to guide the network to generate accurate and smooth segmentation. Our pOSAL framework then exploits unsupervised domain adaptation to address the domain shift challenge by encouraging the segmentation in the target domain to be similar to the source ones. Since the whole-segmentationbased adversarial loss is insufficient to drive the network to capture segmentation details, we further design the pOSAL in a patch-based fashion to enable fine-grained discrimination on local segmentation details. We extensively evaluate our pOSAL framework and demonstrate its effectiveness in improving the segmentation performance on three public retinal fundus image datasets, i.e., Drishti-GS, RIM-ONE-r3, and REFUGE. Furthermore, our pOSAL framework achieved the first place in the OD and OC segmentation tasks in MICCAI 2018 Retinal Fundus Glaucoma Challenge.
http://arxiv.org/abs/1902.07519
We consider languages generated by weighted context-free grammars. It is shown that the behaviour of large texts is controlled by saddle-point equations for an appropriate generating function. We then consider ensembles of grammars, in particular the Random Language Model of E. DeGiuli, Phys. Rev. Lett., 2019. This model is solved in the replica-symmetric ansatz, which is valid in the high-temperature, disordered phase. It is shown that in the phase in which languages carry information, the replica symmetry must be broken.
http://arxiv.org/abs/1902.07516
Dense 3D visual mapping estimates as many as possible pixel depths, for each image. This results in very dense point clouds that often contain redundant and noisy information, especially for surfaces that are roughly planar, for instance, the ground or the walls in the scene. In this paper we leverage on semantic image segmentation to discriminate which regions of the scene require simplification and which should be kept at high level of details. We propose four different point cloud simplification methods which decimate the perceived point cloud by relying on class-specific local and global statistics still maintaining more points in the proximity of class boundaries to preserve the infra-class edges and discontinuities. 3D dense model is obtained by fusing the point clouds in a 3D Delaunay Triangulation to deal with variable point cloud density. In the experimental evaluation we have shown that, by leveraging on semantics, it is possible to simplify the model and diminish the noise affecting the point clouds.
http://arxiv.org/abs/1902.07511
Haptic exploration is a key skill for both robots and humans to discriminate and handle unknown or recognize familiar objects. Its active nature is impressively evident in humans which from early on reliably acquire sophisticated sensory-motor capabilites for active exploratory touch and directed manual exploration that associates surfaces and object properties with their spatial locations. In stark contrast, in robotics the relative lack of good real-world interaction models, along with very restricted sensors and a scarcity of suitable training data to leverage machine learning methods has so far rendered haptic exploration a largely underdeveloped skill for robots, very unlike vision where deep learning approaches and an abundance of available training data have triggered huge advances. In the present work, we connect recent advances in recurrent models of visual attention (RAM) with previous insights about the organisation of human haptic search behavior, exploratory procedures and haptic glances for a novel learning architecture that learns a generative model of haptic exploration in a simplified three-dimensional environment. The proposed algorithm simultaneously optimizes main perception-action loop components: feature extraction, integration of features over time, and the control strategy, while continuously acquiring data online. The resulting method has been successfully tested with four different objects. It achieved results close to 100% while performing object contour exploration that has been optimized for its own sensor morphology.
http://arxiv.org/abs/1902.07501
This paper is concerned with the training of recurrent neural networks as goal-oriented dialog agents using reinforcement learning. Training such agents with policy gradients typically requires a large amount of samples. However, the collection of the required data in form of conversations between chat-bots and human agents is time-consuming and expensive. To mitigate this problem, we describe an efficient policy gradient method using positive memory retention, which significantly increases the sample-efficiency. We show that our method is 10 times more sample-efficient than policy gradients in extensive experiments on a new synthetic number guessing game. Moreover, in a real-word visual object discovery game, the proposed method is twice as sample-efficient as policy gradients and shows state-of-the-art performance.
http://arxiv.org/abs/1810.01371
Learning goal-oriented dialogues by means of deep reinforcement learning has recently become a popular research topic. However, commonly used policy-based dialogue agents often end up focusing on simple utterances and suboptimal policies. To mitigate this problem, we propose a class of novel temperature-based extensions for policy gradient methods, which are referred to as Tempered Policy Gradients (TPGs). On a recent AI-testbed, i.e., the GuessWhat?! game, we achieve significant improvements with two innovations. The first one is an extension of the state-of-the-art solutions with Seq2Seq and Memory Network structures that leads to an improvement of 7%. The second one is the application of our newly developed TPG methods, which improves the performance additionally by around 5% and, even more importantly, helps produce more convincing utterances.
http://arxiv.org/abs/1807.00737
In Hindsight Experience Replay (HER), a reinforcement learning agent is trained by treating whatever it has achieved as virtual goals. However, in previous work, the experience was replayed at random, without considering which episode might be the most valuable for learning. In this paper, we develop an energy-based framework for prioritizing hindsight experience in robotic manipulation tasks. Our approach is inspired by the work-energy principle in physics. We define a trajectory energy function as the sum of the transition energy of the target object over the trajectory. We hypothesize that replaying episodes that have high trajectory energy is more effective for reinforcement learning in robotics. To verify our hypothesis, we designed a framework for hindsight experience prioritization based on the trajectory energy of goal states. The trajectory energy function takes the potential, kinetic, and rotational energy into consideration. We evaluate our Energy-Based Prioritization (EBP) approach on four challenging robotic manipulation tasks in simulation. Our empirical results show that our proposed method surpasses state-of-the-art approaches in terms of both performance and sample-efficiency on all four tasks, without increasing computational time. A video showing experimental results is available at https://youtu.be/jtsF2tTeUGQ
http://arxiv.org/abs/1810.01363
Assigning a label to each pixel in an image, namely semantic segmentation, has been an important task in computer vision, and has applications in autonomous driving, robotic navigation, localization, and scene understanding. Fully convolutional neural networks have proved to be a successful solution for the task over the years but most of the work being done focuses primarily on accuracy. In this paper, we present a computationally efficient approach to semantic segmentation, meanwhile achieving a high mIOU, $70.33\%$ on Cityscapes challenge. The network proposed is capable of running real-time on mobile devices. In addition, we make our code and model weights publicly available.
http://arxiv.org/abs/1902.07476
Convolutional neural networks excel in a number of computer vision tasks. One of their most crucial architectural elements is the effective receptive field size, that has to be manually set to accommodate a specific task. Standard solutions involve large kernels, down/up-sampling and dilated convolutions. These require testing a variety of dilation and down/up-sampling factors and result in non-compact representations and excessive number of parameters. We address this issue by proposing a new convolution filter composed of displaced aggregation units (DAU). DAUs learn spatial displacements and adapt the receptive field sizes of individual convolution filters to a given problem, thus eliminating the need for hand-crafted modifications. DAUs provide a seamless substitution of convolutional filters in existing state-of-the-art architectures, which we demonstrate on AlexNet, ResNet50, ResNet101, DeepLab and SRN-DeblurNet. The benefits of this design are demonstrated on a variety of computer vision tasks and datasets, such as image classification (ILSVRC 2012), semantic segmentation (PASCAL VOC 2011, Cityscape) and blind image de-blurring (GOPRO). Results show that DAUs efficiently allocate parameters resulting in up to four times more compact networks at similar or better performance.
http://arxiv.org/abs/1902.07474
Audio-visual event localization requires one to identify theevent which is both visible and audible in a video (eitherat a frame or video level). To address this task, we pro-pose a deep neural network named Audio-Visual sequence-to-sequence dual network (AVSDN). By jointly taking bothaudio and visual features at each time segment as inputs, ourproposed model learns global and local event information ina sequence to sequence manner, which can be realized in ei-ther fully supervised or weakly supervised settings. Empiricalresults confirm that our proposed method performs favorablyagainst recent deep learning approaches in both settings.
http://arxiv.org/abs/1902.07473
The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computation-bound and I/O-bound. FPGA-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We propose the full-stack compiler DNNVM, which is an integration of optimizers for graphs, loops and data layouts, and an assembler, a runtime supporter and a validation environment. The DNNVM works in the context of deep learning frameworks and transforms CNN models into the directed acyclic graph: XGraph. Based on XGraph, we transform the optimization challenges for both the data layout and pipeline into graph-level problems. DNNVM enumerates all potentially profitable fusion opportunities by a heuristic subgraph isomorphism algorithm to leverage pipeline and data layout optimizations, and searches for the optimal execution strategies of the whole computing graph. On the Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art performance on our benchmarks by naive implementations without optimizations, and the throughput is further improved up to 1.26x by leveraging heterogeneous optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38 TOPs/s for ResNet50.
http://arxiv.org/abs/1902.07463
Online structure learning approaches, such as those stemming from Statistical Relational Learning, enable the discovery of complex relations in noisy data streams. However, these methods assume the existence of fully-labelled training data, which is unrealistic for most real-world applications. We present a novel approach for completing the supervision of a semi-supervised structure learning task. We incorporate graph-cut minimisation, a technique that derives labels for unlabelled data, based on their distance to their labelled counterparts. In order to adapt graph-cut minimisation to first order logic, we employ a suitable structural distance for measuring the distance between sets of logical atoms. The labelling process is achieved online (single-pass) by means of a caching mechanism and the Hoeffding bound, a statistical tool to approximate globally-optimal decisions from locally-optimal ones. We evaluate our approach on the task of composite event recognition by using a benchmark dataset for human activity recognition, as well as a real dataset for maritime monitoring. The evaluation suggests that our approach can effectively complete the missing labels and eventually, improve the accuracy of the underlying structure learning system.
http://arxiv.org/abs/1803.00546
With the advent of sequential matching (of supply and demand) systems (uber, Lyft, Grab for taxis; ubereats, deliveroo, etc for food; amazon prime, lazada etc. for groceries) across many online and offline services, individuals (taxi drivers, delivery boys, delivery van drivers, etc.) earn more by being at the “right” place at the “right” time. We focus on learning techniques for providing guidance (on right locations to be at right times) to individuals in the presence of other “learning” individuals. Interactions between indivduals are anonymous, i.e, the outcome of an interaction (competing for demand) is independent of the identity of the agents and therefore we refer to these as Anonymous MARL settings. Existing research of relevance is on independent learning using Reinforcement Learning (RL) or on Multi-Agent Reinforcement Learning (MARL). The number of individuals in aggregation systems is extremely large and individuals have their own selfish interest (of maximising revenue). Therefore, traditional MARL approaches are either not scalable or assumptions of common objective or action coordination are not viable. In this paper, we focus on improving performance of independent reinforcement learners, specifically the popular Deep Q-Networks (DQN) and Advantage Actor Critic (A2C) approaches by exploiting anonymity. Specifically, we control non-stationarity introduced by other agents using entropy of agent density distribution. We demonstrate a significant improvement in revenue for individuals and for all agents together with our learners on a generic experimental set up for aggregation systems and a real world taxi dataset.
http://arxiv.org/abs/1803.09928
Two line-based fracture detection scheme are developed and discussed, namely Standard line-based fracture detection and Adaptive Differential Parameter Optimized (ADPO) line-based fracture detection. The purpose for the two line-based fracture detection schemes is to detect fractured lines from X-ray images using extracted features based on recognised patterns to differentiate fractured lines from non-fractured lines. The difference between the two schemes is the detection of detailed lines. The ADPO scheme optimizes the parameters of the Probabilistic Hough Transform, such that granule lines within the fractured regions are detected, whereas the Standard scheme is unable to detect them. The lines are detected using the Probabilistic Hough Function, in which the detected lines are a representation of the image edge objects. The lines are given in the form of points, (x,y), which includes the starting and ending point. Based on the given line points, 13 features are extracted from each line, as a summary of line information. These features are used for fracture and non-fracture classification of the detected lines. The classification is carried out by the Artificial Neural Network (ANN). There are two evaluations that are employed to evaluate both the entirety of the system and the ANN. The Standard Scheme is capable of achieving an average accuracy of 74.25%, whilst the ADPO scheme achieved an average accuracy of 74.4%. The ADPO scheme is opted for over the Standard scheme, however it can be further improved with detected contours and its extracted features.
http://arxiv.org/abs/1902.07458
Designing a technique for the automatic analysis of different actions in videos in order to detect the presence of interested activities is of high significance nowadays. In this paper, we explore a robust and dynamic appearance technique for the purpose of identifying different action activities. We also exploit a low-rank and structured sparse matrix decomposition (LSMD) method to better model these activities.. Our method is effective in encoding localized spatio-temporal features which enables the analysis of local motion taking place in the video. Our proposed model use adjacent frame differences as the input to the method thereby forcing it to capture the changes occurring in the video. The performance of our model is tested on a benchmark dataset in terms of detection accuracy. Results achieved with our model showed the promising capability of our model in detecting action activities.
http://arxiv.org/abs/1902.07438
Multishot Magnetic Resonance Imaging (MRI) is a promising imaging modality that can produce a high-resolution image with relatively less data acquisition time. The downside of multishot MRI is that it is very sensitive to subject motion and even small amounts of motion during the scan can produce artifacts in the final MR image that may cause misdiagnosis. Numerous efforts have been made to address this issue; however, all of these proposals are limited in terms of how much motion they can correct and the required computational time. In this paper, we propose a novel generative networks based conjugate gradient SENSE (CG-SENSE) reconstruction framework for motion correction in multishot MRI. The proposed framework first employs CG-SENSE reconstruction to produce the motion-corrupted image and then a generative adversarial network (GAN) is used to correct the motion artifacts. The proposed method has been rigorously evaluated on synthetically corrupted data on varying degrees of motion, numbers of shots, and encoding trajectories. Our analyses (both quantitative as well as qualitative/visual analysis) establishes that the proposed method significantly robust and outperforms state-of-the-art motion correction techniques and also reduces severalfold of computational times.
http://arxiv.org/abs/1902.07430
Kernel extreme learning machine (KELM) is a novel feedforward neural network, which is widely used in classification problems. To some extent, it solves the existing problems of the invalid nodes and the large computational complexity in ELM. However, the traditional KELM classifier usually has a low test accuracy when it faces multiclass classification problems. In order to solve the above problem, a new classifier, Mexican Hat wavelet KELM classifier, is proposed in this paper. The proposed classifier successfully improves the training accuracy and reduces the training time in the multiclass classification problems. Moreover, the validity of the Mexican Hat wavelet as a kernel function of ELM is rigorously proved. Experimental results on different data sets show that the performance of the proposed classifier is significantly superior to the compared classifiers.
http://arxiv.org/abs/1902.07422
Euler’s Elastica based unsupervised segmentation models have strong capability of completing the missing boundaries for existing objects in a clean image, but they are not working well for noisy images. This paper aims to establish a Euler’s Elastica based approach that properly deals with random noises to improve the segmentation performance for noisy images. We solve the corresponding optimization problem via using the progressive hedging algorithm (PHA) with a step length suggested by the alternating direction method of multipliers (ADMM). Technically, all the simplified convex versions of the subproblems derived from the major framework of PHA can be obtained by using the curvature weighted approach and the convex relaxation method. Then an alternating optimization strategy is applied with the merits of using some powerful accelerating techniques including the fast Fourier transform (FFT) and generalized soft threshold formulas. Extensive experiments have been conducted on both synthetic and real images, which validated some significant gains of the proposed segmentation models and demonstrated the advantages of the developed algorithm.
http://arxiv.org/abs/1902.07402
Parking spaces are costly to build, parking payments are difficult to enforce, and drivers waste an excessive amount of time searching for empty lots. Accurate quantification would inform developers and municipalities in space allocation and design, while real-time measurements would provide drivers and parking enforcement with information that saves time and resources. In this paper, we propose an accurate and real-time video system for future Internet of Things (IoT) and smart cities applications. Using recent developments in deep convolutional neural networks (DCNNs) and a novel vehicle tracking filter, we combine information across multiple image frames in a video sequence to remove noise introduced by occlusions and detection failures. We demonstrate that our system achieves higher accuracy than pure image-based instance segmentation, and is comparable in performance to industry benchmark systems that utilize more expensive sensors such as radar. Furthermore, our system shows significant potential in its scalability to a city-wide scale and also in the richness of its output that goes beyond traditional binary occupancy statistics.
http://arxiv.org/abs/1902.07401
We propose a lossy image compression system using the deep-learning autoencoder structure to participate in the Challenge on Learned Image Compression (CLIC) 2018. Our autoencoder uses the residual blocks with skip connections to reduce the correlation among image pixels and condense the input image into a set of feature maps, a compact representation of the original image. The bit allocation and bitrate control are implemented by using the importance maps and quantizer. The importance maps are generated by a separate neural net in the encoder. The autoencoder and the importance net are trained jointly based on minimizing a weighted sum of mean squared error, MS-SSIM, and a rate estimate. Our aim is to produce reconstructed images with good subjective quality subject to the 0.15 bits-per-pixel constraint.
http://arxiv.org/abs/1902.07385
This paper proposes a new nonlinear stability analysis for the acceleration-based robust position control of robot manipulators by using Disturbance Observer (DOb). It is shown that if the nominal inertia matrix is properly tuned in the design of DOb, then the position error asymptotically goes to zero in regulation control and is uniformly ultimately bounded in trajectory tracking control. As the bandwidth of DOb and the nominal inertia matrix are increased, the bound of error shrinks, i.e., the robust stability and performance of the position control system are improved. However, neither the bandwidth of DOb nor the nominal inertia matrix can be freely increased due to practical design constraints, e.g., the robust position controller becomes more noise sensitive when they are increased. The proposed stability analysis provides insights regarding the dynamic behavior of DOb-based robust motion control systems. It is theoretically and experimentally proved that non-diagonal elements of the nominal inertia matrix are useful to improve the stability and adjust the trade-off between the robustness and noise sensitivity. The validity of the proposal is verified by simulation and experimental results.
http://arxiv.org/abs/1902.07708
Visual place recognition (VPR) - the act of recognizing a familiar visual place - becomes difficult when there is extreme environmental appearance change or viewpoint change. Particularly challenging is the scenario where both phenomena occur simultaneously, such as when returning for the first time along a road at night that was previously traversed during the day in the opposite direction. While such problems can be solved with panoramic sensors, humans solve this problem regularly with limited field of view vision and without needing to constantly turn around. In this paper, we present a new depth- and temporal-aware visual place recognition system that solves the opposing viewpoint, extreme appearance-change visual place recognition problem. Our system performs sequence-to-single matching by extracting depth-filtered keypoints using a state-of-the-art depth estimation pipeline, constructing a keypoint sequence over multiple frames from the reference dataset, and comparing those keypoints to those in a single query image. We evaluate the system on a challenging benchmark dataset and show that it consistently outperforms state-of-the-art techniques. We also develop a range of diagnostic simulation experiments that characterize the contribution of depth-filtered keypoint sequences with respect to key domain parameters including degree of appearance change and camera motion.
http://arxiv.org/abs/1902.07381
In this paper, we present an end-to-end language identification framework, the attention-based Convolutional Neural Network-Bidirectional Long-short Term Memory (CNN-BLSTM). The model is performed on the utterance level, which means the utterance-level decision can be directly obtained from the output of the neural network. To handle speech utterances with entire arbitrary and potentially long duration, we combine CNN-BLSTM model with a self-attentive pooling layer together. The front-end CNN-BLSTM module plays a role as local pattern extractor for the variable-length inputs, and the following self-attentive pooling layer is built on top to get the fixed-dimensional utterance-level representation. We conducted experiments on NIST LRE07 closed-set task, and the results reveal that the proposed attention-based CNN-BLSTM model achieves comparable error reduction with other state-of-the-art utterance-level neural network approaches for all 3 seconds, 10 seconds, 30 seconds duration tasks.
http://arxiv.org/abs/1902.07374
Action recognition in videos has attracted a lot of attention in the past decade. In order to learn robust models, previous methods usually assume videos are trimmed as short sequences and require ground-truth annotations of each video frame/sequence, which is quite costly and time-consuming. In this paper, given only video-level annotations, we propose a novel weakly supervised framework to simultaneously locate action frames as well as recognize actions in untrimmed videos. Our proposed framework consists of two major components. First, for action frame localization, we take advantage of the self-attention mechanism to weight each frame, such that the influence of background frames can be effectively eliminated. Second, considering that there are trimmed videos publicly available and also they contain useful information to leverage, we present an additional module to transfer the knowledge from trimmed videos for improving the classification performance in untrimmed ones. Extensive experiments are conducted on two benchmark datasets (i.e., THUMOS14 and ActivityNet1.3), and experimental results clearly corroborate the efficacy of our method.
http://arxiv.org/abs/1902.07370
Human motion prediction from motion capture data is a classical problem in the computer vision, and conventional methods take the holistic human body as input. These methods ignore the fact that, in various human activities, different body components (limbs and the torso) have distinctive characteristics in terms of the moving pattern. In this paper, we argue local representations on different body components should be learned separately and, based on such idea, propose a network, Skeleton Network (SkelNet), for long-term human motion prediction. Specifically, at each time-step, local structure representations of input (human body) are obtained via SkelNet’s branches of component-specific layers, then the shared layer uses local spatial representations to predict the future human pose. Our SkelNet is the first to use local structure representations for predicting the human motion. Then, for short-term human motion prediction, we propose the second network, named as Skeleton Temporal Network (Skel-TNet). Skel-TNet consists of three components: SkelNet and a Recurrent Neural Network, they have advantages in learning spatial and temporal dependencies for predicting human motion, respectively; a feed-forward network that outputs the final estimation. Our methods achieve promising results on the Human3.6M dataset and the CMU motion capture dataset.
http://arxiv.org/abs/1902.07367