Accurately counting cells in microscopic images is important for medical diagnoses and biological studies, but manual cell counting is very tedious, time-consuming, and prone to subjective errors, and automatic counting can be less accurate than desired. To improve the accuracy of automatic cell counting, we propose here a novel method that employs deeply-supervised density regression. A fully convolutional neural network (FCNN) serves as the primary FCNN for density regression. A fully convolutional neural network (FCNN) serves as the primary FCNN for density map regression. Innovatively, a set of auxiliary FCNNs are employed to provide additional supervision for learning the intermediate layers of the primary CNN to improve network performance. In addition, the primary CNN is designed as a concatenating framework to integrate multi-scale features through shortcut connections in the network, which improves the granularity of the features extracted from the intermediate CNN layers and further supports the final density map estimation.
http://arxiv.org/abs/1903.01084
Imagination is one of the most important factors which makes an artistic painting unique and impressive. With the rapid development of Artificial Intelligence, more and more researchers try to create painting with AI technology automatically. However, lacking of imagination is still a main problem for AI painting. In this paper, we propose a novel approach to inject rich imagination into a special painting art Mind Map creation. We firstly consider lexical and phonological similarities of seed word, then learn and inherit original painting style of the author, and finally apply Dadaism and impossibility of improvisation principles into painting process. We also design several metrics for imagination evaluation. Experimental results show that our proposed method can increase imagination of painting and also improve its overall quality.
http://arxiv.org/abs/1903.01080
Unsupervised cross-spectral stereo matching aims at recovering disparity given cross-spectral image pairs without any supervision in the form of ground truth disparity or depth. The estimated depth provides additional information complementary to individual semantic features, which can be helpful for other vision tasks such as tracking, recognition and detection. However, there are large appearance variations between images from different spectral bands, which is a challenge for cross-spectral stereo matching. Existing deep unsupervised stereo matching methods are sensitive to the appearance variations and do not perform well on cross-spectral data. We propose a novel unsupervised cross-spectral stereo matching framework based on image-to-image translation. First, a style adaptation network transforms images across different spectral bands by cycle consistency and adversarial learning, during which appearance variations are minimized. Then, a stereo matching network is trained with image pairs from the same spectra using view reconstruction loss. At last, the estimated disparity is utilized to supervise the spectral-translation network in an end-to-end way. Moreover, a novel style adaptation network F-cycleGAN is proposed to improve the robustness of spectral translation. Our method can tackle appearance variations and enhance the robustness of unsupervised cross-spectral stereo matching. Experimental results show that our method achieves good performance without using depth supervision or explicit semantic information.
http://arxiv.org/abs/1903.01078
Recent works in image captioning have shown very promising raw performance. However, we realize that most of these encoder-decoder style networks with attention do not scale naturally to large vocabulary size, making them difficult to be deployed on embedded system with limited hardware resources. This is because the size of word and output embedding matrices grow proportionally with the size of vocabulary, adversely affecting the compactness of these networks. To address this limitation, this paper introduces a brand new idea in the domain of image captioning. That is, we tackle the problem of compactness of image captioning models which is hitherto unexplored. We showed that, our proposed model, named COMIC for COMpact Image Captioning, achieves comparable results in five common evaluation metrics with state-of-the-art approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an embedding vocabulary size that is 39x - 99x smaller.
http://arxiv.org/abs/1903.01072
This paper introduces, philosophically and to a degree formally, the novel concept of learning $\textit{ex nihilo}$, intended (obviously) to be analogous to the concept of creation $\textit{ex nihilo}$. Learning $\textit{ex nihilo}$ is an agent’s learning “from nothing,” by the suitable employment of schemata for deductive and inductive reasoning. This reasoning must be in machine-verifiable accord with a formal proof/argument theory in a $\textit{cognitive calculus}$ (i.e., roughly, an intensional higher-order multi-operator quantified logic), and this reasoning is applied to percepts received by the agent, in the context of both some prior knowledge, and some prior and current interests. Learning $\textit{ex nihilo}$ is a challenge to contemporary forms of ML, indeed a severe one, but the challenge is offered in the spirt of seeking to stimulate attempts, on the part of non-logicist ML researchers and engineers, to collaborate with those in possession of learning-$\textit{ex nihilo}$ frameworks, and eventually attempts to integrate directly with such frameworks at the implementation level. Such integration will require, among other things, the symbiotic interoperation of state-of-the-art automated reasoners and high-expressivity planners, with statistical/connectionist ML technology.
http://arxiv.org/abs/1903.03515
Visual-Inertial Odometry (VIO) algorithms typically rely on a point cloud representation of the scene that does not model the topology of the environment. A 3D mesh instead offers a richer, yet lightweight, model. Nevertheless, building a 3D mesh out of the sparse and noisy 3D landmarks triangulated by a VIO algorithm often results in a mesh that does not fit the real scene. In order to regularize the mesh, previous approaches decouple state estimation from the 3D mesh regularization step, and either limit the 3D mesh to the current frame or let the mesh grow indefinitely. We propose instead to tightly couple mesh regularization and state estimation by detecting and enforcing structural regularities in a novel factor-graph formulation. We also propose to incrementally build the mesh by restricting its extent to the time-horizon of the VIO optimization; the resulting 3D mesh covers a larger portion of the scene than a per-frame approach while its memory usage and computational complexity remain bounded. We show that our approach successfully regularizes the mesh, while improving localization accuracy, when structural regularities are present, and remains operational in scenes without regularities.
http://arxiv.org/abs/1903.01067
Precise robotic manipulation skills are desirable in many industrial settings, reinforcement learning (RL) methods hold the promise of acquiring these skills autonomously. In this paper, we explicitly consider incorporating operational space force/torque information into reinforcement learning; this is motivated by humans heuristically mapping perceived forces to control actions, which results in completing high-precision tasks in a fairly easy manner. Our approach combines RL with force/torque information by incorporating a proper operational space force controller; where we also exploit different ablations on processing this information. Moreover, we propose a neural network architecture that generalizes to reasonable variations of the environment. We evaluate our method on the open-source Siemens Robot Learning Challenge, which requires precise and delicate force-controlled behavior to assemble a tight-fit gear wheel set.
http://arxiv.org/abs/1903.01066
In this paper, we explore a simple solution to “Multi-Source Neural Machine Translation” (MSNMT) which only relies on preprocessing a N-way multilingual corpus without modifying the Neural Machine Translation (NMT) architecture or training procedure. We simply concatenate the source sentences to form a single long multi-source input sentence while keeping the target side sentence as it is and train an NMT system using this preprocessed corpus. We evaluate our method in resource poor as well as resource rich settings and show its effectiveness (up to 4 BLEU using 2 source languages and up to 6 BLEU using 5 source languages). We also compare against existing methods for MSNMT and show that our solution gives competitive results despite its simplicity. We also provide some insights on how the NMT system leverages multilingual information in such a scenario by visualizing attention.
http://arxiv.org/abs/1702.06135
Efficiently adapting to new environments and changes in dynamics is critical for agents to successfully operate in the real world. Reinforcement learning (RL) based approaches typically rely on external reward feedback for adaptation. However, in many scenarios this reward signal might not be readily available for the target task, or the difference between the environments can be implicit and only observable from the dynamics. To this end, we introduce a method that allows for self-adaptation of learned policies: No-Reward Meta Learning (NoRML). NoRML extends Model Agnostic Meta Learning (MAML) for RL and uses observable dynamics of the environment instead of an explicit reward function in MAML’s finetune step. Our method has a more expressive update step than MAML, while maintaining MAML’s gradient based foundation. Additionally, in order to allow more targeted exploration, we implement an extension to MAML that effectively disconnects the meta-policy parameters from the fine-tuned policies’ parameters. We first study our method on a number of synthetic control problems and then validate our method on common benchmark environments, showing that NoRML outperforms MAML when the dynamics change between tasks.
http://arxiv.org/abs/1903.01063
Traditional representations like Bag of words are high dimensional, sparse and ignore the order as well as syntactic and semantic information. Distributed vector representations or embeddings map variable length text to dense fixed length vectors as well as capture the prior knowledge which can transferred to downstream tasks. Even though embedding has become de facto standard for representations in deep learning based NLP tasks in both general and clinical domains, there is no survey paper which presents a detailed review of embeddings in Clinical Natural Language Processing. In this survey paper, we discuss various medical corpora and their characteristics, medical codes and present a brief overview as well as comparison of popular embeddings models. We classify clinical embeddings into nine types and discuss each embedding type in detail. We discuss various evaluation methods followed by possible solutions to various challenges in clinical embeddings. Finally, we conclude with some of the future directions which will advance the research in clinical embeddings.
http://arxiv.org/abs/1903.01039
Two-stream convolutional networks have shown strong performance in video action recognition tasks. The key idea is to learn spatiotemporal features by fusing convolutional networks spatially and temporally. However, it remains unclear how to model the correlations between the spatial and temporal structures at multiple abstraction levels. First, the spatial stream tends to fail if two videos share similar backgrounds. Second, the temporal stream may be fooled if two actions resemble in short snippets, though appear to be distinct in the long term. We propose a novel spatiotemporal pyramid network to fuse the spatial and temporal features in a pyramid structure such that they can reinforce each other. From the architecture perspective, our network constitutes hierarchical fusion strategies which can be trained as a whole using a unified spatiotemporal loss. A series of ablation experiments support the importance of each fusion strategy. From the technical perspective, we introduce the spatiotemporal compact bilinear operator into video analysis tasks. This operator enables efficient training of bilinear fusion operations which can capture full interactions between the spatial and temporal features. Our final network achieves state-of-the-art results on standard video datasets.
http://arxiv.org/abs/1903.01038
Active authentication refers to the process in which users are unobtrusively monitored and authenticated continuously throughout their interactions with mobile devices. Generally, an active authentication problem is modelled as a one class classification problem due to the unavailability of data from the impostor users. Normally, the enrolled user is considered as the target class (genuine) and the unauthorized users are considered as unknown classes (impostor). We propose a convolutional neural network (CNN) based approach for one class classification in which a zero centered Gaussian noise and an autoencoder are used to model the pseudo-negative class and to regularize the network to learn meaningful feature representations for one class data, respectively. The overall network is trained using a combination of the cross-entropy and the reconstruction error losses. A key feature of the proposed approach is that any pre-trained CNN can be used as the base network for one class classification. Effectiveness of the proposed framework is demonstrated using three publically available face-based active authentication datasets and it is shown that the proposed method achieves superior performance compared to the traditional one class classification methods. The source code is available at: github.com/otkupjnoz/oc-acnn.
http://arxiv.org/abs/1903.01031
Vision, as an inexpensive yet information rich sensor, is commonly used for perception on autonomous mobile robots. Unfortunately, accurate vision-based perception requires a number of assumptions about the environment to hold – some examples of such assumptions, depending on the perception algorithm at hand, include purely lambertian surfaces, texture-rich scenes, absence of aliasing features, and refractive surfaces. In this paper, we present an approach for introspective vision for obstacle avoidance (iVOA) – by leveraging a supervisory sensor that is occasionally available, we detect failures of stereo vision-based perception from divergence in plans generated by vision and the supervisory sensor. By projecting the 3D coordinates where the plans agree and disagree onto the images used for vision-based perception, iVOA generates a training set of reliable and unreliable image patches for perception. We then use this training dataset to learn a model of which image patches are likely to cause failures of the vision-based perception algorithm. Using this model, iVOA is then able to predict whether the relevant image patches in the observed images are likely to cause failures due to vision (both false positives and false negatives). We empirically demonstrate with extensive real-world data from both indoor and outdoor environments, the ability of iVOA to accurately predict the failures of two distinct vision algorithms.
http://arxiv.org/abs/1903.01028
Social intelligence is an important requirement for enabling robots to collaborate with people. In particular, human path prediction is an essential capability for robots in that it prevents potential collision with a human and allows the robot to safely make larger movements. In this paper, we present a method for predicting the trajectory of a human who follows a haptic robotic guide without using sight, which is valuable for assistive robots that aid the visually impaired. We apply a deep learning method based on recurrent neural networks using multimodal data: (1) human trajectory, (2) movement of the robotic guide, (3) haptic input data measured from the physical interaction between the human and the robot, (4) human depth data. We collected actual human trajectory and multimodal response data through indoor experiments. Our model outperformed the baseline result while using only the robot data with the observed human trajectory, and it shows even better results when using additional haptic and depth data.
http://arxiv.org/abs/1903.01027
The stochastic multi-armed bandit problem is a well-known model for studying the exploration-exploitation trade-off. It has significant possible applications in adaptive clinical trials, which allow for dynamic changes in the treatment allocation probabilities of patients. However, most bandit learning algorithms are designed with the goal of minimizing the expected regret. While this approach is useful in many areas, in clinical trials, it can be sensitive to outlier data, especially when the sample size is small. In this paper, we define and study a new robustness criterion for bandit problems. Specifically, we consider optimizing a function of the distribution of returns as a regret measure. This provides practitioners more flexibility to define an appropriate regret measure. The learning algorithm we propose to solve this type of problem is a modification of the BESA algorithm [Baransi et al., 2014], which considers a more general version of regret. We present a regret bound for our approach and evaluate it empirically both on synthetic problems as well as on a dataset from the clinical trial literature. Our approach compares favorably to a suite of standard bandit algorithms.
http://arxiv.org/abs/1903.01026
Reinforcement Learning agents are expected to eventually perform well. Typically, this takes the form of a guarantee about the asymptotic behavior of an algorithm given some assumptions about the environment. We present an algorithm for a policy whose value approaches the optimal value with probability 1 in all computable probabilistic environments, provided the agent has a bounded horizon. This is known as strong asymptotic optimality, and it was previously unknown whether it was possible for a policy to be strongly asymptotically optimal in the class of all computable probabilistic environments. Our agent, Inquisitive Reinforcement Learner (Inq), is more likely to explore the more it expects an exploratory action to reduce its uncertainty about which environment it is in, hence the term inquisitive. Exploring inquisitively is a strategy that can be applied generally; for more manageable environment classes, inquisitiveness is tractable. We conducted experiments in “grid-worlds” to compare the Inquisitive Reinforcement Learner to other weakly asymptotically optimal agents.
http://arxiv.org/abs/1903.01021
The linear and non-flexible nature of deep convolutional models makes them vulnerable to carefully crafted adversarial perturbations. To tackle this problem, we propose a non-linear radial basis convolutional feature mapping by learning a Mahalanobis-like distance function. Our method then maps the convolutional features onto a linearly well-separated manifold, which prevents small adversarial perturbations from forcing a sample to cross the decision boundary. We test the proposed method on three publicly available image classification and segmentation datasets namely, MNIST, ISBI ISIC 2017 skin lesion segmentation, and NIH Chest X-Ray-14. We evaluate the robustness of our method to different gradient (targeted and untargeted) and non-gradient based attacks and compare it to several non-gradient masking defense strategies. Our results demonstrate that the proposed method can increase the resilience of deep convolutional neural networks to adversarial perturbations without accuracy drop on clean data.
http://arxiv.org/abs/1903.01015
The success of Deep Convolutional Neural Networks (CNNs) in recent years in almost all the Computer Vision tasks on one hand, and the popularity of low-cost consumer depth cameras on the other, has made Hand Pose Estimation a hot topic in computer vision field. In this report, we will first explain the hand pose estimation problem and will review major approaches solving this problem, especially the two different problems of using depth maps or RGB images. We will survey the most important papers in each field and will discuss the strengths and weaknesses of each. Finally, we will explain the biggest datasets in this field in detail and list 21 datasets with all their properties. To the best of our knowledge this is the most complete list of all the datasets in the hand pose estimation field.
http://arxiv.org/abs/1903.01013
Sampling-based roadmaps have been popular methods for robot motion and task planning, given their generality and effectiveness in high-dimensional configuration spaces (C-spaces). Following advances in random geometric graphs, a seminal analysis result argued the conditions for asymptotic optimality of these approaches. In particular, a connection radius for each new C-space sample needs to be in the order of $ \gamma (\log n / n)^{1/d} $, where $n$ is the existing number of roadmap nodes and $d$ is the dimensionality of the C-space. This prior analysis, as well as subsequent efforts, also specified a sufficient lower bound for the constant $\gamma$ for asymptotic optimality. All of these results assumed that for a finite number of samples there is a path with positive clearance from obstacles. Nevertheless, manipulation task planning requires solving problems were the start and the goal lie on the boundary of the configuration space. The current work builds on previous work, to: a) obtain an estimate of $\gamma$ in terms of a bound on the dispersion of the samples; and b) propose the modifications necessary to make asymptotic optimality hold when the start and goal lie on the boundary of the C-space under certain assumptions regarding the boundary. The last point generalizes these properties to manipulation task planning and reduces the method’s requirements for a connection radius that achieves asymptotic optimality in this domain as well as the assumptions regarding the boundary’s smoothness relative to prior work.
http://arxiv.org/abs/1903.01006
Can we learn a control policy able to adapt its behaviour in real time so as to take any desired amount of risk? The general Reinforcement Learning framework solely aims at optimising a total reward in expectation, which may not be desirable in critical applications. In stark contrast, the Budgeted Markov Decision Process (BMDP) framework is a formalism in which the notion of risk is implemented as a hard constraint on a failure signal. Existing algorithms solving BMDPs rely on strong assumptions and have so far only been applied to toy-examples. In this work, we relax some of these assumptions and demonstrate the scalability of our approach on two practical problems: a spoken dialogue system and an autonomous driving task. On both examples, we reach similar performances as Lagrangian Relaxation methods with a significant improvement in sample and memory efficiency.
http://arxiv.org/abs/1903.01004
We present a Reinforcement Learning (RL) methodology to bypass Google reCAPTCHA v3. We formulate the problem as a grid world where the agent learns how to move the mouse and click on the reCAPTCHA button to receive a high score. We study the performance of the agent when we vary the cell size of the grid world and show that the performance drops when the agent takes big steps toward the goal. Finally, we used a divide and conquer strategy to defeat the reCAPTCHA system for any grid resolution. Our proposed method achieves a success rate of 97.4% on a 100x100 grid and 96.7% on a 1000x1000 screen resolution.
http://arxiv.org/abs/1903.01003
Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervised methods. Our emphasis is on distilling the essential information, identity, from the representations obtained using deep pre-trained face networks. We propose a self-supervised Siamese network that can be trained without the need for video/track based supervision, and thus can also be applied to image collections. We evaluate our proposed method on three video face clustering datasets. The experiments show that our methods outperform current state-of-the-art methods on all datasets. Video face clustering is lacking a common benchmark as current works are often evaluated with different metrics and/or different sets of face tracks.
http://arxiv.org/abs/1903.01000
Many exciting robotic applications require multiple robots with many degrees of freedom, such as manipulators, to coordinate their motion in a shared workspace. Discovering high-quality paths in such scenarios can be achieved, in principle, by exploring the composite space of all robots. Sampling-based planners do so by building a roadmap or a tree data structure in the corresponding configuration space and can achieve asymptotic optimality. The hardness of motion planning, however, renders the explicit construction of such structures in the composite space of multiple robots impractical. This work proposes a scalable solution for such coupled multi-robot problems, which provides desirable path-quality guarantees and is also computationally efficient. In particular, the proposed \drrtstar\ is an informed, asymptotically-optimal extension of a prior sampling-based multi-robot motion planner, \drrt. The prior approach introduced the idea of building roadmaps for each robot and implicitly searching the tensor product of these structures in the composite space. This work identifies the conditions for convergence to optimal paths in multi-robot problems, which the prior method was not achieving. Building on this analysis, \drrt\ is first properly adapted so as to achieve the theoretical guarantees and then further extended so as to make use of effective heuristics when searching the composite space of all robots. The case where the various robots share some degrees of freedom is also studied. Evaluation in simulation indicates that the new algorithm, \drrtstar\, converges to high-quality paths quickly and scales to a higher number of robots where various alternatives fail. This work also demonstrates the planner’s capability to solve problems involving multiple real-world robotic arms.
http://arxiv.org/abs/1903.00994
Detailed 3D reconstruction is an important challenge with application to robotics, augmented and virtual reality, which has seen impressive progress throughout the past years. Advancements were driven by the availability of depth cameras (RGB-D), as well as increased compute power, e.g.\ in the form of GPUs – but also thanks to inclusion of machine learning in the process. Here, we propose X-Section, an RGB-D 3D reconstruction approach that leverages deep learning to make object-level predictions about thicknesses that can be readily integrated into a volumetric multi-view fusion process, where we propose an extension to the popular KinectFusion approach. In essence, our method allows to complete shape in general indoor scenes behind what is sensed by the RGB-D camera, which may be crucial e.g.\ for robotic manipulation tasks or efficient scene exploration. Predicting object thicknesses rather than volumes allows us to work with comparably high spatial resolution without exploding memory and training data requirements on the employed Convolutional Neural Networks. In a series of qualitative and quantitative evaluations, we demonstrate how we accurately predict object thickness and reconstruct general 3D scenes containing multiple objects.
http://arxiv.org/abs/1903.00987
Advances in sensor technologies, object detection algorithms, planning frameworks and hardware designs have motivated the deployment of robots in warehouse automation. A variety of such applications, like order fulfillment or packing tasks, require picking objects from unstructured piles and carefully arranging them in bins or containers. Desirable solutions need to be low-cost, easily deployable and controllable, making minimalistic hardware choices desirable. The challenge in designing an effective solution to this problem relates to appropriately integrating multiple components, so as to achieve a robust pipeline that minimizes failure conditions. The current work proposes a complete pipeline for solving such packing tasks, given access only to RGB-D data and a single robot arm with a vacuum-based end-effector, which is also used as a pushing finger. To achieve the desired level of robustness, three key manipulation primitives are identified, which take advantage of the environment and simple operations to successfully pack multiple cubic objects. The overall approach is demonstrated to be robust to execution and perception errors. The impact of each manipulation primitive is evaluated by considering different versions of the proposed pipeline, which incrementally introduce reasoning about object poses and corrective manipulation actions.
http://arxiv.org/abs/1903.00984
Perception is a safety-critical function of autonomous vehicles and machine learning (ML) plays a key role in its implementation. This position paper identifies (1) perceptual uncertainty as a performance measure used to define safety requirements and (2) its influence factors when using supervised ML. This work is a first step towards a framework for measuring and controling the effects of these factors and supplying evidence to support claims about perceptual uncertainty.
http://arxiv.org/abs/1903.03438
Designing face recognition systems that are capable of matching face images obtained in the thermal spectrum with those obtained in the visible spectrum is a challenging problem. In this work, we propose the use of semantic-guided generative adversarial network (SG-GAN) to automatically synthesize visible face images from their thermal counterparts. Specifically, semantic labels, extracted by a face parsing network, are used to compute a semantic loss function to regularize the adversarial network during training. These semantic cues denote high-level facial component information associated with each pixel. Further, an identity extraction network is leveraged to generate multi-scale features to compute an identity loss function. To achieve photo-realistic results, a perceptual loss function is introduced during network training to ensure that the synthesized visible face is perceptually similar to the target visible face image. We extensively evaluate the benefits of individual loss functions, and combine them effectively to learn the mapping from thermal to visible face images. Experiments involving two multispectral face datasets show that the proposed method achieves promising results in both face synthesis and cross-spectral face matching.
http://arxiv.org/abs/1903.00963
Datasets are an essential component for training effective machine learning models. In particular, surgical robotic datasets have been key to many advances in semi-autonomous surgeries, skill assessment, and training. Simulated surgical environments can enhance the data collection process by making it faster, simpler and cheaper than real systems. In addition, combining data from multiple robotic domains can provide rich and diverse training data for transfer learning algorithms. In this paper, we present the DESK (Dexterous Surgical Skill) dataset. It comprises a set of surgical robotic skills collected during a surgical training task using three robotic platforms: the Taurus II robot, Taurus II simulated robot, and the YuMi robot. This dataset was used to test the idea of transferring knowledge across different domains (e.g. from Taurus to YuMi robot) for a surgical gesture classification task with seven gestures. We explored three different scenarios: 1) No transfer, 2) Transfer from simulated Taurus to real Taurus and 3) Transfer from Simulated Taurus to the YuMi robot. We conducted extensive experiments with three supervised learning models and provided baselines in each of these scenarios. Results show that using simulation data during training enhances the performance on the real robot where limited real data is available. In particular, we obtained an accuracy of 55% on the real Taurus data using a model that is trained only on the simulator data. Furthermore, we achieved an accuracy improvement of 34% when 3% of the real data is added into the training process.
http://arxiv.org/abs/1903.00959
This paper proposes a novel image set classification technique based on the concept of linear regression. Unlike most other approaches, the proposed technique does not involve any training or feature extraction. The gallery image sets are represented as subspaces in a high dimensional space. Class specific gallery subspaces are used to estimate regression models for each image of the test image set. Images of the test set are then projected on the gallery subspaces. Residuals, calculated using the Euclidean distance between the original and the projected test images, are used as the distance metric. Three different strategies are devised to decide on the final class of the test image set. We performed extensive evaluations of the proposed technique under the challenges of low resolution, noise and less gallery data for the tasks of surveillance, video-based face recognition and object recognition. Experiments show that the proposed technique achieves a better classification accuracy and a faster execution time compared to existing techniques especially under the challenging conditions of low resolution and small gallery and test data.
http://arxiv.org/abs/1803.09470
Motion planning under uncertainty for an autonomous system can be formulated as a Markov Decision Process. In this paper, we propose a solution to this decision theoretic planning problem using a continuous approximation of the underlying discrete value function and leveraging finite element methods. This approach allows us to obtain an accurate and continuous form of value function even with a small number of states from a very low resolution of state space. We achieve this by taking advantage of the second order Taylor expansion to approximate the value function, where the value function is modeled as a boundary-conditioned partial differential equation which can be naturally solved using a finite element method. We have validated our approach via extensive simulations, and the evaluations reveal that our solution provides continuous value functions, leading to better path results in terms of path smoothness, travel distance and time costs, even with a smaller state space.
http://arxiv.org/abs/1903.00948
State-of-the-art LSTM language models trained on large corpora learn sequential contingencies in impressive detail, and have been shown to acquire a number of non-local grammatical dependencies with some success. Here we investigate whether supervision with hierarchical structure enhances learning of a range of grammatical dependencies, a question that has previously been addressed only for subject-verb agreement. Using controlled experimental methods from psycholinguistics, we compare the performance of word-based LSTM models versus Recurrent Neural Network Grammars (RNNGs) (Dyer et al., 2016), which represent hierarchical syntactic structure and use neural control to deploy it in left-to-right processing, on two classes of non-local grammatical dependencies in English – Negative Polarity licensing and filler-gap Dependencies – tested in a range of configurations. Using the same training data for both models, we find that the RNNG outperforms the LSTM on both types of grammatical dependencies and even learns many of the Island Constraints on the filler-gap dependency. Structural supervision thus provides data efficiency advantages over purely string-based training of neural language models in acquiring human-like generalizations about non-local grammatical dependencies.
http://arxiv.org/abs/1903.00943
Over the years many ellipse detection algorithms spring up and are studied broadly, while the critical issue of detecting ellipses accurately and efficiently in real-world images remains a challenge. In this paper, we propose a valuable industry-oriented ellipse detector by arc-support line segments, which simultaneously reaches high detection accuracy and efficiency. To simplify the complicated curves in an image while retaining the general properties including convexity and polarity, the arc-support line segments are extracted, which grounds the successful detection of ellipses. The arc-support groups are formed by iteratively and robustly linking the arc-support line segments that latently belong to a common ellipse. Afterward, two complementary approaches, namely, locally selecting the arc-support group with higher saliency and globally searching all the valid paired groups, are adopted to fit the initial ellipses in a fast way. Then, the ellipse candidate set can be formulated by hierarchical clustering of 5D parameter space of initial ellipses. Finally, the salient ellipse candidates are selected and refined as detections subject to the stringent and effective verification. Extensive experiments on three public datasets are implemented and our method achieves the best F-measure scores compared to the state-of-the-art methods. The source code is available at https://github.com/AlanLuSun/High-quality-ellipse-detection.
http://arxiv.org/abs/1810.03243
Artificial Intelligence (AI) plays varying roles in supporting both existing and emerging technologies. In the area of Learning and Tutoring, it plays key role in Intelligent Tutoring Systems (ITS). The fusion of ITS with Adaptive Hypermedia and Multimedia (AHAM) form the backbone of Adaptive eLearning Systems (AES) which provides personalized experiences to learners. This experience is important because it facilitates the accurate delivery of the learning modules in specific to the learner capacity and readiness. AES types vary, with Adaptive Web Based eLearning Systems (AWBES) being the popular type because of wider access offered by the web technology.The retrieval and aggregation of contents for any eLearning system is critical whichis determined by the relevance of learning material to the needs of the learner.In this paper, we discuss components of AES, role of AI in AES content aggregation, possible risks and available opportunities.
http://arxiv.org/abs/1903.00934
Machine learning has shown promise for automatic detection of Alzheimer’s disease (AD) through speech; however, efforts are hampered by a scarcity of data, especially in languages other than English. We propose a method to learn a correspondence between independently engineered lexicosyntactic features in two languages, using a large parallel corpus of out-of-domain movie dialogue data. We apply it to dementia detection in Mandarin Chinese, and demonstrate that our method outperforms both unilingual and machine translation-based baselines. This appears to be the first study that transfers feature domains in detecting cognitive decline.
http://arxiv.org/abs/1903.00933
Determining a globally optimal solution of belief space planning (BSP) in high-dimensional state spaces is computationally expensive, as it involves belief propagation and objective function evaluation for each candidate action. Our recently introduced topological belief space planning t-bsp instead performs decision making considering only topologies of factor graphs that correspond to posterior future beliefs. In this paper we contribute to this body of work a novel method for efficiently determining error bounds of t-bsp, thereby providing global optimality guarantees or uncertainty margin of its solution. The bounds are given with respect to an optimal solution of information theoretic BSP considering the previously introduced topological metric which is based on the number of spanning trees. In realistic and synthetic simulations, we analyze tightness of these bounds and show empirically how this metric is closely related to another computationally more efficient t-bsp metric, an approximation of the von Neumann entropy of a graph, which can achieve online performance.
http://arxiv.org/abs/1903.00927
Accurate computer-assisted diagnosis can alleviate the risk of overlooking the diagnosis in a clinical environment. Towards this, as a Data Augmentation (DA) technique, Generative Adversarial Networks (GANs) can synthesize additional training data to handle small/fragmented medical images from various scanners; those images are realistic but completely different from the original ones, filling the data lack in the real image distribution. However, we cannot easily use them to locate the position of disease areas, considering expert physicians’ annotation as time-expensive tasks. Therefore, this paper proposes Conditional Progressive Growing of GANs (CPGGANs), incorporating bounding box conditions into PGGANs to place brain metastases at desired position/size on 256 x 256 Magnetic Resonance (MR) images, for Convolutional Neural Network-based tumor detection; this first GAN-based medical DA using automatic bounding box annotation improves the robustness during training. The results show that CPGGAN-based DA can boost 10% sensitivity in diagnosis with an acceptable amount of additional False Positives—even with physicians’ highly-rough and inconsistent bounding box annotation. Surprisingly, further realistic tumor appearance, achieved with additional normal brain MR images for CPGGAN training, does not contribute to detection performance, while even three expert physicians cannot accurately distinguish them from the real ones in Visual Turing Test.
https://arxiv.org/abs/1902.09856
Predicting the future location of vehicles is essential for safety-critical applications such as advanced driver assistance systems (ADAS) and autonomous driving. This paper introduces a novel approach to simultaneously predict both the location and scale of target vehicles in the first-person (egocentric) view of an ego-vehicle. We present a multi-stream recurrent neural network (RNN) encoder-decoder model that separately captures both object location and scale and pixel-level observations for future vehicle localization. We show that incorporating dense optical flow improves prediction results significantly since it captures information about motion as well as appearance change. We also find that explicitly modeling future motion of the ego-vehicle improves the prediction accuracy, which could be especially beneficial in intelligent and automated vehicles that have motion planning capability. To evaluate the performance of our approach, we present a new dataset of first-person videos collected from a variety of scenarios at road intersections, which are particularly challenging moments for prediction because vehicle trajectories are diverse and dynamic.
http://arxiv.org/abs/1809.07408
A significant advance in accelerating neural network training has been the development of normalization methods, permitting the training of deep models both faster and with better accuracy. These advances come with practical challenges: for instance, batch normalization ties the prediction of individual examples with other examples within a batch, resulting in a network that is heavily dependent on batch size. Layer normalization and group normalization are data-dependent and thus must be continually used, even at test-time. To address the issues that arise from using explicit normalization techniques, we propose to replace existing normalization methods with a simple, secondary objective loss that we term a standardization loss. This formulation is flexible and robust across different batch sizes and surprisingly, this secondary objective accelerates learning on the primary training objective. Because it is a training loss, it is simply removed at test-time, and no further effort is needed to maintain normalized activations. We find that a standardization loss accelerates training on both small- and large-scale image classification experiments, works with a variety of architectures, and is largely robust to training across different batch sizes.
http://arxiv.org/abs/1903.00925
Pancreatic cancer is one of the most lethal cancers as incidence approximates mortality. A method for accurately segmenting the pancreas can assist doctors in the diagnosis and treatment of pancreatic cancer. In the current widely used approaches, the 2D method ignores the spatial information of the pancreas, and the 3D model is limited by high resource consumption and GPU memory occupancy. To address these issues, we propose a bi-directional recurrent UNet (PBR-UNet) based on probability graph guidance, which consists of a feature extraction network for efficiently extracting pixel-level probability map as guidance and a bi-directional recurrent network for precise segmentation. The context information of adjacent slices is interconnected to form a chain structure. We integrate contextual information into the entire segmentation network through bi-directional loops to avoid the loss of spatial information in propagation. Additionally, an iterator is applied in the process of propagation, which is used to update the guided probability map after each propagation. We solve the problem that the 2D network loses three-dimensional information and combines the probability map of the adjacent slices into the segmentation as spatial information, avoiding large computational resource consumption caused by direct use of the 3D network. We used Dice similarity coefficients (DSC) to evaluate our approach on NIH pancreatic datasets and eventually achieved a competitive result of 83.02%.
http://arxiv.org/abs/1903.00923
Imagining multiple consecutive frames given one single snapshot is challenging, since it is difficult to simultaneously predict diverse motions from a single image and faithfully generate novel frames without visual distortions. In this work, we leverage an unsupervised variational model to learn rich motion patterns in the form of long-term bi-directional flow fields, and apply the predicted flows to generate high-quality video sequences. In contrast to the state-of-the-art approach, our method does not require external flow supervisions for learning. This is achieved through a novel module that performs bi-directional flows prediction from a single image. In addition, with the bi-directional flow consistency check, our method can handle occlusion and warping artifacts in a principled manner. Our method can be trained end-to-end based on arbitrarily sampled natural video clips, and it is able to capture multi-modal motion uncertainty and synthesizes photo-realistic novel sequences. Quantitative and qualitative evaluations over synthetic and real-world datasets demonstrate the effectiveness of the proposed approach over the state-of-the-art methods.
http://arxiv.org/abs/1903.00913
Recovering the absolute metric scale from a monocular camera is a challenging but highly desirable problem for monocular camera-based systems. By using different kinds of cues, various approaches have been proposed for scale estimation, such as camera height, object size etc. In this paper, firstly, we summarize different kinds of scale estimation approaches. Then, we propose a robust divide and conquer the absolute scale estimation method based on the ground plane and camera height by analyzing the advantages and disadvantages of different approaches. By using the estimated scale, an effective scale correction strategy has been proposed to reduce the scale drift during the Monocular Visual Odometry (VO) estimation process. Finally, the effectiveness and robustness of the proposed method have been verified on both public and self-collected image sequences.
http://arxiv.org/abs/1903.00912
Multi-scale deep CNN architecture [1, 2, 3] successfully captures both fine and coarse level image descriptors for visual similarity task, but they come up with expensive memory overhead and latency. In this paper, we propose a competing novel CNN architecture, called MILDNet, which merits by being vastly compact (about 3 times). Inspired by the fact that successive CNN layers represent the image with increasing levels of abstraction, we compressed our deep ranking model to a single CNN by coupling activations from multiple intermediate layers along with the last layer. Trained on the famous Street2shop dataset [4], we demonstrate that our approach performs as good as the current state-of-the-art models with only one third of the parameters, model size, training time and significant reduction in inference time. The significance of intermediate layers on image retrieval task has also been shown to be performing on popular datasets Holidays, Oxford, Paris [5]. So even though our experiments are done on ecommerce domain, it is applicable to other domains as well. We further did an ablation study to validate our hypothesis by checking the impact on adding each intermediate layer. With this we also present two more useful variants of MILDNet, a mobile model (12 times smaller) for on-edge devices and a compactly featured model (512-d feature embeddings) for systems with less RAMs and to reduce the ranking cost. Further we present an intuitive way to automatically create a tailored in-house triplet training dataset, which is very hard to create manually. This solution too can also be deployed as an all-inclusive visual similarity solution. Finally, we present our entire production level architecture which currently powers visual similarity at Fynd.
http://arxiv.org/abs/1903.00905
Recently, deep generative models have become increasingly popular in unsupervised anomaly detection. However, deep generative models aim at recovering the data distribution rather than detecting anomalies. Besides, deep generative models have the risk of overfitting training samples, which has disastrous effects on anomaly detection performance. To solve the above two problems, we propose a Self-adversarial Variational Autoencoder with a Gaussian anomaly prior assumption. We assume that both the anomalous and the normal prior distribution are Gaussian and have overlaps in the latent space. Therefore, a Gaussian transformer net T is trained to synthesize anomalous but near-normal latent variables. Keeping the original training objective of Variational Autoencoder, besides, the generator G tries to distinguish between the normal latent variables and the anomalous ones synthesized by T, and the encoder E is trained to discriminate whether the output of G is real. These new objectives we added not only give both G and E the ability to discriminate but also introduce additional regularization to prevent overfitting. Compared with the SOTA baselines, the proposed model achieves significant improvements in extensive experiments. Datasets and our model are available at a Github repository.
http://arxiv.org/abs/1903.00904
The game of bridge consists of two stages: bidding and playing. While playing is proved to be relatively easy for computer programs, bidding is very challenging. During the bidding stage, each player knowing only his/her own cards needs to exchange information with his/her partner and interfere with opponents at the same time. Existing methods for solving perfect-information games cannot be directly applied to bidding. Most bridge programs are based on human-designed rules, which, however, cannot cover all situations and are usually ambiguous and even conflicting with each other. In this paper, we, for the first time, propose a competitive bidding system based on deep learning techniques, which exhibits two novelties. First, we design a compact representation to encode the private and public information available to a player for bidding. Second, based on the analysis of the impact of other players’ unknown cards on one’s final rewards, we design two neural networks to deal with imperfect information, the first one inferring the cards of the partner and the second one taking the outputs of the first one as part of its input to select a bid. Experimental results show that our bidding system outperforms the top rule-based program.
http://arxiv.org/abs/1903.00900
A common assumption in causal modeling posits that the data is generated by a set of independent mechanisms, and algorithms should aim to recover this structure. Standard unsupervised learning, however, is often concerned with training a single model to capture the overall distribution or aspects thereof. Inspired by clustering approaches, we consider mixtures of implicit generative models that ``disentangle’’ the independent generative mechanisms underlying the data. Relying on an additional set of discriminators, we propose a competitive training procedure in which the models only need to capture the portion of the data distribution from which they can produce realistic samples. As a by-product, each model is simpler and faster to train. We empirically show that our approach splits the training distribution in a sensible way and increases the quality of the generated samples.
http://arxiv.org/abs/1804.11130
An abdominal aortic aneurysm (AAA) is a focal dilation of the aorta that, if not treated, tends to grow and may rupture. A significant unmet need in the assessment of AAA disease, for the diagnosis, prognosis and follow-up, is the determination of rupture risk, which is currently based on the manual measurement of the aneurysm diameter in a selected Computed Tomography Angiography (CTA) scan. However, there is a lack of standardization determining the degree and rate of disease progression, due to the lack of robust, automated aneurysm segmentation tools that allow quantitatively analyzing the AAA. In this work, we aim at proposing the first 3D convolutional neural network for the segmentation of aneurysms both from preoperative and postoperative CTA scans. We extensively validate its performance in terms of diameter measurements, to test its applicability in the clinical practice, as well as regarding the relative volume difference, and Dice and Jaccard scores. The proposed method yields a mean diameter measurement error of 3.3 mm, a relative volume difference of 8.58 %, and Dice and Jaccard scores of 87 % and 77 %, respectively. At a clinical level, an aneurysm enlargement of 10 mm is considered relevant, thus, our method is suitable to automatically determine the AAA diameter and opens up the opportunity for more complex aneurysm analysis.
http://arxiv.org/abs/1903.00879
Recent research on super-resolution has achieved great success due to the development of deep convolutional neural networks (DCNNs). However, super-resolution of arbitrary scale factor has been ignored for a long time. Most previous researchers regard super-resolution of different scale factors as independent tasks. They train a specific model for each scale factor which is inefficient in computing, and prior work only take the super-resolution of several integer scale factors into consideration. In this work, we propose a novel method called Meta-SR to firstly solve super-resolution of arbitrary scale factor (including non-integer scale factors) with a single model. In our Meta-SR, the Meta-Upscale Module is proposed to replace the traditional upscale module. For arbitrary scale factor, the Meta-Upscale Module dynamically predicts the weights of the upscale filters by taking the scale factor as input and use these weights to generate the HR image of arbitrary size. For any low-resolution image, our Meta-SR can continuously zoom in it with arbitrary scale factor by only using a single model. We evaluated the proposed method through extensive experiments on widely used benchmark datasets on single image super-resolution. The experimental results show the superiority of our Meta-Upscale.
http://arxiv.org/abs/1903.00875
Face images captured through the glass are usually contaminated by reflections. The non-transmitted reflections make the reflection removal more challenging than for general scenes, because important facial features are completely occluded. In this paper, we propose and solve the face image reflection removal problem. We remove non-transmitted reflections by incorporating inpainting ideas into a guided reflection removal framework and recover facial features by considering various face-specific priors. We use a newly collected face reflection image dataset to train our model and compare with state-of-the-art methods. The proposed method shows advantages in estimating reflection-free face images for improving face recognition.
http://arxiv.org/abs/1903.00865
Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative approach using a second-order optimization method that shows similar generalization capability to first-order methods, but converges faster and can handle larger mini-batches. To test our method on a benchmark where highly optimized first-order methods are available as references, we train ResNet-50 on ImageNet. We converged to 75% Top-1 validation accuracy in 35 epochs for mini-batch sizes under 16,384, and achieved 75% even with a mini-batch size of 131,072, which took 100 epochs.
http://arxiv.org/abs/1811.12019
Highlight detection has the potential to significantly ease video browsing, but existing methods often suffer from expensive supervision requirements, where human viewers must manually identify highlights in training videos. We propose a scalable unsupervised solution that exploits video duration as an implicit supervision signal. Our key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, we introduce a novel ranking framework that prefers segments from shorter videos, while properly accounting for the inherent noise in the (unlabeled) training data. We use it to train a highlight detector with 10M hashtagged Instagram videos. In experiments on two challenging public video highlight detection benchmarks, our method substantially improves the state-of-the-art for unsupervised highlight detection.
http://arxiv.org/abs/1903.00859