Energy based models (EBMs) are appealing due to their generality and simplicity in likelihood modeling, but have been traditionally difficult to train. We present techniques to scale MCMC based EBM training, on continuous neural networks, and show its success on the high-dimensional data domains of ImageNet32x32, ImageNet128x128, CIFAR-10, and robotic hand trajectories, achieving significantly better samples than other likelihood models and on par with contemporary GAN approaches, while covering all modes of the data. We highlight unique capabilities of implicit generation, such as energy compositionality and corrupt image reconstruction and completion. Finally, we show that EBMs generalize well and are able to achieve state-of-the-art out-of-distribution classification, exhibit adversarially robust classification, coherent long term predicted trajectory roll-outs, and generate zero-shot compositions of models.
http://arxiv.org/abs/1903.08689
We propose a high-quality photo-to-pencil translation method with fine-grained control over the drawing style. This is a challenging task due to multiple stroke types (e.g., outline and shading), structural complexity of pencil shading (e.g., hatching), and the lack of aligned training data pairs. To address these challenges, we develop a two-branch model that learns separate filters for generating sketchy outlines and tonal shading from a collection of pencil drawings. We create training data pairs by extracting clean outlines and tonal illustrations from original pencil drawings using image filtering techniques, and we manually label the drawing styles. In addition, our model creates different pencil styles (e.g., line sketchiness and shading style) in a user-controllable manner. Experimental results on different types of pencil drawings show that the proposed algorithm performs favorably against existing methods in terms of quality, diversity and user evaluations.
http://arxiv.org/abs/1903.08682
Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where we partially deprive the models from source-side textual context. Our results show that under limited textual context, models are capable of leveraging the visual input to generate better translations. This contradicts the current belief that MMT models disregard the visual modality because of either the quality of the image features or the way they are integrated into the model.
http://arxiv.org/abs/1903.08678
We describe in this paper Hydra, an ensemble of convolutional neural networks (CNN) for geospatial land classification. The idea behind Hydra is to create an initial CNN that is coarsely optimized but provides a good starting pointing for further optimization, which will serve as the Hydra’s body. Then, the obtained weights are fine-tuned multiple times with different augmentation techniques, crop styles, and classes weights to form an ensemble of CNNs that represent the Hydra’s heads. By doing so, we prompt convergence to different endpoints, which is a desirable aspect for ensembles. With this framework, we were able to reduce the training time while maintaining the classification performance of the ensemble. We created ensembles for our experiments using two state-of-the-art CNN architectures, ResNet and DenseNet. We have demonstrated the application of our Hydra framework in two datasets, FMOW and NWPU-RESISC45, achieving results comparable to the state-of-the-art for the former and the best reported performance so far for the latter. Code and CNN models are available at https://github.com/maups/hydra-fmow
http://arxiv.org/abs/1802.03518
Continual learning is the ability of an agent to learn online with a non-stationary and never-ending stream of data. A key component for such never-ending learning process is to overcome the catastrophic forgetting of previously seen data, a problem that neural networks are well known to suffer from. The solutions developed so far often relax the problem of continual learning to the easier task-incremental setting, where the stream of data is divided into tasks with clear boundaries. In this paper, we break the limits and move to the more challenging online setting where we assume no information of tasks in the data stream. We start from the idea that each learning step should not increase the losses of the previously learned examples through constraining the optimization process. This means that the number of constraints grows linearly with the number of examples, which is a serious limitation. We develop a solution to select a fixed number of constraints that we use to approximate the feasible region defined by the original constraints. We compare our approach against the methods that rely on task boundaries to select a fixed set of examples, and show comparable or even better results, especially when the boundaries are blurry or when the data distributions are imbalanced.
http://arxiv.org/abs/1903.08671
In this paper, we address the problem of 3D object mesh reconstruction from RGB videos. Our approach combines the best of multi-view geometric and data-driven methods for 3D reconstruction by optimizing object meshes for multi-view photometric consistency while constraining mesh deformations with a shape prior. We pose this as a piecewise image alignment problem for each mesh face projection. Our approach allows us to update shape parameters from the photometric error without any depth or mask information. Moreover, we show how to avoid a degeneracy of zero photometric gradients via rasterizing from a virtual viewpoint. We demonstrate 3D object mesh reconstruction results from both synthetic and real-world videos with our photometric mesh optimization, which is unachievable with either na"ive mesh generation networks or traditional pipelines of surface reconstruction without heavy manual post-processing.
http://arxiv.org/abs/1903.08642
For a team of heterogeneous robots executing multiple tasks, we propose a novel algorithm to optimally allocate tasks to robots while accounting for their different capabilities. Motivated by the need that robot teams have in many real-world applications of remaining operational for long periods of time, we allow each robot to choose tasks taking into account the energy consumed by executing them, besides the global specifications on the task allocation. The tasks are encoded as constraints in an energy minimization problem solved at each point in time by each robot. The prioritization of a task over others – effectively signifying the allocation of the task to that particular robot – occurs via the introduction of slack variables in the task constraints. Moreover, the suitabilities of certain robots towards certain tasks are also taken into account to generate a task allocation algorithm for a team of robots with heterogeneous capabilities. The efficacy of the developed approach is demonstrated both in simulation and on a team of real robots.
http://arxiv.org/abs/1903.08641
It holds great implications for practical applications to enable centimeter-accuracy positioning for mobile and wearable sensor systems. In this paper, we propose a novel, high-precision, efficient visual-inertial (VI)-SLAM algorithm, termed Schmidt-EKF VI-SLAM (SEVIS), which optimally fuses IMU measurements and monocular images in a tightly-coupled manner to provide 3D motion tracking with bounded error. In particular, we adapt the Schmidt Kalman filter formulation to selectively include informative features in the state vector while treating them as nuisance parameters (or Schmidt states) once they become matured. This change in modeling allows for significant computational savings by no longer needing to constantly update the Schmidt states (or their covariance), while still allowing the EKF to correctly account for their cross-correlations with the active states. As a result, we achieve linear computational complexity in terms of map size, instead of quadratic as in the standard SLAM systems. In order to fully exploit the map information to bound navigation drifts, we advocate efficient keyframe-aided 2D-to-2D feature matching to find reliable correspondences between current 2D visual measurements and 3D map features. The proposed SEVIS is extensively validated in both simulations and experiments.
http://arxiv.org/abs/1903.08636
In this paper we propose a new analytical preintegration theory for graph-based sensor fusion with an inertial measurement unit (IMU) and a camera (or other aiding sensors).Rather than using discrete sampling of the measurement dynamics as in current methods,we derive the closed-form solutions to the preintegration equations, yielding improved accuracy in state estimation.We advocate two new different inertial models for preintegration: (i) the model that assumes piecewise constant measurements, and (ii) the model that assumes piecewise constant local true acceleration.We show through extensive Monte-Carlo simulations the effect that the choice of preintegration model has on estimation performance.To validate the proposed preintegration theory, we develop both direct and indirect visual-inertial navigation systems (VINS) that leverage our preintegration.In the first, within a tightly-coupled, sliding-window optimization framework, we jointly estimate the features in the window and the IMU states while performing marginalization to bound the computational cost.In the second, we loosely-couple the IMU preintegration with a direct image alignment that estimates relative camera motion by minimizing the photometric errors (i.e., image intensity difference), allowing for efficient and informative loop closures. Both systems are extensively validated in real-world experiments and are shown to offer competitive performance to state-of-the-art methods.
http://arxiv.org/abs/1805.02774
Magnetic Resonance Imaging (MRI) is a non-invasive diagnostic tool that provides excellent soft-tissue contrast without the use of ionizing radiation. But, compared to other clinical imaging modalities (e.g., CT or ultrasound), the data acquisition process for MRI is inherently slow. Furthermore, dynamic applications demand collecting a series of images in quick succession. As a result, reducing acquisition time and improving imaging quality for undersampled datasets have been active areas of research for the last two decades. The combination of parallel imaging and compressive sensing (CS) has been shown to benefit a wide range of MRI applications. More recently, deep learning techniques have been shown to outperform CS methods. Some of these techniques pose the MRI reconstruction as a direct inversion problem and tackle it by training a deep neural network (DNN) to map from the measured Fourier samples and the final image. Considering that the forward model in MRI changes from one dataset to the next, such methods have to be either trained over a large and diverse corpus of data or limited to a specific application, and even then they cannot ensure data consistency. An alternative is to use “plug-and-play” (PnP) algorithms, which iterate image denoising with forward-model based signal recovery. PnP algorithms are an excellent fit for compressive MRI because they decouple image modeling from the forward model, which can change significantly among different scans due to variations in the coil sensitivity maps, sampling patterns, and image resolution. Consequently, with PnP, state-of-the-art image-denoising techniques, such as those based on DNNs, can be directly exploited for compressive MRI image reconstruction. The objective of this article is two-fold: i) to review recent advances in plug-and-play methods, and ii) to discuss their application to compressive MRI image reconstruction.
http://arxiv.org/abs/1903.08616
Standard image captioning tasks such as COCO and Flickr30k are factual, neutral in tone and (to a human) state the obvious (e.g., “a man playing a guitar”). While such tasks are useful to verify that a machine understands the content of an image, they are not engaging to humans as captions. With this in mind we define a new task, Personality-Captions, where the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits. We collect and release a large dataset of 201,858 of such captions conditioned over 215 possible traits. We build models that combine existing work from (i) sentence representations (Mazare et al., 2018) with Transformers trained on 1.7 billion dialogue examples; and (ii) image representations (Mahajan et al., 2018) with ResNets trained on 3.5 billion social media images. We obtain state-of-the-art performance on Flickr30k and COCO, and strong performance on our new task. Finally, online evaluations validate that our task and models are engaging to humans, with our best model close to human performance.
https://arxiv.org/abs/1810.10665
Adversary scenarios in driving, where the other vehicles may make mistakes or have a competing or malicious intent, have to be studied not only for our safety but also for addressing the concerns from public in order to push the technology forward. Classical planning solutions for adversary driving do not exist so far, especially when the vehicles do not communicate their intent. Given recent success in solving hard problems in artificial intelligence (AI), it is worth studying the potential of reinforcement learning for safety driving in adversary settings. In most recent reinforcement learning applications, there is a deep neural networks that maps an input state to an optimal policy over primitive actions. However, learning a policy over primitive actions is very difficult and inefficient. On the other hand, the knowledge already learned in classical planning methods should be inherited and reused. In order to take advantage of reinforcement learning good at exploring the action space for safety and classical planning skill models good at handling most driving scenarios, we propose to learn a policy over an action space of primitive actions augmented with classical planning methods. We show two advantages by doing so. First, training this reinforcement learning agent is easier and faster than training the primitive-action agent. Second, our new agent outperforms the primitive-action reinforcement learning agent, human testers as well as the classical planning methods that our agent queries as skills.
http://arxiv.org/abs/1903.08606
Deep Neural Networks (DNNs) excel on many complex perceptual tasks but it has proven notoriously difficult to understand how they reach their decisions. We here introduce a high-performance DNN architecture on ImageNet whose decisions are considerably easier to explain. Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet (87.6% top-5 for 33 x 33 px features and Alexnet performance for 17 x 17 px features). The constraint on local features makes it straight-forward to analyse how exactly each part of the image influences the classification. Furthermore, the BagNets behave similar to state-of-the art deep neural networks such as VGG-16, ResNet-152 or DenseNet-169 in terms of feature sensitivity, error distribution and interactions between image parts. This suggests that the improvements of DNNs over previous bag-of-feature classifiers in the last few years is mostly achieved by better fine-tuning rather than by qualitatively different decision strategies.
http://arxiv.org/abs/1904.00760
Although YOLOv2 approach is extremely fast on object detection; its backbone network has the low ability on feature extraction and fails to make full use of multi-scale local region features, which restricts the improvement of object detection accuracy. Therefore, this paper proposed a DC-SPP-YOLO (Dense Connection and Spatial Pyramid Pooling Based YOLO) approach for ameliorating the object detection accuracy of YOLOv2. Specifically, the dense connection of convolution layers is employed in the backbone network of YOLOv2 to strengthen the feature extraction and alleviate the vanishing-gradient problem. Moreover, an improved spatial pyramid pooling is introduced to pool and concatenate the multi-scale local region features, so that the network can learn the object features more comprehensively. The DC-SPP-YOLO model is established and trained based on a new loss function composed of mean square error and cross entropy, and the object detection is realized. Experiments demonstrate that the mAP (mean Average Precision) of DC-SPP-YOLO proposed on PASCAL VOC datasets and UA-DETRAC datasets is higher than that of YOLOv2; the object detection accuracy of DC-SPP-YOLO is superior to YOLOv2 by strengthening feature extraction and using the multi-scale local region features.
http://arxiv.org/abs/1903.08589
We propose a robust approach for the registration of two sets of 3D points in the presence of a large amount of outliers. Our first contribution is to reformulate the registration problem using a Truncated Least Squares (TLS) cost that makes the estimation insensitive to a large fraction of spurious point-to-point correspondences. The second contribution is a general framework to decouple rotation, translation, and scale estimation, which allows solving in cascade for the three transformations. Since each subproblem (scale, rotation, and translation estimation) is still non-convex and combinatorial in nature, out third contribution is to show that (i) TLS scale and (component-wise) translation estimation can be solved exactly and in polynomial time via an adaptive voting scheme, (ii) TLS rotation estimation can be relaxed to a semidefinite program and the relaxation is tight in practice, even in presence of an extreme amount of outliers. We validate the proposed algorithm, named TEASER (Truncated least squares Estimation And SEmidefinite Relaxation), in standard registration benchmarks showing that the algorithm outperforms RANSAC and robust local optimization techniques, and favorably compares with Branch-and-Bound methods, while being a polynomial-time algorithm. TEASER can tolerate up to 99% outliers and returns highly-accurate solutions.
http://arxiv.org/abs/1903.08588
Deep learning techniques involving image processing and data analysis are constantly evolving. Many domains adapt these techniques for object segmentation, instantiation and classification. Recently, agricultural industries adopted those techniques in order to bring automation to farmers around the globe. One analysis procedure required for automatic visual inspection in this domain is leaf count and segmentation. Collecting labeled data from field crops and greenhouses is a complicated task due to the large variety of crops, growth seasons, climate changes, phenotype diversity, and more, especially when specific learning tasks require a large amount of labeled data for training. Data augmentation for training deep neural networks is well established, examples include data synthesis, using generative semi-synthetic models, and applying various kinds of transformations. In this paper we propose a method that preserves the geometric structure of the data objects, thus keeping the physical appearance of the data-set as close as possible to imaged plants in real agricultural scenes. The proposed method provides state of the art results when applied to the standard benchmark in the field, namely, the ongoing Leaf Segmentation Challenge hosted by Computer Vision Problems in Plant Phenotyping.
http://arxiv.org/abs/1903.08583
We present a comprehensive study and evaluation of existing single image deraining algorithms, using a new large-scale benchmark consisting of both synthetic and real-world rainy images.This dataset highlights diverse data sources and image contents, and is divided into three subsets (rain streak, rain drop, rain and mist), each serving different training or evaluation purposes. We further provide a rich variety of criteria for dehazing algorithm evaluation, ranging from full-reference metrics, to no-reference metrics, to subjective evaluation and the novel task-driven evaluation. Experiments on the dataset shed light on the comparisons and limitations of state-of-the-art deraining algorithms, and suggest promising future directions.
http://arxiv.org/abs/1903.08558
We present a novel model called OCGAN for the classical problem of one-class novelty detection, where, given a set of examples from a particular class, the goal is to determine if a query example is from the same class. Our solution is based on learning latent representations of in-class examples using a denoising auto-encoder network. The key contribution of our work is our proposal to explicitly constrain the latent space to exclusively represent the given class. In order to accomplish this goal, firstly, we force the latent space to have bounded support by introducing a tanh activation in the encoder’s output layer. Secondly, using a discriminator in the latent space that is trained adversarially, we ensure that encoded representations of in-class examples resemble uniform random samples drawn from the same bounded space. Thirdly, using a second adversarial discriminator in the input space, we ensure all randomly drawn latent samples generate examples that look real. Finally, we introduce a gradient-descent based sampling technique that explores points in the latent space that generate potential out-of-class examples, which are fed back to the network to further train it to generate in-class examples from those points. The effectiveness of the proposed method is measured across four publicly available datasets using two one-class novelty detection protocols where we achieve state-of-the-art results.
http://arxiv.org/abs/1903.08550
Over the past few years, dictionary learning (DL)-based methods have been successfully used in various image reconstruction problems. However, traditional DL-based computed tomography (CT) reconstruction methods are patch-based and ignore the consistency of pixels in overlapped patches. In addition, the features learned by these methods always contain shifted versions of the same features. In recent years, convolutional sparse coding (CSC) has been developed to address these problems. In this paper, inspired by several successful applications of CSC in the field of signal processing, we explore the potential of CSC in sparse-view CT reconstruction. By directly working on the whole image, without the necessity of dividing the image into overlapped patches in DL-based methods, the proposed methods can maintain more details and avoid artifacts caused by patch aggregation. With predetermined filters, an alternating scheme is developed to optimize the objective function. Extensive experiments with simulated and real CT data were performed to validate the effectiveness of the proposed methods. Qualitative and quantitative results demonstrate that the proposed methods achieve better performance than several existing state-of-the-art methods.
http://arxiv.org/abs/1903.08549
Tissue loss in the hippocampi has been heavily correlated with the progression of Alzheimer’s Disease (AD). The shape and structure of the hippocampus are important factors in terms of early AD diagnosis and prognosis by clinicians. However, manual segmentation of such subcortical structures in MR studies is a challenging and subjective task. In this paper, we investigate variants of the well known 3D U-Net, a type of convolution neural network (CNN) for semantic segmentation tasks. We propose an alternative form of the 3D U-Net, which uses dilated convolutions and deep supervision to incorporate multi-scale information into the model. The proposed method is evaluated on the task of hippocampus head and body segmentation in an MRI dataset, provided as part of the MICCAI 2018 segmentation decathlon challenge. The experimental results show that our approach outperforms other conventional methods in terms of different segmentation accuracy metrics.
http://arxiv.org/abs/1903.09097
Efficient point cloud compression is fundamental to enable the deployment of virtual and mixed reality applications, since the number of points to code can range in the order of millions. In this paper, we present a novel data-driven geometry compression method for static point clouds based on learned convolutional transforms and uniform quantization. We perform joint optimization of both rate and distortion using a trade-off parameter. In addition, we cast the decoding process as a binary classification of the point cloud occupancy map. Our method outperforms the MPEG reference solution in terms of rate-distortion on the Microsoft Voxelized Upper Bodies dataset with 51.5% BDBR savings on average. Moreover, while octree-based methods face exponential diminution of the number of points at low bitrates, our method still produces high resolution outputs even at low bitrates.
http://arxiv.org/abs/1903.08548
Robots must know how to be gentle when they need to interact with fragile objects, or when the robot itself is prone to wear and tear. We propose an approach that enables deep reinforcement learning to train policies that are gentle, both during exploration and task execution. In a reward-based learning environment, a natural approach involves augmenting the (task) reward with a penalty for non-gentleness, which can be defined as excessive impact force. However, augmenting with only this penalty impairs learning: policies get stuck in a local optimum which avoids all contact with the environment. Prior research has shown that combining auxiliary tasks or intrinsic rewards can be beneficial for stabilizing and accelerating learning in sparse-reward domains, and indeed we find that introducing a surprise-based intrinsic reward does avoid the no-contact failure case. However, we show that a simple dynamics-based surprise is not as effective as penalty-based surprise. Penalty-based surprise, based on predicting forceful contacts, has a further benefit: it encourages exploration which is contact-rich yet gentle. We demonstrate the effectiveness of the approach using a complex, tendon-powered robot hand with tactile sensors. Videos are available at this http URL
http://arxiv.org/abs/1903.08542
Automated surface-anomaly detection using machine learning has become an interesting and promising area of research, with a very high and direct impact on the application domain of visual inspection. Deep-learning methods have become the most suitable approaches for this task. They allow the inspection system to learn to detect the surface anomaly by simply showing it a number of exemplar images. This paper presents a segmentation-based deep-learning architecture that is designed for the detection and segmentation of surface anomalies and is demonstrated on a specific domain of surface-crack detection. The design of the architecture enables the model to be trained using a small number of samples, which is an important requirement for practical applications. The proposed model is compared with the related deep-learning methods, including the state-of-the-art commercial software, showing that the proposed approach outperforms the related methods on the domain of a surface-crack detection. The large number of experiments also shed light on the required precision of the annotation, the number of required training samples and on the required computational cost. Experiments are performed on a newly created dataset based on a real-world quality control case and demonstrate that the proposed approach is able to learn on a small number of defected surfaces, using only approximately 25-30 defective training samples, instead of hundreds or thousands, which is usually the case in deep-learning applications. This makes the deep-learning method practical for use in industry where the number of available defective samples is limited. The dataset is also made publicly available to encourage the development and evaluation of new methods for surface-defect detection.
http://arxiv.org/abs/1903.08536
Recently, deep learning based 3D face reconstruction methods have shown promising results in both quality and efficiency.However, training deep neural networks typically requires a large volume of data, whereas face images with ground-truth 3D face shapes are scarce. In this paper, we propose a novel deep 3D face reconstruction approach that 1) leverages a robust, hybrid loss function for weakly-supervised learning which takes into account both low-level and perception-level information for supervision, and 2) performs multi-image face reconstruction by exploiting complementary information from different images for shape aggregation. Our method is fast, accurate, and robust to occlusion and large pose. We provide comprehensive experiments on three datasets, systematically comparing our method with fifteen recent methods and demonstrating its state-of-the-art performance.
http://arxiv.org/abs/1903.08527
We present a machine-readable movement writing for sleight-of-hand moves with cards – a “Labanotation of card magic.” This scheme of movement writing contains 440 categories of motion, and appears to taxonomize all card sleights that have appeared in over 1500 publications. The movement writing is axiomatized in $\mathcal{SROIQ}$(D) Description Logic, and collected formally as an Ontology of Card Sleights, a computational ontology that extends the Basic Formal Ontology and the Information Artifact Ontology. The Ontology of Card Sleights is implemented in OWL DL, a Description Logic fragment of the Web Ontology Language. While ontologies have historically been used to classify at a less granular level, the algorithmic nature of card tricks allows us to transcribe a performer’s actions step by step. We conclude by discussing design criteria we have used to ensure the ontology can be accessed and modified with a simple click-and-drag interface. This may allow database searches and performance transcriptions by users with card magic knowledge, but no ontology background.
http://arxiv.org/abs/1903.08523
Early diagnosis of acute coronary artery occlusion based on electrocardiogram (ECG) findings is essential for prompt delivery of primary percutaneous coronary intervention. Current ST elevation (STE) criteria are specific but insensitive. Consequently, it is likely that many patients are missing out on potentially life-saving treatment. Experts combining non-specific ECG changes with STE detect ischaemia with higher sensitivity, but at the cost of specificity. We show that a deep learning model can detect ischaemia caused by acute coronary artery occlusion with a better balance of sensitivity and specificity than STE criteria, existing computerised analysers or expert cardiologists.
http://arxiv.org/abs/1903.04421
The TriRhenaTech alliance universities and their partners presented their competences in the field of artificial intelligence and their cross-border cooperations with the industry at the tri-national conference ‘Artificial Intelligence : from Research to Application’ on March 13th, 2019 in Offenburg. The TriRhenaTech alliance is a network of universities in the Upper Rhine Trinational Metropolitan Region comprising of the German universities of applied sciences in Furtwangen, Kaiserslautern, Karlsruhe, and Offenburg, the Baden-Wuerttemberg Cooperative State University Loerrach, the French university network Alsace Tech (comprised of 14 ‘grandes 'ecoles’ in the fields of engineering, architecture and management) and the University of Applied Sciences and Arts Northwestern Switzerland. The alliance’s common goal is to reinforce the transfer of knowledge, research, and technology, as well as the cross-border mobility of students.
http://arxiv.org/abs/1903.08495
Here we present CaosDB, a Research Data Management System (RDMS) designed to ensure seamless integration of inhomogeneous data sources and repositories of legacy data. Its primary purpose is the management of data from biomedical sciences, both from simulations and experiments during the complete research data lifecycle. An RDMS for this domain faces particular challenges: Research data arise in huge amounts, from a wide variety of sources, and traverse a highly branched path of further processing. To be accepted by its users, an RDMS must be built around workflows of the scientists and practices and thus support changes in workflow and data structure. Nevertheless it should encourage and support the development and observation of standards and furthermore facilitate the automation of data acquisition and processing with specialized software. The storage data model of an RDMS must reflect these complexities with appropriate semantics and ontologies while offering simple methods for finding, retrieving, and understanding relevant data. We show how CaosDB responds to these challenges and give an overview of the CaosDB Server, its data model and its easy-to-learn CaosDB Query Language. We briefly discuss the status of the implementation, how we currently use CaosDB, and how we plan to use and extend it.
http://arxiv.org/abs/1801.07653
Developing a robust algorithm for automatic individual tree crown (ITC) detection from laser scanning datasets is important for tracking the responses of trees to anthropogenic change. Such approaches allow the size, growth and mortality of individual trees to be measured, enabling forest carbon stocks and dynamics to be tracked and understood. Many algorithms exist for structurally simple forests including coniferous forests and plantations. Finding a robust solution for structurally complex, species-rich tropical forests remains a challenge; existing segmentation algorithms often perform less well than simple area-based approaches when estimating plot-level biomass. Here we describe a Multi-Class Graph Cut (MCGC) approach to tree crown delineation. This uses local three-dimensional geometry and density information, alongside knowledge of crown allometries, to segment individual tree crowns from LiDAR point clouds. Our approach robustly identifies trees in the top and intermediate layers of the canopy, but cannot recognise small trees. From these three-dimensional crowns, we are able to measure individual tree biomass. Comparing these estimates to those from permanent inventory plots, our algorithm is able to produce robust estimates of hectare-scale carbon density, demonstrating the power of ITC approaches in monitoring forests. The flexibility of our method to add additional dimensions of information, such as spectral reflectance, make this approach an obvious avenue for future development and extension to other sources of three-dimensional data, such as structure from motion datasets.
http://arxiv.org/abs/1903.08481
Deep learning is a research hot topic in the field of machine learning. Real-value neural networks (Real NNs), especially deep real networks (DRNs), have been widely used in many research fields. In recent years, the deep complex networks (DCNs) and the deep quaternion networks (DQNs) have attracted more and more attentions. The octonion algebra, which is an extension of complex algebra and quaternion algebra, can provide more efficient and compact expression. This paper constructs a general framework of deep octonion networks (DONs) and provides the main building blocks of DONs such as octonion convolution, octonion batch normalization and octonion weight initialization; DONs are then used in image classification tasks for CIFAR-10 and CIFAR-100 data sets. Compared with the DRNs, the DCNs, and the DQNs, the proposed DONs have better convergence and higher classification accuracy. The success of DONs is also explained by multi-task learning.
http://arxiv.org/abs/1903.08478
We present a method for fast and accurate physics-based predictions during non-prehensile manipulation planning and control. Given an initial state and a sequence of controls, the problem of predicting the resulting sequence of states is a key component of a variety of model-based planning and control algorithms. We propose combining a coarse (i.e. computationally cheap but not very accurate) predictive physics model, with a fine (i.e. computationally expensive but accurate) predictive physics model, to generate a hybrid model that is at the required speed and accuracy for a given manipulation task. Our approach is based on the Parareal algorithm, a parallel-in-time integration method used for computing numerical solutions for general systems of ordinary differential equations. We use Parareal to combine a coarse pushing model with an off-the-shelf physics engine to deliver physics-based predictions that are as accurate as the physics engine but runs in substantially less wall-clock time, thanks to Parareal being amenable to parallelization. We use these physics-based predictions in a model-predictive-control framework based on trajectory optimization, to plan pushing actions that avoid an obstacle and reach a goal location. We show that by combining the two physics models, we can achieve the same success rates as the planner that uses the off-the-shelf physics engine directly, but significantly faster. We present experiments in simulation and on a real robotic setup.
http://arxiv.org/abs/1903.08470
Recent success of semantic segmentation approaches on demanding road driving datasets has spurred interest in many related application fields. Many of these applications involve real-time prediction on mobile platforms such as cars, drones and various kinds of robots. Real-time setup is challenging due to extraordinary computational complexity involved. Many previous works address the challenge with custom lightweight architectures which decrease computational complexity by reducing depth, width and layer capacity with respect to general purpose architectures. We propose an alternative approach which achieves a significantly better performance across a wide range of computing budgets. First, we rely on a light-weight general purpose architecture as the main recognition engine. Then, we leverage light-weight upsampling with lateral connections as the most cost-effective solution to restore the prediction resolution. Finally, we propose to enlarge the receptive field by fusing shared features at multiple resolutions in a novel fashion. Experiments on several road driving datasets show a substantial advantage of the proposed approach, either with ImageNet pre-trained parameters or when we learn from scratch. Our Cityscapes test submission entitled SwiftNetRN-18 delivers 75.5% MIoU and achieves 39.9Hz on 1024x2048 images on GTX1080Ti.
http://arxiv.org/abs/1903.08469
Recurrent neural networks (RNNs) can model natural language by sequentially ‘reading’ input tokens and outputting a distributed representation of each token. Due to the sequential nature of RNNs, inference time is linearly dependent on the input length, and all inputs are read regardless of their importance. Efforts to speed up this inference, known as ‘neural speed reading’, either ignore or skim over part of the input. We present Structural-Jump-LSTM: the first neural speed reading model to both skip and jump text during inference. The model consists of a standard LSTM and two agents: one capable of skipping single words when reading, and one capable of exploiting punctuation structure (sub-sentence separators (,:), sentence end symbols (.!?), or end of text markers) to jump ahead after reading a word. A comprehensive experimental evaluation of our model against all five state-of-the-art neural reading models shows that Structural-Jump-LSTM achieves the best overall floating point operations (FLOP) reduction (hence is faster), while keeping the same accuracy or even improving it compared to a vanilla LSTM that reads the whole text.
http://arxiv.org/abs/1904.00761
We address the issue of editing musical performance data, in particular MIDI files representing human musical performances. Editing such sequences raises specific issues due to the ambiguous nature of musical objects. The first source of ambiguity is that musicians naturally produce many deviations from the metrical frame. These deviations may be intentional or subconscious, but they play an important role in conveying the groove or feeling of a performance. Relations between musical elements are also usually implicit, creating even more ambiguity. A note is in relation with the surrounding notes in many possible ways: it can be part of a melodic pattern, it can also play a harmonic role with the simultaneous notes, or be a pedal-tone. All these aspects play an essential role that should be preserved, as much as possible, when editing musical sequences. In this paper, we contribute specifically to the problem of editing non-quantized, metrical musical sequences represented as MIDI files. We first list of number of problems caused by the use of naive edition operations applied to performance data, using a motivating example. We then introduce a model, called Dancing MIDI, based on 1) two desirable, well-defined properties for edit operations and 2) two well-defined operations, Split and Concat, with an implementation. We show that our model formally satisfies the two properties, and that it prevents most of the problems that occur with naive edit operations on our motivating example, as well as on a real-world example using an automatic harmonizer.
http://arxiv.org/abs/1903.08459
In this paper we investigate the problems of class imbalance and irrelevant relationships in Visual Relationship Detection (VRD). State-of-the-art deep VRD models still struggle to predict uncommon classes, limiting their applicability. Moreover, many methods are incapable of properly filtering out background relationships while predicting relevant ones. Although these problems are very apparent, they have both been overlooked so far. We analyse why this is the case and propose modifications to both model and training to alleviate the aforementioned issues, as well as suggesting new measures to complement existing ones and give a more holistic picture of the efficacy of a model.
http://arxiv.org/abs/1903.08456
In this paper, we propose a constraint-based modeling approach for the problem of discovering frequent gradual patterns in a numerical dataset. This SAT-based declarative approach offers an additional possibility to benefit from the recent progress in satisfiability testing and to exploit the efficiency of modern SAT solvers for enumerating all frequent gradual patterns in a numerical dataset. Our approach can easily be extended with extra constraints, such as temporal constraints in order to extract more specific patterns in a broad range of gradual patterns mining applications. We show the practical feasibility of our SAT model by running experiments on two real world datasets.
http://arxiv.org/abs/1903.08452
To capture salient contextual information for spoken language understanding (SLU) of a dialogue, we propose time-aware models that automatically learn the latent time-decay function of the history without a manual time-decay function. We also propose a method to identify and label the current speaker to improve the SLU accuracy. In experiments on the benchmark dataset used in Dialog State Tracking Challenge 4, the proposed models achieved significantly higher F1 scores than the state-of-the-art contextual models. Finally, we analyze the effectiveness of the introduced models in detail. The analysis demonstrates that the proposed methods were effective to improve SLU accuracy individually.
http://arxiv.org/abs/1903.08450
We propose a novel transition-based algorithm that straightforwardly parses sentences from left to right by building $n$ attachments, with $n$ being the length of the input sentence. Similarly to the recent stack-pointer parser by Ma et al. (2018), we use the pointer network framework that, given a word, can directly point to a position from the sentence. However, our left-to-right approach is simpler than the original top-down stack-pointer parser (not requiring a stack) and reduces transition sequence length in half, from 2$n$-1 actions to $n$. This results in a quadratic non-projective parser that runs twice as fast as the original while achieving the best accuracy to date on the English PTB dataset (96.04% UAS, 94.43% LAS) among fully-supervised single-model dependency parsers, and improves over the former top-down transition system in the majority of languages tested.
http://arxiv.org/abs/1903.08445
Estimating the intensity of emotion has gained significance as modern textual inputs in potential applications like social media, e-retail markets, psychology, advertisements etc., carry a lot of emotions, feelings, expressions along with its meaning. However, the approaches of traditional sentiment analysis primarily focuses on classifying the sentiment in general (positive or negative) or at an aspect level(very positive, low negative, etc.) and cannot exploit the intensity information. Moreover, automatically identifying emotions like anger, fear, joy, sadness, disgust etc., from text introduces challenging scenarios where single tweet may contain multiple emotions with different intensities and some emotions may even co-occur in some of the tweets. In this paper, we propose an architecture, Experts Model, inspired from the standard Mixture of Experts (MoE) model. The key idea here is each expert learns different sets of features from the feature vector which helps in better emotion detection from the tweet. We compared the results of our Experts Model with both baseline results and top five performers of SemEval-2018 Task-1, Affect in Tweets (AIT). The experimental results show that our proposed approach deals with the emotion detection problem and stands at top-5 results.
http://arxiv.org/abs/1904.00762
We study strategy synthesis for partially observable Markov decision processes (POMDPs). The particular problem is to determine strategies that provably adhere to (probabilistic) temporal logic constraints. This problem is computationally intractable and theoretically hard. We propose a novel method that combines techniques from machine learning and formal verification. First, we train a recurrent neural network (RNN) to encode POMDP strategies. The RNN accounts for memory-based decisions without the need to expand the full belief space of a POMDP. Secondly, we restrict the RNN-based strategy to represent a finite-memory strategy and implement it on a specific POMDP. For the resulting finite Markov chain, efficient formal verification techniques provide provable guarantees against temporal logic specifications. If the specification is not satisfied, counterexamples supply diagnostic information. We use this information to improve the strategy by iteratively training the RNN. Numerical experiments show that the proposed method elevates the state of the art in POMDP solving by up to three orders of magnitude in terms of solving times and model sizes.
http://arxiv.org/abs/1903.08428
The paper is a half-way between the agent technology and the mathematical reasoning to model tactical decision making tasks. These models are applied to air defense (AD) domain for command and control (C2). It also addresses the issues related to evaluation of agents. The agents are designed and implemented using the agent-programming paradigm. The agents are deployed in an air combat simulated environment for performing the tasks of C2 like electronic counter counter measures, threat assessment, and weapon allocation. The simulated AD system runs without any human intervention, and represents state-of-the-art model for C2 autonomy. The use of agents as autonomous decision making entities is particularly useful in view of futuristic network centric warfare.
http://arxiv.org/abs/1903.08412
Automatic fact-checking systems detect misinformation, such as fake news, by (i) selecting check-worthy sentences for fact-checking, (ii) gathering related information to the sentences, and (iii) inferring the factuality of the sentences. Most prior research on (i) uses hand-crafted features to select check-worthy sentences, and does not explicitly account for the recent finding that the top weighted terms in both check-worthy and non-check-worthy sentences are actually overlapping [15]. Motivated by this, we present a neural check-worthiness sentence ranking model that represents each word in a sentence by \textit{both} its embedding (aiming to capture its semantics) and its syntactic dependencies (aiming to capture its role in modifying the semantics of other terms in the sentence). Our model is an end-to-end trainable neural network for check-worthiness ranking, which is trained on large amounts of unlabelled data through weak supervision. Thorough experimental evaluation against state of the art baselines, with and without weak supervision, shows our model to be superior at all times (+13% in MAP and +28% at various Precision cut-offs from the best baseline with statistical significance). Empirical analysis of the use of weak supervision, word embedding pretraining on domain-specific data, and the use of syntactic dependencies of our model reveals that check-worthy sentences contain notably more identical syntactic dependencies than non-check-worthy sentences.
http://arxiv.org/abs/1903.08404
When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multi-word phrases is critical in any application of semantic processing, such as search engines; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase `green card’ is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality, manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality.
http://arxiv.org/abs/1903.08389
Compact convolutional neural networks gain efficiency mainly through depthwise convolutions, expanded channels and complex topologies, which contrarily aggravate the training efforts. In this work, we identify the shift problem occurs in even-sized kernel (2x2, 4x4) convolutions, and eliminate it by proposing symmetric padding on each side of the feature maps (C2sp, C4sp). Symmetric padding enlarges the receptive fields of even-sized kernels with little computational cost. In classification tasks, C2sp outperforms the conventional 3x3 convolution and obtains comparable accuracies to existing compact convolution blocks, but consumes less memory and time during training. In generation tasks, C2sp and C4sp both achieve improved image qualities and stabilized training. Symmetric padding coupled with even-sized convolution is easy to be implemented into deep learning frameworks, providing promising building units for architecture designs that emphasize training efforts on online and continual learning occasions.
http://arxiv.org/abs/1903.08385
This paper addresses the issue of building a part-based representation of a dataset of images. More precisely, we look for a non-negative, sparse decomposition of the images on a reduced set of atoms, in order to unveil a morphological and interpretable structure of the data. Additionally, we want this decomposition to be computed online for any new sample that is not part of the initial dataset. Therefore, our solution relies on a sparse, non-negative auto-encoder where the encoder is deep (for accuracy) and the decoder shallow (for interpretability). This method compares favorably to the state-of-the-art online methods on two datasets (MNIST and Fashion MNIST), according to classical metrics and to a new one we introduce, based on the invariance of the representation to morphological dilation.
http://arxiv.org/abs/1904.00763
Lifelong learning, the problem of continual learning where tasks arrive in sequence, has been lately attracting more attention in the computer vision community. The aim of lifelong learning is to develop a system that can learn new tasks while maintaining the performance on the previously learned tasks. However, there are two obstacles for lifelong learning of deep neural networks: catastrophic forgetting and capacity limitation. To solve the above issues, inspired by the recent breakthroughs in automatically learning good neural network architectures, we develop a Multi-task based lifelong learning via nonexpansive AutoML framework termed Regularize, Expand and Compress (REC). REC is composed of three stages: 1) continually learns the sequential tasks without the learned tasks’ data via a newly proposed multi-task weight consolidation (MWC) algorithm; 2) expands the network to help the lifelong learning with potentially improved model capability and performance by network-transformation based AutoML; 3) compresses the expanded model after learning every new task to maintain model efficiency and performance. The proposed MWC and REC algorithms achieve superior performance over other lifelong learning algorithms on four different datasets.
http://arxiv.org/abs/1903.08362
Recent advancements have led to a proliferation of machine learning systems used to assist humans in a wide range of tasks. However, we are still far from accurate, reliable, and resource-efficient operations of these systems. For robot perception, convolutional neural networks (CNNs) for object detection and pose estimation are recently coming into widespread use. However, neural networks are known to suffer overfitting during training process and are less robust within unseen conditions, which are especially vulnerable to {\em adversarial scenarios}. In this work, we propose {\em Generative Robust Inference and Perception (GRIP)} as a two-stage object detection and pose estimation system that aims to combine relative strengths of discriminative CNNs and generative inference methods to achieve robust estimation. Our results show that a second stage of sample-based generative inference is able to recover from false object detection by CNNs, and produce robust estimations in adversarial conditions. We demonstrate the efficacy of {\em GRIP} robustness through comparison with state-of-the-art learning-based pose estimators and pick-and-place manipulation in dark and cluttered environments.
http://arxiv.org/abs/1903.08352
Single photon avalanche diodes (SPADs) are starting to play a pivotal role in the development of photon-efficient, long-range LiDAR systems. However, due to non-linearities in their image formation model, a high photon flux (e.g., due to strong sunlight) leads to distortion of the incident temporal waveform, and potentially, large depth errors. Operating SPADs in low flux regimes can mitigate these distortions, but, often requires attenuating the signal and thus, results in low signal-to-noise ratio. In this paper, we address the following basic question: what is the optimal photon flux that a SPAD-based LiDAR should be operated in? We derive a closed form expression for the optimal flux, which is quasi-depth-invariant, and depends on the ambient light strength. The optimal flux is lower than what a SPAD typically measures in real world scenarios, but surprisingly, considerably higher than what is conventionally suggested for avoiding distortions. We propose a simple, adaptive approach for achieving the optimal flux by attenuating incident flux based on an estimate of ambient light strength. Using extensive simulations and a hardware prototype, we show that the optimal flux criterion holds for several depth estimators, under a wide range of illumination conditions.
http://arxiv.org/abs/1903.08347
We present a novel learning framework for vehicle recognition from a single RGB image. Unlike existing methods which only use attention mechanisms to locate 2D discriminative information, our unified framework learns a joint representation of the 2D global texture and 3D-bounding-box in a mutually correlated and reinforced way. These two kinds of feature representation are combined by a novel fusion network, which predicts the vehicle’s category. The 2D global feature is extracted using an off-the-shelf detection network, where the estimated 2D bounding box assists in finding the region of interest (RoI). With the assistance of the RoI, the 3D bounding box and its corresponding features are generated in a geometrically correct way using a novel \textit{3D perspective Network} (3DPN). The 3DPN consists of a convolutional neural network (CNN), a vanishing point loss, and RoI perspective layers. The CNN regresses the 3D bounding box under the guidance of the proposed vanishing point loss, which provides a perspective geometry constraint. Thanks to the proposed RoI perspective layer, the variation caused by viewpoint changes is corrected via the estimated geometry, enhancing the feature representation. We present qualitative and quantitative results for our approach on the vehicle classification and verification tasks in the BoxCars dataset. The results demonstrate that, by learning how to extract features from the 3D bounding box, we can achieve comparable or superior performance to methods that only use 2D information.
https://arxiv.org/abs/1903.07916
Face detection is an important first step before face verification and recognition. In unconstrained settings it is still an open challenge because of the variation in pose, lighting, scale, background and location. However, for the purposes of verification we can have a control on background and location. Images are primarily captured in places such as the entrance to a sensitive building, in front of a door or some location where the background does not change. We present a correlation based face detection algorithm to detect faces in such settings, where we control the location, and leave lighting, pose, and scale uncontrolled. In these scenarios the results indicate that our algorithm is easy and fast to train, outperforms Viola and Jones face detection accuracy and is faster to test.
http://arxiv.org/abs/1903.08649