In this article we revisit the definition of Precision-Recall (PR) curves for generative models proposed by Sajjadi et al. (arXiv:1806.00035). Rather than providing a scalar for generative quality, PR curves distinguish mode-collapse (poor recall) and bad quality (poor precision). We first generalize their formulation to arbitrary measures, hence removing any restriction to finite support. We also expose a bridge between PR curves and type I and type II error rates of likelihood ratio classifiers on the task of discriminating between samples of the two distributions. Building upon this new perspective, we propose a novel algorithm to approximate precision-recall curves, that shares some interesting methodological properties with the hypothesis testing technique from Lopez-Paz et al (arXiv:1610.06545). We demonstrate the interest of the proposed formulation over the original approach on controlled multi-modal datasets.
https://arxiv.org/abs/1905.05441
Visual localization is an attractive problem that estimates the camera localization from database images based on the query image. It is a crucial task for various applications, such as autonomous vehicles, assistive navigation and augmented reality. The challenging issues of the task lie in various appearance variations between query and database images, including illumination variations, season variations, dynamic object variations and viewpoint variations. In order to tackle those challenges, Panoramic Annular Localizer into which panoramic annular lens and robust deep image descriptors are incorporated is proposed in this paper. The panoramic annular images captured by the single camera are processed and fed into the NetVLAD network to form the active deep descriptor, and sequential matching is utilized to generate the localization result. The experiments carried on the public datasets and in the field illustrate the validation of the proposed system.
https://arxiv.org/abs/1905.05425
In cross-lingual transfer, NLP models over one or more source languages are applied to a low-resource target language. While most prior work has used a single source model or a few carefully selected models, here we consider a massive setting with many such models. This setting raises the problem of poor transfer, particularly from distant languages. We propose two techniques for modulating the transfer, suitable for zero-shot or few-shot learning, respectively. Evaluating on named entity recognition, we show that our techniques are much more effective than strong baselines, including standard ensembling, and our unsupervised method rivals oracle selection of the single best individual model.
http://arxiv.org/abs/1902.00193
In the context of Industry 4.0, data management is a key point for decision aid approaches. Large amounts of manufacturing digital data are collected on the shop floor. Their analysis can then require a large amount of computing power. The Big Data issue can be solved by aggregation, generating smart and meaningful data. This paper presents a new knowledge-based multi-level aggregation strategy to support decision making. Manufacturing knowledge is used at each level to design the monitoring criteria or aggregation operators. The proposed approach has been implemented as a demonstrator and successfully applied to a real machining database from the aeronautic industry. Decision Making; Machining; Knowledge based system
http://arxiv.org/abs/1905.06413
Understanding human actions is a crucial problem for service robots. However, the general trend in Action Recognition is developing and testing these systems on structured datasets. That’s why this work presents a practical Skeleton-based Action Recognition framework which can be used in realistic scenarios. Our results show that although non-augmented and non-normalized data may yield comparable results on the test split of the dataset, it is far from being useful on another dataset which is a manually collected data.
https://arxiv.org/abs/1905.05420
In this paper, we focus on the facial expression translation task and propose a novel Expression Conditional GAN (ECGAN) which can learn the mapping from one image domain to another one based on an additional expression attribute. The proposed ECGAN is a generic framework and is applicable to different expression generation tasks where specific facial expression can be easily controlled by the conditional attribute label. Besides, we introduce a novel face mask loss to reduce the influence of background changing. Moreover, we propose an entire framework for facial expression generation and recognition in the wild, which consists of two modules, i.e., generation and recognition. Finally, we evaluate our framework on several public face datasets in which the subjects have different races, illumination, occlusion, pose, color, content and background conditions. Even though these datasets are very diverse, both the qualitative and quantitative results demonstrate that our approach is able to generate facial expressions accurately and robustly.
https://arxiv.org/abs/1905.05416
We explore value-based solutions for multi-agent reinforcement learning (MARL) tasks in the centralized training with decentralized execution (CTDE) regime popularized recently. However, VDN and QMIX are representative examples that use the idea of factorization of the joint action-value function into individual ones for decentralized execution. VDN and QMIX address only a fraction of factorizable MARL tasks due to their structural constraint in factorization such as additivity and monotonicity. In this paper, we propose a new factorization method for MARL, QTRAN, which is free from such structural constraints and takes on a new approach to transforming the original joint action-value function into an easily factorizable one, with the same optimal actions. QTRAN guarantees more general factorization than VDN or QMIX, thus covering a much wider class of MARL tasks than does previous methods. Our experiments for the tasks of multi-domain Gaussian-squeeze and modified predator-prey demonstrate QTRAN’s superior performance with especially larger margins in games whose payoffs penalize non-cooperative behavior more aggressively.
https://arxiv.org/abs/1905.05408
The paper proves that the number of k-skip-n-grams for a corpus of size $L$ is where $k’ = \min(L - n + 1, k)$.
https://arxiv.org/abs/1905.05407
Plug-and-play (PnP) is a non-convex framework that integrates modern denoising priors, such as BM3D or deep learning-based denoisers, into ADMM or other proximal algorithms. An advantage of PnP is that one can use pre-trained denoisers when there is not sufficient data for end-to-end training. Although PnP has been recently studied extensively with great empirical success, theoretical analysis addressing even the most basic question of convergence has been insufficient. In this paper, we theoretically establish convergence of PnP-FBS and PnP-ADMM, without using diminishing stepsizes, under a certain Lipschitz condition on the denoisers. We then propose real spectral normalization, a technique for training deep learning-based denoisers to satisfy the proposed Lipschitz condition. Finally, we present experimental results validating the theory.
https://arxiv.org/abs/1905.05406
Removing rain effects from an image automatically has many applications such as autonomous driving, drone piloting and photo editing and still draws the attention of many people. Traditional methods use heuristics to handcraft various priors to remove or separate the rain effects from an image. Recently end-to-end deep learning based deraining methods have been proposed to offer more flexibility and effectiveness. However, they tend not to obtain good visual effect when encountered images with heavy rain. Heavy rain brings not only rain streaks but also haze-like effect which is caused by the accumulation of tiny raindrops. Different from previous deraining methods, in this paper we model rainy images with a new rain model to remove not only rain streaks but also haze-like effect. Guided by our model, we design a two-branch network to learn its parameters. Then, an SPP structure is jointly trained to refine the results of our model to control the degree of removing the haze-like effect flexibly. Besides, a subnetwork which can localize the rainy pixels is proposed to guide the training of our network. Extensive experiments on several datasets show that our method outperforms the state-of-the-art in both objectives assessments and visual quality.
https://arxiv.org/abs/1905.05404
We introduce a novel unsupervised domain adaptation approach for object detection. We aim to alleviate the imperfect translation problem of pixel-level adaptations, and the source-biased discriminativity problem of feature-level adaptations simultaneously. Our approach is composed of two stages, i.e., Domain Diversification (DD) and Multi-domain-invariant Representation Learning (MRL). At the DD stage, we diversify the distribution of the labeled data by generating various distinctive shifted domains from the source domain. At the MRL stage, we apply adversarial learning with a multi-domain discriminator to encourage feature to be indistinguishable among the domains. DD addresses the source-biased discriminativity, while MRL mitigates the imperfect image translation. We construct a structured domain adaptation framework for our learning paradigm and introduce a practical way of DD for implementation. Our method outperforms the state-of-the-art methods by a large margin of 3%~11% in terms of mean average precision (mAP) on various datasets.
https://arxiv.org/abs/1905.05396
We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years. MNMT has been useful in improving translation quality as a result of knowledge transfer. MNMT is more promising and interesting than its statistical machine translation counterpart because end-to-end modeling and distributed representations open new avenues. Many approaches have been proposed in order to exploit multilingual parallel corpora for improving translation quality. However, the lack of a comprehensive survey makes it difficult to determine which approaches are promising and hence deserve further exploration. In this paper, we present an in-depth survey of existing literature on MNMT. We categorize various approaches based on the resource scenarios as well as underlying modeling principles. We hope this paper will serve as a starting point for researchers and engineers interested in MNMT.
https://arxiv.org/abs/1905.05395
A key challenge in leveraging data augmentation for neural network training is choosing an effective augmentation policy from a large search space of candidate operations. Properly chosen augmentation policies can lead to significant generalization improvements; however, state-of-the-art approaches such as AutoAugment are computationally infeasible to run for the ordinary user. In this paper, we introduce a new data augmentation algorithm, Population Based Augmentation (PBA), which generates nonstationary augmentation policy schedules instead of a fixed augmentation policy. We show that PBA can match the performance of AutoAugment on CIFAR-10, CIFAR-100, and SVHN, with three orders of magnitude less overall compute. On CIFAR-10 we achieve a mean test error of 1.46%, which is a slight improvement upon the current state-of-the-art. The code for PBA is open source and is available at this https URL.
https://arxiv.org/abs/1905.05393
We discuss a data market technique based on intrinsic (relevance and uniqueness) as well as extrinsic value (influenced by supply and demand) of data. For intrinsic value, we explain how to perform valuation of data in absolute terms (i.e just by itself), or relatively (i.e in comparison to multiple datasets) or in conditional terms (i.e valuating new data given currently existing data).
http://arxiv.org/abs/1905.06462
Unsupervised domain adaptation in person re-identification resorts to labeled source data to promote the model training on target domain, facing the dilemmas caused by large domain shift and large camera variations. The non-overlapping labels challenge that source domain and target domain have entirely different persons further increases the re-identification difficulty. In this paper, we propose a novel algorithm to narrow such domain gaps. We derive a camera style adaptation framework to learn the style-based mappings between different camera views, from the target domain to the source domain, and then we can transfer the identity-based distribution from the source domain to the target domain on the camera level. To overcome the non-overlapping labels challenge and guide the person re-identification model to narrow the gap further, an efficient and effective soft-labeling method is proposed to mine the intrinsic local structure of the target domain through building the connection between GAN-translated source domain and the target domain. Experiment results conducted on real benchmark datasets indicate that our method gets state-of-the-art results.
https://arxiv.org/abs/1905.05382
Inspired by recent successes in neural machine translation and image caption generation, we present an attention based encoder decoder model (AED) to recognize Vietnamese Handwritten Text. The model composes of two parts: a DenseNet for extracting invariant features, and a Long Short-Term Memory network (LSTM) with an attention model incorporated for generating output text (LSTM decoder), which are connected from the CNN part to the attention model. The input of the CNN part is a handwritten text image and the target of the LSTM decoder is the corresponding text of the input image. Our model is trained end-to-end to predict the text from a given input image since all the parts are differential components. In the experiment section, we evaluate our proposed AED model on the VNOnDB-Word and VNOnDB-Line datasets to verify its efficiency. The experiential results show that our model achieves 12.30% of word error rate without using any language model. This result is competitive with the handwriting recognition system provided by Google in the Vietnamese Online Handwritten Text Recognition competition.
https://arxiv.org/abs/1905.05381
Pap smear testing has been widely used for detecting cervical cancers based on the morphology properties of cell nuclei in microscopic image. An accurate nuclei segmentation could thus improve the success rate of cervical cancer screening. In this work, a method of automated cervical nuclei segmentation using Deformable Multipath Ensemble Model (D-MEM) is proposed. The approach adopts a U-shaped convolutional network as a backbone network, in which dense blocks are used to transfer feature information more effectively. To increase the flexibility of the model, we then use deformable convolution to deal with different nuclei irregular shapes and sizes. To reduce the predictive bias, we further construct multiple networks with different settings, which form an ensemble model. The proposed segmentation framework has achieved state-of-the-art accuracy on Herlev dataset with Zijdenbos similarity index (ZSI) of 0.933, and has the potential to be extended for solving other medical image segmentation tasks.
http://arxiv.org/abs/1812.00527
With the thriving of deep learning, 3D Convolutional Neural Networks have become a popular choice in volumetric image analysis due to their impressive 3D contexts mining ability. However, the 3D convolutional kernels will introduce a significant increase in the amount of trainable parameters. Considering the training data is often limited in biomedical tasks, a tradeoff has to be made between model size and its representational power. To address this concern, in this paper, we propose a novel 3D Dense Separated Convolution (3D-DSC) module to replace the original 3D convolutional kernels. The 3D-DSC module is constructed by a series of densely connected 1D filters. The decomposition of 3D kernel into 1D filters reduces the risk of over-fitting by removing the redundancy of 3D kernels in a topologically constrained manner, while providing the infrastructure for deepening the network. By further introducing nonlinear layers and dense connections between 1D filters, the network’s representational power can be significantly improved while maintaining a compact architecture. We demonstrate the superiority of 3D-DSC on volumetric image classification and segmentation, which are two challenging tasks often encountered in biomedical image computing.
http://arxiv.org/abs/1905.08608
Recognition of historical documents is a challenging problem due to the noised, damaged characters and background. However, in Japanese historical documents, not only contains the mentioned problems, pre-modern Japanese characters were written in cursive and are connected. Therefore, character segmentation based methods do not work well. This leads to the idea of creating a new recognition system. In this paper, we propose a human-inspired document reading system to recognize multiple lines of premodern Japanese historical documents. During the reading, people employ eyes movement to determine the start of a text line. Then, they move the eyes from the current character/word to the next character/word. They can also determine the end of a line or skip a figure to move to the next line. The eyes movement integrates with visual processing to operate the reading process in the brain. We employ attention-based encoder-decoder to implement this recognition system. First, the recognition system detects where to start a text line. Second, the system scans and recognize character by character until the text line is completed. Then, the system continues to detect the start of the next text line. This process is repeated until reading the whole document. We tested our human-inspired recognition system on the pre-modern Japanese historical document provide by the PRMU Kuzushiji competition. The results of the experiments demonstrate the superiority and effectiveness of our proposed system by achieving Sequence Error Rate of 9.87% and 53.81% on level 2 and level 3 of the dataset, respectively. These results outperform to any other systems participated in the PRMU Kuzushiji competition.
https://arxiv.org/abs/1905.05377
Spatial audio is an essential medium to audiences for 3D visual and auditory experience. However, the recording devices and techniques are expensive or inaccessible to the general public. In this work, we propose a self-supervised audio spatialization network that can generate spatial audio given the corresponding video and monaural audio. To enhance spatialization performance, we use an auxiliary classifier to classify ground-truth videos and those with audio where the left and right channels are swapped. We collect a large-scale video dataset with spatial audio to validate the proposed method. Experimental results demonstrate the effectiveness of the proposed model on the audio spatialization task.
https://arxiv.org/abs/1905.05375
Traditional metrics for evaluating the efficacy of image processing techniques do not lend themselves to understanding the capabilities and limitations of modern image processing methods - particularly those enabled by deep learning. When applying image processing in engineering solutions, a scientist or engineer has a need to justify their design decisions with clear metrics. By applying blind/referenceless image spatial quality (BRISQUE), Structural SIMilarity (SSIM) index scores, and Peak signal-to-noise ratio (PSNR) to images before and after image processing, we can quantify quality improvements in a meaningful way and determine the lowest recoverable image quality for a given method.
https://arxiv.org/abs/1905.05373
Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. In this work, we propose a reinforcement learning-based method for automatically adjusting the parameters of any (non-differentiable) simulator, thereby controlling the distribution of synthesized data in order to maximize the accuracy of a model trained on that data. In contrast to prior art that hand-crafts these simulation parameters or adjusts only parts of the available parameters, our approach fully controls the simulator with the actual underlying goal of maximizing accuracy, rather than mimicking the real data distribution or randomly generating a large volume of data. We find that our approach (i) quickly converges to the optimal simulation parameters in controlled experiments and (ii) can indeed discover good sets of parameters for an image rendering simulator in actual computer vision applications.
http://arxiv.org/abs/1810.02513
Deep learning using neural networks has provided advances in image style transfer, merging the content of one image (e.g., a photo) with the style of another (e.g., a painting). Our research shows this concept can be extended to analyse the design of streetscapes in relation to health and wellbeing outcomes. An Australian population health survey (n=34,000) was used to identify the spatial distribution of health and wellbeing outcomes, including general health and social capital. For each outcome, the most and least desirable locations formed two domains. Streetscape design was sampled using around 80,000 Google Street View images per domain. Generative adversarial networks translated these images from one domain to the other, preserving the main structure of the input image, but transforming the `style’ from locations where self-reported health was bad to locations where it was good. These translations indicate that areas in Melbourne with good general health are characterised by sufficient green space and compactness of the urban environment, whilst streetscape imagery related to high social capital contained more and wider footpaths, fewer fences and more grass. Beyond identifying relationships, the method is a first step towards computer-generated design interventions that have the potential to improve population health and wellbeing.
http://arxiv.org/abs/1905.06464
State-of-the-art forward facing monocular visual-inertial odometry algorithms are often brittle in practice, especially whilst dealing with initialisation and motion in directions that render the state unobservable. In such cases having a reliable complementary odometry algorithm enables robust and resilient flight. Using the common local planarity assumption, we present a fast, dense, and direct frame-to-frame visual-inertial odometry algorithm for downward facing cameras that minimises a joint cost function involving a homography based photometric cost and an IMU regularisation term. Via extensive evaluation in a variety of scenarios we demonstrate superior performance than existing state-of-the-art downward facing odometry algorithms for Micro Aerial Vehicles (MAVs).
http://arxiv.org/abs/1810.08704
LiDAR-camera calibration is a precondition for many heterogeneous systems that fuse data from LiDAR and camera. However, the constraint from common field of view and the requirement for strict time synchronization make the calibration a challenging problem. In this paper, we propose a novel LiDAR-camera calibration method aiming to eliminate these two constraints. Specifically, we capture a scan of 3D LiDAR when both the environment and the sensors are stationary, then move the camera to reconstruct the 3D environment using the sequentially obtained images. Finally, we align 3D visual points to the laser scan based on tightly couple graph optimization method to calculate the extrinsic parameters between LiDAR and camera. Under this design, the configuration of these two sensors are free from the common field of view constraint owing to the extended view from the moving camera. And we also eliminate the requirement for strict time synchronization as we only use the single scan of laser data when the sensors are stationary. We theoretically derive the conditions of minimal observability for our method and prove that the accuracy of calibration is improved by collecting more observations from multiple scattered calibration targets. We validate our method on both simulation platform and real-world datasets. Experiments show that our method achieves higher accuracy than other comparable methods, which is in accordance with our theoretical analysis. In addition, the proposed method is beneficial to not only plane measurement error based chessboard, but also other point measurement error based calibration targets, such as boxes and polygonal boards.
http://arxiv.org/abs/1903.06141
Multi-person pose estimation is a fundamental yet challenging task in computer vision. Both rich context information and spatial information are required to precisely locate the keypoints for all persons in an image. In this paper, a novel Context-and-Spatial Aware Network (CSANet), which integrates both a Context Aware Path and Spatial Aware Path, is proposed to obtain effective features involving both context information and spatial information. Specifically, we design a Context Aware Path with structure supervision strategy and spatial pyramid pooling strategy to enhance the context information. Meanwhile, a Spatial Aware Path is proposed to preserve the spatial information, which also shortens the information propagation path from low-level features to high-level features. On top of these two paths, we employ a Heavy Head Path to further combine and enhance the features effectively. Experimentally, our proposed network outperforms state-of-the-art methods on the COCO keypoint benchmark, which verifies the effectiveness of our method and further corroborates the above proposition.
https://arxiv.org/abs/1905.05355
Rank-based Learning with deep neural network has been widely used for image cropping. However, the performance of ranking-based methods is often poor and this is mainly due to two reasons: 1) image cropping is a listwise ranking task rather than pairwise comparison; 2) the rescaling caused by pooling layer and the deformation in view generation damage the performance of composition learning. In this paper, we develop a novel model to overcome these problems. To address the first problem, we formulate the image cropping as a listwise ranking problem to find the best view composition. For the second problem, a refined view sampling (called RoIRefine) is proposed to extract refined feature maps for candidate view generation. Given a series of candidate views, the proposed model learns the Top-1 probability distribution of views and picks up the best one. By integrating refined sampling and listwise ranking, the proposed network called LVRN achieves the state-of-the-art performance both in accuracy and speed.
https://arxiv.org/abs/1905.05352
Pedestrians and vehicles often share the road in complex inner city traffic. This leads to interactions between the vehicle and pedestrians, with each affecting the other’s motion. In order to create robust methods to reason about pedestrian behavior and to design interfaces of communication between self-driving cars and pedestrians we need to better understand such interactions. In this paper, we present a data-driven approach to implicitly model pedestrians’ interactions with vehicles, to better predict pedestrian behavior. We propose a LSTM model that takes as input the past trajectories of the pedestrian and ego-vehicle, and pedestrian head orientation, and predicts the future positions of the pedestrian. Our experiments based on a real-world, inner city dataset captured with vehicle mounted cameras, show that the usage of such cues improve pedestrian prediction when compared to a baseline that purely uses the past trajectory of the pedestrian.
https://arxiv.org/abs/1905.05350
Numerous methods for human activity recognition have been proposed in the past two decades. Many of these methods are based on sparse representation, which describes the whole video content by a set of local features. Trajectories, being mid-level sparse features, are capable of describing the motion of an interest-point in 2D space. 2D trajectories might be affected by viewpoint changes, potentially decreasing their accuracy. In this paper, we initially propose and compare different 2D trajectory-based algorithms for human activity recognition. Moreover, we propose a new way of fusing disparity information with 2D trajectory information, without the calculation of 3D reconstruction. The obtained results show a 2.76\% improvement when using disparity-augmented trajectories, compared to using the classical 2D trajectory information only. Furthermore, we have also tested our method on the challenging Hollywood 3D dataset, and we have obtained competitive results, at a faster speed.
https://arxiv.org/abs/1905.05344
This paper proposes a fractional order gradient method for the backward propagation of convolutional neural networks. To overcome the problem that fractional order gradient method cannot converge to real extreme point, a simplified fractional order gradient method is designed based on Caputo’s definition. The parameters within layers are updated by the designed gradient method, but the propagations between layers still use integer order gradients, and thus the complicated derivatives of composite functions are avoided and the chain rule will be kept. By connecting every layers in series and adding loss functions, the proposed convolutional neural networks can be trained smoothly according to various tasks. Some practical experiments are carried out in order to demonstrate the effectiveness of neural networks at last.
https://arxiv.org/abs/1905.05336
We are pleased to dedicate this survey on kernelization of the Vertex Cover problem, to Professor Juraj Hromkovi\v{c} on the occasion of his 60th birthday. The Vertex Cover problem is often referred to as the Drosophila of parameterized complexity. It enjoys a long history. New and worthy perspectives will always be demonstrated first with concrete results here. This survey discusses several research directions in Vertex Cover kernelization. The Barrier Degree of Vertex Cover kernelization is discussed. We have reduction rules that kernelize vertices of small degree, including in this paper new results that reduce graphs almost to minimum degree five. Can this process go on forever? What is the minimum vertex-degree barrier for polynomial-time kernelization? Assuming the Exponential-Time Hypothesis, there is a minimum degree barrier. The idea of automated kernelization is discussed. We here report the first experimental results of an AI-guided branching algorithm for Vertex Cover whose logic seems amenable for application in finding reduction rules to kernelize small-degree vertices. The survey highlights a central open problem in parameterized complexity. Happy Birthday, Juraj!
http://arxiv.org/abs/1811.09429
Retrieving object instances among cluttered scenes efficiently requires compact yet comprehensive regional image representations. Intuitively, object semantics can help build the index that focuses on the most relevant regions. However, due to the lack of bounding-box datasets for objects of interest among retrieval benchmarks, most recent work on regional representations has focused on either uniform or class-agnostic region selection. In this paper, we first fill the void by providing a new dataset of landmark bounding boxes, based on the Google Landmarks dataset, that includes $86k$ images with manually curated boxes from $15k$ unique landmarks. Then, we demonstrate how a trained landmark detector, using our new dataset, can be leveraged to index image regions and improve retrieval accuracy while being much more efficient than existing regional methods. In addition, we introduce a novel regional aggregated selective match kernel (R-ASMK) to effectively combine information from detected regions into an improved holistic image representation. R-ASMK boosts image retrieval accuracy substantially with no dimensionality increase, while even outperforming systems that index image regions independently. Our complete image retrieval system improves upon the previous state-of-the-art by significant margins on the Revisited Oxford and Paris datasets. Code and data available at the project webpage: https://github.com/tensorflow/models/tree/master/research/delf.
http://arxiv.org/abs/1812.01584
We investigate diffusive search on planar networks, motivated by tubular networks in cell biology that contain molecules searching for reaction partners and binding sites. Exact calculation of the diffusive mean first-passage time on a spatial network is used to characterize the typical search time as a function of network connectivity. We find that global structural properties — the total edge length and number of loops — are sufficient to largely determine network exploration times for both synthetic planar networks and for organelle morphologies extracted from living cells. This suggests that network architecture can be designed for efficient search without controlling the precise arrangement of connections. Specifically, increasing the number of loops substantially decreases search times, pointing to a potential physical mechanism for regulating reaction rates within organelle network structures.
https://arxiv.org/abs/1905.05320
Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today’s VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new “TextVQA” dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.
https://arxiv.org/abs/1904.08920
In this study, we propose the Affine Variational Autoencoder (AVAE), a variant of Variational Autoencoder (VAE) designed to improve robustness by overcoming the inability of VAEs to generalize to distributional shifts in the form of affine perturbations. By optimizing an affine transform to maximize ELBO, the proposed AVAE transforms an input to the training distribution without the need to increase model complexity to model the full distribution of affine transforms. In addition, we introduce a training procedure to create an efficient model by learning a subset of the training distribution, and using the AVAE to improve generalization and robustness to distributional shift at test time. Experiments on affine perturbations demonstrate that the proposed AVAE significantly improves generalization and robustness to distributional shift in the form of affine perturbations without an increase in model complexity.
https://arxiv.org/abs/1905.05300
Recent work in neural generation has attracted significant interest in controlling the form of text, such as style, persona, and politeness. However, there has been less work on controlling neural text generation for content. This paper introduces the notion of Content Transfer for long-form text generation, where the task is to generate a next sentence in a document that both fits its context and is grounded in a content-rich external textual source such as a news story. Our experiments on Wikipedia data show significant improvements against competitive baselines. As another contribution of this paper, we release a benchmark dataset of 640k Wikipedia referenced sentences paired with the source articles to encourage exploration of this new task.
https://arxiv.org/abs/1905.05293
Skin cancer is one of the most common cancers in the United States. As technological advancements are made, algorithmic diagnosis of skin lesions is becoming more important. In this paper, we develop algorithms for segmenting the actual diseased area of skin in a given image of a skin lesion, and for classifying different types of skin lesions pictured in a given image. The cores of the algorithms used were based in persistent homology, an algebraic topology technique that is part of the rising field of Topological Data Analysis (TDA). The segmentation algorithm utilizes a similar concept to persistent homology that captures the robustness of segmented regions. For classification, we design two families of topological features from persistence diagrams—which we refer to as {\em persistence statistics} (PS) and {\em persistence curves} (PC), and use linear support vector machine as classifiers. We also combined those topological features, PS and PC, into ResNet-101 model, which we call {\em TopoResNet-101}, the results show that PS and PC are effective in two folds—improving classification performances and stabilizing the training process. Although convolutional features are the most important learning targets in CNN models, global information of images may be lost in the training process. Because topological features were extracted globally, our results show that the global property of topological features provide additional information to machine learning models.
http://arxiv.org/abs/1905.08607
Efficient exploration is an unsolved problem in Reinforcement Learning which is usually addressed by reactively rewarding the agent for fortuitously encountering novel situations. This paper introduces an efficient active exploration algorithm, Model-Based Active eXploration (MAX), which uses an ensemble of forward models to plan to observe novel events. This is carried out by optimizing agent behaviour with respect to a measure of novelty derived from the Bayesian perspective of exploration, which is estimated using the disagreement between the futures predicted by the ensemble members. We show empirically that in semi-random discrete environments where directed exploration is critical to make progress, MAX is at least an order of magnitude more efficient than strong baselines. MAX scales to high-dimensional continuous environments where it builds task-agnostic models that can be used for any downstream task.
http://arxiv.org/abs/1810.12162
An important task that domestic robots need to achieve is the recognition of states of food ingredients so they can continue their cooking actions. This project focuses on a fine-tuning algorithm for the VGG (Visual Geometry Group) architecture of deep convolutional neural networks (CNN) for object recognition. The algorithm aims to identify eleven different ingredient cooking states for an image dataset. The original VGG model was adjusted and trained to properly classify the food states. The model was initialized with Imagenet weights. Different experiments were carried out in order to find the model parameters that provided the best performance. The accuracy achieved for the validation set was 76.7% and for the test set 76.6% after changing several parameters of the VGG model.
http://arxiv.org/abs/1905.08606
We present a navigation system that combines ideas from hierarchical planning and machine learning. The system uses a traditional global planner to compute optimal paths towards a goal, and a deep local trajectory planner and velocity controller to compute motion commands. The latter components of the system adjust the behavior of the robot through attention mechanisms such that it moves towards the goal, avoids obstacles, and respects the space of nearby pedestrians. Both the structure of the proposed deep models and the use of attention mechanisms make the system’s execution interpretable. Our simulation experiments suggest that the proposed architecture outperforms baselines that try to map global plan information and sensor data directly to velocity commands. In comparison to a hand-designed traditional navigation system, the proposed approach showed more consistent performance.
https://arxiv.org/abs/1905.05279
Robotic research is often built on approaches that are motivated by insights from self-examination of how we interface with the world. However, given current theories about human cognition and sensory processing, it is reasonable to assume that the internal workings of the brain are separate from how we interface with the world and ourselves. To amend some of these misconceptions arising from self-examination this article reviews human visual understanding for cognition and action, specifically manipulation. Our focus is on identifying overarching principles such as the separation into visual processing for action and cognition, hierarchical processing of visual input, and the contextual and anticipatory nature of visual processing for action. We also provide a rudimentary exposition of previous theories about visual understanding that shows how self-examination can lead down the wrong path. Our hope is that the article will provide insights for the robotic researcher that can help them navigate the path of self-examination, give them an overview of current theories about human visual processing, as well as provide a source for further relevant reading.
https://arxiv.org/abs/1905.05272
Autonomous vehicles may make wrong decisions due to inaccurate detection and recognition. Therefore, an intelligent vehicle can combine its own data with that of other vehicles to enhance perceptive ability, and thus improve detection accuracy and driving safety. However, multi-vehicle cooperative perception requires the integration of real world scenes and the traffic of raw sensor data exchange far exceeds the bandwidth of existing vehicular networks. To the best our knowledge, we are the first to conduct a study on raw-data level cooperative perception for enhancing the detection ability of self-driving systems. In this work, relying on LiDAR 3D point clouds, we fuse the sensor data collected from different positions and angles of connected vehicles. A point cloud based 3D object detection method is proposed to work on a diversity of aligned point clouds. Experimental results on KITTI and our collected dataset show that the proposed system outperforms perception by extending sensing area, improving detection accuracy and promoting augmented results. Most importantly, we demonstrate it is possible to transmit point clouds data for cooperative perception via existing vehicular network technologies.
https://arxiv.org/abs/1905.05265
In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties. We propose a novel and efficient exploration method for deep RL that has two components. The first is a decaying schedule to suppress the intrinsic uncertainty. The second is an exploration bonus calculated from the upper quantiles of the learned distribution. In Atari 2600 games, our method outperforms QR-DQN in 12 out of 14 hard games (achieving 483 \% average gain across 49 games in cumulative rewards over QR-DQN with a big win in Venture). We also compared our algorithm with QR-DQN in a challenging 3D driving simulator (CARLA). Results show that our algorithm achieves near-optimal safety rewards twice faster than QRDQN.
http://arxiv.org/abs/1905.06125
Face obscuration is often needed by law enforcement or mass media outlets to provide privacy protection. Sharing sensitive content where the obscuration or redaction technique may have failed to completely remove all identifiable traces can lead to life-threatening consequences. Hence, it is critical to be able to systematically measure the face obscuration performance of a given technique. In this paper we propose to measure the effectiveness of three obscuration techniques: Gaussian blurring, median blurring, and pixelation. We do so by identifying the redacted faces under two scenarios: classifying an obscured face into a group of identities and comparing the similarity of an obscured face with a clear face. Threat modeling is also considered to provide a vulnerability analysis for each studied obscuration technique. Based on our evaluation, we show that pixelation-based face obscuration approaches are the most effective.
https://arxiv.org/abs/1905.05243
Garbage and waste disposal is one of the biggest challenges currently faced by mankind. Proper waste disposal and recycling is a must in any sustainable community, and in many coastal areas there is significant water pollution in the form of floating or submerged garbage. This is called marine debris. Submerged marine debris threatens marine life, and for shallow coastal areas, it can also threaten fishing vessels [Iñiguez et al. 2016, Renewable and Sustainable Energy Reviews]. Submerged marine debris typically stays in the environment for a long time (20+ years), and consists of materials that can be recycled, such as metals, plastics, glass, etc. Many of these items should not be disposed in water bodies as this has a negative effect in the environment and human health. This thesis performs a comprehensive evaluation on the use of DNNs for the problem of marine debris detection in FLS images, as well as related problems such as image classification, matching, and detection proposals. We do this in a dataset of 2069 FLS images that we captured with an ARIS Explorer 3000 sensor on marine debris objects lying in the floor of a small water tank. The objects we used to produce this dataset contain typical household marine debris and distractor marine objects (tires, hooks, valves, etc), divided in 10 classes plus a background class. Our results show that for the evaluated tasks, DNNs are a superior technique than the corresponding state of the art. There are large gains particularly for the matching and detection proposal tasks. We also study the effect of sample complexity and object size in many tasks, which is valuable information for practitioners. We expect that our results will advance the objective of using Autonomous Underwater Vehicles to automatically survey, detect and collect marine debris from underwater environments.
https://arxiv.org/abs/1905.05241
Convolutional neural networks (CNNs) have emerged as the state-of-the-art in multiple vision tasks including depth estimation. However, memory and computing power requirements remain as challenges to be tackled in these models. Monocular depth estimation has significant use in robotics and virtual reality that requires deployment on low-end devices. Training a small model from scratch results in a significant drop in accuracy and it does not benefit from pre-trained large models. Motivated by the literature of model pruning, we propose a lightweight monocular depth model obtained from a large trained model. This is achieved by removing the least important features with a novel joint end-to-end filter pruning. We propose to learn a binary mask for each filter to decide whether to drop the filter or not. These masks are trained jointly to exploit relations between filters at different layers as well as redundancy within the same layer. We show that we can achieve around 5x compression rate with small drop in accuracy on the KITTI driving dataset. We also show that masking can improve accuracy over the baseline with fewer parameters, even without enforcing compression loss.
https://arxiv.org/abs/1905.05212
We introduce Pixel-aligned Implicit Function (PIFu), a highly effective implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Highly intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way. Compared to existing representations used for 3D deep learning, PIFu can produce high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface is spatially aligned with the input image. Furthermore, while previous techniques are designed to process either a single image or multiple views, PIFu extends naturally to arbitrary number of views. We demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset, which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.
http://arxiv.org/abs/1905.05172
This paper shows that when applying machine learning to digital zoom for photography, it is beneficial to use real, RAW sensor data for training. Existing learning-based super-resolution methods do not use real sensor data, instead operating on RGB images. In practice, these approaches result in loss of detail and accuracy in their digitally zoomed output when zooming in on distant image regions. We also show that synthesizing sensor data by resampling high-resolution RGB images is an oversimplified approximation of real sensor data and noise, resulting in worse image quality. The key barrier to using real sensor data for training is that ground truth high-resolution imagery is missing. We show how to obtain the ground-truth data with optically zoomed images and contribute a dataset, SR-RAW, for real-world computational zoom. We use SR-RAW to train a deep network with a novel contextual bilateral loss (CoBi) that delivers critical robustness to mild misalignment in input-output image pairs. The trained network achieves state-of-the-art performance in 4X and 8X computational zoom.
http://arxiv.org/abs/1905.05169
We consider the problem of online adaptation of a neural network designed to represent vehicle dynamics. The neural network model is intended to be used by an MPC control law to autonomously control the vehicle. This problem is challenging because both the input and target distributions are non-stationary, and naive approaches to online adaptation result in catastrophic forgetting, which can in turn lead to controller failures. We present a novel online learning method, which combines the pseudo-rehearsal method with locally weighted projection regression. We demonstrate the effectiveness of the resulting Locally Weighted Projection Regression Pseudo-Rehearsal (LW-PR$^2$) method in simulation and on a large real world dataset collected with a 1/5 scale autonomous vehicle.
http://arxiv.org/abs/1905.05162
This paper presents the algorithms and system architecture of an autonomous racecar. The introduced vehicle is powered by a software stack designed for robustness, reliability, and extensibility. In order to autonomously race around a previously unknown track, the proposed solution combines state of the art techniques from different fields of robotics. Specifically, perception, estimation, and control are incorporated into one high-performance autonomous racecar. This complex robotic system, developed by AMZ Driverless and ETH Zurich, finished 1st overall at each competition we attended: Formula Student Germany 2017, Formula Student Italy 2018 and Formula Student Germany 2018. We discuss the findings and learnings from these competitions and present an experimental evaluation of each module of our solution.
http://arxiv.org/abs/1905.05150