Evolving Internet-of-Things (IoT) applications often require the use of sensor-based indoor tracking and positioning, for which the performance is significantly improved by identifying the type of the surrounding indoor environment. This identification is of high importance since it leads to higher localization accuracy. This paper presents a novel method based on a cascaded two-stage machine learning approach for highly-accurate and robust localization in indoor environments using adaptive selection and combination of RF features. In the proposed method, machine learning is first used to identify the type of the surrounding indoor environment. Then, in the second stage, machine learning is employed to identify the most appropriate selection and combination of RF features that yield the highest localization accuracy. Analysis is based on k-Nearest Neighbor (k-NN) machine learning algorithm applied on a real dataset generated from practical measurements of the RF signal in realistic indoor environments. Received Signal Strength, Channel Transfer Function, and Frequency Coherence Function are the primary RF features being explored and combined. Numerical investigations demonstrate that prediction based on the concatenation of primary RF features enhanced significantly as the localization accuracy improved by at least 50% to more than 70%.
http://arxiv.org/abs/1905.01000
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
http://arxiv.org/abs/1806.01260
Deep learning is known to be data-hungry, which hinders its application in many areas of science when datasets are small. Here, we propose to use transfer learning methods to migrate knowledge between different physical scenarios and significantly improve the prediction accuracy of artificial neural networks trained on a small dataset. This method can help reduce the demand for expensive data by making use of additional inexpensive data. First, we demonstrate that in predicting the transmission from multilayer photonic film, the relative error rate is reduced by 46.8% (26.5%) when the source data comes from 10-layer (8-layer) films and the target data comes from 8-layer (10-layer) films. Second, we show that the relative error rate is decreased by 22% when knowledge is transferred between two very different physical scenarios: transmission from multilayer films and scattering from multilayer nanoparticles. Finally, we propose a multi-task learning method to improve the performance of different physical scenarios simultaneously in which each task only has a small dataset.
http://arxiv.org/abs/1809.00972
In this work, we propose a novel framework named Region-Aware Network (RANet) to achieve anti-confusing, including heavy occlusion, nearby person and symmetric appearance, for human pose estimation. Specifically, our proposed method addresses three key aspects for human pose estimation, i.e., data augmentation, feature learning and prediction fusion. First, we propose Parsing-based Data Augmentation (PDA) to generate abundant data with confusing textures. Second, we not only propose a Feature Pyramid Stem (FPS) module to learn better low-level features in lower stage; but also incorporate an Effective Region Extraction (ERE) module to investigate better human body-specific features. Third, we introduce Cascade Voting Fusion (CVS) to explicitly leverage the visibility to exclude the deflected predictions and achieve final accurate pose estimation. Experimental results demonstrate the superiority of our method against the state of the arts with significant improvements on two popular benchmark datasets, including MPII and LSP.
http://arxiv.org/abs/1905.00996
We present a modular approach for learning policies for navigation over long planning horizons from language input. Our hierarchical policy operates at multiple timescales, where the higher-level master policy proposes subgoals to be executed by specialized sub-policies. Our choice of subgoals is compositional and semantic, i.e. they can be sequentially combined in arbitrary orderings, and assume human-interpretable descriptions (e.g. ‘exit room’, ‘find kitchen’, ‘find refrigerator’, etc.). We use imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning. Independent reinforcement learning at each level of hierarchy enables sub-policies to adapt to consequences of their actions and recover from errors. Subsequent joint hierarchical training enables the master policy to adapt to the sub-policies. On the challenging EQA (Das et al., 2018) benchmark in House3D (Wu et al., 2018), requiring navigating diverse realistic indoor environments, our approach outperforms prior work by a significant margin, both in terms of navigation and question answering.
http://arxiv.org/abs/1810.11181
Odor identification is an important area in a wide range of industries like cosmetics, food, beverages and medical diagnosis among others. Odor detection could be done through an array of gas sensors conformed as an electronic nose where a data acquisition module converts sensor signals to a standard output to be analyzed. To facilitate odors detection a system is required for the identification. This paper presents the results of an automated odor identification process implemented by a fuzzy system and an electronic nose. First, an electronic nose prototype is manufactured to detect organic compounds vapor using an array of five tin dioxide gas sensors, an arduino uno board is used as a data acquisition section. Second, an intelligent module with a fuzzy system is considered for the identification of the signals received by the electronic nose. This solution proposes a system to identify odors by using a personal computer. Results show an acceptable precision.
http://arxiv.org/abs/1905.00991
As Arctic conditions rapidly change, human activity in the Arctic will continue to increase and so will the need for high-resolution observations of sea ice. While satellite imagery can provide high spatial resolution, it is temporally sparse and significant ice deformation can occur between observations. This makes it difficult to apply feature tracking or image correlation techniques that require persistent features to exist between images. With this in mind, we propose a technique based on optimal transport, which is commonly used to measure differences between probability distributions. When little ice enters or leaves the image scene, we show that regularized optimal transport can be used to quantitatively estimate ice deformation. We discuss the motivation for our approach and describe efficient computational implementations. Results are provided on a combination of synthetic and MODIS imagery to demonstrate the ability of our approach to estimate dynamics properties at the original image resolution.
http://arxiv.org/abs/1905.00989
Autonomous cars have to navigate in dynamic environment which can be full of uncertainties. The uncertainties can come either from sensor limitations such as occlusions and limited sensor range, or from probabilistic prediction of other road participants, or from unknown social behavior in a new area. To safely and efficiently drive in the presence of these uncertainties, the decision-making and planning modules of autonomous cars should intelligently utilize all available information and appropriately tackle the uncertainties so that proper driving strategies can be generated. In this paper, we propose a social perception scheme which treats all road participants as distributed sensors in a sensor network. By observing the individual behaviors as well as the group behaviors, uncertainties of the three types can be updated uniformly in a belief space. The updated beliefs from the social perception are then explicitly incorporated into a probabilistic planning framework based on Model Predictive Control (MPC). The cost function of the MPC is learned via inverse reinforcement learning (IRL). Such an integrated probabilistic planning module with socially enhanced perception enables the autonomous vehicles to generate behaviors which are defensive but not overly conservative, and socially compatible. The effectiveness of the proposed framework is verified in simulation on an representative scenario with sensor occlusions.
http://arxiv.org/abs/1905.00988
Many prediction problems, such as those that arise in the context of robotics, have a simplifying underlying structure that, if known, could accelerate learning. In this paper, we present a strategy for learning a set of neural network modules that can be combined in different ways. We train different modular structures on a set of related tasks and generalize to new tasks by composing the learned modules in new ways. By reusing modules to generalize we achieve combinatorial generalization, akin to the “infinite use of finite means” displayed in language. Finally, we show this improves performance in two robotics-related problems.
http://arxiv.org/abs/1806.10166
Motivation: Biomedical event detection is fundamental for information extraction in molecular biology and biomedical research. The detected events form the central basis for comprehensive biomedical knowledge fusion, facilitating the digestion of massive information influx from literature. Limited by the feature context, the existing event detection models are mostly applicable for a single task. A general and scalable computational model is desiderated for biomedical knowledge management. Results: We consider and propose a bottom-up detection framework to identify the events from recognized arguments. To capture the relations between the arguments, we trained a bi-directional Long Short-Term Memory (LSTM) network to model their context embedding. Leveraging the compositional attributes, we further derived the candidate samples for training event classifiers. We built our models on the datasets from BioNLP Shared Task for evaluations. Our method achieved the average F-scores of 0.81 and 0.92 on BioNLPST-BGI and BioNLPST-BB datasets respectively. Comparing with 7 state-of-the-art methods, our method nearly doubled the existing F-score performance (0.92 vs 0.56) on the BioNLPST-BB dataset. Case studies were conducted to reveal the underlying reasons. Availability: https://github.com/cskyan/evntextrc
http://arxiv.org/abs/1905.00982
The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like ‘park’, and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each each city has recordings from multiple scenes. We test a series of methods for this novel task and show that whilst a simple convolutional neural network (CNN) can achieve accuracy of 50%, which is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge (on the same data), with a simple adaptation to the class labels to use paired city labels with grouped scenes, accuracy increases to 52%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56%, outperforming the aforementioned approaches.
http://arxiv.org/abs/1905.00979
Deep reinforcement learning algorithms have been successfully applied to a range of challenging control tasks. However, these methods typically struggle with achieving effective exploration and are extremely sensitive to the choice of hyperparameters. One reason is that most approaches use a noisy version of their operating policy to explore - thereby limiting the range of exploration. In this paper, we introduce Collaborative Evolutionary Reinforcement Learning (CERL), a scalable framework that comprises a portfolio of policies that simultaneously explore and exploit diverse regions of the solution space. A collection of learners - typically proven algorithms like TD3 - optimize over varying time-horizons leading to this diverse portfolio. All learners contribute to and use a shared replay buffer to achieve greater sample efficiency. Computational resources are dynamically distributed to favor the best learners as a form of online algorithm selection. Neuroevolution binds this entire process to generate a single emergent learner that exceeds the capabilities of any individual learner. Experiments in a range of continuous control benchmarks demonstrate that the emergent learner significantly outperforms its composite learners while remaining overall more sample-efficient - notably solving the Mujoco Humanoid benchmark where all of its composite learners (TD3) fail entirely in isolation.
http://arxiv.org/abs/1905.00976
Capsule network has shown various advantages over convolutional neural network (CNN). It keeps more precise spatial information than CNN and uses equivariance instead of invariance during inference and highly potential to be a new effective tool for visual tasks. However, the current capsule networks have incompatible performance with CNN when facing datasets with background and complex target objects and are lacking in universal and efficient regularization method. We analyze a main reason of the incompatible performance as the conflict between information sensitiveness of capsule network and unreasonably higher activation value distribution of capsules in primary capsule layer. Correspondingly, we propose a practical improvement method by restraining the activation value of capsules in primary capsule layer to suppress non-informative capsules and highlight discriminative capsules. In the experiments, the method has achieved better performances on various mainstream datasets. In addition, the proposed improvement methods can be seen as a suitable, simple and efficient regularization method that can be generally used in capsule network.
http://arxiv.org/abs/1903.10588
State of the art visual relation detection methods have been relying on features extracted from RGB images including objects’ 2D positions. In this paper, we argue that the 3D positions of objects in space can provide additional valuable information about object relations. This information helps not only to detect spatial relations, such as “standing behind”, but also non-spatial relations, such as “holding”. Since 3D information of a scene is not easily accessible, we propose incorporating a pre-trained RGB-to-Depth model within visual relation detection frameworks. We discuss different feature extraction strategies from depth maps and show their critical role in relation detection. Our experiments confirm that the performance of state-of-the-art visual relation detection approaches can significantly be improved by utilizing depth map information.
http://arxiv.org/abs/1905.00966
Air traffic control is a real-time safety-critical decision making process in highly dynamic and stochastic environments. In today’s aviation practice, a human air traffic controller monitors and directs many aircraft flying through its designated airspace sector. With the fast growing air traffic complexity in traditional (commercial airliners) and low-altitude (drones and eVTOL aircraft) airspace, an autonomous air traffic control system is needed to accommodate high density air traffic and ensure safe separation between aircraft. We propose a deep multi-agent reinforcement learning framework that is able to identify and resolve conflicts between aircraft in a high-density, stochastic, and dynamic en-route sector with multiple intersections and merging points. The proposed framework utilizes an actor-critic model, A2C that incorporates the loss function from Proximal Policy Optimization (PPO) to help stabilize the learning process. In addition we use a centralized learning, decentralized execution scheme where one neural network is learned and shared by all agents in the environment. We show that our framework is both scalable and efficient for large number of incoming aircraft to achieve extremely high traffic throughput with safety guarantee. We evaluate our model via extensive simulations in the BlueSky environment. Results show that our framework is able to resolve 99.97% and 100% of all conflicts both at intersections and merging points, respectively, in extreme high-density air traffic scenarios.
http://arxiv.org/abs/1905.01303
Deep neural network (DNN) accelerators with improved energy and delay are desirable for meeting the requirements of hardware targeted for IoT and edge computing systems. Convolutional neural networks (CoNNs) belong to one of the most popular types of DNN architectures. This paper presents the design and evaluation of an accelerator for CoNNs. The system-level architecture is based on mixed-signal, cellular neural networks (CeNNs). Specifically, we present (i) the implementation of different layers, including convolution, ReLU, and pooling, in a CoNN using CeNN, (ii) modified CoNN structures with CeNN-friendly layers to reduce computational overheads typically associated with a CoNN, (iii) a mixed-signal CeNN architecture that performs CoNN computations in the analog and mixed signal domain, and (iv) design space exploration that identifies what CeNN-based algorithm and architectural features fare best compared to existing algorithms and architectures when evaluated over common datasets – MNIST and CIFAR-10. Notably, the proposed approach can lead to 8.7$\times$ improvements in energy-delay product (EDP) per digit classification for the MNIST dataset at iso-accuracy when compared with the state-of-the-art DNN engine, while our approach could offer 4.3$\times$ improvements in EDP when compared to other network implementations for the CIFAR-10 dataset.
http://arxiv.org/abs/1811.02636
Fake news and misinformation have been increasingly used to manipulate popular opinion and influence political processes. To better understand fake news, how they are propagated, and how to counter their effect, it is necessary to first identify them. Recently, approaches have been proposed to automatically classify articles as fake based on their content. An important challenge for these approaches comes from the dynamic nature of news: as new political events are covered, topics and discourse constantly change and thus, a classifier trained using content from articles published at a given time is likely to become ineffective in the future. To address this challenge, we propose a topic-agnostic (TAG) classification strategy that uses linguistic and web-markup features to identify fake news pages. We report experimental results using multiple data sets which show that our approach attains high accuracy in the identification of fake news, even as topics evolve over time.
http://arxiv.org/abs/1905.00957
We present the perceptor gradients algorithm – a novel approach to learning symbolic representations based on the idea of decomposing an agent’s policy into i) a perceptor network extracting symbols from raw observation data and ii) a task encoding program which maps the input symbols to output actions. We show that the proposed algorithm is able to learn representations that can be directly fed into a Linear-Quadratic Regulator (LQR) or a general purpose A* planner. Our experimental results confirm that the perceptor gradients algorithm is able to efficiently learn transferable symbolic representations as well as generate new observations according to a semantically meaningful specification.
http://arxiv.org/abs/1905.00956
Understanding and interpreting the decisions made by deep learning models is valuable in many domains. In computer vision, computing heatmaps from a deep network is a popular approach for visualizing and understanding deep networks. However, heatmaps that do not correlate with the network may mislead human, hence the performance of heatmaps in providing a faithful explanation to the underlying deep network is crucial. In this paper, we propose I-GOS, which optimizes for a heatmap so that the classification scores on the masked image would maximally decrease. The main novelty of the approach is to compute descent directions based on the integrated gradients instead of the normal gradient, which avoids local optima and speeds up convergence. Compared with previous approaches, our method can flexibly compute heatmaps at any resolution for different user needs. Extensive experiments on several benchmark datasets show that the heatmaps produced by our approach are more correlated with the decision of the underlying deep network, in comparison with other state-of-the-art approaches.
http://arxiv.org/abs/1905.00954
As an instance-level recognition problem, person re-identification (ReID) relies on discriminative features, which not only capture different spatial scales but also encapsulate an arbitrary combination of multiple scales. We call these features of both homogeneous and heterogeneous scales omni-scale features. In this paper, a novel deep CNN is designed, termed Omni-Scale Network (OSNet), for omni-scale feature learning in ReID. This is achieved by designing a residual block composed of multiple convolutional feature streams, each detecting features at a certain scale. Importantly, a novel unified aggregation gate is introduced to dynamically fuse multi-scale features with input-dependent channel-wise weights. To efficiently learn spatial-channel correlations and avoid overfitting, the building block uses both pointwise and depthwise convolutions. By stacking such blocks layer-by-layer, our OSNet is extremely lightweight and can be trained from scratch on existing ReID benchmarks. Despite its small model size, our OSNet achieves state-of-the-art performance on six person-ReID datasets.
http://arxiv.org/abs/1905.00953
Model interpretability and systematic, targeted model adaptation present central challenges in deep learning. In the domain of intuitive physics, we study the task of visually predicting stability of block towers with the goal of understanding and influencing the model’s reasoning. Our contributions are two-fold. Firstly, we introduce neural stethoscopes as a framework for quantifying the degree of importance of specific factors of influence in deep networks as well as for actively promoting and suppressing information as appropriate. In doing so, we unify concepts from multitask learning as well as training with auxiliary and adversarial losses. Secondly, we deploy the stethoscope framework to provide an in-depth analysis of a state-of-the-art deep neural network for stability prediction, specifically examining its physical reasoning. We show that the baseline model is susceptible to being misled by incorrect visual cues. This leads to a performance breakdown to the level of random guessing when training on scenarios where visual cues are inversely correlated with stability. Using stethoscopes to promote meaningful feature extraction increases performance from 51% to 90% prediction accuracy. Conversely, training on an easy dataset where visual cues are positively correlated with stability, the baseline model learns a bias leading to poor performance on a harder dataset. Using an adversarial stethoscope, the network is successfully de-biased, leading to a performance increase from 66% to 88%.
http://arxiv.org/abs/1806.05502
Many systems for autonomous vehicles’ navigation rely on lane detection. Traditional algorithms usually estimate only the position of the lanes on the road, but an autonomous control system may also need to know if a lane marking can be crossed or not, and what portion of space inside the lane is free from obstacles, to make safer control decisions. On the other hand, free space detection algorithms only detect navigable areas, without information about lanes. State-of-the-art algorithms use CNNs for both tasks, with significant consumption of computing resources. We propose a novel approach that estimates the free space inside each lane, with a single CNN. Additionally, adding only a small requirement concerning GPU RAM, we infer the road type, that will be useful for path planning. To achieve this result, we train a multi-task CNN. Then, we further elaborate the output of the network, to extract polygons that can be effectively used in navigation control. Finally, we provide a computationally efficient implementation, based on ROS, that can be executed in real time. Our code and trained models are available online.
http://arxiv.org/abs/1905.00941
When dealing with material classification in baggage at airports, Dual-Energy Computed Tomography (DECT) allows characterization of any given material with coefficients based on two attenuative effects: Compton scattering and photoelectric absorption. However, straightforward projection-domain decomposition methods for this characterization often yield poor reconstructions due to the high dynamic range of material properties encountered in an actual luggage scan. Hence, for better reconstruction quality under a timing constraint, we propose a splitting-based, GPU-accelerated, statistical DECT reconstruction algorithm. Compared to prior art, our main contribution lies in the significant acceleration made possible by separating reconstruction and decomposition within an ADMM framework. Experimental results, on both synthetic and real-world baggage phantoms, demonstrate a significant reduction in time required for convergence.
http://arxiv.org/abs/1905.00934
This paper presents a new framework for jointly enhancing the resolution and the dynamic range of an image, i.e., simultaneous super-resolution (SR) and high dynamic range imaging (HDRI), based on a convolutional neural network (CNN). From the common trends of both tasks, we train a CNN for the joint HDRI and SR by focusing on the reconstruction of high-frequency details. Specifically, the high-frequency component in our work is the reflectance component according to the Retinex-based image decomposition, and only the reflectance component is manipulated by the CNN while another component (illumination) is processed in a conventional way. In training the CNN, we devise an appropriate loss function that contributes to the naturalness quality of resulting images. Experiments show that our algorithm outperforms the cascade implementation of CNN-based SR and HDRI.
http://arxiv.org/abs/1905.00933
Generative adversarial networks (GANs) are a framework that learns a generative distribution through adversarial training. Recently, their class-conditional extensions (e.g., conditional GAN (cGAN) and auxiliary classifier GAN (AC-GAN)) have attracted much attention owing to their ability to learn the disentangled representations and to improve the training stability. However, their training requires the availability of large-scale accurate class-labeled data, which are often laborious or impractical to collect in a real-world scenario. To remedy this, we propose a novel family of GANs called label-noise robust GANs (rGANs), which, by incorporating a noise transition model, can learn a clean label conditional generative distribution even when training labels are noisy. In particular, we propose two variants: rAC-GAN, which is a bridging model between AC-GAN and the label-noise robust classification model, and rcGAN, which is an extension of cGAN and solves this problem with no reliance on any classifier. In addition to providing the theoretical background, we demonstrate the effectiveness of our models through extensive experiments using diverse GAN configurations, various noise settings, and multiple evaluation metrics (in which we tested 402 conditions in total). Our code is available at https://github.com/takuhirok/rGAN/.
http://arxiv.org/abs/1811.11165
In this paper, we introduce an approach for leveraging available data across multiple locales sharing the same language to 1) improve domain classification model accuracy in Spoken Language Understanding and user experience even if new locales do not have sufficient data and 2) reduce the cost of scaling the domain classifier to a large number of locales. We propose a locale-agnostic universal domain classification model based on selective multi-task learning that learns a joint representation of an utterance over locales with different sets of domains and allows locales to share knowledge selectively depending on the domains. The experimental results demonstrate the effectiveness of our approach on domain classification task in the scenario of multiple locales with imbalanced data and disparate domain sets. The proposed approach outperforms other baselines models especially when classifying locale-specific domains and also low-resourced domains.
http://arxiv.org/abs/1905.00924
Domain classification is the task of mapping spoken language utterances to one of the natural language understanding domains in intelligent personal digital assistants (IPDAs). This is a major component in mainstream IPDAs in industry. Apart from official domains, thousands of third-party domains are also created by external developers to enhance the capability of IPDAs. As more domains are developed rapidly, the question of how to continuously accommodate the new domains still remains challenging. Moreover, existing continual learning approaches do not address the problem of incorporating personalized information dynamically for better domain classification. In this paper, we propose CoNDA, a neural network based approach for domain classification that supports incremental learning of new classes. Empirical evaluation shows that CoNDA achieves high accuracy and outperforms baselines by a large margin on both incrementally added new domains and existing domains.
http://arxiv.org/abs/1905.00921
We present a practical and robust deep learning solution for capturing and rendering novel views of complex real world scenes for virtual exploration. Previous approaches either require intractably dense view sampling or provide little to no guidance for how users should sample views of a scene to reliably render high-quality novel views. Instead, we propose an algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields. We extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. In practice, we apply this bound to capture and render views of real world scenes that achieve the perceptual quality of Nyquist rate view sampling while using up to 4000x fewer views. We demonstrate our approach’s practicality with an augmented reality smartphone app that guides users to capture input images of a scene and viewers that enable realtime virtual exploration on desktop and mobile platforms.
http://arxiv.org/abs/1905.00889
Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for this task: 1) lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for generating diverse outputs without paired training images. To synthesize diverse outputs, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and attribute vectors sampled from the attribute space to synthesize diverse outputs at test time. To handle unpaired training data, we introduce a cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative evaluations, we measure realism with user study and Fr'{e}chet inception distance, and measure diversity with the perceptual distance metric, Jensen-Shannon divergence, and number of statistically-different bins.
http://arxiv.org/abs/1905.01270
The objective of this paper is self-supervised learning of feature embeddings from videos, suitable for correspondence flow, i.e. matching correspondences between frames over the video. We leverage the natural spatial-temporal coherence of appearance in videos, to create a “pointer” model that learns to reconstruct a target frame by copying colors from a reference frame. We make three contributions: First, we introduce a simple information bottleneck that enforces the model to learn robust features for correspondence matching, and avoids it learning trivial solutions, e.g. matching based on low-level color information. Second, we propose to train the model over a long temporal window in videos. To make the model more robust to complex object deformation, occlusion, i.e. the problem of tracker drifting, we formulate a recursive model, trained with scheduled sampling and cycle consistency. Third, we evaluate the approach by first training on the Kinetics dataset using self-supervised learning, and then directly applied for DAVIS video segmentation and JHMDB keypoint tracking. On both tasks, our approach has achieved state-of-the-art performance, especially on segmentation, we outperform all previous methods by a significant margin.
http://arxiv.org/abs/1905.00875
In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models. Our experimental results show this combined compression approach is very effective. For a three-layer long short-term memory (LSTM) based AED model, the original model size can be reduced to 1% with negligible loss of accuracy. Our approach enables the feasibility of deploying AED for resource-constraint applications.
http://arxiv.org/abs/1905.00855
With the advent of perceptual loss functions, new possibilities in super-resolution have emerged, and we currently have models that successfully generate near-photorealistic high-resolution images from their low-resolution observations. Up to now, however, such approaches have been exclusively limited to single image super-resolution. The application of perceptual loss functions on video processing still entails several challenges, mostly related to the lack of temporal consistency of the generated images, i.e., flickering artifacts. In this work, we present a novel adversarial recurrent network for video upscaling that is able to produce realistic textures in a temporally consistent way. The proposed architecture naturally leverages information from previous frames due to its recurrent architecture, i.e. the input to the generator is composed of the low-resolution image and, additionally, the warped output of the network at the previous step. Together with a video discriminator, we also propose additional loss functions to further reinforce temporal consistency in the generated sequences. The experimental validation of our algorithm shows the effectiveness of our approach which obtains images with high perceptual quality and improved temporal consistency.
http://arxiv.org/abs/1807.07930
Numerous tasks in imaging and vision can be formulated as variational problems over vector-valued maps. We approach the relaxation and convexification of such vectorial variational problems via a lifting to the space of currents. To that end, we recall that functionals with polyconvex Lagrangians can be reparametrized as convex one-homogeneous functionals on the graph of the function. This leads to an equivalent shape optimization problem over oriented surfaces in the product space of domain and codomain. A convex formulation is then obtained by relaxing the search space from oriented surfaces to more general currents. We propose a discretization of the resulting infinite-dimensional optimization problem using Whitney forms, which also generalizes recent “sublabel-accurate” multilabeling approaches.
http://arxiv.org/abs/1905.00851
Manipulation tasks often require robots to be continuously in contact with an object. Therefore tactile perception systems need to handle continuous contact data. Shear deformation causes the tactile sensor to output path-dependent readings in contrast to discrete contact readings. As such, in some continuous-contact tasks, sliding can be regarded as a disturbance over the sensor signal. Here we present a shear-invariant perception method based on principal component analysis (PCA) which outputs the required information about the environment despite sliding motion. A compliant tactile sensor (the TacTip) is used to investigate continuous tactile contact. First, we evaluate the method offline using test data collected whilst the sensor slides over an edge. Then, the method is used within a contour-following task applied to 6 objects with varying curvatures; all contours are successfully traced. The method demonstrates generalisation capabilities and could underlie a more sophisticated controller for challenging manipulation or exploration tasks in unstructured environments. A video showing the work described in the paper can be found at https://youtu.be/wrTM61-pieU
http://arxiv.org/abs/1905.00842
Knowledge representation and reasoning (KRR) is one of the key areas in artificial intelligence (AI) field. It is intended to represent the world knowledge in formal languages (e.g., Prolog, SPARQL) and then enhance the expert systems to perform querying and inference tasks. Currently, constructing large scale knowledge bases (KBs) with high quality is prohibited by the fact that the construction process requires many qualified knowledge engineers who not only understand the domain-specific knowledge but also have sufficient skills in knowledge representation. Unfortunately, qualified knowledge engineers are in short supply. Therefore, it would be very useful to build a tool that allows the user to construct and query the KB simply via text. Although there is a number of systems developed for knowledge extraction and question answering, they mainly fail in that these system don’t achieve high enough accuracy whereas KRR is highly sensitive to erroneous data. In this thesis proposal, I will present Knowledge Authoring Logic Machine (KALM), a rule-based system which allows the user to author knowledge and query the KB in text. The experimental results show that KALM achieved superior accuracy in knowledge authoring and question answering as compared to the state-of-the-art systems.
http://arxiv.org/abs/1905.00840
In this paper, we address the problem of enhancing the speech of a speaker of interest in a cocktail party scenario when visual information of the speaker of interest is available. Contrary to most previous studies, we do not learn visual features on the typically small audio-visual datasets, but use an already available face landmark detector (trained on a separate image dataset). The landmarks are used by LSTM-based models to generate time-frequency masks which are applied to the acoustic mixed-speech spectrogram. Results show that: (i) landmark motion features are very effective features for this task, (ii) similarly to previous work, reconstruction of the target speaker’s spectrogram mediated by masking is significantly more accurate than direct spectrogram reconstruction, and (iii) the best masks depend on both motion landmark features and the input mixed-speech spectrogram. To the best of our knowledge, our proposed models are the first models trained and evaluated on the limited size GRID and TCD-TIMIT datasets, that achieve speaker-independent speech enhancement in a multi-talker setting.
http://arxiv.org/abs/1811.02480
We introduce SinGAN, an unconditional generative model that can be learned from a single natural image. Our model is trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples that carry the same visual content as the image. SinGAN contains a pyramid of fully convolutional GANs, each responsible for learning the patch distribution at a different scale of the image. This allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and the fine textures of the training image. In contrast to previous single image GAN schemes, our approach is not limited to texture images, and is not conditional (i.e. it generates samples from noise). User studies confirm that the generated samples are commonly confused to be real images. We illustrate the utility of SinGAN in a wide range of image manipulation tasks.
http://arxiv.org/abs/1905.01164
Lighting plays a central role in conveying the essence and depth of the subject in a portrait photograph. Professional photographers will carefully control the lighting in their studio to manipulate the appearance of their subject, while consumer photographers are usually constrained to the illumination of their environment. Though prior works have explored techniques for relighting an image, their utility is usually limited due to requirements of specialized hardware, multiple images of the subject under controlled or known illuminations, or accurate models of geometry and reflectance. To this end, we present a system for portrait relighting: a neural network that takes as input a single RGB image of a portrait taken with a standard cellphone camera in an unconstrained environment, and from that image produces a relit image of that subject as though it were illuminated according to any provided environment map. Our method is trained on a small database of 18 individuals captured under different directional light sources in a controlled light stage setup consisting of a densely sampled sphere of lights. Our proposed technique produces quantitatively superior results on our dataset’s validation set compared to prior works, and produces convincing qualitative relighting results on a dataset of hundreds of real-world cellphone portraits. Because our technique can produce a 640 $\times$ 640 image in only 160 milliseconds, it may enable interactive user-facing photographic applications in the future.
http://arxiv.org/abs/1905.00824
We present a project that aims to generate images that depict accurate, vivid, and personalized outcomes of climate change using Cycle-Consistent Adversarial Networks (CycleGANs). By training our CycleGAN model on street-view images of houses before and after extreme weather events (e.g. floods, forest fires, etc.), we learn a mapping that can then be applied to images of locations that have not yet experienced these events. This visual transformation is paired with climate model predictions to assess likelihood and type of climate-related events in the long term (50 years) in order to bring the future closer in the viewers mind. The eventual goal of our project is to enable individuals to make more informed choices about their climate future by creating a more visceral understanding of the effects of climate change, while maintaining scientific credibility by drawing on climate model projections.
http://arxiv.org/abs/1905.03709
We propose a new model for detecting visual relationships, such as “person riding motorcycle” or “bottle on table”. This task is an important step towards comprehensive structured image understanding, going beyond detecting individual objects. Our main novelty is a Box Attention mechanism that allows to model pairwise interactions between objects using standard object detection pipelines. The resulting model is conceptually clean, expressive and relies on well-justified training and prediction procedures. Moreover, unlike previously proposed approaches, our model does not introduce any additional complex components or hyperparameters on top of those already required by the underlying detection model. We conduct an experimental evaluation on three challenging datasets, V-COCO, Visual Relationships and Open Images, demonstrating strong quantitative and qualitative results.
http://arxiv.org/abs/1807.02136
Soft robots are challenging to model due in large part to the nonlinear properties of soft materials. Fortunately, this softness makes it possible to safely observe their behavior under random control inputs, making them amenable to large-scale data collection and system identification. This paper implements and evaluates a system identification method based on Koopman operator theory in which models of nonlinear dynamical systems are constructed via linear regression of observed data by exploiting the fact that every nonlinear system has a linear representation in the infinite-dimensional space of real-valued functions called observables. The approach does not suffer from some of the shortcomings of other nonlinear system identification methods, which typically require the manual tuning of training parameters and have limited convergence guarantees. A dynamic model of a pneumatic soft robot arm is constructed via this method, and used to predict the behavior of the real system. The total normalized-root-mean-square error (NRMSE) of its predictions is lower than that of several other identified models including a neural network, NLARX, nonlinear Hammerstein-Wiener, and linear state space model.
http://arxiv.org/abs/1810.06637
Weight quantization is one of the most important techniques of Deep Neural Networks (DNNs) model compression method. A recent work using systematic framework of DNN weight quantization with the advanced optimization algorithm ADMM (Alternating Direction Methods of Multipliers) achieves one of state-of-art results in weight quantization. In this work, we first extend such ADMM-based framework to guarantee solution feasibility and we have further developed a multi-step, progressive DNN weight quantization framework, with dual benefits of (i) achieving further weight quantization thanks to the special property of ADMM regularization, and (ii) reducing the search space within each step. Extensive experimental results demonstrate the superior performance compared with prior work. Some highlights: we derive the first lossless and fully binarized (for all layers) LeNet-5 for MNIST; And we derive the first fully binarized (for all layers) VGG-16 for CIFAR-10 and ResNet for ImageNet with reasonable accuracy loss.
http://arxiv.org/abs/1905.00789
This paper aims to design quadrotor swarm performances, where the swarm acts as an integrated, coordinated unit embodying moving and deforming objects. We divide the task of creating a choreography into three basic steps: designing swarm motion primitives, transitioning between those movements, and synchronizing the motion of the drones. The result is a flexible framework for designing choreographies comprised of a wide variety of motions. The motion primitives can be intuitively designed using few parameters, providing a rich library for choreography design. Moreover, we combine and adapt existing goal assignment and trajectory generation algorithms to maximize the smoothness of the transitions between motion primitives. Finally, we propose a correction algorithm to compensate for motion delays and synchronize the motion of the drones to a desired periodic motion pattern. The proposed methodology was validated experimentally by generating and executing choreographies on a swarm of 25 quadrotors.
http://arxiv.org/abs/1810.03572
Non-linear functions such as neural networks can be locally approximated by affine planes. Recent works make use of input-Jacobians, which describe the normal to these planes. In this paper, we introduce full-Jacobians, which includes this normal along with an additional intercept term called the bias-Jacobians, that together completely describe local planes. For ReLU neural networks, bias-Jacobians correspond to sums of gradients of outputs w.r.t. intermediate layer activations. We first use these full-Jacobians for distillation by aligning gradients of their intermediate representations. Next, we regularize bias-Jacobians alone to improve generalization. Finally, we show that full-Jacobian maps can be viewed as saliency maps. Experimental results show improved distillation on small data-sets, improved generalization for neural network training, and sharper saliency maps.
http://arxiv.org/abs/1905.00780
We propose a novel agglomerative clustering method based on unmasking, a technique that was previously used for authorship verification of text documents and for abnormal event detection in videos. In order to join two clusters, we alternate between (i) training a binary classifier to distinguish between the samples from one cluster and the samples from the other cluster, and (ii) removing at each step the most discriminant features. The faster-decreasing accuracy rates of the intermediately-obtained classifiers indicate that the two clusters should be joined. To the best of our knowledge, this is the first work to apply unmasking in order to cluster images. We compare our method with k-means as well as a recent state-of-the-art clustering method. The empirical results indicate that our approach is able to improve performance for various (deep and shallow) feature representations and different tasks, such as handwritten digit recognition, texture classification and fine-grained object recognition.
http://arxiv.org/abs/1905.00773
We introduce REPRISE, a REtrospective and PRospective Inference SchEme, which learns temporal event-predictive models of dynamical systems. REPRISE infers the unobservable contextual event state and accompanying temporal predictive models that best explain the recently encountered sensorimotor experiences retrospectively. Meanwhile, it optimizes upcoming motor activities prospectively in a goal-directed manner. Here, REPRISE is implemented by a recurrent neural network (RNN), which learns temporal forward models of the sensorimotor contingencies generated by different simulated dynamic vehicles. The RNN is augmented with contextual neurons, which enable the encoding of distinct, but related, sensorimotor dynamics as compact event codes. We show that REPRISE concurrently learns to separate and approximate the encountered sensorimotor dynamics: it analyzes sensorimotor error signals adapting both internal contextual neural activities and connection weight values. Moreover, we show that REPRISE can exploit the learned model to induce goal-directed, model-predictive control, that is, approximate active inference: Given a goal state, the system imagines a motor command sequence optimizing it with the prospective objective to minimize the distance to the goal. The RNN activities thus continuously imagine the upcoming future and reflect on the recent past, optimizing the predictive model, the hidden neural state activities, and the upcoming motor activities. As a result, event-predictive neural encodings develop, which allow the invocation of highly effective and adaptive goal-directed sensorimotor control.
http://arxiv.org/abs/1809.07412
Deep convolutional neural networks (CNNs) are nowadays achieving significant leaps in different pattern recognition tasks including action recognition. Current CNNs are increasingly deeper, data-hungrier and this makes their success tributary of the abundance of labeled training data. CNNs also rely on max/average pooling which reduces dimensionality of output layers and hence attenuates their sensitivity to the availability of labeled data. However, this process may dilute the information of upstream convolutional layers and thereby affect the discrimination power of the trained representations, especially when the learned categories are fine-grained. In this paper, we introduce a novel hierarchical aggregation design, for final pooling, that controls granularity of the learned representations w.r.t the actual granularity of action categories. Our solution is based on a tree-structured temporal pyramid that aggregates outputs of CNNs at different levels. Top levels of this hierarchy are dedicated to coarse categories while deep levels are more suitable to fine-grained ones. The design of our temporal pyramid is based on solving a constrained minimization problem whose solution corresponds to the distribution of weights of different representations in the temporal pyramid. Experiments conducted using the challenging UCF101 database show the relevance of our hierarchical design w.r.t other related methods.
http://arxiv.org/abs/1905.00745
Egocentric vision is an emerging field of computer vision that is characterized by the acquisition of images and video from the first person perspective. In this paper we address the challenge of egocentric human action recognition by utilizing the presence and position of detected regions of interest in the scene explicitly, without further use of visual features. Initially, we recognize that human hands are essential in the execution of actions and focus on obtaining their movements as the principal cues that define actions. We employ object detection and region tracking techniques to locate hands and capture their movements. Prior knowledge about egocentric views facilitates hand identification between left and right. With regard to detection and tracking, we contribute a pipeline that successfully operates on unseen egocentric videos to find the camera wearer’s hands and associate them through time. Moreover, we emphasize on the value of scene information for action recognition. We acknowledge that the presence of objects is significant for the execution of actions by humans and in general for the description of a scene. To acquire this information, we utilize object detection for specific classes that are relevant to the actions we want to recognize. Our experiments are targeted on videos of kitchen activities from the Epic-Kitchens dataset. We model action recognition as a sequence learning problem of the detected spatial positions in the frames. Our results show that explicit hand and object detections with no other visual information can be relied upon to classify hand-related human actions. Testing against methods fully dependent on visual features, signals that for actions where hand motions are conceptually important, a region-of-interest-based description of a video contains equally expressive information with comparable classification performance.
http://arxiv.org/abs/1905.00742
Training agents with reinforcement learning based techniques requires thousands of steps, which translates to long training periods when applied to robots. By training the policy in a simulated environment we avoid such limitation. Typically, the action spaces in a simulation and real robot are kept as similar as possible, but if we want to use a generic simulation environment, this strategy will not work. Video games, such as Doom (1993), offer a crude but multi-purpose environments that can used for learning various tasks. However, original Doom has four discrete actions for movement and the robot in our case has two continuous actions. In this work, we study the transfer between these two different action spaces. We begin with experiments in a simulated environment, after which we validate the results with experiments on a real robot. Results show that fine-tuning initially learned network parameters leads to unreliable results, but by keeping most of the neural network frozen we obtain above $90\%$ success rate in simulation and real robot experiments.
http://arxiv.org/abs/1905.00741
We present the 2019 DAVIS Challenge on Video Object Segmentation, the third edition of the DAVIS Challenge series, a public competition designed for the task of Video Object Segmentation (VOS). In addition to the original semi-supervised track and the interactive track introduced in the previous edition, a new unsupervised multi-object track will be featured this year. In the newly introduced track, participants are asked to provide non-overlapping object proposals on each image, along with an identifier linking them between frames (i.e. video object proposals), without any test-time human supervision (no scribbles or masks provided on the test video). In order to do so, we have re-annotated the train and val sets of DAVIS 2017 in a concise way that facilitates the unsupervised track, and created new test-dev and test-challenge sets for the competition. Definitions, rules, and evaluation metrics for the unsupervised track are described in detail in this paper.
http://arxiv.org/abs/1905.00737