Recent studies have shown that the environment where people eat can affect their nutritional behaviour. In this work, we provide automatic tools for a personalised analysis of a person’s health habits by the examination of daily recorded egocentric photo-streams. Specifically, we propose a new automatic approach for the classification of food-related environments, that is able to classify up to 15 such scenes. In this way, people can monitor the context around their food intake in order to get an objective insight into their daily eating routine. We propose a model that classifies food-related scenes organized in a semantic hierarchy. Additionally, we present and make available a new egocentric dataset composed of more than 33000 images recorded by a wearable camera, over which our proposed model has been tested. Our approach obtains an accuracy and F-score of 56\% and 65\%, respectively, clearly outperforming the baseline methods.
http://arxiv.org/abs/1905.04097
Partial domain adaptation aims to transfer knowledge from a label-rich source domain to a label-scarce target domain which relaxes the fully shared label space assumption across different domains. In this more general and practical scenario, a major challenge is how to select source instances in the shared classes across different domains for positive transfer. To address this issue, we propose a Domain Adversarial Reinforcement Learning (DARL) framework to automatically select source instances in the shared classes for circumventing negative transfer as well as to simultaneously learn transferable features between domains by reducing the domain shift. Specifically, in this framework, we employ deep Q-learning to learn policies for an agent to make selection decisions by approximating the action-value function. Moreover, domain adversarial learning is introduced to learn domain-invariant features for the selected source instances by the agent and the target instances, and also to determine rewards for the agent based on how relevant the selected source instances are to the target domain. Experiments on several benchmark datasets demonstrate that the superior performance of our DARL method over existing state of the arts for partial domain adaptation.
http://arxiv.org/abs/1905.04094
Nowadays, there is an upsurge of interest in using lifelogging devices. Such devices generate huge amounts of image data; consequently, the need for automatic methods for analyzing and summarizing these data is drastically increasing. We present a new method for familiar scene recognition in egocentric videos, based on background pattern detection through automatically configurable COSFIRE filters. We present some experiments over egocentric data acquired with the Narrative Clip.
http://arxiv.org/abs/1905.04093
Several of the key issues of planar (Al,Ga)N-based deep-ultraviolet light emitting diodes could potentially be overcome by utilizing nanowire heterostructures, exhibiting high structural perfection and improved light extraction. Here, we study the spontaneous emission of GaN/(Al,Ga)N nanowire ensembles grown on Si(111) by plasma-assisted molecular beam epitaxy. The nanowires contain single GaN quantum disks embedded in long (Al,Ga)N nanowire segments essential for efficient light extraction. These quantum disks are found to exhibit intense emission at unexpectedly high energies, namely, significantly above the GaN bandgap, and almost independent of the disk thickness. An in-depth investigation of the actual structure and composition of the nanowires reveals a spontaneously formed Al gradient both along and across the nanowire, resulting in a complex core/shell structure with an Al deficient core and an Al rich shell with continuously varying Al content along the entire length of the (Al,Ga)N segment. This compositional change along the nanowire growth axis induces a polarization doping of the shell that results in a degenerate electron gas in the disk, thus screening the built-in electric fields. The high carrier density not only results in the unexpectedly high transition energies, but also in radiative lifetimes depending only weakly on temperature, leading to a comparatively high internal quantum efficiency of the GaN quantum disks up to room temperature.
https://arxiv.org/abs/1905.04090
This paper solves the Sparse Photometric stereo through Lighting Interpolation and Normal Estimation using a generative Network (SPLINE-Net). SPLINE-Net contains a lighting interpolation network to generate dense lighting observations given a sparse set of lights as inputs followed by a normal estimation network to estimate surface normals. Both networks are jointly constrained by the proposed symmetric and asymmetric loss functions to enforce isotropic constrain and perform outlier rejection of global illumination effects. SPLINE-Net is verified to outperform existing methods for photometric stereo of general BRDFs by using only ten images of different lights instead of using nearly one hundred images.
http://arxiv.org/abs/1905.04088
Understanding physical relations between objects, especially their support relations, is crucial for robotic manipulation. There has been work on reasoning about support relations and structural stability of simple configurations in RGB-D images. In this paper, we propose a method for extracting more detailed physical knowledge from a set of RGB-D images taken from the same scene but from different views using qualitative reasoning and intuitive physical models. Rather than providing a simple contact relation graph and approximating stability over convex shapes, our method is able to provide a detailed supporting relation analysis based on a volumetric representation. Specifically, true supporting relations between objects (e.g., if an object supports another object by touching it on the side or if the object above contributes to the stability of the object below) are identified. We apply our method to real-world structures captured in warehouse scenarios and show our method works as desired.
http://arxiv.org/abs/1905.04084
In nature, flocking or swarm behavior is observed in many species as it has beneficial properties like reducing the probability of being caught by a predator. In this paper, we propose SELFish (Swarm Emergent Learning Fish), an approach with multiple autonomous agents which can freely move in a continuous space with the objective to avoid being caught by a present predator. The predator has the property that it might get distracted by multiple possible preys in its vicinity. We show that this property in interaction with self-interested agents which are trained with reinforcement learning to solely survive as long as possible leads to flocking behavior similar to Boids, a common simulation for flocking behavior. Furthermore we present interesting insights in the swarming behavior and in the process of agents being caught in our modeled environment.
http://arxiv.org/abs/1905.04077
The routine of a person is defined by the occurrence of activities throughout different days, and can directly affect the person’s health. In this work, we address the recognition of routine related days. To do so, we rely on egocentric images, which are recorded by a wearable camera and allow to monitor the life of the user from a first-person view perspective. We propose an unsupervised model that identifies routine related days, following an outlier detection approach. We test the proposed framework over a total of 72 days in the form of photo-streams covering around 2 weeks of the life of 5 different camera wearers. Our model achieves an average of 76% Accuracy and 68% Weighted F-Score for all the users. Thus, we show that our framework is able to recognise routine related days and opens the door to the understanding of the behaviour of people.
http://arxiv.org/abs/1905.04076
Occlusion and pose variations, which can change facial appearance significantly, are two major obstacles for automatic Facial Expression Recognition (FER). Though automatic FER has made substantial progresses in the past few decades, occlusion-robust and pose-invariant issues of FER have received relatively less attention, especially in real-world scenarios. This paper addresses the real-world pose and occlusion robust FER problem with three-fold contributions. First, to stimulate the research of FER under real-world occlusions and variant poses, we build several in-the-wild facial expression datasets with manual annotations for the community. Second, we propose a novel Region Attention Network (RAN), to adaptively capture the importance of facial regions for occlusion and pose variant FER. The RAN aggregates and embeds varied number of region features produced by a backbone convolutional neural network into a compact fixed-length representation. Last, inspired by the fact that facial expressions are mainly defined by facial action units, we propose a region biased loss to encourage high attention weights for the most important regions. We validate our RAN and region biased loss on both our built test datasets and four popular datasets: FERPlus, AffectNet, RAF-DB, and SFEW. Extensive experiments show that our RAN and region biased loss largely improve the performance of FER with occlusion and variant pose. Our method also achieves state-of-the-art results on FERPlus, AffectNet, RAF-DB, and SFEW. Code and the collected test data will be publicly available.
http://arxiv.org/abs/1905.04075
Wearable cameras capture a first-person view of the daily activities of the camera wearer, offering a visual diary of the user behaviour. Detection of the appearance of people the camera user interacts with for social interactions analysis is of high interest. Generally speaking, social events, lifestyle and health are highly correlated, but there is a lack of tools to monitor and analyse them. We consider that egocentric vision provides a tool to obtain information and understand users social interactions. We propose a model that enables us to evaluate and visualize social traits obtained by analysing social interactions appearance within egocentric photostreams. Given sets of egocentric images, we detect the appearance of faces within the days of the camera wearer, and rely on clustering algorithms to group their feature descriptors in order to re-identify persons. Recurrence of detected faces within photostreams allows us to shape an idea of the social pattern of behaviour of the user. We validated our model over several weeks recorded by different camera wearers. Our findings indicate that social profiles are potentially useful for social behaviour interpretation.
http://arxiv.org/abs/1905.04073
Human interaction involves very sophisticated non-verbal communication skills like understanding the goals and actions of others and coordinating our own actions accordingly. Neuroscience refers to this mechanism as motor resonance, in the sense that the perception of another person’s actions and sensory experiences activates the observer’s brain as if (s)he would be performing the same actions and having the same experiences. We analyze and model non-verbal cues (arm movements) exchanged between two humans that interact and execute handover actions. The contributions of this paper are the following: (i) computational models, using recorded motion data, describing the motor behaviour of each actor in action-in-interaction situations, (ii) a computational model that captures the behaviour if the “giver” and “receiver” during an object handover action, by coupling the arm motion of both actors, and (iii) embedded these models in the iCub robot for both action execution and recognition. Our results show that: (i) the robot can interpret the human arm motion and recognize handover actions; and (ii) behave in a “human-like” manner to receive the object of the recognized handover action.
http://arxiv.org/abs/1905.04072
In this paper we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost and time intensive. Thus, much work has been put into finding methods, which allow to reduce the involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented dialogue systems, conversational dialogue systems, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then by presenting the evaluation methods regarding this class.
http://arxiv.org/abs/1905.04071
In this report, a de-hazing algorithm based on probability and multi-scale fractional order-based fusion is proposed. The proposed scheme improves on a previously implemented multiscale fraction order-based fusion by augmenting its local contrast and edge sharpening features. It also brightens de-hazed images, while avoiding sky region over-enhancement. The results of the proposed algorithm are analyzed and compared with existing methods from the literature and indicate better performance in most cases.
http://arxiv.org/abs/1905.04302
Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring the omitted vowels in speech technologies, little attention has been given to this problem in papers dedicated to written Arabic technologies.f In this research, we present Arabic-Unitex, an Arabic Language Resource, with emphasis on vowel representation and encoding. Specifically, we present two dozens of rules formalizing a detailed description of vowel omission in written text. They are typographical rules integrated into large-coverage resources for morphological annotation. For restoring vowels, our resources are capable of identifying words in which the vowels are not shown, as well as words in which the vowels are partially or fully included. By taking into account these rules, our resources are able to compute and restore for each word form a list of compatible fully vowelized candidates through omission-tolerant dictionary lookup. Our program performs the analysis of 5000 words/second for running text (20 pages/second). Based on these comprehensive linguistic resources, we created a spell checker that detects any invalid/misplaced vowel in a fully or partially vowelized form. Finally, our resources provide a lexical coverage of more than 99 percent of the words used in popular newspapers, and restore vowels in words (out of context) simply and efficiently.
http://arxiv.org/abs/1905.04051
A variety of machine learning applications expect to achieve rapid learning from a limited number of labeled data. However, the success of most current models is the result of heavy training on big data. Meta-learning addresses this problem by extracting common knowledge across different tasks that can be quickly adapted to new tasks. However, they do not fully explore weakly-supervised information, which is usually free or cheap to collect. In this paper, we show that weakly-labeled data can significantly improve the performance of meta-learning on few-shot classification. We propose prototype propagation network (PPN) trained on few-shot tasks together with data annotated by coarse-label. Given a category graph of the targeted fine-classes and some weakly-labeled coarse-classes, PPN learns an attention mechanism which propagates the prototype of one class to another on the graph, so that the K-nearest neighbor (KNN) classifier defined on the propagated prototypes results in high accuracy across different few-shot tasks. The training tasks are generated by subgraph sampling, and the training objective is obtained by accumulating the level-wise classification loss on the subgraph. The resulting graph of prototypes can be continually re-used and updated for new tasks and classes. We also introduce two practical test/inference settings which differ according to whether the test task can leverage any weakly-supervised information as in training. On two benchmarks, PPN significantly outperforms most recent few-shot learning methods in different settings, even when they are also allowed to train on weakly-labeled data.
http://arxiv.org/abs/1905.04042
Neural machine translation - using neural networks to translate human language - is an area of active research exploring new neuron types and network topologies with the goal of dramatically improving machine translation performance. Current state-of-the-art approaches, such as the multi-head attention-based transformer, require very large translation corpuses and many epochs to produce models of reasonable quality. Recent attempts to parallelize the official TensorFlow “Transformer” model across multiple nodes have hit roadblocks due to excessive memory use and resulting out of memory errors when performing MPI collectives. This paper describes modifications made to the Horovod MPI-based distributed training framework to reduce memory usage for transformer models by converting assumed-sparse tensors to dense tensors, and subsequently replacing sparse gradient gather with dense gradient reduction. The result is a dramatic increase in scale-out capability, with CPU-only scaling tests achieving 91% weak scaling efficiency up to 1200 MPI processes (300 nodes), and up to 65% strong scaling efficiency up to 400 MPI processes (200 nodes) using the Stampede2 supercomputer.
http://arxiv.org/abs/1905.04035
We show that for neural network functions that have width less or equal to the input dimension all connected components of decision regions are unbounded. The result holds for continuous and strictly monotonic activation functions as well as for ReLU activation. This complements recent results on approximation capabilities of [Hanin 2017 Approximating] and connectivity of decision regions of [Nguyen 2018 Neural] for such narrow neural networks. Further, we give an example that negatively answers the question posed in [Nguyen 2018 Neural] whether one of their main results still holds for ReLU activation. Our results are illustrated by means of numerical experiments.
http://arxiv.org/abs/1807.01194
State-of-the-art approaches to partially observable planning like POMCP are based on stochastic tree search. While these approaches are computationally efficient, they may still construct search trees of considerable size, which could limit the performance due to restricted memory resources. In this paper, we propose Partially Observable Stacked Thompson Sampling (POSTS), a memory bounded approach to open-loop planning in large POMDPs, which optimizes a fixed size stack of Thompson Sampling bandits. We empirically evaluate POSTS in four large benchmark problems and compare its performance with different tree-based approaches. We show that POSTS achieves competitive performance compared to tree-based open-loop planning and offers a performance-memory tradeoff, making it suitable for partially observable planning with highly restricted computational and memory resources.
http://arxiv.org/abs/1905.04020
In this work, we study the robustness of a CNN+RNN based image captioning system being subjected to adversarial noises. We propose to fool an image captioning system to generate some targeted partial captions for an image polluted by adversarial noises, even the targeted captions are totally irrelevant to the image content. A partial caption indicates that the words at some locations in this caption are observed, while words at other locations are not restricted.It is the first work to study exact adversarial attacks of targeted partial captions. Due to the sequential dependencies among words in a caption, we formulate the generation of adversarial noises for targeted partial captions as a structured output learning problem with latent variables. Both the generalized expectation maximization algorithm and structural SVMs with latent variables are then adopted to optimize the problem. The proposed methods generate very successful at-tacks to three popular CNN+RNN based image captioning models. Furthermore, the proposed attack methods are used to understand the inner mechanism of image captioning systems, providing the guidance to further improve automatic image captioning systems towards human captioning.
http://arxiv.org/abs/1905.04016
The nuclear norm minimization (NNM) is commonly used to approximate the matrix rank by shrinking all singular values equally. However, the singular values have clear physical meanings in many practical problems, and NNM may not be able to faithfully approximate the matrix rank. To alleviate the above-mentioned limitation of NNM, recent studies have suggested that the weighted nuclear norm minimization (WNNM) can achieve a better rank estimation than NNM, which heuristically set the weight being inverse to the singular values. However, it still lacks a rigorous explanation why WNNM is more effective than NMM in various applications. In this paper, we analyze NNM and WNNM from the perspective of group sparse representation (GSR). Concretely, an adaptive dictionary learning method is devised to connect the rank minimization and GSR models. Based on the proposed dictionary, we prove that NNM and WNNM are equivalent to L1-norm minimization and the weighted L1-norm minimization in GSR, respectively. Inspired by enhancing sparsity of the weighted L1-norm minimization in comparison with L1-norm minimization in sparse representation, we thus explain that WNNM is more effective than NMM. By integrating the image nonlocal self-similarity (NSS) prior with the WNNM model, we then apply it to solve the image denoising problem. Experimental results demonstrate that WNNM is more effective than NNM and outperforms several state-of-the-art methods in both objective and perceptual quality.
http://arxiv.org/abs/1608.04517
We present a fully-supervized method for learning to segment data structured by an adjacency graph. We introduce the graph-structured contrastive loss, a loss function structured by a ground truth segmentation. It promotes learning vertex embeddings which are homogeneous within desired segments, and have high contrast at their interface. Thus, computing a piecewise-constant approximation of such embeddings produces a graph-partition close to the objective segmentation. This loss is fully backpropagable, which allows us to learn vertex embeddings with deep learning algorithms. We evaluate our methods on a 3D point cloud oversegmentation task, defining a new state-of-the-art by a large margin. These results are based on the published work of Landrieu and Boussaha 2019.
http://arxiv.org/abs/1905.04014
Traditional face alignment based on machine learning usually tracks the localizations of facial landmarks employing a static model trained offline where all of the training data is available in advance. When new training samples arrive, the static model must be retrained from scratch, which is excessively time-consuming and memory-consuming. In many real-time applications, the training data is obtained one by one or batch by batch. It results in that the static model limits its performance on sequential images with extensive variations. Therefore, the most critical and challenging aspect in this field is dynamically updating the tracker’s models to enhance predictive and generalization capabilities continuously. In order to address this question, we develop a fast and accurate online learning algorithm for face alignment. Particularly, we incorporate on-line sequential extreme learning machine into a parallel cascaded regression framework, coined incremental cascade regression(ICR). To the best of our knowledge, this is the first incremental cascaded framework with the non-linear regressor. One main advantage of ICR is that the tracker model can be fast updated in an incremental way without the entire retraining process when a new input is incoming. Experimental results demonstrate that the proposed ICR is more accurate and efficient on still or sequential images compared with the recent state-of-the-art cascade approaches. Furthermore, the incremental learning proposed in this paper can update the trained model in real time.
http://arxiv.org/abs/1905.04010
Although a wide variety of deep neural networks for robust Visual Odometry (VO) can be found in the literature, they are still unable to solve the drift problem in long-term robot navigation. Thus, this paper aims to propose novel deep end-to-end networks for long-term 6-DoF VO task. It mainly fuses relative and global networks based on Recurrent Convolutional Neural Networks (RCNNs) to improve the monocular localization accuracy. Indeed, the relative sub-networks are implemented to smooth the VO trajectory, while global subnetworks are designed to avoid drift problem. All the parameters are jointly optimized using Cross Transformation Constraints (CTC), which represents temporal geometric consistency of the consecutive frames, and Mean Square Error (MSE) between the predicted pose and ground truth. The experimental results on both indoor and outdoor datasets show that our method outperforms other state-of-the-art learning-based VO methods in terms of pose accuracy.
http://arxiv.org/abs/1812.07869
In reinforcement learning algorithms, it is a common practice to account for only a single view of the environment to make the desired decisions; however, utilizing multiple views of the environment can help to promote the learning of complicated policies. Since the views may frequently suffer from partial observability, their provided observation can have different levels of importance. In this paper, we present a novel attention-based deep reinforcement learning method in a multi-view environment in which each view can provide various representative information about the environment. Specifically, our method learns a policy to dynamically attend to views of the environment based on their importance in the decision-making process. We evaluate the performance of our method on TORCS racing car simulator and three other complex 3D environments with obstacles.
http://arxiv.org/abs/1905.03985
Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, one often encounters situations with non-stationary environments and in these scenarios, RL methods yield sub-optimal decisions. In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a non-stationary environment. The goal of this problem is to maximize the long-term discounted reward achieved when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the long-run reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward. We further validate the effectiveness of the proposed solution on non-stationary random Markov decision processes, a sensor energy management problem and a traffic signal control problem.
http://arxiv.org/abs/1905.03970
The Legal Judgment Prediction (LJP) is to determine judgment results based on the fact descriptions of the cases. LJP usually consists of multiple subtasks, such as applicable law articles prediction, charges prediction, and the term of the penalty prediction. These multiple subtasks have topological dependencies, the results of which affect and verify each other. However, existing methods use dependencies of results among multiple subtasks inefficiently. Moreover, for cases with similar descriptions but different penalties, current methods cannot predict accurately because the word collocation information is ignored. In this paper, we propose a Multi-Perspective Bi-Feedback Network with the Word Collocation Attention mechanism based on the topology structure among subtasks. Specifically, we design a multi-perspective forward prediction and backward verification framework to utilize result dependencies among multiple subtasks effectively. To distinguish cases with similar descriptions but different penalties, we integrate word collocations features of fact descriptions into the network via an attention mechanism. The experimental results show our model achieves significant improvements over baselines on all prediction tasks.
http://arxiv.org/abs/1905.03969
Visual speech recognition (VSR) is the task of recognizing spoken language from video input only, without any audio. VSR has many applications as an assistive technology, especially if it could be deployed in mobile devices and embedded systems. The need of intensive computational resources and large memory footprint are two of the major obstacles in developing neural network models for VSR in a resource constrained environment. We propose a novel end-to-end deep neural network architecture for word level VSR called MobiVSR with a design parameter that aids in balancing the model’s accuracy and parameter count. We use depthwise-separable 3D convolution for the first time in the domain of VSR and show how it makes our model efficient. MobiVSR achieves an accuracy of 73\% on a challenging Lip Reading in the Wild dataset with 6 times fewer parameters and 20 times lesser memory footprint than the current state of the art. MobiVSR can also be compressed to 6 MB by applying post training quantization.
http://arxiv.org/abs/1905.03968
Camera viewpoint selection is an important aspect of visual grasp detection, especially in clutter where many occlusions are present. Where other approaches use a static camera position or fixed data collection routines, our Multi-View Picking (MVP) controller uses an active perception approach to choose informative viewpoints based directly on a distribution of grasp pose estimates in real time, reducing uncertainty in the grasp poses caused by clutter and occlusions. In trials of grasping 20 objects from clutter, our MVP controller achieves 80% grasp success, outperforming a single-viewpoint grasp detector by 12%. We also show that our approach is both more accurate and more efficient than approaches which consider multiple fixed viewpoints.
http://arxiv.org/abs/1809.08564
Typical techniques for video captioning follow the encoder-decoder framework, which can only focus on one source video being processed. A potential disadvantage of such design is that it cannot capture the multiple visual context information of a word appearing in more than one relevant videos in training data. To tackle this limitation, we propose the Memory-Attended Recurrent Network (MARN) for video captioning, in which a memory structure is designed to explore the full-spectrum correspondence between a word and its various similar visual contexts across videos in training data. Thus, our model is able to achieve a more comprehensive understanding for each word and yield higher captioning quality. Furthermore, the built memory structure enables our method to model the compatibility between adjacent words explicitly instead of asking the model to learn implicitly, as most existing models do. Extensive validation on two real-word datasets demonstrates that our MARN consistently outperforms state-of-the-art methods.
http://arxiv.org/abs/1905.03966
We propose a novel method for generating titles for unstructured text documents. We reframe the problem as a sequential question-answering task. A deep neural network is trained on document-title pairs with decomposable titles, meaning that the vocabulary of the title is a subset of the vocabulary of the document. To train the model we use a corpus of millions of publicly available document-title pairs: news articles and headlines. We present the results of a randomized double-blind trial in which subjects were unaware of which titles were human or machine-generated. When trained on approximately 1.5 million news articles, the model generates headlines that humans judge to be as good or better than the original human-written headlines in the majority of cases.
http://arxiv.org/abs/1904.08455
Network slicing is a key technology in 5G communications system, which aims to dynamically and efficiently allocate resources for diversified services with distinct requirements over a common underlying physical infrastructure. Therein, demand-aware allocation is of significant importance to network slicing. In this paper, we consider a scenario that contains several slices in one base station on sharing the same bandwidth. Deep reinforcement learning (DRL) is leveraged to solve this problem by regarding the varying demands and the allocated bandwidth as the environment \emph{state} and \emph{action}, respectively. In order to obtain better quality of experience (QoE) satisfaction ratio and spectrum efficiency (SE), we propose generative adversarial network (GAN) based deep distributional Q network (GAN-DDQN) to learn the distribution of state-action values. Furthermore, we estimate the distributions by approximating a full quantile function, which can make the training error more controllable. In order to protect the stability of GAN-DDQN’s training process from the widely-spanning utility values, we also put forward a reward-clipping mechanism. Finally, we verify the performance of the proposed GAN-DDQN algorithm through extensive simulations.
http://arxiv.org/abs/1905.03929
The need for efficiently finding the video content a user wants is increasing because of the erupting of user-generated videos on the Web. Existing keyword-based or content-based video retrieval methods usually determine what occurs in a video but not when and where. In this paper, we make an answer to the question of when and where by formulating a new task, namely spatio-temporal video re-localization. Specifically, given a query video and a reference video, spatio-temporal video re-localization aims to localize tubelets in the reference video such that the tubelets semantically correspond to the query. To accurately localize the desired tubelets in the reference video, we propose a novel warp LSTM network, which propagates the spatio-temporal information for a long period and thereby captures the corresponding long-term dependencies. Another issue for spatio-temporal video re-localization is the lack of properly labeled video datasets. Therefore, we reorganize the videos in the AVA dataset to form a new dataset for spatio-temporal video re-localization research. Extensive experimental results show that the proposed model achieves superior performances over the designed baselines on the spatio-temporal video re-localization task.
http://arxiv.org/abs/1905.03922
Latest algorithms for automatic neural architecture search perform remarkable but basically directionless in search space and computational expensive in the training of every intermediate architecture. In this paper, we propose a method for efficient architecture search called EENA (Efficient Evolution of Neural Architecture) with mutation and crossover operations guided by the information have already been learned to speed up this process and consume less computational effort by reducing redundant searching and training. On CIFAR-10 classification, EENA using minimal computational resources (0.65 GPU-days) can design highly effective neural architecture which achieves 2.56% test error with 8.47M parameters. Furthermore, The best architecture discovered is also transferable for CIFAR-100.
http://arxiv.org/abs/1905.07320
Multi-person pose estimation from a 2D image is challenging because it requires not only keypoint localization but also human detection. In state-of-the-art top-down methods, multi-scale information is a crucial factor for the accurate pose estimation because it contains both of local information around the keypoints and global information of the entire person. Although multi-scale information allows these methods to achieve the state-of-the-art performance, the top-down methods still require a huge amount of computation because they need to use an additional human detector to feed the cropped human image to their pose estimation model. To effectively utilize multi-scale information with the smaller computation, we propose a multi-scale aggregation R-CNN (MSA R-CNN). It consists of multi-scale RoIAlign block (MS-RoIAlign) and multi-scale keypoint head network (MS-KpsNet) which are designed to effectively utilize multi-scale information. Also, in contrast to previous top-down methods, the MSA R-CNN performs human detection and keypoint localization in a single model, which results in reduced computation. The proposed model achieved the best performance among single model-based methods and its results are comparable to those of separated model-based methods with a smaller amount of computation on the publicly available 2D multi-person keypoint localization dataset.
http://arxiv.org/abs/1905.03912
Object shape provides important information for robotic manipulation; for instance, selecting an effective grasp depends on both the global and local shape of the object of interest, while reaching into clutter requires accurate surface geometry to avoid unintended contact with the environment. Model-based 3D object manipulation is a widely studied problem; however, obtaining the accurate 3D object models for multiple objects often requires tedious work. In this letter, we exploit Gaussian process implicit surfaces (GPIS) extracted from RGB-D sensor data to grasp an unknown object. We propose a reconstruction-aware trajectory optimization that makes use of the extracted GPIS model plan a motion to improve the ability to estimate the object’s 3D geometry, while performing a pick-and-place action. We present a probabilistic approach for a robot to autonomously learn and track the object, while achieve the manipulation task. We use a sampling-based trajectory generation method to explore the unseen parts of the object using the estimated conditional entropy of the GPIS model. We validate our method with physical robot experiments across eleven different objects of varying shape from the YCB object dataset. Our experiments show that our reconstruction-aware trajectory optimization provides higher-quality 3D object reconstruction when compared with directly solving the manipulation task or using a heuristic to view unseen portions of the object.
http://arxiv.org/abs/1905.03907
In order to improve the accuracy of face recognition under varying illumination conditions, a local texture enhanced illumination normalization method based on fusion of differential filtering images (FDFI-LTEIN) is proposed to weaken the influence caused by illumination changes. Firstly, the dynamic range of the face image in dark or shadowed regions is expanded by logarithmic transformation. Then, the global contrast enhanced face image is convolved with difference of Gaussian filters and difference of bilateral filters, and the filtered images are weighted and merged using a coefficient selection rule based on the standard deviation (SD) of image, which can enhance image texture information while filtering out most noise. Finally, the local contrast equalization (LCE) is performed on the fused face image to reduce the influence caused by over or under saturated pixel values in highlight or dark regions. Experimental results on the Extended Yale B face database and CMU PIE face database demonstrate that the proposed method is more robust to illumination changes and achieve higher recognition accuracy when compared with other illumination normalization methods and a deep CNNs based illumination invariant face recognition method
http://arxiv.org/abs/1905.03904
The integration of Artificial Intelligence (AI) into weapon systems is one of the most consequential tactical and strategic decisions in the history of warfare. Current AI development is a remarkable combination of accelerating capability, hidden decision mechanisms, and decreasing costs. Implementation of these systems is in its infancy and exists on a spectrum from resilient and flexible to simplistic and brittle. Resilient systems should be able to effectively handle the complexities of a high-dimensional battlespace. Simplistic AI implementations could be manipulated by an adversarial AI that identifies and exploits their weaknesses. In this paper, we present a framework for understanding the development of dynamic AI/ML systems that interactively and continuously adapt to their user’s needs. We explore the implications of increasingly capable AI in the kill chain and how this will lead inevitably to a fully automated, always on system, barring regulation by treaty. We examine the potential of total integration of cyber and physical security and how this likelihood must inform the development of AI-enabled systems with respect to the “fog of war”, human morals, and ethics.
http://arxiv.org/abs/1905.03899
We propose a data-driven learned sky model, which we use for outdoor lighting estimation from a single image. As no large-scale dataset of images and their corresponding ground truth illumination is readily available, we use complementary datasets to train our approach, combining the vast diversity of illumination conditions of SUN360 with the radiometrically calibrated and physically accurate Laval HDR sky database. Our key contribution is to provide a holistic view of both lighting modeling and estimation, solving both problems end-to-end. From a test image, our method can directly estimate an HDR environment map of the lighting without relying on analytical lighting models. We demonstrate the versatility and expressivity of our learned sky model and show that it can be used to recover plausible illumination, leading to visually pleasant virtual object insertions. To further evaluate our method, we capture a dataset of HDR 360{\deg} panoramas and show through extensive validation that we significantly outperform previous state-of-the-art.
http://arxiv.org/abs/1905.03897
In this paper, we revisit the problem of classifying ships (maritime vessels) detected from overhead imagery. Despite the last decade of research on this very important and pertinent problem, it remains largely unsolved. One of the major issues with the detection and classification of ships and other objects in the maritime domain is the lack of substantial ground truth data needed to train state-of-the-art machine learning algorithms. We address this issue by building a large (200k) synthetic image dataset using the Unity gaming engine and 3D ship models. We demonstrate that with the use of synthetic data, classification performance increases dramatically, particularly when there are very few annotated images used in training.
http://arxiv.org/abs/1905.03894
Detection of curvilinear structures in images has long been of interest. One of the most challenging aspects of this problem is inferring the graph representation of the curvilinear network. Most existing delineation approaches first perform binary segmentation of the image and then refine it using either a set of hand-designed heuristics or a separate classifier that assigns likelihood to paths extracted from the pixel-wise prediction. In our work, we bridge the gap between segmentation and path classification by training a deep network that performs those two tasks simultaneously. We show that this approach is beneficial because it enforces consistency across the whole processing pipeline. We apply our approach on roads and neurons datasets.
http://arxiv.org/abs/1905.03892
Deep learning based methods have penetrated many image processing problems and become dominant solutions to these problems. A natural question raised here is “Is there any space for conventional methods on these problems?” In this paper, exposure interpolation is taken as an example to answer this question and the answer is “Yes”. A framework on fusing conventional and deep learning method is introduced to generate an medium exposure image for two large-exposureratio images. Experimental results indicate that the quality of the medium exposure image is increased significantly through using the deep learning method to refine the interpolated image via the conventional method. The conventional method can be adopted to improve the convergence speed of the deep learning method and to reduce the number of samples which is required by the deep learning method.
http://arxiv.org/abs/1905.03890
Open-domain question answering remains a challenging task as it requires models that are capable of understanding questions and answers, collecting useful information, and reasoning over evidence. Previous work typically formulates this task as a reading comprehension or entailment problem given evidence retrieved from search engines. However, existing techniques struggle to retrieve indirectly related evidence when no directly related evidence is provided, especially for complex questions where it is hard to parse precisely what the question asks. In this paper we propose a retriever-reader model that learns to attend on essential terms during the question answering process. We build (1) an essential term selector which first identifies the most important words in a question, then reformulates the query and searches for related evidence; and (2) an enhanced reader that distinguishes between essential terms and distracting words to predict the answer. We evaluate our model on multiple open-domain multiple-choice QA datasets, notably performing at the level of the state-of-the-art on the AI2 Reasoning Challenge (ARC) dataset.
http://arxiv.org/abs/1808.09492
We present a method for converting the voices between a set of speakers. Our method is based on training multiple autoencoder paths, where there is a single speaker-independent encoder and multiple speaker-dependent decoders. The autoencoders are trained with an addition of an adversarial loss which is provided by an auxiliary classifier in order to guide the output of the encoder to be speaker independent. The training of the model is unsupervised in the sense that it does not require collecting the same utterances from the speakers nor does it require time aligning over phonemes. Due to the use of a single encoder, our method can generalize to converting the voice of out-of-training speakers to speakers in the training dataset. We present subjective tests corroborating the performance of our method.
http://arxiv.org/abs/1905.03864
In this work we propose a new computational framework, based on generative deep models, for synthesis of photo-realistic food meal images from textual descriptions of its ingredients. Previous works on synthesis of images from text typically rely on pre-trained text models to extract text features, followed by a generative neural networks (GANs) aimed to generate realistic images conditioned on the text features. These works mainly focus on generating spatially compact and well-defined categories of objects, such as birds or flowers. In contrast, meal images are significantly more complex, consisting of multiple ingredients whose appearance and spatial qualities are further modified by cooking methods. We propose a method that first builds an attention-based ingredients-image association model, which is then used to condition a generative neural network tasked with synthesizing meal images. Furthermore, a cycle-consistent constraint is added to further improve image quality and control appearance. Extensive experiments show our model is able to generate meal image corresponding to the ingredients, which could be used to augment existing dataset for solving other computational food analysis problems.
http://arxiv.org/abs/1905.13149
Automated prediction of valence, one key feature of a person’s emotional state, from individuals’ personal narratives may provide crucial information for mental healthcare (e.g. early diagnosis of mental diseases, supervision of disease course, etc.). In the Interspeech 2018 ComParE Self-Assessed Affect challenge, the task of valence prediction was framed as a three-class classification problem using 8 seconds fragments from individuals’ narratives. As such, the task did not allow for exploring contextual information of the narratives. In this work, we investigate the intrinsic information from multiple narratives recounted by the same individual in order to predict their current state-of-mind. Furthermore, with generalizability in mind, we decided to focus our experiments exclusively on textual information as the public availability of audio narratives is limited compared to text. Our hypothesis is, that context modeling might provide insights about emotion triggering concepts (e.g. events, people, places) mentioned in the narratives that are linked to an individual’s state of mind. We explore multiple machine learning techniques to model narratives. We find that the models are able to capture inter-individual differences, leading to more accurate predictions of an individual’s emotional state, as compared to single narratives.
https://arxiv.org/abs/1905.05701
We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.
http://arxiv.org/abs/1806.01347
In this work, we demonstrate the existence of universal adversarial audio perturbations that cause mis-transcription of audio signals by automatic speech recognition (ASR) systems. We propose an algorithm to find a single quasi-imperceptible perturbation, which when added to any arbitrary speech signal, will most likely fool the victim speech recognition model. Our experiments demonstrate the application of our proposed technique by crafting audio-agnostic universal perturbations for the state-of-the-art ASR system – Mozilla DeepSpeech. Additionally, we show that such perturbations generalize to a significant extent across models that are not available during training, by performing a transferability test on a WaveNet based ASR system.
http://arxiv.org/abs/1905.03828
We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.
http://arxiv.org/abs/1905.03820
There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of $\mathit{embedding}$ code and natural language queries, into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including $\mathit{unsupervised}$ techniques, which rely only on a corpus of code examples, and $\mathit{supervised}$ techniques, which use an $\mathit{aligned}$ corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet. Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a $\mathit{minimal}$ supervision extension to an existing unsupervised technique. Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus.
http://arxiv.org/abs/1905.03813
The Long Short-Term Memory (LSTM) recurrent neural network is capable of processing complex sequential information since it utilizes special gating schemes for learning representations from long input sequences. It has the potential to model any sequential time-series data, where the current hidden state has to be considered in the context of the past hidden states. This property makes LSTM an ideal choice to learn the complex dynamics present in long sequences. Unfortunately, the conventional LSTMs do not consider the impact of spatio-temporal dynamics corresponding to the given salient motion patterns, when they gate the information that ought to be memorized through time. To address this problem, we propose a differential gating scheme for the LSTM neural network, which emphasizes on the change in information gain caused by the salient motions between the successive video frames. This change in information gain is quantified by Derivative of States (DoS), and thus the proposed LSTM model is termed as differential Recurrent Neural Network (dRNN). In addition, the original work used the hidden state at the last time-step to model the entire video sequence. Based on the energy profiling of DoS, we further propose to employ the State Energy Profile (SEP) to search for salient dRNN states and construct more informative representations. The effectiveness of the proposed model was demonstrated by automatically recognizing human actions from the real-world 2D and 3D single-person action datasets. We point out that LSTM is a special form of dRNN. As a result, we have introduced a new family of LSTMs. Our study is one of the first works towards demonstrating the potential of learning complex time-series representations via high-order derivatives of states.
http://arxiv.org/abs/1905.04293