Recently, there has been a lot of research into tensor singular value decomposition (t-SVD) by using discrete Fourier transform (DFT) matrix. The main aims of this paper are to propose and study tensor singular value decomposition based on the discrete cosine transform (DCT) matrix. The advantages of using DCT are that (i) the complex arithmetic is not involved in the cosine transform based tensor singular value decomposition, so the computational cost required can be saved; (ii) the intrinsic reflexive boundary condition along the tubes in the third dimension of tensors is employed, so its performance would be better than that by using the periodic boundary condition in DFT. We demonstrate that the tensor product between two tensors by using DCT can be equivalent to the multiplication between a block Toeplitz-plus-Hankel matrix and a block vector. Numerical examples of low-rank tensor completion are further given to illustrate that the efficiency by using DCT is two times faster than that by using DFT and also the errors of video and multispectral image completion by using DCT are smaller than those by using DFT.
http://arxiv.org/abs/1902.03070
Static stochastic VRPs aim at modeling real-life VRPs by considering uncertainty on data. In particular, the SS-VRPTW-CR considers stochastic customers with time windows and does not make any assumption on their reveal times, which are stochastic as well. Based on customer request probabilities, we look for an a priori solution composed preventive vehicle routes, minimizing the expected number of unsatisfied customer requests at the end of the day. A route describes a sequence of strategic vehicle relocations, from which nearby requests can be rapidly reached. Instead of reoptimizing online, a so-called recourse strategy defines the way the requests are handled, whenever they appear. In this paper, we describe a new recourse strategy for the SS-VRPTW-CR, improving vehicle routes by skipping useless parts. We show how to compute the expected cost of a priori solutions, in pseudo-polynomial time, for this recourse strategy. We introduce a new meta-heuristic, called Progressive Focus Search (PFS), which may be combined with any local-search based algorithm for solving static stochastic optimization problems. PFS accelerates the search by using approximation factors: from an initial rough simplified problem, the search progressively focuses to the actual problem description. We evaluate our contributions on a new, real-world based, public benchmark.
http://arxiv.org/abs/1902.03930
Service robots are expected to be more autonomous and efficiently work in human-centric environments. For this type of robots, open-ended object recognition is a challenging task due to the high demand for two essential capabilities: (i) the accurate and real-time response, and (ii) the ability to learn new object categories from very few examples on-site. These capabilities are required for such robots since no matter how extensive the training data used for batch learning, the robot might be faced with an unknown object when operating in everyday environments. In this work, we present OrthographicNet, a deep transfer learning based approach, for 3D object recognition in open-ended domains. In particular, OrthographicNet generates a rotation and scale invariant global feature for a given object, enabling to recognize the same or similar objects seen from different perspectives. Experimental results show that our approach yields significant improvements over the previous state-of-the-art approaches concerning scalability, memory usage, and object recognition performance. Regarding real-time performance, two real-world demonstrations validate the promising performance of the proposed architecture. Moreover, our approach demonstrates the capability of learning from very few training examples in a real-world setting.
https://arxiv.org/abs/1902.03057
We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually grounded monolingual models can be used to perform cross-lingual speech-to-speech retrieval. For both languages, the enriched bilingual (speech-image) corpora with part-of-speech tags and forced alignments are distributed to the community for reproducible research.
http://arxiv.org/abs/1902.03052
The goal of MRI reconstruction is to restore a high fidelity image from partially observed measurements. This partial view naturally induces reconstruction uncertainty that can only be reduced by acquiring additional measurements. In this paper, we present a novel method for MRI reconstruction that, at inference time, dynamically selects the measurements to take and iteratively refines the prediction in order to best reduce the reconstruction error and, thus, its uncertainty. We validate our method on a large scale knee MRI dataset, as well as on ImageNet. Results show that (1) our system successfully outperforms active acquisition baselines; (2) our uncertainty estimates correlate with error maps; and (3) our ResNet-based architecture surpasses standard pixel-to-pixel models in the task of MRI reconstruction. The proposed method not only shows high-quality reconstructions but also paves the road towards more applicable solutions for accelerating MRI.
http://arxiv.org/abs/1902.03051
We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as $O(1/\sqrt{n})$ from $n$ observations, we assume that the individual losses are self-concordant, that is, their third-order derivatives are bounded by their second-order derivatives. This setting includes least-squares, as well as all generalized linear models such as logistic and softmax regression. For this class of losses, we provide a bias-variance decomposition and show that the assumptions commonly made in least-squares regression, such as the source and capacity conditions, can be adapted to obtain fast non-asymptotic rates of convergence by improving the bias terms, the variance terms or both.
http://arxiv.org/abs/1902.03046
In many applications, robots autonomous deployment is preferable and sometimes it is the only affordable solution. To address this issue, virtual force (VF) is one of the prominent approaches to performing multirobot deployment autonomously. However, most of the existing VF-based approaches consider only a uniform deployment to maximize the covered area while ignoring the criticality of specific locations during the deployment process. To overcome these limitations, we present a framework for autonomously deploy robots or vehicles using virtual force. The framework is composed of two stages. In the first stage, a two-hop Cooperative Virtual Force based Robots Deployment (Two-hop COVER) is employed where a cooperative relation between robots and neighboring landmarks is established to satisfy mission requirements. The second stage complements the first stage and ensures perfect demand satisfaction by utilizing the Trace Fingerprint technique which collected traces while each robot traversing the deployment area. Finally, a fairness-aware version of Two-hop COVER is presented to consider scenarios where the mission requirements are greater than the available resources (i.e. robots). We evaluate our framework via extensive simulations. The results demonstrate outstanding performance compared to contemporary approaches in terms of total travelled distance, total exchanged messages, total deployment time, and Jain fairness index.
https://arxiv.org/abs/1902.03039
Generative adversarial nets (GANs) are widely used to learn the data sampling process and their performance may heavily depend on the loss functions, given a limited computational budget. This study revisits MMD-GAN that uses the maximum mean discrepancy (MMD) as the loss function for GAN and makes two contributions. First, we argue that the existing MMD loss function may discourage the learning of fine details in data as it attempts to contract the discriminator outputs of real data. To address this issue, we propose a repulsive loss function to actively learn the difference among the real data by simply rearranging the terms in MMD. Second, inspired by the hinge loss, we propose a bounded Gaussian kernel to stabilize the training of MMD-GAN with the repulsive loss function. The proposed methods are applied to the unsupervised image generation tasks on CIFAR-10, STL-10, CelebA, and LSUN bedroom datasets. Results show that the repulsive loss function significantly improves over the MMD loss at no additional computational cost and outperforms other representative loss functions. The proposed methods achieve an FID score of 16.21 on the CIFAR-10 dataset using a single DCGAN network and spectral normalization.
https://arxiv.org/abs/1812.09916
Deep neural networks were shown to be vulnerable to single pixel modifications. However, the reason behind such phenomena has never been elucidated. Here, we propose Propagation Maps which show the influence of the perturbation in each layer of the network. Propagation Maps reveal that even in extremely deep networks such as Resnet, modification in one pixel easily propagates until the last layer. In fact, this initial local perturbation is also shown to spread becoming a global one and reaching absolute difference values that are close to the maximum value of the original feature maps in a given layer. Moreover, we do a locality analysis in which we demonstrate that nearby pixels of the perturbed one in the one-pixel attack tend to share the same vulnerability, revealing that the main vulnerability lies in neither neurons nor pixels but receptive fields. Hopefully, the analysis conducted in this work together with a new technique called propagation maps shall shed light into the inner workings of other adversarial samples and be the basis of new defense systems to come.
http://arxiv.org/abs/1902.02947
We design envy-free mechanisms for the allocation of rooms and rent payments among roommates. We achieve four objectives: (1) each agent is allowed to make a report that expresses her preference about violating her budget constraint, a feature not achieved by mechanisms that only elicit quasi-linear reports; (2) these reports are finite dimensional; (3) computation is feasible in polynomial time; and (4) incentive properties of envy-free mechanisms that elicit quasi-linear reports are preserved.
http://arxiv.org/abs/1902.02935
Transfer-learning and meta-learning are two effective methods to apply knowledge learned from large data sources to new tasks. In few-class, few-shot target task settings (i.e. when there are only a few classes and training examples available in the target task), meta-learning approaches that optimize for future task learning have outperformed the typical transfer approach of initializing model weights from a pre-trained starting point. But as we experimentally show, meta-learning algorithms that work well in the few-class setting do not generalize well in many-shot and many-class cases. In this paper, we propose a joint training approach that combines both transfer-learning and meta-learning. Benefiting from the advantages of each, our method obtains improved generalization performance on unseen target tasks in both few- and many-class and few- and many-shot scenarios.
http://arxiv.org/abs/1809.08346
Offline handwritten mathematical expression recognition is a challenging task, because handwritten mathematical expressions mainly have two problems in the process of recognition. On one hand, it is how to correctly recognize different mathematical symbols. On the other hand, it is how to correctly recognize the two-dimensional structure existing in mathematical expressions. Inspired by recent work in deep learning, a new neural network model that combines a Multi-Scale convolutional neural network (CNN) with an Attention recurrent neural network (RNN) is proposed to identify two-dimensional handwritten mathematical expressions as one-dimensional LaTeX sequences. As a result, the model proposed in the present work has achieved a WER error of 25.715% and ExpRate of 28.216%.
http://arxiv.org/abs/1902.05376
The Café Wall illusion is one of a class of tilt illusions where lines that are parallel appear to be tilted. We demonstrate that a simple Differences of Gaussian model provides an explanatory mechanism for the illusory tilt perceived in a family of Café Wall illusion generalizes to the dashed versions of Café Wall. Our explanation models the visual mechanisms in low-level stages that can reveal tilt cues in Geometrical distortion illusions such as Tile illusions particularly Café Wall illusions. For this, we simulate the activation of the retinal/cortical simple cells in responses to these patterns based on a Classical Receptive Field (CRF) model to explain tilt effects in these illusions. Previously, it was assumed that all these visual experiences of tilt arise from the orientation selectivity properties described for more complex cortical cells. An estimation of an overall tilt angle perceived in these illusions is based on the integration of the local tilts detected by simple cells which is presumed to be a key mechanism utilized by the complex cells to create our final perception of tilt.
https://arxiv.org/abs/1902.03739
The Caf'e Wall illusion is one of a class of tilt illusions where lines that are parallel appear to be tilted. We demonstrate that a simple Differences of Gaussian model provides an explanatory mechanism for the illusory tilt perceived in a family of Caf'e Wall illusion generalizes to the dashed versions of Caf'e Wall. Our explanation models the visual mechanisms in low-level stages that can reveal tilt cues in Geometrical distortion illusions such as Tile illusions particularly Caf'e Wall illusions. For this, we simulate the activation of the retinal/cortical simple cells in responses to these patterns based on a Classical Receptive Field (CRF) model to explain tilt effects in these illusions. Previously, it was assumed that all these visual experiences of tilt arise from the orientation selectivity properties described for more complex cortical cells. An estimation of an overall tilt angle perceived in these illusions is based on the integration of the local tilts detected by simple cells which is presumed to be a key mechanism utilized by the complex cells to create our final perception of tilt.
http://arxiv.org/abs/1902.03739
For many real applications, it is equally important to detect objects accurately and quickly. In this paper, we propose an accurate and efficient single shot object detector with fea-ture aggregation and enhancement (FAENet). Our motivation is to enhance and exploit the shallow and deep feature maps of the whole network simultaneously. For achieving this, we introduce a pair of novel feature aggregation modules and two feature enhancement blocks, and integrate them into the original structure of SSD. Extensive experiments on both PASCAL VOC and MS COCO datasets demonstrate that the proposed method achieves much higher accuracy than SSD. In addition, our method performs better than the state-of-the-art one-stage method RefineDet on small objects and can run at a faster speed.
http://arxiv.org/abs/1902.02923
Illusions are fascinating and immediately catch people’s attention and interest, but they are also valuable in terms of giving us insights into human cognition and perception. A good theory of human perception should be able to explain the illusion, and a correct theory will actually give quantifiable results. We investigate here the efficiency of a computational filtering model utilised for modelling the lateral inhibition of retinal ganglion cells and their responses to a range of Geometric Illusions using isotropic Differences of Gaussian filters. This study explores the way in which illusions have been explained and shows how a simple standard model of vision based on classical receptive fields can predict the existence of these illusions as well as the degree of effect. A fundamental contribution of this work is to link bottom-up processes to higher level perception and cognition consistent with Marr’s theory of vision and edge map representation.
http://arxiv.org/abs/1902.02922
The performance of a biometric system that relies on a single biometric modality (e.g., fingerprints only) is often stymied by various factors such as poor data quality or limited scalability. Multibiometric systems utilize the principle of fusion to combine information from multiple sources in order to improve recognition accuracy whilst addressing some of the limitations of single-biometric systems. The past two decades have witnessed the development of a large number of biometric fusion schemes. This paper presents an overview of biometric fusion with specific focus on three questions: what to fuse, when to fuse, and how to fuse. A comprehensive review of techniques incorporating ancillary information in the biometric recognition pipeline is also presented. In this regard, the following topics are discussed: (i) incorporating data quality in the biometric recognition pipeline; (ii) combining soft biometric attributes with primary biometric identifiers; (iii) utilizing contextual information to improve biometric recognition accuracy; and (iv) performing continuous authentication using ancillary information. In addition, the use of information fusion principles for presentation attack detection and multibiometric cryptosystems is also discussed. Finally, some of the research challenges in biometric fusion are enumerated. The purpose of this article is to provide readers a comprehensive overview of the role of information fusion in biometrics.
http://arxiv.org/abs/1902.02919
In vision-enabled autonomous systems such as robots and autonomous cars, video object detection plays a crucial role, and both its speed and accuracy are important factors to provide reliable operation. The key insight we show in this paper is that speed and accuracy are not necessarily a trade-off when it comes to image scaling. Our results show that re-scaling the image to a lower resolution will sometimes produce better accuracy. Based on this observation, we propose a novel approach, dubbed AdaScale, which adaptively selects the input image scale that improves both accuracy and speed for video object detection. To this end, our results on ImageNet VID and mini YouTube-BoundingBoxes datasets demonstrate 1.3 points and 2.7 points mAP improvement with 1.6x and 1.8x speedup, respectively. Additionally, we improve state-of-the-art video acceleration work by an extra 1.25x speedup with slightly better mAP on ImageNet VID dataset.
http://arxiv.org/abs/1902.02910
This paper motivates and develops source traces for temporal difference (TD) learning in the tabular setting. Source traces are like eligibility traces, but model potential histories rather than immediate ones. This allows TD errors to be propagated to potential causal states and leads to faster generalization. Source traces can be thought of as the model-based, backward view of successor representations (SR), and share many of the same benefits. This view, however, suggests several new ideas. First, a TD($\lambda$)-like source learning algorithm is proposed and its convergence is proven. Then, a novel algorithm for learning the source map (or SR matrix) is developed and shown to outperform the previous algorithm. Finally, various approaches to using the source/SR model are explored, and it is shown that source traces can be effectively combined with other model-based methods like Dyna and experience replay.
http://arxiv.org/abs/1902.02907
Recent years have witnessed an increased focus on interpretability and the use of machine learning to inform policy analysis and decision making. This paper applies machine learning to examine travel behavior and, in particular, on modeling changes in travel modes when individuals are presented with a novel (on-demand) mobility option. It addresses the following question: Can machine learning be applied to model individual taste heterogeneity (preference heterogeneity for travel modes and response heterogeneity to travel attributes) in travel mode choice? This paper first develops a high-accuracy classifier to predict mode-switching behavior under a hypothetical Mobility-on-Demand Transit system (i.e., stated-preference data), which represents the case study underlying this research. We show that this classifier naturally captures individual heterogeneity available in the data. Moreover, the paper derives insights on heterogeneous switching behaviors through the generation of marginal effects and elasticities by current travel mode, partial dependence plots, and individual conditional expectation plots. The paper also proposes two new model-agnostic interpretation tools for machine learning, i.e., conditional partial dependence plots and conditional individual partial dependence plots, specifically designed to examine response heterogeneity. The results on the case study show that the machine-learning classifier, together with model-agnostic interpretation tools, provides valuable insights on travel mode switching behavior for different individuals and population segments. For example, the existing drivers are more sensitive to additional pickups than people using other travel modes, and current transit users are generally willing to share rides but reluctant to take any additional transfers.
http://arxiv.org/abs/1902.02904
Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP), either in continuous settings, with fixed discount factor $\gamma < 1$, or in episodic settings, with $\gamma = 1$. While this has proven effective for specific tasks with well-defined objectives (e.g., games), it has never been established that fixed discounting is suitable for general purpose use (e.g., as a model of human preferences). This paper characterizes rationality in sequential decision making using a set of seven axioms and arrives at a form of discounting that generalizes traditional fixed discounting. In particular, our framework admits a state-action dependent “discount” factor that is not constrained to be less than 1, so long as there is eventual long run discounting. Although this broadens the range of possible preference structures in continuous settings, we show that there exists a unique “optimizing MDP” with fixed $\gamma < 1$ whose optimal value function matches the true utility of the optimal policy, and we quantify the difference between value and utility for suboptimal policies. Our work can be seen as providing a normative justification for (a slight generalization of) Martha White’s RL task formalism (2017) and other recent departures from the traditional RL, and is relevant to task specification in RL, inverse RL and preference-based RL.
http://arxiv.org/abs/1902.02893
We consider a novel approach to high-level robot task execution for a robot assistive task. In this work we explore the problem of learning to predict the next subtask by introducing a deep model for both sequencing goals and for visually evaluating the state of a task. We show that deep learning for monitoring robot tasks execution very well supports the interconnection between task-level planning and robot operations. These solutions can also cope with the natural non-determinism of the execution monitor. We show that a deep execution monitor leverages robot performance. We measure the improvement taking into account some robot helping tasks performed at a warehouse.
https://arxiv.org/abs/1902.02877
Visual search of relevant targets in the environment is a crucial robot skill. We propose a preliminary framework for the execution monitor of a robot task, taking care of the robot attitude to visually searching the environment for targets involved in the task. Visual search is also relevant to recover from a failure. The framework exploits deep reinforcement learning to acquire a “common sense” scene structure and it takes advantage of a deep convolutional network to detect objects and relevant relations holding between them. The framework builds on these methods to introduce a vision-based execution monitoring, which uses classical planning as a backbone for task execution. Experiments show that with the proposed vision-based execution monitor the robot can complete simple tasks and can recover from failures in autonomy.
https://arxiv.org/abs/1902.02870
As autonomous robots increasingly become part of daily life, they will often encounter dynamic environments while only having limited information about their surroundings. Unfortunately, due to the possible presence of malicious dynamic actors, it is infeasible to develop an algorithm that can guarantee collision-free operation. Instead, one can attempt to design a control technique that guarantees the robot is not-at-fault in any collision. In the literature, making such guarantees in real time has been restricted to static environments or specific dynamic models. To ensure not-at-fault behavior, a robot must first correctly sense and predict the world around it within some sufficiently large sensor horizon (the prediction problem), then correctly control relative to the predictions (the control problem). This paper addresses the control problem by proposing Reachability-based Trajectory Design for Dynamic environments (RTD-D), which guarantees that a robot with an arbitrary nonlinear dynamic model correctly responds to predictions in arbitrary dynamic environments. RTD-D first computes a Forward Reachable Set (FRS) offline of the robot tracking parameterized desired trajectories that include fail-safe maneuvers. Then, for online receding-horizon planning, the method provides a way to discretize predictions of an arbitrary dynamic environment to enable real-time collision checking. The FRS is used to map these discretized predictions to trajectories that the robot can track while provably not-at-fault. One such trajectory is chosen at each iteration, or the robot executes the fail-safe maneuver from its previous trajectory which is guaranteed to be not at fault. RTD-D is shown to produce not-at-fault behavior over thousands of simulations and several real-world hardware demonstrations on two robots: a Segway, and a small electric vehicle.
https://arxiv.org/abs/1902.02851
Nowadays, the adoption of face recognition for biometric authentication systems is usual, mainly because this is one of the most accessible biometric modalities. Techniques that rely on trespassing these kind of systems by using a forged biometric sample, such as a printed paper or a recorded video of a genuine access, are known as presentation attacks, but may be also referred in the literature as face spoofing. Presentation attack detection is a crucial step for preventing this kind of unauthorized accesses into restricted areas and/or devices. In this paper, we propose a novel approach which relies in a combination between intrinsic image properties and deep neural networks to detect presentation attack attempts. Our method explores depth, salience and illumination maps, associated with a pre-trained Convolutional Neural Network in order to produce robust and discriminant features. Each one of these properties are individually classified and, in the end of the process, they are combined by a meta learning classifier, which achieves outstanding results on the most popular datasets for PAD. Results show that proposed method is able to overpass state-of-the-art results in an inter-dataset protocol, which is defined as the most challenging in the literature.
http://arxiv.org/abs/1902.02845
Human pose estimation - the process of recognizing a human’s limb positions and orientations in a video - has many important applications including surveillance, diagnosis of movement disorders, and computer animation. While deep learning has lead to great advances in 2D and 3D pose estimation from single video sources, the problem of estimating 3D human pose from multiple video sensors with overlapping fields of view has received less attention. When the application allows use of multiple cameras, 3D human pose estimates may be greatly improved through fusion of multi-view pose estimates and observation of limbs that are fully or partially occluded in some views. Past approaches to multi-view 3D pose estimation have used probabilistic graphical models to reason over constraints, including per-image pose estimates, temporal smoothness, and limb length. In this paper, we present a pipeline for multi-view 3D pose estimation of multiple individuals which combines a state-of-art 2D pose detector with a factor graph of 3D limb constraints optimized with belief propagation. We evaluate our results on the TUM-Campus and Shelf datasets for multi-person 3D pose estimation and show that our system significantly out-performs the previous state-of-the-art with a simpler model of limb dependency.
http://arxiv.org/abs/1902.02841
An ideal outcome of pattern mining is a small set of informative patterns, containing no redundancy or noise, that identifies the key structure of the data at hand. Standard frequent pattern miners do not achieve this goal, as due to the pattern explosion typically very large numbers of highly redundant patterns are returned. We pursue the ideal for sequential data, by employing a pattern set mining approach-an approach where, instead of ranking patterns individually, we consider results as a whole. Pattern set mining has been successfully applied to transactional data, but has been surprisingly under studied for sequential data. In this paper, we employ the MDL principle to identify the set of sequential patterns that summarises the data best. In particular, we formalise how to encode sequential data using sets of serial episodes, and use the encoded length as a quality score. As search strategy, we propose two approaches: the first algorithm selects a good pattern set from a large candidate set, while the second is a parameter-free any-time algorithm that mines pattern sets directly from the data. Experimentation on synthetic and real data demonstrates we efficiently discover small sets of informative patterns.
http://arxiv.org/abs/1902.02834
In this work, we use the Belief Function Theory which extends the probabilistic framework in order to provide uncertainty bounds to different categories of crowd density estimators. Our method allows us to compare the multi-scale performance of the estimators, and also to characterize their reliability for crowd monitoring applications requiring varying degrees of prudence.
http://arxiv.org/abs/1902.02831
Controlling soft robots with precision is a challenge due in large part to the difficulty of constructing models that are amenable to model-based control design techniques. Koopman Operator Theory offers a way to construct explicit linear dynamical models of soft robots and to control them using established model-based linear control methods. This method is data-driven, yet unlike other data-driven models such as neural networks, it yields an explicit control-oriented linear model rather than just a “black-box” input-output mapping. This work describes this Koopman-based system identification method and its application to model predictive controller design. A model and MPC controller of a pneumatic soft robot arm was constructed via the method, and its performance was evaluated over several trajectory following tasks in the real-world. On all of the tasks, the Koopman-based MPC controller outperformed a benchmark MPC controller based on a linear state-space model of the same system.
https://arxiv.org/abs/1902.02827
Image classification is vulnerable to adversarial attacks. This work investigates the robustness of Saak transform against adversarial attacks towards high performance image classification. We develop a complete image classification system based on multi-stage Saak transform. In the Saak transform domain, clean and adversarial images demonstrate different distributions at different spectral dimensions. Selection of the spectral dimensions at every stage can be viewed as an automatic denoising process. Motivated by this observation, we carefully design strategies of feature extraction, representation and classification that increase adversarial robustness. The performances with well-known datasets and attacks are demonstrated by extensive experimental evaluations.
http://arxiv.org/abs/1902.02826
Recently, we have seen a rapid development of Deep Neural Network (DNN) based visual tracking solutions. Some trackers combine the DNN-based solutions with Discriminative Correlation Filters (DCF) to extract semantic features and successfully deliver the state-of-the-art tracking accuracy. However, these solutions are highly compute-intensive, which require long processing time, resulting unsecured real-time performance. To deliver both high accuracy and reliable real-time performance, we propose a novel tracker called SiamVGG. It combines a Convolutional Neural Network (CNN) backbone and a cross-correlation operator, and takes advantage of the features from exemplary images for more accurate object tracking. The architecture of SiamVGG is customized from VGG-16, with the parameters shared by both exemplary images and desired input video frames. We demonstrate the proposed SiamVGG on OTB-2013/50/100 and VOT 2015/2016/2017 datasets with the state-of-the-art accuracy while maintaining a decent real-time performance of 50 FPS running on a GTX 1080Ti. Our design can achieve 2% higher Expected Average Overlap (EAO) compared to the ECO and C-COT in VOT2017 Challenge.
http://arxiv.org/abs/1902.02804
Observations from the Kepler and K2 missions have provided the astronomical community with unprecedented amounts of data to search for transiting exoplanets and other astrophysical phenomena. Here, we present K2-288, a low-mass binary system (M2.0 +/- 1.0; M3.0 +/- 1.0) hosting a small (Rp = 1.9 REarth), temperate (Teq = 226 K) planet observed in K2 Campaign 4. The candidate was first identified by citizen scientists using Exoplanet Explorers hosted on the Zooniverse platform. Follow-up observations and detailed analyses validate the planet and indicate that it likely orbits the secondary star on a 31.39-day period. This orbit places K2-288Bb in or near the habitable zone of its low-mass host star. K2-288Bb resides in a system with a unique architecture, as it orbits at >0.1 au from one component in a moderate separation binary (aproj approximately 55 au), and further follow-up may provide insight into its formation and evolution. Additionally, its estimated size straddles the observed gap in the planet radius distribution. Planets of this size occur less frequently and may be in a transient phase of radius evolution. K2-288 is the third transiting planet system identified by the Exoplanet Explorers program and its discovery exemplifies the value of citizen science in the era of Kepler, K2, and the Transiting Exoplanet Survey Satellite.
https://arxiv.org/abs/1902.02789
360-degree cameras offer the possibility to cover a large area, for example an entire room, without using multiple distributed vision sensors. However, geometric distortions introduced by their lenses make computer vision problems more challenging. In this paper we address face detection in 360-degree fisheye images. We show how a face detector trained on regular images can be re-trained for this purpose, and we also provide a 360-degree fisheye-like version of the popular FDDB face detection dataset, which we call FDDB-360.
http://arxiv.org/abs/1902.02777
We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. We train and test CMP on navigation problems in simulation environments derived from scans of real world buildings. Our experiments demonstrate that CMP outperforms alternate learning-based architectures, as well as, classical mapping and path planning approaches in many cases. Furthermore, it naturally extends to semantically specified goals, such as ‘going to a chair’. We also deploy CMP on physical robots in indoor environments, where it achieves reasonable performance, even though it is trained entirely in simulation.
http://arxiv.org/abs/1702.03920
Motivated by the recent potential of mass customization brought by whole-garment knitting machines, we introduce the new problem of automatic machine instruction generation using a single image of the desired physical product, which we apply to machine knitting. We propose to tackle this problem by directly learning to synthesize regular machine instructions from real images. We create a cured dataset of real samples with their instruction counterpart and propose to use synthetic images to augment it in a novel way. We theoretically motivate our data mixing framework and show empirical results suggesting that making real images look more synthetic is beneficial in our problem setup.
http://arxiv.org/abs/1902.02752
The Pix2pix and CycleGAN losses have vastly improved the qualitative and quantitative visual quality of results in image-to-image translation tasks. We extend this framework by exploring approximately invertible architectures which are well suited to these losses. These architectures are approximately invertible by design and thus partially satisfy cycle-consistency before training even begins. Furthermore, since invertible architectures have constant memory complexity in depth, these models can be built arbitrarily deep. We are able to demonstrate superior quantitative output on the Cityscapes and Maps datasets at near constant memory budget.
http://arxiv.org/abs/1902.02729
Localizing an object accurately with respect to a robot is a key step for autonomous robotic manipulation. In this work, we propose to tackle this task knowing only 3D models of the robot and object in the particular case where the scene is viewed from uncalibrated cameras – a situation which would be typical in an uncontrolled environment, e.g., on a construction site. We demonstrate that this localization can be performed very accurately, with millimetric errors, without using a single real image for training, a strong advantage since acquiring representative training data is a long and expensive process. Our approach relies on a classification Convolutional Neural Network (CNN) trained using hundreds of thousands of synthetically rendered scenes with randomized parameters. To evaluate our approach quantitatively and make it comparable to alternative approaches, we build a new rich dataset of real robot images with accurately localized blocks.
http://arxiv.org/abs/1902.02711
Opinion phrase extraction is one of the key tasks in fine-grained sentiment analysis. While opinion expressions could be generic subjective expressions, aspect specific opinion expressions contain both the aspect as well as the opinion expression within the original sentence context. In this work, we formulate the task as an instance of token-level sequence labeling. When multiple aspects are present in a sentence, detection of opinion phrase boundary becomes difficult and label of each word depend not only upon the surrounding words but also with the concerned aspect. We propose a neural network architecture with bidirectional LSTM (Bi-LSTM) and a novel attention mechanism. Bi-LSTM layer learns the various sequential pattern among the words without requiring any hand-crafted features. The attention mechanism captures the importance of context words on a particular aspect opinion expression when multiple aspects are present in a sentence via location and content based memory. A Conditional Random Field (CRF) model is incorporated in the final layer to explicitly model the dependencies among the output labels. Experimental results on Hotel dataset from Tripadvisor.com showed that our approach outperformed several state-of-the-art baselines.
http://arxiv.org/abs/1902.02709
Stickers are popularly used in messaging apps such as Hike to visually express a nuanced range of thoughts and utterances and convey exaggerated emotions. However, discovering the right sticker at the right time in a chat from a large and ever expanding pool of stickers can be cumbersome. In this paper, we describe a system for recommending stickers as users chat based on what the user is typing and the conversational context. We decompose the sticker recommendation problem into two steps. First, we predict the next message that the user is likely to send in the chat. Second, we substitute the predicted message with an appropriate sticker. Majority of Hike’s users transliterate messages from their native language to English. This leads to numerous orthographic variations of the same message and thus complicates message prediction. To address this issue, we cluster the messages that have the same meaning and predict the message cluster instead of the message. We experiment with different approaches to train embedding for chat messages and study their efficacy in learning similar dense representations for messages that have the same intent. We propose a novel hybrid message prediction model, which can run with low latency on low end phones that have severe computational limitations.
http://arxiv.org/abs/1902.02704
Unsupervised object discovery in images involves uncovering recurring patterns that define objects and discriminates them against the background. This is more challenging than image clustering as the size and the location of the objects are not known: this adds additional degrees of freedom and increases the problem complexity. In this work, we propose StampNet, a novel autoencoding neural network that localizes shapes (objects) over a simple background in images and categorizes them simultaneously. StampNet consists of a discrete latent space that is used to categorize objects and to determine the location of the objects. The object categories are formed during the training, resulting in the discovery of a fixed set of objects. We present a set of experiments that demonstrate that StampNet is able to localize and cluster multiple overlapping shapes with varying complexity including the digits from the MNIST dataset. We also present an application of StampNet in the localization of pedestrians in overhead depth-maps.
http://arxiv.org/abs/1902.02693
InGaN-based visible LEDs find commercial applications for solid-state lighting and displays, but lattice mismatch limits the thickness of InGaN quantum wells that can be grown on GaN with high crystalline quality. Since narrower wells operate at a higher carrier density for a given current density, they increase the fraction of carriers lost to Auger recombination and lower the efficiency. The incorporation of boron, a smaller group-III element, into InGaN alloys is a promising method to eliminate the lattice mismatch and realize high-power, high-efficiency visible LEDs with thick active regions. In this work we apply predictive calculations based on hybrid density functional theory to investigate the thermodynamic, structural, and electronic properties of BInGaN alloys. Our results show that BInGaN alloys with a B:In ratio of 2:3 are better lattice matched to GaN compared to InGaN and, for indium fractions less than 0.2, nearly lattice matched. Deviations from Vegard’s law appear as bowing of the in-plane lattice constant with respect to composition. Our thermodynamics calculations demonstrate that the solubility of boron is higher in InGaN than in pure GaN. Varying the Ga mole fraction while keeping the B:In ratio constant enables the adjustment of the (direct) gap in the 1.75-3.39 eV range, which covers the entire visible spectrum. Holes are strongly localized in non-bonded N 2p states caused by local bond planarization near boron atoms. Our results indicate that BInGaN alloys are promising for fabricating nitride heterostructures with thick active regions for high-power, high-efficiency LEDs.
https://arxiv.org/abs/1902.02692
In this work, we propose a single deep neural network for panoptic segmentation, for which the goal is to provide each individual pixel of an input image with a class label, as in semantic segmentation, as well as a unique identifier for specific objects in an image, following instance segmentation. Our network makes joint semantic and instance segmentation predictions and combines these to form an output in the panoptic format. This has two main benefits: firstly, the entire panoptic prediction is made in one pass, reducing the required computation time and resources; secondly, by learning the tasks jointly, information is shared between the two tasks, thereby improving performance. Our network is evaluated on two street scene datasets: Cityscapes and Mapillary Vistas. By leveraging information exchange and improving the merging heuristics, we increase the performance of the single network, and achieve a score of 23.9 on the Panoptic Quality (PQ) metric on Mapillary Vistas validation, with an input resolution of 640 x 900 pixels. On Cityscapes validation, our method achieves a PQ score of 45.9 with an input resolution of 512 x 1024 pixels. Moreover, our method decreases the prediction time by a factor of 2 with respect to separate networks.
http://arxiv.org/abs/1902.02678
Multi-task learning allows the sharing of useful information between multiple related tasks. In natural language processing several recent approaches have successfully leveraged unsupervised pre-training on large amounts of data to perform well on various tasks, such as those in the GLUE benchmark. These results are based on fine-tuning on each task separately. We explore the multi-task learning setting for the recent BERT model on the GLUE benchmark, and how to best add task-specific parameters to a pre-trained BERT network, with a high degree of parameter sharing between tasks. We introduce new adaptation modules, PALs or `projected attention layers’, which use a low-dimensional multi-head attention mechanism, based on the idea that it is important to include layers with inductive biases useful for the input domain. By using PALs in parallel with BERT layers, we match the performance of fine-tuned BERT on the GLUE benchmark with roughly 7 times fewer parameters, and obtain state-of-the-art results on the Recognizing Textual Entailment dataset.
http://arxiv.org/abs/1902.02671
We address the problem of efficient exploration by proposing a new meta algorithm in the context of model-based online planning for Bayesian Reinforcement Learning (BRL). We beat the state-of-the-art, while staying computationally faster, in some cases by two orders of magnitude. This is the first Optimism free BRL algorithm to beat all previous state-of-the-art in tabular RL. The main novelty is the use of a candidate policy generator, to generate long-term options in the belief tree, which allows us to create much sparser and deeper trees. We present results on many standard environments and empirically prove its performance.
http://arxiv.org/abs/1902.02661
The resolution of many robotics problems demands the optimization of certain reward (or cost) over a huge number of candidate paths. In this paper, we describe an integer programming (IP) solution methodology for path-based optimization problems that is both easy to apply (in two simple steps) and frequently high-performance in terms of the computation time and the achieved optimality. We demonstrate the generality of our approach through the application to three challenging path-based optimization problems: {\em multi-robot path planning (MPP)}, {\em minimum constraint removal (MCR)}, and {\em reward collection problems (RCPs)}. In the process, in addition to providing basic working IP models, we describe best practices (which include many generic heuristics) that lead to significant performance boosts. Associated simulation experiments show that the approach can efficiently produce (near-)optimal solutions for problems with large state spaces, complex constraints, and complicated objective functions.
http://arxiv.org/abs/1902.02652
We present and characterize a simple method for detecting pointing gestures suitable for human-robot interaction applications using a commodity RGB-D camera. We exploit a state-of-the-art Deep CNN-based detector to find hands and faces in RGB images, then examine the corresponding depth channel pixels to obtain full 3D pointing vectors. We test several methods of estimating the hand end-point of the pointing vector. The system runs at better than 30Hz on commodity hardware: exceeding the frame rate of typical RGB-D sensors. An estimate of the absolute pointing accuracy is found empirically by comparison with ground-truth data from a VICON motion-capture system, and the useful interaction volume established. Finally, we show an end-to-end test where a robot estimates where the pointing vector intersects the ground plane, and report the accuracy obtained. We provide source code as a ROS node, with the intention of contributing a commodity implementation of this common component in HRI systems.
http://arxiv.org/abs/1902.02636
Steganography is the science of hiding a secret message within an ordinary public message, which referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this paper, we explore the use of deep neural networks as steganographic functions for speech data. To this end, we propose to jointly optimize two neural networks: the first network encodes the message inside a carrier, while the second network decodes the message from the modified carrier. We demonstrated the effectiveness of our method on several speech data-sets and analyzed the results quantitatively and qualitatively. Moreover, we showed that our approach could be applied to conceal multiple messages in a single carrier using multiple decoders or a single conditional decoder. Qualitative experiments suggest that modifications to the carrier are unnoticeable by human listeners and that the decoded messages are highly intelligible.
https://arxiv.org/abs/1902.03083
Accuracy is an important concern for suppliers of artificial intelligence (AI) services, but considerations beyond accuracy, such as safety (which includes fairness and explainability), security, and provenance, are also critical elements to engender consumers’ trust in a service. Many industries use transparent, standardized, but often not legally required documents called supplier’s declarations of conformity (SDoCs) to describe the lineage of a product along with the safety and performance testing it has undergone. SDoCs may be considered multi-dimensional fact sheets that capture and quantify various aspects of the product and its development to make it worthy of consumers’ trust. Inspired by this practice, we propose FactSheets to help increase trust in AI services. We envision such documents to contain purpose, performance, safety, security, and provenance information to be completed by AI service providers for examination by consumers. We suggest a comprehensive set of declaration items tailored to AI and provide examples for two fictitious AI services in the appendix of the paper.
http://arxiv.org/abs/1808.07261
Supervised classification and representation learning are two widely used methods to analyze multivariate images. Although complementary, these two classes of methods have been scarcely considered jointly. In this paper, a method coupling these two approaches is designed using a matrix cofactorization formulation. Each task is modeled as a factorization matrix problem and a term relating both coding matrices is then introduced to drive an appropriate coupling. The link can be interpreted as a clustering operation over the low-dimensional representation vectors. The attribution vectors of the clustering are then used as features vectors for the classification task, i.e., the coding vectors of the corresponding factorization problem. A proximal gradient descent algorithm, ensuring convergence to a critical point of the objective function, is then derived to solve the resulting non-convex non-smooth optimization problem. An evaluation of the proposed method is finally conducted both on synthetic and real data in the specific context of hyperspectral image interpretation, unifying two standard analysis techniques, namely unmixing and classification.
http://arxiv.org/abs/1902.02597
Beauty is in the eye of the beholder. This maxim, emphasizing the subjectivity of the perception of beauty, has enjoyed a wide consensus since ancient times. In the digitalera, data-driven methods have been shown to be able to predict human-assigned beauty scores for facial images. In this work, we augment this ability and train a generative model that generates faces conditioned on a requested beauty score. In addition, we show how this trained generator can be used to beautify an input face image. By doing so, we achieve an unsupervised beautification model, in the sense that it relies on no ground truth target images.
http://arxiv.org/abs/1902.02593