The ability to recover from a fall is an essential feature for a legged robot to navigate in challenging environments robustly. Until today, there has been very little progress on this topic. Current solutions mostly build upon (heuristically) predefined trajectories, resulting in unnatural behaviors and requiring considerable effort in engineering system-specific components. In this paper, we present an approach based on model-free Deep Reinforcement Learning (RL) to control recovery maneuvers of quadrupedal robots using a hierarchical behavior-based controller. The controller consists of four neural network policies including three behaviors and one behavior selector to coordinate them. Each of them is trained individually in simulation and deployed directly on a real system. We experimentally validate our approach on the quadrupedal robot ANYmal, which is a dog-sized quadrupedal system with 12 degrees of freedom. With our method, ANYmal manifests dynamic and reactive recovery behaviors to recover from an arbitrary fall configuration within less than 5 seconds. We tested the recovery maneuver more than 100 times, and the success rate was higher than 97 %.
http://arxiv.org/abs/1901.07517
Machine Learning has been a big success story during the AI resurgence. One particular stand out success relates to unsupervised learning from a massive amount of data, albeit much of it relates to one modality/type of data at a time. In spite of early assertions of the unreasonable effectiveness of data, there is increasing recognition of utilizing knowledge whenever it is available or can be created purposefully. In this paper, we focus on discussing the indispensable role of knowledge for deeper understanding of complex text and multimodal data in situations where (i) large amounts of training data (labeled/unlabeled) are not available or labor intensive to create, (ii) the objects (particularly text) to be recognized are complex (i.e., beyond simple entity-person/location/organization names), such as implicit entities and highly subjective content, and (iii) applications need to use complementary or related data in multiple modalities/media. What brings us to the cusp of rapid progress is our ability to (a) create knowledge, varying from comprehensive or cross domain to domain or application specific, and (b) carefully exploit the knowledge to further empower or extend the applications of ML/NLP techniques. Using the early results in several diverse situations - both in data types and applications - we seek to foretell unprecedented progress in our ability for deeper understanding and exploitation of multimodal data.
http://arxiv.org/abs/1610.07708
We build CSI-Net, a unified Deep Neural Network~(DNN), to learn the representation of WiFi signals. Using CSI-Net, we jointly solved two body characterization problems: biometrics estimation (including body fat, muscle, water, and bone rates) and person recognition. We also demonstrated the application of CSI-Net on two distinctive pose recognition tasks: the hand sign recognition (fine-scaled action of the hand) and falling detection (coarse-scaled motion of the body).
http://arxiv.org/abs/1810.03064
The key idea of variational auto-encoders (VAEs) resembles that of traditional auto-encoder models in which spatial information is supposed to be explicitly encoded in the latent space. However, the latent variables in VAEs are vectors, which can be interpreted as multiple feature maps of size 1x1. Such representations can only convey spatial information implicitly when coupled with powerful decoders. In this work, we propose spatial VAEs that use feature maps of larger size as latent variables to explicitly capture spatial information. This is achieved by allowing the latent variables to be sampled from matrix-variate normal (MVN) distributions whose parameters are computed from the encoder network. To increase dependencies among locations on latent feature maps and reduce the number of parameters, we further propose spatial VAEs via low-rank MVN distributions. Experimental results show that the proposed spatial VAEs outperform original VAEs in capturing rich structural and spatial information.
http://arxiv.org/abs/1705.06821
We propose a quantitative and qualitative analysis of the performances of statistical models for frame semantic structure extraction. We report on a replication study on FrameNet 1.7 data and show that preprocessing toolkits play a major role in argument identification performances, observing gains similar in their order of magnitude to those reported by recent models for frame semantic parsing. We report on the robustness of a recent statistical classifier for frame semantic parsing to lexical configurations of predicate-argument structures, relying on an artificially augmented dataset generated using a rule-based algorithm combining valence pattern matching and lexical substitution. We prove that syntactic pre-processing plays a major role in the performances of statistical classifiers to argument identification, and discuss the core reasons of syntactic mismatch between dependency parsers output and FrameNet syntactic formalism. Finally, we suggest new leads for improving statistical models for frame semantic parsing, including joint syntax-semantic parsing relying on FrameNet syntactic formalism, latent classes inference via split-and-merge algorithms and neural network architectures relying on rich input representations of words.
http://arxiv.org/abs/1901.07475
Recognizing pedestrian attributes is an important task in computer vision community due to it plays an important role in video surveillance. Many algorithms has been proposed to handle this task. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attributes recognition (PAR, for short), including the fundamental concepts of pedestrian attributes and corresponding challenges. Secondly, we introduce existing benchmarks, including popular datasets and evaluation criterion. Thirdly, we analyse the concept of multi-task learning and multi-label learning, and also explain the relations between these two learning algorithms and pedestrian attribute recognition. We also review some popular network architectures which have widely applied in the deep learning community. Fourthly, we analyse popular solutions for this task, such as attributes group, part-based, \emph{etc}. Fifthly, we shown some applications which takes pedestrian attributes into consideration and achieve better performance. Finally, we summarized this paper and give several possible research directions for pedestrian attributes recognition. The project page of this paper can be found from the following website: \url{https://sites.google.com/view/ahu-pedestrianattributes/}.
http://arxiv.org/abs/1901.07474
We explore the problem of intersection classification using monocular on-board passive vision, with the goal of classifying traffic scenes with respect to road topology. We divide the existing approaches into two broad categories according to the type of input data: (a) first person vision (FPV) approaches, which use an egocentric view sequence as the intersection is passed; and (b) third person vision (TPV) approaches, which use a single view immediately before entering the intersection. The FPV and TPV approaches each have advantages and disadvantages. Therefore, we aim to combine them into a unified deep learning framework. Experimental results show that the proposed FPV-TPV scheme outperforms previous methods and only requires minimal FPV/TPV measurements.
http://arxiv.org/abs/1901.07446
We present a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is the largest public chest x-ray database suitable for training supervised models concerning radiographs, and the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded from this http URL
http://arxiv.org/abs/1901.07441
Recently, Graph Convolutional Networks (GCNs) have been widely studied for graph-structured data representation and learning. However, in many real applications, data are coming with multiple graphs, and it is non-trivial to adapt GCNs to deal with data representation with multiple graph structures. One main challenge for multi-graph representation is how to exploit both structure information of each individual graph and correlation information across multiple graphs simultaneously. In this paper, we propose a novel Multiple Graph Adversarial Learning (MGAL) framework for multi-graph representation and learning. MGAL aims to learn an optimal structure-invariant and consistent representation for multiple graphs in a common subspace via a novel adversarial learning framework, which thus incorporates both structure information of intra-graph and correlation information of inter-graphs simultaneously. Based on MGAL, we then provide a unified network for semi-supervised learning task. Promising experimental results demonstrate the effectiveness of MGAL model.
http://arxiv.org/abs/1901.07439
Two variants of multi-robot search for a stationary object in a priori known environment represented by a graph are studied in the paper. The first one is a generalization of the Traveling Deliveryman Problem where more than one deliveryman is allowed to be used in a solution. Similarly, the second variant is a generalization of the Graph Search Problem. A novel heuristics suitable for both problems is proposed which is furthermore integrated into a cluster-first route second approach. A set of computational experiments was conducted over the benchmark instances derived from the TSPLIB library. The results obtained show that even a standalone heuristics significantly outperforms the standard solution based on k- means clustering in quality of results as well as computational time. The integrated approach furthermore improves solutions found by a standalone heuristics by up to 15% at the expense of higher computational complexity.
http://arxiv.org/abs/1901.07434
An algorithm for robot formation path planning is presented in this paper. Given a map of the working environment, the algorithm finds a path for a formation taking into account possible split of the formation and its consecutive merge. The key part of the solution works on a graph and sequentially employs an extended version of Dijkstra’s graph-based algorithm for multiple robots. It is thus deterministic, complete, computationally inexpensive, and finds a solution for a fixed source node to another node in the graph. Moreover, the presented solution is general enough to be incorporated into high-level tasks like cooperative surveillance and it can benefit from state-of-the-art formation motion planning approaches, which can be used for evaluation of edges of an input graph. The performed experimental results demonstrate the behavior of the method in complex environments for formations consisting of tens of robots.
http://arxiv.org/abs/1901.08444
Automatic poetry generation is novel and interesting application of natural language processing research. It became more popular during the last few years due to the rapid development of technology and neural computing power. This line of research can be applied to the study of linguistics and literature, for social science experiments, or simply for entertainment. The most effective known method of artificial poem generation uses recurrent neural networks (RNN). We also used RNNs to generate poems in the style of Adam Mickiewicz. Our network was trained on the Sir Thaddeus poem. For data pre-processing, we used a specialized stemming tool, which is one of the major innovations and contributions of this work. Our experiment was conducted on the source text, divided into sub-word units (at a level of resolution close to syllables). This approach is novel and is not often employed in the published literature. The subwords units seem to be a natural choice for analysis of the Polish language, as the language is morphologically rich due to cases, gender forms and a large vocabulary. Moreover, Sir Thaddeus contains rhymes, so the analysis of syllables can be meaningful. We verified our model with different settings for the temperature parameter, which controls the randomness of the generated text. We also compared our results with similar models trained on the same text but divided into characters (which is the most common approach alongside the use of full word units). The differences were tremendous. Our solution generated much better poems that were able to follow the metre and vocabulary of the source data text.
http://arxiv.org/abs/1901.07426
In this paper, we present an integrated solution to memory-efficient environment modeling by an autonomous mobile robot equipped with a laser range-finder. Majority of nowadays approaches to autonomous environment modeling, called exploration, employs occupancy grids as environment representation where the working space is divided into small cells each storing information about the corresponding piece of the environment in the form of a probabilistic estimate of its state. In contrast, the presented approach uses a polygonal representation of the explored environment which consumes much less memory, enables fast planning and decision-making algorithms and it is thus reliable for large-scale environments. Simultaneous localization and mapping (SLAM) has been integrated into the presented framework to correct odometry errors and to provide accurate position estimates. This involves also a refinement of the already generated environment model in case of loop closure, i.e. when the robot detects that it revisited an already explored place. The framework has been implemented in Robot Operating System (ROS) and tested with a real robot in various environments. The experiments show that the polygonal representation with SLAM integrated can be used in the real world as it is fast, memory efficient and accurate. Moreover, the refinement can be executed in real-time during the exploration process.
http://arxiv.org/abs/1901.07423
In order to ensure efficient flow of goods in an automated warehouse and to guarantee its continuous distribution to/from picking stations in an effective way, decisions about which goods will be delivered to which particular picking station by which robot and by which path and in which time have to be made based on the current state of the warehouse. This task involves solution of two suproblems: (1) task allocation in which an assignment of robots to goods they have to deliver at a particular time is found and (2) planning of collision-free trajectories for particular robots (given their actual and goal positions). The trajectory planning problem is addressed in this paper taking into account specifics of automated warehouses. First, assignments of all robots are not known in advance, they are instead presented to the algorithm gradually one by one. Moreover, we do not optimize a makespan, but a throughput - the sum of individual robot plan costs. We introduce a novel approach to this problem which is based on the context-aware route planning algorithm [1]. The performed experimental results show that the proposed approach has a lower fail rate and produces results of higher quality than the original algorithm. This is redeemed by higher computational complexity which is nevertheless low enough for real-time planning.
http://arxiv.org/abs/1901.07422
Segmentation of both white matter lesions and deep grey matter structures is an important task in the quantification of magnetic resonance imaging in multiple sclerosis. Typically these tasks are performed separately: in this paper we present a single CNN-based segmentation solution for providing fast, reliable segmentations of multimodal MR imagies into lesion classes and healthy-appearing grey- and white-matter structures. We show substantial, statistically significant improvements in both Dice coefficient and in lesion-wise specificity and sensitivity, compared to previous approaches, and agreement with individual human raters in the range of human inter-rater variability. The method is trained on data gathered from a single centre: nonetheless, it performs well on data from centres, scanners and field-strengths not represented in the training dataset. A retrospective study found that the classifier successfully identified lesions missed by the human raters. Lesion labels were provided by human raters, while weak labels for other brain structures (including CSF, cortical grey matter, cortical white matter, cerebellum, amygdala, hippocampus, subcortical GM structures and choroid plexus) were provided by Freesurfer 5.3. The segmentations of these structures compared well, not only with Freesurfer 5.3, but also with FSL-First and Freesurfer 6.1.
http://arxiv.org/abs/1901.07419
A novel multi-robot path planning approach is presented in this paper. Based on the standard Dijkstra, the algorithm looks for the optimal paths for a formation of robots, taking into account the possibility of split and merge. The algorithm explores a graph representation of the environment, computing for each node the cost of moving a number of robots and their corresponding paths. In every node where the formation can split, all the new possible formation subdivisions are taken into account accordingly to their individual costs. In the same way, in every node where the formation can merge, the algorithm verifies whether the combination is possible and, if possible, computes the new cost. In order to manage split and merge situations, a set of constraints is applied. The proposed algorithm is thus deterministic, complete and finds an optimal solution from a source node to all other nodes in the graph. The presented solution is general enough to be incorporated into high-level tasks as well as it can benefit from state-of-the-art formation motion planning approaches, which can be used for evaluation of edges of an input graph. The presented experimental results demonstrate the ability of the method to find the optimal solution for a formation of robots in environments with varying complexity.
http://arxiv.org/abs/1901.07408
Legibility is the extent to which a space can be easily recognized. Evaluating legibility is particularly desirable in indoor spaces, since it has a large impact on human behavior and the efficiency of space utilization. However, indoor space legibility has only been studied through survey and trivial simulations and lacks reliable quantitative measurement. We utilized a Deep Convolutional Neural Network (DCNN), which is structurally similar to a human perception system, to model legibility in indoor spaces. To implement the modeling of legibility for any indoor spaces, we designed an end-to-end processing pipeline from indoor data retrieving to model training to spatial legibility analysis. Although the model performed very well (98% top-1 accuracy) overall, there are still discrepancies in accuracy among different spaces, reflecting legibility differences. To prove the validity of the pipeline, we deployed a survey on Amazon Mechanical Turk, collecting 4,015 samples. The human samples showed a similar behavior pattern and mechanism as the DCNN models. Further, we used model results to visually explain legibility in different architectural programs, building age, building style, visual clusterings of spaces and visual explanations for building age and architectural functions.
http://arxiv.org/abs/1901.10553
This paper addresses the problem of coordination of a fleet of mobile robots - the problem of finding an optimal set of collision-free trajectories for individual robots in the fleet. Many approaches have been introduced during the last decades, but a minority of them is practically applicable, i.e. fast, producing near-optimal solutions, and complete. We propose a novel probabilistic approach based on the Rapidly Exploring Random Tree algorithm (RRT) by significantly improving its multi-robot variant for discrete environments. The presented experimental results show that the proposed approach is fast enough to solve problems with tens of robots in seconds. Although the solutions generated by the approach are slightly worse than one of the best state-of-the-art algorithms presented in (ter Mors et al., 2010), it solves problems where ter Mors’s algorithm fails.
http://arxiv.org/abs/1901.07363
We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an inter-channel feature that encodes acoustic information robust against reverberation, and we propose an online algorithm well suited for estimating DP-RTFs associated with moving audio sources. Another crucial ingredient of the proposed method is its ability to properly assign DP-RTFs to audio-source directions. Towards this goal, we adopt a maximum-likelihood formulation and we propose to use an exponentiated gradient (EG) to efficiently update source-direction estimates starting from their currently available values. The problem of multiple speaker tracking is computationally intractable because the number of possible associations between observed source directions and physical speakers grows exponentially with time. We adopt a Bayesian framework and we propose a variational approximation of the posterior filtering distribution associated with multiple speaker tracking, as well as an efficient variational expectation-maximization (VEM) solver. The proposed online localization and tracking method is thoroughly evaluated using two datasets that contain recordings performed in real environments.
http://arxiv.org/abs/1809.10936
Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT’16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT’16 Romanian-English, outperforming the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.
http://arxiv.org/abs/1901.07291
The RGB-D camera maintains a limited range for working and is hard to accurately measure the depth information in a far distance. Besides, the RGB-D camera will easily be influenced by strong lighting and other external factors, which will lead to a poor accuracy on the acquired environmental depth information. Recently, deep learning technologies have achieved great success in the visual SLAM area, which can directly learn high-level features from the visual inputs and improve the estimation accuracy of the depth information. Therefore, deep learning technologies maintain the potential to extend the source of the depth information and improve the performance of the SLAM system. However, the existing deep learning-based methods are mainly supervised and require a large amount of ground-truth depth data, which is hard to acquire because of the realistic constraints. In this paper, we first present an unsupervised learning framework, which not only uses image reconstruction for supervising but also exploits the pose estimation method to enhance the supervised signal and add training constraints for the task of monocular depth and camera motion estimation. Furthermore, we successfully exploit our unsupervised learning framework to assist the traditional ORB-SLAM system when the initialization module of ORB-SLAM method could not match enough features. Qualitative and quantitative experiments have shown that our unsupervised learning framework performs the depth estimation task comparable to the supervised methods and outperforms the previous state-of-the-art approach by $13.5\%$ on KITTI dataset. Besides, our unsupervised learning framework could significantly accelerate the initialization process of ORB-SLAM system and effectively improve the accuracy on environmental mapping in strong lighting and weak texture scenes.
http://arxiv.org/abs/1901.07288
Face tracking serves as the crucial initial step in mobile applications trying to analyse target faces over time in mobile settings. However, this problem has received little attention, mainly due to the scarcity of dedicated face tracking benchmarks. In this work, we introduce MobiFace, the first dataset for single face tracking in mobile situations. It consists of 80 unedited live-streaming mobile videos captured by 70 different smartphone users in fully unconstrained environments. Over $95K$ bounding boxes are manually labelled. The videos are carefully selected to cover typical smartphone usage. The videos are also annotated with 14 attributes, including 6 newly proposed attributes and 8 commonly seen in object tracking. 36 state-of-the-art trackers, including facial landmark trackers, generic object trackers and trackers that we have fine-tuned or improved, are evaluated. The results suggest that mobile face tracking cannot be solved through existing approaches. In addition, we show that fine-tuning on the MobiFace training data significantly boosts the performance of deep learning-based trackers, suggesting that MobiFace captures the unique characteristics of mobile face tracking. Our goal is to offer the community a diverse dataset to enable the design and evaluation of mobile face trackers. The dataset, annotations and the evaluation server will be on \url{https://mobiface.github.io/}.
http://arxiv.org/abs/1805.09749
This paper describes the design and implementation of a ground-related odometry sensor suitable for micro aerial vehicles. The sensor is based on a ground-facing camera and a single-board Linux-based embedded computer with a multimedia System on a Chip (SoC). The SoC features a hardware video encoder which is used to estimate the optical flow online. The optical flow is then used in combination with a distance sensor to estimate the vehicle’s velocity. The proposed sensor is compared to a similar existing solution and evaluated in both indoor and outdoor environments.
http://arxiv.org/abs/1901.07278
We propose a new video representation in terms of an over-segmentation of dense trajectories covering the whole video. Trajectories are often used to encode long-temporal information in several computer vision applications. Similar to temporal superpixels, a temporal slice of super-trajectories are superpixels, but the later contains more information because it maintains the long dense pixel-wise tracking information as well. The main challenge in using trajectories for any application, is the accumulation of tracking error in the trajectory construction. For our problem, this results in disconnected superpixels. We exploit constraints for edges in addition to trajectory based color and position similarity. Analogous to superpixels as a preprocessing tool for images, the proposed representation has its applications for videos, especially in trajectory based video analysis.
http://arxiv.org/abs/1901.07273
An important open problem in robotic planning is the autonomous generation of 3D inspection paths – that is, planning the best path to move a robot along in order to inspect a target structure. We recently suggested a new method for planning paths allowing the inspection of complex 3D structures, given a triangular mesh model of the structure. The method differs from previous approaches in its emphasis on generating and considering also plans that result in imperfect coverage of the inspection target. In many practical tasks, one would accept imperfections in coverage if this results in a substantially more energy efficient inspection path. The key idea is using a multiobjective evolutionary algorithm to optimize the energy usage and coverage of inspection plans simultaneously - and the result is a set of plans exploring the different ways to balance the two objectives. We here test our method on a set of inspection targets with large variation in size and complexity, and compare its performance with two state-of-the-art methods for complete coverage path planning. The results strengthen our confidence in the ability of our method to generate good inspection plans for different types of targets. The method’s advantage is most clearly seen for real-world inspection targets, since traditional complete coverage methods have no good way of generating plans for structures with hidden parts. Multiobjective evolution, by optimizing energy usage and coverage together ensures a good balance between the two - both when 100% coverage is feasible, and when large parts of the object are hidden.
http://arxiv.org/abs/1901.07272
Deep convolution neural networks demonstrate impressive results in super-resolution domain. An ocean of researches concentrate on improving peak signal noise ratio (PSNR) by using deeper and deeper layers, which is not friendly to constrained resources. Pursuing a trade-off between restoration capacity and simplicity of a model is still non-trivial by now. Recently, more contributions are devoted to this balance and our work is focusing on improving it further with automatic neural architecture search. In this paper, we handle super-resolution using multi-objective approach and propose an elastic search method involving both macro and micro aspects based on a hybrid controller of evolutionary algorithm and reinforcement learning. Quantitative experiments can help to draw a conclusion that the models generated by our methods are very competitive than and even dominate most of state-of-the-art super-resolution methods with different levels of FLOPS.
http://arxiv.org/abs/1901.07261
The World Health Organization (WHO) reported 1.25 million deaths yearly due to road traffic accidents worldwide and the number has been continuously increasing over the last few years. Nearly fifth of these accidents are caused by distracted drivers. Existing work of distracted driver detection is concerned with a small set of distractions (mostly, cell phone usage). Unreliable ad-hoc methods are often used.In this paper, we present the first publicly available dataset for driver distraction identification with more distraction postures than existing alternatives. In addition, we propose a reliable deep learning-based solution that achieves a 90% accuracy. The system consists of a genetically-weighted ensemble of convolutional neural networks, we show that a weighted ensemble of classifiers using a genetic algorithm yields in a better classification confidence. We also study the effect of different visual elements in distraction detection by means of face and hand localizations, and skin segmentation. Finally, we present a thinned version of our ensemble that could achieve 84.64% classification accuracy and operate in a real-time environment.
http://arxiv.org/abs/1901.09097
Over recent years, emerging interest has occurred in integrating computer vision technology into the retail industry. Automatic checkout (ACO) is one of the critical problems in this area which aims to automatically generate the shopping list from the images of the products to purchase. The main challenge of this problem comes from the large scale and the fine-grained nature of the product categories as well as the difficulty for collecting training images that reflect the realistic checkout scenarios due to continuous update of the products. Despite its significant practical and research value, this problem is not extensively studied in the computer vision community, largely due to the lack of a high-quality dataset. To fill this gap, in this work we propose a new dataset to facilitate relevant research. Our dataset enjoys the following characteristics: (1) It is by far the largest dataset in terms of both product image quantity and product categories. (2) It includes single-product images taken in a controlled environment and multi-product images taken by the checkout system. (3) It provides different levels of annotations for the check-out images. Comparing with the existing datasets, ours is closer to the realistic setting and can derive a variety of research problems. Besides the dataset, we also benchmark the performance on this dataset with various approaches. The dataset and related resources can be found at \url{https://rpc-dataset.github.io/}.
http://arxiv.org/abs/1901.07249
External effects such as shocks and temperature variations affect the calibration of visual-inertial sensor systems and thus they cannot fully rely on factory calibrations. Re-calibrations performed on short user-collected datasets might yield poor performance since the observability of certain parameters is highly dependent on the motion. Additionally, on resource-constrained systems (e.g mobile phones), full-batch approaches over longer sessions quickly become prohibitively expensive. In this paper, we approach the self-calibration problem by introducing information theoretic metrics to assess the information content of trajectory segments, thus allowing to select the most informative parts from a dataset for calibration purposes. With this approach, we are able to build compact calibration datasets either: (a) by selecting segments from a long session with limited exciting motion or (b) from multiple short sessions where a single sessions does not necessarily excite all modes sufficiently. Real-world experiments in four different environments show that the proposed method achieves comparable performance to a batch calibration approach, yet, at a constant computational complexity which is independent of the duration of the session.
http://arxiv.org/abs/1901.07242
As the foundation of driverless vehicle and intelligent robots, Simultaneous Localization and Mapping(SLAM) has attracted much attention these days. However, non-geometric modules of traditional SLAM algorithms are limited by data association tasks and have become a bottleneck preventing the development of SLAM. To deal with such problems, many researchers seek to Deep Learning for help. But most of these studies are limited to virtual datasets or specific environments, and even sacrifice efficiency for accuracy. Thus, they are not practical enough. We propose DF-SLAM system that uses deep local feature descriptors obtained by the neural network as a substitute for traditional hand-made features. Experimental results demonstrate its improvements in efficiency and stability. DF-SLAM outperforms popular traditional SLAM systems in various scenes, including challenging scenes with intense illumination changes. Its versatility and mobility fit well into the need for exploring new environments. Since we adopt a shallow network to extract local descriptors and remain others the same as original SLAM systems, our DF-SLAM can still run in real-time on GPU.
http://arxiv.org/abs/1901.07223
A class of vision problems, less commonly studied, consists of detecting objects in imagery obtained from physics-based experiments. These objects can span in 4D (x, y, z, t) and are visible as disturbances (caused due to physical phenomena) in the image with background distribution being approximately uniform. Such objects, occasionally referred to as `events’, can be considered as high energy blobs in the image. Unlike the images analyzed in conventional vision problems, very limited features are associated with such events, and their shape, size and count can vary significantly. This poses a challenge on the use of pre-trained models obtained from supervised approaches. In this paper, we propose an unsupervised approach involving iterative clustering based segmentation (ICS) which can detect target objects (events) in real-time. In this approach, a test image is analyzed over several cycles, and one event is identified per cycle. Each cycle consists of the following steps: (1) image segmentation using a modified k-means clustering method, (2) elimination of empty (with no events) segments based on statistical analysis of each segment, (3) merging segments that overlap (correspond to same event), and (4) selecting the strongest event. These four steps are repeated until all the events have been identified. The ICS approach consists of a few hyper-parameters that have been chosen based on statistical study performed over a set of test images. The applicability of ICS method is demonstrated on several 2D and 3D test examples.
http://arxiv.org/abs/1901.07222
In this study, we present a fully automatic method to segment both rectum and rectal cancer based on Deep Neural Networks (DNNs) with axial T2-weighted Magnetic Resonance images. Clinically, the relative location between rectum and rectal cancer plays an important role in cancer treatment planning. Such a need motivates us to propose a fully convolutional architecture for Multi-Task Learning (MTL) to segment both rectum and rectal cancer. Moreover, we propose a bias-variance decomposition-based method which can visualize and assess regional robustness of the segmentation model. In addition, we also suggest a novel augmentation method which can improve the segmentation performance as well as reduce the training time. Overall, our proposed method is not only computationally efficient due to its fully convolutional nature but also outperforms the current state-of-the-art for rectal cancer segmentation. It also scores high accuracy in rectum segmentation without any prior study reported. Moreover, we conclude that supplementing rectum information benefits the rectal cancer segmentation model, especially in model variance.
http://arxiv.org/abs/1901.07213
Collaborative filtering (CF) is the key technique for recommender systems (RSs). CF exploits user-item behavior interactions (e.g., clicks) only and hence suffers from the data sparsity issue. One research thread is to integrate auxiliary information such as product reviews and news titles, leading to hybrid filtering methods. Another thread is to transfer knowledge from other source domains such as improving the movie recommendation with the knowledge from the book domain, leading to transfer learning methods. In real-world life, no single service can satisfy a user’s all information needs. Thus it motivates us to exploit both auxiliary and source information for RSs in this paper. We propose a novel neural model to smoothly enable Transfer Meeting Hybrid (TMH) methods for cross-domain recommendation with unstructured text in an end-to-end manner. TMH attentively extracts useful content from unstructured text via a memory module and selectively transfers knowledge from a source domain via a transfer network. On two real-world datasets, TMH shows better performance in terms of three ranking metrics by comparing with various baselines. We conduct thorough analyses to understand how the text content and transferred knowledge help the proposed model.
http://arxiv.org/abs/1901.07199
Since compressive autoencoder (CAE) was proposed, autoencoder, as a simple and efficient neural network model, has achieved better performance than traditional codecs such as JPEG[3], JPEG 2000[4] etc. in lossy image compression. However, it faces the problem that the bitrate, characterizing the compression ratio, cannot be optimized by general methods due to its discreteness. Current research additionally trains a entropy estimator to indirectly optimize the bitrate. In this paper, we proposed the compressive autoencoder with pruning based on ADMM (CAE-P) which replaces the traditionally used entropy estimating technique with ADMM pruning method inspired by the field of neural network architecture search and avoided the extra effort needed for training an entropy estimator. We tested our models on natural image dataset Kodak PhotoCD and achieved better results than the original CAE model which relies on entropy coding along with traditional codecs. We further explored the effectiveness of the ADMM-based pruning method in CAE-P by looking into the detail of latent codes learned by the model.
http://arxiv.org/abs/1901.07196
This paper applies a genetic algorithm and fuzzy markup language to construct a human and smart machine cooperative learning system on game of Go. The genetic fuzzy markup language (GFML)-based Robot Agent can work on various kinds of robots, including Palro, Pepper, and TMUs robots. We use the parameters of FAIR open source Darkforest and OpenGo AI bots to construct the knowledge base of Open Go Darkforest (OGD) cloud platform for student learning on the Internet. In addition, we adopt the data from AlphaGo Master sixty online games as the training data to construct the knowledge base and rule base of the co-learning system. First, the Darkforest predicts the win rate based on various simulation numbers and matching rates for each game on OGD platform, then the win rate of OpenGo is as the final desired output. The experimental results show that the proposed approach can improve knowledge base and rule base of the prediction ability based on Darkforest and OpenGo AI bot with various simulation numbers.
http://arxiv.org/abs/1901.07191
People solve the difficult problem of understanding the salient features of both observations of others and the relationship to their own state when learning to imitate specific tasks. In this work, we train a comparator network which is used to compute distances between motions. Given a desired motion the comparator can provide a reward signal to the agent via the distance between the desired motion and the agent’s motion. We train an RNN-based comparator model to compute distances in space and time between motion clips while training an RL policy to minimize this distance. Furthermore, we examine a challenging form of this problem where a single \demonstrationText is provided for a given task. We demonstrate our approach in the setting of deep learning based control for physical simulation of humanoid walking in both 2D with $10$ degrees of freedom (DoF) and 3D with $38$ DoF.
http://arxiv.org/abs/1901.07186
Word representations are created using analogy context-based statistics and lexical relations on words. Word representations are inputs for the learning models in Natural Language Understanding (NLU) tasks. However, to understand language, knowing only the context is not sufficient. Reading between the lines is a key component of NLU. Embedding deeper word relationships which are not represented in the context enhances the word representation. This paper presents a word embedding which combines an analogy, context-based statistics using Word2Vec, and deeper word relationships using Conceptnet, to create an expanded word representation. In order to fine-tune the word representation, Self-Organizing Map is used to optimize it. The proposed word representation is compared with semantic word representations using Simlex 999. Furthermore, the use of 3D visual representations has shown to be capable of representing the similarity and association between words. The proposed word representation shows a Spearman correlation score of 0.886 and provided the best results when compared to the current state-of-the-art methods, and exceed the human performance of 0.78.
http://arxiv.org/abs/1901.07176
In this work, we propose a new data visualization and clustering technique for discovering discriminative structures in high-dimensional data. This technique, referred to as cPCA++, utilizes the fact that the interesting features of a “target” dataset may be obscured by high variance components during traditional PCA. By analyzing what is referred to as a “background” dataset (i.e., one that exhibits the high variance principal components but not the interesting structures), our technique is capable of efficiently highlighting the structure that is unique to the “target” dataset. Similar to another recently proposed algorithm called “contrastive PCA” (cPCA), the proposed cPCA++ method identifies important dataset specific patterns that are not detected by traditional PCA in a wide variety of settings. However, the proposed cPCA++ method is significantly more efficient than cPCA, because it does not require the parameter sweep in the latter approach. We applied the cPCA++ method to the problem of image splicing localization. In this application, we utilize authentic edges as the background dataset and the spliced edges as the target dataset. The proposed method is significantly more efficient than state-of-the-art methods, as the former does not require iterative updates of filter weights via stochastic gradient descent and backpropagation, nor the training of a classifier. Furthermore, the cPCA++ method is shown to provide performance scores comparable to the state-of-the-art Multi-task Fully Convolutional Network (MFCN).
http://arxiv.org/abs/1901.07172
Deep metric learning has been widely applied in many computer vision tasks, and recently, it is more attractive in \emph{zero-shot image retrieval and clustering}(ZSRC) where a good embedding is requested such that the unseen classes can be distinguished well. Most existing works deem this ‘good’ embedding just to be the discriminative one and thus race to devise powerful metric objectives or hard-sample mining strategies for leaning discriminative embedding. However, in this paper, we first emphasize that the generalization ability is a core ingredient of this ‘good’ embedding as well and largely affects the metric performance in zero-shot settings as a matter of fact. Then, we propose the Energy Confused Adversarial Metric Learning(ECAML) framework to explicitly optimize a robust metric. It is mainly achieved by introducing an interesting Energy Confusion regularization term, which daringly breaks away from the traditional metric learning idea of discriminative objective devising, and seeks to ‘confuse’ the learned model so as to encourage its generalization ability by reducing overfitting on the seen classes. We train this confusion term together with the conventional metric objective in an adversarial manner. Although it seems weird to ‘confuse’ the network, we show that our ECAML indeed serves as an efficient regularization technique for metric learning and is applicable to various conventional metric methods. This paper empirically and experimentally demonstrates the importance of learning embedding with good generalization, achieving state-of-the-art performances on the popular CUB, CARS, Stanford Online Products and In-Shop datasets for ZSRC tasks. \textcolor[rgb]{1, 0, 0}{Code available at this http URL}.
http://arxiv.org/abs/1901.07169
Deep neural networks (DNNs) have achieved superior performance in various prediction tasks, but can be very vulnerable to adversarial examples or perturbations. Therefore, it is crucial to measure the sensitivity of DNNs to various forms of perturbations in real applications. We introduce a novel perturbation manifold and its associated influence measure to quantify the effects of various perturbations on DNN classifiers. Such perturbations include various external and internal perturbations to input samples and network parameters. The proposed measure is motivated by information geometry and provides desirable invariance properties. We demonstrate that our influence measure is useful for four model building tasks: detecting potential ‘outliers’, analyzing the sensitivity of model architectures, comparing network sensitivity between training and test sets, and locating vulnerable areas. Experiments show reasonably good performance of the proposed measure for the popular DNN models ResNet50 and DenseNet121 on CIFAR10 and MNIST datasets.
http://arxiv.org/abs/1901.07152
Recently, deep learning based natural language processing techniques are being extensively used to deal with spam mail, censorship evaluation in social networks, among others. However, there is only a couple of works evaluating the vulnerabilities of such deep neural networks. Here, we go beyond attacks to investigate, for the first time, universal rules, i.e., rules that are sample agnostic and therefore could turn any text sample in an adversarial one. In fact, the universal rules do not use any information from the method itself (no information from the method, gradient information or training dataset information is used), making them black-box universal attacks. In other words, the universal rules are sample and method agnostic. By proposing a coevolutionary optimization algorithm we show that it is possible to create universal rules that can automatically craft imperceptible adversarial samples (only less than five perturbations which are close to misspelling are inserted in the text sample). A comparison with a random search algorithm further justifies the strength of the method. Thus, universal rules for fooling networks are here shown to exist. Hopefully, the results from this work will impact the development of yet more sample and model agnostic attacks as well as their defenses, culminating in perhaps a new age for artificial intelligence.
http://arxiv.org/abs/1901.07132
In this work, we propose a method for neural dialogue response generation that allows not only generating semantically reasonable responses according to the dialogue history, but also explicitly controlling the sentiment of the response via sentiment labels. Our proposed model is based on the paradigm of conditional adversarial learning; the training of a sentiment-controlled dialogue generator is assisted by an adversarial discriminator which assesses the fluency and feasibility of the response generating from the dialogue history and a given sentiment label. Because of the flexibility of our framework, the generator could be a standard sequence-to-sequence (SEQ2SEQ) model or a more complicated one such as a conditional variational autoencoder-based SEQ2SEQ model. Experimental results using automatic and human evaluation both demonstrate that our proposed framework is able to generate both semantically reasonable and sentiment-controlled dialogue responses.
http://arxiv.org/abs/1901.07129
We propose a novel image sampling method for differentiable image transformation in deep neural networks. The sampling schemes currently used in deep learning, such as Spatial Transformer Networks, rely on bilinear interpolation, which performs poorly under severe scale changes, and more importantly, results in poor gradient propagation. This is due to their strict reliance on direct neighbors. Instead, we propose to generate random auxiliary samples in the vicinity of each pixel in the sampled image, and create a linear approximation using their intensity values. We then use this approximation as a differentiable formula for the transformed image. However, we observe that these auxiliary samples may collapse to a single pixel under severe image transformations, and propose to address it by adding constraints to the distance between the center pixel and the auxiliary samples. We demonstrate that our approach produces more representative gradients with a wider basin of convergence for image alignment, which leads to considerable performance improvements when training networks for image registration and classification tasks, particularly under large downsampling.
http://arxiv.org/abs/1901.07124
Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angles or exponential maps as parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configurations. This work addresses both limitations. QuaterNet represents rotations with quaternions and our loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors. We investigate both recurrent and convolutional architectures and evaluate on short-term prediction and long-term generation. For the latter, our approach is qualitatively judged as realistic as recent neural strategies from the graphics literature. Our experiments compare quaternions to Euler angles as well as exponential maps and show that only a very short context is required to make reliable future predictions. Finally, we show that the standard evaluation protocol for Human3.6M produces high variance results and we propose a simple solution.
http://arxiv.org/abs/1901.07677
Deep convolutional neural networks (CNNs) have recently achieved great success in many visual recognition tasks. However, existing deep convolutional neural network models are computationally expensive and memory intensive, hindering their deployment in devices with low memory resources or in applications with strict latency requirements. Therefore, a natural thought is to perform model compression and acceleration in deep networks without significantly decreasing the model performance. During the past few years, tremendous progresses have been made in this area. In this paper, we survey the recent advanced techniques for compacting and accelerating CNNs model developed. These techniques are roughly categorized into four schemes: parameter pruning and sharing, low-rank factorization, transfered/compact convolutional filters and knowledge distillation. Methods of parameter pruning and sharing will be described at the beginning, after that the other techniques will be introduced. For each scheme, we provide insightful analysis regarding the performance, related applications, advantages and drawbacks etc. Then we will go through a few very recent additional successful methods, for example, dynamic networks and stochastic depths networks. After that, we survey the evaluation matrix, main datasets used for evaluating the model performance and recent bench-marking efforts. Finally we conclude this paper, discuss remaining challenges and possible directions in this topic.
http://arxiv.org/abs/1710.09282
In recent years, the learned local descriptors have outperformed handcrafted ones by a large margin, due to the powerful deep convolutional neural network architectures such as L2-Net [1] and triplet based metric learning [2]. However, there are two problems in the current methods, which hinders the overall performance. Firstly, the widely-used margin loss is sensitive to incorrect correspondences, which are prevalent in the existing local descriptor learning datasets. Second, the L2 distance ignores the fact that the feature vectors have been normalized to unit norm. To tackle these two problems and further boost the performance, we propose a robust angular loss which 1) uses cosine similarity instead of L2 distance to compare descriptors and 2) relies on a robust loss function that gives smaller penalty to triplets with negative relative similarity. The resulting descriptor shows robustness on different datasets, reaching the state-of-the-art result on Brown dataset , as well as demonstrating excellent generalization ability on the Hpatches dataset and a Wide Baseline Stereo dataset.
http://arxiv.org/abs/1901.07076
The important task of developing verification system of data of virtual community member on the basis of computer-linguistic analysis of the content of a large sample of Ukrainian virtual communities is solved. The subject of research is methods and tools for verification of web-members socio-demographic characteristics based on computer-linguistic analysis of their communicative interaction results. The aim of paper is to verifying web-user personal data on the basis of computer-linguistic analysis of web-members information tracks. The structure of verification software for web-user profile is designed for a practical implementation of assigned tasks. The method of personal data verification of web-members by analyzing information track of virtual community member is conducted. For the first time the method for checking the authenticity of web members personal data, which helped to design of verification tool for socio-demographic characteristics of web-member is developed. The verification system of data of web-members, which forms the verified socio-demographic profiles of web-members, is developed as a result of conducted experiments. Also the user interface of the developed verification system web-members data is presented. Effectiveness and efficiency of use of the developed methods and means for solving tasks in web-communities administration is proved by their approbation. The number of false results of verification system is 18%.
http://arxiv.org/abs/1901.07067
Unsupervised neural nets such as Restricted Boltzmann Machines(RBMs) and Deep Belif Networks(DBNs), are powerful in automatic feature extraction,unsupervised weight initialization and density estimation. In this paper,we demonstrate that the parameters of these neural nets can be dramatically reduced without affecting their performance. We describe a method to reduce the parameters required by RBM which is the basic building block for deep architectures. Further we propose an unsupervised sparse deep architectures selection algorithm to form sparse deep neural networks.Experimental results show that there is virtually no loss in either generative or discriminative performance.
http://arxiv.org/abs/1901.07066
Parameters of recent neural networks require a huge amount of memory. These parameters are used by neural networks to perform machine learning tasks when processing inputs. To speed up inference, we develop Partition Pruning, an innovative scheme to reduce the parameters used while taking into consideration parallelization. We evaluated the performance and energy consumption of parallel inference of partitioned models, which showed a 7.72x speed up of performance and a 2.73x reduction in the energy used for computing pruned layers of TinyVGG16 in comparison to running the unpruned model on a single accelerator. In addition, our method showed a limited reduction some numbers in accuracy while partitioning fully connected layers.
http://arxiv.org/abs/1901.11391
Characterizing human values is a topic deeply interwoven with the sciences, humanities, art, and many other human endeavors. In recent years, a number of thinkers have argued that accelerating trends in computer science, cognitive science, and related disciplines foreshadow the creation of intelligent machines which meet and ultimately surpass the cognitive abilities of human beings, thereby entangling an understanding of human values with future technological development. Contemporary research accomplishments suggest sophisticated AI systems becoming widespread and responsible for managing many aspects of the modern world, from preemptively planning users’ travel schedules and logistics, to fully autonomous vehicles, to domestic robots assisting in daily living. The extrapolation of these trends has been most forcefully described in the context of a hypothetical “intelligence explosion,” in which the capabilities of an intelligent software agent would rapidly increase due to the presence of feedback loops unavailable to biological organisms. The possibility of superintelligent agents, or simply the widespread deployment of sophisticated, autonomous AI systems, highlights an important theoretical problem: the need to separate the cognitive and rational capacities of an agent from the fundamental goal structure, or value system, which constrains and guides the agent’s actions. The “value alignment problem” is to specify a goal structure for autonomous agents compatible with human values. In this brief article, we suggest that recent ideas from affective neuroscience and related disciplines aimed at characterizing neurological and behavioral universals in the mammalian class provide important conceptual foundations relevant to describing human values. We argue that the notion of “mammalian value systems” points to a potential avenue for fundamental research in AI safety and AI ethics.
http://arxiv.org/abs/1607.08289