Integer programming (IP) has proven to be highly effective in solving many path-based optimization problems in robotics. However, the applications of IP are generally done in an ad-hoc, problem specific manner. In this work, after examined a wide range of path-based optimization problems, we describe an IP solution methodology for these problems that is both easy to apply (in two simple steps) and high-performance in terms of the computation time and the achieved optimality. We demonstrate the generality of our approach through the application to three challenging path-based optimization problems: multi-robot path planning (MPP), minimum constraint removal (MCR), and reward collection problems RCPs). Associated experiments show that the approach can efficiently produce (near-)optimal solutions for problems with large state spaces, complex constraints, and complicated objective functions. In conjunction with the proposition of the IP methodology, we introduce two new and practical robotics problems: multi-robot minimum constraint removal (MMCR) and multi-robot path planning (MPP) with partial solutions, which can be quickly and effectively solved using our proposed IP solution pipeline.
http://arxiv.org/abs/1902.02652
In this paper we propose a robust visual odometry system for a wide-baseline camera rig with wide field-of-view (FOV) fisheye lenses, which provides full omnidirectional stereo observations of the environment. For more robust and accurate ego-motion estimation we adds three components to the standard VO pipeline, 1) the hybrid projection model for improved feature matching, 2) multi-view P3P RANSAC algorithm for pose estimation, and 3) online update of rig extrinsic parameters. The hybrid projection model combines the perspective and cylindrical projection to maximize the overlap between views and minimize the image distortion that degrades feature matching performance. The multi-view P3P RANSAC algorithm extends the conventional P3P RANSAC to multi-view images so that all feature matches in all views are considered in the inlier counting for robust pose estimation. Finally the online extrinsic calibration is seamlessly integrated in the backend optimization framework so that the changes in camera poses due to shocks or vibrations can be corrected automatically. The proposed system is extensively evaluated with synthetic datasets with ground-truth and real sequences of highly dynamic environment, and its superior performance is demonstrated.
https://arxiv.org/abs/1902.11154
The most widely used video encoders share a common hybrid coding framework that includes block-based motion estimation/compensation and block-based transform coding. Despite their high coding efficiency, the encoded videos often exhibit visually annoying artifacts, denoted as Perceivable Encoding Artifacts (PEAs), which significantly degrade the visual Qualityof- Experience (QoE) of end users. To monitor and improve visual QoE, it is crucial to develop subjective and objective measures that can identify and quantify various types of PEAs. In this work, we make the first attempt to build a large-scale subjectlabelled database composed of H.265/HEVC compressed videos containing various PEAs. The database, namely the PEA265 database, includes 4 types of spatial PEAs (i.e. blurring, blocking, ringing and color bleeding) and 2 types of temporal PEAs (i.e. flickering and floating). Each containing at least 60,000 image or video patches with positive and negative labels. To objectively identify these PEAs, we train Convolutional Neural Networks (CNNs) using the PEA265 database. It appears that state-of-theart ResNeXt is capable of identifying each type of PEAs with high accuracy. Furthermore, we define PEA pattern and PEA intensity measures to quantify PEA levels of compressed video sequence. We believe that the PEA265 database and our findings will benefit the future development of video quality assessment methods and perceptually motivated video encoders.
http://arxiv.org/abs/1903.00473
Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g., they use the same visual features, answer vocabularies, etc. We find that methods do not generalize across the two domains. To address this problem, we propose a new VQA algorithm that rivals or exceeds the state-of-the-art for both domains.
http://arxiv.org/abs/1903.00366
Weighted bipolar argumentation frameworks offer a tool for decision support and social media analysis. Arguments are evaluated by an iterative procedure that takes initial weights and attack and support relations into account. Until recently, convergence of these iterative procedures was not very well understood in cyclic graphs. Mossakowski and Neuhaus recently introduced a unification of different approaches and proved first convergence and divergence results. We build up on this work, simplify and generalize convergence results and complement them with runtime guarantees. As it turns out, there is a tradeoff between semantics’ convergence guarantees and their ability to move strength values away from the initial weights. We demonstrate that divergence problems can be avoided without this tradeoff by continuizing semantics. Semantically, we extend the framework with a Duality property that assures a symmetric impact of attack and support relations. We also present a Java implementation of modular semantics and explain the practical usefulness of the theoretical ideas.
http://arxiv.org/abs/1809.07133
Adversarial generative image models yield outstanding sample quality, but suffer from two drawbacks: (i) they mode-drop, \ie, do not cover the full support of the target distribution, and (ii) they do not allow for likelihood evaluations on held-out data. Conversely, maximum likelihood estimation encourages models to cover the full support of the training data, but yields poor samples. To address these mutual shortcomings, we propose a generative model that can be jointly trained with both procedures. In our approach, the conditional independence assumption typically made in variational autoencoders is relaxed by leveraging invertible models. This leads to improved sample quality, as well as improved likelihood on held-out data. Our model significantly improves on existing hybrid models, yielding GAN-like samples, and IS and FID scores that are competitive with fully adversarial models, while offering likelihoods measures on held-out data comparable to recent likelihood-based methods.
http://arxiv.org/abs/1901.01091
Neural networks are powering the deployment of embedded devices and Internet of Things. Applications range from personal assistants to critical ones such as self-driving cars. It has been shown recently that models obtained from neural nets can be trojaned ; an attacker can then trigger an arbitrary model behavior facing crafted inputs. This has a critical impact on the security and reliability of those deployed devices. We introduce novel algorithms to detect the tampering with deployed models, classifiers in particular. In the remote interaction setup we consider, the proposed strategy is to identify markers of the model input space that are likely to change class if the model is attacked, allowing a user to detect a possible tampering. This setup makes our proposal compatible with a wide rage of scenarios, such as embedded models, or models exposed through prediction APIs. We experiment those tampering detection algorithms on the canonical MNIST dataset, over three different types of neural nets, and facing five different attacks (trojaning, quantization, fine-tuning, compression and watermarking). We then validate over five large models (VGG16, VGG19, ResNet, MobileNet, DenseNet) with a state of the art dataset (VGGFace2), and report results demonstrating the possibility of an efficient detection of model tampering.
http://arxiv.org/abs/1903.00317
Previous spatial-temporal action localization methods commonly follow the pipeline of object detection to estimate bounding boxes and labels of actions. However, the temporal relation of an action has not been fully explored. In this paper, we propose an end-to-end Progress Regression Recurrent Neural Network (PR-RNN) for online spatial-temporal action localization, which learns to infer the action by temporal progress regression. Two new action attributes, called progression and progress rate, are introduced to describe the temporal engagement and relative temporal position of an action. In our method, frame-level features are first extracted by a Fully Convolutional Network (FCN). Subsequently, detection results and action progress attributes are regressed by the Convolutional Gated Recurrent Unit (ConvGRU) based on all the observed frames instead of a single frame or a short clip. Finally, a novel online linking method is designed to connect single-frame results to spatial-temporal tubes with the help of the estimated action progress attributes. Extensive experiments demonstrate that the progress attributes improve the localization accuracy by providing more precise temporal position of an action in unconstrained videos. Our proposed PR-RNN achieves the stateof-the-art performance for most of the IoU thresholds on two benchmark datasets.
http://arxiv.org/abs/1903.00304
Autonomous driving presents one of the largest problems that the robotics and artificial intelligence communities are facing at the moment, both in terms of difficulty and potential societal impact. Self-driving vehicles (SDVs) are expected to prevent road accidents and save millions of lives while improving the livelihood and life quality of many more. However, despite large interest and a number of industry players working in the autonomous domain, there still remains more to be done in order to develop a system capable of operating at a level comparable to best human drivers. One reason for this is high uncertainty of traffic behavior and large number of situations that an SDV may encounter on the roads, making it very difficult to create a fully generalizable system. To ensure safe and efficient operations, an autonomous vehicle is required to account for this uncertainty and to anticipate a multitude of possible behaviors of traffic actors in its surrounding. We address this critical problem and present a method to predict multiple possible trajectories of actors while also estimating their probabilities. The method encodes each actor’s surrounding context into a raster image, used as input by deep convolutional networks to automatically derive relevant features for the task. Following extensive offline evaluation and comparison to state-of-the-art baselines, the method was successfully tested on SDVs in closed-course tests.
http://arxiv.org/abs/1809.10732
This article presents a continuous model for hierarchical networks based on a combination of mathematically derived models of receptive fields and biologically inspired computations. Based on a functional model of complex cells in terms of an oriented quasi quadrature combination of first- and second-order directional Gaussian derivatives, we couple such primitive computations in cascade over combinatorial expansions over image orientations. Scale-space properties of the computational primitives are analysed and it is shown that the resulting representation allows for provable scale and rotation covariance. A prototype application to texture analysis is developed and it is demonstrated that a simplified mean-reduced representation of the resulting QuasiQuadNet leads to promising experimental results on three texture datasets.
http://arxiv.org/abs/1903.00289
State-of-the-art offline handwriting text recognition systems tend to use neural networks and therefore require a large amount of annotated data to be trained. In order to partially satisfy this requirement, we propose a system based on Generative Adversarial Networks (GAN) to produce synthetic images of handwritten words. We use bidirectional LSTM recurrent layers to get an embedding of the word to be rendered, and we feed it to the generator network. We also modify the standard GAN by adding an auxiliary network for text recognition. The system is then trained with a balanced combination of an adversarial loss and a CTC loss. Together, these extensions to GAN enable to control the textual content of the generated word images. We obtain realistic images on both French and Arabic datasets, and we show that integrating these synthetic images into the existing training data of a text recognition system can slightly enhance its performance.
http://arxiv.org/abs/1903.00277
The task of video prediction is forecasting the next frames given some previous frames. Despite much recent progress, this task is still challenging mainly due to high nonlinearity in the spatial domain. To address this issue, we propose a novel architecture, Frequency Domain Transformer Network (FDTN), which is an end-to-end learnable model that estimates and uses the transformations of the signal in the frequency domain. Experimental evaluations show that this approach can outperform some widely used video prediction methods like Video Ladder Network (VLN) and Predictive Gated Pyramids (PGP).
http://arxiv.org/abs/1903.00271
To autonomously navigate and plan interactions in real-world environments, robots require the ability to robustly perceive and map complex, unstructured surrounding scenes. Besides building an internal representation of the observed scene geometry, the key insight towards a truly functional understanding of the environment is the usage of higher-level entities during mapping, such as individual object instances. We propose an approach to incrementally build volumetric object-centric maps during online scanning with a localized RGB-D camera. First, a per-frame segmentation scheme combines an unsupervised geometric approach with instance-aware semantic object predictions. This allows us to detect and segment elements both from the set of known classes and from other, previously unseen categories. Next, a data association step tracks the predicted instances across the different frames. Finally, a map integration strategy fuses information about their 3D shape, location, and, if available, semantic class into a global volume. Evaluation on a publicly available dataset shows that the proposed approach for building instance-level semantic maps is competitive with state-of-the-art methods, while additionally able to discover objects of unseen categories. The system is further evaluated within a real-world robotic mapping setup, for which qualitative results highlight the online nature of the method.
http://arxiv.org/abs/1903.00268
Modeling social interactions based on individual behavior has always been an area of interest, but prior literature generally presumes rational behavior. Thus, such models may miss out on capturing the effects of biases humans are susceptible to. This work presents a method to model egocentric bias, the real-life tendency to emphasize one’s own opinion heavily when presented with multiple opinions. We use a symmetric distribution centered at an agent’s own opinion, as opposed to the Bounded Confidence (BC) model used in prior work. We consider a game of iterated interactions where an agent cooperates based on its opinion about an opponent. Our model also includes the concept of domain-based self-doubt, which varies as the interaction succeeds or not. An increase in doubt makes an agent reduce its egocentricity in subsequent interactions, thus enabling the agent to learn reactively. The agent system is modeled with factions not having a single leader, to overcome some of the issues associated with leader-follower factions. We find that agents belonging to factions perform better than individual agents. We observe that an intermediate level of egocentricity helps the agent perform at its best, which concurs with conventional wisdom that neither overconfidence nor low self-esteem brings benefits.
http://arxiv.org/abs/1903.03443
Object recognition is a primary function of the human visual system. It has recently been claimed that the highly successful ability to recognise objects in a set of emergent computer vision systems—Deep Convolutional Neural Networks (DCNNs)—can form a useful guide to recognition in humans. To test this assertion, we systematically evaluated visual crowding, a dramatic breakdown of recognition in clutter, in DCNNs and compared their performance to extant research in humans. We examined crowding in two architectures of DCNNs with the same methodology as that used among humans. We manipulated multiple stimulus factors including inter-letter spacing, letter colour, size, and flanker location to assess the extent and shape of crowding in DCNNs to establish a clear picture of crowding in DCNNs. We found that crowding followed a predictable pattern across DCNN architectures that was fundamentally different from that in humans. Some characteristic hallmarks of human crowding, such as invariance to size, the effect of target-flanker similarity, and confusions between target and flanker identities, were completely missing, minimised or even reversed in DCNNs. These data show that DCNNs, while proficient in object recognition, likely achieve this competence through a set of mechanisms that are distinct from those in humans. They are not equivalent models of human or primate object recognition and caution must be exercised when inferring mechanisms derived from their operation.
http://arxiv.org/abs/1903.00258
We present, for the first time, a novel deep neural network architecture called \dcn with a dual-path connection between the input image and output class label for mammogram image processing. This architecture is built upon U-Net, which non-linearly maps the input data into a deep latent space. One path of the \dcnn, the locality preserving learner, is devoted to hierarchically extracting and exploiting intrinsic features of the input, while the other path, called the conditional graph learner, focuses on modeling the input-mask correlations. The learned mask is further used to improve classification results, and the two learning paths complement each other. By integrating the two learners our new architecture provides a simple but effective way to jointly learn the segmentation and predict the class label. Benefiting from the powerful expressive capacity of deep neural networks a more discriminative representation can be learned, in which both the semantics and structure are well preserved. Experimental results show that \dcn achieves the best mammography segmentation and classification simultaneously, outperforming recent state-of-the-art models.
http://arxiv.org/abs/1903.00001
Recently, learning to hash has been widely studied for image retrieval thanks to the computation and storage efficiency of binary codes. For most existing learning to hash methods, sufficient training images are required and used to learn precise hashing codes. However, in some real-world applications, there are not always sufficient training images in the domain of interest. In addition, some existing supervised approaches need a amount of labeled data, which is an expensive process in term of time, label and human expertise. To handle such problems, inspired by transfer learning, we propose a simple yet effective unsupervised hashing method named Optimal Projection Guided Transfer Hashing (GTH) where we borrow the images of other different but related domain i.e., source domain to help learn precise hashing codes for the domain of interest i.e., target domain. Besides, we propose to seek for the maximum likelihood estimation (MLE) solution of the hashing functions of target and source domains due to the domain gap. Furthermore,an alternating optimization method is adopted to obtain the two projections of target and source domains such that the domain hashing disparity is reduced gradually. Extensive experiments on various benchmark databases verify that our method outperforms many state-of-the-art learning to hash methods. The implementation details are available at https://github.com/liuji93/GTH.
http://arxiv.org/abs/1903.00252
Letting a deep network be aware of the quality of its own predictions is an interesting yet important problem. In the task of instance segmentation, the confidence of instance classification is used as mask quality score in most instance segmentation frameworks. However, the mask quality, quantified as the IoU between the instance mask and its ground truth, is usually not well correlated with classification score. In this paper, we study this problem and propose Mask Scoring R-CNN which contains a network block to learn the quality of the predicted instance masks. The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU. The mask scoring strategy calibrates the misalignment between mask quality and mask score, and improves instance segmentation performance by prioritizing more accurate mask predictions during COCO AP evaluation. By extensive evaluations on the COCO dataset, Mask Scoring R-CNN brings consistent and noticeable gain with different models, and outperforms the state-of-the-art Mask R-CNN. We hope our simple and effective approach will provide a new direction for improving instance segmentation. The source code of our method is available at \url{https://github.com/zjhuang22/maskscoring_rcnn}.
http://arxiv.org/abs/1903.00241
Social media has revolutionized human communication and styles of interaction. Due to its easiness and effective medium, people share and exchange information, carry out discussion on various events, and express their opinions. For effective policy making and understanding the response of a community on different events, we need to monitor and analyze the social media. In social media, there are some users who are more influential, for example, a famous politician may have more influence than a common person. These influential users belong to specific communities. The main object of this research is to know the sentiments of a specific community on various events. For detecting the event based sentiments of a community we propose a generic framework. Our framework identifies the users of a specific community on twitter. After identifying the users of a community, we fetch their tweets and identify tweets belonging to specific events. The event based tweets are pre-processed. Pre-processed tweets are then analyzed for detecting sentiments of a community for specific events. Qualitative and quantitative evaluation confirms the effectiveness and usefulness of our proposed framework.
http://arxiv.org/abs/1903.00232
Camera shake during exposure is a major problem in hand-held photography, as it causes image blur that destroys details in the captured images.~In the real world, such blur is mainly caused by both the camera motion and the complex scene structure.~While considerable existing approaches have been proposed based on various assumptions regarding the scene structure or the camera motion, few existing methods could handle the real 6 DoF camera motion.~In this paper, we propose to jointly estimate the 6 DoF camera motion and remove the non-uniform blur caused by camera motion by exploiting their underlying geometric relationships, with a single blurry image and its depth map (either direct depth measurements, or a learned depth map) as input.~We formulate our joint deblurring and 6 DoF camera motion estimation as an energy minimization problem which is solved in an alternative manner. Our model enables the recovery of the 6 DoF camera motion and the latent clean image, which could also achieve the goal of generating a sharp sequence from a single blurry image. Experiments on challenging real-world and synthetic datasets demonstrate that image blur from camera shake can be well addressed within our proposed framework.
http://arxiv.org/abs/1903.00231
Given the task of learning robotic grasping solely based on a depth camera input and gripper force feedback, we derive a learning algorithm from an applied point of view to significantly reduce the amount of required training data. Major improvements in time and data efficiency are achieved by: Firstly, we exploit the geometric consistency between the undistorted depth images and the task space. Using a relative small, fully-convolutional neural network, we predict grasp and gripper parameters with great advantages in training as well as inference performance. Secondly, motivated by the small random grasp success rate of around 3%, the grasp space was explored in a systematic manner. The final system was learned with 23000 grasp attempts in around 60h, improving current solutions by an order of magnitude. For typical bin picking scenarios, we measured a grasp success rate of 96.6%. Further experiments showed that the system is able to generalize and transfer knowledge to novel objects and environments.
http://arxiv.org/abs/1903.00228
This work studies the design of safe control policies for large-scale non-linear systems operating in uncertain environments. In such a case, the robust control framework is a principled approach to safety that aims to maximize the worst-case performance of a system. However, the resulting optimization problem is generally intractable for non-linear systems with continuous states. To overcome this issue, we introduce two tractable methods that are based either on sampling or on a conservative approximation of the robust objective. The proposed approaches are applied to the problem of autonomous driving.
http://arxiv.org/abs/1903.00220
In this paper, we describe KT-Speech-Crawler: an approach for automatic dataset construction for speech recognition by crawling YouTube videos. We outline several filtering and post-processing steps, which extract samples that can be used for training end-to-end neural speech recognition systems. In our experiments, we demonstrate that a single-core version of the crawler can obtain around 150 hours of transcribed speech within a day, containing an estimated 3.5% word error rate in the transcriptions. Automatically collected samples contain reading and spontaneous speech recorded in various conditions including background noise and music, distant microphone recordings, and a variety of accents and reverberation. When training a deep neural network on speech recognition, we observed around 40\% word error rate reduction on the Wall Street Journal dataset by integrating 200 hours of the collected samples into the training set. The demo (this http URL) and the crawler code (https://github.com/EgorLakomkin/KTSpeechCrawler) are publicly available.
http://arxiv.org/abs/1903.00216
Collective intelligence is manifested when multiple agents coherently work in observation, interaction, decision-making and action. In this paper, we define and quantify the intelligence level of heterogeneous agents group with the improved Anytime Universal Intelligence Test(AUIT), based on an extension of the existing evaluation of homogeneous agents group. The relationship of intelligence level with agents composition, group size, spatial complexity and testing time is analyzed. The intelligence level of heterogeneous agents groups is compared with the homogeneous ones to analyze the effects of heterogeneity on collective intelligence. Our work will help to understand the essence of collective intelligence more deeply and reveal the effect of various key factors on group intelligence level.
http://arxiv.org/abs/1903.00206
We present a concept of constrained collaborative mobile agents (CCMA) system, which consists of multiple wheeled mobile agents constrained by a passive kinematic chain. This mobile robotic system is modular in nature, the passive kinematic chain can be easily replaced with different designs and morphologies for different functions and task adaptability. Depending solely on the actuation of the mobile agents, this mobile robotic system can manipulate or position an end-effector. However, the complexity of the system due to presence of several mobile agents, passivity of the kinematic chain and the nature of the constrained collaborative manipulation requires development of an optimization framework. We therefore present an optimization framework for forward simulation and kinematic control of this system. With this optimization framework, the number of deployed mobile agents, actuation schemes, the design and morphology of the passive kinematic chain can be easily changed, which reinforces the modularity and collaborative aspects of the mobile robotic system. We present results, in simulation, for spatial 4-DOF to 6-DOF CCMA system examples. Finally, we present experimental quantitative results for two different fabricated 4-DOF prototypes, which demonstrate different actuation schemes, control and collaborative manipulation of an end-effector.
http://arxiv.org/abs/1901.02935
Emphatic Temporal Difference (ETD) learning has recently been proposed as a convergent off-policy learning method. ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training. A simple counterexample provided back in 2017 pointed to a potential class of problems where ETD converges but TD diverges. In this paper, we empirically show that ETD converges on a few other well-known on-policy experiments whereas TD either diverges or performs poorly. We also show that ETD outperforms TD on the mountain car prediction problem. Our results, together with a similar pattern observed under off-policy training in prior works, suggest that ETD might be a good substitute over conventional TD.
http://arxiv.org/abs/1903.00194
The annotated medical images are usually expensive to be collected. This paper proposes a deep learning method on small data to classify Common Imaging Signs of Lung diseases (CISL) in computed tomography (CT) images. We explore both the real data and the data generated by Generative Adversarial Network (GAN) to improve the reliability and the generalization of learning. First, we use GAN to generate a large number of CISLs from small annotated data, which are difficult to be distinguished from real counterparts. These generated samples are used to pre-train a Convolutional Neural Network (CNN) for classifying CISLs. Second, we fine-tune the CNN classification model with real data. Experiments were conducted on the LISS database of CISLs. We successfully convinced radiologists that our generated CISLs samples were real for 56.7% of our experiments. The pre-trained CNN model achieves 88.4% of mean accuracy of binary classification, and after fine-tuning, the mean accuracy is significantly increased to 95.0%. For multi-classification of all types of CISLs and normal tissues, through the two stages of training, the mean accuracy, sensitivity and specificity are up to about 91.83%, 92.73% and 99.0%, respectively. To our knowledge, this is the best result achieved on the LISS database, which demonstrates that the proposed method is effective and promising for fulfilling deep learning on small data.
http://arxiv.org/abs/1903.00183
The availability of labeled image datasets has been shown critical for high-level image understanding, which continuously drives the progress of feature designing and models developing. However, constructing labeled image datasets is laborious and monotonous. To eliminate manual annotation, in this work, we propose a novel image dataset construction framework by employing multiple textual queries. We aim at collecting diverse and accurate images for given queries from the Web. Specifically, we formulate noisy textual queries removing and noisy images filtering as a multi-view and multi-instance learning problem separately. Our proposed approach not only improves the accuracy but also enhances the diversity of the selected images. To verify the effectiveness of our proposed approach, we construct an image dataset with 100 categories. The experiments show significant performance gains by using the generated data of our approach on several tasks, such as image classification, cross-dataset generalization, and object detection. The proposed method also consistently outperforms existing weakly supervised and web-supervised approaches.
http://arxiv.org/abs/1708.06495
Saliency detection is one of the basic challenges in computer vision. How to extract effective features is a critical point for saliency detection. Recent methods mainly adopt integrating multi-scale convolutional features indiscriminately. However, not all features are useful for saliency detection and some even cause interferences. To solve this problem, we propose Pyramid Feature Selective network to focus on effective high-level context features and low-level spatial structural features. First, we design Context-aware Pyramid Feature Extraction (CPFE) module for multi-scale high-level feature maps to capture rich context features. Second, we adopt channel-wise attention (CA) after CPFE feature maps and spatial attention (SA) after low-level feature maps, then fuse outputs of CA & SA together. Finally, we propose an edge preservation loss to guide network to learn more detailed information in boundary localization. Extensive evaluations on five benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art approaches under different evaluation metrics.
http://arxiv.org/abs/1903.00179
Open Information Extraction (OpenIE) extracts meaningful structured tuples from free-form text. Most previous work on OpenIE considers extracting data from one sentence at a time. We describe NeurON, a system for extracting tuples from question-answer pairs. Since real questions and answers often contain precisely the information that users care about, such information is particularly desirable to extend a knowledge base with. NeurON addresses several challenges. First, an answer text is often hard to understand without knowing the question, and second, relevant information can span multiple sentences. To address these, NeurON formulates extraction as a multi-source sequence-to-sequence learning task, wherein it combines distributed representations of a question and an answer to generate knowledge facts. We describe experiments on two real-world datasets that demonstrate that NeurON can find a significant number of new and interesting facts to extend a knowledge base compared to state-of-the-art OpenIE methods.
http://arxiv.org/abs/1903.00172
Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.
http://arxiv.org/abs/1903.00161
The problem of localization on a geo-referenced satellite map given a query ground view image is useful yet remains challenging due to the drastic change in viewpoint. To this end, in this paper we work on the extension of our earlier work on the Cross-View Matching Network (CVM-Net) \cite{Hu2018} for the ground-to-aerial image matching task since the traditional image descriptors fail due to the drastic viewpoint change. In particular, we show more extensive experimental results and analyses of the network architecture on our CVM-Net. Furthermore, we propose a Markov localization framework that enforces the temporal consistency between image frames to enhance the geo-localization results in the case where a video stream of ground view images is available. Experimental results show that our proposed Markov localization framework can continuously localize the vehicle within a small error on our Singapore dataset.
http://arxiv.org/abs/1903.00159
Sequential Convex Programming (SCP) has recently seen a surge of interest as a tool for trajectory optimization. However, most available methods lack rigorous performance guarantees and they are often tailored to specific optimal control setups. In this paper, we present GuSTO (Guaranteed Sequential Trajectory Optimization), an algorithmic framework to solve trajectory optimization problems for control-affine systems with drift. GuSTO generalizes earlier SCP-based methods for trajectory optimization (by addressing, for example, goal-set constraints and problems with either fixed or free final time) and enjoys theoretical convergence guarantees in terms of convergence to, at least, a stationary point. The theoretical analysis is further leveraged to devise an accelerated implementation of GuSTO, which originally infuses ideas from indirect optimal control into an SCP context. Numerical experiments on a variety of trajectory optimization setups show that GuSTO generally outperforms current state-of-the-art approaches in terms of success rates, solution quality, and computation times.
http://arxiv.org/abs/1903.00155
Unsupervised neural machine translation (UNMT) requires only monolingual data of similar language pairs during training and can produce bi-directional translation models with relatively good performance on alphabetic languages (Lample et al., 2018). However, no research has been done to logographic language pairs. This study focuses on Chinese-Japanese UNMT trained by data containing sub-character (ideograph or stroke) level information which is decomposed from character level data. BLEU scores of both character and sub-character level systems were compared against each other and the results showed that despite the effectiveness of UNMT on character level data, sub-character level data could further enhance the performance, in which the stroke level system outperformed the ideograph level system.
http://arxiv.org/abs/1903.00149
Real-time navigation in dense human environments is a challenging problem in robotics. Most existing path planners fail to account for the dynamics of pedestrians because introducing time as an additional dimension in search space is computationally prohibitive. Alternatively, most local motion planners only address imminent collision avoidance and fail to offer long-term optimality. In this work, we present an approach, called Dynamic Channels, to solve this global to local quandary. Our method combines the high-level topological path planning with low-level motion planning into a complete pipeline. By formulating the path planning problem as graph searching in the triangulation space, our planner is able to explicitly reason about the obstacle dynamics and capture the environmental change efficiently. We evaluate efficiency and performance of our approach on public pedestrian datasets and compare it to a state-of-the-art planning algorithm for dynamic obstacle avoidance.
http://arxiv.org/abs/1903.00143
Within Music Information Retrieval (MIR), prominent tasks – including pitch-tracking, source-separation, super-resolution, and synthesis – typically call for specialised methods, despite their similarities. Conditional Generative Adversarial Networks (cGANs) have been shown to be highly versatile in learning general image-to-image translations, but have not yet been adapted across MIR. In this work, we present an end-to-end supervisable architecture to perform all aforementioned audio tasks, consisting of a WaveNet synthesiser conditioned on the output of a jointly-trained cGAN spectrogram translator. In doing so, we demonstrate the potential of such flexible techniques to unify MIR tasks, promote efficient transfer learning, and converge research to the improvement of powerful, general methods. Finally, to the best of our knowledge, we present the first application of GANs to guided instrument synthesis.
http://arxiv.org/abs/1903.00142
Neural machine translation systems have become state-of-the-art approaches for Grammatical Error Correction (GEC) task. In this paper, we propose a copy-augmented architecture for the GEC task by copying the unchanged words from the source sentence to the target sentence. Since the GEC suffers from not having enough labeled training data to achieve high accuracy. We pre-train the copy-augmented architecture with a denoising auto-encoder using the unlabeled One Billion Benchmark and make comparisons between the fully pre-trained model and a partially pre-trained model. It is the first time copying words from the source context and fully pre-training a sequence to sequence model are experimented on the GEC task. Moreover, We add token-level and sentence-level multi-task learning for the GEC task. The evaluation results on the CoNLL-2014 test set show that our approach outperforms all recently published state-of-the-art results by a large margin.
http://arxiv.org/abs/1903.00138
We predict future video frames from complex dynamic scenes, using an invertible neural network as the encoder of a nonlinear dynamic system with latent linear state evolution. Our invertible linear embedding (ILE) demonstrates successful learning, prediction and latent state inference. In contrast to other approaches, ILE does not use any explicit reconstruction loss or simplistic pixel-space assumptions. Instead, it leverages invertibility to optimize the likelihood of image sequences exactly, albeit indirectly. Comparison with a state-of-the-art method demonstrates the viability of our approach.
http://arxiv.org/abs/1903.00133
OpenRoACH is a 15-cm 200-gram self-contained hexapedal robot with an onboard single-board computer. To our knowledge, it is the smallest legged robot with the capability of running the Robot Operating System (ROS) onboard. The robot is fully open sourced, uses accessible materials and off-the-shelf electronic components, can be fabricated with benchtop fast-prototyping machines such as a laser cutter and a 3D printer, and can be assembled by one person within two hours. Its sensory capacity has been tested with gyroscopes, accelerometers, Beacon sensors, color vision sensors, linescan sensors and cameras. It is low-cost within $150 including structure materials, motors, electronics, and a battery. The capabilities of OpenRoACH are demonstrated with multi-surface walking and running, 24-hour continuous walking burn-ins, carrying 200-gram dynamic payloads and 800-gram static payloads, and ROS control of steering based on camera feedback. Information and files related to mechanical design, fabrication, assembly, electronics, and control algorithms are all publicly available on https://wiki.eecs.berkeley.edu/biomimetics/Main/OpenRoACH.
http://arxiv.org/abs/1903.00131
Natural language understanding for robotics can require substantial domain- and platform-specific engineering. For example, for mobile robots to pick-and-place objects in an environment to satisfy human commands, we can specify the language humans use to issue such commands, and connect concept words like red can to physical object properties. One way to alleviate this engineering for a new domain is to enable robots in human environments to adapt dynamically—continually learning new language constructions and perceptual concepts. In this work, we present an end-to-end pipeline for translating natural language commands to discrete robot actions, and use clarification dialogs to jointly improve language parsing and concept grounding. We train and evaluate this agent in a virtual setting on Amazon Mechanical Turk, and we transfer the learned agent to a physical robot platform to demonstrate it in the real world.
http://arxiv.org/abs/1903.00122
Inter-process communication (IPC) is one of the core functions of modern robotics middleware. We propose an efficient IPC technique called TZC (Towards Zero-Copy). As a core component of TZC, we design a novel algorithm called partial serialization. Our formulation can generate messages that can be divided into two parts. During message transmission, one part is transmitted through a socket and the other part uses shared memory. The part within shared memory is never copied or serialized during its lifetime. We have integrated TZC with ROS and ROS2 and find that TZC can be easily combined with current open-source platforms. By using TZC, the overhead of IPC remains constant when the message size grows. In particular, when the message size is 4MB (less than the size of a full HD image), TZC can reduce the overhead of ROS IPC from tens of milliseconds to hundreds of microseconds and can reduce the overhead of ROS2 IPC from hundreds of milliseconds to less than 1 millisecond. We also demonstrate the benefits of TZC by integrating with TurtleBot2 that are used in autonomous driving scenarios. We show that by using TZC, the braking distance can be shortened by 16% than ROS.
http://arxiv.org/abs/1810.00556
When considering sparse motion capture marker data, one typically struggles to balance its overfitting via a high dimensional blendshape system versus underfitting caused by smoothness constraints. With the current trend towards using more and more data, our aim is not to fit the motion capture markers with a parameterized (blendshape) model or to smoothly interpolate a surface through the marker positions, but rather to find an instance in the high resolution dataset that contains local geometry to fit each marker. Just as is true for typical machine learning applications, this approach benefits from a plethora of data, and thus we also consider augmenting the dataset via specially designed physical simulations that target the high resolution dataset such that the simulation output lies on the same so-called manifold as the data targeted.
http://arxiv.org/abs/1903.00119
One key issue in managing a large scale 3D shape dataset is to identify an effective way to retrieve a shape-of-interest. The sketch-based query, which enjoys the flexibility in representing the user’s intention, has received growing interests in recent years due to the popularization of the touchscreen technology. Essentially, the sketch depicts an abstraction of an shape in a certain view while the shape contains the full 3D information. Matching between them is a cross-modality retrieval problem. However, for a given query, only part of the viewpoints of the 3D shape is representative. Thus, blindly projecting a 3D shape into a feature vector without considering what is the query will inevitably bring query-unrepresentative information. To handle this issue, in this paper we propose a Deep Point-to-Subspace Metric Learning (DPSML) framework to project a sketch into a feature vector and a 3D shape into a subspace spanned by a few selected basis feature vectors. The similarity between them is defined as the distance between the query feature vector and its closest point in the subspace by solving an optimization problem on the fly. Note that, the closest point is query-adaptive and can reflect the viewpoint information that is representative to the given query. To efficiently learn such a deep model, we formulate it as a classification problem with a special classifier design. To reduce the redundancy of 3D shapes, we also introduce a Representative-View Selection (RVS) module to select the most representative views of a 3D shape. By conducting extensive experiments on various datasets, we show that the proposed approach can achieve superior performance over its competitive baselines and attain the state-of-the-art performance.
http://arxiv.org/abs/1903.00117
In this work we present a self-supervised learning framework to simultaneously train two Convolutional Neural Networks (CNNs) to predict depth and surface normals from a single image. In contrast to most existing frameworks which represent outdoor scenes as fronto-parallel planes at piece-wise smooth depth, we propose to predict depth with surface orientation while assuming that natural scenes have piece-wise smooth normals. We show that a simple depth-normal consistency as a soft-constraint on the predictions is sufficient and effective for training both these networks simultaneously. The trained normal network provides state-of-the-art predictions while the depth network, relying on much realistic smooth normal assumption, outperforms the traditional self-supervised depth prediction network by a large margin on the KITTI benchmark. Demo video: https://youtu.be/ZD-ZRsw7hdM
http://arxiv.org/abs/1903.00112
In scenarios where a robot generates and executes a plan, there may be instances where this generated plan is less costly for the robot to execute but incomprehensible to the human. When the human acts as a supervisor and is held accountable for the robot’s plan, the human may be at a higher risk if the incomprehensible behavior is deemed to be unsafe. In such cases, the robot, who may be unaware of the human’s exact expectations, may choose to do (1) the most constrained plan (i.e. one preferred by all possible supervisors) incurring the added cost of executing highly sub-optimal behavior when the human is observing it and (2) deviate to a more optimal plan when the human looks away. These problems amplify in situations where the robot has to fulfill multiple goals and cater to the needs of different human supervisors. In such settings, the robot, being a rational agent, should take any chance it gets to deviate to a lower cost plan. On the other hand, continuous monitoring of the robot’s behavior is often difficult for the human because it costs them valuable resources (eg. time, effort, cognitive overload etc.). To optimize the cost for constant monitoring while ensuring the robots follow the {\em safe} behavior, we model this problem in the game-theoretic framework of trust where the human is the agent that trusts the robot. We show that the notion of human’s trust, which is well-defined when there is a pure strategy equilibrium, is inversely proportional to the probability it assigns for observing the robot’s behavior. We then show that with high probability, our game lacks a pure strategy Nash equilibrium, forcing us to define a notion of trust boundary over mixed strategies of the human in order to guarantee safe behavior by the robot.
http://arxiv.org/abs/1903.00111
To automatically produce a brief yet expressive summary of a long video, an automatic algorithm should start by resembling the human process of summary generation. Prior work proposed supervised and unsupervised algorithms to train models for learning the underlying behavior of humans by increasing modeling complexity or craft-designing better heuristics to simulate human summary generation process. In this work, we take a different approach by analyzing a major cue that humans exploit for the summary generation; the nature and intensity of actions. We empirically observed that a frame is more likely to be included in human-generated summaries if it contains a substantial amount of deliberate motion performed by an agent, which is referred to as actionness. Therefore, we hypothesize that learning to automatically generate summaries involves an implicit knowledge of actionness estimation and ranking. We validate our hypothesis by running a user study that explores the correlation between human-generated summaries and actionness ranks. We also run a consensus and behavioral analysis between human subjects to ensure reliable and consistent results. The analysis exhibits a considerable degree of agreement among subjects within obtained data and verifying our initial hypothesis. Based on the study findings, we develop a method to incorporate actionness data to explicitly regulate a learning algorithm that is trained for summary generation. We assess the performance of our approach to four summarization benchmark datasets and demonstrate an evident advantage compared to state-of-the-art summarization methods.
http://arxiv.org/abs/1903.00110
A conditional general adversarial network (GAN) is proposed for image deblurring problem. It is tailored for image deblurring instead of just applying GAN on the deblurring problem. Motivated by that, dark channel prior is carefully picked to be incorporated into the loss function for network training. To make it more compatible with neuron networks, its original indifferentiable form is discarded and L2 norm is adopted instead. On both synthetic datasets and noisy natural images, the proposed network shows improved deblurring performance and robustness to image noise qualitatively and quantitatively. Additionally, compared to the existing end-to-end deblurring networks, our network structure is light-weight, which ensures less training and testing time.
http://arxiv.org/abs/1903.00107
Multilingual neural machine translation (NMT) enables training a single model that supports translation from multiple source languages into multiple target languages. In this paper, we push the limits of multilingual NMT in terms of number of languages being used. We perform extensive experiments in training massively multilingual NMT models, translating up to 102 languages to and from English within a single model. We explore different setups for training such models and analyze the trade-offs between translation quality and various modeling decisions. We report results on the publicly available TED talks multilingual corpus where we show that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages. Our experiments on a large-scale dataset with 102 languages to and from English and up to one million examples per direction also show promising results, surpassing strong bilingual baselines and encouraging future work on massively multilingual NMT.
http://arxiv.org/abs/1903.00089
A change detection system takes as input two images of a region captured at two different times, and predicts which pixels in the region have undergone change over the time period. Since pixel-based analysis can be erroneous due to noise, illumination difference and other factors, contextual information is usually used to determine the class of a pixel (changed or not). This contextual information is taken into account by considering a pixel of the difference image along with its neighborhood. With the help of ground truth information, the labeled patterns are generated. Finally, Broad Learning classifier is used to get prediction about the class of each pixel. Results show that Broad Learning can classify the data set with a significantly higher F-Score than that of Multilayer Perceptron. Performance comparison has also been made with other popular classifiers, namely Multilayer Perceptron and Random Forest.
http://arxiv.org/abs/1903.00087
Fast and precise robot motion is needed in certain applications such as electronic manufacturing, additive manufacturing and assembly. Most industrial robot motion controllers allow externally commanded motion profile, but the trajectory tracking performance is affected by the robot dynamics and joint servo controllers which users have no direct access and little information. The performance is further compromised by time delays in transmitting the external command as a setpoint to the inner control loop. This paper presents an approach of combining neural networks and iterative learning control to improve the trajectory tracking performance for a multi-axis articulated industrial robot. For a given desired trajectory, the external command is iteratively refined using a high fidelity dynamical simulator to compensate for the robot inner loop dynamics. These desired trajectories and the corresponding refined input trajectories are then used to train multi-layer neural networks to emulate the dynamical inverse of the nonlinear inner loop dynamics. We show that with a sufficiently rich training set, the trained neural networks can generalize well to trajectories beyond the training set. In applying the trained neural networks to the physical robot, the tracking performance still improves but not as much as in the simulator. We show that transfer learning can effectively bridge the gap between simulation and the physical robot. In the end, we test the trained neural networks on other robot models in simulation and demonstrate the possibility of a general purpose network. Development and evaluation of this methodology is based on the ABB IRB6640-180 industrial robot and ABB RobotStudio software packages.
http://arxiv.org/abs/1903.00082