Convolutional Neural Networks (CNNs) have gained widespread popularity in the field of computer vision and image processing. Due to huge computational requirements of CNNs, dedicated hardware-based implementations are being explored to improve their performance. Hardware platforms such as Field Programmable Gate Arrays (FPGAs) are widely being used to design parallel architectures for this purpose. In this paper, we analyze Winograd minimal filtering or fast convolution algorithms to reduce the arithmetic complexity of convolutional layers of CNNs. We explore a complex design space to find the sets of parameters that result in improved throughput and power-efficiency. We also design a pipelined and parallel Winograd convolution engine that improves the throughput and power-efficiency while reducing the computational complexity of the overall system. Our proposed designs show up to 4.75$\times$ and 1.44$\times$ improvements in throughput and power-efficiency, respectively, in comparison to the state-of-the-art design while using approximately 2.67$\times$ more multipliers. Furthermore, we obtain savings of up to 53.6\% in logic resources compared with the state-of-the-art implementation.
https://arxiv.org/abs/1903.01811
Indoor localization is one of the crucial enablers for deployment of service robots. Although several successful techniques for indoor localization have been proposed in the past, the majority of them relies on maps generated based on data gathered with the same sensor modality that is used for localization. Typically, tedious labor by experts is needed to acquire this data, thus limiting the readiness of the system as well as its ease of installation for inexperienced operators. In this paper, we propose a memory and computationally efficient monocular camera-based localization system that allows a robot to estimate its pose given an architectural floor plan. Our method employs a convolutional neural network to predict room layout edges from a single camera image and estimates the robot pose using a particle filter that matches the extracted edges to the given floor plan. We evaluate our localization system with multiple real-world experiments and demonstrate that it has the robustness and accuracy required for reliable indoor navigation.
http://arxiv.org/abs/1903.01804
Point clouds are challenging to process due to their sparsity, therefore autonomous vehicles rely more on appearance attributes than pure geometric features. However, 3D LIDAR perception can provide crucial information for urban navigation in challenging light or weather conditions. In this paper, we investigate the versatility of Shape Completion for 3D Object Tracking in LIDAR point clouds. We design a Siamese tracker that encodes model and candidate shapes into a compact latent representation. We regularize the encoding by enforcing the latent representation to decode into an object model shape. We observe that 3D object tracking and 3D shape completion complement each other. Learning a more meaningful latent representation shows better discriminatory capabilities, leading to improved tracking performance. We test our method on the KITTI Tracking set using car 3D bounding boxes. Our model reaches a 76.94% Success rate and 81.38% Precision for 3D Object Tracking, with the shape completion regularization leading to an improvement of 3% in both metrics.
https://arxiv.org/abs/1903.01784
In this work, we present a novel framework for on-line human gait stability prediction of the elderly users of an intelligent robotic rollator using Long Short Term Memory (LSTM) networks, fusing multimodal RGB-D and Laser Range Finder (LRF) data from non-wearable sensors. A Deep Learning (DL) based approach is used for the upper body pose estimation. The detected pose is used for estimating the body Center of Mass (CoM) using Unscented Kalman Filter (UKF). An Augmented Gait State Estimation framework exploits the LRF data to estimate the legs’ positions and the respective gait phase. These estimates are the inputs of an encoder-decoder sequence to sequence model which predicts the gait stability state as Safe or Fall Risk walking. It is validated with data from real patients, by exploring different network architectures, hyperparameter settings and by comparing the proposed method with other baselines. The presented LSTM-based human gait stability predictor is shown to provide robust predictions of the human stability state, and thus has the potential to be integrated into a general user-adaptive control architecture as a fall-risk alarm.
http://arxiv.org/abs/1812.00252
Explainable Artificial Intelligence (XAI) systems need to include an explanation model to communicate the internal decisions, behaviours and actions to the interacting humans. Successful explanation involves both cognitive and social processes. In this paper we focus on the challenge of meaningful interaction between an explainer and an explainee and investigate the structural aspects of an interactive explanation to propose an interaction protocol. We follow a bottom-up approach to derive the model by analysing transcripts of different explanation dialogue types with 398 explanation dialogues. We use grounded theory to code and identify key components of an explanation dialogue. We formalize the model using the agent dialogue framework (ADF) as a new dialogue type and then evaluate it in a human-agent interaction study with 101 dialogues from 14 participants. Our results show that the proposed model can closely follow the explanation dialogues of human-agent conversations.
http://arxiv.org/abs/1903.02409
Scaling the distributed deep learning to a massive GPU cluster level is challenging due to the instability of the large mini-batch training and the overhead of the gradient synchronization. We address the instability of the large mini-batch training with batch-size control and label smoothing. We address the overhead of the gradient synchronization with 2D-Torus all-reduce. Specifically, 2D-Torus all-reduce arranges GPUs in a logical 2D grid and performs a series of collective operation in different orientations. These two techniques are implemented with Neural Network Libraries (NNL). We have successfully trained ImageNet/ResNet-50 in 122 seconds without significant accuracy loss on ABCI cluster.
http://arxiv.org/abs/1811.05233
A Pathology report is arguably one of the most important documents in medicine containing interpretive information about the visual findings from the patient’s biopsy sample. Each pathology report has a retention period of up to 20 years after the treatment of a patient. Cancer registries process and encode high volumes of free-text pathology reports for surveillance of cancer and tumor diseases all across the world. In spite of their extremely valuable information they hold, pathology reports are not used in any systematic way to facilitate computational pathology. Therefore, in this study, we investigate automated machine-learning techniques to identify/predict the primary diagnosis (based on ICD-O code) from pathology reports. We performed experiments by extracting the TF-IDF features from the reports and classifying them using three different methods—SVM, XGBoost, and Logistic Regression. We constructed a new dataset with 1,949 pathology reports arranged into 37 ICD-O categories, collected from four different primary sites, namely lung, kidney, thymus, and testis. The reports were manually transcribed into text format after collecting them as PDF files from NCI Genomic Data Commons public dataset. We subsequently pre-processed the reports by removing irrelevant textual artifacts produced by OCR software. The highest classification accuracy we achieved was 92\% using XGBoost classifier on TF-IDF feature vectors, the linear SVM scored 87\% accuracy. Furthermore, the study shows that TF-IDF vectors are suitable for highlighting the important keywords within a report which can be helpful for the cancer research and diagnostic workflow. The results are encouraging in demonstrating the potential of machine learning methods for classification and encoding of pathology reports.
http://arxiv.org/abs/1903.07406
Hue modification is the adjustment of hue property on color images. Conducting hue modification on an image is trivial, and it can be abused to falsify opinions of viewers. Since shapes, edges or textural information remains unchanged after hue modification, this type of manipulation is relatively hard to be detected and localized. Since small patches inherit the same Color Filter Array (CFA) configuration and demosaicing, any distortion made by local hue modification can be detected by patch matching within the same image. In this paper, we propose to localize hue modification by means of a Siamese neural network specifically designed for matching two inputs. By crafting the network outputs, we are able to form a heatmap which potentially highlights malicious regions. Our proposed method deals well not only with uncompressed images but also with the presence of JPEG compression, an operation usually hindering the exploitation of CFA and demosaicing artifacts. Experimental evidences corroborate the effectiveness of the proposed method.
https://arxiv.org/abs/1903.01735
Microblog has become a popular platform for people to post, share, and seek information due to its convenience and low cost. However, it also facilitates the generation and propagation of fake news, which could cause detrimental societal consequences. Detecting fake news on microblogs is important for societal good. Emotion is a significant indicator while verifying information on social media. Existing fake news detection studies utilize emotion mainly through users stances or simple statistical emotional features; and exploiting the emotion information from both news content and user comments is also limited. In the realistic scenarios, to impress the audience and spread extensively, the publishers typically either post a tweet with intense emotion which could easily resonate with the crowd, or post a controversial statement unemotionally but aim to evoke intense emotion among the users. Therefore, in this paper, we study the novel problem of exploiting emotion information for fake news detection. We propose a new Emotion-based Fake News Detection framework (EFN), which can i) learn content- and comment- emotion representations for publishers and users respectively; and ii) exploit content and social emotions simultaneously for fake news detection. Experimental results on real-world dataset demonstrate the effectiveness of the proposed framework.
https://arxiv.org/abs/1903.01728
Over the years many ellipse detection algorithms spring up and are studied broadly, while the critical issue of detecting ellipses accurately and efficiently in real-world images remains a challenge. In this paper, we propose a valuable industry-oriented ellipse detector by arc-support line segments, which simultaneously reaches high detection accuracy and efficiency. To simplify the complicated curves in an image while retaining the general properties including convexity and polarity, the arc-support line segments are extracted, which grounds the successful detection of ellipses. The arc-support groups are formed by iteratively and robustly linking the arc-support line segments that latently belong to a common ellipse. Afterward, two complementary approaches, namely, locally selecting the arc-support group with higher saliency and globally searching all the valid paired groups, are adopted to fit the initial ellipses in a fast way. Then, the ellipse candidate set can be formulated by hierarchical clustering of 5D parameter space of initial ellipses. Finally, the salient ellipse candidates are selected and refined as detections subject to the stringent and effective verification. Extensive experiments on three public datasets are implemented and our method achieves the best F-measure scores compared to the state-of-the-art methods. The source code is available at https://github.com/AlanLuSun/High-quality-ellipse-detection.
http://arxiv.org/abs/1810.03243
Mobile manipulation planning commonly adopts a decoupled approach that performs planning separately on the base and the manipulator. While this approach is fast, it can generate sub-optimal paths. Another direction is a coupled approach jointly adjusting the base and manipulator in a high-dimensional configuration space. This coupled approach addresses sub-optimality and incompleteness of the decoupled approach, but has not been widely used due to its excessive computational overhead. Given this trade-off space, we present a simple, yet effective mobile manipulation sampling method, harmonious sampling, to perform the coupled approach mainly in difficult regions, where we need to simultaneously maneuver the base and the manipulator. Our method identifies such difficult regions through a low-dimensional base space by utilizing a reachability map given the target end-effector pose and narrow passage detected by generalized Voronoi diagram. For the rest of simple regions, we perform sampling mainly on the base configurations with a predefined joint configuration, accelerating the planning process. We compare our method with the decoupled and coupled approaches in six different problems with varying difficulty. Our method shows meaningful improvements experimentally in terms of time to find an initial solution (up to 5.6 times faster) and final solution cost (up to 17% lower) over the decoupled approach, especially in difficult scenes with narrow space. We also demonstrate these benefits with a real, mobile Hubo robot.
http://arxiv.org/abs/1809.07497
The accuracy of the object detection model depends on whether the anchor boxes effectively trained. Because of the small number of GT boxes or object target is invariant in the training phase, cannot effectively train anchor boxes. Improving detection accuracy by extending the dataset is an effective way. We propose a data enhancement method based on the foreground-background separation model. While this model uses a binary image of object target random perturb original dataset image. Perturbation methods include changing the color channel of the object, adding salt noise to the object, and enhancing contrast. The main contribution of this paper is to propose a data enhancement method based on GAN and improve detection accuracy of DSSD. Results are shown on both PASCAL VOC2007 and PASCAL VOC2012 dataset. Our model with 321x321 input achieves 78.7% mAP on the VOC2007 test, 76.6% mAP on the VOC2012 test.
https://arxiv.org/abs/1903.01716
Motion Planning, as a fundamental technology of automatic navigation for the autonomous vehicle, is still an open challenging issue in the real-life traffic situation and is mostly applied by the model-based approaches. However, due to the complexity of the traffic situations and the uncertainty of the edge cases, it is hard to devise a general motion planning system for the autonomous vehicle. In this paper, we proposed a motion planning model based on deep learning (named as spatiotemporal LSTM network), which is able to generate a real-time reflection based on spatiotemporal information extraction. To be specific, the model based on spatiotemporal LSTM network has three main structure. Firstly, the Convolutional Long-short Term Memory (Conv-LSTM) is used to extract hidden features through sequential image data. Then, the 3D Convolutional Neural Network(3D-CNN) is applied to extract the spatiotemporal information from the multi-frame feature information. Finally, the fully connected neural networks are used to construct a control model for autonomous vehicle steering angle. The experiments demonstrated that the proposed method can generate a robust and accurate visual motion planning results for the autonomous vehicle.
https://arxiv.org/abs/1903.01712
The instrumentation of real systems is often designed for control purposes and control inputs are designed to achieve nominal control objectives. Hence, the available measurements may not be sufficient to isolate faults with certainty and diagnoses are ambiguous. Active diagnosis formulates a planning problem to generate a sequence of actions that, applied to the system, enforce diagnosability and allow to iteratively refine ambiguous diagnoses. This paper analyses the requirements for applying active diagnosis to space systems and proposes ActHyDiag as an effective framework to solve this problem. It presents the results of applying ActHyDiag to a real space case study and of implementing the generated plans in the form of On-Board Control Procedures. The case study is a redundant Spacewire Network where up to 6 instruments, monitored and controlled by the on-board software hosted in the Satellite Management Unit, are transferring science data to a mass memory unit through Spacewire routers. Experiments have been conducted on a real physical benchmark developed by Thales Alenia Space and demonstrate the effectiveness of the plans proposed by ActHyDiag.
https://arxiv.org/abs/1903.01710
Artificial Intelligence represents many things: a new market to conquer or a quality label for tech companies, a threat for traditional industries, a menace for democracy, or a blessing for our busy everyday life. The press abounds in examples illustrating these aspects, but one should draw not hasty and premature conclusions. The first successes in AI have been a surprise for society at large-including researchers in the field. Today, after the initial stupefaction, we have examples of the system reactions: traditional companies are heavily investing in AI, social platforms are monitored during elections, data collection is more and more regulated, etc. The resilience of an organization (i.e. its capacity to resist to a shock) relies deeply on the perception of its environment. Future problems have to be anticipated, while unforeseen events occurring have to be quickly identified in order to be mitigated as fast as possible. The author states that this clear perception starts with a common definition of AI in terms of capacities and limits. AI practitioners should make notions and concepts accessible to the general public and the impacted fields (e.g. industries, law, education). It is a truism that only law experts would have the potential to estimate IA impacts on judicial system. However, questions remain on how to connect different kind of expertise and what is the appropriate level of detail required for the knowledge exchanges. And the same consideration is true for dissemination towards society. Ultimately, society will live with decisions made by the “experts”. It sounds wise to involve society in the decision process rather than risking to pay consequences later. Therefore, society also needs the key concepts to understand AI impact on their life. This was the purpose of the trial of an IA that took place in October 2018 at the Court of Appeal of Paris: gathering experts from various fields to expose challenges in law and science towards a general public.
http://arxiv.org/abs/1903.09518
Recently, leveraging on the development of end-to-end convolutional neural networks, deep stereo matching networks achieve remarkable performance far exceeding traditional approaches. However, state-of-the-art stereo methods still have difficulties finding correct correspondences in texture-less regions, detailed structures, small objects and near boundaries, which could be alleviated by geometric clues such as edge contours and corresponding constraints. To improve the quality of disparity estimates in these challenging areas, we propose an effective multi-task learning network EdgeStereo composed of a disparity estimation sub-network and an edge detection sub-network, which enables end-to-end predictions of both disparity map and edge map. To effectively incorporates edge cues, we propose the edge-aware smoothness loss and edge feature embedding for inter-task interactions. It is demonstrated that based on our unified model, edge detection task and stereo matching task can promote each other. In addition, we design a compact module called residual pyramid to replace the commonly-used multi-stage cascaded structures or 3-D convolution based regularization modules in current stereo matching networks. By the time of the paper submission, EdgeStereo achieves state-of-the-art performance on the FlyingThings3D dataset, KITTI 2012 and KITTI 2015 stereo benchmarks, outperforming other published stereo matching methods by a noteworthy margin. EdgeStereo also has a better generalization capability for disparity estimation because of the incorporation of edge cues.
https://arxiv.org/abs/1903.01700
Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can significantly improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measure increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsupervised cross-domain CWS approaches with a large margin.
https://arxiv.org/abs/1903.01698
Person re-identification (re-id) is a cross-camera retrieval task which establishes a correspondence between images of a person from multiple cameras. Deep Learning methods have been successfully applied to this problem and have achieved impressive results. However, these methods require a large amount of labeled training data. Currently labeled datasets in person re-id are limited in their scale and manual acquisition of such large-scale datasets from surveillance cameras is a tedious and labor-intensive task. In this paper, we propose a framework that performs intelligent data augmentation and assigns partial smoothing label to generated data. Our approach first exploits the clustering property of existing person re-id datasets to create groups of similar objects that model cross-view variations. Each group is then used to generate realistic images through adversarial training. Our aim is to emphasize feature similarity between generated samples and the original samples. Finally, we assign a non-uniform label distribution to the generated samples and define a regularized loss function for training. The proposed approach tackles two problems (1) how to efficiently use the generated data and (2) how to address the over-smoothness problem found in current regularization methods. Extensive experiments on four larges cale datasets show that our regularization method significantly improves the Re-ID accuracy compared to existing methods.
http://arxiv.org/abs/1809.04976
We propose novel real-time algorithm to localize hands and find their associations with multiple people in the cluttered 4D volumetric data (dynamic 3D volumes). Different from the traditional multiple view approaches, which find key points in 2D and then triangulate to recover the 3D locations, our method directly processes the dynamic 3D data that involve both clutter and crowd. The volumetric representation is more desirable than the partial observations from different view points and enables more robust and accurate results. However, due to the large amount of data in the volumetric representation brute force 3D schemes are slow. In this paper, we propose novel real-time methods to tackle the problem to achieve both higher accuracy and faster speed than previous approaches. Our method detects the 3D bounding box of each subject and localizes the hands of each person. We develop new 2D features for fast candidate proposals and optimize the trajectory linking using a new max-covering bipartite matching formulation, which is critical for robust performance. We propose a novel decomposition method to reduce the key point localization in each person 3D volume to a sequence of efficient 2D problems. Our experiments show that the proposed method is faster than different competing methods and it gives almost half the localization error.
https://arxiv.org/abs/1903.01695
The use of information technology in the study of human behavior is a subject of great scientific interest. Cultural and personality aspects are factors that influence how people interact with one another in a crowd. This paper presents a methodology to detect cultural characteristics of crowds in video sequences. Based on filmed sequences, pedestrians are detected, tracked and characterized. Such information is then used to find out cultural differences in those videos, based on the Big-five personality model. Regarding cultural differences of each country, results indicate that this model generates coherent information when compared to data provided in literature.
https://arxiv.org/abs/1903.01688
Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned “important” weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited “important” weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the “Lottery Ticket Hypothesis” (Frankle & Carbin 2019), and find that with optimal learning rate, the “winning ticket” initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization.
https://arxiv.org/abs/1810.05270
Visually identifying materials is crucial for many tasks, yet material perception remains poorly understood. Distinguishing mirror from glass is particularly challenging as both materials derive their appearance from their surroundings, yet we rarely experience difficulties telling them apart. Here we took a ‘big data’ approach to uncovering the underlying visual cues and processes, leveraging recent advances in neural network models of vision. We trained thousands of convolutional neural networks on >750,000 simulated mirror and glass objects, and compared their performance with human judgments, as well as alternative classifiers based on ‘hand-engineered’ image features. For randomly chosen images, all classifiers and humans performed with high accuracy, and therefore correlated highly with one another. To tease the models apart, we then painstakingly assembled a diagnostic image set for which humans make highly systematic errors, allowing us to decouple accuracy from human-like performance. A large-scale, systematic search through feedforward neural architectures revealed that relatively shallow networks predicted human judgments better than any other models. However, surprisingly, no network correlated better than 0.6 with humans (below inter-human correlations). Thus, although the model sets new standards for simulating human vision in a challenging material perception task, the results cast doubt on recent claims that such architectures are generally good models of human vision.
https://arxiv.org/abs/1903.01671
Active localization is the problem of generating robot actions that allow it to maximally disambiguate its pose within a reference map. Traditional approaches to this use an information-theoretic criterion for action selection and hand-crafted perceptual models. In this work we propose an end-to-end differentiable method for learning to take informative actions that is trainable entirely in simulation and then transferable to real robot hardware with zero refinement. The system is composed of two modules: a convolutional neural network for perception, and a deep reinforcement learned planning module. We introduce a multi-scale approach to the learned perceptual model since the accuracy needed to perform action selection with reinforcement learning is much less than the accuracy needed for robot control. We demonstrate that the resulting system outperforms using the traditional approach for either perception or planning. We also demonstrate our approaches robustness to different map configurations and other nuisance parameters through the use of domain randomization in training. The code is also compatible with the OpenAI gym framework, as well as the Gazebo simulator.
http://arxiv.org/abs/1903.01669
Customized grippers have specifically designed fingers to increase the contact area with the workpieces and improve the grasp robustness. However, grasp planning for customized grippers is challenging due to the object variations, surface contacts and structural constraints of the grippers. In this paper, we propose a learning framework to plan robust grasps for customized grippers in real-time. The learning framework contains a low-level optimization-based planner to search for optimal grasps locally under object shape variations, and a high-level learning-based explorer to learn the grasp exploration based on previous grasp experience. The optimization-based planner uses an iterative surface fitting (ISF) to simultaneously search for optimal gripper transformation and finger displacement by minimizing the surface fitting error. The high-level learning-based explorer trains a region-based convolutional neural network (R-CNN) to propose good optimization regions, which avoids ISF getting stuck in bad local optima and improves the collision avoidance performance. The proposed learning framework with RCNN-ISF is able to consider the structural constraints of the gripper, learn grasp exploration strategy from previous experience, and plan optimal grasps in clutter environment in real-time. The effectiveness of the algorithm is verified by experiments.
http://arxiv.org/abs/1809.08546
This paper proposes a method for tight fusion of visual, depth and inertial data in order to extend robotic capabilities for navigation in GPS-denied, poorly illuminated, and texture-less environments. Visual and depth information are fused at the feature detection and descriptor extraction levels to augment one sensing modality with the other. These multimodal features are then further integrated with inertial sensor cues using an extended Kalman filter to estimate the robot pose, sensor bias terms, and landmark positions simultaneously as part of the filter state. As demonstrated through a set of hand-held and Micro Aerial Vehicle experiments, the proposed algorithm is shown to perform reliably in challenging visually-degraded environments using RGB-D information from a lightweight and low-cost sensor and data from an IMU.
http://arxiv.org/abs/1903.01659
Automatic assembly has broad applications in industries. Traditional assembly tasks utilize predefined trajectories or tuned force control parameters, which make the automatic assembly time-consuming, difficult to generalize, and not robust to uncertainties. In this paper, we propose a learning framework for high precision industrial assembly. The framework combines both the supervised learning and the reinforcement learning. The supervised learning utilizes trajectory optimization to provide the initial guidance to the policy, while the reinforcement learning utilizes actor-critic algorithm to establish the evaluation system even the supervisor is not accurate. The proposed learning framework is more efficient compared with the reinforcement learning and achieves better stability performance than the supervised learning. The effectiveness of the method is verified by both the simulation and experiment.
http://arxiv.org/abs/1809.08548
With an ever-widening domain of aerial robotic applications, including many mission critical tasks such as disaster response operations, search and rescue missions and infrastructure inspections taking place in GPS-denied environments, the need for reliable autonomous operation of aerial robots has become crucial. Operating in GPS-denied areas aerial robots rely on a multitude of sensors to localize and navigate. Visible spectrum cameras are the most commonly used sensors due to their low cost and weight. However, in environments that are visually-degraded such as in conditions of poor illumination, low texture, or presence of obscurants including fog, smoke and dust, the reliability of visible light cameras deteriorates significantly. Nevertheless, maintaining reliable robot navigation in such conditions is essential. In contrast to visible light cameras, thermal cameras offer visibility in the infrared spectrum and can be used in a complementary manner with visible spectrum cameras for robot localization and navigation tasks, without paying the significant weight and power penalty typically associated with carrying other sensors. Exploiting this fact, in this work we present a multi-sensor fusion algorithm for reliable odometry estimation in GPS-denied and degraded visual environments. The proposed method utilizes information from both the visible and thermal spectra for landmark selection and prioritizes feature extraction from informative image regions based on a metric over spatial entropy. Furthermore, inertial sensing cues are integrated to improve the robustness of the odometry estimation process. To verify our solution, a set of challenging experiments were conducted inside a) an obscurant filed machine shop-like industrial environment, as well as b) a dark subterranean mine in the presence of heavy airborne dust.
http://arxiv.org/abs/1903.01656
High efficiency video coding (HEVC) has brought outperforming efficiency for video compression. To reduce the compression artifacts of HEVC, we propose a DenseNet based approach as the in-loop filter of HEVC, which leverages multiple adjacent frames to enhance the quality of each encoded frame. Specifically, the higher-quality frames are found by a reference frame selector (RFS). Then, a deep neural network for multi-frame in-loop filter (named MIF-Net) is developed to enhance the quality of each encoded frame by utilizing the spatial information of this frame and the temporal information of its neighboring higher-quality frames. The MIF-Net is built on the recently developed DenseNet, benefiting from the improved generalization capacity and computational efficiency. Finally, experimental results verify the effectiveness of our multi-frame in-loop filter, outperforming the HM baseline and other state-of-the-art approaches.
https://arxiv.org/abs/1903.01648
This paper develops model-based grasp planning algorithms for assembly tasks. It focuses on industrial end-effectors like grippers and suction cups, and plans grasp configurations considering CAD models of target objects. The developed algorithms are able to stably plan a large number of high-quality grasps, with high precision and little dependency on the quality of CAD models. The undergoing core technique is superimposed segmentation, which pre-processes a mesh model by peeling it into facets. The algorithms use superimposed segments to locate contact points and parallel facets, and synthesize grasp poses for popular industrial end-effectors. Several tunable parameters were prepared to adapt the algorithms to meet various requirements. The experimental section demonstrates the advantages of the algorithms by analyzing the cost and stability of the algorithms, the precision of the planned grasps, and the tunable parameters with both simulations and real-world experiments. Also, some examples of robotic assembly systems using the proposed algorithms are presented to demonstrate the efficacy.
http://arxiv.org/abs/1903.01631
NVDLA is an open-source deep neural network (DNN) accelerator which has received a lot of attention by the community since its introduction by Nvidia. It is a full-featured hardware IP and can serve as a good reference for conducting research and development of SoCs with integrated accelerators. However, an expensive FPGA board is required to do experiments with this IP in a real SoC. Moreover, since NVDLA is clocked at a lower frequency on an FPGA, it would be hard to do accurate performance analysis with such a setup. To overcome these limitations, we integrate NVDLA into a real RISC-V SoC on the Amazon could FPGA using FireSim, a cycle-exact FPGA-accelerated simulator. We then evaluate the performance of NVDLA by running YOLOv3 object-detection algorithm. Our results show that NVDLA can sustain 7.5 fps when running YOLOv3. We further analyze the performance by showing that sharing the last-level cache with NVDLA can result in up to 1.56x speedup. We then identify that sharing the memory system with the accelerator can result in unpredictable execution time for the real-time tasks running on this platform. We believe this is an important issue that must be addressed in order for on-chip DNN accelerators to be incorporated in real-time embedded systems.
http://arxiv.org/abs/1903.06495
The verification of planning domain models is crucial to ensure the safety, integrity and correctness of planning-based automated systems. This task is usually performed using model checking techniques. However, directly applying model checkers to verify planning domain models can result in false positives, i.e. counterexamples that are unreachable by a sound planner when using the domain under verification during a planning task. In this paper, we discuss the downside of unconstrained planning domain model verification. We then propose a fail-safe practice for designing planning domain models that can inherently guarantee the safety of the produced plans in case of undetected errors in domain models. In addition, we demonstrate how model checkers, as well as state trajectory constraints planning techniques, should be used to verify planning domain models so that unreachable counterexamples are not returned.
http://arxiv.org/abs/1811.09231
While discriminative classifiers often yield strong predictive performance, missing feature values at prediction time can still be a challenge. Classifiers may not behave as expected under certain ways of substituting the missing values, since they inherently make assumptions about the data distribution they were trained on. In this paper, we propose a novel framework that classifies examples with missing features by computing the expected prediction on a given feature distribution. We then use geometric programming to learn a naive Bayes distribution that embeds a given logistic regression classifier and can efficiently take its expected predictions. Empirical evaluations show that our model achieves the same performance as the logistic regression with all features observed, and outperforms standard imputation techniques when features go missing during prediction time. Furthermore, we demonstrate that our method can be used to generate ‘sufficient explanations’ of logistic regression classifications, by removing features that do not affect the classification.
https://arxiv.org/abs/1903.01620
A plethora of recent work has shown that convolutional networks are not robust to adversarial images: images that are created by perturbing a sample from the data distribution as to maximize the loss on the perturbed example. In this work, we hypothesize that adversarial perturbations move the image away from the image manifold in the sense that there exists no physical process that could have produced the adversarial image. This hypothesis suggests that a successful defense mechanism against adversarial images should aim to project the images back onto the image manifold. We study such defense mechanisms, which approximate the projection onto the unknown image manifold by a nearest-neighbor search against a web-scale image database containing tens of billions of images. Empirical evaluations of this defense strategy on ImageNet suggest that it is very effective in attack settings in which the adversary does not have access to the image database. We also propose two novel attack methods to break nearest-neighbor defenses, and demonstrate conditions under which nearest-neighbor defense fails. We perform a series of ablation experiments, which suggest that there is a trade-off between robustness and accuracy in our defenses, that a large image database (with hundreds of millions of images) is crucial to get good performance, and that careful construction the image database is important to be robust against attacks tailored to circumvent our defenses.
https://arxiv.org/abs/1903.01612
Recent work on the “lottery ticket hypothesis” proposes that randomly-initialized, dense neural networks contain much smaller, fortuitously initialized subnetworks (“winning tickets”) capable of training to similar accuracy as the original network at a similar speed. While strong evidence exists for the hypothesis across many settings, it has not yet been evaluated on large, state-of-the-art networks and there is even evidence against the hypothesis on deeper networks. We modify the lottery ticket pruning procedure to make it possible to identify winning tickets on deeper networks. Rather than set the weights of a winning ticket to their original initializations, we set them to the weights obtained after a small number of training iterations (“late resetting”). Using late resetting, we identify the first winning tickets for Resnet-50 on Imagenet To understand the efficacy of late resetting, we study the “stability” of neural network training to pruning, which we define as the consistency of the optimization trajectories followed by a winning ticket when it is trained in isolation and as part of the larger network. We find that later resetting produces stabler winning tickets and that improved stability correlates with higher winning ticket accuracy. This analysis offers new insights into the lottery ticket hypothesis and the dynamics of neural network learning.
https://arxiv.org/abs/1903.01611
As deep learning continues to make progress for challenging perception tasks, there is increased interest in combining vision, language, and decision-making. Specifically, the Vision and Language Navigation (VLN) task involves navigating to a goal purely from language instructions and visual information without explicit knowledge of the goal. Recent successful approaches have made in-roads in achieving good success rates for this task but rely on beam search, which thoroughly explores a large number of trajectories and is unrealistic for applications such as robotics. In this paper, inspired by the intuition of viewing the problem as search on a navigation graph, we propose to use a progress monitor developed in prior work as a learnable heuristic for search. We then propose two modules incorporated into an end-to-end architecture: 1) A learned mechanism to perform backtracking, which decides whether to continue moving forward or roll back to a previous state (Regret Module) and 2) A mechanism to help the agent decide which direction to go next by showing directions that are visited and their associated progress estimate (Progress Marker). Combined, the proposed approach significantly outperforms current state-of-the-art methods using greedy action selection, with 5% absolute improvement on the test server in success rates, and more importantly 8% on success rates normalized by the path length. Our code is available at https://github.com/chihyaoma/regretful-agent .
http://arxiv.org/abs/1903.01602
Image deblurring aims to restore the latent sharp images from the corresponding blurred ones. In this paper, we present an unsupervised method for domain-specific single-image deblurring based on disentangled representations. The disentanglement is achieved by splitting the content and blur features in a blurred image using content encoders and blur encoders. We enforce a KL divergence loss to regularize the distribution range of extracted blur attributes such that little content information is contained. Meanwhile, to handle the unpaired training data, a blurring branch and the cycle-consistency loss are added to guarantee that the content structures of the deblurred results match the original images. We also add an adversarial loss on deblurred results to generate visually realistic images and a perceptual loss to further mitigate the artifacts. We perform extensive experiments on the tasks of face and text deblurring using both synthetic datasets and real images, and achieve improved results compared to recent state-of-the-art deblurring methods.
https://arxiv.org/abs/1903.01594
When operating in unstructured environments such as warehouses, homes, and retail centers, robots are frequently required to interactively search for and retrieve specific objects from cluttered bins, shelves, or tables. Mechanical Search describes the class of tasks where the goal is to locate and extract a known target object. In this paper, we formalize Mechanical Search and study a version where distractor objects are heaped over the target object in a bin. The robot uses an RGBD perception system and control policies to iteratively select, parameterize, and perform one of 3 actions – push, suction, grasp – until the target object is extracted, or either a time limit is exceeded, or no high confidence push or grasp is available. We present a study of 5 algorithmic policies for mechanical search, with 15,000 simulated trials and 300 physical trials for heaps ranging from 10 to 20 objects. Results suggest that success can be achieved in this long-horizon task with algorithmic policies in over 95% of instances and that the number of actions required scales approximately linearly with the size of the heap. Code and supplementary material can be found at this http URL .
http://arxiv.org/abs/1903.01588
For a given identity in a face dataset, there are certain iconic images which are more representative of the subject than others. In this paper, we explore the problem of computing the iconicity of a face. The premise of the proposed approach is as follows: For an identity containing a mixture of iconic and non iconic images, if a given face cannot be successfully matched with any other face of the same identity, then the iconicity of the face image is low. Using this information, we train a Siamese Multi-Layer Perceptron network, such that each of its twins predict iconicity scores of the image feature pair, fed in as input. We observe the variation of the obtained scores with respect to covariates such as blur, yaw, pitch, roll and occlusion to demonstrate that they effectively predict the quality of the image and compare it with other existing metrics. Furthermore, we use these scores to weight features for template-based face verification and compare it with media averaging of features.
https://arxiv.org/abs/1903.01581
Many modern nonlinear control methods aim to endow systems with guaranteed properties, such as stability or safety, and have been successfully applied to the domain of robotics. However, model uncertainty remains a persistent challenge, weakening theoretical guarantees and causing implementation failures on physical systems. This paper develops a machine learning framework centered around Control Lyapunov Functions (CLFs) to adapt to parametric uncertainty and unmodeled dynamics in general robotic systems. Our proposed method proceeds by iteratively updating estimates of Lyapunov function derivatives and improving controllers, ultimately yielding a stabilizing quadratic program model-based controller. We validate our approach on a planar Segway simulation, demonstrating substantial performance improvements by iteratively refining on a base model-free controller.
http://arxiv.org/abs/1903.01577
3D multi-object detection and tracking are crucial for traffic scene understanding. However, the community pays less attention to these areas due to the lack of a standardized benchmark dataset to advance the field. Moreover, existing datasets (e.g., KITTI) do not provide sufficient data and labels to tackle challenging scenes where highly interactive and occluded traffic participants are present. To address the issues, we present the Honda Research Institute 3D Dataset (H3D), a large-scale full-surround 3D multi-object detection and tracking dataset collected using a 3D LiDAR scanner. H3D comprises of 160 crowded and highly interactive traffic scenes with a total of 1 million labeled instances in 27,721 frames. With unique dataset size, rich annotations, and complex scenes, H3D is gathered to stimulate research on full-surround 3D multi-object detection and tracking. To effectively and efficiently annotate a large-scale 3D point cloud dataset, we propose a labeling methodology to speed up the overall annotation cycle. A standardized benchmark is created to evaluate full-surround 3D multi-object detection and tracking algorithms. 3D object detection and tracking algorithms are trained and tested on H3D. Finally, sources of errors are discussed for the development of future algorithms.
http://arxiv.org/abs/1903.01568
Learning interpretable and transferable subpolicies and performing task decomposition from a single, complex task is difficult. Some traditional hierarchical reinforcement learning techniques enforce this decomposition in a top-down manner, while meta-learning techniques require a task distribution at hand to learn such decompositions. This paper presents a framework for using diverse suboptimal world models to decompose complex task solutions into simpler modular subpolicies. This framework performs automatic decomposition of a single source task in a bottom up manner, concurrently learning the required modular subpolicies as well as a controller to coordinate them. We perform a series of experiments on high dimensional continuous action control tasks to demonstrate the effectiveness of this approach at both complex single task learning and lifelong learning. Finally, we perform ablation studies to understand the importance and robustness of different elements in the framework and limitations to this approach.
https://arxiv.org/abs/1903.01567
We propose an unsupervised hashing method which aims to produce binary codes that preserve the ranking induced by a real-valued representation. Such compact hash codes enable the complete elimination of real-valued feature storage and allow for significant reduction of the computation complexity and storage cost of large-scale image retrieval applications. Specifically, we learn a neural network-based model, which transforms the input representation into a binary representation. We formalize the training objective of the network in an intuitive and effective way, considering each training sample as a query and aiming to obtain the same retrieval results using the produced hash codes as those obtained with the original features. This training formulation directly optimizes the hashing model for the target usage of the hash codes it produces. We further explore the addition of a decoder trained to obtain an approximated reconstruction of the original features. At test time, we retrieved the most promising database samples with an efficient graph-based search procedure using only our hash codes and perform re-ranking using the reconstructed features, thus without needing to access the original features at all. Experiments conducted on multiple publicly available large-scale datasets show that our method consistently outperforms all compared state-of-the-art unsupervised hashing methods and that the reconstruction procedure can effectively boost the search accuracy with a minimal constant additional cost.
https://arxiv.org/abs/1903.01545
Performance evaluation of urban autonomous vehicles requires a realistic model of the behavior of other road users in the environment. Learning such models from data involves collecting naturalistic data of real-world human behavior. In many cases, acquisition of this data can be prohibitively expensive or intrusive. Additionally, the available data often contain only typical behaviors and exclude behaviors that are classified as rare events. To evaluate the performance of AV in such situations, we develop a model of traffic behavior based on the theory of bounded rationality. Based on the experiments performed on a large naturalistic driving data, we show that the developed model can be applied to estimate probability of rare events, as well as to generate new traffic situations.
http://arxiv.org/abs/1903.01539
In crowded social scenarios with a myriad of external stimuli, human brains exhibit a natural ability to filter out irrelevant information and narrowly focus their attention. In the midst of multiple groups of people, humans use such sensory gating to effectively further their own group’s interactional goals. In this work, we consider the design of a policy network to model multi-group multi-person communication. Our policy takes as input the state of the world such as an agent’s gaze direction, body pose of other agents or history of past actions, and outputs an optimal action such as speaking, listening or responding (communication modes). Inspired by humans’ natural neurobiological filtering process, a central component of our policy network design is an information gating function, termed the Kinesic-Proxemic-Message Gate (KPM-Gate), that models the ability of an agent to selectively gather information from specific neighboring agents. The degree of influence of a neighbor is based on dynamic non-verbal cues such as body motion, head pose (kinesics) and interpersonal space (proxemics). We further show that the KPM-Gate can be used to discover social groups using its natural interpretation as a social attention mechanism. We pose the communication policy learning problem as a multi-agent imitation learning problem. We learn a single policy shared by all agents under the assumption of a decentralized Markov decision process. We term our policy network as the Multi-Agent Group Discovery and Communication Mode Network (MAGDAM network), as it learns social group structure in addition to the dynamics of group communication. Our experimental validation on both synthetic and real world data shows that our model is able to both discover social group structure and learn an accurate multi-agent communication policy.
https://arxiv.org/abs/1903.01537
Deep learning approaches for Visual-Inertial Odometry (VIO) have proven successful, but they rarely focus on incorporating robust fusion strategies for dealing with imperfect input sensory data. We propose a novel end-to-end selective sensor fusion framework for monocular VIO, which fuses monocular images and inertial measurements in order to estimate the trajectory whilst improving robustness to real-life issues, such as missing and corrupted data or bad sensor synchronization. In particular, we propose two fusion modalities based on different masking strategies: deterministic soft fusion and stochastic hard fusion, and we compare with previously proposed direct fusion baselines. During testing, the network is able to selectively process the features of the available sensor modalities and produce a trajectory at scale. We present a thorough investigation on the performances on three public autonomous driving, Micro Aerial Vehicle (MAV) and hand-held VIO datasets. The results demonstrate the effectiveness of the fusion strategies, which offer better performances compared to direct fusion, particularly in presence of corrupted data. In addition, we study the interpretability of the fusion networks by visualising the masking layers in different scenarios and with varying data corruption, revealing interesting correlations between the fusion networks and imperfect sensory input data.
http://arxiv.org/abs/1903.01534
To evaluate their performance, existing dehazing approaches generally rely on distance measures between the generated image and its corresponding ground truth. Despite its ability to produce visually good images, using pixel-based or even perceptual metrics do not guarantee, in general, that the produced image is fit for being used as input for low-level computer vision tasks such as segmentation. To overcome this weakness, we are proposing a novel end-to-end approach for image dehazing, fit for being used as input to an image segmentation procedure, while maintaining the visual quality of the generated images. Inspired by the success of Generative Adversarial Networks (GAN), we propose to optimize the generator by introducing a discriminator network and a loss function that evaluates segmentation quality of dehazed images. In addition, we make use of a supplementary loss function that verifies that the visual and the perceptual quality of the generated image are preserved in hazy conditions. Results obtained using the proposed technique are appealing, with a favorable comparison to state-of-the-art approaches when considering the performance of segmentation algorithms on the hazy images.
https://arxiv.org/abs/1903.01530
Deep neural networks based methods have been proved to achieve outstanding performance on object detection and classification tasks. Despite significant performance improvement, due to the deep structures, they still require prohibitive runtime to process images and maintain the highest possible performance for real-time applications. Observing the phenomenon that human vision system (HVS) relies heavily on the temporal dependencies among frames from the visual input to conduct recognition efficiently, we propose a novel framework dubbed as TKD: temporal knowledge distillation. This framework distills the temporal knowledge from a heavy neural networks based model over selected video frames (the perception of the moments) to a light-weight model. To enable the distillation, we put forward two novel procedures: 1) an Long-short Term Memory (LSTM) based key frame selection method; and 2) a novel teacher-bounded loss design. To validate, we conduct comprehensive empirical evaluations using different object detection methods over multiple datasets including Youtube-Objects and Hollywood scene dataset. Our results show consistent improvement in accuracy-speed trad-offs for object detection over the frames of the dynamic scene, compare to other modern object recognition methods.
https://arxiv.org/abs/1903.01522
In radiologists’ routine work, one major task is to read a medical image, e.g., a CT scan, find significant lesions, and write sentences in the radiology report to describe them. In this paper, we study the lesion description or annotation problem as an important step of computer-aided diagnosis (CAD). Given a lesion image, our aim is to predict multiple relevant labels, such as the lesion’s body part, type, and attributes. To address this problem, we define a set of 145 labels based on RadLex to describe a large variety of lesions in the DeepLesion dataset. We directly mine training labels from the lesion’s corresponding sentence in the radiology report, which requires minimal manual effort and is easily generalizable to large data and label sets. A multi-label convolutional neural network is then proposed for images with multi-scale structure and a noise-robust loss. Quantitative and qualitative experiments demonstrate the effectiveness of the framework. The average area under ROC curve on 1,872 test lesions is 0.9083.
https://arxiv.org/abs/1903.01505
We present a framework for creating navigable space from sparse and noisy map points generated by sparse visual SLAM methods. Our method incrementally seeds and creates local convex regions free of obstacle points along a robot’s trajectory. Then a dense version of point cloud is reconstructed through a map point regulation process where the original noisy map points are first projected onto a series of local convex hull surfaces, after which those points falling inside the convex hulls are culled. The regulated and refined map points allow human users to quickly recognize and abstract the environmental information. We have validated our proposed framework using both a public dataset and a real environmental structure, and our results reveal that the reconstructed navigable free space has small volume loss (error) comparing with the ground truth, and the method is highly efficient, allowing real-time computation and online planning.
http://arxiv.org/abs/1903.01503
We describe Voyageur, which is an application of experiential search to the domain of travel. Unlike traditional search engines for online services, experiential search focuses on the experiential aspects of the service under consideration. In particular, Voyageur needs to handle queries for subjective aspects of the service (e.g., quiet hotel, friendly staff) and combine these with objective attributes, such as price and location. Voyageur also highlights interesting facts and tips about the services the user is considering to provide them with further insights into their choices.
https://arxiv.org/abs/1903.01498