Generating realistic images from scene graphs asks neural networks to be able to reason about object relationships and compositionality. As a relatively new task, how to properly ensure the generated images comply with scene graphs or how to measure task performance remains an open question. In this paper, we propose to harness scene graph context to improve image generation from scene graphs. We introduce a scene graph context network that pools features generated by a graph convolutional neural network that are then provided to both the image generation network and the adversarial loss. With the context network, our model is trained to not only generate realistic looking images, but also to better preserve non-spatial object relationships. We also define two novel evaluation metrics, the relation score and the mean opinion relation score, for this task that directly evaluate scene graph compliance. We use both quantitative and qualitative studies to demonstrate that our pro-posed model outperforms the state-of-the-art on this challenging task.
http://arxiv.org/abs/1901.03762
In this paper, we propose a pyramid network structure to improve the FCN-based segmentation solutions and apply it to label thyroid follicles in histology images. Our design is based on the notion that a hierarchical updating scheme, if properly implemented, can help FCNs capture the major objects, as well as structure details in an image. To this end, we devise a residual module to be mounted on consecutive network layers, through which pixel labels would be propagated from the coarsest layer towards the finest layer in a bottom-up fashion. We add five residual units along the decoding path of a modified U-Net to make our segmentation network, Res-Seg-Net. Experiments demonstrate that the multi-resolution set-up in our model is effective in producing segmentations with improved accuracy and robustness.
http://arxiv.org/abs/1901.03760
In person attributes recognition, we describe a person in terms of their appearance. Typically, this includes a wide range of traits including age, gender, clothing, and footwear. Although this could be used in a wide variety of scenarios, it generally is applied to video surveillance, where attribute recognition is impacted by low resolution, and other issues such as variable pose, occlusion and shadow. Recent approaches have used deep convolutional neural networks (CNNs) to improve the accuracy in person attribute recognition. However, many of these networks are relatively shallow and it is unclear to what extent they use contextual cues to improve classification accuracy. In this paper, we propose deeper methods for person attribute recognition. Interpreting the reasons behind the classification is highly important, as it can provide insight into how the classifier is making decisions. Interpretation suggests that deeper networks generally take more contextual information into consideration, which helps improve classification accuracy and generalizability. We present experimental analysis and results for whole body attributes using the PA-100K and PETA datasets and facial attributes using the CelebA dataset.
http://arxiv.org/abs/1901.03756
Generating low-level robot controllers often requires manual parameters tuning and significant system knowledge, which can result in long design times for highly specialized controllers. With the growth of automation, the need for such controllers might grow faster than the number of expert designers. To address the problem of rapidly generating low-level controllers without domain knowledge, we propose using model-based reinforcement learning (MBRL) trained on few minutes of automatically generated data. In this paper, we explore the capabilities of MBRL on a Crazyflie quadrotor with rapid dynamics where existing classical control schemes offer a baseline against the new method’s performance. To our knowledge, this is the first use of MBRL for low-level controlled hover of a quadrotor using only on-board sensors, direct motor input signals, and no initial dynamics knowledge. Our forward dynamics model for prediction is a neural network tuned to predict the state variables at the next time step, with a regularization term on the variance of predictions. The model predictive controller then transmits best actions from a GPU-enabled base station to the quadrotor firmware via radio. In our experiments, the quadrotor achieved hovering capability of up to 6 seconds with 3 minutes of experimental training data.
http://arxiv.org/abs/1901.03737
Quantitative reasoning is an important component of reasoning that any intelligent natural language understanding system can reasonably be expected to handle. We present EQUATE (Evaluating Quantitative Understanding Aptitude in Textual Entailment), a new dataset to evaluate the ability of models to reason with quantities in textual entailment (including not only arithmetic and algebraic computation, but also other phenomena such as range comparisons and verbal reasoning with quantities). The average performance of 7 published textual entailment models on EQUATE does not exceed a majority class baseline, indicating that current models do not implicitly learn to reason with quantities. We propose a new baseline Q-REAS that manipulates quantities symbolically, achieving some success on numerical reasoning, but struggling at more verbal aspects of the task. We hope our evaluation framework will support the development of new models of quantitative reasoning in language understanding.
http://arxiv.org/abs/1901.03735
We consider the problem of exploration in meta reinforcement learning. Two new meta reinforcement learning algorithms are suggested: E-MAML and E-$\text{RL}^2$. Results are presented on a novel environment we call `Krazy World’ and a set of maze environments. We show E-MAML and E-$\text{RL}^2$ deliver better performance on tasks where exploration is important.
http://arxiv.org/abs/1803.01118
Automated rationale generation is an approach for real-time explanation generation whereby a computational model learns to translate an autonomous agent’s internal state and action data representations into natural language. Training on human explanation data can enable agents to learn to generate human-like explanations for their behavior. In this paper, using the context of an agent that plays Frogger, we describe (a) how to collect a corpus of explanations, (b) how to train a neural rationale generator to produce different styles of rationales, and (c) how people perceive these rationales. We conducted two user studies. The first study establishes the plausibility of each type of generated rationale and situates their user perceptions along the dimensions of confidence, humanlike-ness, adequate justification, and understandability. The second study further explores user preferences between the generated rationales with regard to confidence in the autonomous agent, communicating failure and unexpected behavior. Overall, we find alignment between the intended differences in features of the generated rationales and the perceived differences by users. Moreover, context permitting, participants preferred detailed rationales to form a stable mental model of the agent’s behavior.
http://arxiv.org/abs/1901.03729
Action anticipation and forecasting in videos do not require a hat-trick, as far as there are signs in the context to foresee how actions are going to be deployed. Capturing these signs is hard because the context includes the past. We propose an end-to-end network for action anticipation and forecasting with memory, to both anticipate the current action and foresee the next one. Experiments on action sequence datasets show excellent results indicating that training on histories with a dynamic memory can significantly improve forecasting performance.
http://arxiv.org/abs/1901.03728
Motion is a dominant cue in automated driving systems. Optical flow is typically computed to detect moving objects and to estimate depth using triangulation. In this paper, our motivation is to leverage the existing dense optical flow to improve the performance of semantic segmentation. To provide a systematic study, we construct four different architectures which use RGB only, flow only, RGBF concatenated and two-stream RGB + flow. We evaluate these networks on two automotive datasets namely Virtual KITTI and Cityscapes using the state-of-the-art flow estimator FlowNet v2. We also make use of the ground truth optical flow in Virtual KITTI to serve as an ideal estimator and a standard Farneback optical flow algorithm to study the effect of noise. Using the flow ground truth in Virtual KITTI, two-stream architecture achieves the best results with an improvement of 4% IoU. As expected, there is a large improvement for moving objects like trucks, vans and cars with 38%, 28% and 6% increase in IoU. FlowNet produces an improvement of 2.4% in average IoU with larger improvement in the moving objects corresponding to 26%, 11% and 5% in trucks, vans and cars. In Cityscapes, flow augmentation provided an improvement for moving objects like motorcycle and train with an increase of 17% and 7% in IoU.
http://arxiv.org/abs/1901.07355
Neural networks have become the standard model for various computer vision tasks in automated driving including semantic segmentation, moving object detection, depth estimation, visual odometry, etc. The main flavors of neural networks which are used commonly are convolutional (CNN) and recurrent (RNN). In spite of rapid progress in embedded processors, power consumption and cost is still a bottleneck. Spiking Neural Networks (SNNs) are gradually progressing to achieve low-power event-driven hardware architecture which has a potential for high efficiency. In this paper, we explore the role of deep spiking neural networks (SNN) for automated driving applications. We provide an overview of progress on SNN and argue how it can be a good fit for automated driving applications.
http://arxiv.org/abs/1903.02080
Despite remarkable advances in image synthesis research, existing works often fail in manipulating images under the context of large geometric transformations. Synthesizing person images conditioned on arbitrary poses is one of the most representative examples where the generation quality largely relies on the capability of identifying and modeling arbitrary transformations on different body parts. Current generative models are often built on local convolutions and overlook the key challenges (e.g. heavy occlusions, different views or dramatic appearance changes) when distinct geometric changes happen for each part, caused by arbitrary pose manipulations. This paper aims to resolve these challenges induced by geometric variability and spatial displacements via a new Soft-Gated Warping Generative Adversarial Network (Warping-GAN), which is composed of two stages: 1) it first synthesizes a target part segmentation map given a target pose, which depicts the region-level spatial layouts for guiding image synthesis with higher-level structure constraints; 2) the Warping-GAN equipped with a soft-gated warping-block learns feature-level mapping to render textures from the original image into the generated segmentation map. Warping-GAN is capable of controlling different transformation degrees given distinct target poses. Moreover, the proposed warping-block is light-weight and flexible enough to be injected into any networks. Human perceptual studies and quantitative evaluations demonstrate the superiority of our Warping-GAN that significantly outperforms all existing methods on two large datasets.
https://arxiv.org/abs/1810.11610
In this paper we present MLaut (Machine Learning AUtomation Toolbox) for the python data science ecosystem. MLaut automates large-scale evaluation and benchmarking of machine learning algorithms on a large number of datasets. MLaut provides a high-level workflow interface to machine algorithm algorithms, implements a local back-end to a database of dataset collections, trained algorithms, and experimental results, and provides easy-to-use interfaces to the scikit-learn and keras modelling libraries. Experiments are easy to set up with default settings in a few lines of code, while remaining fully customizable to the level of hyper-parameter tuning, pipeline composition, or deep learning architecture. As a principal test case for MLaut, we conducted a large-scale supervised classification study in order to benchmark the performance of a number of machine learning algorithms - to our knowledge also the first larger-scale study on standard supervised learning data sets to include deep learning algorithms. While corroborating a number of previous findings in literature, we found (within the limitations of our study) that deep neural networks do not perform well on basic supervised learning, i.e., outside the more specialized, image-, audio-, or text-based tasks.
http://arxiv.org/abs/1901.03678
We study the global convergence of generative adversarial imitation learning for linear quadratic regulators, which is posed as minimax optimization. To address the challenges arising from non-convex-concave geometry, we analyze the alternating gradient algorithm and establish its Q-linear rate of convergence to a unique saddle point, which simultaneously recovers the globally optimal policy and reward function. We hope our results may serve as a small step towards understanding and taming the instability in imitation learning as well as in more general non-convex-concave alternating minimax optimization that arises from reinforcement learning and generative adversarial learning.
http://arxiv.org/abs/1901.03674
Inverse Problems in medical imaging and computer vision are traditionally solved using purely model-based methods. Among those variational regularization models are one of the most popular approaches. We propose a new framework for applying data-driven approaches to inverse problems, using a neural network as a regularization functional. The network learns to discriminate between the distribution of ground truth images and the distribution of unregularized reconstructions. Once trained, the network is applied to the inverse problem by solving the corresponding variational problem. Unlike other data-based approaches for inverse problems, the algorithm can be applied even if only unsupervised training data is available. Experiments demonstrate the potential of the framework for denoising on the BSDS dataset and for computed tomography reconstruction on the LIDC dataset.
http://arxiv.org/abs/1805.11572
This paper discusses cooperative stabilization control of rigid formations via an event-based approach. We first design a centralized event-based formation control system, in which a central event controller determines the next triggering time and broadcasts the event signal to all the agents for control input update. We then build on this approach to propose a distributed event control strategy, in which each agent can use its local event trigger and local information to update the control input at its own event time. For both cases, the triggering condition, event function and triggering behavior are discussed in detail, and the exponential convergence of the event-based formation system is guaranteed.
http://arxiv.org/abs/1901.03656
We present ParaBank, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of ParaNMT, we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase resource with more than 4 billion generated tokens and exhibiting greater lexical diversity. Using human judgments, we also demonstrate that ParaBank’s paraphrases improve over ParaNMT on both semantic similarity and fluency. Finally, we use ParaBank to train a monolingual NMT model with the same support for lexically-constrained decoding for sentence rewriting tasks.
http://arxiv.org/abs/1901.03644
Accurate state estimation is a fundamental problem for autonomous robots. To achieve locally accurate and globally drift-free state estimation, multiple sensors with complementary properties are usually fused together. Local sensors (camera, IMU, LiDAR, etc) provide precise pose within a small region, while global sensors (GPS, magnetometer, barometer, etc) supply noisy but globally drift-free localization in a large-scale environment. In this paper, we propose a sensor fusion framework to fuse local states with global sensors, which achieves locally accurate and globally drift-free pose estimation. Local estimations, produced by existing VO/VIO approaches, are fused with global sensors in a pose graph optimization. Within the graph optimization, local estimations are aligned into a global coordinate. Meanwhile, the accumulated drifts are eliminated. We evaluate the performance of our system on public datasets and with real-world experiments. Results are compared against other state-of-the-art algorithms. We highlight that our system is a general framework, which can easily fuse various global sensors in a unified pose graph optimization. Our implementations are open source\footnote{https://github.com/HKUST-Aerial-Robotics/VINS-Fusion}.
http://arxiv.org/abs/1901.03642
Nowadays, more and more sensors are equipped on robots to increase robustness and autonomous ability. We have seen various sensor suites equipped on different platforms, such as stereo cameras on ground vehicles, a monocular camera with an IMU (Inertial Measurement Unit) on mobile phones, and stereo cameras with an IMU on aerial robots. Although many algorithms for state estimation have been proposed in the past, they are usually applied to a single sensor or a specific sensor suite. Few of them can be employed with multiple sensor choices. In this paper, we proposed a general optimization-based framework for odometry estimation, which supports multiple sensor sets. Every sensor is treated as a general factor in our framework. Factors which share common state variables are summed together to build the optimization problem. We further demonstrate the generality with visual and inertial sensors, which form three sensor suites (stereo cameras, a monocular camera with an IMU, and stereo cameras with an IMU). We validate the performance of our system on public datasets and through real-world experiments with multiple sensors. Results are compared against other state-of-the-art algorithms. We highlight that our system is a general framework, which can easily fuse various sensors in a pose graph optimization. Our implementations are open source\footnote{https://github.com/HKUST-Aerial-Robotics/VINS-Fusion}.
http://arxiv.org/abs/1901.03638
Cross-domain image-to-image translation should satisfy two requirements: (1) preserve the information that is common to both domains, and (2) generate convincing images covering variations that appear in the target domain. This is challenging, especially when there are no example translations available as supervision. Adversarial cycle consistency was recently proposed as a solution, with beautiful and creative results, yielding much follow-up work. However, augmented reality applications cannot readily use such techniques to provide users with compelling translations of real scenes, because the translations do not have high-fidelity constraints. In other words, current models are liable to change details that should be preserved: while re-texturing a face, they may alter the face’s expression in an unpredictable way. In this paper, we introduce the problem of high-fidelity image-to-image translation, and present a method for solving it. Our main insight is that low-fidelity translations typically escape a cycle-consistency penalty, because the back-translator learns to compensate for the forward-translator’s errors. We therefore introduce an optimization technique that prevents the networks from cooperating: simply train each network only when its input data is real. Prior works, in comparison, train each network with a mix of real and generated data. Experimental results show that our method accurately disentangles the factors that separate the domains, and converges to semantics-preserving translations that prior methods miss.
http://arxiv.org/abs/1901.03628
While deep neural networks have proven to be a powerful tool for many recognition and classification tasks, their stability properties are still not well understood. In the past, image classifiers have been shown to be vulnerable to so-called adversarial attacks, which are created by additively perturbing the correctly classified image. In this paper, we propose the ADef algorithm to construct a different kind of adversarial attack created by iteratively applying small deformations to the image, found through a gradient descent step. We demonstrate our results on MNIST with convolutional neural networks and on ImageNet with Inception-v3 and ResNet-101.
http://arxiv.org/abs/1804.07729
This paper describes the current TT"U speech transcription system for Estonian speech. The system is designed to handle semi-spontaneous speech, such as broadcast conversations, lecture recordings and interviews recorded in diverse acoustic conditions. The system is based on the Kaldi toolkit. Multi-condition training using background noise profiles extracted automatically from untranscribed data is used to improve the robustness of the system. Out-of-vocabulary words are recovered using a phoneme n-gram based decoding subgraph and a FST-based phoneme-to-grapheme model. The system achieves a word error rate of 8.1% on a test set of broadcast conversations. The system also performs punctuation recovery and speaker identification. Speaker identification models are trained using a recently proposed weakly supervised training method.
http://arxiv.org/abs/1901.03601
In 2018, clinics and hospitals were hit with numerous attacks leading to significant data breaches and interruptions in medical services. An attacker with access to medical records can do much more than hold the data for ransom or sell it on the black market. In this paper, we show how an attacker can use deep learning to add or remove evidence of medical conditions from volumetric (3D) medical scans. An attacker may perform this act in order to stop a political candidate, sabotage research, commit insurance fraud, perform an act of terrorism, or even commit murder. We implement the attack using a 3D conditional GAN and show how the framework (CT-GAN) can be automated. Although the body is complex and 3D medical scans are very large, CT-GAN achieves realistic results and can be executed in milliseconds. To evaluate the attack, we focus on injecting and removing lung cancer from CT scans. We show how three expert radiologists and a state-of-the-art deep learning AI could not differentiate between tampered and non-tampered scans. We also evaluate state-of-the-art countermeasures and propose our own. Finally, we discuss the possible attack vectors on modern radiology networks and demonstrate one of the attack vectors on an active CT scanner.
http://arxiv.org/abs/1901.03597
Wasserstein Generative Adversarial Networks (WGANs) can be used to generate realistic samples from complicated image distributions. The Wasserstein metric used in WGANs is based on a notion of distance between individual images, which induces a notion of distance between probability distributions of images. So far the community has considered $\ell^2$ as the underlying distance. We generalize the theory of WGAN with gradient penalty to Banach spaces, allowing practitioners to select the features to emphasize in the generator. We further discuss the effect of some particular choices of underlying norms, focusing on Sobolev norms. Finally, we demonstrate a boost in performance for an appropriate choice of norm on CIFAR-10 and CelebA.
https://arxiv.org/abs/1806.06621
Computer vision applications based on videos often require the detection of moving objects in their first step. Background subtraction is then applied in order to separate the background and the foreground. In literature, background subtraction is surely among the most investigated field in computer vision providing a big amount of publications. Most of them concern the application of mathematical and machine learning models to be more robust to the challenges met in videos. However, the ultimate goal is that the background subtraction methods developed in research could be employed in real applications like traffic surveillance. But looking at the literature, we can remark that there is often a gap between the current methods used in real applications and the current methods in fundamental research. In addition, the videos evaluated in large-scale datasets are not exhaustive in the way that they only covered a part of the complete spectrum of the challenges met in real applications. In this context, we attempt to provide the most exhaustive survey as possible on real applications that used background subtraction in order to identify the real challenges met in practice, the current used background models and to provide future directions. Thus, challenges are investigated in terms of camera, foreground objects and environments. In addition, we identify the background models that are effectively used in these applications in order to find potential usable recent background models in terms of robustness, time and memory requirements.
http://arxiv.org/abs/1901.03577
The window mechanism was introduced by Chatterjee et al. [1] to strengthen classical game objectives with time bounds. It permits to synthesize system controllers that exhibit acceptable behaviors within a configurable time frame, all along their infinite execution, in contrast to the traditional objectives that only require correctness of behaviors in the limit. The window concept has proved its interest in a variety of two-player zero-sum games, thanks to the ability to reason about such time bounds in system specifications, but also the increased tractability that it usually yields. In this work, we extend the window framework to stochastic environments by considering the fundamental threshold probability problem in Markov decision processes for window objectives. That is, given such an objective, we want to synthesize strategies that guarantee satisfying runs with a given probability. We solve this problem for the usual variants of window objectives, where either the time frame is set as a parameter, or we ask if such a time frame exists. We develop a generic approach for window-based objectives and instantiate it for the classical mean-payoff and parity objectives, already considered in games. Our work paves the way to a wide use of the window mechanism in stochastic models. [1] Krishnendu Chatterjee, Laurent Doyen, Mickael Randour, and Jean-Fran\c{c}ois Raskin. Looking at mean-payoff and total-payoff through windows. Inf. Comput., 242:25-52, 2015.
http://arxiv.org/abs/1901.03571
Underwater image enhancement has been attracting much attention due to its significance in marine engineering and aquatic robot. Numerous underwater image enhancement algorithms have been proposed in the last few years. However, these algorithms are mainly evaluated using either synthetic datasets or few selected real-world images. It is thus unclear how these algorithms would perform on images acquired in the wild and how we could gauge the progress in the field. To bridge this gap, we present the first comprehensive perceptual study and analysis of underwater image enhancement using large-scale real-world degraded images. In this paper, we construct an Underwater Image Enhancement Benchmark Dataset(UIEBD) including 950 real-world underwater images, 890 of which have corresponding reference images. We treat the rest 60 underwater images which cannot obtain satisfactory references as challenging data. Using this dataset, we conduct a comprehensive study of the state-of-the-art underwater image enhancement algorithms qualitatively and quantitatively. In addition, we propose an end-to-end Deep Underwater Image Enhancement Network (DUIENet) trained by this benchmark as a baseline, which indicates the generalization of the proposed UIEBD for training Convolutional Neural Networks (CNNs). The benchmark evaluations and the proposed DUIENet demonstrate the performance and limitations of state-of-the-art algorithms which shed light on future research in underwater image enhancement.
https://arxiv.org/abs/1901.05495
Image registration is a process of aligning two or more images of same objects using geometric transformation. Most of the existing approaches work on the assumption of location invariance. These approaches require object-centric images to perform matching. Further, in absence of intensity level symmetry between the corresponding points in two images, the learning based registration approaches rely on synthetic deformations, which often fail in real scenarios. To address these issues, a combination of convolutional neural networks (CNNs) to perform the desired registration is developed in this work. The complete objective is divided into three sub-objectives: object localization, segmentation and matching transformation. Object localization step establishes an initial correspondence between the images. A modified version of single shot multi-box detector is used for this purpose. The detected region is cropped to make the images object-centric. Subsequently, the objects are segmented and matched using a spatial transformer network employing thin plate spline deformation. Initial experiments on MNIST and Caltech-101 datasets show that the proposed model is able to produce accurate matching. Quantitative evaluation performed using dice coefficient (DC) and mean intersection over union (mIoU) show that proposed method results in the values of 79% and 66%, respectively for MNIST dataset and the values of 94% and 90%, respectively for Caltech-101 dataset. The proposed framework is extended to the registration of CT and US images, which is free from any data specific assumptions and has better generalization capability as compared to the existing rule-based/classical approaches.
http://arxiv.org/abs/1805.00223
The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. We measure our agent’s effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.
http://arxiv.org/abs/1901.03559
This paper describes the development of a real-time Human-Robot Interaction (HRI) system for a service robot based on 3D human activity recognition and human-like decision mechanism. The Human-Robot Interactive (HRI) system, which allows one person to interact with a service robot using natural body language, collects sequences of 3D skeleton joints comprising rich human movement information about the user via Microsoft Kinect. This information is used to train a three-layer Long-Short-Term Memory (LSTM) network for human action recognition. The robot understands user intent based on an online LSTM network test, and responds to the user via movements of the robotic arm or chassis. Furthermore, the human-like decision mechanism is also fused into this process, which allows the robot to instinctively decide whether to interrupt the current task according to task priority. The framework of the overall system is established on the Robot Operating System (ROS) platform. The real-life activity interaction between our service robot and the user was conducted to demonstrate the effectiveness of developed HRI system.
http://arxiv.org/abs/1802.00272
This paper addresses non-prehensile rearrangement planning problems where a robot is tasked to rearrange objects among obstacles on a planar surface. We present an efficient planning algorithm that is designed to impose few assumptions on the robot’s non-prehensile manipulation abilities and is simple to adapt to different robot embodiments. For this, we combine sampling-based motion planning with reinforcement learning and generative modeling. Our algorithm explores the composite configuration space of objects and robot as a search over robot actions, forward simulated in a physics model. This search is guided by a generative model that provides robot states from which an object can be transported towards a desired state, and a learned policy that provides corresponding robot actions. As an efficient generative model, we apply Generative Adversarial Networks. We implement and evaluate our approach for robots endowed with configuration spaces in SE(2). We demonstrate empirically the efficacy of our algorithm design choices and observe more than 2x speedup in planning time on various test scenarios compared to a state-of-the-art approach.
http://arxiv.org/abs/1901.03557
The primary motivation of Image-to-Image Transformation is to convert an image of one domain to another domain. Most of the research has been focused on the task of image transformation for a set of pre-defined domains. Very few works are reported that actually developed a common framework for image-to-image transformation for different domains. With the introduction of Generative Adversarial Networks (GANs) as a general framework for the image generation problem, there is a tremendous growth in the area of image-to-image transformation. Most of the research focuses over the suitable objective function for image-to-image transformation. In this paper, we propose a new Cyclic-Synthesized Generative Adversarial Networks (CSGAN) for image-to-image transformation. The proposed CSGAN uses a new objective function (loss) called Cyclic-Synthesized Loss (CS) between the synthesized image of one domain and cycled image of another domain. The performance of the proposed CSGAN is evaluated on two benchmark image-to-image transformation datasets, including CUHK Face dataset and CMP Facades dataset. The results are computed using the widely used evaluation metrics such as MSE, SSIM, PSNR, and LPIPS. The experimental results of the proposed CSGAN approach are compared with the latest state-of-the-art approaches such as GAN, Pix2Pix, DualGAN, CycleGAN and PS2GAN. The proposed CSGAN technique outperforms all the methods over CUHK dataset and exhibits the promising and comparable performance over Facades dataset in terms of both qualitative and quantitative measures. The code is available at https://github.com/KishanKancharagunta/CSGAN.
http://arxiv.org/abs/1901.03554
Here we present DIVE: Data-driven Inference of Vertexwise Evolution. DIVE is an image-based disease progression model with single-vertex resolution, designed to reconstruct long-term patterns of brain pathology from short-term longitudinal data sets. DIVE clusters vertex-wise biomarker measurements on the cortical surface that have similar temporal dynamics across a patient population, and concurrently estimates an average trajectory of vertex measurements in each cluster. DIVE uniquely outputs a parcellation of the cortex into areas with common progression patterns, leading to a new signature for individual diseases. DIVE further estimates the disease stage and progression speed for every visit of every subject, potentially enhancing stratification for clinical trials or management. On simulated data, DIVE can recover ground truth clusters and their underlying trajectory, provided the average trajectories are sufficiently different between clusters. We demonstrate DIVE on data from two cohorts: the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Dementia Research Centre (DRC), UK, containing patients with Posterior Cortical Atrophy (PCA) as well as typical Alzheimer’s disease (tAD). DIVE finds similar spatial patterns of atrophy for tAD subjects in the two independent datasets (ADNI and DRC), and further reveals distinct patterns of pathology in different diseases (tAD vs PCA) and for distinct types of biomarker data: cortical thickness from Magnetic Resonance Imaging (MRI) vs amyloid load from Positron Emission Tomography (PET). Finally, DIVE can be used to estimate a fine-grained spatial distribution of pathology in the brain using any kind of voxelwise or vertexwise measures including Jacobian compression maps, fractional anisotropy (FA) maps from diffusion imaging or other PET measures. DIVE source code is available online: https://github.com/mrazvan22/dive
http://arxiv.org/abs/1901.03553
This work addresses the problem of learning compact yet discriminative patch descriptors within a deep learning framework. We observe that features extracted by convolutional layers in the pixel domain are largely complementary to features extracted in a transformed domain. We propose a convolutional network framework for learning binary patch descriptors where pixel domain features are fused with features extracted from the transformed domain. In our framework, while convolutional and transformed features are distinctly extracted, they are fused and provided to a single classifier which thus jointly operates on convolutional and transformed features. We experiment at matching patches from three different datasets, showing that our feature fusion approach outperforms multiple state-of-the-art approaches in terms of accuracy, rate, and complexity.
http://arxiv.org/abs/1901.03547
In this paper, we propose a deep convolutional neural network for learning the embeddings of images in order to capture the notion of visual similarity. We present a deep siamese architecture that when trained on positive and negative pairs of images learn an embedding that accurately approximates the ranking of images in order of visual similarity notion. We also implement a novel loss calculation method using an angular loss metrics based on the problems requirement. The final embedding of the image is combined representation of the lower and top-level embeddings. We used fractional distance matrix to calculate the distance between the learned embeddings in n-dimensional space. In the end, we compare our architecture with other existing deep architecture and go on to demonstrate the superiority of our solution in terms of image retrieval by testing the architecture on four datasets. We also show how our suggested network is better than the other traditional deep CNNs used for capturing fine-grained image similarities by learning an optimum embedding.
http://arxiv.org/abs/1901.03546
Nowadays, Twitter has become a great source of user-generated information about events. Very often people report causal relationships between events in their tweets. Automatic detection of causality information in these events might play an important role in predictive event analytics. Existing approaches include both rule-based and data-driven supervised methods. However, it is challenging to correctly identify event causality using only linguistic rules due to the highly unstructured nature and grammatical incorrectness of social media short text such as tweets. Also, it is difficult to develop a data-driven supervised method for event causality detection in tweets due to insufficient contextual information. This paper proposes a novel event context word extension technique based on background knowledge. To demonstrate the effectiveness of our proposed event context word extension technique, we develop a feed-forward neural network based approach to detect event causality from tweets. Extensive experiments demonstrate the superiority of our approach.
http://arxiv.org/abs/1901.03526
We introduce Disease Knowledge Transfer (DKT), a novel technique for transferring biomarker information between related neurodegenerative diseases. DKT infers robust multimodal biomarker trajectories in rare neurodegenerative diseases even when only limited, unimodal data is available, by transferring information from larger multimodal datasets from common neurodegenerative diseases. DKT is a joint-disease generative model of biomarker progressions, which exploits biomarker relationships that are shared across diseases. As opposed to current deep learning approaches, DKT is interpretable, which allows us to understand underlying disease mechanisms. Here we demonstrate DKT on Alzheimer’s disease (AD) variants and its ability to predict trajectories for multimodal biomarkers in Posterior Cortical Atrophy (PCA), in lack of such data from PCA subjects. For this we train DKT on a combined dataset containing subjects with two distinct diseases and sizes of data available: 1) a larger, multimodal typical AD (tAD) dataset from the TADPOLE Challenge, and 2) a smaller unimodal Posterior Cortical Atrophy (PCA) dataset from the Dementia Research Centre (DRC) UK, for which only a limited number of Magnetic Resonance Imaging (MRI) scans are available. We first show that DKT estimates plausible multimodal trajectories in PCA that agree with previous literature. We further validate DKT in two situations: (1) on synthetic data, showing that it can accurately estimate the ground truth parameters and (2) on 20 DTI scans from controls and PCA patients, showing that it has favourable predictive performance compared to standard approaches. While we demonstrated DKT on Alzheimer’s variants, we note DKT is generalisable to other forms of related neurodegenerative diseases. Source code for DKT is available online: https://github.com/mrazvan22/dkt.
http://arxiv.org/abs/1901.03517
Computational paralinguistic analysis is increasingly being used in a wide range of cyber applications, including security-sensitive applications such as speaker verification, deceptive speech detection, and medical diagnostics. While state-of-the-art machine learning techniques, such as deep neural networks, can provide robust and accurate speech analysis, they are susceptible to adversarial attacks. In this work, we propose an end-to-end scheme to generate adversarial examples for computational paralinguistic applications by perturbing directly the raw waveform of an audio recording rather than specific acoustic features. Our experiments show that the proposed adversarial perturbation can lead to a significant performance drop of state-of-the-art deep neural networks, while only minimally impairing the audio quality.
http://arxiv.org/abs/1711.03280
The basic principles in designing convolutional neural network (CNN) structures for predicting objects on different levels, e.g., image-level, region-level, and pixel-level are diverging. Generally, network structures designed specifically for image classification are directly used as default backbone structure for other tasks including detection and segmentation, but there is seldom backbone structure designed under the consideration of unifying the advantages of networks designed for pixel-level or region-level predicting tasks, which may require very deep features with high resolution. Towards this goal, we design a fish-like network, called FishNet. In FishNet, the information of all resolutions is preserved and refined for the final task. Besides, we observe that existing works still cannot \emph{directly} propagate the gradient information from deep layers to shallow layers. Our design can better handle this problem. Extensive experiments have been conducted to demonstrate the remarkable performance of the FishNet. In particular, on ImageNet-1k, the accuracy of FishNet is able to surpass the performance of DenseNet and ResNet with fewer parameters. FishNet was applied as one of the modules in the winning entry of the COCO Detection 2018 challenge. The code is available at https://github.com/kevin-ssy/FishNet.
http://arxiv.org/abs/1901.03495
Information retrieval systems are evolving from document retrieval to answer retrieval. Web search logs provide large amounts of data about how people interact with ranked lists of documents, but very little is known about interaction with answer texts. In this paper, we use Amazon Mechanical Turk to investigate three answer presentation and interaction approaches in a non-factoid question answering setting. We find that people perceive and react to good and bad answers very differently, and can identify good answers relatively quickly. Our results provide the basis for further investigation of effective answer interaction and feedback methods.
http://arxiv.org/abs/1901.03491
Conversational assistants are being progressively adopted by the general population. However, they are not capable of handling complicated information-seeking tasks that involve multiple turns of information exchange. Due to the limited communication bandwidth in conversational search, it is important for conversational assistants to accurately detect and predict user intent in information-seeking conversations. In this paper, we investigate two aspects of user intent prediction in an information-seeking setting. First, we extract features based on the content, structural, and sentiment characteristics of a given utterance, and use classic machine learning methods to perform user intent prediction. We then conduct an in-depth feature importance analysis to identify key features in this prediction task. We find that structural features contribute most to the prediction performance. Given this finding, we construct neural classifiers to incorporate context information and achieve better performance without feature engineering. Our findings can provide insights into the important factors and effective methods of user intent prediction in information-seeking conversations.
http://arxiv.org/abs/1901.03489
Lung segmentation in computerized tomography (CT) images is an important procedure in various lung disease diagnosis. Most of the current lung segmentation approaches are performed through a series of procedures with manually empirical parameter adjustments in each step. Pursuing an automatic segmentation method with fewer steps, in this paper, we propose a novel deep learning Generative Adversarial Network (GAN) based lung segmentation schema, which we denote as LGAN. Our proposed schema can be generalized to different kinds of neural networks for lung segmentation in CT images and is evaluated on a dataset containing 220 individual CT scans with two metrics: segmentation quality and shape similarity. Also, we compared our work with current state of the art methods. The results obtained with this study demonstrate that the proposed LGAN schema can be used as a promising tool for automatic lung segmentation due to its simplified procedure as well as its good performance.
http://arxiv.org/abs/1901.03473
In this paper, a multi-scale framework with local region based active contour and boundary shape similarity constraint is proposed for the segmentation of levator hiatus in ultrasound images. In this paper, we proposed a multiscale active contour framework to segment levator hiatus ultrasound images by combining the local region information and boundary shape similarity constraint. In order to get more precisely initializations and reduce the computational cost, Gaussian pyramid method is used to decompose the image into coarse-to-fine scales. A localized region active contour model is firstly performed on the coarsest scale image to get a rough contour of the levator hiatus, then the segmentation result on the coarse scale is interpolated into the finer scale image as the initialization. The boundary shape similarity between different scales is incorporate into the local region based active contour model so that the result from coarse scale can guide the contour evolution at finer scale. By incorporating the multi-scale and boundary shape similarity, the proposed method can precisely locate the levator hiatus boundaries despite various ultrasound image artifacts. With a data set of 90 levator hiatus ultrasound images, the efficiency and accuracy of the proposed method are validated by quantitative and qualitative evaluations (TP, FP, Js) and comparison with other two state-of-art active contour segmentation methods (C-V model, DRLSE model).
http://arxiv.org/abs/1901.03472
In this paper, we proposed three methods to solve color recognition of Rubik’s cube, which includes one offline method and two online methods. Scatter balance \& extreme learning machine (SB-ELM), a offline method, is proposed to illustrate the efficiency of training based method. We also point out the conception of color drifting which indicates offline methods are always ineffectiveness and can not work well in continuous change circumstance. By contrast, dynamic weight label propagation is proposed for labeling blocks color by known center blocks color of Rubik’s cube. Furthermore, weak label hierarchic propagation, another online method, is also proposed for unknown all color information but only utilizes weak label of center block in color recognition. We finally design a Rubik’s cube robot and construct a dataset to illustrate the efficiency and effectiveness of our online methods and to indicate the ineffectiveness of offline method by color drifting in our dataset.
http://arxiv.org/abs/1901.03470
RodFIter is a promising method of attitude reconstruction from inertial measurements based on the functional iterative integration of Rodrigues vector. The Rodrigues vector is used to encode the attitude in place of the popular rotation vector because it has a polynomial-like rate equation and could be cast into theoretically sound and exact integration. This paper further applies the approach of RodFIter to the unity-norm quaternion for attitude reconstruction, named QuatFIter, and shows that it is identical to the previous Picard-type quaternion method. The Chebyshev polynomial approximation and truncation techniques from the RodFIter are exploited to speed up its implementation. Numerical results demonstrate that the QuatFIter is comparable in accuracy to the RodFIter, although its convergence rate is relatively slower with respect to the number of iterations. Notably, the QuatFIter has about two times better computational efficiency, thanks to the linear quaternion kinematic equation.
http://arxiv.org/abs/1901.03468
Automatically generating the descriptions of an image, i.e., image captioning, is an important and fundamental topic in artificial intelligence, which bridges the gap between computer vision and natural language processing. Based on the successful deep learning models, especially the CNN model and Long Short-Term Memories (LSTMs) with attention mechanism, we propose a hierarchical attention model by utilizing both of the global CNN features and the local object features for more effective feature representation and reasoning in image captioning. The generative adversarial network (GAN), together with a reinforcement learning (RL) algorithm, is applied to solve the exposure bias problem in RNN-based supervised training for language problems. In addition, through the automatic measurement of the consistency between the generated caption and the image content by the discriminator in the GAN framework and RL optimization, we make the finally generated sentences more accurate and natural. Comprehensive experiments show the improved performance of the hierarchical attention mechanism and the effectiveness of our RL-based optimization method. Our model achieves state-of-the-art results on several important metrics in the MSCOCO dataset, using only greedy inference.
https://arxiv.org/abs/1811.05253
Hand segmentation and fingertip detection play an indispensable role in hand gesture-based human-machine interaction systems. In this study, we propose a method to discriminate hand components and to locate fingertips in RGB-D images. The system consists of three main steps: hand detection using RGB images providing regions which are considered as promising areas for further processing, hand segmentation, and fingertip detection using depth image and our modified SegNet, a single lightweight architecture that can process two independent tasks at the same time. The experimental results show that our system is a promising method for hand segmentation and fingertip detection which achieves a comparable performance while model complexity is suitable for real-time applications.
http://arxiv.org/abs/1901.03465
Content-based adult video detection plays an important role in preventing pornography. However, existing methods usually rely on single modality and seldom focus on multi-modality semantics representation. Addressing at this problem, we put forward an approach of analyzing periodicity and saliency for adult video detection. At first, periodic patterns and salient regions are respective-ly analyzed in audio-frames and visual-frames. Next, the multi-modal co-occurrence semantics is described by combining audio periodicity with visual saliency. Moreover, the performance of our approach is evaluated step by step. Experimental results show that our approach obviously outper-forms some state-of-the-art methods.
http://arxiv.org/abs/1901.03462
This paper introduces the Seventh Dialog System Technology Challenges (DSTC), which use shared datasets to explore the problem of building dialog systems. Recently, end-to-end dialog modeling approaches have been applied to various dialog tasks. The seventh DSTC (DSTC7) focuses on developing technologies related to end-to-end dialog systems for (1) sentence selection, (2) sentence generation and (3) audio visual scene aware dialog. This paper summarizes the overall setup and results of DSTC7, including detailed descriptions of the different tracks and provided datasets. We also describe overall trends in the submitted systems and the key results. Each track introduced new datasets and participants achieved impressive results using state-of-the-art end-to-end technologies.
http://arxiv.org/abs/1901.03461
Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming. Recent works directly leverage the motion vectors and residuals readily available in the compressed video to represent motion at no cost. While this avoids flow computation, it also hurts accuracy since the motion vector is noisy and has substantially reduced resolution, which makes it a less discriminative motion representation. To remedy these issues, we propose a lightweight generator network, which reduces noises in motion vectors and captures fine motion details, achieving a more Discriminative Motion Cue (DMC) representation. Since optical flow is a more accurate motion representation, we train the DMC generator to approximate flow using a reconstruction loss and a generative adversarial loss, jointly with the downstream action classification task. Extensive evaluations on three action recognition benchmarks (HMDB-51, UCF-101, and a subset of Kinetics) confirm the effectiveness of our method. Our full system, consisting of the generator and the classifier, is coined as DMC-Net which obtains high accuracy close to that of using flow and runs two orders of magnitude faster than using optical flow at inference time.
http://arxiv.org/abs/1901.03460
We introduce a new task named Story Ending Generation (SEG), whic-h aims at generating a coherent story ending from a sequence of story plot. Wepropose a framework consisting of a Generator and a Reward Manager for thistask. The Generator follows the pointer-generator network with coverage mech-anism to deal with out-of-vocabulary (OOV) and repetitive words. Moreover, amixed loss method is introduced to enable the Generator to produce story endingsof high semantic relevance with story plots. In the Reward Manager, the rewardis computed to fine-tune the Generator with policy-gradient reinforcement learn-ing (PGRL). We conduct experiments on the recently-introduced ROCStoriesCorpus. We evaluate our model in both automatic evaluation and human evalua-tion. Experimental results show that our model exceeds the sequence-to-sequencebaseline model by 15.75% and 13.57% in terms of CIDEr and consistency scorerespectively.
http://arxiv.org/abs/1901.03459