Image recognition using Deep Learning has been evolved for decades though advances in the field through different settings is still a challenge. In this paper, we present our findings in searching for better image classifiers in offline and online environments. We resort to Convolutional Neural Network and its variations of fully connected Multi-layer Perceptron. Though still preliminary, these results are encouraging and may provide a better understanding about the field and directions toward future works.
http://arxiv.org/abs/1903.07479
This paper proposes a solution to the problem of smooth path planning for mobile robots in dynamic and unknown environments. A novel concept of Time-Warped Grid is introduced to predict the pose of obstacles in the environment and avoid collisions. The algorithm is implemented using C/C++ and the CUDA programming environment, and combines stochastic estimation (Kalman filter), harmonic potential fields and a rubber band model, and it translates naturally into the parallel paradigm of GPU programming. In simple terms, time-warped grids are progressively wider orbits around the mobile robot. Those orbits represent the variable time intervals estimated by the robot to reach detected obstacles. The proposed method was tested using several simulation scenarios for the Pioneer P3-DX robot, which demonstrated the robustness of the algorithm by finding the optimum path in terms of smoothness, distance, and collision-free, in both static or dynamic environments, and with large number of obstacles.
http://arxiv.org/abs/1903.07441
Researches have shown that diet recording can help people increase awareness of food intake and improve nutrition management, and thereby maintain a healthier life. Recently, researchers have been working on smartphone-based diet recording methods and applications that help users accomplish two tasks: record what they eat and how much they eat. Although the former task has made great progress through adopting image recognition technology, it is still a challenge to estimate the volume of foods accurately and conveniently. In this paper, we propose a novel method, named MUSEFood, for food volume estimation. MUSEFood uses the camera to capture photos of the food, but unlike existing volume measurement methods, MUSEFood requires neither training images with volume information nor placing a reference object of known size while taking photos. In addition, considering the impact of different containers on the contour shape of foods, MUSEFood uses a multi-task learning framework to improve the accuracy of food segmentation, and uses a differential model applicable for various containers to further reduce the negative impact of container differences on volume estimation accuracy. Furthermore, MUSEFood uses the microphone and the speaker to accurately measure the vertical distance from the camera to the food in a noisy environment, thus scaling the size of food in the image to its actual size. The experiments on real foods indicate that MUSEFood outperforms state-of-the-art approaches, and highly improves the speed of food volume estimation.
http://arxiv.org/abs/1903.07437
Recent work has shown that LSTMs trained on a generic language modeling objective capture syntax-sensitive generalizations such as long-distance number agreement. We have however no mechanistic understanding of how they accomplish this remarkable feat. Some have conjectured it depends on heuristics that do not truly take hierarchical structure into account. We present here a detailed study of the inner mechanics of number tracking in LSTMs at the single neuron level. We discover that long-distance number information is largely managed by two ``number units’’. Importantly, the behaviour of these units is partially controlled by other units independently shown to track syntactic structure. We conclude that LSTMs are, to some extent, implementing genuinely syntactic processing mechanisms, paving the way to a more general understanding of grammatical encoding in LSTMs.
http://arxiv.org/abs/1903.07435
Federated learning obtains a central model on the server by aggregating models trained locally on clients. As a result, federated learning does not require clients to upload their data to the server, thereby preserving the data privacy of the clients. One challenge in federated learning is to reduce the client-server communication since the end devices typically have very limited communication bandwidth. This paper presents an enhanced federated learning technique by proposing a synchronous learning strategy on the clients and a temporally weighted aggregation of the local models on the server. In the asynchronous learning strategy, different layers of the deep neural networks are categorized into shallow and deeps layers and the parameters of the deep layers are updated less frequently than those of the shallow layers. Furthermore, a temporally weighted aggregation strategy is introduced on the server to make use of the previously trained local models, thereby enhancing the accuracy and convergence of the central model. The proposed algorithm is empirically on two datasets with different deep neural networks. Our results demonstrate that the proposed asynchronous federated deep learning outperforms the baseline algorithm both in terms of communication cost and model accuracy.
http://arxiv.org/abs/1903.07424
The Transformer translation model is easier to parallelize and provides better performance comparing with recurrent seq2seq models, which makes it popular among industry and research community. We implement Neutron in this work, including the Transformer model and several variants from most recent researches. It is easier to modify and provides comparable performance with interesting features while keep readability.
http://arxiv.org/abs/1903.07402
Exploration in sparse reward reinforcement learning remains a difficult open challenge. Many state-of-the-art methods use intrinsic motivation to complement the sparse extrinsic reward signal, giving the agent more opportunities to receive feedback during exploration. Most commonly, these signals are added as bonus rewards, which results in the mixture policy faithfully conducting neither exploration nor task fulfillment for an extended amount of time. In this paper, we instead learn separate intrinsic and extrinsic task policies and schedule between these different drives to accelerate exploration and stabilize learning. Moreover, we introduce a new type of intrinsic reward denoted as successor feature control (SFC), which is general and not task-specific. It takes into account statistics over complete trajectories and thus differs from previous methods that only use local information to evaluate intrinsic motivation. We evaluate our proposed scheduled intrinsic drive (SID) agent using three different environments with pure visual inputs: VizDoom, DeepMind Lab and OpenAI Gym classic control from pixels. The results show a greatly improved exploration efficiency with SFC and the hierarchical usage of the intrinsic drives. A video of our experimental results can be found at https://youtu.be/4ZHcBo7006Y.
http://arxiv.org/abs/1903.07400
The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. Traditional intonation models have given way to the overwhelming performance of deep learning (DL) techniques for training general purpose end-to-end mappings using millions of tunable parameters. The shift towards black box machine learning models has nonetheless posed the reverse problem – a compelling need to discover knowledge, to explain, visualise and interpret. Our work bridges between a comprehensive generative model of intonation and state-of-the-art DL techniques. We build upon the modelling paradigm of the Superposition of Functional Contours (SFC) model and propose a Variational Prosody Model (VPM) that uses a network of variational contour generators to capture the context-sensitive variation of the constituent elementary prosodic contours. We show that the VPM can give insight into the intrinsic variability of these prosodic prototypes through learning a meaningful prosodic latent space representation structure. We also show that the VPM is able to capture prosodic phenomena that have multiple dimensions of context based variability. Since it is based on the principle of superposition, the VPM does not necessitate the use of specially crafted corpora for the analysis, opening up the possibilities of using big data for prosody analysis. In a speech synthesis scenario, the model can be used to generate a dynamic and natural prosody contour that is devoid of averaging effects.
http://arxiv.org/abs/1806.08685
Sentence fusion is the task of joining several independent sentences into a single coherent text. Current datasets for sentence fusion are small and insufficient for training modern neural models. In this paper, we propose a method for automatically-generating fusion examples from raw text and present DiscoFuse, a large scale dataset for discourse-based sentence fusion. We author a set of rules for identifying a diverse set of discourse phenomena in raw text, and decomposing the text into two independent sentences. We apply our approach on two document collections: Wikipedia and Sports articles, yielding 60 million fusion examples annotated with discourse information required to reconstruct the fused text. We develop a sequence-to-sequence model on DiscoFuse and thoroughly analyze its strengths and weaknesses with respect to the various discourse phenomena, using both automatic as well as human evaluation. Finally, we conduct transfer learning experiments with WebSplit, a recent dataset for text simplification. We show that pretraining on DiscoFuse substantially improves performance on WebSplit when viewed as a sentence fusion task.
http://arxiv.org/abs/1902.10526
Encoder-decoder models have become an effective approach for sequence learning tasks like machine translation, image captioning and speech recognition, but have yet to show competitive results for handwritten text recognition. To this end, we propose an attention-based sequence-to-sequence model. It combines a convolutional neural network as a generic feature extractor with a recurrent neural network to encode both the visual information, as well as the temporal context between characters in the input image, and uses a separate recurrent neural network to decode the actual character sequence. We make experimental comparisons between various attention mechanisms and positional encodings, in order to find an appropriate alignment between the input and output sequence. The model can be trained end-to-end and the optional integration of a hybrid loss allows the encoder to retain an interpretable and usable output, e.g. for keyword spotting purposes without prior indexing, if desired. We achieve competitive results on the IAM and ICFHR2016 READ data sets compared to the state-of-the-art without the use of a language model, and we significantly improve over any recent sequence-to-sequence approaches.
http://arxiv.org/abs/1903.07377
There are many research tools which are also used for teaching the acoustic phonetics of speech rhythm and speech melody. But they were not purpose-designed for teaching-learning situations, and some have a steep learning curve. CRAFT (Creation and Recovery of Amplitude and Frequency Tracks) is custom-designed as a novel flexible online tool for visualisation and critical comparison of functions and transforms, with implementations of the Reaper, RAPT, PyRapt, YAAPT, YIN and PySWIPE F0 estimators, three Praat configurations, and two purpose-built estimators, PyAMDF, S0FT. Visualisations of amplitude and frequency envelope spectra, spectral edge detection of rhythm zones, and a parametrised spectrogram are included. A selection of audio clips from tone and intonation languages is provided for demonstration purposes. The main advantages of online tools are consistency (users have the same version and the same data selection), interoperability over different platforms, and ease of maintenance. The code is available on GitHub.
http://arxiv.org/abs/1903.08718
Combining items of clothing into an outfit is a major task in fashion retail. Recommending sets of items that are compatible with a particular seed item is useful for providing users with guidance and inspiration, but is currently a manual process that requires expert stylists and is therefore not scalable or easy to personalise. We use a multilayer neural network fed by visual and textual features to learn embeddings of items in a latent style space such that compatible items of different types are embedded close to one another. We train our model using the ASOS outfits dataset, which consists of a large number of outfits created by professional stylists and which we release to the research community. Our model shows strong performance in an offline outfit compatibility prediction task. We use our model to generate outfits and for the first time in this field perform an AB test, comparing our generated outfits to those produced by a baseline model which matches appropriate product types but uses no information on style. Users approved of outfits generated by our model 21% and 34% more frequently than those generated by the baseline model for womenswear and menswear respectively.
http://arxiv.org/abs/1904.00741
This paper proposes an uncalibrated photometric stereo method for non-Lambertian scenes based on deep learning. Unlike previous approaches that heavily rely on assumptions of specific reflectances and light source distributions, our method is able to determine both shape and light directions of a scene with unknown arbitrary reflectances observed under unknown varying light directions. To achieve this goal, we propose a two-stage deep learning architecture, called SDPS-Net, which can effectively take advantage of intermediate supervision, resulting in reduced learning difficulty compared to a single-stage model. Experiments on both synthetic and real datasets show that our proposed approach significantly outperforms previous uncalibrated photometric stereo methods.
http://arxiv.org/abs/1903.07366
Varying terrain conditions and limited field-of-view restricts the visibility of aerial robots while performing visual monitoring operations. In this paper, we study the multi-point monitoring problem on a 2.5D terrain using an unmanned aerial vehicle (UAV) with limited camera field-of-view. This problem is NP-Hard and hence we develop a two phase strategy to compute an approximate tour for the UAV. In the first phase, visibility regions on the flight plane are determined for each point of interest. In the second phase, a tour for the UAV to visit each visibility region is computed by casting the problem as an instance of the Traveling Salesman Problem with Neighbourhoods (TSPN). We design a constant-factor approximation algorithm for the TSPN instance. Further, we reduce the TSPN instance to an instance of the Generalized Traveling Salesman Problem (GTSP) and devise an ILP formulation to solve it. We present a comparative evaluation of solutions computed using a branch-and-cut implementation and an off-the-shelf GTSP tool – GLNS, while varying the points of interest density, sampling resolution and camera field-of-view. We also show results from preliminary field experiments.
http://arxiv.org/abs/1903.07363
Driven by Convolutional Neural Networks, object detection and semantic segmentation have gained significant improvements. However, existing methods on the basis of a full top-down module have limited robustness in handling those two tasks simultaneously. To this end, we present a joint multi-task framework, termed IvaNet. Different from existing methods, our IvaNet backwards abstract semantic information from higher layers to augment lower layers using local top-down modules. The comparisons against some counterparts on the PASCAL VOC and MS COCO datasets demonstrate the functionality of IvaNet.
http://arxiv.org/abs/1903.07360
Sensor-to-segment calibration is a crucial step in inertial motion tracking. When two segments are connected by a hinge joint, for example in human knee and finger joints as well as in many robotic limbs, then the joint axis vector must be identified in the intrinsic sensor coordinate systems. There exist methods that identify these coordinates by solving an optimization problem that is based on kinematic joint constraints, which involve either the measured accelerations or the measured angular rates. In the current paper we demonstrate that using only one of these constraints leads to inaccurate estimates at either fast or slow motions. We propose a novel method based on a cost function that combines both constraints. The restrictive assumption of a homogeneous magnetic field is avoided by using only accelerometer and gyroscope readings. To combine the advantages of both sensor types, the residual weights are adjusted automatically based on the estimated signal variances and a nonlinear weighting of the acceleration norm difference. The method is evaluated using real data from nine different motions of an upper limb exoskeleton. Results show that, unlike previous approaches, the proposed method yields accurate joint axis estimation after only five seconds for all fast and slow motions.
http://arxiv.org/abs/1903.07353
A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and associated texts. In this paper we aim to bootstrap generators from large scale datasets where the data (e.g., DBPedia facts) and related texts (e.g., Wikipedia abstracts) are loosely aligned. We tackle this challenging task by introducing a special-purpose content selection mechanism. We use multi-instance learning to automatically discover correspondences between data and text pairs and show how these can be used to enhance the content signal while training an encoder-decoder architecture. Experimental results demonstrate that models trained with content-specific objectives improve upon a vanilla encoder-decoder which solely relies on soft attention.
http://arxiv.org/abs/1804.06385
A robot making contact with an environment or human presents potential safety risks, including excessive collision force. While experiments on the effect of robot inertia, relative velocity, and interface stiffness on collision are in literature, analytical models for maximum collision force are limited to a simplified mass-spring robot model. This simplified model limits the analysis of control (force/torque, impedance, or admittance) or compliant robots (joint and end-effector compliance). Here, the Sobolev norm is adapted to be a system norm, giving rigorous bounds on the maximum force on a stiffness element in a general dynamic system, allowing the study of collision with more accurate models and feedback control. The Sobolev norm can be found through the $\mathcal{H}_2$ norm of a transformed system, allowing efficient computation, connection with existing control theory, and controller synthesis to minimize collision force. The Sobolev norm is validated, first experimentally with an admittance-controlled robot, then in simulation with a linear flexible-joint robot. It is then used to investigate the impact of control, joint flexibility and end-effector compliance on collision, and a trade-off between collision performance and environmental estimation uncertainty is shown.
http://arxiv.org/abs/1810.03345
This paper presents an unsupervised framework for jointly modeling topic content and discourse behavior in microblog conversations. Concretely, we propose a neural model to discover word clusters indicating what a conversation concerns (i.e., topics) and those reflecting how participants voice their opinions (i.e., discourse). Extensive experiments show that our model can yield both coherent topics and meaningful discourse behavior. Further study shows that our topic and discourse representations can benefit the classification of microblog messages, especially when they are jointly trained with the classifier.
http://arxiv.org/abs/1903.07319
Supervised learning methods to infer (hypothesize) depth of a scene from a single image require costly per-pixel ground-truth. We follow a geometric approach that exploits abundant stereo imagery to learn a model to hypothesize scene structure without direct supervision. Although we train a network with stereo pairs, we only require a single image at test time to hypothesize disparity or depth. We propose a novel objective function that exploits the bilateral cyclic relationship between the left and right disparities and we introduce an adaptive regularization scheme that allows the network to handle both the co-visible and occluded regions in a stereo pair. This process ultimately produces a model to generate hypotheses for the 3-dimensional structure of the scene as viewed in a single image. When used to generate a single (most probable) estimate of depth, our method outperforms state-of-the-art unsupervised monocular depth prediction methods on the KITTI benchmarks. We show that our method generalizes well by applying our models trained on KITTI to the Make3d dataset.
http://arxiv.org/abs/1903.07309
In the present paper we describe the technology for translating algorithmic descriptions of discrete functions to SAT. The proposed technology is aimed at applications in algebraic cryptanalysis. We describe how cryptanalysis instances are reduced to SAT in such a way that it should be perceived as natural by the cryptographic community. Therefore, in the theoretical part of the paper we justify the main principles of general reduction to SAT for discrete functions from a class containing the majority of functions employed in cryptography. Based on these principles we describe the Transalg software system, developed with SAT-based cryptanalysis specifics in mind. We show the results of applications of Transalg to construction of a number of attacks on various cryptographic functions. Some of the corresponding attacks are state of the art. We also compare the functional capabilities of the proposed system with that of other software systems that can be used to reduce cryptanalysis instances to SAT, and also with the CBMC system widely employed in symbolic verification. In the paper we also present vast experimental data, obtained using the SAT-solvers that took first places at the SAT-competitions in the recent several years.
http://arxiv.org/abs/1805.07239
This paper proposes an autoregressive (AR) model for sequences of graphs, which generalises traditional AR models. A first novelty consists in formalising the AR model for a very general family of graphs, characterised by a variable topology, and attributes associated with nodes and edges. A graph neural network (GNN) is also proposed to learn the AR function associated with the graph-generating process (GGP), and subsequently predict the next graph in a sequence. The proposed method is compared with four baselines on synthetic GGPs, denoting a significantly better performance on all considered problems.
http://arxiv.org/abs/1903.07299
Appearance-based gaze estimation has attracted more and more attention because of its wide range of applications. The use of deep convolutional neural networks has improved the accuracy significantly. In order to improve the estimation accuracy further, we focus on extracting better features from eye images. Relatively large changes in gaze angles may result in relatively small changes in eye appearance. We argue that current architectures for gaze estimation may not be able to capture such small changes, as they apply multiple pooling layers or other downsampling layers so that the spatial resolution of the high-level layers is reduced significantly. To evaluate whether the use of features extracted at high resolution can benefit gaze estimation, we adopt dilated-convolutions to extract high-level features without reducing spatial resolution. In cross-subject experiments on the Columbia Gaze dataset for eye contact detection and the MPIIGaze dataset for 3D gaze vector regression, the resulting Dilated-Nets achieve significant (up to 20.8%) gains when compared to similar networks without dilated-convolutions. Our proposed Dilated-Net achieves state-of-the-art results on both the Columbia Gaze and the MPIIGaze datasets.
http://arxiv.org/abs/1903.07296
We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers. We show that this is suboptimal as the normalization layers tend to ``wash away’’ semantic information. To address the issue, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation. Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows user control over both semantic and style as synthesizing images. Code will be available at https://github.com/NVlabs/SPADE .
http://arxiv.org/abs/1903.07291
Long Short-Term Memory (LSTM) Networks and Convolutional Neural Networks (CNN) have become very common and are used in many fields as they were effective in solving many problems where the general neural networks were inefficient. They were applied to various problems mostly related to images and sequences. Since LSTMs and CNNs take inputs of the same length and dimension, input images and sequences are padded to maximum length while testing and training. This padding can affect the way the networks function and can make a great deal when it comes to performance and accuracies. This paper studies this and suggests the best way to pad an input sequence. This paper uses a simple sentiment analysis task for this purpose. We use the same dataset on both the networks with various padding to show the difference. This paper also discusses some preprocessing techniques done on the data to ensure effective analysis of the data.
http://arxiv.org/abs/1903.07288
Recently, deep neural networks have significant progress and successful application in various fields, but they are found vulnerable to attack instances, e.g., adversarial examples. State-of-art attack methods can generate attack images by adding small perturbation to the source image. These attack images can fool the classifier but have little impact to human. Therefore, such attack instances are difficult to generate by searching the feature space. How to design an effective and robust generating method has become a spotlight. Inspired by adversarial examples, we propose two novel generative models to produce adaptive attack instances directly, in which conditional generative adversarial network is adopted and distinctive strategy is designed for training. Compared with the common method, such as Fast Gradient Sign Method, our models can reduce the generating cost and improve robustness and has about one fifth running time for producing attack instance.
http://arxiv.org/abs/1903.07282
In this work, we formulate the process of generating explanations as model reconciliation for planning problems as one of planning with explanatory actions. We show that these problems could be better understood within the framework of epistemic planning and that, in fact, most earlier works on explanation as model reconciliation correspond to tractable subsets of epistemic planning problems. We empirically show how our approach is computationally more efficient than existing techniques for explanation generation and also discuss how this particular approach could be extended to capture most of the existing variants of explanation as model reconciliation. We end the paper with a discussion of how this formulation could be extended to generate novel explanatory behaviors.
http://arxiv.org/abs/1903.07269
3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies — a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that data representation (rather than its quality) accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations — essentially mimicking LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance — raising the detection accuracy of objects within 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo image based approaches.
https://arxiv.org/abs/1812.07179
Due to the complication of the presented problem, intelligent algorithms show great power to solve the parts logistics optimization problem related to the vehicle routing problem (VRP). However, most of the existing research to VRP are incomprehensive and failed to solve a real-work parts logistics problem. In this work, towards SAIC logistics problem, we propose a systematic solution to this 2-Dimensional Loading Capacitated Multi-Depot Heterogeneous VRP with Time Windows by integrating diverse types of intelligent algorithms, including, a heuristic algorithm to initialize feasible logistics planning schemes by imitating manual planning, the core Tabu Search algorithm for global optimization, accelerated by a novel bundle technique, heuristically algorithms for routing, packing and queuing associated, and a heuristic post-optimization process to promote the optimal solution. Based on these algorithms, the SAIC Motor has successfully established an intelligent management system to give a systematic solution for the parts logistics planning, superior than manual planning in its performance, customizability and expandability.
http://arxiv.org/abs/1903.07260
Video anomaly detection under weak labels is formulated as a typical multiple-instance learning problem in previous works. In this paper, we provide a new perspective, i.e., a supervised learning task under noisy labels. In such a viewpoint, as long as cleaning away label noise, we can directly apply fully supervised action classifiers to weakly supervised anomaly detection, and take maximum advantage of these well-developed classifiers. For this purpose, we devise a graph convolutional network to correct noisy labels. Based upon feature similarity and temporal consistency, our network propagates supervisory signals from high-confidence snippets to low-confidence ones. In this manner, the network is capable of providing cleaned supervision for action classifiers. During the test phase, we only need to obtain snippet-wise predictions from the action classifier without any extra post-processing. Extensive experiments on 3 datasets at different scales with 2 types of action classifiers demonstrate the efficacy of our method. Remarkably, we obtain the frame-level AUC score of 82.12% on UCF-Crime.
http://arxiv.org/abs/1903.07256
Finding a template in a search image is one of the core problems many computer vision, such as semantic image semantic, image-to-GPS verification \etc. We propose a novel quality-aware template matching method, QATM, which is not only used as a standalone template matching algorithm, but also a trainable layer that can be easily embedded into any deep neural network. Specifically, we assess the quality of a matching pair using soft-ranking among all matching pairs, and thus different matching scenarios such as 1-to-1, 1-to-many, and many-to-many will be all reflected to different values. Our extensive evaluation on classic template matching benchmarks and deep learning tasks demonstrate the effectiveness of QATM. It not only outperforms state-of-the-art template matching methods when used alone, but also largely improves existing deep network solutions.
http://arxiv.org/abs/1903.07254
Aspect-term sentiment analysis (ATSA) is a longstanding challenge in natural language understanding. It requires fine-grained semantical reasoning about a target entity appeared in the text. As manual annotation over the aspects is laborious and time-consuming, the amount of labeled data is limited for supervised learning. This paper proposes a semi-supervised method for the ATSA problem by using the Variational Autoencoder based on Transformer (VAET), which models the latent distribution via variational inference. By disentangling the latent representation into the aspect-specific sentiment and the lexical context, our method induces the underlying sentiment prediction for the unlabeled data, which then benefits the ATSA classifier. Our method is classifier agnostic, i.e., the classifier is an independent module and various advanced supervised models can be integrated. Experimental results are obtained on the SemEval 2014 task 4 and show that our method is effective with four classical classifiers. The proposed method outperforms two general semisupervised methods and achieves state-of-the-art performance.
http://arxiv.org/abs/1810.10437
Existing polarimetric synthetic aperture radar (PolSAR) image classification methods cannot achieve satisfactory performance on complex scenes characterized by several types of land cover with significant levels of noise or similar scattering properties across land cover types. Hence, we propose a supervised classification method aimed at constructing a classifier based on self-paced learning (SPL). SPL has been demonstrated to be effective at dealing with complex data while providing classifier. In this paper, a novel Support Vector Machine (SVM) algorithm based on SPL with neighborhood constraints (SVM_SPLNC) is proposed. The proposed method leverages the easiest samples first to obtain an initial parameter vector. Then, more complex samples are gradually incorporated to update the parameter vector iteratively. Moreover, neighborhood constraints are introduced during the training process to further improve performance. Experimental results on three real PolSAR images show that the proposed method performs well on complex scenes.
http://arxiv.org/abs/1903.07243
In this paper, we design a nonlinear observer to estimate the inertial pose and the velocity of a free-floating non-cooperative satellite (Target) using only relative pose measurements. In the context of control design for orbital robotic capture of such a non-cooperative Target, due to lack of navigational aids, only a relative pose estimate may be obtained from slow-sampled and noisy exteroceptive sensors. The velocity, however, cannot be measured directly. To address this problem, we develop a model-based observer which acts as an internal model for Target kinematics/dynamics and therefore, may act as a predictor during periods of no measurement. To this end, firstly, we formalize the estimation problem on the SE(3) Lie group with different state and measurement spaces. Secondly, we develop the kinematics and dynamics observer such that the overall observer error dynamics possesses a stability property. Finally, the proposed observer is validated through robust Monte-Carlo simulations and experiments on a robotic facility.
http://arxiv.org/abs/1903.07230
Machine learning models of music typically break up the task of composition into a chronological process, composing a piece of music in a single pass from beginning to end. On the contrary, human composers write music in a nonlinear fashion, scribbling motifs here and there, often revisiting choices previously made. In order to better approximate this process, we train a convolutional neural network to complete partial musical scores, and explore the use of blocked Gibbs sampling as an analogue to rewriting. Neither the model nor the generative procedure are tied to a particular causal direction of composition. Our model is an instance of orderless NADE (Uria et al., 2014), which allows more direct ancestral sampling. However, we find that Gibbs sampling greatly improves sample quality, which we demonstrate to be due to some conditional distributions being poorly modeled. Moreover, we show that even the cheap approximate blocked Gibbs procedure from Yao et al. (2014) yields better samples than ancestral sampling, based on both log-likelihood and human evaluation.
http://arxiv.org/abs/1903.07227
This work develops a novel end-to-end deep unsupervised learning method based on convolutional neural network (CNN) with pseudo-classes for remote sensing scene representation. First, we introduce center points as the centers of the pseudo classes and the training samples can be allocated with pseudo labels based on the center points. Therefore, the CNN model, which is used to extract features from the scenes, can be trained supervised with the pseudo labels. Moreover, a pseudo-center loss is developed to decrease the variance between the samples and the corresponding pseudo center point. The pseudo-center loss is important since it can update both the center points with the training samples and the CNN model with the center points in the training process simultaneously. Finally, joint learning of the pseudo-center loss and the pseudo softmax loss which is formulated with the samples and the pseudo labels is developed for unsupervised remote sensing scene representation to obtain discriminative representations from the scenes. Experiments are conducted over two commonly used remote sensing scene datasets to validate the effectiveness of the proposed method and the experimental results show the superiority of the proposed method when compared with other state-of-the-art methods.
http://arxiv.org/abs/1903.07224
Objective: Monitoring athlete internal workload exposure, including prevention of catastrophic non-contact knee injuries, relies on the existence of a custom early-warning detection system. This system must be able to estimate accurate, reliable, and valid musculoskeletal joint loads, for sporting maneuvers in near real-time and during match play. However, current methods are constrained to laboratory instrumentation, are labor and cost intensive, and require highly trained specialist knowledge, thereby limiting their ecological validity and volume deployment. Methods: Here we show that kinematic data obtained from wearable sensor accelerometers, in lieu of embedded force platforms, can leverage recent supervised learning techniques to predict in-game near real-time multidimensional ground reaction forces and moments (GRF/M). Competing convolutional neural network (CNN) deep learning models were trained using laboratory-derived stance phase GRF/M data and simulated sensor accelerations for running and sidestepping maneuvers derived from nearly half a million legacy motion trials. Then, predictions were made from each model driven by five sensor accelerations recorded during independent inter-laboratory data capture sessions. Results: Despite adversarial conditions, the proposed deep learning workbench achieved correlations for vertical GRF above 0.96 for running. Conclusion: The lessons learned from this study will facilitate the use of wearable sensors in conjunction with deep learning to accurately estimate near real-time on-field GRF/M. Significance: Coaching, medical, and allied health staff can use this technology to monitor a range of joint loading indicators during game play, with the ultimate aim to minimize the occurrence of non-contact injuries in elite and community-level sports.
http://arxiv.org/abs/1903.07221
While deep neural networks have achieved state-of-the-art performance across a large number of complex tasks, it remains a big challenge to deploy such networks for practical, on-device edge scenarios such as on mobile devices, consumer devices, drones, and vehicles. In this study, we take a deeper exploration into a human-machine collaborative design approach for creating highly efficient deep neural networks through a synergy between principled network design prototyping and machine-driven design exploration. The efficacy of human-machine collaborative design is demonstrated through the creation of AttoNets, a family of highly efficient deep neural networks for on-device edge deep learning. Each AttoNet possesses a human-specified network-level macro-architecture comprising of custom modules with unique machine-designed module-level macro-architecture and micro-architecture designs, all driven by human-specified design requirements. Experimental results for the task of object recognition showed that the AttoNets created via human-machine collaborative design has significantly fewer parameters and computational costs than state-of-the-art networks designed for efficiency while achieving noticeably higher accuracy (with the smallest AttoNet achieving ~1.8% higher accuracy while requiring ~10x fewer multiply-add operations and parameters than MobileNet-V1). Furthermore, the efficacy of the AttoNets is demonstrated for the task of instance-level object segmentation and object detection, where an AttoNet-based Mask R-CNN network was constructed with significantly fewer parameters and computational costs (~5x fewer multiply-add operations and ~2x fewer parameters) than a ResNet-50 based Mask R-CNN network.
http://arxiv.org/abs/1903.07209
Designing agents capable of explaining complex sequential decisions remain a significant open problem in automated decision-making. Recently, there has been a lot of interest in developing approaches for generating such explanations for various decision-making paradigms. One such approach has been the idea of {\em explanation as model-reconciliation}. The framework hypothesizes that one of the common reasons for the user’s confusion could be the mismatch between the user’s model of the task and the one used by the system to generate the decisions. While this is a general framework, most works that have been explicitly built on this explanatory philosophy have focused on settings where the model of user’s knowledge is available in a declarative form. Our goal in this paper is to adapt the model reconciliation approach to the cases where such user models are no longer explicitly provided. We present a simple and easy to learn labeling model that can help an explainer decide what information could help achieve model reconciliation between the user and the agent.
http://arxiv.org/abs/1903.07198
Superpixel decomposition methods are widely used in computer vision and image processing applications. By grouping homogeneous pixels, the accuracy can be increased and the decrease of the number of elements to process can drastically reduce the computational burden. For most superpixel methods, a trade-off is computed between 1) color homogeneity, 2) adherence to the image contours and 3) shape regularity of the decomposition. In this paper, we propose a framework that jointly enforces all these aspects and provides accurate and regular Superpixels with Contour Adherence using Linear Path (SCALP). During the decomposition, we propose to consider color features along the linear path between the pixel and the corresponding superpixel barycenter. A contour prior is also used to prevent the crossing of image boundaries when associating a pixel to a superpixel. Finally, in order to improve the decomposition accuracy and the robustness to noise, we propose to integrate the pixel neighborhood information, while preserving the same computational complexity. SCALP is extensively evaluated on standard segmentation dataset, and the obtained results outperform the ones of the state-of-the-art methods. SCALP is also extended for supervoxel decomposition on MRI images.
http://arxiv.org/abs/1903.07193
This paper presents a learning-based clothing animation method for highly efficient virtual try-on simulation. Given a garment, we preprocess a rich database of physically-based dressed character simulations, for multiple body shapes and animations. Then, using this database, we train a learning-based model of cloth drape and wrinkles, as a function of body shape and dynamics. We propose a model that separates global garment fit, due to body shape, from local garment wrinkles, due to both pose dynamics and body shape. We use a recurrent neural network to regress garment wrinkles, and we achieve highly plausible nonlinear effects, in contrast to the blending artifacts suffered by previous methods. At runtime, dynamic virtual try-on animations are produced in just a few milliseconds for garments with thousands of triangles. We show qualitative and quantitative analysis of results
http://arxiv.org/abs/1903.07190
In image processing tasks, spatial priors are essential for robust computations, regularization, algorithmic design and Bayesian inference. In this paper, we introduce weighted mean curvature (WMC) as a novel image prior and present an efficient computation scheme for its discretization in practical image processing applications. We first demonstrate the favorable properties of WMC, such as sampling invariance, scale invariance, and contrast invariance with Gaussian noise model; and we show the relation of WMC to area regularization. We further propose an efficient computation scheme for discretized WMC, which is demonstrated herein to process over 33.2 giga-pixels/second on GPU. This scheme yields itself to a convolutional neural network representation. Finally, WMC is evaluated on synthetic and real images, showing its superiority quantitatively to total-variation and mean curvature.
http://arxiv.org/abs/1903.07189
Disease progression modeling (DPM) using longitudinal data is a challenging machine learning task. Existing DPM algorithms neglect temporal dependencies among measurements, make parametric assumptions about biomarker trajectories, do not model multiple biomarkers jointly, and need an alignment of subjects’ trajectories. In this paper, recurrent neural networks (RNNs) are utilized to address these issues. However, in many cases, longitudinal cohorts contain incomplete data, which hinders the application of standard RNNs and requires a pre-processing step such as imputation of the missing values. Instead, we propose a generalized training rule for the most widely used RNN architecture, long short-term memory (LSTM) networks, that can handle both missing predictor and target values. The proposed LSTM algorithm is applied to model the progression of Alzheimer’s disease (AD) using six volumetric magnetic resonance imaging (MRI) biomarkers, i.e., volumes of ventricles, hippocampus, whole brain, fusiform, middle temporal gyrus, and entorhinal cortex, and it is compared to standard LSTM networks with data imputation and a parametric, regression-based DPM method. The results show that the proposed algorithm achieves a significantly lower mean absolute error (MAE) than the alternatives with p < 0.05 using Wilcoxon signed rank test in predicting values of almost all of the MRI biomarkers. Moreover, a linear discriminant analysis (LDA) classifier applied to the predicted biomarker values produces a significantly larger AUC of 0.90 vs. at most 0.84 with p < 0.001 using McNemar’s test for clinical diagnosis of AD. Inspection of MAE curves as a function of the amount of missing data reveals that the proposed LSTM algorithm achieves the best performance up until more than 74% missing values. Finally, it is illustrated how the method can successfully be applied to data with varying time intervals.
http://arxiv.org/abs/1903.07173
Superpixels have become very popular in many computer vision applications. Nevertheless, they remain underexploited since the superpixel decomposition may produce irregular and non stable segmentation results due to the dependency to the image content. In this paper, we first introduce a novel structure, a superpixel-based patch, called SuperPatch. The proposed structure, based on superpixel neighborhood, leads to a robust descriptor since spatial information is naturally included. The generalization of the PatchMatch method to SuperPatches, named SuperPatchMatch, is introduced. Finally, we propose a framework to perform fast segmentation and labeling from an image database, and demonstrate the potential of our approach since we outperform, in terms of computational cost and accuracy, the results of state-of-the-art methods on both face labeling and medical image segmentation.
http://arxiv.org/abs/1903.07169
Parkinson’s Disease (PD) is one of the most prevalent neurodegenerative diseases that affects tens of millions of Americans. PD is highly progressive and heterogeneous. Quite a few studies have been conducted in recent years on predictive or disease progression modeling of PD using clinical and biomarkers data. Neuroimaging, as another important information source for neurodegenerative disease, has also arisen considerable interests from the PD community. In this paper, we propose a deep learning method based on Graph Convolutional Networks (GCN) for fusing multiple modalities of brain images in relationship prediction which is useful for distinguishing PD cases from controls. On Parkinson’s Progression Markers Initiative (PPMI) cohort, our approach achieved $0.9537\pm 0.0587$ AUC, compared with $0.6443\pm 0.0223$ AUC achieved by traditional approaches such as PCA.
http://arxiv.org/abs/1805.08801
Automatic segmentation methods are important tools for quantitative analysis of Magnetic Resonance Images (MRI). Recently, patch-based label fusion approaches have demonstrated state-of-the-art segmentation accuracy. In this paper, we introduce a new patch-based label fusion framework to perform segmentation of anatomical structures. The proposed approach uses an Optimized PAtchMatch Label fusion (OPAL) strategy that drastically reduces the computation time required for the search of similar patches. The reduced computation time of OPAL opens the way for new strategies and facilitates processing on large databases. In this paper, we investigate new perspectives offered by OPAL, by introducing a new multi-scale and multi-feature framework. During our validation on hippocampus segmentation we use two datasets: young adults in the ICBM cohort and elderly adults in the EADC-ADNI dataset. For both, OPAL is compared to state-of-the-art methods. Results show that OPAL obtained the highest median Dice coefficient (89.9% for ICBM and 90.1% for EADC-ADNI). Moreover, in both cases, OPAL produced a segmentation accuracy similar to inter-expert variability. On the EADC-ADNI dataset, we compare the hippocampal volumes obtained by manual and automatic segmentation. The volumes appear to be highly correlated that enables to perform more accurate separation of pathological populations.
http://arxiv.org/abs/1903.07165
In the superpixel literature, the comparison of state-of-the-art methods can be biased by the non-robustness of some metrics to decomposition aspects, such as the superpixel scale. Moreover, most recent decomposition methods allow to set a shape regularity parameter, which can have a substantial impact on the measured performances. In this paper, we introduce an evaluation framework, that aims to unify the comparison process of superpixel methods. We investigate the limitations of existing metrics, and propose to evaluate each of the three core decomposition aspects: color homogeneity, respect of image objects and shape regularity. To measure the regularity aspect, we propose a new global regularity measure (GR), which addresses the non-robustness of state-of-the-art metrics. We evaluate recent superpixel methods with these criteria, at several superpixel scales and regularity levels. The proposed framework reduces the bias in the comparison process of state-of-the-art superpixel methods. Finally, we demonstrate that the proposed GR measure is correlated with the performances of various applications.
http://arxiv.org/abs/1903.07162
This is a work-in-progress report, which aims to share preliminary results of a novel sequence-to-sequence schema for dependency parsing that relies on a combination of a BiLSTM and two Pointer Networks (Vinyals et al., 2015), in which the final softmax function has been replaced with the logistic regression. The two pointer networks co-operate to develop a latent syntactic knowledge, by learning the lexical properties of “selection” and the lexical properties of “selectability”, respectively. At the moment and without fine-tuning, the parser implementation gets a UAS of 93.14% on the English Penn-treebank (Marcus et al., 1993) annotated with Stanford Dependencies: 2-3% under the SOTA but yet attractive as a baseline of the approach.
http://arxiv.org/abs/1903.07161
We propose a data-driven framework to enable the modeling and optimization of human-machine interaction processes, e.g., systems aimed at assisting humans in decision-making or learning, work-load allocation, and interactive advertising. This is a challenging problem for several reasons. First, humans’ behavior is hard to model or infer, as it may reflect biases, long term memory, and sensitivity to sequencing, i.e., transience and exponential complexity in the length of the interaction. Second, due to the interactive nature of such processes, the machine policy used to engage with a human may bias possible data-driven inferences. Finally, in choosing machine policies that optimize interaction rewards, one must, on the one hand, avoid being overly sensitive to error/variability in the estimated human model, and on the other, being overly deterministic/predictable which may result in poor human ‘engagement’ in the interaction. To meet these challenges, we propose a robust approach, based on the maximum entropy principle, which iteratively estimates human behavior and optimizes the machine policy–Alternating Entropy-Reward Ascent (AREA) algorithm. We characterize AREA, in terms of its space and time complexity and convergence. We also provide an initial validation based on synthetic data generated by an established noisy nonlinear model for human decision-making.
http://arxiv.org/abs/1903.07157
Image restoration problems are typically ill-posed requiring the design of suitable priors. These priors are typically hand-designed and are fully instantiated throughout the process. In this paper, we introduce a novel framework for handling inverse problems related to image restoration based on elements from the half quadratic splitting method and proximal operators. Modeling the proximal operator as a convolutional network, we defined an implicit prior on the image space as a function class during training. This is in contrast to the common practice in literature of having the prior to be fixed and fully instantiated even during training stages. Further, we allow this proximal operator to be tuned differently for each iteration which greatly increases modeling capacity and allows us to reduce the number of iterations by an order of magnitude as compared to other approaches. Our final network is an end-to-end one whose run time matches the previous fastest algorithms while outperforming them in recovery fidelity on two image restoration tasks. Indeed, we find our approach achieves state-of-the-art results on benchmarks in image denoising and image super resolution while recovering more complex and finer details.
http://arxiv.org/abs/1903.07154