In recent years, Signal Temporal Logic (STL) has gained traction as a practical and expressive means of encoding control objectives for robotic and cyber-physical systems. The state-of-the-art in STL trajectory synthesis is to formulate the problem as a Mixed Integer Linear Program (MILP). The MILP approach is sound and complete for bounded specifications, but such strong correctness guarantees come at the price of exponential complexity in the number of predicates and the time bound of the specification. In this work, we propose an alternative synthesis paradigm that relies on Bayesian optimization rather than mixed integer programming. This relaxes the completeness guarantee to probabilistic completeness, but is significantly more efficient: our approach scales polynomially in the STL time-bound and linearly in the number of predicates. We prove that our approach is sound and probabilistically complete, and demonstrate its scalability with a nontrivial example.
http://arxiv.org/abs/1905.03051
We propose an end-to-end deep learning learning model for graph classification and representation learning that is invariant to permutation of the nodes of the input graphs. We address the challenge of learning a fixed size graph representation for graphs of varying dimensions through a differentiable node attention pooling mechanism. In addition to a theoretical proof of its invariance to permutation, we provide empirical evidence demonstrating the statistically significant gain in accuracy when faced with an isomorphic graph classification task given only a small number of training examples. We analyse the effect of four different matrices to facilitate the local message passing mechanism by which graph convolutions are performed vs. a matrix parametrised by a learned parameter pair able to transition smoothly between the former. Finally, we show that our model achieves competitive classification performance with existing techniques on a set of molecule datasets.
http://arxiv.org/abs/1905.03046
In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.
http://arxiv.org/abs/1905.03030
Magnetic particle imaging (MPI) data is commonly reconstructed using a system matrix acquired in a time-consuming calibration measurement. The calibration approach has the important advantage over model-based reconstruction that it takes the complex particle physics as well as system imperfections into account. This benefit comes for the cost that the system matrix needs to be re-calibrated whenever the scan parameters, particle types or even the particle environment (e.g. viscosity or temperature) changes. One route for reducing the calibration time is the sampling of the system matrix at a subset of the spatial positions of the intended field-of-view and employing system matrix recovery. Recent approaches used compressed sensing (CS) and achieved subsampling factors up to 28 that still allowed reconstructing MPI images of sufficient quality. In this work, we propose a novel framework with a 3d-System Matrix Recovery Network and demonstrate it to recover a 3d system matrix with a subsampling factor of 64 in less than one minute and to outperform CS in terms of system matrix quality, reconstructed image quality, and processing time. The advantage of our method is demonstrated by reconstructing open access MPI datasets. The model is further shown to be capable of inferring system matrices for different particle types.
http://arxiv.org/abs/1905.03026
In this work, we present a method for automatic colorization of grayscale videos. The core of the method is a Generative Adversarial Network that is trained and tested on sequences of frames in a sliding window manner. Network convolutional and deconvolutional layers are three-dimensional, with frame height, width and time as the dimensions taken into account. Multiple chrominance estimates per frame are aggregated and combined with available luminance information to recreate a colored sequence. Colorization trials are run succesfully on a dataset of old black-and-white films. The usefulness of our method is also validated with numerical results, computed with a newly proposed metric that measures colorization consistency over a frame sequence.
http://arxiv.org/abs/1905.03023
Cancellable biometrics (CB) as a means for biometric template protection approach refers to an irreversible yet similarity preserving transformation on the original template. With similarity preserving property, the matching between template and query instance can be performed in the transform domain without jeopardizing accuracy performance. Unfortunately, this trait invites a class of attack, namely similarity-based attack (SA). SA produces a preimage, an inverse of transformed template, which can be exploited for impersonation and cross-matching. In this paper, we propose a Genetic Algorithm enabled similarity-based attack framework (GASAF) to demonstrate that CB schemes whose possess similarity preserving property are highly vulnerable to similarity-based attack. Besides that, a set of new metrics is designed to measure the effectiveness of the similarity-based attack. We conduct the experiment on two representative CB schemes, i.e. BioHashing and Bloom-filter. The experimental results attest the vulnerability under this type of attack.
http://arxiv.org/abs/1905.03021
With the increasing size of datasets and demand for real time response for interactive applications, improving runtime for algorithms with excessive computational requirements has become increasingly important. Many different algorithms combining efficient priority queues with various helper structures have been proposed for computing grey-weighted distance transforms. Here we compare the performance of popular competitive algorithms in different scenarios to form practical guidelines easy to adopt. The label-setting category of algorithms is shown to be the best choice for all scenarios. The hierarchical heap with a pointer array to keep track of nodes on the heap is shown to be the best choice as priority queue. However, if memory is a critical issue, then the best choice is the Dial priority queue for integer valued costs and the Untidy priority queue for real valued costs.
http://arxiv.org/abs/1905.03017
Internet of Things (IoT) devices and applications are being deployed in our homes and workplaces and in our daily lives. These devices often rely on continuous data collection and machine learning models for analytics and actuations. However, this approach introduces a number of privacy and efficiency challenges, as the service operator can perform arbitrary inferences on the available data. Recently, advances in edge processing have paved the way for more efficient, and private, data processing at the source for simple tasks and lighter models, though they remain a challenge for larger, and more complicated models. In this paper, we present a hybrid approach for breaking down large, complex deep neural networks for cooperative, privacy-preserving analytics. To this end, instead of performing the whole operation on the cloud, we let an IoT device to run the initial layers of the neural network, and then send the output to the cloud to feed the remaining layers and produce the final result. We manipulate the model with Siamese fine-tuning and propose a noise addition mechanism to ensure that the output of the user’s device contains no extra information except what is necessary for the main task, preventing any secondary inference on the data. We then evaluate the privacy benefits of this approach based on the information exposed to the cloud service. We also asses the local inference cost of different layers on a modern handset. Our evaluations show that by using Siamese fine-tuning and at a small processing cost, we can greatly reduce the level of unnecessary, potentially sensitive information in the personal data, and thus achieving the desired trade-off between utility, privacy, and performance.
http://arxiv.org/abs/1703.02952
While many individual tasks in the domain of human analysis have recently received an accuracy boost from deep learning approaches, multi-task learning has mostly been ignored due to a lack of data. New synthetic datasets are being released, filling this gap with synthetic generated data. In this work, we analyze four related human analysis tasks in still images in a multi-task scenario by leveraging such datasets. Specifically, we study the correlation of 2D/3D pose estimation, body part segmentation and full-body depth estimation. These tasks are learned via the well-known Stacked Hourglass module such that each of the task-specific streams shares information with the others. The main goal is to analyze how training together these four related tasks can benefit each individual task for a better generalization. Results on the newly released SURREAL dataset show that all four tasks benefit from the multi-task approach, but with different combinations of tasks: while combining all four tasks improves 2D pose estimation the most, 2D pose improves neither 3D pose nor full-body depth estimation. On the other hand 2D parts segmentation can benefit from 2D pose but not from 3D pose. In all cases, as expected, the maximum improvement is achieved on those human body parts that show more variability in terms of spatial distribution, appearance and shape, e.g. wrists and ankles.
http://arxiv.org/abs/1905.03003
For rescue robots, flipper endows the robot with additional ability to pass through various terrain. Autonomous motion becomes more important. In recent work autonomy is done by either planning with several special states or based on collected data. We are considering if it is possible to find a way to build continues states without collecting old trail data. In this paper, we first model the possible states as a global planning path with parameter configuration of the scene. Then, we follows the path to achieve the autonomous run. We plot the morphology of each path points to show the correctness of the path and implement a simple path following on real robot to demonstrate the performance of our algorithm.
http://arxiv.org/abs/1905.02984
Although many research vehicle platforms for autonomous driving have been built in the past, hardware design, source code and lessons learned have not been made available for the next generation of demonstrators. This raises the efforts for the research community to contribute results based on real-world evaluations as engineering knowledge of building and maintaining a research vehicle is lost. In this paper, we deliver an analysis of our approach to transferring an open source driving stack to a research vehicle. We put the hardware and software setup in context to other demonstrators and explain the criteria that led to our chosen hardware and software design. Specifically, we discuss the mapping of the Apollo driving stack to the system layout of our research vehicle, fortuna, including communication with the actuators by a controller running on a real-time hardware platform and the integration of the sensor setup. With our collection of the lessons learned, we encourage a faster setup of such systems by other research groups in the future.
http://arxiv.org/abs/1905.02980
Mapping is an essential task for mobile robots and topological representation often works as a basis for the various applications. In this paper, a novel framework that can build topological maps incrementally is proposed. The algorithm is using a distance map, and in our framework the topological map can grow as we append more sensor data to the map. To demonstrate our algorithm, we show the result of the distance map based method on several popular maps and run the incremental framework with raw sensor data to have a growing topological map, as an example of a robot exploring the environment.
http://arxiv.org/abs/1811.01547
The detection of allusive text reuse is particularly challenging due to the sparse evidence on which allusive references rely—commonly based on none or very few shared words. Arguably, lexical semantics can be resorted to since uncovering semantic relations between words has the potential to increase the support underlying the allusion and alleviate the lexical sparsity. A further obstacle is the lack of evaluation benchmark corpora, largely due to the highly interpretative character of the annotation process. In the present paper, we aim to elucidate the feasibility of automated allusion detection. We approach the matter from an Information Retrieval perspective in which referencing texts act as queries and referenced texts as relevant documents to be retrieved, and estimate the difficulty of benchmark corpus compilation by a novel inter-annotator agreement study on query segmentation. Furthermore, we investigate to what extent the integration of lexical semantic information derived from distributional models and ontologies can aid retrieving cases of allusive reuse. The results show that (i) despite low agreement scores, using manual queries considerably improves retrieval performance with respect to a windowing approach, and that (ii) retrieval performance can be moderately boosted with distributional semantics.
http://arxiv.org/abs/1905.02973
Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning. In the encoding phase, we detect and generate multimodal semantic attributes by formulating it as a multi-label classification problem. Moreover, we add auxiliary classification loss to our model that can obtain more effective visual features and high-level multimodal semantic attribute distributions for sufficient video encoding. In the decoding phase, we extend each weight matrix of the conventional LSTM to an ensemble of attribute-dependent weight matrices, and employ attention mechanism to pay attention to different attributes at each time of the captioning process. We evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT, achieving competitive results with current state-of-the-art across six evaluation metrics.
http://arxiv.org/abs/1905.02963
Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks. While recent deep learning based inpainting methods deal with a single image and mostly assume that the positions of the corrupted pixels are known, we aim at automatic text removal in video sequences without mask information. In this paper, we propose a simple yet effective framework for fast blind video decaptioning. We construct an encoder-decoder model, where the encoder takes multiple source frames that can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a residual connection from the input frame to the decoder output to enforce our network to focus on the corrupted regions only. Our proposed model was ranked in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2: Video decaptioning. In addition, we further improve this strong model by applying a recurrent feedback. The recurrent feedback not only enforces temporal coherence but also provides strong clues on where the corrupted pixels are. Both qualitative and quantitative experiments demonstrate that our full model produces accurate and temporally consistent video results in real time (50+ fps).
http://arxiv.org/abs/1905.02949
We present a single-image 3D face synthesis technique that can handle challenging facial expressions while recovering fine geometric details. Our technique employs expression analysis for proxy face geometry generation and combines supervised and unsupervised learning for facial detail synthesis. On proxy generation, we conduct emotion prediction to determine a new expression-informed proxy. On detail synthesis, we present a Deep Facial Detail Net (DFDN) based on Conditional Generative Adversarial Net (CGAN) that employs both geometry and appearance loss functions. For geometry, we capture 366 high-quality 3D scans from 122 different subjects under 3 facial expressions. For appearance, we use additional 20K in-the-wild face images and apply image-based rendering to accommodate lighting variations. Comprehensive experiments demonstrate that our framework can produce high-quality 3D faces with realistic details under challenging facial expressions.
http://arxiv.org/abs/1903.10873
Emotion is intrinsic to humans and consequently emotion understanding is a key part of human-like artificial intelligence (AI). Emotion recognition in conversation (ERC) is becoming increasingly popular as a new research frontier in natural language processing (NLP) due to its ability to mine opinions from the plethora of publicly available conversational data in platforms such as Facebook, Youtube, Reddit, Twitter, and others. Moreover, it has potential applications in health-care systems (as a tool for psychological analysis), education (understanding student frustration) and more. Additionally, ERC is also extremely important for generating emotion-aware dialogues that require an understanding of the user’s emotions. Catering to these needs calls for effective and scalable conversational emotion-recognition algorithms. However, it is a strenuous problem to solve because of several research challenges. In this paper, we discuss these challenges and shed light on the recent research in this field. We also describe the drawbacks of these approaches and discuss the reasons why they fail to successfully overcome the research challenges in ERC.
http://arxiv.org/abs/1905.02947
In this paper, we propose a novel meta-learning method in a reinforcement learning setting, based on evolution strategies (ES), exploration in parameter space and deterministic policy gradients. ES methods are easy to parallelize, which is desirable for modern training architectures; however, such methods typically require a huge number of samples for effective training. We use deterministic policy gradients during adaptation and other techniques to compensate for the sample-efficiency problem while maintaining the inherent scalability of ES methods. We demonstrate that our method achieves good results compared to gradient-based meta-learning in high-dimensional control tasks in the MuJoCo simulator. In addition, because of gradient-free methods in the meta-training phase, which do not need information about gradients and policies in adaptation training, we predict and confirm our algorithm performs better in tasks that need multi-step adaptation.
http://arxiv.org/abs/1812.11314
Federated learning performs distributed model training using local data hosted by agents. It shares only model parameter updates for iterative aggregation at the server. Although it is privacy-preserving by design, federated learning is vulnerable to noise corruption of local agents, as demonstrated in the previous study on adversarial data poisoning threat against federated learning systems. Even a single noise-corrupted agent can bias the model training. In our work, we propose a collaborative and privacy-preserving machine teaching paradigm with multiple distributed teachers, to improve robustness of the federated training process against local data corruption. We assume that each local agent (teacher) have the resources to verify a small portions of trusted instances, which may not by itself be adequate for learning. In the proposed collaborative machine teaching method, these trusted instances guide the distributed agents to jointly select a compact while informative training subset from data hosted by their own. Simultaneously, the agents learn to add changes of limited magnitudes into the selected data instances, in order to improve the testing performances of the federally trained model despite of the training data corruption. Experiments on toy and real data demonstrate that our approach can identify training set bugs effectively and suggest appropriate changes to the labels. Our algorithm is a step toward trustworthy machine learning.
http://arxiv.org/abs/1905.02941
Artificial intelligence (AI) researchers claim that they have made great achievements' in clinical realms. However, clinicians point out the so-called
achievements’ have no ability to implement into natural clinical settings. The root cause for this huge gap is that many essential features of natural clinical tasks are overlooked by AI system developers without medical background. In this paper, we propose that the clinical benchmark suite is a novel and promising direction to capture the essential features of the real-world clinical tasks, hence qualifies itself for guiding the development of AI systems, promoting the implementation of AI in real-world clinical practice.
http://arxiv.org/abs/1905.02940
A generally intelligent learner should generalize to more complex tasks than it has previously encountered, but the two common paradigms in machine learning – either training a separate learner per task or training a single learner for all tasks – both have difficulty with such generalization because they do not leverage the compositional structure of the task distribution. This paper introduces the compositional problem graph as a broadly applicable formalism to relate tasks of different complexity in terms of problems with shared subproblems. We propose the compositional generalization problem for measuring how readily old knowledge can be reused and hence built upon. As a first step for tackling compositional generalization, we introduce the compositional recursive learner, a domain-general framework for learning algorithmic procedures for composing representation transformations, producing a learner that reasons about what computation to execute by making analogies to previously seen problems. We show on a symbolic and a high-dimensional domain that our compositional approach can generalize to more complex problems than the learner has previously encountered, whereas baselines that are not explicitly compositional do not.
http://arxiv.org/abs/1807.04640
As the expressive depth of an emotional face differs with individuals or expressions, recognizing an expression using a single facial image at a moment is difficult. A relative expression of a query face compared to a reference face might alleviate this difficulty. In this paper, we propose to utilize contrastive representation that embeds a distinctive expressive factor for a discriminative purpose. The contrastive representation is calculated at the embedding layer of deep networks by comparing a given (query) image with the reference image. We attempt to utilize a generative reference image that is estimated based on the given image. Consequently, we deploy deep neural networks that embed a combination of a generative model, a contrastive model, and a discriminative model with an end-to-end training manner. In our proposed networks, we attempt to disentangle a facial expressive factor in two steps including learning of a generator network and a contrastive encoder network. We conducted extensive experiments on publicly available face expression databases (CK+, MMI, Oulu-CASIA, and in-the-wild databases) that have been widely adopted in the recent literatures. The proposed method outperforms the known state-of-the art methods in terms of the recognition accuracy.
http://arxiv.org/abs/1703.07140
In this work we explore how fine-grained differences between the shapes of common objects are expressed in language, grounded on images and 3D models of the objects. We first build a large scale, carefully controlled dataset of human utterances that each refers to a 2D rendering of a 3D CAD model so as to distinguish it from a set of shape-wise similar alternatives. Using this dataset, we develop neural language understanding (listening) and production (speaking) models that vary in their grounding (pure 3D forms via point-clouds vs. rendered 2D images), the degree of pragmatic reasoning captured (e.g. speakers that reason about a listener or not), and the neural architecture (e.g. with or without attention). We find models that perform well with both synthetic and human partners, and with held out utterances and objects. We also find that these models are amenable to zero-shot transfer learning to novel object classes (e.g. transfer from training on chairs to testing on lamps), as well as to real-world images drawn from furniture catalogs. Lesion studies indicate that the neural listeners depend heavily on part-related words and associate these words correctly with visual parts of objects (without any explicit network training on object parts), and that transfer to novel classes is most successful when known part-words are available. This work illustrates a practical approach to language grounding, and provides a case study in the relationship between object shape and linguistic structure when it comes to object differentiation.
http://arxiv.org/abs/1905.02925
Stacking-based deep neural network (S-DNN) is aggregated with pluralities of basic learning modules, one after another, to synthesize a deep neural network (DNN) alternative for pattern classification. Contrary to the DNNs trained end to end by backpropagation (BP), each S-DNN layer, i.e., a self-learnable module, is to be trained decisively and independently without BP intervention. In this paper, a ridge regression-based S-DNN, dubbed deep analytic network (DAN), along with its kernelization (K-DAN), are devised for multilayer feature re-learning from the pre-extracted baseline features and the structured features. Our theoretical formulation demonstrates that DAN/K-DAN re-learn by perturbing the intra/inter-class variations, apart from diminishing the prediction errors. We scrutinize the DAN/K-DAN performance for pattern classification on datasets of varying domains - faces, handwritten digits, generic objects, to name a few. Unlike the typical BP-optimized DNNs to be trained from gigantic datasets by GPU, we disclose that DAN/K-DAN are trainable using only CPU even for small-scale training sets. Our experimental results disclose that DAN/K-DAN outperform the present S-DNNs and also the BP-trained DNNs, including multiplayer perceptron, deep belief network, etc., without data augmentation applied.
http://arxiv.org/abs/1811.07184
The paper discusses an adaptive strategy to effectively control nonlinear manipulation motions of a dual arm robot (DAR) under system uncertainties including parameter variations, actuator nonlinearities and external disturbances. It is proposed that the control scheme is first derived from the dynamic surface control (DSC) method, which allows the robot’s end-effectors to robustly track the desired trajectories. Moreover, since exactly determining the DAR system’s dynamics is impractical due to the system uncertainties, the uncertain system parameters are then proposed to be adaptively estimated by the use of the radial basis function network (RBFN). The adaptation mechanism is derived from the Lyapunov theory, which theoretically guarantees stability of the closed-loop control system. The effectiveness of the proposed RBFN-DSC approach is demonstrated by implementing the algorithm in a synthetic environment with realistic parameters, where the obtained results are highly promising.
http://arxiv.org/abs/1905.02914
Grading breast density is highly sensitive to normalization settings of digital mammogram as the density is tightly correlated with the distribution of pixel intensity. Also, the grade varies with readers due to uncertain grading criteria. These issues are inherent in the density assessment of digital mammography. They are problematic when designing a computer-aided prediction model for breast density and become worse if the data comes from multiple sites. In this paper, we proposed two novel deep learning techniques for breast density prediction: 1) photometric transformation which adaptively normalizes the input mammograms, and 2) label distillation which adjusts the label by using its output prediction. The photometric transformer network predicts optimal parameters for photometric transformation on the fly, learned jointly with the main prediction network. The label distillation, a type of pseudo-label techniques, is intended to mitigate the grading variation. We experimentally showed that the proposed methods are beneficial in terms of breast density prediction, resulting in significant performance improvement compared to various previous approaches.
http://arxiv.org/abs/1905.02906
Video inpainting, which aims at filling in missing regions of a video, remains challenging due to the difficulty of preserving the precise spatial and temporal coherence of video contents. In this work we propose a novel flow-guided video inpainting approach. Rather than filling in the RGB pixels of each frame directly, we consider video inpainting as a pixel propagation problem. We first synthesize a spatially and temporally coherent optical flow field across video frames using a newly designed Deep Flow Completion network. Then the synthesized flow field is used to guide the propagation of pixels to fill up the missing regions in the video. Specifically, the Deep Flow Completion network follows a coarse-to-fine refinement to complete the flow fields, while their quality is further improved by hard flow example mining. Following the guide of the completed flow, the missing video regions can be filled up precisely. Our method is evaluated on DAVIS and YouTube-VOS datasets qualitatively and quantitatively, achieving the state-of-the-art performance in terms of inpainting quality and speed.
http://arxiv.org/abs/1905.02884
In this paper, we present a new inpainting framework for recovering missing regions of video frames. Compared with image inpainting, performing this task on video presents new challenges such as how to preserving temporal consistency and spatial details, as well as how to handle arbitrary input video size and length fast and efficiently. Towards this end, we propose a novel deep learning architecture which incorporates ConvLSTM and optical flow for modeling the spatial-temporal consistency in videos. It also saves much computational resource such that our method can handle videos with larger frame size and arbitrary length streamingly in real-time. Furthermore, to generate an accurate optical flow from corrupted frames, we propose a robust flow generation module, where two sources of flows are fed and a flow blending network is trained to fuse them. We conduct extensive experiments to evaluate our method in various scenarios and different datasets, both qualitatively and quantitatively. The experimental results demonstrate the superior of our method compared with the state-of-the-art inpainting approaches.
http://arxiv.org/abs/1905.02882
Syntax has been demonstrated highly effective in neural machine translation (NMT). Previous NMT models integrate syntax by representing 1-best tree outputs from a well-trained parsing system, e.g., the representative Tree-RNN and Tree-Linearization methods, which may suffer from error propagation. In this work, we propose a novel method to integrate source-side syntax implicitly for NMT. The basic idea is to use the intermediate hidden representations of a well-trained end-to-end dependency parser, which are referred to as syntax-aware word representations (SAWRs). Then, we simply concatenate such SAWRs with ordinary word embeddings to enhance basic NMT models. The method can be straightforwardly integrated into the widely-used sequence-to-sequence (Seq2Seq) NMT models. We start with a representative RNN-based Seq2Seq baseline system, and test the effectiveness of our proposed method on two benchmark datasets of the Chinese-English and English-Vietnamese translation tasks, respectively. Experimental results show that the proposed approach is able to bring significant BLEU score improvements on the two datasets compared with the baseline, 1.74 points for Chinese-English translation and 0.80 point for English-Vietnamese translation, respectively. In addition, the approach also outperforms the explicit Tree-RNN and Tree-Linearization methods.
http://arxiv.org/abs/1905.02878
Generative models for 3D geometric data arise in many important applications in 3D computer vision and graphics. In this paper, we focus on 3D deformable shapes that share a common topological structure, such as human faces and bodies. Morphable Models were among the first attempts to create compact representations for such shapes; despite their effectiveness and simplicity, such models have limited representation power due to their linear formulation. Recently, non-linear learnable methods have been proposed, although most of them resort to intermediate representations, such as 3D grids of voxels or 2D views. In this paper, we introduce a convolutional mesh autoencoder and a GAN architecture based on the spiral convolutional operator, acting directly on the mesh and leveraging its underlying geometric structure. We provide an analysis of our convolution operator and demonstrate state-of-the-art results on 3D shape datasets compared to the linear Morphable Model and the recently proposed COMA model.
http://arxiv.org/abs/1905.02876
We introduce (1) a novel parser for Minimalist Grammars (MG), encoded as a system of first-order logic formulae that may be evaluated using an SMT-solver, and (2) a novel procedure for inferring Minimalist Grammars using this parser. The input to this procedure is a sequence of sentences that have been annotated with syntactic relations such as semantic role labels (connecting arguments to predicates) and subject-verb agreement. The output of this procedure is a set of minimalist grammars, each of which is able to parse the sentences in the input sequence such that the parse for a sentence has the same syntactic relations as those specified in the annotation for that sentence. We applied this procedure to a set of sentences annotated with syntactic relations and evaluated the inferred grammars using cost functions inspired by the Minimum Description Length principle and the Subset principle. Inferred grammars that were optimal with respect to certain combinations of these cost functions were found to align with contemporary theories of syntax.
http://arxiv.org/abs/1905.02869
Visual tracking is one of the most challenging computer vision problems. In order to achieve high performance visual tracking in various negative scenarios, a novel cascaded Siamese network is proposed and developed based on two different deep learning networks: a matching subnetwork and a classification subnetwork. The matching subnetwork is a fully convolutional Siamese network. According to the similarity score between the exemplar image and the candidate image, it aims to search possible object positions and crop scaled candidate patches. The classification subnetwork is designed to further evaluate the cropped candidate patches and determine the optimal tracking results based on the classification score. The matching subnetwork is trained offline and fixed online, while the classification subnetwork performs stochastic gradient descent online to learn more target-specific information. To improve the tracking performance further, an effective classification subnetwork update method based on both similarity and classification scores is utilized for updating the classification subnetwork. Extensive experimental results demonstrate that our proposed approach achieves state-of-the-art performance in recent benchmarks.
http://arxiv.org/abs/1905.02857
Frequently Asked Question (FAQ) retrieval is an important task where the objective is to retrieve the appropriate Question-Answer (QA) pair from a database based on the user’s query. In this study, we propose a FAQ retrieval system that considers the similarity between a user’s query and a question computed by a traditional unsupervised information retrieval system, as well as the relevance between the query and an answer computed by the recently-proposed BERT model. By combining the rule-based approach and the flexible neural approach, the proposed system realizes robust FAQ retrieval. A common approach to FAQ retrieval is to construct labeled data for training, which takes a lot of costs. However, a FAQ database generally contains a too small number of QA pairs to train a model. To surmount this problem, we leverage FAQ sets that are similar to the one in question. We construct localgovFAQ dataset based on FAQ pages of administrative municipalities throughout Japan. In this research, we evaluate our approach on two datasets, localgovFAQ dataset and StackExchange dataset, and demonstrate that our proposed method works effectively.
http://arxiv.org/abs/1905.02851
We aim to better understand attention over nodes in graph neural networks and identify factors influencing its effectiveness. Motivated by insights from the work on Graph Isomorphism Networks (Xu et al., 2019), we design simple graph reasoning tasks that allow us to study attention in a controlled environment. We find that under typical conditions the effect of attention is negligible or even harmful, but under certain conditions it provides an exceptional gain in performance of more than 40% in some of our classification tasks. However, we have yet to satisfy these conditions in practice.
http://arxiv.org/abs/1905.02850
Off-policy learning is more unstable compared to on-policy learning in reinforcement learning (RL). One reason for the instability of off-policy learning is a discrepancy between the target ($\pi$) and behavior (b) policy distributions. The discrepancy between $\pi$ and b distributions can be alleviated by employing a smooth variant of the importance sampling (IS), such as the relative importance sampling (RIS). RIS has parameter $\beta\in[0, 1]$ which controls smoothness. To cope with instability, we present the first relative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-free algorithms in RL. In our method, the network yields a target policy (the actor), a value function (the critic) assessing the current policy ($\pi$) using samples drawn from behavior policy. We use action value generated from the behavior policy in reward function to train our algorithm rather than from the target policy. We also use deep neural networks to train both actor and critic. We evaluated our algorithm on a number of Open AI Gym benchmark problems and demonstrate better or comparable performance to several state-of-the-art RL baselines.
http://arxiv.org/abs/1810.12558
We formulate a new problem as Object Importance Estimation (OIE) in on-road driving videos, where the road users are considered as important objects if they have influence on the control decision of the ego-vehicle’s driver. The importance of a road user depends on both its visual dynamics, e.g., appearance, motion and location, in the driving scene and the driving goal, \emph{e.g}., the planned path, of the ego vehicle. We propose a novel framework that incorporates both visual model and goal representation to conduct OIE. To evaluate our framework, we collect an on-road driving dataset at traffic intersections in the real world and conduct human-labeled annotation of the important objects. Experimental results show that our goal-oriented method outperforms baselines and has much more improvement on the left-turn and right-turn scenarios. Furthermore, we explore the possibility of using object importance for driving control prediction and demonstrate that binary brake prediction can be improved with the information of object importance.
http://arxiv.org/abs/1905.02848
Pattern analysis often requires a pre-processing stage for extracting or selecting features in order to help the classification, prediction, or clustering stage discriminate or represent the data in a better way. The reason for this requirement is that the raw data are complex and difficult to process without extracting or selecting appropriate features beforehand. This paper reviews theory and motivation of different common methods of feature selection and extraction and introduces some of their applications. Some numerical implementations are also shown for these methods. Finally, the methods in feature selection and extraction are compared.
http://arxiv.org/abs/1905.02845
We characterize a prevalent weakness of deep neural networks (DNNs)—overthinking—which occurs when a DNN can reach correct predictions before its final layer. Overthinking is computationally wasteful, and it can also be destructive when, by the final layer, a correct prediction changes into a misclassification. Understanding overthinking requires studying how each prediction \emph{evolves} during a DNN’s forward pass, which conventionally is opaque. For prediction transparency, we propose the Shallow-Deep Network (SDN), a generic modification to off-the-shelf DNNs that introduces internal classifiers. We apply SDN to four modern architectures, trained on three image classification tasks, to characterize the overthinking problem. We show that SDNs can mitigate the wasteful effect of overthinking with confidence-based early exits, which reduce the average inference cost by more than 50% and preserve the accuracy. We also find that the destructive effect occurs for 50% of misclassifications on natural inputs and that it can be induced, adversarially, with a recent backdooring attack. To mitigate this effect, we propose a new confusion metric to quantify the internal disagreements that will likely to lead to misclassifications.
http://arxiv.org/abs/1810.07052
We propose a data-driven approach to online multi-object tracking (MOT) that uses a convolutional neural network (CNN) for data association in a tracking-by-detection framework. The problem of multi-target tracking aims to assign noisy detections to a-priori unknown and time-varying number of tracked objects across a sequence of frames. A majority of the existing solutions focus on either tediously designing cost functions or formulating the task of data association as a complex optimization problem that can be solved effectively. Instead, we exploit the power of deep learning to formulate the data association problem as inference in a CNN. To this end, we propose to learn a similarity function that combines cues from both image and spatial features of objects. Our solution learns to perform global assignments in 3D purely from data, handles noisy detections and a varying number of targets, and is easy to train. We evaluate our approach on the challenging KITTI dataset and show competitive results. Our code is available at https://git.uwaterloo.ca/wise-lab/fantrack.
http://arxiv.org/abs/1905.02843
ArCo is the Italian Cultural Heritage knowledge graph, consisting of a network of seven vocabularies and 169 million triples about 820 thousand cultural entities. It is distributed jointly with a SPARQL endpoint, a software for converting catalogue records to RDF, and a rich suite of documentation material (testing, evaluation, how-to, examples, etc.). ArCo is based on the official General Catalogue of the Italian Ministry of Cultural Heritage and Activities (MiBAC) - and its associated encoding regulations - which collects and validates the catalogue records of (ideally) all Italian Cultural Heritage properties (excluding libraries and archives), contributed by CH administrators from all over Italy. We present its structure, design methods and tools, its growing community, and delineate its importance, quality, and impact.
http://arxiv.org/abs/1905.02840
In this paper, we propose a novel control architecture, inspired from neuroscience, for adaptive control of continuous-time systems. The proposed architecture, in the setting of standard neural network (NN) based adaptive control, augments an external working memory to the NN. The external working memory, through a write operation, stores certain recently observed feature vectors from the hidden layer of the NN. It retrieves relevant vectors from the working memory to modify the final control signal generated by the controller. The use of external working memory is aimed at improving the context thereby inducing the learning system to search in a particular direction. This directed learning allows the learning system to find a good approximation of the unknown function even after abrupt changes quickly. A key objective explored in this paper is to design control architectures and algorithms that can learn and adapt quickly to changes that are even abrupt. We consider two classes of controllers for concrete development of our ideas (i) a model reference NN adaptive controller for linear systems with matched uncertainty (ii) robot arm controller. We prove that the resulting controllers lead to Uniformly Ulitmately Bounded (UUB) stable closed loop systems. We provide a detailed illustration of the working of this learning mechanism through a simple example. Through extensive simulations and specific metrics we also show that memory augmentation improves learning significantly even when the system undergoes sudden changes. Importantly, we also provide evidence for the proposed mechanism by which this specific memory augmentation improves learning.
https://arxiv.org/abs/1905.02832
Designing models that are robust to small adversarial perturbations of their inputs has proven remarkably difficult. In this work we show that the reverse problem—making models more vulnerable—is surprisingly easy. After presenting some proofs of concept on MNIST, we introduce a generic tilting attack that injects vulnerabilities into the linear layers of pre-trained networks by increasing their sensitivity to components of low variance in the training data without affecting their performance on test data. We illustrate this attack on a multilayer perceptron trained on SVHN and use it to design a stand-alone adversarial module which we call a steganogram decoder. Finally, we show on CIFAR-10 that a poisoning attack with a poisoning rate as low as 0.1% can induce vulnerabilities to chosen imperceptible backdoor signals in state-of-the-art networks. Beyond their practical implications, these different results shed new light on the nature of the adversarial example phenomenon.
http://arxiv.org/abs/1806.07409
In this paper, we propose a novel effective light-weight framework, called LightTrack, for online human pose tracking. The proposed framework is designed to be generic for top-down pose tracking and is faster than existing online and offline methods. Single-person Pose Tracking (SPT) and Visual Object Tracking (VOT) are incorporated into one unified functioning entity, easily implemented by a replaceable single-person pose estimation module. Our framework unifies single-person pose tracking with multi-person identity association and sheds first light upon bridging keypoint tracking with object tracking. We also propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a Re-ID module in our pose tracking system. In contrary to other Re-ID modules, we use a graphical representation of human joints for matching. The skeleton-based representation effectively captures human pose similarity and is computationally inexpensive. It is robust to sudden camera shift that introduces human drifting. To the best of our knowledge, this is the first paper to propose an online human pose tracking framework in a top-down fashion. The proposed framework is general enough to fit other pose estimators and candidate matching mechanisms. Our method outperforms other online methods while maintaining a much higher frame rate, and is very competitive with our offline state-of-the-art. We make the code publicly available at: https://github.com/Guanghan/lighttrack.
http://arxiv.org/abs/1905.02822
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a large-scale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at this http URL
http://arxiv.org/abs/1809.01696
Recognizing a piece of writing as a poem or prose is usually easy for the majority of people; however, only specialists can determine which meter a poem belongs to. In this paper, we build Recurrent Neural Network (RNN) models that can classify poems according to their meters from plain text. The input text is encoded at the character level and directly fed to the models without feature handcrafting. This is a step forward for machine understanding and synthesis of languages in general, and Arabic language in particular. Among the 16 poem meters of Arabic and the 4 meters of English the networks were able to correctly classify poem with an overall accuracy of 96.38\% and 82.31\% respectively. The poem datasets used to conduct this research were massive, over 1.5 million of verses, and were crawled from different nontechnical sources, almost Arabic and English literature sites, and in different heterogeneous and unstructured formats. These datasets are now made publicly available in clean, structured, and documented format for other future research. To the best of the authors’ knowledge, this research is the first to address classifying poem meters in a machine learning approach, in general, and in RNN featureless based approach, in particular. In addition, the dataset is the first publicly available dataset ready for the purpose of future computational research.
https://arxiv.org/abs/1905.05700
Natural language is one of the most fundamental features that distinguish people from other living things and enable people to communicate each other. Language is a tool that enables people to express their feelings and thoughts and to transfers cultures through generations. Texts and audio are examples of natural language in daily life. In the natural language, many words disappear in time, on the other hand new words are derived. Therefore, while the process of natural language processing (NLP) is complex even for human, it is difficult to process in computer system. The area of linguistics examines how people use language. NLP, which requires the collaboration of linguists and computer scientists, plays an important role in human computer interaction. Studies in NLP have increased with the use of artificial intelligence technologies in the field of linguistics. With the deep learning methods which are one of the artificial intelligence study areas, platforms close to natural language are being developed. Developed platforms for language comprehension, machine translation and part of speech (POS) tagging benefit from deep learning methods. Recurrent Neural Network (RNN), one of the deep learning architectures, is preferred for processing sequential data such as text or audio data. In this study, Turkish POS tagging model has been proposed by using Bidirectional Long-Short Term Memory (BLSTM) which is an RNN type. The proposed POS tagging model is provided to natural language researchers with a platform that allows them to perform and use their own analysis. In the development phase of the platform developed by using BLSTM, the error rate of the POS tagger has been reduced by taking feedback with expert opinion.
https://arxiv.org/abs/1905.05699
Objective: This work addresses two key problems of skin lesion classification. The first problem is the effective use of high-resolution images with pretrained standard architectures for image classification. The second problem is the high class imbalance encountered in real-world multi-class datasets. Methods: To use high-resolution images, we propose a novel patch-based attention architecture that provides global context between small, high-resolution patches. We modify three pretrained architectures and study the performance of patch-based attention. To counter class imbalance problems, we compare oversampling, balanced batch sampling, and class-specific loss weighting. Additionally, we propose a novel diagnosis-guided loss weighting method which takes the method used for ground-truth annotation into account. Results: Our patch-based attention mechanism outperforms previous methods and improves the mean sensitivity by 7%. Class balancing significantly improves the mean sensitivity and we show that our diagnosis-guided loss weighting method improves the mean sensitivity by 3% over normal loss balancing. Conclusion: The novel patch-based attention mechanism can be integrated into pretrained architectures and provides global context between local patches while outperforming other patch-based methods. Hence, pretrained architectures can be readily used with high-resolution images without downsampling. The new diagnosis-guided loss weighting method outperforms other methods and allows for effective training when facing class imbalance. Significance: The proposed methods improve automatic skin lesion classification. They can be extended to other clinical applications where high-resolution image data and class imbalance are relevant.
http://arxiv.org/abs/1905.02793
Localization and tracking are two very active areas of research for robotics, automation, and the Internet-of-Things. Accurate tracking for a large number of devices usually requires deployment of substantial infrastructure (infrared tracking systems, cameras, wireless antennas, etc.), which is not ideal for inaccessible or protected environments. This paper stems from the challenge posed such environments: cover a large number of units spread over a large number of small rooms, with minimal required localization infrastructure. The idea is to accurately track the position of handheld devices or mobile robots, without interfering with its architecture. Using Ultra-Wide Band (UWB) devices, we leveraged our expertise in distributed and collaborative robotic systems to develop an novel solution requiring a minimal number of fixed anchors. We discuss a strategy to share the UWB network together with an Extended Kalman filter derivation to collaboratively locate and track UWB-equipped devices, and show results from our experimental campaign tracking visitors in the Chambord castle in France.
http://arxiv.org/abs/1905.03247
We present a stereo-based dense mapping algorithm for large-scale dynamic urban environments. In contrast to other existing methods, we simultaneously reconstruct the static background, the moving objects, and the potentially moving but currently stationary objects separately, which is desirable for high-level mobile robotic tasks such as path planning in crowded environments. We use both instance-aware semantic segmentation and sparse scene flow to classify objects as either background, moving, or potentially moving, thereby ensuring that the system is able to model objects with the potential to transition from static to dynamic, such as parked cars. Given camera poses estimated from visual odometry, both the background and the (potentially) moving objects are reconstructed separately by fusing the depth maps computed from the stereo input. In addition to visual odometry, sparse scene flow is also used to estimate the 3D motions of the detected moving objects, in order to reconstruct them accurately. A map pruning technique is further developed to improve reconstruction accuracy and reduce memory consumption, leading to increased scalability. We evaluate our system thoroughly on the well-known KITTI dataset. Our system is capable of running on a PC at approximately 2.5Hz, with the primary bottleneck being the instance-aware semantic segmentation, which is a limitation we hope to address in future work. The source code is available from the project website (this http URL).
http://arxiv.org/abs/1905.02781
Estimating statistical uncertainties allows autonomous agents to communicate their confidence during task execution and is important for applications in safety-critical domains such as autonomous driving. In this work, we present the uncertainty-aware imitation learning (UAIL) algorithm for improving end-to-end control systems via data aggregation. UAIL applies Monte Carlo Dropout to estimate uncertainty in the control output of end-to-end systems, using states where it is uncertain to selectively acquire new training data. In contrast to prior data aggregation algorithms that force human experts to visit sub-optimal states at random, UAIL can anticipate its own mistakes and switch control to the expert in order to prevent visiting a series of sub-optimal states. Our experimental results from simulated driving tasks demonstrate that our proposed uncertainty estimation method can be leveraged to reliably predict infractions. Our analysis shows that UAIL outperforms existing data aggregation algorithms on a series of benchmark tasks.
http://arxiv.org/abs/1905.02780