Finding the source of a disturbance or fault in complex systems such as industrial chemical processing plants can be a difficult task and consume a significant number of engineering hours. In many cases, a systematic elimination procedure is considered to be the only feasible approach but can cause undesired process upsets. Practitioners desire robust alternative approaches. This paper presents an unsupervised, data-driven method for ranking process elements according to the magnitude and novelty of their influence. Partial bivariate transfer entropy estimation is used to infer a weighted directed graph of process elements. Eigenvector centrality is applied to rank network nodes according to their overall effect. As the ranking of process elements rely on emerging properties that depend on the aggregate of many connections, the results are robust to errors in the estimation of individual edge properties and the inclusion of indirect connections that do not represent the true causal structure of the process. A monitoring chart of continuously calculated process element importance scores over multiple overlapping time regions can assist with incipient fault detection. Ranking results combined with visual inspection of information transfer networks is also useful for root cause analysis of known faults and disturbances. A software implementation of the proposed method is available.
http://arxiv.org/abs/1904.04035
Since word embeddings have been the most popular input for many NLP tasks, evaluating their quality is of critical importance. Most research efforts are focusing on English word embeddings. This paper addresses the problem of constructing and evaluating such models for the Greek language. We created a new word analogy corpus considering the original English Word2vec word analogy corpus and some specific linguistic aspects of the Greek language as well. Moreover, we created a Greek version of WordSim353 corpora for a basic evaluation of word similarities. We tested seven word vector models and our evaluation showed that we are able to create meaningful representations. Last, we discovered that the morphological complexity of the Greek language and polysemy can influence the quality of the resulting word embeddings.
http://arxiv.org/abs/1904.04032
The gold standard for discovering causal relations is by means of experimentation. Over the last decades, alternative methods have been proposed that can infer causal relations between variables from certain statistical patterns in purely observational data. We introduce Joint Causal Inference (JCI), a novel approach to causal discovery from multiple data sets that elegantly unifies both approaches. JCI is a causal modeling approach rather than a specific algorithm, and it can be used in combination with any causal discovery algorithm that can take into account certain background knowledge. The main idea is to reduce causal discovery from multiple datasets originating from different contexts (e.g., different experimental conditions) to causal discovery from a single pooled dataset by adding auxiliary context variables and incorporating applicable background knowledge on the causal relationships involving the context variables. We propose different flavours of JCI that differ in the amount of background knowledge that is assumed. JCI can deal with several different types of interventions in a unified fashion, does not require knowledge on intervention targets or types in case of interventional data, and allows one to fully exploit all the information in the joint distribution on system and context variables. We explain how some well-known causal discovery algorithms can be seen as implementations of the JCI framework, but we also propose novel implementations that are simple adaptations of existing causal discovery methods for purely observational data to the JCI setting. We evaluate different implementations of the JCI approach on synthetic data and on flow cytometry protein expression data and conclude that JCI implementations can outperform state-of-the-art causal discovery algorithms.
http://arxiv.org/abs/1611.10351
Artificial Intelligence allows the improvement of our daily life, for instance, speech and handwritten text recognition, real time translation and weather forecasting are common used applications. In the livestock sector, machine learning algorithms have the potential for early detection and warning of problems, which represents a significant milestone in the poultry industry. Production problems generate economic loss that could be avoided by acting in a timely manner. In the current study, training and testing of support vector machines are addressed, for an early detection of problems in the production curve of commercial eggs, using farm’s egg production data of 478,919 laying hens grouped in 24 flocks. Experiments using support vector machines with a 5 k-fold cross-validation were performed at different previous time intervals, to alert with up to 5 days of forecasting interval, whether a flock will experience a problem in production curve. Performance metrics such as accuracy, specificity, sensitivity, and positive predictive value were evaluated, reaching 0-day values of 0.9874, 0.9876, 0.9783 and 0.6518 respectively on unseen data (test-set). The optimal forecasting interval was from zero to three days, performance metrics decreases as the forecasting interval is increased. It should be emphasized that this technique was able to issue an alert a day in advance, achieving an accuracy of 0.9854, a specificity of 0.9865, a sensitivity of 0.9333 and a positive predictive value of 0.6135. This novel application embedded in a computer system of poultry management is able to provide significant improvements in early detection and warning of problems related to the production curve.
http://arxiv.org/abs/1904.03987
When one wants to train a neural network to perform semantic segmentation, creating pixel-level annotations for each of the images in the database is a tedious task. If he works with aerial or satellite images, which are usually very large, it is even worse. With that in mind, we investigate how to use image-level annotations in order to perform semantic segmentation. Image-level annotations are much less expensive to acquire than pixel-level annotations, but we lose a lot of information for the training of the model. From the annotations of the images, the model must find by itself how to classify the different regions of the image. In this work, we use the method proposed by Anh and Kwak [1] to produce pixel-level annotation from image level annotation. We compare the overall quality of our generated dataset with the original dataset. In addition, we propose an adaptation of the AffinityNet that allows us to directly perform a semantic segmentation. Our results show that the generated labels lead to the same performances for the training of several segmentation networks. Also, the quality of semantic segmentation performed directly by the AffinityNet and the Random Walk is close to the one of the best fully-supervised approaches.
http://arxiv.org/abs/1904.03983
In hyperspectral remote sensing data mining, it is important to take into account of both spectral and spatial information, such as the spectral signature, texture feature and morphological property, to improve the performances, e.g., the image classification accuracy. In a feature representation point of view, a nature approach to handle this situation is to concatenate the spectral and spatial features into a single but high dimensional vector and then apply a certain dimension reduction technique directly on that concatenated vector before feed it into the subsequent classifier. However, multiple features from various domains definitely have different physical meanings and statistical properties, and thus such concatenation hasn’t efficiently explore the complementary properties among different features, which should benefit for boost the feature discriminability. Furthermore, it is also difficult to interpret the transformed results of the concatenated vector. Consequently, finding a physically meaningful consensus low dimensional feature representation of original multiple features is still a challenging task. In order to address the these issues, we propose a novel feature learning framework, i.e., the simultaneous spectral-spatial feature selection and extraction algorithm, for hyperspectral images spectral-spatial feature representation and classification. Specifically, the proposed method learns a latent low dimensional subspace by projecting the spectral-spatial feature into a common feature space, where the complementary information has been effectively exploited, and simultaneously, only the most significant original features have been transformed. Encouraging experimental results on three public available hyperspectral remote sensing datasets confirm that our proposed method is effective and efficient.
http://arxiv.org/abs/1904.03982
Air pollution is the leading environmental health hazard globally due to various sources which include factory emissions, car exhaust and cooking stoves. As a precautionary measure, air pollution forecast serves as the basis for taking effective pollution control measures, and accurate air pollution forecasting has become an important task. In this paper, we forecast fine-grained ambient air quality information for 5 prominent locations in Delhi based on the historical and real-time ambient air quality and meteorological data reported by Central Pollution Control board. We present VayuAnukulani system, a novel end-to-end solution to predict air quality for next 24 hours by estimating the concentration and level of different air pollutants including nitrogen dioxide ($NO_2$), particulate matter ($PM_{2.5}$ and $PM_{10}$) for Delhi. Extensive experiments on data sources obtained in Delhi demonstrate that the proposed adaptive attention based Bidirectional LSTM Network outperforms several baselines for classification and regression models. The accuracy of the proposed adaptive system is $\sim 15 - 20\%$ better than the same offline trained model. We compare the proposed methodology on several competing baselines, and show that the network outperforms conventional methods by $\sim 3 - 5 \%$.
http://arxiv.org/abs/1904.03977
Recent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech. The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are convenient for modeling, but present additional challenges for vocoding (i.e., waveform generation from the acoustic features). High-quality synthesis can be achieved with neural vocoders, such as WaveNet, but such autoregressive models suffer from slow sequential inference. Meanwhile, their existing parallel inference counterparts are difficult to train and require increasingly large model sizes. In this paper, we propose an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrate a linear predictive synthesis filter into the model. Results show that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.
http://arxiv.org/abs/1904.03976
Morphological reconstruction (MR) is often employed by seeded image segmentation algorithms such as watershed transform and power watershed as it is able to filter seeds (regional minima) to reduce over-segmentation. However, MR might mistakenly filter meaningful seeds that are required for generating accurate segmentation and it is also sensitive to the scale because a single-scale structuring element is employed. In this paper, a novel adaptive morphological reconstruction (AMR) operation is proposed that has three advantages. Firstly, AMR can adaptively filter useless seeds while preserving meaningful ones. Secondly, AMR is insensitive to the scale of structuring elements because multiscale structuring elements are employed. Finally, AMR has two attractive properties: monotonic increasingness and convergence that help seeded segmentation algorithms to achieve a hierarchical segmentation. Experiments clearly demonstrate that AMR is useful for improving algorithms of seeded image segmentation and seed-based spectral segmentation. Compared to several state-of-the-art algorithms, the proposed algorithms provide better segmentation results requiring less computing time. Source code is available at https://github.com/SUST-reynole/AMR.
http://arxiv.org/abs/1904.03973
Text generation is an important Natural Language Processing task with various applications. Although several metrics have already been introduced to evaluate the text generation methods, each of them has its own shortcomings. The most widely used metrics such as BLEU only consider the quality of generated sentences and neglect their diversity. For example, repeatedly generation of only one high quality sentence would result in a high BLEU score. On the other hand, the more recent metric introduced to evaluate the diversity of generated texts known as Self-BLEU ignores the quality of generated texts. In this paper, we propose metrics to evaluate both the quality and diversity simultaneously by approximating the distance of the learned generative model and the real data distribution. For this purpose, we first introduce a metric that approximates this distance using n-gram based measures. Then, a feature-based measure which is based on a recent highly deep model trained on a large text corpus called BERT is introduced. Finally, for oracle training mode in which the generator’s density can also be calculated, we propose to use the distance measures between the corresponding explicit distributions. Eventually, the most popular and recent text generation models are evaluated using both the existing and the proposed metrics and the preferences of the proposed metrics are determined.
http://arxiv.org/abs/1904.03971
In online discussion fora, speakers often make arguments for or against something, say birth control, by highlighting certain aspects of the topic. In social science, this is referred to as issue framing. In this paper, we introduce a new issue frame annotated corpus of online discussions. We explore to what extent models trained to detect issue frames in newswire and social media can be transferred to the domain of discussion fora, using a combination of multi-task and adversarial training, assuming only unlabeled training data in the target domain.
http://arxiv.org/abs/1904.03969
The stability of word embedding algorithms, i.e., the consistency of the word representations they reveal when trained repeatedly on the same data set, has recently raised concerns. We here compare word embedding algorithms on three corpora of different sizes, and evaluate both their stability and accuracy. We find strong evidence that down-sampling strategies (used as part of their training procedures) are particularly influential for the stability of SVDPPMI-type embeddings. This finding seems to explain diverging reports on their stability and lead us to a simple modification which provides superior stability as well as accuracy on par with skip-gram embeddings.
http://arxiv.org/abs/1808.06810
To understand historical texts, we must be aware that language – including the emotional connotation attached to words – changes over time. In this paper, we aim at estimating the emotion which is associated with a given word in former language stages of English and German. Emotion is represented following the popular Valence-Arousal-Dominance (VAD) annotation scheme. While being more expressive than polarity alone, existing word emotion induction methods are typically not suited for addressing it. To overcome this limitation, we present adaptations of two popular algorithms to VAD. To measure their effectiveness in diachronic settings, we present the first gold standard for historical word emotions, which was created by scholars with proficiency in the respective language stages and covers both English and German. In contrast to claims in previous work, our findings indicate that hand-selecting small sets of seed words with supposedly stable emotional meaning is actually harmful rather than helpful.
http://arxiv.org/abs/1806.08115
In its most basic form, decision-making can be viewed as a computational process that progressively eliminates alternatives, thereby reducing uncertainty. Such processes are generally costly, meaning that the amount of uncertainty that can be reduced is limited by the amount of available computational resources. Here, we introduce the notion of elementary computation based on a fundamental principle for probability transfers that reduce uncertainty. Elementary computations can be considered as the inverse of Pigou-Dalton transfers applied to probability distributions, closely related to the concepts of majorization, T-transforms, and generalized entropies that induce a preorder on the space of probability distributions. As a consequence we can define resource cost functions that are order-preserving and therefore monotonic with respect to the uncertainty reduction. This leads to a comprehensive notion of decision-making processes with limited resources. Along the way, we prove several new results on majorization theory, as well as on entropy and divergence measures.
http://arxiv.org/abs/1904.03964
Existing methods usually utilize pre-defined criterions, such as p-norm, to prune unimportant filters. There are two major limitations in these methods. First, the relations of the filters are largely ignored. The filters usually work jointly to make an accurate prediction in a collaborative way. Similar filters will have equivalent effects on the network prediction, and the redundant filters can be further pruned. Second, the pruning criterion remains unchanged during training. As the network updated at each iteration, the filter distribution also changes continuously. The pruning criterions should also be adaptively switched. In this paper, we propose Meta Filter Pruning (MFP) to solve the above problems. First, as a complement to the existing p-norm criterion, we introduce a new pruning criterion considering the filter relation via filter distance. Additionally, we build a meta pruning framework for filter pruning, so that our method could adaptively select the most appropriate pruning criterion as the filter distribution changes. Experiments validate our approach on two image classification benchmarks. Notably, on ILSVRC-2012, our MFP reduces more than 50% FLOPs on ResNet-50 with only 0.44% top-5 accuracy loss.
http://arxiv.org/abs/1904.03961
Convolutional neural networks (CNNs) have enabled the state-of-the-art performance in many computer vision tasks. However, little effort has been devoted to establishing convolution in non-linear space. Existing works mainly leverage on the activation layers, which can only provide point-wise non-linearity. To solve this problem, a new operation, kervolution (kernel convolution), is introduced to approximate complex behaviors of human perception systems leveraging on the kernel trick. It generalizes convolution, enhances the model capacity, and captures higher order interactions of features, via patch-wise kernel functions, but without introducing additional parameters. Extensive experiments show that kervolutional neural networks (KNN) achieve higher accuracy and faster convergence than baseline CNN.
http://arxiv.org/abs/1904.03955
Image quality plays a big role in CNN-based image classification performance. Fine-tuning the network with distorted samples may be too costly for large networks. To solve this issue, we propose a transfer learning approach optimized to keep into account that in each layer of a CNN some filters are more susceptible to image distortion than others. Our method identifies the most susceptible filters and applies retraining only to the filters that show the highest activation maps distance between clean and distorted images. Filters are ranked using the Borda count election method and then only the most affected filters are fine-tuned. This significantly reduces the number of parameters to retrain. We evaluate this approach on the CIFAR-10 and CIFAR-100 datasets, testing it on two different models and two different types of distortion. Results show that the proposed transfer learning technique recovers most of the lost performance due to input data distortion, at a considerably faster pace with respect to existing methods, thanks to the reduced number of parameters to fine-tune. When few noisy samples are provided for training, our filter-level fine tuning performs particularly well, also outperforming state of the art layer-level transfer learning approaches.
http://arxiv.org/abs/1904.03949
Photometric stereo (PS) techniques nowadays remain constrained to an ideal laboratory setup where modeling and calibration of lighting is amenable. This work aims to eliminate such restrictions. To this end, we introduce an efficient principled variational approach to uncalibrated PS under general illumination, which is approximated through a second-order spherical harmonic expansion. The joint recovery of shape, reflectance and illumination is formulated as a variational problem where shape estimation is carried out directly in terms of the underlying perspective depth map, thus implicitly ensuring integrability and bypassing the need for a subsequent normal integration. We provide a tailored numerical scheme to solve the resulting nonconvex problem efficiently and robustly. On a variety of evaluations, our method consistently reduces the mean angular error by a factor of 2-3 compared to the state-of-the-art.
http://arxiv.org/abs/1904.03942
3D scan registration is a classical, yet a highly useful problem in the context of 3D sensors such as Kinect and Velodyne. While there are several existing methods, the techniques are usually incremental where adjacent scans are registered first to obtain the initial poses, followed by motion averaging and bundle-adjustment refinement. In this paper, we take a different approach and develop minimal solvers for jointly computing the initial poses of cameras in small loops such as 3-, 4-, and 5-cycles. Note that the classical registration of 2 scans can be done using a minimum of 3 point matches to compute 6 degrees of relative motion. On the other hand, to jointly compute the 3D registrations in n-cycles, we take 2 point matches between the first n-1 consecutive pairs (i.e., Scan 1 & Scan 2, … , and Scan n-1 & Scan n) and 1 or 2 point matches between Scan 1 and Scan n. Overall, we use 5, 7, and 10 point matches for 3-, 4-, and 5-cycles, and recover 12, 18, and 24 degrees of transformation variables, respectively. Using simulations and real-data we show that the 3D registration using mini n-cycles are computationally efficient, and can provide alternate and better initial poses compared to standard pairwise methods.
http://arxiv.org/abs/1904.03941
Noisy labels often occur in vision datasets, especially when they are issued from crowdsourcing or Web scraping. In this paper, we propose a new regularization method which enables one to learn robust classifiers in presence of noisy data. To achieve this goal, we augment the virtual adversarial loss with a Wasserstein distance. This distance allows us to take into account specific relations between classes by leveraging on the geometric properties of this optimal transport distance. Notably, we encode the class similarities in the ground cost that is used to compute the Wasserstein distance. As a consequence, we can promote smoothness between classes that are very dissimilar, while keeping the classification decision function sufficiently complex for similar classes. While designing this ground cost can be left as a problem-specific modeling task, we show in this paper that using the semantic relations between classes names already leads to good results.Our proposed Wasserstein Adversarial Training (WAT) outperforms state of the art on four datasets corrupted with noisy labels: three classical benchmarks and one real case in remote sensing image semantic segmentation.
http://arxiv.org/abs/1904.03936
There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.
http://arxiv.org/abs/1904.03922
In computer vision, image datasets used for classification are naturally associated with multiple labels and comprised of multiple views, because each image may contain several objects (e.g. pedestrian, bicycle and tree) and is properly characterized by multiple visual features (e.g. color, texture and shape). Currently available tools ignore either the label relationship or the view complementary. Motivated by the success of the vector-valued function that constructs matrix-valued kernels to explore the multi-label structure in the output space, we introduce multi-view vector-valued manifold regularization (MV$\mathbf{^3}$MR) to integrate multiple features. MV$\mathbf{^3}$MR exploits the complementary property of different features and discovers the intrinsic local geometry of the compact support shared by different features under the theme of manifold regularization. We conducted extensive experiments on two challenging, but popular datasets, PASCAL VOC’ 07 (VOC) and MIR Flickr (MIR), and validated the effectiveness of the proposed MV$\mathbf{^3}$MR for image classification.
http://arxiv.org/abs/1904.03921
Deep metric learning algorithms have been utilized to learn discriminative and generalizable models which are effective for classifying unseen classes. In this paper, a novel noise tolerant deep metric learning algorithm is proposed. The proposed method, termed as Density Aware Metric Learning, enforces the model to learn embeddings that are pulled towards the most dense region of the clusters for each class. It is achieved by iteratively shifting the estimate of the center towards the dense region of the cluster thereby leading to faster convergence and higher generalizability. In addition to this, the approach is robust to noisy samples in the training data, often present as outliers. Detailed experiments and analysis on two challenging cross-modal face recognition databases and two popular object recognition databases exhibit the efficacy of the proposed approach. It has superior convergence, requires lesser training time, and yields better accuracies than several popular deep metric learning methods.
http://arxiv.org/abs/1904.03911
Motivation: Network visualizations of complex biological datasets usually result in ‘hairball’ images, which do not discriminate network modules. Results: We present the EntOptLayout Cytoscape plug-in based on a recently developed network representation theory. The plug-in provides an efficient visualization of network modules, which represent major protein complexes in protein-protein interaction and signalling networks. Importantly, the tool gives a quality score of the network visualization by calculating the information loss between the input data and the visual representation showing a 3- to 25-fold improvement over conventional methods. Availability: The plug-in (running on Windows, Linux, or Mac OS) and its tutorial (both in written and video forms) can be downloaded freely under the terms of the MIT license from: this http URL Contact: csermely.peter@med.semmelweis-univ.hu
http://arxiv.org/abs/1904.03910
Human pose estimation plays an important role in many computer vision tasks and has been studied for many decades. However, due to complex appearance variations from poses, illuminations, occlusions and low resolutions, it still remains a challenging problem. Taking the advantage of high-level semantic information from deep convolutional neural networks is an effective way to improve the accuracy of human pose estimation. In this paper, we propose a novel Cascade Feature Aggregation (CFA) method, which cascades several hourglass networks for robust human pose estimation. Features from different stages are aggregated to obtain abundant contextual information, leading to robustness to poses, partial occlusions and low resolution. Moreover, results from different stages are fused to further improve the localization accuracy. The extensive experiments on MPII datasets and LIP datasets demonstrate that our proposed CFA outperforms the state-of-the-art and achieves the best performance on the state-of-the-art benchmark MPII.
http://arxiv.org/abs/1902.07837
There is growing interest in multi-label image classification due to its critical role in web-based image analytics-based applications, such as large-scale image retrieval and browsing. Matrix completion has recently been introduced as a method for transductive (semi-supervised) multi-label classification, and has several distinct advantages, including robustness to missing data and background noise in both feature and label space. However, it is limited by only considering data represented by a single-view feature, which cannot precisely characterize images containing several semantic concepts. To utilize multiple features taken from different views, we have to concatenate the different features as a long vector. But this concatenation is prone to over-fitting and often leads to very high time complexity in MC based image classification. Therefore, we propose to weightedly combine the MC outputs of different views, and present the multi-view matrix completion (MVMC) framework for transductive multi-label image classification. To learn the view combination weights effectively, we apply a cross validation strategy on the labeled set. In the learning process, we adopt the average precision (AP) loss, which is particular suitable for multi-label image classification. A least squares loss formulation is also presented for the sake of efficiency, and the robustness of the algorithm based on the AP loss compared with the other losses is investigated. Experimental evaluation on two real world datasets (PASCAL VOC’ 07 and MIR Flickr) demonstrate the effectiveness of MVMC for transductive (semi-supervised) multi-label image classification, and show that MVMC can exploit complementary properties of different features and output-consistent labels for improved multi-label image classification.
http://arxiv.org/abs/1904.03901
This paper addresses the problem of key phrase extraction from sentences. Existing state-of-the-art supervised methods require large amounts of annotated data to achieve good performance and generalization. Collecting labeled data is, however, often expensive. In this paper, we redefine the problem as question-answer extraction, and present SAMIE: Self-Asking Model for Information Ixtraction, a semi-supervised model which dually learns to ask and to answer questions by itself. Briefly, given a sentence $s$ and an answer $a$, the model needs to choose the most appropriate question $\hat q$; meanwhile, for the given sentence $s$ and same question $\hat q$ selected in the previous step, the model will predict an answer $\hat a$. The model can support few-shot learning with very limited supervision. It can also be used to perform clustering analysis when no supervision is provided. Experimental results show that the proposed method outperforms typical supervised methods especially when given little labeled data.
http://arxiv.org/abs/1904.03898
With conventional anti-jamming solutions like frequency hopping or spread spectrum, legitimate transceivers often tend to “escape” or “hide” themselves from jammers. These reactive anti-jamming approaches are constrained by the lack of timely knowledge of jamming attacks. Bringing together the latest advances in neural network architectures and ambient backscattering communications, this work allows wireless nodes to effectively “face” the jammer by first learning its jamming strategy, then adapting the rate or transmitting information right on the jamming signal. Specifically, to deal with unknown jamming attacks, existing work often relies on reinforcement learning algorithms, e.g., Q-learning. However, the Q-learning algorithm is notorious for its slow convergence to the optimal policy, especially when the system state and action spaces are large. This makes the Q-learning algorithm pragmatically inapplicable. To overcome this problem, we design a novel deep reinforcement learning algorithm using the recent dueling neural network architecture. Our proposed algorithm allows the transmitter to effectively learn about the jammer and attain the optimal countermeasures thousand times faster than that of the conventional Q-learning algorithm. Through extensive simulation results, we show that our design (using ambient backscattering and the deep dueling neural network architecture) can improve the average throughput by up to 426% and reduce the packet loss by 24%. By augmenting the ambient backscattering capability on devices and using our algorithm, it is interesting to observe that the (successful) transmission rate increases with the jamming power. Our proposed solution can find its applications in both civil (e.g., ultra-reliable and low-latency communications or URLLC) and military scenarios (to combat both inadvertent and deliberate jamming).
http://arxiv.org/abs/1904.03897
There has been an increasing interest in 3D indoor navigation, where a robot in an environment moves to a target according to an instruction. To deploy a robot for navigation in the physical world, lots of training data is required to learn an effective policy. It is quite labour intensive to obtain sufficient real environment data for training robots while synthetic data is much easier to construct by render-ing. Though it is promising to utilize the synthetic environments to facilitate navigation training in the real world, real environment are heterogeneous from synthetic environment in two aspects. First, the visual representation of the two environments have significant variances. Second, the houseplans of these two environments are quite different. There-fore two types of information,i.e. visual representation and policy behavior, need to be adapted in the reinforce mentmodel. The learning procedure of visual representation and that of policy behavior are presumably reciprocal. We pro-pose to jointly adapt visual representation and policy behavior to leverage the mutual impacts of environment and policy. Specifically, our method employs an adversarial feature adaptation model for visual representation transfer anda policy mimic strategy for policy behavior imitation. Experiment shows that our method outperforms the baseline by 19.47% without any additional human annotations.
http://arxiv.org/abs/1904.03895
In general, deep learning based models require a tremendous amount of samples for appropriate training, which is difficult to satisfy in the medical field. This issue can usually be avoided with a proper initialization of the weights. On the task of medical image segmentation in general, two techniques are usually employed to tackle the training of a deep network $f_T$. The first one consists in reusing some weights of a network $f_S$ pre-trained on a large scale database ($e.g.$ ImageNet). This procedure, also known as \textit{transfer learning}, happens to reduce the flexibility when it comes to new network design since $f_T$ is constrained to match some parts of $f_S$. The second commonly used technique consists in working on image patches to benefit from the large number of available patches. This paper brings together these two techniques and propose to train arbitrarily designed networks, with a focus on relatively small databases, in two stages: patch pre-training and full sized image fine-tuning. An experimental work have been carried out on the tasks of retinal blood vessel segmentation and the optic disc one, using four publicly available databases. Furthermore, three types of network are considered, going from a very light weighted network to a densely connected one. The final results show the efficiency of the proposed framework along with state of the art results on all the databases.
http://arxiv.org/abs/1904.03892
This paper presents a new task, the grounding of spatio-temporal identifying descriptions in videos. Previous work suggests potential bias in existing datasets and emphasizes the need for a new data creation schema to better model linguistic structure. We introduce a new data collection scheme based on grammatical constraints for surface realization to enable us to investigate the problem of grounding spatio-temporal identifying descriptions in videos. We then propose a two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion. We show that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules. Finally, we propose a future challenge and a need for a robust system arising from replacing ground truth visual annotations with automatic video object detector and temporal event localization.
http://arxiv.org/abs/1904.03885
Large sense-annotated datasets are increasingly necessary for training deep supervised systems in word sense disambiguation. However, gathering high-quality sense-annotated data for as many instances as possible is a laborious and expensive task. This has led to the proliferation of automatic and semi-automatic methods for overcoming the so-called knowledge-acquisition bottleneck. In this short survey we present an overview of currently available sense-annotated corpora, both manually and (semi)automatically constructed, for diverse languages and lexical resources (i.e. WordNet, Wikipedia, BabelNet). General statistics and specific features of each sense-annotated dataset are also provided.
http://arxiv.org/abs/1802.04744
In domain adaptation for neural machine translation, translation performance can benefit from separating features into domain-specific features and common features. In this paper, we propose a method to explicitly model the two kinds of information in the encoder-decoder framework so as to exploit out-of-domain data in in-domain training. In our method, we maintain a private encoder and a private decoder for each domain which are used to model domain-specific information. In the meantime, we introduce a common encoder and a common decoder shared by all the domains which can only have domain-independent information flow through. Besides, we add a discriminator to the shared encoder and employ adversarial training for the whole model to reinforce the performance of information separation and machine translation simultaneously. Experiment results show that our method can outperform competitive baselines greatly on multiple data sets.
http://arxiv.org/abs/1904.03879
Compared with object detection in static images, object detection in videos is more challenging due to degraded image qualities. An effective way to address this problem is to exploit temporal contexts by linking the same object across video to form tubelets and aggregating classification scores in the tubelets. In this paper, we focus on obtaining high quality object linking results for better classification. Unlike previous methods that link objects by checking boxes between neighboring frames, we propose to link in the same frame. To achieve this goal, we extend prior methods in following aspects: (1) a cuboid proposal network that extracts spatio-temporal candidate cuboids which bound the movement of objects; (2) a short tubelet detection network that detects short tubelets in short video segments; (3) a short tubelet linking algorithm that links temporally-overlapping short tubelets to form long tubelets. Experiments on the ImageNet VID dataset show that our method outperforms both the static image detector and the previous state of the art. In particular, our method improves results by 8.8% over the static image detector for fast moving objects.
https://arxiv.org/abs/1801.09823
This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the model learns the notion of acoustic units from the labelled data and then the model uses its knowledge to find new acoustic units on the target language. We implement this process with the Bayesian Subspace Hidden Markov Model (SHMM), a model akin to the Subspace Gaussian Mixture Model (SGMM) where each low dimensional embedding represents an acoustic unit rather than just a HMM’s state. The subspace is trained on 3 languages from the GlobalPhone corpus (German, Polish and Spanish) and the AUs are discovered on the TIMIT corpus. Results, measured in equivalent Phone Error Rate, show that this approach significantly outperforms previous HMM based acoustic units discovery systems and compares favorably with the Variational Auto Encoder-HMM.
http://arxiv.org/abs/1904.03876
Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards at both event and episode levels for better context modeling. The proposed technique achieves outstanding performances on ActivityNet Captions dataset in most metrics.
http://arxiv.org/abs/1904.03870
In this paper, we present LidarStereoNet, the first unsupervised Lidar-stereo fusion network, which can be trained in an end-to-end manner without the need of ground truth depth maps. By introducing a novel “Feedback Loop’’ to connect the network input with output, LidarStereoNet could tackle both noisy Lidar points and misalignment between sensors that have been ignored in existing Lidar-stereo fusion studies. Besides, we propose to incorporate a piecewise planar model into network learning to further constrain depths to conform to the underlying 3D geometry. Extensive quantitative and qualitative evaluations on both real and synthetic datasets demonstrate the superiority of our method, which outperforms state-of-the-art stereo matching, depth completion and Lidar-Stereo fusion approaches significantly.
http://arxiv.org/abs/1904.03868
Overcoming robotics challenges in the real world requires resilient control systems capable of handling a multitude of environments and unforeseen events. Evolutionary optimization using simulations is a promising way to automatically design such control systems, however, if the disparity between simulation and the real world becomes too large, the optimization process may result in dysfunctional real-world behaviors. In this paper, we address this challenge by considering embodied phase coordination in the evolutionary optimization of a quadruped robot controller based on central pattern generators. With this method, leg phases, and indirectly also inter-leg coordination, are influenced by sensor feedback.By comparing two very similar control systems we gain insight into how the sensory feedback approach affects the evolved parameters of the control system, and how the performances differs in simulation, in transferal to the real world, and to different real-world environments. We show that evolution enables the design of a control system with embodied phase coordination which is more complex than previously seen approaches, and that this system is capable of controlling a real-world multi-jointed quadruped robot.The approach reduces the performance discrepancy between simulation and the real world, and displays robustness towards new environments.
http://arxiv.org/abs/1904.03855
Unsupervised deep learning for optical flow computation has achieved promising results. Most existing deep-net based methods rely on image brightness consistency and local smoothness constraint to train the networks. Their performance degrades at regions where repetitive textures or occlusions occur. In this paper, we propose Deep Epipolar Flow, an unsupervised optical flow method which incorporates global geometric constraints into network learning. In particular, we investigate multiple ways of enforcing the epipolar constraint in flow estimation. To alleviate a ``chicken-and-egg’’ type of problem encountered in dynamic scenes where multiple motions may be present, we propose a low-rank constraint as well as a union-of-subspaces constraint for training. Experimental results on various benchmarking datasets show that our method achieves competitive performance compared with supervised methods and outperforms state-of-the-art unsupervised deep-learning methods.
http://arxiv.org/abs/1904.03848
Distance metric learning (DML) is a critical factor for image analysis and pattern recognition. To learn a robust distance metric for a target task, we need abundant side information (i.e., the similarity/dissimilarity pairwise constraints over the labeled data), which is usually unavailable in practice due to the high labeling cost. This paper considers the transfer learning setting by exploiting the large quantity of side information from certain related, but different source tasks to help with target metric learning (with only a little side information). The state-of-the-art metric learning algorithms usually fail in this setting because the data distributions of the source task and target task are often quite different. We address this problem by assuming that the target distance metric lies in the space spanned by the eigenvectors of the source metrics (or other randomly generated bases). The target metric is represented as a combination of the base metrics, which are computed using the decomposed components of the source metrics (or simply a set of random bases); we call the proposed method, decomposition-based transfer DML (DTDML). In particular, DTDML learns a sparse combination of the base metrics to construct the target metric by forcing the target metric to be close to an integration of the source metrics. The main advantage of the proposed method compared with existing transfer metric learning approaches is that we directly learn the base metric coefficients instead of the target metric. To this end, far fewer variables need to be learned. We therefore obtain more reliable solutions given the limited side information and the optimization tends to be faster. Experiments on the popular handwritten image (digit, letter) classification and challenge natural image annotation tasks demonstrate the effectiveness of the proposed method.
http://arxiv.org/abs/1904.03846
Person re-identification (ReID) benefits greatly from the accurate annotations of existing datasets (e.g., CUHK03 \cite{li2014deepreid} and Market-1501 \cite{zheng2015scalable}), which are quite expensive because each image in these datasets has to be assigned with a proper label. In this work, we explore to ease the annotation of ReID by replacing the accurate annotation with inaccurate annotation, i.e., we group the images into bags in terms of time and assign a bag-level label for each bag. This greatly reduces the annotation effort and leads to the creation of a large-scale ReID benchmark called SYSU-30$k$. The new benchmark contains $30k$ categories of persons, which is about $20$ times larger than CUHK03 ($1.3k$ categories) and Market-1501 ($1.5k$ categories), and $30$ times larger the ImageNet ($1k$ categories). It totally sums up to 29,606,918 images. Learning a ReID model with bag-level annotation is called the weakly supervised ReID problem. To solve this problem, we introduce conditional random fields (CRFs) to capture the dependencies from all images in a bag and generate a reliable pseudo label for each person image. The pseudo label is further used to supervise the learning of the ReID model. When compared with the fully supervised ReID models, our method achieves the state-of-the-art performance on SYSU-30$k$ and other datasets. The code, dataset, and pretrained model will be available online.
http://arxiv.org/abs/1904.03845
Task 4 of the Dcase2018 challenge demonstrated that substantially more research is needed for a real-world application of sound event detection. Analyzing the challenge results it can be seen that most successful models are biased towards predicting long (e.g., over 5s) utterances. This work aims to investigate the performance impact of fixed sized window median filter post-processing and advocate the use of double thresholding as a more robust and predictable post-processing method. Further, four different temporal subsampling methods within the CRNN framework are proposed: mean-max, alpha-mean-max, Lp-norm and convolutional. We show that for this task subsampling the temporal resolution by a neural network enhances the F1 score as well as onset and offset accuracies. Our best single model achieves 30.1% F1 on the evaluation set and the best fusion model 32.5%, outperforming the previously best attempt by 0.1% while maintaining robustness to short events.
http://arxiv.org/abs/1904.03841
With the improvement of pattern recognition and feature extraction of the Deep Neural Networks (DNNs), more and more problems are attempted to solve from the view of images. Recently, the Reconstructive Neural Network (ReConNN) was proposed to obtain an image-based model from an analysis-based model, which can help us to solve many high frequency problems with difficult sampling, e.g. sonic wave and impact. However, because the researched problems are most slightly changed process, the low-accuracy of the Convolutional Neural Network (CNN) and poor-diversity of the Generative Adversarial Network (GAN) make the reconstruction process low-accuracy, poor-efficiency, expensive-computation, and high-manpower. In this study, an improved ReConNN model is proposed to address the mentioned weaknesses. Through experiments, comparisons and analyses, the improved one is demonstrated to outperform in accuracy, efficiency and cost.
http://arxiv.org/abs/1905.03229
The redundancy is widely recognized in Convolutional Neural Networks (CNNs), which enables to remove unimportant filters from convolutional layers so as to slim the network with acceptable performance drop. Inspired by the linear and combinational properties of convolution, we seek to make some filters increasingly close and eventually identical for network slimming. To this end, we propose Centripetal SGD (C-SGD), a novel optimization method, which can train several filters to collapse into a single point in the parameter hyperspace. When the training is completed, the removal of the identical filters can trim the network with NO performance loss, thus no finetuning is needed. By doing so, we have partly solved an open problem of constrained filter pruning on CNNs with complicated structure, where some layers must be pruned following others. Our experimental results on CIFAR-10 and ImageNet have justified the effectiveness of C-SGD-based filter pruning. Moreover, we have provided empirical evidences for the assumption that the redundancy in deep neural networks helps the convergence of training by showing that a redundant CNN trained using C-SGD outperforms a normally trained counterpart with the equivalent width.
http://arxiv.org/abs/1904.03837
Representation and learning of long-range dependencies is a central challenge confronted in modern applications of machine learning to sequence data. Yet despite the prominence of this issue, the basic problem of measuring long-range dependence, either in a given data source or as represented in a trained deep model, remains largely limited to heuristic tools. We contribute a statistical framework for investigating long-range dependence in current applications of sequence modeling, drawing on the statistical theory of long memory stochastic processes. By analogy with their linear predecessors in the time series literature, we identify recurrent neural networks (RNNs) as nonlinear processes that simultaneously attempt to learn both a feature representation for and the long-range dependency structure of an input sequence. We derive testable implications concerning the relationship between long memory in real-world data and its learned representation in a deep network architecture, which are explored through a semiparametric framework adapted to the high-dimensional setting. We establish the validity of statistical inference for a simple estimator, which yields a decision rule for long memory in RNNs. Experiments illustrating this statistical framework confirm the presence of long memory in a diverse collection of natural language and music data, but show that a variety of RNN architectures fail to capture this property even after training to benchmark accuracy in a language model.
http://arxiv.org/abs/1904.03834
Speech emotion recognition is a challenging task and heavily depends on hand-engineered acoustic features, which are typically crafted to echo human perception of speech signals. However, a filter bank that is designed from perceptual evidence is not always guaranteed to be the best in a statistical modelling framework where the end goal is for example emotion classification. This has fuelled the emerging trend of learning representations from raw speech especially using deep learning neural networks. In particular, a combination of Convolution Neural Networks (CNNs) and Long Short Term Memory (LSTM) have gained traction in this field for the intrinsic property of LSTM in learning contextual information crucial for emotion recognition and CNNs been used for its ability to overcome the scalability problem of regular neural networks. In this paper, we show that there are still opportunities to improve the performance of emotion recognition from the raw speech by exploiting the properties of CNN in modelling contextual information. We propose the use of parallel convolutional layers in the feature extraction block that are jointly trained with the LSTM based classification network for emotion recognition task. Our results suggest that the proposed model can reach the performance of CNN with hand-engineered features on IEMOCAP and MSP-IMPROV datasets.
http://arxiv.org/abs/1904.03833
In the conventional person re-id setting, it is assumed that the labeled images are the person images within the bounding box for each individual; this labeling across multiple nonoverlapping camera views from raw video surveillance is costly and time-consuming. To overcome this difficulty, we consider weakly supervised person re-id modeling. The weak setting refers to matching a target person with an untrimmed gallery video where we only know that the identity appears in the video without the requirement of annotating the identity in any frame of the video during the training procedure. Hence, for a video, there could be multiple video-level labels. We cast this weakly supervised person re-id challenge into a multi-instance multi-label learning (MIML) problem. In particular, we develop a Cross-View MIML (CV-MIML) method that is able to explore potential intraclass person images from all the camera views by incorporating the intra-bag alignment and the cross-view bag alignment. Finally, the CV-MIML method is embedded into an existing deep neural network for developing the Deep Cross-View MIML (Deep CV-MIML) model. We have performed extensive experiments to show the feasibility of the proposed weakly supervised setting and verify the effectiveness of our method compared to related methods on four weakly labeled datasets.
http://arxiv.org/abs/1904.03832
This paper investigates a hybrid compositional approach to optimal mission planning for multi-rotor Unmanned Aerial Vehicles (UAVs). We consider a time critical search and rescue scenario with two quadrotors in a constrained environment. Metric Temporal Logic (MTL) is used to formally describe the task specifications. In order to capture the various modes of UAV operation, we utilize a hybrid model for the system with linearized dynamics around different operating points. We divide the mission into several sub-tasks by exploiting the invariant nature of various task specifications i.e., the mutual independence of safety and timing constraints along the way, and the different modes (i,e., dynamics) of the robot. For each sub-task, we translate the MTL formulae into linear constraints, and solve the associated optimal control problem for desired path using a Mixed Integer Linear Program (MILP) solver. The complete path is constructed by the composition of individual optimal sub-paths. We show that the resulting trajectory satisfies the task specifications, and the proposed approach leads to significant reduction in computational complexity of the problem, making it possible to implement in real-time.
http://arxiv.org/abs/1904.03830
Audio is an important medium in people’s daily life, hidden information can be embedded into audio for covert communication. Current audio information hiding techniques can be roughly classed into time domain-based and transform domain-based techniques. Time domain-based techniques have large hiding capacity but low imperceptibility. Transform domain-based techniques have better imperceptibility, but the hiding capacity is poor. This paper proposes a new audio information hiding technique which shows high hiding capacity and good imperceptibility. The proposed audio information hiding method takes the original audio signal as input and obtains the audio signal embedded with hidden information (called stego audio) through the training of our private automatic speech recognition (ASR) model. Without knowing the internal parameters and structure of the private model, the hidden information can be extracted by the private model but cannot be extracted by public models. We use four other ASR models to extract the hidden information on the stego audios to evaluate the security of the private model. The experimental results show that the proposed audio information hiding technique has a high hiding capacity of 48 cps with good imperceptibility and high security. In addition, our proposed adversarial audio can be used to activate an intrinsic backdoor of DNN-based ASR models, which brings a serious threat to intelligent speakers.
http://arxiv.org/abs/1904.03829
Label propagation aims to iteratively diffuse the label information from labeled examples to unlabeled examples over a similarity graph. Current label propagation algorithms cannot consistently yield satisfactory performance due to two reasons: one is the instability of single propagation method in dealing with various practical data, and the other one is the improper propagation sequence ignoring the labeling difficulties of different examples. To remedy above defects, this paper proposes a novel propagation algorithm called hybrid diffusion under ensemble teaching (HyDEnT). Specifically, HyDEnT integrates multiple propagation methods as base learners to fully exploit their individual wisdom, which helps HyDEnT to be stable and obtain consistent encouraging results. More importantly, HyDEnT conducts propagation under the guidance of an ensemble of teachers. That is to say, in every propagation round the simplest curriculum examples are wisely designated by a teaching algorithm, so that their labels can be reliably and accurately decided by the learners. To optimally choose these simplest examples, every teacher in the ensemble should comprehensively consider the examples’ difficulties from its own viewpoint, as well as the common knowledge shared by all the teachers. This is accomplished by a designed optimization problem, which can be efficiently solved via the block coordinate descent method. Thanks to the efforts of the teachers, all the unlabeled examples are logically propagated from simple to difficult, leading to better propagation quality of HyDEnT than the existing methods.
http://arxiv.org/abs/1904.03828