Convolutional Neural Networks (CNNs) have been proven to be extremely successful at solving computer vision tasks. State-of-the-art methods favor such deep network architectures for its accuracy performance, with the cost of having massive number of parameters and high weights redundancy. Previous works have studied how to prune such CNNs weights. In this paper, we go to another extreme and analyze the performance of a network stacked with a single convolution kernel across layers, as well as other weights sharing techniques. We name it Deep Anchored Convolutional Neural Network (DACNN). Sharing the same kernel weights across layers allows to reduce the model size tremendously, more precisely, the network is compressed in memory by a factor of L, where L is the desired depth of the network, disregarding the fully connected layer for prediction. The number of parameters in DACNN barely increases as the network grows deeper, which allows us to build deep DACNNs without any concern about memory costs. We also introduce a partial shared weights network (DACNN-mix) as well as an easy-plug-in module, coined regulators, to boost the performance of our architecture. We validated our idea on 3 datasets: CIFAR-10, CIFAR-100 and SVHN. Our results show that we can save massive amounts of memory with our model, while maintaining a high accuracy performance.
http://arxiv.org/abs/1904.09764
In this paper, we propose a novel algorithm to rectify illumination of the digitized documents by eliminating shading artifacts. Firstly, a topographic surface of an input digitized document is created using luminance value of each pixel. Then the shading artifact on the document is estimated by simulating an immersion process. The simulation of the immersion process is modeled using a novel diffusion equation with an iterative update rule. After estimating the shading artifacts, the digitized document is reconstructed using the Lambertian surface model. In order to evaluate the performance of the proposed algorithm, we conduct rigorous experiments on a set of digitized documents which is generated using smartphones under challenging lighting conditions. According to the experimental results, it is found that the proposed method produces promising illumination correction results and outperforms the results of the state-of-the-art methods.
http://arxiv.org/abs/1904.09763
Multi-face alignment aims to identify geometry structures of multiple human face in a image, and its performance is important for the many practical tasks, such as face recognition, face tracking and face animation. In this work, we present a fast bottom-up multi-face alignment approach landmark detection approach, which can simultaneously localize multi-person facial landmarks with high precision. In more detail, unlike previous top-down approach, our bottom-up architecture maps the landmarks to the high-dimensional space. Then, the discriminative high-dimensional features are aggregated to represent the landmarks. By clustering the features belonging to the same face, our approach can align the multi-person facial landmarks synchronously. Extensive experiments are conducted in this paper, and the experimental results demonstrate that our method can achieve the high performance in the multiface landmark alignment task while our model is extremely fast. Moreover, we propose a new multi-face dataset to compare the speed and precision of bottom-up face alignment method. Our dataset is publicly available at https://github.com/AISAResearch/FoxNet
http://arxiv.org/abs/1904.09758
This paper proposes a novel Non-Local Attention Optimized Deep Image Compression (NLAIC) framework, which is built on top of the popular variational auto-encoder (VAE) structure. Our NLAIC framework embeds non-local operations in the encoders and decoders for both image and latent feature probability information (known as hyperprior) to capture both local and global correlations, and apply attention mechanism to generate masks that are used to weigh the features for the image and hyperprior, which implicitly adapt bit allocation for different features based on their importance. Furthermore, both hyperpriors and spatial-channel neighbors of the latent features are used to improve entropy coding. The proposed model outperforms the existing methods on Kodak dataset, including learned (e.g., Balle2019, Balle2018) and conventional (e.g., BPG, JPEG2000, JPEG) image compression methods, for both PSNR and MS-SSIM distortion metrics.
http://arxiv.org/abs/1904.09757
Who did what to whom is a major focus in natural language understanding, which is right the aim of semantic role labeling (SRL). Although SRL is naturally essential to text comprehension tasks, it is surprisingly ignored in previous work. This paper thus makes the first attempt to let SRL enhance text comprehension and inference through specifying verbal arguments and their corresponding semantic roles. In terms of deep learning models, our embeddings are enhanced by semantic role labels for more fine-grained semantics. We show that the salient labels can be conveniently added to existing models and significantly improve deep learning models in challenging text comprehension tasks. Extensive experiments on benchmark machine reading comprehension and inference datasets verify that the proposed semantic learning helps our system reach new state-of-the-art.
http://arxiv.org/abs/1809.02794
Forward Vehicle Collision Warning (FCW) is one of the most important functions for autonomous vehicles. In this procedure, vehicle detection and distance measurement are core components, requiring accurate localization and estimation. In this paper, we propose a simple but efficient forward vehicle collision warning framework by aggregating monocular distance measurement and precise vehicle detection. In order to obtain forward vehicle distance, a quick camera calibration method which only needs three physical points to calibrate related camera parameters is utilized. As for the forward vehicle detection, a multi-scale detection algorithm that regards the result of calibration as distance priori is proposed to improve the precision. Intensive experiments are conducted in our established real scene dataset and the results have demonstrated the effectiveness of the proposed framework.
http://arxiv.org/abs/1904.12642
Video-based vehicle detection and tracking is one of the most important components for Intelligent Transportation Systems (ITS). When it comes to road junctions, the problem becomes even more difficult due to the occlusions and complex interactions among vehicles. In order to get a precise detection and tracking result, in this work we propose a novel tracking-by-detection framework. In the detection stage, we present a sequential detection model to deal with serious occlusions. In the tracking stage, we model group behavior to treat complex interactions with overlaps and ambiguities. The main contributions of this paper are twofold: 1) Shape prior is exploited in the sequential detection model to tackle occlusions in crowded scene. 2) Traffic force is defined in the traffic scene to model group behavior, and it can assist to handle complex interactions among vehicles. We evaluate the proposed approach on real surveillance videos at road junctions and the performance has demonstrated the effectiveness of our method.
http://arxiv.org/abs/1904.12641
Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
http://arxiv.org/abs/1904.09751
Graph Neural Networks (GNNs) have been popularly used for analyzing non-Euclidean data such as social network data and biological data. Despite their success, the design of graph neural networks requires a lot of manual work and domain knowledge. In this paper, we propose a Graph Neural Architecture Search method (GraphNAS for short) that enables automatic search of the best graph neural architecture based on reinforcement learning. Specifically, GraphNAS first uses a recurrent network to generate variable-length strings that describe the architectures of graph neural networks, and then trains the recurrent network with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation data set. Extensive experimental results on node classification tasks in both transductive and inductive learning settings demonstrate that GraphNAS can achieve consistently better performance on the Cora, Citeseer, Pubmed citation network, and protein-protein interaction network. On node classification tasks, GraphNAS can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy.
http://arxiv.org/abs/1904.09981
This paper presents an unsupervised deep-learning framework named Local Deep-Feature Alignment (LDFA) for dimension reduction. We construct neighbourhood for each data sample and learn a local Stacked Contractive Auto-encoder (SCAE) from the neighbourhood to extract the local deep features. Next, we exploit an affine transformation to align the local deep features of each neighbourhood with the global features. Moreover, we derive an approach from LDFA to map explicitly a new data sample into the learned low-dimensional subspace. The advantage of the LDFA method is that it learns both local and global characteristics of the data sample set: the local SCAEs capture local characteristics contained in the data set, while the global alignment procedures encode the interdependencies between neighbourhoods into the final low-dimensional feature representations. Experimental results on data visualization, clustering and classification show that the LDFA method is competitive with several well-known dimension reduction techniques, and exploiting locality in deep learning is a research topic worth further exploring.
http://arxiv.org/abs/1904.09747
We present a constituency parsing algorithm that maps from word-aligned contextualized feature vectors to parse trees. Our algorithm proceeds strictly left-to-right, processing one word at a time by assigning it a label from a small vocabulary. We show that, with mild assumptions, our inference procedure requires constant computation time per word. Our method gets 95.4 F1 on the WSJ test set.
http://arxiv.org/abs/1904.09745
Large-scale point cloud generated from 3D sensors is more accurate than its image-based counterpart. However, it is seldom used in visual pose estimation due to the difficulty in obtaining 2D-3D image to point cloud correspondences. In this paper, we propose the 2D3D-MatchNet - an end-to-end deep network architecture to jointly learn the descriptors for 2D and 3D keypoint from image and point cloud, respectively. As a result, we are able to directly match and establish 2D-3D correspondences from the query image and 3D point cloud reference map for visual pose estimation. We create our Oxford 2D-3D Patches dataset from the Oxford Robotcar dataset with the ground truth camera poses and 2D-3D image to point cloud correspondences for training and testing the deep network. Experimental results verify the feasibility of our approach.
http://arxiv.org/abs/1904.09742
This paper proposes an automatic subtitle generation and semantic video summarization technique. The importance of automatic video summarization is vast in the present era of big data. Video summarization helps in efficient storage and also quick surfing of large collection of videos without losing the important ones. The summarization of the videos is done with the help of subtitles which is obtained using several text summarization algorithms. The proposed technique generates the subtitle for videos with/without subtitles using speech recognition and then applies NLP based Text summarization algorithms on the subtitles. The performance of subtitle generation and video summarization is boosted through Ensemble method with two approaches such as Intersection method and Weight based learning method Experimental results reported show the satisfactory performance of the proposed method
http://arxiv.org/abs/1904.09740
Normalization methods are essential components in convolutional neural networks (CNNs). They either standardize or whiten data using statistics estimated in predefined sets of pixels. Unlike existing works that design normalization techniques for specific tasks, we propose Switchable Whitening (SW), which provides a general form unifying different whitening methods as well as standardization methods. SW learns to switch among these operations in an end-to-end manner. It has several advantages. First, SW adaptively selects appropriate whitening or standardization statistics for different tasks (see Fig.1), making it well suited for a wide range of tasks without manual design. Second, by integrating benefits of different normalizers, SW shows consistent improvements over its counterparts in various challenging benchmarks. Third, SW serves as a useful tool for understanding the characteristics of whitening and standardization techniques. We show that SW outperforms other alternatives on image classification (CIFAR-10/100, ImageNet), semantic segmentation (ADE20K, Cityscapes), domain adaptation (GTA5, Cityscapes), and image style transfer (COCO). For example, without bells and whistles, we achieve state-of-the-art performance with 45.33% mIoU on the ADE20K dataset. Code and models will be released.
http://arxiv.org/abs/1904.09739
With the development of deep learning, the structure of convolution neural network is becoming more and more complex and the performance of object recognition is getting better. However, the classification mechanism of convolution neural networks is still an unsolved core problem. The main problem is that convolution neural networks have too many parameters, which makes it difficult to analyze them. In this paper, we design and train a convolution neural network based on the expression recognition, and explore the classification mechanism of the network. By using the Deconvolution visualization method, the extremum point of the convolution neural network is projected back to the pixel space of the original image, and we qualitatively verify that the trained expression recognition convolution neural network forms a detector for the specific facial action unit. At the same time, we design the distance function to measure the distance between the presence of facial feature unit and the maximal value of the response on the feature map of convolution neural network. The greater the distance, the more sensitive the feature map is to the facial feature unit. By comparing the maximum distance of all facial feature elements in the feature graph, the mapping relationship between facial feature element and convolution neural network feature map is determined. Therefore, we have verified that the convolution neural network has formed a detector for the facial Action unit in the training process to realize the expression recognition.
http://arxiv.org/abs/1904.09737
As DenseNet conserves intermediate features with diverse receptive fields by aggregating them with dense connection, it shows good performance on the object detection task. Although feature reuse enables DenseNet to produce strong features with a small number of model parameters and FLOPs, the detector with DenseNet backbone shows rather slow speed and low energy efficiency. We find the linearly increasing input channel by dense connection leads to heavy memory access cost, which causes computation overhead and more energy consumption. To solve the inefficiency of DenseNet, we propose an energy and computation efficient architecture called VoVNet comprised of One-Shot Aggregation (OSA). The OSA not only adopts the strength of DenseNet that represents diversified features with multi receptive fields but also overcomes the inefficiency of dense connection by aggregating all features only once in the last feature maps. To validate the effectiveness of VoVNet as a backbone network, we design both lightweight and large-scale VoVNet and apply them to one-stage and two-stage object detectors. Our VoVNet based detectors outperform DenseNet based ones with 2x faster speed and the energy consumptions are reduced by 1.6x - 4.1x. In addition to DenseNet, VoVNet also outperforms widely used ResNet backbone with faster speed and better energy efficiency. In particular, the small object detection performance has been significantly improved over DenseNet and ResNet.
http://arxiv.org/abs/1904.09730
Person identification in the wild is very challenging due to great variation in poses, face quality, clothes, makeup and so on. Traditional research, such as face recognition, person re-identification, and speaker recognition, often focuses on a single modal of information, which is inadequate to handle all the situations in practice. Multi-modal person identification is a more promising way that we can jointly utilize face, head, body, audio features, and so on. In this paper, we introduce iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2\%. We evaluated the state-of-art models of face recognition, person re-identification, and speaker recognition on the iQIYI-VID dataset. Experimental results show that these models are still far from being perfect for the task of person identification in the wild. We proposed a Multi-modal Attention module to fuse multi-modal features that can improve person identification considerably. We have released the dataset online to promote multi-modal person identification research.
http://arxiv.org/abs/1811.07548
We introduce SocialIQa, the first large-scale benchmark for commonsense reasoning about social situations. This resource contains 45,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: Skylar went to Jan's birthday party and gave her a gift. What does Skylar need to do before this?'' A:
Go shopping’’). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to the wrong question. While humans can easily solve these questions (90%), our benchmark is more challenging for existing question-answering (QA) models, such as those based on pretrained language models (77%). Notably, we further establish SocialIQa as a resource for transfer learning of commonsense knowledge, achieving state-of-the-art performance on several commonsense reasoning tasks (Winograd Schemas, COPA).
http://arxiv.org/abs/1904.09728
Non-signalized intersection is a typical and common scenario for connected and automated vehicles (CAVs). How to balance safety and efficiency remains difficult for researchers. To improve the original Responsibility Sensitive Safety (RSS) driving strategy on the non-signalized intersection, we propose a new strategy in this paper, based on right-of-way assignment (RWA). The performances of RSS strategy, cooperative driving strategy, and RWA based strategy are tested and compared. Testing results indicate that our strategy yields better traffic efficiency than RSS strategy, but not satisfying as the cooperative driving strategy due to the limited range of communication and the lack of long-term planning. However, our new strategy requires much fewer communication costs among vehicles.
http://arxiv.org/abs/1905.01150
This paper proposes a robust localization system that employs deep learning for better scene representation, and enhances the accuracy of 6-DOF camera pose estimation. Inspired by the fact that global scene structure can be revealed by wide field-of-view, we leverage the large overlap of a fisheye camera between adjacent frames, and the powerful high-level feature representations of deep learning. Our main contribution is the novel network architecture that extracts both temporal and spatial information using a Recurrent Neural Network. Specifically, we propose a novel pose regularization term combined with LSTM. This leads to smoother pose estimation, especially for large outdoor scenery. Promising experimental results on three benchmark datasets manifest the effectiveness of the proposed approach.
http://arxiv.org/abs/1904.09722
We present two new datasets and a novel attention mechanism for Natural Language Inference (NLI). Existing neural NLI models, even though when trained on existing large datasets, do not capture the notion of entity and role well and often end up making mistakes such as “Peter signed a deal” can be inferred from “John signed a deal”. The two datasets have been developed to mitigate such issues and make the systems better at understanding the notion of “entities” and “roles”. After training the existing architectures on the new dataset we observe that the existing architectures does not perform well on one of the new benchmark. We then propose a modification to the “word-to-word” attention function which has been uniformly reused across several popular NLI architectures. The resulting architectures perform as well as their unmodified counterparts on the existing benchmarks and perform significantly well on the new benchmark for “roles” and “entities”.
http://arxiv.org/abs/1904.09720
Face Anti-spoofing gains increased attentions recently in both academic and industrial fields. With the emergence of various CNN based solutions, the multi-modal(RGB, depth and IR) methods based CNN showed better performance than single modal classifiers. However, there is a need for improving the performance and reducing the complexity. Therefore, an extreme light network architecture(FeatherNet A/B) is proposed with a streaming module which fixes the weakness of Global Average Pooling and uses less parameters. Our single FeatherNet trained by depth image only, provides a higher baseline with 0.00168 ACER, 0.35M parameters and 83M FLOPS. Furthermore, a novel fusion procedure with ``ensemble + cascade’’ structure is presented to satisfy the performance preferred use cases. Meanwhile, the MMFD dataset is collected to provide more attacks and diversity to gain better generalization. We use the fusion method in the Face Anti-spoofing Attack Detection Challenge@CVPR2019 and got the result of 0.0013(ACER), 0.999(TPR@FPR=10e-2), 0.998(TPR@FPR=10e-3) and 0.9814(TPR@FPR=10e-4).
http://arxiv.org/abs/1904.09290
Arbitrary attribute editing generally can be tackled by incorporating encoder-decoder and generative adversarial networks. However, the bottleneck layer in encoder-decoder usually gives rise to blurry and low quality editing result. And adding skip connections improves image quality at the cost of weakened attribute manipulation ability. Moreover, existing methods exploit target attribute vector to guide the flexible translation to desired target domain. In this work, we suggest to address these issues from selective transfer perspective. Considering that specific editing task is certainly only related to the changed attributes instead of all target attributes, our model selectively takes the difference between target and source attribute vectors as input. Furthermore, selective transfer units are incorporated with encoder-decoder to adaptively select and modify encoder feature for enhanced attribute editing. Experiments show that our method (i.e., STGAN) simultaneously improves attribute manipulation accuracy as well as perception quality, and performs favorably against state-of-the-arts in arbitrary facial attribute editing and season translation.
http://arxiv.org/abs/1904.09709
Standard methods in deep learning for natural language processing fail to capture the compositional structure of human language that allows for systematic generalization outside of the training distribution. However, human learners readily generalize in this way, e.g. by applying known grammatical rules to novel words. Inspired by work in neuroscience suggesting separate brain systems for syntactic and semantic processing, we implement a modification to standard approaches in neural machine translation, imposing an analogous separation. The novel model, which we call Syntactic Attention, substantially outperforms standard methods in deep learning on the SCAN dataset, a compositional generalization task, without any hand-engineered features or additional supervision. Our work suggests that separating syntactic from semantic learning may be a useful heuristic for capturing compositional structure.
http://arxiv.org/abs/1904.09708
Winograd Schema Challenge (WSC) was proposed as an AI-hard problem in testing computers’ intelligence on common sense representation and reasoning. This paper presents the new state-of-theart on WSC, achieving an accuracy of 71.1%. We demonstrate that the leading performance benefits from jointly modelling sentence structures, utilizing knowledge learned from cutting-edge pretraining models, and performing fine-tuning. We conduct detailed analyses, showing that fine-tuning is critical for achieving the performance, but it helps more on the simpler associative problems. Modelling sentence dependency structures, however, consistently helps on the harder non-associative subset of WSC. Analysis also shows that larger fine-tuning datasets yield better performances, suggesting the potential benefit of future work on annotating more Winograd schema sentences.
http://arxiv.org/abs/1904.09705
The use of deep pretrained bidirectional transformers has led to remarkable progress in learning multi-sentence representations for downstream language understanding tasks (Devlin et al., 2018). For tasks that make pairwise comparisons, e.g. matching a given context with a corresponding response, two approaches have permeated the literature. A Cross-encoder performs full self-attention over the pair; a Bi-encoder performs self-attention for each sequence separately, and the final representation is a function of the pair. While Cross-encoders nearly always outperform Bi-encoders on various tasks, both in our work and others’ (Urbanek et al., 2019), they are orders of magnitude slower, which hampers their ability to perform real-time inference. In this work, we develop a new architecture, the Poly-encoder, that is designed to approach the performance of the Cross-encoder while maintaining reasonable computation time. Additionally, we explore two pretraining schemes with different datasets to determine how these affect the performance on our chosen dialogue tasks: ConvAI2 and DSTC7 Track 1. We show that our models achieve state-of-the-art results on both tasks; that the Poly-encoder is a suitable replacement for Bi-encoders and Cross-encoders; and that even better results can be obtained by pretraining on a large dialogue dataset.
http://arxiv.org/abs/1905.01969
Argument mining is generally performed on the sentence-level – it is assumed that an entire sentence (not parts of it) corresponds to an argument. In this paper, we introduce the new task of Argument unit Recognition and Classification (ARC). In ARC, an argument is generally a part of a sentence – a more realistic assumption since several different arguments can occur in one sentence and longer sentences often contain a mix of argumentative and non-argumentative parts. Recognizing and classifying the spans that correspond to arguments makes ARC harder than previously defined argument mining tasks. We release ARC-8, a new benchmark for evaluating the ARC task. We show that token-level annotations for argument units can be gathered using scalable methods. ARC-8 contains 25\% more arguments than a dataset annotated on the sentence-level would. We cast ARC as a sequence labeling task, develop a number of methods for ARC sequence tagging and establish the state of the art for ARC-8. A focus of our work is robustness: both robustness against errors in sentence identification (which are frequent for noisy text) and robustness against divergence in training and test data.
http://arxiv.org/abs/1904.09688
This paper proposes a novel training scheme for fast matching models in Search Ads, which is motivated by the real challenges in model training. The first challenge stems from the pursuit of high throughput, which prohibits the deployment of inseparable architectures, and hence greatly limits the model accuracy. The second problem arises from the heavy dependency on human provided labels, which are expensive and time-consuming to collect, yet how to leverage unlabeled search log data is rarely studied. The proposed training framework targets on mitigating both issues, by treating the stronger but undeployable models as annotators, and learning a deployable model from both human provided relevance labels and weakly annotated search log data. Specifically, we first construct multiple auxiliary tasks from the enumerated relevance labels, and train the annotators by jointly learning from those related tasks. The annotation models are then used to assign scores to both labeled and unlabeled training samples. The deployable model is firstly learnt on the scored unlabeled data, and then fine-tuned on scored labeled data, by leveraging both labels and scores via minimizing the proposed label-aware weighted loss. According to our experiments, compared with the baseline that directly learns from relevance labels, training by the proposed framework outperforms it by a large margin, and improves data efficiency substantially by dispensing with 80% labeled samples. The proposed framework allows us to improve the fast matching model by learning from stronger annotators while keeping its architecture unchanged. Meanwhile, our training framework offers a principled manner to leverage search log data in the training phase, which could effectively alleviate our dependency on human provided labels.
https://arxiv.org/abs/1901.10710
Stochastic video prediction models take in a sequence of image frames, and generate a sequence of consecutive future image frames. These models typically generate future frames in an autoregressive fashion, which is slow and requires the input and output frames to be consecutive. We introduce a model that overcomes these drawbacks by generating a latent representation from an arbitrary set of frames that can then be used to simultaneously and efficiently sample temporally consistent frames at arbitrary time-points. For example, our model can “jump” and directly sample frames at the end of the video, without sampling intermediate frames. Synthetic video evaluations confirm substantial gains in speed and functionality without loss in fidelity. We also apply our framework to a 3D scene reconstruction dataset. Here, our model is conditioned on camera location and can sample consistent sets of images for what an occluded region of a 3D scene might look like, even if there are multiple possibilities for what that region might contain. Reconstructions and videos are available at https://bit.ly/2O4Pc4R.
http://arxiv.org/abs/1807.02033
We present a comprehensive study and evaluation of existing single image dehazing algorithms, using a new large-scale benchmark consisting of both synthetic and real-world hazy images, called REalistic Single Image DEhazing (RESIDE). RESIDE highlights diverse data sources and image contents, and is divided into five subsets, each serving different training or evaluation purposes. We further provide a rich variety of criteria for dehazing algorithm evaluation, ranging from full-reference metrics, to no-reference metrics, to subjective evaluation and the novel task-driven evaluation. Experiments on RESIDE shed light on the comparisons and limitations of state-of-the-art dehazing algorithms, and suggest promising future directions.
http://arxiv.org/abs/1712.04143
Communication structure plays a key role in the learning capability of decentralized systems. Structural self-adaptation, by means of self-organization, changes the order as well as the input information of the agents’ collective decision-making. This paper studies the role of agents’ repositioning on the same communication structure, i.e. a tree, as the means to expand the learning capacity in complex combinatorial optimization problems, for instance, load-balancing power demand to prevent blackouts or efficient utilization of bike sharing stations. The optimality of structural self-adaptations is rigorously studied by constructing a novel large-scale benchmark that consists of 4000 agents with synthetic and real-world data performing 4 million structural self-adaptations during which almost 320 billion learning messages are exchanged. Based on this benchmark dataset, 124 deterministic structural criteria, applied as learning meta-features, are systematically evaluated as well as two online structural self-adaptation strategies designed to expand learning capacity. Experimental evaluation identifies metrics that capture agents with influential information and their optimal positioning. Significant gain in learning performance is observed for the two strategies especially under low-performing initialization. Strikingly, the strategy that triggers structural self-adaptation in a more exploratory fashion is the most cost-effective.
http://arxiv.org/abs/1904.09681
With an ultimate goal of narrowing the gap between human and machine readers in text comprehension, we present the first collection of Challenging Chinese machine reading Comprehension datasets (C^3) collected from language and professional certification exams, which contains 13,924 documents and their associated 23,990 multiple-choice questions. Most of the questions in C^3 cannot be answered merely by surface-form matching against the given text. As a pilot study, we closely analyze the prior knowledge (i.e., linguistic, domain-specific, and general world knowledge) needed in these real world reading comprehension tasks. We further explore how to leverage linguistic knowledge including a lexicon of common idioms and proverbs and domain-specific knowledge such as textbooks to aid machine readers, through fine-tuning a pre-trained language model (Devlin et al.,2019). Our experimental results demonstrate that linguistic knowledge may help improve the performance of the baseline reader in both general and domain-specific tasks. C^3 will be available at this http URL
http://arxiv.org/abs/1904.09679
In this paper, we introduce UniSent a universal sentiment lexica for 1000 languages created using an English sentiment lexicon and a massively parallel corpus in the Bible domain. To the best of our knowledge, UniSent is the largest sentiment resource to date in terms of number of covered languages, including many low resource languages. To create UniSent, we propose Adapted Sentiment Pivot, a novel method that combines annotation projection, vocabulary expansion, and unsupervised domain adaptation. We evaluate the quality of UniSent for Macedonian, Czech, German, Spanish, and French and show that its quality is comparable to manually or semi-manually created sentiment resources. With the publication of this paper, we release UniSent lexica as well as Adapted Sentiment Pivot related codes. method.
http://arxiv.org/abs/1904.09678
In machine learning, it is very important for a robot to be able to estimate dynamics from sequences of input data. This problem can be solved using a recurrent neural network. In this paper, we will discuss the preprocessing of 10 states of the dataset, then the use of a LSTM recurrent neural network to estimate one output state (dynamics) from the other 9 input states. We will discuss the architecture of the recurrent neural network, the data collection and preprocessing, the loss function, the results of the test data, and the discussion of changes that could improve the network. The results of this paper will be used for artificial intelligence research and identify the capabilities of a LSTM recurrent neural network architecture to estimate dynamics of a system.
http://arxiv.org/abs/1904.09980
In machine learning, it is very important for a robot to know the state of an object and recognize particular desired states. This is an image classification problem that can be solved using a convolutional neural network. In this paper, we will discuss the use of a VGG convolutional neural network to recognize those states of cooking objects. We will discuss the uses of activation functions, optimizers, data augmentation, layer additions, and other different versions of architectures. The results of this paper will be used to identify alternatives to the VGG convolutional neural network to improve accuracy.
http://arxiv.org/abs/1904.12613
We propose BERTScore, an automatic evaluation metric for text generation. Analogous to common metrics, \method computes a similarity score for each token in the candidate sentence with each token in the reference. However, instead of looking for exact matches, we compute similarity using contextualized BERT embeddings. We evaluate on several machine translation and image captioning benchmarks, and show that BERTScore correlates better with human judgments than existing metrics, often significantly outperforming even task-specific supervised metrics.
http://arxiv.org/abs/1904.09675
The new demands for high-reliability and ultra-high capacity wireless communication have led to extensive research into 5G communications. However, the current communication systems, which were designed on the basis of conventional communication theories, signficantly restrict further performance improvements and lead to severe limitations. Recently, the emerging deep learning techniques have been recognized as a promising tool for handling the complicated communication systems, and their potential for optimizing wireless communications has been demonstrated. In this article, we first review the development of deep learning solutions for 5G communication, and then propose efficient schemes for deep learning-based 5G scenarios. Specifically, the key ideas for several important deep learningbased communication methods are presented along with the research opportunities and challenges. In particular, novel communication frameworks of non-orthogonal multiple access (NOMA), massive multiple-input multiple-output (MIMO), and millimeter wave (mmWave) are investigated, and their superior performances are demonstrated. We vision that the appealing deep learning-based wireless physical layer frameworks will bring a new direction in communication theories and that this work will move us forward along this road.
http://arxiv.org/abs/1904.09673
Can neural networks learn to compare graphs without feature engineering? In this paper, we show that it is possible to learn representations for graph similarity with neither domain knowledge nor supervision (i.e.\ feature engineering or labeled graphs). We propose Deep Divergence Graph Kernels, an unsupervised method for learning representations over graphs that encodes a relaxed notion of graph isomorphism. Our method consists of three parts. First, we learn an encoder for each anchor graph to capture its structure. Second, for each pair of graphs, we train a cross-graph attention network which uses the node representations of an anchor graph to reconstruct another graph. This approach, which we call isomorphism attention, captures how well the representations of one graph can encode another. We use the attention-augmented encoder’s predictions to define a divergence score for each pair of graphs. Finally, we construct an embedding space for all graphs using these pair-wise divergence scores. Unlike previous work, much of which relies on 1) supervision, 2) domain specific knowledge (e.g. a reliance on Weisfeiler-Lehman kernels), and 3) known node alignment, our unsupervised method jointly learns node representations, graph representations, and an attention-based alignment between graphs. Our experimental results show that Deep Divergence Graph Kernels can learn an unsupervised alignment between graphs, and that the learned representations achieve competitive results when used as features on a number of challenging graph classification tasks. Furthermore, we illustrate how the learned attention allows insight into the the alignment of sub-structures across graphs.
http://arxiv.org/abs/1904.09671
Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird’s eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data – samples from 2D manifolds in 3D space – we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.
http://arxiv.org/abs/1904.09664
The paper presents a new model for single channel images low-level interpretation. The image is decomposed into a graph which captures a complete set of structural features. The description allows to accurately identify every edge location and its correct connectivity. The key features of the method are: vector description of the edges, subpixel precision, and parallelism of the underlying algorithm. The methodology outperforms classical and state of the art edge detectors at both conceptual and experimental levels. It also enables graph based algorithms for higher-level feature extraction. Any image processing pipeline can benefit from such results: e.g., controlled denoising, edge preserving filtering, upsampling, compression, vector and graph based pattern matching, neural network training.
http://arxiv.org/abs/1904.09659
Embedding methods have achieved success in face recognition by comparing facial features in a latent semantic space. However, in a fully unconstrained face setting, the features learned by the embedding model could be ambiguous or may not even be present in the input face, leading to noisy representations. We propose Probabilistic Face Embeddings (PFEs), which represent each face image as a Gaussian distribution in the latent space. The mean of the distribution estimates the most likely feature values while the variance shows the uncertainty in the feature values. Probabilistic solutions can then be naturally derived for matching and fusing PFEs using the uncertainty information. Empirical evaluation on different baseline models, training datasets and benchmarks show that the proposed method can improve the face recognition performance of deterministic embeddings by converting them into PFEs. The uncertainties estimated by PFEs also serve as good indicators of the potential matching accuracy, which are important for a risk-controlled recognition system.
http://arxiv.org/abs/1904.09658
Research has provided evidence that associative classification produces more accurate results compared to other classification models. The Classification Based on Association (CBA) is one of the famous Associative Classification algorithms that generates accurate classifiers. However, current association classification algorithms reside external to databases, which reduces the flexibility of enterprise analytics systems. This paper implements the CBA in Oracle database using two variant models: hardcoding the CBA in Oracle Data Mining (ODM) package and Integrating Oracle Apriori model with the Oracle Decision tree model. We compared the proposed model performance with Naive Bayes, Support Vector Machine, Random Forests, and Decision Tree over 18 datasets from UCI. Results showed that our models outperformed the original CBA model with 1 percent and is competitive to chosen classification models over benchmark datasets.
http://arxiv.org/abs/1904.09654
Estimating the parameter of a Bernoulli process arises in many applications, including photon-efficient active imaging where each illumination period is regarded as a single Bernoulli trial. Motivated by acquisition efficiency when multiple Bernoulli processes are of interest, we formulate the allocation of trials under a constraint on the mean as an optimal resource allocation problem. An oracle-aided trial allocation demonstrates that there can be a significant advantage from varying the allocation for different processes and inspires a simple trial allocation gain quantity. Motivated by realizing this gain without an oracle, we present a trellis-based framework for representing and optimizing stopping rules. Considering the convenient case of Beta priors, three implementable stopping rules with similar performances are explored, and the simplest of these is shown to asymptotically achieve the oracle-aided trial allocation. These approaches are further extended to estimating functions of a Bernoulli parameter. In simulations inspired by realistic active imaging scenarios, we demonstrate significant mean-squared error improvements: up to 4.36 dB for the estimation of p and up to 1.80 dB for the estimation of log p.
http://arxiv.org/abs/1809.08801
Previous studies have shown that neural machine translation (NMT) models can benefit from modeling translated (Past) and un-translated (Future) source contents as recurrent states (Zheng et al., 2018). However, the recurrent process is less interpretable. In this paper, we propose to model Past and Future by Capsule Network (Hinton et al.,2011), which provides an explicit separation of source words into groups of Past and Future by the process of parts-to-wholes assignment. The assignment is learned with a novel variant of routing-by-agreement mechanism (Sabour et al., 2017), namely Guided Dynamic Routing, in which what to translate at current decoding step guides the routing process to assign each source word to its associated group represented by a capsule, and to refine the representation of the capsule dynamically and iteratively. Experiments on translation tasks of three language pairs show that our model achieves substantial improvements over both RNMT and Transformer. Extensive analysis further verifies that our method does recognize translated and untranslated content as expected, and produces better and more adequate translations.
http://arxiv.org/abs/1904.09646
We use persistent homology along with the eigenfunctions of the Laplacian to study similarity amongst triangulated 2-manifolds. Our method relies on studying the lower-star filtration induced by the eigenfunctions of the Laplacian. This gives us a shape descriptor that inherits the rich information encoded in the eigenfunctions of the Laplacian. Moreover, the similarity between these descriptors can be easily computed using tools that are readily available in Topological Data Analysis. We provide experiments to illustrate the effectiveness of the proposed method.
http://arxiv.org/abs/1904.09639
Deep pre-training and fine-tuning models (like BERT, OpenAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow. How to apply these complex models to real business scenarios becomes a challenging but practical problem. Previous works often leverage model compression approaches to resolve this problem. However, these methods usually induce information loss during the model compression procedure, leading to incomparable results between compressed model and the original model. To tackle this challenge, we propose a Multi-task Knowledge Distillation Model (MKDM for short) for web-scale Question Answering system, by distilling knowledge from multiple teacher models to a light-weight student model. In this way, more generalized knowledge can be transferred. The experiment results show that our method can significantly outperform the baseline methods and even achieve comparable results with the original teacher models, along with significant speedup of model inference.
http://arxiv.org/abs/1904.09636
In this study, we propose the leveraging of interpretability for tasks beyond purely the purpose of explainability. In particular, this study puts forward a novel strategy for leveraging gradient-based interpretability in the realm of adversarial examples, where we use insights gained to aid adversarial learning. More specifically, we introduce the concept of spatially constrained one-pixel adversarial perturbations, where we guide the learning of such adversarial perturbations towards more susceptible areas identified via gradient-based interpretability. Experimental results using different benchmark datasets show that such a spatially constrained one-pixel adversarial perturbation strategy can noticeably improve the speed of convergence as well as produce successful attacks that were also visually difficult to perceive, thus illustrating an effective use of interpretability methods for tasks outside of the purpose of purely explainability.
http://arxiv.org/abs/1904.09633
Metric Learning for visual similarity has mostly adopted binary supervision indicating whether a pair of images are of the same class or not. Such a binary indicator covers only a limited subset of image relations, and is not sufficient to represent semantic similarity between images described by continuous and/or structured labels such as object poses, image captions, and scene graphs. Motivated by this, we present a novel method for deep metric learning using continuous labels. First, we propose a new triplet loss that allows distance ratios in the label space to be preserved in the learned metric space. The proposed loss thus enables our model to learn the degree of similarity rather than just the order. Furthermore, we design a triplet mining strategy adapted to metric learning with continuous labels. We address three different image retrieval tasks with continuous labels in terms of human poses, room layouts and image captions, and demonstrate the superior performance of our approach compared to previous methods.
http://arxiv.org/abs/1904.09626
There is a growing interest in social robots to be considered in the therapy of children with autism due to their effectiveness in improving the outcomes. However, children on the spectrum exhibit challenging behaviors that need to be considered when designing robots for them. A child could involuntarily throw a small social robot during meltdown and that could hit another person’s head and cause harm (e.g. concussion). In this paper, the application of soft materials is investigated for its potential in attenuating head’s linear acceleration upon impact. The thickness and storage modulus of three different soft materials were considered as the control factors while the noise factor was the impact velocity. The design of experiments was based on Taguchi method. A total of 27 experiments were conducted on a developed dummy head setup that reports the linear acceleration of the head. ANOVA tests were performed to analyze the data. The findings showed that the control factors are not statistically significant in attenuating the response. The optimal values of the control factors were identified using the signal-to-noise (S/N) ratio optimization technique. Confirmation runs at the optimal parameters (i.e. thickness of 3 mm and 5 mm) showed a better response as compared to other conditions. Designers of social robots should consider the application of soft materials to their designs as it help in reducing the potential harm to the head.
http://arxiv.org/abs/1904.09621
Clinical decision support tools (DST) promise improved healthcare outcomes by offering data-driven insights. While effective in lab settings, almost all DSTs have failed in practice. Empirical research diagnosed poor contextual fit as the cause. This paper describes the design and field evaluation of a radically new form of DST. It automatically generates slides for clinicians’ decision meetings with subtly embedded machine prognostics. This design took inspiration from the notion of “Unremarkable Computing”, that by augmenting the users’ routines technology/AI can have significant importance for the users yet remain unobtrusive. Our field evaluation suggests clinicians are more likely to encounter and embrace such a DST. Drawing on their responses, we discuss the importance and intricacies of finding the right level of unremarkableness in DST design, and share lessons learned in prototyping critical AI systems as a situated experience.
http://arxiv.org/abs/1904.09612