While existing work on neural architecture search (NAS) tunes hyperparameters in a separate post-processing step, we demonstrate that architectural choices and other hyperparameter settings interact in a way that can render this separation suboptimal. Likewise, we demonstrate that the common practice of using very few epochs during the main NAS and much larger numbers of epochs during a post-processing step is inefficient due to little correlation in the relative rankings for these two training regimes. To combat both of these problems, we propose to use a recent combination of Bayesian optimization and Hyperband for efficient joint neural architecture and hyperparameter search.
https://arxiv.org/abs/1807.06906
Recent advances in deep neural networks have been developed via architecture search for stronger representational power. In this work, we focus on the effect of attention in general deep neural networks. We propose a simple and effective attention module, named Bottleneck Attention Module (BAM), that can be integrated with any feed-forward convolutional neural networks. Our module infers an attention map along two separate pathways, channel and spatial. We place our module at each bottleneck of models where the downsampling of feature maps occurs. Our module constructs a hierarchical attention at bottlenecks with a number of parameters and it is trainable in an end-to-end manner jointly with any feed-forward models. We validate our BAM through extensive experiments on CIFAR-100, ImageNet-1K, VOC 2007 and MS COCO benchmarks. Our experiments show consistent improvement in classification and detection performances with various models, demonstrating the wide applicability of BAM. The code and models will be publicly available.
https://arxiv.org/abs/1807.06514
Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a mapping from the image to its natural language description. In general, the mapping function is learned from a training set of image-caption pairs. However, for some language, large scale image-caption paired corpus might not be available. We present an approach to this unpaired image captioning problem by language pivoting. Our method can effectively capture the characteristics of an image captioner from the pivot language (Chinese) and align it to the target language (English) using another pivot-target (Chinese-English) sentence parallel corpus. We evaluate our method on two image-to-English benchmark datasets: MSCOCO and Flickr30K. Quantitative comparisons against several baseline approaches demonstrate the effectiveness of our method.
https://arxiv.org/abs/1803.05526
Planning problems are among the most important and well-studied problems in artificial intelligence. They are most typically solved by tree search algorithms that simulate ahead into the future, evaluate future states, and back-up those evaluations to the root of a search tree. Among these algorithms, Monte-Carlo tree search (MCTS) is one of the most general, powerful and widely used. A typical implementation of MCTS uses cleverly designed rules, optimized to the particular characteristics of the domain. These rules control where the simulation traverses, what to evaluate in the states that are reached, and how to back-up those evaluations. In this paper we instead learn where, what and how to search. Our architecture, which we call an MCTSnet, incorporates simulation-based search inside a neural network, by expanding, evaluating and backing-up a vector embedding. The parameters of the network are trained end-to-end using gradient-based optimisation. When applied to small searches in the well known planning problem Sokoban, the learned search algorithm significantly outperformed MCTS baselines.
https://arxiv.org/abs/1802.04697
In neural machine translation (NMT), the most common practice is to stack a number of recurrent or feed-forward layers in the encoder and the decoder. As a result, the addition of each new layer improves the translation quality significantly. However, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all the layers thereby leading to a recurrently stacked NMT model. We empirically show that the translation quality of a model that recurrently stacks a single layer 6 times is comparable to the translation quality of a model that stacks 6 separate layers. We also show that using pseudo-parallel corpora by back-translation leads to further significant improvements in translation quality.
https://arxiv.org/abs/1807.05353
Training 3D object detectors for autonomous driving has been limited to small datasets due to the effort required to generate annotations. Reducing both task complexity and the amount of task switching done by annotators is key to reducing the effort and time required to generate 3D bounding box annotations. This paper introduces a novel ground truth generation method that combines human supervision with pretrained neural networks to generate per-instance 3D point cloud segmentation, 3D bounding boxes, and class annotations. The annotators provide object anchor clicks which behave as a seed to generate instance segmentation results in 3D. The points belonging to each instance are then used to regress object centroids, bounding box dimensions, and object orientation. Our proposed annotation scheme requires 30x lower human annotation time. We use the KITTI 3D object detection dataset to evaluate the efficiency and the quality of our annotation scheme. We also test the the proposed scheme on previously unseen data from the Autonomoose self-driving vehicle to demonstrate generalization capabilities of the network.
https://arxiv.org/abs/1807.06072
Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones.
https://arxiv.org/abs/1807.06036
Recent advances in deep learning-based object detection techniques have revolutionized their applicability in several fields. However, since these methods rely on unwieldy and large amounts of data, a common practice is to download models pre-trained on standard datasets and fine-tune them for specific application domains with a small set of domain relevant images. In this work, we show that using synthetic datasets that are not necessarily photo-realistic can be a better alternative to simply fine-tune pre-trained networks. Specifically, our results show an impressive 25% improvement in the mAP metric over a fine-tuning baseline when only about 200 labelled images are available to train. Finally, an ablation study of our results is presented to delineate the individual contribution of different components in the randomization pipeline.
https://arxiv.org/abs/1807.09834
Detecting the relations among objects, such as “cat on sofa” and “person ride horse”, is a crucial task in image understanding, and beneficial to bridging the semantic gap between images and natural language. Despite the remarkable progress of deep learning in detection and recognition of individual objects, it is still a challenging task to localize and recognize the relations between objects due to the complex combinatorial nature of various kinds of object relations. Inspired by the recent advances in one-shot learning, we propose a simple yet effective Semantics Induced Learner (SIL) model for solving this challenging task. Learning in one-shot manner can enable a detection model to adapt to a huge number of object relations with diverse appearance effectively and robustly. In addition, the SIL combines bottom-up and top-down attention mech- anisms, therefore enabling attention at the level of vision and semantics favorably. Within our proposed model, the bottom-up mechanism, which is based on Faster R-CNN, proposes objects regions, and the top-down mechanism selects and integrates visual features according to semantic information. Experiments demonstrate the effectiveness of our framework over other state-of-the-art methods on two large-scale data sets for object relation detection.
https://arxiv.org/abs/1807.05857
Due to object detection’s close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
https://arxiv.org/abs/1807.05511
In this work we present a detailed analysis of the interplay of Coulomb effects and different mechanisms that can lead to carrier localization effects in c-plane InGaN/GaN quantum wells. As mechanisms for carrier localization we consider here effects introduced by random alloy fluctuations as well as structural inhomogeneities such as well width fluctuations. Special attention is paid to the impact of the well width on the results. All calculations have been carried out in the framework of atomistic tight-binding theory. Our theoretical investigations show that independent of the here studied well widths, carrier localization effects due to built-in fields, well width fluctuations and random alloy fluctuations dominate over Coulomb effects in terms of charge density redistributions. However, the situation is less clear cut when the well width fluctuations are absent. For large well width (approx. > 2.5 nm) charge density redistributions are possible but the electronic and optical properties are basically dominated by the spatial out-of plane carrier separation originating from the electrostatic built-in field. The situation changes for lower well width (< 2.5 nm) where the Coulomb effect can lead to significant charge density redistributions and thus might compensate a large fraction of the spatial in-plane wave function separation observed in a single-particle picture. Given that this in-plane separation has been regarded as one of the main drivers behind the green gap problem, our calculations indicate that radiative recombination rates might significantly benefit from a reduced quantum well barrier interface roughness.
https://arxiv.org/abs/1807.05392
This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec’s properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.
https://arxiv.org/abs/1709.01362
Recent advances in vision tasks (e.g., segmentation) highly depend on the availability of large-scale real-world image annotations obtained by cumbersome human labors. Moreover, the perception performance often drops significantly for new scenarios, due to the poor generalization capability of models trained on limited and biased annotations. In this work, we resort to transfer knowledge from automatically rendered scene annotations in virtual-world to facilitate real-world visual tasks. Although virtual-world annotations can be ideally diverse and unlimited, the discrepant data distributions between virtual and real-world make it challenging for knowledge transferring. We thus propose a novel Semantic-aware Grad-GAN (SG-GAN) to perform virtual-to-real domain adaption with the ability of retaining vital semantic information. Beyond the simple holistic color/texture transformation achieved by prior works, SG-GAN successfully personalizes the appearance adaption for each semantic region in order to preserve their key characteristic for better recognition. It presents two main contributions to traditional GANs: 1) a soft gradient-sensitive objective for keeping semantic boundaries; 2) a semantic-aware discriminator for validating the fidelity of personalized adaptions with respect to each semantic region. Qualitative and quantitative experiments demonstrate the superiority of our SG-GAN in scene adaption over state-of-the-art GANs. Further evaluations on semantic segmentation on Cityscapes show using adapted virtual images by SG-GAN dramatically improves segmentation performance than original virtual data. We release our code at this https URL.
https://arxiv.org/abs/1801.01726
In this paper we show strategies to easily identify fake samples generated with the Generative Adversarial Network framework. One strategy is based on the statistical analysis and comparison of raw pixel values and features extracted from them. The other strategy learns formal specifications from the real data and shows that fake samples violate the specifications of the real data. We show that fake samples produced with GANs have a universal signature that can be used to identify fake samples. We provide results on MNIST, CIFAR10, music and speech data.
https://arxiv.org/abs/1807.04919
This work provides a simple approach to discover tight object bounding boxes with only image-level supervision, called Tight box mining with Surrounding Segmentation Context (TS2C). We observe that object candidates mined through current multiple instance learning methods are usually trapped to discriminative object parts, rather than the entire object. TS2C leverages surrounding segmentation context derived from weakly-supervised segmentation to suppress such low-quality distracting candidates and boost the high-quality ones. Specifically, TS2C is developed based on two key properties of desirable bounding boxes: 1) high purity, meaning most pixels in the box are with high object response, and 2) high completeness, meaning the box covers high object response pixels comprehensively. With such novel and computable criteria, more tight candidates can be discovered for learning a better object detector. With TS2C, we obtain 48.0% and 44.4% mAP scores on VOC 2007 and 2012 benchmarks, which are the new state-of-the-arts.
https://arxiv.org/abs/1807.04897
Real-time moving object detection in unconstrained scenes is a difficult task due to dynamic background, changing foreground appearance and limited computational resource. In this paper, an optical flow based moving object detection framework is proposed to address this problem. We utilize homography matrixes to online construct a background model in the form of optical flow. When judging out moving foregrounds from scenes, a dual-mode judge mechanism is designed to heighten the system’s adaptation to challenging situations. In experiment part, two evaluation metrics are redefined for more properly reflecting the performance of methods. We quantitatively and qualitatively validate the effectiveness and feasibility of our method with videos in various scene conditions. The experimental results show that our method adapts itself to different situations and outperforms the state-of-the-art methods, indicating the advantages of optical flow based methods.
https://arxiv.org/abs/1807.04890
We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. Our proposed architecture is shown to produce state of the art results on the KITTI 3D object detection benchmark while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles. Code is at: this https URL
https://arxiv.org/abs/1712.02294
We develop an unsupervised, nonparametric, and scalable statistical learning method for detection of unknown objects in noisy images. The method uses results from percolation theory and random graph theory. We present an algorithm that allows to detect objects of unknown shapes and sizes in the presence of nonparametric noise of unknown level. The noise density is assumed to be unknown and can be very irregular. The algorithm has linear complexity and exponential accuracy and is appropriate for real-time systems. We prove strong consistency and scalability of our method in this setup with minimal assumptions.
https://arxiv.org/abs/1102.5019
Generative adversarial networks (GANs) are powerful tools for learning generative models. In practice, the training may suffer from lack of convergence. GANs are commonly viewed as a two-player zero-sum game between two neural networks. Here, we leverage this game theoretic view to study the convergence behavior of the training process. Inspired by the fictitious play learning process, a novel training method, referred to as Fictitious GAN, is introduced. Fictitious GAN trains the deep neural networks using a mixture of historical models. Specifically, the discriminator (resp. generator) is updated according to the best-response to the mixture outputs from a sequence of previously trained generators (resp. discriminators). It is shown that Fictitious GAN can effectively resolve some convergence issues that cannot be resolved by the standard training approach. It is proved that asymptotically the average of the generator outputs has the same distribution as the data samples.
https://arxiv.org/abs/1803.08647
Generative Adversarial Networks are powerful generative models that are able to model the manifold of natural images. We leverage this property to perform manifold regularization by approximating a variant of the Laplacian norm using a Monte Carlo approximation that is easily computed with the GAN. When incorporated into the semi-supervised feature-matching GAN we achieve state-of-the-art results for GAN-based semi-supervised learning on CIFAR-10 and SVHN benchmarks, with a method that is significantly easier to implement than competing methods. We also find that manifold regularization improves the quality of generated images, and is affected by the quality of the GAN used to approximate the regularizer.
https://arxiv.org/abs/1807.04307
We explore recurrent encoder multi-decoder neural network architectures for semi-supervised sequence classification and reconstruction. We find that the use of multiple reconstruction modules helps models generalize in a classification task when only a small amount of labeled data is available, which is often the case in practice. Such models provide useful high-level representations of motions allowing clustering, searching and faster labeling of new sequences. We also propose a new, realistic partitioning of a well-known, high quality motion-capture dataset for better evaluations. We further explore a novel formulation for future-predicting decoders based on conditional recurrent generative adversarial networks, for which we propose both soft and hard constraints for transition generation derived from desired physical properties of synthesized future movements and desired animation goals. We find that using such constraints allow to stabilize the training of recurrent adversarial architectures for animation generation.
https://arxiv.org/abs/1511.06653
Neural Machine Translation (NMT) performs poor on the low-resource language pair $(X,Z)$, especially when $Z$ is a rare language. By introducing another rich language $Y$, we propose a novel triangular training architecture (TA-NMT) to leverage bilingual data $(Y,Z)$ (may be small) and $(X,Y)$ (can be rich) to improve the translation performance of low-resource pairs. In this triangular architecture, $Z$ is taken as the intermediate latent variable, and translation models of $Z$ are jointly optimized with a unified bidirectional EM algorithm under the goal of maximizing the translation likelihood of $(X,Y)$. Empirical results demonstrate that our method significantly improves the translation quality of rare languages on MultiUN and IWSLT2012 datasets, and achieves even better performance combining back-translation methods.
https://arxiv.org/abs/1805.04813
Quantum computing and neural networks show great promise for the future of information processing. In this paper we study a quantum reservoir computer, a framework harnessing quantum dynamics and designed for fast and efficient solving of temporal machine learning tasks such as speech recognition, time series prediction and natural language processing. Specifically, we study memory capacity and accuracy of a quantum reservoir computer based on the fully connected transverse field Ising model by investigating different forms of inter-spin interactions and computing timescales. We show that variation in inter-spin interactions leads to a better memory capacity in general, by engineering the type of interactions the capacity can be greatly enhanced and there exists an optimal timescale at which the capacity is maximized. To connect computational capabilities to physical properties of the underlaying system, we also study the out-of-time-ordered correlator and find that its faster decay implies a more accurate memory.
https://arxiv.org/abs/1807.03947
We report the results of subparsec-scale submillimeter observations towards an embedded high-mass young stellar object in the Small Magellanic Cloud (SMC) with ALMA. Complementary infrared data obtained with the AKARI satellite and the Gemini South telescope are also presented. The target infrared point source is spatially resolved into two dense molecular cloud cores; one is associated with a high-mass young stellar object (YSO core), while another is not associated with an infrared source (East core). The two cores are dynamically associated but show different chemical characteristics. Emission lines of CS, C33S, H2CS, SO, SO2, CH3OH, H13CO+, H13CN, SiO, and dust continuum are detected from the observed region. Tentative detection of HDS is also reported. The first detection of CH3OH in the SMC has a strong impact on our understanding of the formation of complex organic molecules in metal-poor environments. The gas temperature is estimated to be ~10 K based on the rotation analysis of CH3OH lines. The fractional abundance of CH3OH gas in the East core is estimated to be (0.5-1.5) x 10^(-8), which is comparable with or marginally higher than those of similar cold sources in our Galaxy despite a factor of five lower metallicity in the SMC. This work provides observational evidence that an organic molecule like CH3OH, which is largely formed on grain surfaces, can be produced even in a significantly lower metallicity environment compared to the solar neighborhood. A possible origin of cold CH3OH gas in the observed dense core is discussed.
https://arxiv.org/abs/1806.07120
This paper studies a multiple-input single-output non-orthogonal multiple access cognitive radio network relying on simultaneous wireless information and power transfer. A realistic non-linear energy harvesting model is applied and a power splitting architecture is adopted at each secondary user (SU). Since it is difficult to obtain perfect channel state information (CSI) in practice, instead either a bounded or gaussian CSI error model is considered. Our robust beamforming and power splitting ratio are jointly designed for two problems with different objectives, namely that of minimizing the transmission power of the cognitive base station and that of maximizing the total harvested energy of the SUs, respectively. The optimization problems are challenging to solve, mainly because of the non-linear structure of the energy harvesting and CSI errors models. We converted them into convex forms by using semi-definite relaxation. For the minimum transmission power problem, we obtain the rank-2 solution under the bounded CSI error model, while for the maximum energy harvesting problem, a two-loop procedure using a one-dimensional search is proposed. Our simulation results show that the proposed scheme significantly outperforms its traditional orthogonal multiple access counterpart. Furthermore, the performance using the gaussian CSI error model is generally better than that using the bounded CSI error model.
https://arxiv.org/abs/1807.03930
We present a new dataset, called Falling Things (FAT), for advancing the state-of-the-art in object detection and 3D pose estimation in the context of robotics. By synthetically combining object models and backgrounds of complex composition and high graphical quality, we are able to generate photorealistic images with accurate 3D pose annotations for all objects in all images. Our dataset contains 60k annotated photos of 21 household objects taken from the YCB dataset. For each image, we provide the 3D poses, per-pixel class segmentation, and 2D/3D bounding box coordinates for all objects. To facilitate testing different input modalities, we provide mono and stereo RGB images, along with registered dense depth images. We describe in detail the generation process and statistical analysis of the data.
https://arxiv.org/abs/1804.06534
Human moving path is an important feature in architecture design. By studying the path, architects know where to arrange the basic elements (e.g. structures, glasses, furniture, etc.) in the space. This paper presents SimArch, a multi-agent system for human moving path simulation. It involves a behavior model built by using a Markov Decision Process. The model simulates human mental states, target range detection, and collision prediction when agents are on the floor, in a particular small gallery, looking at an exhibit, or leaving the floor. It also models different kinds of human characteristics by assigning different transition probabilities. A modified weighted A* search algorithm quickly plans the sub-optimal path of the agents. In an experiment, SimArch takes a series of preprocessed floorplans as inputs, simulates the moving path, and outputs a density map for evaluation. The density map provides the prediction that how likely a person will occur in a location. A following discussion illustrates how architects can use the density map to improve their floorplan design.
https://arxiv.org/abs/1807.03760
We introduce a data-driven forecasting method for high-dimensional chaotic systems using long short-term memory (LSTM) recurrent neural networks. The proposed LSTM neural networks perform inference of high-dimensional dynamical systems in their reduced order space and are shown to be an effective set of nonlinear approximators of their attractor. We demonstrate the forecasting performance of the LSTM and compare it with Gaussian processes (GPs) in time series obtained from the Lorenz 96 system, the Kuramoto-Sivashinsky equation and a prototype climate model. The LSTM networks outperform the GPs in short-term forecasting accuracy in all applications considered. A hybrid architecture, extending the LSTM with a mean stochastic model (MSM-LSTM), is proposed to ensure convergence to the invariant measure. This novel hybrid method is fully data-driven and extends the forecasting capabilities of LSTM networks.
https://arxiv.org/abs/1802.07486
Attention mechanisms have attracted considerable interest in image captioning because of its powerful performance. Existing attention-based models use feedback information from the caption generator as guidance to determine which of the image features should be attended to. A common defect of these attention generation methods is that they lack a higher-level guiding information from the image itself, which sets a limit on selecting the most informative image features. Therefore, in this paper, we propose a novel attention mechanism, called topic-guided attention, which integrates image topics in the attention model as a guiding information to help select the most important image features. Moreover, we extract image features and image topics with separate networks, which can be fine-tuned jointly in an end-to-end manner during training. The experimental results on the benchmark Microsoft COCO dataset show that our method yields state-of-art performance on various quantitative metrics.
https://arxiv.org/abs/1807.03514
We’d like to share a simple tweak of Single Shot Multibox Detector (SSD) family of detectors, which is effective in reducing model size while maintaining the same quality. We share box predictors across all scales, and replace convolution between scales with max pooling. This has two advantages over vanilla SSD: (1) it avoids score miscalibration across scales; (2) the shared predictor sees the training data over all scales. Since we reduce the number of predictors to one, and trim all convolutions between them, model size is significantly smaller. We empirically show that these changes do not hurt model quality compared to vanilla SSD.
https://arxiv.org/abs/1807.03284
Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem. Different from captions, stories have more expressive language styles and contain many imaginary concepts that do not appear in the images. Thus it poses challenges to behavioral cloning algorithms. Furthermore, due to the limitations of automatic metrics on evaluating story quality, reinforcement learning methods with hand-crafted rewards also face difficulties in gaining an overall performance boost. Therefore, we propose an Adversarial REward Learning (AREL) framework to learn an implicit reward function from human demonstrations, and then optimize policy search with the learned reward function. Though automatic eval- uation indicates slight performance boost over state-of-the-art (SOTA) methods in cloning expert behaviors, human evaluation shows that our approach achieves significant improvement in generating more human-like stories than SOTA systems.
https://arxiv.org/abs/1804.09160
Recurrent Networks are one of the most powerful and promising artificial neural network algorithms to processing the sequential data such as natural languages, sound, time series data. Unlike traditional feed-forward network, Recurrent Network has a inherent feed back loop that allows to store the temporal context information and pass the state of information to the entire sequences of the events. This helps to achieve the state of art performance in many important tasks such as language modeling, stock market prediction, image captioning, speech recognition, machine translation and object tracking etc., However, training the fully connected RNN and managing the gradient flow are the complicated process. Many studies are carried out to address the mentioned limitation. This article is intent to provide the brief details about recurrent neurons, its variances and trips & tricks to train the fully recurrent neural network. This review work is carried out as a part of our IPO studio software module ‘Multiple Object Tracking’.
https://arxiv.org/abs/1807.02857
The explosion of video data on the internet requires effective and efficient technology to generate captions automatically for people who are not able to watch the videos. Despite the great progress of video captioning research, particularly on video feature encoding, the language decoder is still largely based on the prevailing RNN decoder such as LSTM, which tends to prefer the frequent word that aligns with the video. In this paper, we propose a boundary-aware hierarchical language decoder for video captioning, which consists of a high-level GRU based language decoder, working as a global (caption-level) language model, and a low-level GRU based language decoder, working as a local (phrase-level) language model. Most importantly, we introduce a binary gate into the low-level GRU language decoder to detect the language boundaries. Together with other advanced components including joint video prediction, shared soft attention, and boundary-aware video encoding, our integrated video captioning framework can discover hierarchical language information and distinguish the subject and the object in a sentence, which are usually confusing during the language generation. Extensive experiments on two widely-used video captioning datasets, MSR-Video-to-Text (MSR-VTT) \cite{xu2016msr} and YouTube-to-Text (MSVD) \cite{chen2011collecting} show that our method is highly competitive, compared with the state-of-the-art methods.
https://arxiv.org/abs/1807.03658
A good and robust sensor data fusion in diverse weather conditions is a quite challenging task. There are several fusion architectures in the literature, e.g. the sensor data can be fused right at the beginning (Early Fusion), or they can be first processed separately and then concatenated later (Late Fusion). In this work, different fusion architectures are compared and evaluated by means of object detection tasks, in which the goal is to recognize and localize predefined objects in a stream of data. Usually, state-of-the-art object detectors based on neural networks are highly optimized for good weather conditions, since the well-known benchmarks only consist of sensor data recorded in optimal weather conditions. Therefore, the performance of these approaches decreases enormously or even fails in adverse weather conditions. In this work, different sensor fusion architectures are compared for good and adverse weather conditions for finding the optimal fusion architecture for diverse weather situations. A new training strategy is also introduced such that the performance of the object detector is greatly enhanced in adverse weather scenarios or if a sensor fails. Furthermore, the paper responds to the question if the detection accuracy can be increased further by providing the neural network with a-priori knowledge such as the spatial calibration of the sensors.
https://arxiv.org/abs/1807.02323
Datacenter applications demand both low latency and high throughput; while interactive applications (e.g., Web Search) demand low tail latency for their short messages due to their partition-aggregate software architecture, many data-intensive applications (e.g., Map-Reduce) require high throughput for long flows as they move vast amounts of data across the network. Recent proposals improve latency of short flows and throughput of long flows by addressing the shortcomings of existing packet scheduling and congestion control algorithms, respectively. We make the key observation that long tails in the Flow Completion Times (FCT) of short flows result from packets that suffer congestion at more than one switch along their paths in the network. Our proposal, Slytherin, specifically targets packets that suffered from congestion at multiple points and prioritizes them in the network. Slytherin leverages ECN mechanism which is widely used in existing datacenters to identify such tail packets and dynamically prioritizes them using existing priority queues. As compared to existing state-of-the-art packet scheduling proposals, Slytherin achieves 18.6% lower 99th percentile flow completion times for short flows without any loss of throughput. Further, Slytherin drastically reduces 99th percentile queue length in switches by a factor of about 2x on average.
https://arxiv.org/abs/1807.02184
Average precision (AP), the area under the recall-precision (RP) curve, is the standard performance measure for object detection. Despite its wide acceptance, it has a number of shortcomings, the most important of which are (i) the inability to distinguish very different RP curves, and (ii) the lack of directly measuring bounding box localization accuracy. In this paper, we propose ‘Localization Recall Precision (LRP) Error’, a new metric which we specifically designed for object detection. LRP Error is composed of three components related to localization, false negative (FN) rate and false positive (FP) rate. Based on LRP, we introduce the ‘Optimal LRP’, the minimum achievable LRP error representing the best achievable configuration of the detector in terms of recall-precision and the tightness of the boxes. In contrast to AP, which considers precisions over the entire recall domain, Optimal LRP determines the ‘best’ confidence score threshold for a class, which balances the trade-off between localization and recall-precision. In our experiments, we show that, for state-of-the-art object (SOTA) detectors, Optimal LRP provides richer and more discriminative information than AP. We also demonstrate that the best confidence score thresholds vary significantly among classes and detectors. Moreover, we present LRP results of a simple online video object detector which uses a SOTA still image object detector and show that the class-specific optimized thresholds increase the accuracy against the common approach of using a general threshold for all classes. At this https URL we provide the source code that can compute LRP for the PASCAL VOC and MSCOCO datasets. Our source code can easily be adapted to other datasets as well.
https://arxiv.org/abs/1807.01696
This paper proposes an approach for rapid bounding box annotation for object detection datasets. The procedure consists of two stages: The first step is to annotate a part of the dataset manually, and the second step proposes annotations for the remaining samples using a model trained with the first stage annotations. We experimentally study which first/second stage split minimizes to total workload. In addition, we introduce a new fully labeled object detection dataset collected from indoor scenes. Compared to other indoor datasets, our collection has more class categories, different backgrounds, lighting conditions, occlusion and high intra-class differences. We train deep learning based object detectors with a number of state-of-the-art models and compare them in terms of speed and accuracy. The fully annotated dataset is released freely available for the research community.
https://arxiv.org/abs/1807.03142
Based on the Just-Noticeable-Difference (JND) criterion, a subjective video quality assessment (VQA) dataset, called the VideoSet, was constructed recently. In this work, we propose a JND-based VQA model using a probabilistic framework to analyze and clean collected subjective test data. While most traditional VQA models focus on content variability, our proposed VQA model takes both subject and content variabilities into account. The model parameters used to describe subject and content variabilities are jointly optimized by solving a maximum likelihood estimation (MLE) problem. As an application, the new subjective VQA model is used to filter out unreliable video quality scores collected in the VideoSet. Experiments are conducted to demonstrate the effectiveness of the proposed model.
https://arxiv.org/abs/1807.00920
Most machine learning methods are known to capture and exploit biases of the training data. While some biases are beneficial for learning, others are harmful. Specifically, image captioning models tend to exaggerate biases present in training data. This can lead to incorrect captions in domains where unbiased captions are desired, or required, due to over reliance on the learned prior and image context. We investigate generation of gender specific caption words (e.g. man, woman) based on the person’s appearance or the image context. We introduce a new Equalizer model that ensures equal gender probability when gender evidence is occluded in a scene and confident predictions when gender evidence is present. The resulting model is forced to look at a person rather than use contextual cues to make a gender specific prediction. The losses that comprise our model, the Appearance Confusion Loss and the Confident Loss, are general, and can be added to any description model in order to mitigate impacts of unwanted bias in a description dataset. Our proposed model has lower error than prior work when describing images with people and mentioning their gender and more closely matches the ground truth ratio of sentences including women to sentences including men.
https://arxiv.org/abs/1807.00517
In this article, we use deep neural networks (DNNs) to develop a wireless end-to-end communication system, in which DNNs are employed for all signal-related functionalities, such as encoding, decoding, modulation, and equalization. However, accurate instantaneous channel transfer function, \emph{i.e.}, the channel state information (CSI), is necessary to compute the gradient of the DNN representing. In many communication systems, the channel transfer function is hard to obtain in advance and varies with time and location. In this article, this constraint is released by developing a channel agnostic end-to-end system that does not rely on any prior information about the channel. We use a conditional generative adversarial net (GAN) to represent the channel effects, where the encoded signal of the transmitter will serve as the conditioning information. In addition, in order to deal with the time-varying channel, the received signal corresponding to the pilot data can also be added as a part of the conditioning information. From the simulation results, the proposed method is effective on additive white Gaussian noise (AWGN) and Rayleigh fading channels, which opens a new door for building data-driven communication systems.
https://arxiv.org/abs/1807.00447
Recently, neural machine translation has achieved remarkable progress by introducing well-designed deep neural networks into its encoder-decoder framework. From the optimization perspective, residual connections are adopted to improve learning performance for both encoder and decoder in most of these deep architectures, and advanced attention connections are applied as well. Inspired by the success of the DenseNet model in computer vision problems, in this paper, we propose a densely connected NMT architecture (DenseNMT) that is able to train more efficiently for NMT. The proposed DenseNMT not only allows dense connection in creating new features for both encoder and decoder, but also uses the dense attention structure to improve attention quality. Our experiments on multiple datasets show that DenseNMT structure is more competitive and efficient.
https://arxiv.org/abs/1806.00722
Many works have been done on salient object detection using supervised or unsupervised approaches on colour images. Recently, a few studies demonstrated that efficient salient object detection can also be implemented by using spectral features in visible spectrum of hyperspectral images from natural scenes. However, these models on hyperspectral salient object detection were tested with a very few number of data selected from various online public dataset, which are not specifically created for object detection purposes. Therefore, here, we aim to contribute to the field by releasing a hyperspectral salient object detection dataset with a collection of 60 hyperspectral images with their respective ground-truth binary images and representative rendered colour images (sRGB). We took several aspects in consideration during the data collection such as variation in object size, number of objects, foreground-background contrast, object position on the image, and etc. Then, we prepared ground truth binary images for each hyperspectral data, where salient objects are labelled on the images. Finally, we did performance evaluation using Area Under Curve (AUC) metric on some existing hyperspectral saliency detection models in literature.
https://arxiv.org/abs/1806.11314
User response prediction is a crucial component for personalized information retrieval and filtering scenarios, such as recommender system and web search. The data in user response prediction is mostly in a multi-field categorical format and transformed into sparse representations via one-hot encoding. Due to the sparsity problems in representation and optimization, most research focuses on feature engineering and shallow modeling. Recently, deep neural networks have attracted research attention on such a problem for their high capacity and end-to-end training scheme. In this paper, we study user response prediction in the scenario of click prediction. We first analyze a coupled gradient issue in latent vector-based models and propose kernel product to learn field-aware feature interactions. Then we discuss an insensitive gradient issue in DNN-based models and propose Product-based Neural Network (PNN) which adopts a feature extractor to explore feature interactions. Generalizing the kernel product to a net-in-net architecture, we further propose Product-network In Network (PIN) which can generalize previous models. Extensive experiments on 4 industrial datasets and 1 contest dataset demonstrate that our models consistently outperform 8 baselines on both AUC and log loss. Besides, PIN makes great CTR improvement (relatively 34.67%) in online A/B test.
https://arxiv.org/abs/1807.00311
Context is important for accurate visual recognition. In this work we propose an object detection algorithm that not only considers object visual appearance, but also makes use of two kinds of context including scene contextual information and object relationships within a single image. Therefore, object detection is regarded as both a cognition problem and a reasoning problem when leveraging these structured information. Specifically, this paper formulates object detection as a problem of graph structure inference, where given an image the objects are treated as nodes in a graph and relationships between the objects are modeled as edges in such graph. To this end, we present a so-called Structure Inference Network (SIN), a detector that incorporates into a typical detection framework (e.g. Faster R-CNN) with a graphical model which aims to infer object state. Comprehensive experiments on PASCAL VOC and MS COCO datasets indicate that scene context and object relationships truly improve the performance of object detection with more desirable and reasonable outputs.
https://arxiv.org/abs/1807.00119
Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000 videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we trained an end-to-end conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge.
https://arxiv.org/abs/1806.08409
The nature of dispersal of many invasive pests and pathogens in agricultural and forestry makes it necessary to consider how the actions of one manager affect neighbouring properties. In addition to the direct effects of a potential spread of a pest and the resulting economic loss, there are also indirect consequences that affect whole regions and that require coordinated actions to manage and/or to eradicate it (like movement restrictions). In this paper we address the emergence and stability of cooperation among agents who respond to a threat of an invasive pest or disease. The model, based on the weakest-link paradigm, uses repeated multi-participant coordination games where players’ pay-offs depend on management decisions to prevent the invasion on their own land as well as of their neighbours on a network. We show that for the basic cooperation game agents select the risk-dominant strategy of a Stag hunt game over the pay-off dominant strategy of implementing control measures. However, cooperation can be achieved by the social planner offering a biosecurity payment. The critical level of this payment depends on the details of the decision-making process, with higher trust (based on a reputation of other agents reflecting their past performance) allowing a significant reduction in necessary payments and slowing down decay in cooperation when the payment is low. We also find that allowing for uncertainty in decision-making process can enhance cooperation for low levels of payments. Finally, we show the importance of industry structure to the emergence of cooperation, with increase in the average coordination number of network nodes leading to increase in the critical biosecurity payment.
https://arxiv.org/abs/1807.00701
Generative adversarial networks (GANs) are a novel approach to generative modelling, a task whose goal it is to learn a distribution of real data points. They have often proved difficult to train: GANs are unlike many techniques in machine learning, in that they are best described as a two-player game between a discriminator and generator. This has yielded both unreliability in the training process, and a general lack of understanding as to how GANs converge, and if so, to what. The purpose of this dissertation is to provide an account of the theory of GANs suitable for the mathematician, highlighting both positive and negative results. This involves identifying the problems when training GANs, and how topological and game-theoretic perspectives of GANs have contributed to our understanding and improved our techniques in recent years.
https://arxiv.org/abs/1806.11382
Synthesizing images or texts automatically is a useful research area in the artificial intelligence nowadays. Generative adversarial networks (GANs), which are proposed by Goodfellow in 2014, make this task to be done more efficiently by using deep neural networks. We consider generating corresponding images from an input text description using a GAN. In this paper, we analyze the GAN-CLS algorithm, which is a kind of advanced method of GAN proposed by Scott Reed in 2016. First, we find the problem with this algorithm through inference. Then we correct the GAN-CLS algorithm according to the inference by modifying the objective function of the model. Finally, we do the experiments on the Oxford-102 dataset and the CUB dataset. As a result, our modified algorithm can generate images which are more plausible than the GAN-CLS algorithm in some cases. Also, some of the generated images match the input texts better.
https://arxiv.org/abs/1806.11302
This notebook paper presents an overview and comparative analysis of our systems designed for the following five tasks in ActivityNet Challenge 2018: temporal action proposals, temporal action localization, dense-captioning events in videos, trimmed action recognition, and spatio-temporal action localization.
https://arxiv.org/abs/1807.00686
Although attention-based Neural Machine Translation (NMT) has achieved remarkable progress in recent years, it still suffers from issues of repeating and dropping translations. To alleviate these issues, we propose a novel key-value memory-augmented attention model for NMT, called KVMEMATT. Specifically, we maintain a timely updated keymemory to keep track of attention history and a fixed value-memory to store the representation of source sentence throughout the whole translation process. Via nontrivial transformations and iterative interactions between the two memories, the decoder focuses on more appropriate source word(s) for predicting the next target word at each decoding step, therefore can improve the adequacy of translations. Experimental results on Chinese=>English and WMT17 German<=>English translation tasks demonstrate the superiority of the proposed model.
https://arxiv.org/abs/1806.11249