With the novel and fast advances in the area of deep neural networks, several challenging image-based tasks have been recently approached by researchers in pattern recognition and computer vision. In this paper, we address one of these tasks, which is to match image content with natural language descriptions, sometimes referred as multimodal content retrieval. Such a task is particularly challenging considering that we must find a semantic correspondence between captions and the respective image, a challenge for both computer vision and natural language processing areas. For such, we propose a novel multimodal approach based solely on convolutional neural networks for aligning images with their captions by directly convolving raw characters. Our proposed character-based textual embeddings allow the replacement of both word-embeddings and recurrent neural networks for text understanding, saving processing time and requiring fewer learnable parameters. Our method is based on the idea of projecting both visual and textual information into a common embedding space. For training such embeddings we optimize a contrastive loss function that is computed to minimize order-violations between images and their respective descriptions. We achieve state-of-the-art performance in the largest and most well-known image-text alignment dataset, namely Microsoft COCO, with a method that is conceptually much simpler and that possesses considerably fewer parameters than current approaches.
https://arxiv.org/abs/1706.00999
In this paper, we address the basic problem of recognizing moving objects in video images using Visual Vocabulary model and Bag of Words and track our object of interest in the subsequent video frames using species inspired PSO. Initially, the shadow free images are obtained by background modelling followed by foreground modeling to extract the blobs of our object of interest. Subsequently, we train a cubic SVM with human body datasets in accordance with our domain of interest for recognition and tracking. During training, using the principle of Bag of Words we extract necessary features of certain domains and objects for classification. Subsequently, matching these feature sets with those of the extracted object blobs that are obtained by subtracting the shadow free background from the foreground, we detect successfully our object of interest from the test domain. The performance of the classification by cubic SVM is satisfactorily represented by confusion matrix and ROC curve reflecting the accuracy of each module. After classification, our object of interest is tracked in the test domain using species inspired PSO. By combining the adaptive learning tools with the efficient classification of description, we achieve optimum accuracy in recognition of the moving objects. We evaluate our algorithm benchmark datasets: iLIDS, VIVID, Walking2, Woman. Comparative analysis of our algorithm against the existing state-of-the-art trackers shows very satisfactory and competitive results.
https://arxiv.org/abs/1707.05224
Deep convolutional neural network (CNN) based salient object detection methods have achieved state-of-the-art performance and outperform those unsupervised methods with a wide margin. In this paper, we propose to integrate deep and unsupervised saliency for salient object detection under a unified framework. Specifically, our method takes results of unsupervised saliency (Robust Background Detection, RBD) and normalized color images as inputs, and directly learns an end-to-end mapping between inputs and the corresponding saliency maps. The color images are fed into a Fully Convolutional Neural Networks (FCNN) adapted from semantic segmentation to exploit high-level semantic cues for salient object detection. Then the results from deep FCNN and RBD are concatenated to feed into a shallow network to map the concatenated feature maps to saliency maps. Finally, to obtain a spatially consistent saliency map with sharp object boundaries, we fuse superpixel level saliency map at multi-scale. Extensive experimental results on 8 benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art approaches with a margin.
https://arxiv.org/abs/1706.00530
For decades, conventional computers based on the von Neumann architecture have performed computation by repeatedly transferring data between their processing and their memory units, which are physically separated. As computation becomes increasingly data-centric and as the scalability limits in terms of performance and power are being reached, alternative computing paradigms are searched for in which computation and storage are collocated. A fascinating new approach is that of computational memory where the physics of nanoscale memory devices are used to perform certain computational tasks within the memory unit in a non-von Neumann manner. Here we present a large-scale experimental demonstration using one million phase-change memory devices organized to perform a high-level computational primitive by exploiting the crystallization dynamics. Also presented is an application of such a computational memory to process real-world data-sets. The results show that this co-existence of computation and storage at the nanometer scale could be the enabler for new, ultra-dense, low power, and massively parallel computing systems.
https://arxiv.org/abs/1706.00511
Partially observable environments present an important open challenge in the domain of sequential control learning with delayed rewards. Despite numerous attempts during the two last decades, the majority of reinforcement learning algorithms and associated approximate models, applied to this context, still assume Markovian state transitions. In this paper, we explore the use of a recently proposed attention-based model, the Gated End-to-End Memory Network, for sequential control. We call the resulting model the Gated End-to-End Memory Policy Network. More precisely, we use a model-free value-based algorithm to learn policies for partially observed domains using this memory-enhanced neural network. This model is end-to-end learnable and it features unbounded memory. Indeed, because of its attention mechanism and associated non-parametric memory, the proposed model allows us to define an attention mechanism over the observation stream unlike recurrent models. We show encouraging results that illustrate the capability of our attention-based model in the context of the continuous-state non-stationary control problem of stock trading. We also present an OpenAI Gym environment for simulated stock exchange and explain its relevance as a benchmark for the field of non-Markovian decision process learning.
https://arxiv.org/abs/1705.10993
In this work, we propose to utilize Convolutional Neural Networks to boost the performance of depth-induced salient object detection by capturing the high-level representative features for depth modality. We formulate the depth-induced saliency detection as a CNN-based cross-modal transfer problem to bridge the gap between the “data-hungry” nature of CNNs and the unavailability of sufficient labeled training data in depth modality. In the proposed approach, we leverage the auxiliary data from the source modality effectively by training the RGB saliency detection network to obtain the task-specific pre-understanding layers for the target modality. Meanwhile, we exploit the depth-specific information by pre-training a modality classification network that encourages modal-specific representations during the optimizing course. Thus, it could make the feature representations of the RGB and depth modalities as discriminative as possible. These two modules are pre-trained independently and then stitched to initialize and optimize the eventual depth-induced saliency detection model. Experiments demonstrate the effectiveness of the proposed novel pre-training strategy as well as the significant and consistent improvements of the proposed approach over other state-of-the-art methods.
https://arxiv.org/abs/1703.00122
In cognitive radio networks (CRNs), secondary users (SUs) can proactively obtain spectrum access opportunities by helping with primary users’ (PUs’) data transmissions. Currently, such kind of spectrum access is implemented via a cooperative communications based link-level frame-based cooperative (LLC) approach where individual SUs independently serve as relays for PUs in order to gain spectrum access opportunities. Unfortunately, this LLC approach cannot fully exploit spectrum access opportunities to enhance the throughput of CRNs and fails to motivate PUs to join the spectrum sharing processes. To address these challenges, we propose a network-level session-based cooperative (NLC) approach where SUs are grouped together to cooperate with PUs session by session, instead of frame by frame as what has been done in existing works, for spectrum access opportunities of the corresponding group. Thanks to our group-based session-by-session cooperating strategy, our NLC approach is able to address all those challenges in the LLC approach. To articulate our NLC approach, we further develop an NLC scheme under a cognitive capacity harvesting network (CCHN) architecture. We formulate the cooperative mechanism design as a cross-layer optimization problem with constraints on primary session selection, flow routing and link scheduling. To search for solutions to the optimization problem, we propose an augmented scheduling index ordering based (SIO-based) algorithm to identify maximal independent sets. Through extensive simulations, we demonstrate the effectiveness of the proposed NLC approach and the superiority of the augmented SIO-based algorithm over the traditional method.
https://arxiv.org/abs/1705.10281
In this paper, we propose the first model to be able to generate visually grounded questions with diverse types for a single image. Visual question generation is an emerging topic which aims to ask questions in natural language based on visual input. To the best of our knowledge, it lacks automatic methods to generate meaningful questions with various types for the same visual input. To circumvent the problem, we propose a model that automatically generates visually grounded questions with varying types. Our model takes as input both images and the captions generated by a dense caption model, samples the most probable question types, and generates the questions in sequel. The experimental results on two real world datasets show that our model outperforms the strongest baseline in terms of both correctness and diversity with a wide margin.
https://arxiv.org/abs/1612.06530
This paper proposes a novel method to estimate the global scale of a 3D reconstructed model within a Kalman filtering-based monocular SLAM algorithm. Our Bayesian framework integrates height priors over the detected objects belonging to a set of broad predefined classes, based on recent advances in fast generic object detection. Each observation is produced on single frames, so that we do not need a data association process along video frames. This is because we associate the height priors with the image region sizes at image places where map features projections fall within the object detection regions. We present very promising results of this approach obtained on several experiments with different object classes.
https://arxiv.org/abs/1705.09860
We propose a novel technique to make neural network robust to adversarial examples using a generative adversarial network. We alternately train both classifier and generator networks. The generator network generates an adversarial perturbation that can easily fool the classifier network by using a gradient of each image. Simultaneously, the classifier network is trained to classify correctly both original and adversarial images generated by the generator. These procedures help the classifier network to become more robust to adversarial perturbations. Furthermore, our adversarial training framework efficiently reduces overfitting and outperforms other regularization methods such as Dropout. We applied our method to supervised learning for CIFAR datasets, and experimantal results show that our method significantly lowers the generalization error of the network. To the best of our knowledge, this is the first method which uses GAN to improve supervised learning.
https://arxiv.org/abs/1705.03387
We propose an object detection method that improves the accuracy of the conventional SSD (Single Shot Multibox Detector), which is one of the top object detection algorithms in both aspects of accuracy and speed. The performance of a deep network is known to be improved as the number of feature maps increases. However, it is difficult to improve the performance by simply raising the number of feature maps. In this paper, we propose and analyze how to use feature maps effectively to improve the performance of the conventional SSD. The enhanced performance was obtained by changing the structure close to the classifier network, rather than growing layers close to the input data, e.g., by replacing VGGNet with ResNet. The proposed network is suitable for sharing the weights in the classifier networks, by which property, the training can be faster with better generalization power. For the Pascal VOC 2007 test set trained with VOC 2007 and VOC 2012 training sets, the proposed network with the input size of 300 x 300 achieved 78.5% mAP (mean average precision) at the speed of 35.0 FPS (frame per second), while the network with a 512 x 512 sized input achieved 80.8% mAP at 16.6 FPS using Nvidia Titan X GPU. The proposed network shows state-of-the-art mAP, which is better than those of the conventional SSD, YOLO, Faster-RCNN and RFCN. Also, it is faster than Faster-RCNN and RFCN.
https://arxiv.org/abs/1705.09587
During language acquisition, infants have the benefit of visual cues to ground spoken language. Robots similarly have access to audio and visual sensors. Recent work has shown that images and spoken captions can be mapped into a meaningful common space, allowing images to be retrieved using speech and vice versa. In this setting of images paired with untranscribed spoken captions, we consider whether computer vision systems can be used to obtain textual labels for the speech. Concretely, we use an image-to-words multi-label visual classifier to tag images with soft textual labels, and then train a neural network to map from the speech to these soft targets. We show that the resulting speech system is able to predict which words occur in an utterance—acting as a spoken bag-of-words classifier—without seeing any parallel speech and text. We find that the model often confuses semantically related words, e.g. “man” and “person”, making it even more effective as a semantic keyword spotter.
https://arxiv.org/abs/1703.08136
Deep learning exploits large volumes of labeled data to learn powerful models. When the target dataset is small, it is a common practice to perform transfer learning using pre-trained models to learn new task specific representations. However, pre-trained CNNs for image recognition are provided with limited information about the image during training, which is label alone. Tasks such as scene retrieval suffer from features learned from this weak supervision and require stronger supervision to better understand the contents of the image. In this paper, we exploit the features learned from caption generating models to learn novel task specific image representations. In particular, we consider the state-of-the art captioning system Show and Tell~\cite{SnT-pami-2016} and the dense region description model DenseCap~\cite{densecap-cvpr-2016}. We demonstrate that, owing to richer supervision provided during the process of training, the features learned by the captioning system perform better than those of CNNs. Further, we train a siamese network with a modified pair-wise loss to fuse the features learned by~\cite{SnT-pami-2016} and~\cite{densecap-cvpr-2016} and learn image representations suitable for retrieval. Experiments show that the proposed fusion exploits the complementary nature of the individual features and yields state-of-the art retrieval results on benchmark datasets.
https://arxiv.org/abs/1705.09142
Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions. For example, our model is able to detect spoken instances of the word ‘lighthouse’ within an utterance and associate them with image regions containing lighthouses. We do not use any form of conventional automatic speech recognition, nor do we use any text transcriptions or conventional linguistic annotations. Our model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images.
https://arxiv.org/abs/1701.07481
Following the recent progress in image classification and captioning using deep learning, we develop a novel natural language person retrieval system based on an attention mechanism. More specifically, given the description of a person, the goal is to localize the person in an image. To this end, we first construct a benchmark dataset for natural language person retrieval. To do so, we generate bounding boxes for persons in a public image dataset from the segmentation masks, which are then annotated with descriptions and attributes using the Amazon Mechanical Turk. We then adopt a region proposal network in Faster R-CNN as a candidate region generator. The cropped images based on the region proposals as well as the whole images with attention weights are fed into Convolutional Neural Networks for visual feature extraction, while the natural language expression and attributes are input to Bidirectional Long Short- Term Memory (BLSTM) models for text feature extraction. The visual and text features are integrated to score region proposals, and the one with the highest score is retrieved as the output of our system. The experimental results show significant improvement over the state-of-the-art method for generic object retrieval and this line of research promises to benefit search in surveillance video footage.
https://arxiv.org/abs/1705.08923
We develop the first approximate inference algorithm for 1-Best (and M-Best) decoding in bidirectional neural sequence models by extending Beam Search (BS) to reason about both forward and backward time dependencies. Beam Search (BS) is a widely used approximate inference algorithm for decoding sequences from unidirectional neural sequence models. Interestingly, approximate inference in bidirectional models remains an open problem, despite their significant advantage in modeling information from both the past and future. To enable the use of bidirectional models, we present Bidirectional Beam Search (BiBS), an efficient algorithm for approximate bidirectional this http URL evaluate our method and as an interesting problem in its own right, we introduce a novel Fill-in-the-Blank Image Captioning task which requires reasoning about both past and future sentence structure to reconstruct sensible image descriptions. We use this task as well as the Visual Madlibs dataset to demonstrate the effectiveness of our approach, consistently outperforming all baseline methods.
https://arxiv.org/abs/1705.08759
Factors determining the carrier distribution in InGaN/GaN multiple-quantum-well (MQW) light-emitting diodes (LEDs) are studied via photoluminescence and temperature-dependent electroluminescence spectra. Employing a dichromatic LED device, we demonstrate that the carrier recombination rate should be considered playing an important role in determining the carrier distribution in the MQW active region, not just the simple hole characteristics such as low mobility and large effective mass.
https://arxiv.org/abs/1705.09559
Explaining and reasoning about processes which underlie observed black-box phenomena enables the discovery of causal mechanisms, derivation of suitable abstract representations and the formulation of more robust predictions. We propose to learn high level functional programs in order to represent abstract models which capture the invariant structure in the observed data. We introduce the $\pi$-machine (program-induction machine) – an architecture able to induce interpretable LISP-like programs from observed data traces. We propose an optimisation procedure for program learning based on backpropagation, gradient descent and A* search. We apply the proposed method to three problems: system identification of dynamical systems, explaining the behaviour of a DQN agent and learning by demonstration in a human-robot interaction scenario. Our experimental results show that the $\pi$-machine can efficiently induce interpretable programs from individual data traces.
https://arxiv.org/abs/1705.08320
Salient object detection has increasingly become a popular topic in cognitive and computational sciences, including computer vision and artificial intelligence research. In this paper, we propose integrating \textit{semantic priors} into the salient object detection process. Our algorithm consists of three basic steps. Firstly, the explicit saliency map is obtained based on the semantic segmentation refined by the explicit saliency priors learned from the data. Next, the implicit saliency map is computed based on a trained model which maps the implicit saliency priors embedded into regional features with the saliency values. Finally, the explicit semantic map and the implicit map are adaptively fused to form a pixel-accurate saliency map which uniformly covers the objects of interest. We further evaluate the proposed framework on two challenging datasets, namely, ECSSD and HKUIS. The extensive experimental results demonstrate that our method outperforms other state-of-the-art methods.
https://arxiv.org/abs/1705.08207
This paper highlights the significance of including memory structures in neural networks when the latter are used to learn perception-action loops for autonomous robot navigation. Traditional navigation approaches rely on global maps of the environment to overcome cul-de-sacs and plan feasible motions. Yet, maintaining an accurate global map may be challenging in real-world settings. A possible way to mitigate this limitation is to use learning techniques that forgo hand-engineered map representations and infer appropriate control responses directly from sensed information. An important but unexplored aspect of such approaches is the effect of memory on their performance. This work is a first thorough study of memory structures for deep-neural-network-based robot navigation, and offers novel tools to train such networks from supervision and quantify their ability to generalize to unseen scenarios. We analyze the separation and generalization abilities of feedforward, long short-term memory, and differentiable neural computer networks. We introduce a new method to evaluate the generalization ability by estimating the VC-dimension of networks with a final linear readout layer. We validate that the VC estimates are good predictors of actual test performance. The reported method can be applied to deep learning problems beyond robotics.
https://arxiv.org/abs/1705.08049
We prove explicit rationality-results for Asai- $L$-functions, $L^S(s,\Pi’,{\rm As}^\pm)$, and Rankin-Selberg $L$-functions, $L^S(s,\Pi\times\Pi’)$, over arbitrary CM-fields $F$, relating critical values to explicit powers of $(2\pi i)$. Besides determining the contribution of archimedean zeta-integrals to our formulas as concrete powers of $(2\pi i)$, it is one of the crucial advantages of our refined approach, that it applies to very general non-cuspidal isobaric automorphic representations $\Pi’$ of ${\rm GL}_n(\mathbb A_F)$. As a major application, this enables us to establish a certain algebraic version of the Gan–Gross–Prasad conjecture, as refined by N.\ Harris, for totally definite unitary groups: This generalizes a deep result of Zhang and complements totally recent progress of Beuzard-Plessis. As another application we obtain a generalization of an important result of Harder–Raghuram on quotients of consecutive critical values, proved by them for totally real fields, and achieved here for arbitrary CM-fields $F$ and pairs $(\Pi,\Pi’)$ of relative rank one.
https://arxiv.org/abs/1705.07701
The exploitation of mm-wave bands is one of the key-enabler for 5G mobile radio networks. However, the introduction of mm-wave technologies in cellular networks is not straightforward due to harsh propagation conditions that limit the mm-wave access availability. Mm-wave technologies require high-gain antenna systems to compensate for high path loss and limited power. As a consequence, directional transmissions must be used for cell discovery and synchronization processes: this can lead to a non-negligible access delay caused by the exploration of the cell area with multiple transmissions along different directions. The integration of mm-wave technologies and conventional wireless access networks with the objective of speeding up the cell search process requires new 5G network architectural solutions. Such architectures introduce a functional split between C-plane and U-plane, thereby guaranteeing the availability of a reliable signaling channel through conventional wireless technologies that provides the opportunity to collect useful context information from the network edge. In this article, we leverage the context information related to user positions to improve the directional cell discovery process. We investigate fundamental trade-offs of this process and the effects of the context information accuracy on the overall system performance. We also cope with obstacle obstructions in the cell area and propose an approach based on a geo-located context database where information gathered over time is stored to guide future searches. Analytic models and numerical results are provided to validate proposed strategies.
https://arxiv.org/abs/1705.07291
From families to nations, what binds individuals in social groups is the degree to which they share beliefs, norms, and memories. While local clusters of communicating individuals can sustain shared memories and norms, communities characterized by isolated cliques are susceptible to information fragmentation and polarization dynamics. We employ experimental manipulations in lab-created communities to investigate how the temporal dynamics of conversational interactions can shape the formation of collective memories. We show that when individuals that bridge cliques (i.e., weak ties) communicate early on in a series of networked interactions, the community reaches higher mnemonic convergence compared to when individuals first interact within cliques (i.e., strong ties). This, we find, is due to the tradeoffs between information diversity and accumulated overlap over time. By using data calibrated models, we extend these findings to a larger and more complex network structure. Our approach offers a framework to analyze and design interventions in communication networks that optimize shared remembering and diminish the likelihood of information bubbles and polarization.
https://arxiv.org/abs/1705.07185
Recently, the soft attention mechanism, which was originally proposed in language processing, has been applied in computer vision tasks like image captioning. This paper presents improvements to the soft attention model by combining a convolutional LSTM with a hierarchical system architecture to recognize action categories in videos. We call this model the Convolutional Hierarchical Attention Model (CHAM). The model applies a convolutional operation inside the LSTM cell and an attention map generation process to recognize actions. The hierarchical architecture of this model is able to explicitly reason on multi-granularities of action categories. The proposed architecture achieved improved results on three publicly available datasets: the UCF sports dataset, the Olympic sports dataset and the HMDB51 dataset.
https://arxiv.org/abs/1705.03146
Bilinear models provide an appealing framework for mixing and merging information in Visual Question Answering (VQA) tasks. They help to learn high level associations between question meaning and visual concepts in the image, but they suffer from huge dimensionality issues. We introduce MUTAN, a multimodal tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between visual and textual representations. Additionally to the Tucker framework, we design a low-rank matrix-based decomposition to explicitly constrain the interaction rank. With MUTAN, we control the complexity of the merging scheme while keeping nice interpretable fusion relations. We show how our MUTAN model generalizes some of the latest VQA architectures, providing state-of-the-art results.
https://arxiv.org/abs/1705.06676
OTS44 is one of only four free-floating planets known to have a disk. We have previously shown that it is the coolest and least massive known free-floating planet ($\sim$12 M${\rm Jup}$) with a substantial disk that is actively accreting. We have obtained Band 6 (233 GHz) ALMA continuum data of this very young disk-bearing object. The data shows a clear unresolved detection of the source. We obtained disk-mass estimates via empirical correlations derived for young, higher-mass, central (substellar) objects. The range of values obtained are between 0.07 and 0.63 M${\oplus}$ (dust masses). We compare the properties of this unique disk with those recently reported around higher-mass (brown dwarfs) young objects in order to infer constraints on its mechanism of formation. While extreme assumptions on dust temperature yield disk-mass values that could slightly diverge from the general trends found for more massive brown dwarfs, a range of sensible values provide disk masses compatible with a unique scaling relation between $M_{\rm dust}$ and $M_{*}$ through the substellar domain down to planetary masses.
https://arxiv.org/abs/1705.06378
Deep convolutional Neural Networks (CNN) are the state-of-the-art performers for object detection task. It is well known that object detection requires more computation and memory than image classification. Thus the consolidation of a CNN-based object detection for an embedded system is more challenging. In this work, we propose LCDet, a fully-convolutional neural network for generic object detection that aims to work in embedded systems. We design and develop an end-to-end TensorFlow(TF)-based model. Additionally, we employ 8-bit quantization on the learned weights. We use face detection as a use case. Our TF-Slim based network can predict different faces of different shapes and sizes in a single forward pass. Our experimental results show that the proposed method achieves comparative accuracy comparing with state-of-the-art CNN-based face detection methods, while reducing the model size by 3x and memory-BW by ~4x comparing with one of the best real-time CNN-based object detector such as YOLO. TF 8-bit quantized model provides additional 4x memory reduction while keeping the accuracy as good as the floating point model. The proposed model thus becomes amenable for embedded implementations.
https://arxiv.org/abs/1705.05922
Camera parameters not only play an important role in determining the visual quality of perceived images, but also affect the performance of vision algorithms, for a vision-guided robot. By quantitatively evaluating four object detection algorithms, with respect to varying ambient illumination, shutter speed and voltage gain, it is observed that the performance of the algorithms is highly dependent on these variables. From this observation, a novel active control of camera parameters method is proposed, to make robot vision more robust under different light conditions. Experimental results demonstrate the effectiveness of our proposed approach, which improves the performance of object detection algorithms, compared with the conventional auto-exposure algorithm.
https://arxiv.org/abs/1705.05685
The integration of advanced optoelectronic properties in nanoscale devices of group III nitride can be realized by understanding the coupling of charge carriers with optical excitations in these nanostructures. The native defect induced electron-phonon coupling in GaN nanowires are reported using various spectroscopic studies. The GaN nanowires having different native defects are grown in atmospheric pressure chemical vapor deposition technique. X-ray photoelectron spectroscopic analysis revealed the variation of Ga/N ratios in nanowires having possible native defects, with respect to their growth parameters. The analysis of characteristics features of electron-phonon coupling in Raman spectra show the variations in carrier density and mobility with respect to the native defects in unintentionally doped GaN nanowires. The radiative recombination of donor acceptor pair transitions and the corresponding LO phonon replicas observed in photoluminescence studies further emphasize the role of native defects in electron-phonon coupling.
https://arxiv.org/abs/1705.05662
Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at www.visualqa.org as part of the 2nd iteration of the Visual Question Answering Dataset and Challenge (VQA v2.0). We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners. Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users.
https://arxiv.org/abs/1612.00837
We train a generator by maximum likelihood and we also train the same generator architecture by Wasserstein GAN. We then compare the generated samples, exact log-probability densities and approximate Wasserstein distances. We show that an independent critic trained to approximate Wasserstein distance between the validation set and the generator distribution helps detect overfitting. Finally, we use ideas from the one-shot learning literature to develop a novel fast learning critic.
https://arxiv.org/abs/1705.05263
This paper studies the fundamental limits of caching in a network with two receivers and two files generated by a two-component discrete memoryless source with arbitrary joint distribution. Each receiver is equipped with a cache of equal capacity, and the requested files are delivered over a shared error-free broadcast link. First, a lower bound on the optimal peak rate-memory trade-off is provided. Then, in order to leverage the correlation among the library files to alleviate the load over the shared link, a two-step correlation-aware cache-aided coded multicast (CACM) scheme is proposed. The first step uses Gray-Wyner source coding to represent the library via one common and two private descriptions, such that a second correlation-unaware multiple-request CACM step can exploit the additional coded multicast opportunities that arise. It is shown that the rate achieved by the proposed two-step scheme matches the lower bound for a significant memory regime and it is within half of the conditional entropy for all other memory values.
https://arxiv.org/abs/1705.04616
Interface phonon (IF) modes of c-plane oriented [AlN/GaN]20 and Al0.35Ga0.65N/Al0.55Ga0.45N]20 multi quantum well (MQW) structures grown via plasma assisted molecular beam epitaxy are reported. The effect of variation in dielectric constant of barrier layers to the IF optical phonon modes of well layers periodically arranged in the MQWs investigated.
https://arxiv.org/abs/1705.04445
We report on the impact of growth conditions on surface hillock density of N-polar GaN grown on nominally on-axis (0001) sapphire substrate by metal organic chemical vapor deposition (MOCVD). Large reduction in hillock density was achieved by implementation of an optimized high temperature AlN nucleation layer and use of indium surfactant in GaN overgrowth. A reduction by more than a factor of five in hillock density from 1000 to 170 hillocks/cm$^{-2}$ was achieved as a result. Crystal quality and surface morphology of the resultant GaN films were characterized by high resolution x-ray diffraction and atomic force microscopy and found to be relatively unaffected by the buffer conditions. It is also shown that the density of smaller surface features is unaffected by AlN buffer conditions.
https://arxiv.org/abs/1705.04237
This paper proposes a novel approach to create an automated visual surveillance system which is very efficient in detecting and tracking moving objects in a video captured by moving camera without any apriori information about the captured scene. Separating foreground from the background is challenging job in videos captured by moving camera as both foreground and background information change in every consecutive frames of the image sequence; thus a pseudo-motion is perceptive in background. In the proposed algorithm, the pseudo-motion in background is estimated and compensated using phase correlation of consecutive frames based on the principle of Fourier shift theorem. Then a method is proposed to model an acting background from recent history of commonality of the current frame and the foreground is detected by the differences between the background model and the current frame. Further exploiting the recent history of dissimilarities of the current frame, actual moving objects are detected in the foreground. Next, a two-stepped morphological operation is proposed to refine the object region for an optimum object size. Each object is attributed by its centroid, dimension and three highest peaks of its gray value histogram. Finally, each object is tracked using Kalman filter based on its attributes. The major advantage of this algorithm over most of the existing object detection and tracking algorithms is that, it does not require initialization of object position in the first frame or training on sample data to perform. Performance of the algorithm is tested on benchmark videos containing variable background and very satisfiable results is achieved. The performance of the algorithm is also comparable with some of the state-of-the-art algorithms for object detection and tracking.
https://arxiv.org/abs/1706.02672
Visual question answering (or VQA) is a new and exciting problem that combines natural language processing and computer vision techniques. We present a survey of the various datasets and models that have been used to tackle this task. The first part of the survey details the various datasets for VQA and compares them along some common factors. The second part of this survey details the different approaches for VQA, classified into four types: non-deep learning models, deep learning models without attention, deep learning models with attention, and other models which do not fit into the first three. Finally, we compare the performances of these approaches and provide some directions for future work.
https://arxiv.org/abs/1705.03865
Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, Intel Knights Landing, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.
https://arxiv.org/abs/1704.04374
Recently, deep Convolutional Neural Networks (CNN) have demonstrated strong performance on RGB salient object detection. Although, depth information can help improve detection results, the exploration of CNNs for RGB-D salient object detection remains limited. Here we propose a novel deep CNN architecture for RGB-D salient object detection that exploits high-level, mid-level, and low level features. Further, we present novel depth features that capture the ideas of background enclosure and depth contrast that are suitable for a learned approach. We show improved results compared to state-of-the-art RGB-D salient object detection methods. We also show that the low-level and mid-level depth features both contribute to improvements in the results. Especially, F-Score of our method is 0.848 on RGBD1000 dataset, which is 10.7% better than the second place.
https://arxiv.org/abs/1705.03607
One thing that discriminates living things from inanimate matter is their ability to generate similarly complex or non-random architectures in a large abundance. From DNA sequences to folded protein structures, living cells, microbial communities and multicellular structures, the material configurations in biology can easily be distinguished from non-living material assemblies. This is also true of the products of complex organisms that can themselves construct complex tools, machines, and artefacts. Whilst these objects are not living, they cannot randomly form, as they are the product of a biological organism and hence are either technological or cultural biosignatures. The problem is that it is not obvious how it might be possible to generalise an approach that aims to evaluate complex objects as possible biosignatures. However, if it was possible such a self-contained approach could be useful to explore the cosmos for new life forms. This would require us to prove rigorously that a given artefact is too complex to have formed by chance. In this paper, we present a new type of complexity measure, Pathway Complexity, that allows us to not only threshold the abiotic-biotic divide, but to demonstrate a probabilistic approach based upon object abundance and complexity which can be used to unambiguously assign complex objects as biosignatures. We hope that this approach not only opens up the search for biosignatures beyond earth, but allow us to explore earth for new types of biology, as well as observing when a complex chemical system discovered in the laboratory could be considered alive.
https://arxiv.org/abs/1705.03460
Due to their wide band gaps, III-N materials can exhibit behaviors ranging from the semiconductor class to the dielectric class. Through an analogy between a Metal/AlGaN/AlN/GaN diode and a MOS contact, we make use of this dual nature and show a direct path to capture the energy band diagram of the nitride system. We then apply transparency calculations to describe the forward conduction regime of a III-N heterojunction diode and demonstrate it realizes a tunnel diode, in contrast to its regular Schottky Barrier Diode designation. Thermionic emission is ruled out and instead, a coherent electron tunneling scenario allows to account for transport at room temperature and higher.
https://arxiv.org/abs/1704.08505
Image-based salient object detection (SOD) has been extensively studied in the past decades. However, video-based SOD is much less explored since there lack large-scale video datasets within which salient objects are unambiguously defined and annotated. Toward this end, this paper proposes a video-based SOD dataset that consists of 200 videos (64 minutes). In constructing the dataset, we manually annotate all objects and regions over 7,650 uniformly sampled keyframes and collect the eye-tracking data of 23 subjects that free-view all videos. From the user data, we find salient objects in video can be defined as objects that consistently pop-out throughout the video, and objects with such attributes can be unambiguously annotated by combining manually annotated object/region masks with eye-tracking data of multiple subjects. To the best of our knowledge, it is currently the largest dataset for video-based salient object detection. Based on this dataset, this paper proposes an unsupervised baseline approach for video-based SOD by using saliency-guided stacked autoencoders. In the proposed approach, multiple spatiotemporal saliency cues are first extracted at pixel, superpixel and object levels. With these saliency cues, stacked autoencoders are unsupervisedly constructed which automatically infer a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. Experimental results show that the proposed unsupervised approach outperforms 30 state-of-the-art models on the proposed dataset, including 19 image-based & classic (unsupervised or non-deep learning), 6 image-based & deep learning, and 5 video-based & unsupervised. Moreover, benchmarking results show that the proposed dataset is very challenging and has the potential to boost the development of video-based SOD.
https://arxiv.org/abs/1611.00135
We are interested in counting the number of instances of object classes in natural, everyday images. Previous counting approaches tackle the problem in restricted domains such as counting pedestrians in surveillance videos. Counts can also be estimated from outputs of other vision tasks like object detection. In this work, we build dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes. Our approach is inspired by the phenomenon of subitizing - the ability of humans to make quick assessments of counts given a perceptual signal, for small count values. Given a natural scene, we employ a divide and conquer strategy while incorporating context across the scene to adapt the subitizing idea to counting. Our approach offers consistent improvements over numerous baseline approaches for counting on the PASCAL VOC 2007 and COCO datasets. Subsequently, we study how counting can be used to improve object detection. We then show a proof of concept application of our counting methods to the task of Visual Question Answering, by studying the `how many?’ questions in the VQA and COCO-QA datasets.
https://arxiv.org/abs/1604.03505
Generative Adversarial Nets (GANs) represent an important milestone for effective generative models, which has inspired numerous variants seemingly different from each other. One of the main contributions of this paper is to reveal a unified geometric structure in GAN and its variants. Specifically, we show that the adversarial generative model training can be decomposed into three geometric steps: separating hyperplane search, discriminator parameter update away from the separating hyperplane, and the generator update along the normal vector direction of the separating hyperplane. This geometric intuition reveals the limitations of the existing approaches and leads us to propose a new formulation called geometric GAN using SVM separating hyperplane that maximizes the margin. Our theoretical analysis shows that the geometric GAN converges to a Nash equilibrium between the discriminator and generator. In addition, extensive numerical results show that the superior performance of geometric GAN.
https://arxiv.org/abs/1705.02894
We present a simple method to incorporate syntactic information about the target language in a neural machine translation system by translating into linearized, lexicalized constituency trees. An experiment on the WMT16 German-English news translation task resulted in an improved BLEU score when compared to a syntax-agnostic NMT baseline trained on the same dataset. An analysis of the translations from the syntax-aware system shows that it performs more reordering during translation in comparison to the baseline. A small-scale human evaluation also showed an advantage to the syntax-aware system.
https://arxiv.org/abs/1704.04743
Generative adversarial networks (GANs) have received a tremendous amount of attention in the past few years, and have inspired applications addressing a wide range of problems. Despite its great potential, GANs are difficult to train. Recently, a series of papers (Arjovsky & Bottou, 2017a; Arjovsky et al. 2017b; and Gulrajani et al. 2017) proposed using Wasserstein distance as the training objective and promised easy, stable GAN training across architectures with minimal hyperparameter tuning. In this paper, we compare the performance of Wasserstein distance with other training objectives on a variety of GAN architectures in the context of single image super-resolution. Our results agree that Wasserstein GAN with gradient penalty (WGAN-GP) provides stable and converging GAN training and that Wasserstein distance is an effective metric to gauge training progress.
https://arxiv.org/abs/1705.02438
We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields $67.6\%$ relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.
https://arxiv.org/abs/1705.02411
Large-scale is a trend in person re-identification (re-id). It is important that real-time search be performed in a large gallery. While previous methods mostly focus on discriminative learning, this paper makes the attempt in integrating deep learning and hashing into one framework to evaluate the efficiency and accuracy for large-scale person re-id. We integrate spatial information for discriminative visual representation by partitioning the pedestrian image into horizontal parts. Specifically, Part-based Deep Hashing (PDH) is proposed, in which batches of triplet samples are employed as the input of the deep hashing architecture. Each triplet sample contains two pedestrian images (or parts) with the same identity and one pedestrian image (or part) of the different identity. A triplet loss function is employed with a constraint that the Hamming distance of pedestrian images (or parts) with the same identity is smaller than ones with the different identity. In the experiment, we show that the proposed Part-based Deep Hashing method yields very competitive re-id accuracy on the large-scale Market-1501 and Market-1501+500K datasets.
https://arxiv.org/abs/1705.02145
Attentional sequence-to-sequence models have become the new standard for machine translation, but one challenge of such models is a significant increase in training and decoding cost compared to phrase-based systems. Here, we focus on efficient decoding, with a goal of achieving accuracy close the state-of-the-art in neural machine translation (NMT), while achieving CPU decoding speed/throughput close to that of a phrasal decoder. We approach this problem from two angles: First, we describe several techniques for speeding up an NMT beam search decoder, which obtain a 4.4x speedup over a very efficient baseline decoder without changing the decoder output. Second, we propose a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep. This architecture achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost. By combining these techniques, our best system achieves a very competitive accuracy of 38.3 BLEU on WMT English-French NewsTest2014, while decoding at 100 words/sec on single-threaded CPU. We believe this is the best published accuracy/speed trade-off of an NMT system.
https://arxiv.org/abs/1705.01991
Our Keck/NIRC2 imaging survey searches for stellar companions around 144 systems with radial velocity (RV) detected giant planets to determine whether stellar binaries influence the planets’ orbital parameters. This survey, the largest of its kind to date, finds eight confirmed binary systems and three confirmed triple systems. These include three new multi-stellar systems (HD 30856, HD 86081, and HD 207832) and three multi-stellar systems with newly confirmed common proper motion (HD 43691, HD 116029, and HD 164509). We combine these systems with seven RV planet-hosting multi-stellar systems from the literature in order to test for differences in the properties of planets with semimajor axes ranging between 0.1-5 au in single vs multi-stellar systems. We find no evidence that the presence or absence of stellar companions alters the distribution of planet properties in these systems. Although the observed stellar companions might influence the orbits of more distant planetary companions in these systems, our RV observations currently provide only weak constraints on the masses and orbital properties of planets beyond 5 au. In order to aid future efforts to characterize long period RV companions in these systems, we publish our contrast curves for all 144 targets. Using four years of astrometry for six hierarchical triple star systems hosting giant planets, we fit the orbits of the stellar companions in order to characterize the orbital architecture in these systems. We find that the orbital plane of the secondary and tertiary companions are inconsistent with an edge-on orbit in four out of six cases.
https://arxiv.org/abs/1704.02326
Multi-party Conversational Systems are systems with natural language interaction between one or more people or systems. From the moment that an utterance is sent to a group, to the moment that it is replied in the group by a member, several activities must be done by the system: utterance understanding, information search, reasoning, among others. In this paper we present the challenges of designing and building multi-party conversational systems, the state of the art, our proposed hybrid architecture using both rules and machine learning and some insights after implementing and evaluating one on the finance domain.
https://arxiv.org/abs/1705.01214