We present a scalable method for detecting objects and estimating their 3D poses in RGB-D data. To this end, we rely on an efficient representation of object views and employ hashing techniques to match these views against the input frame in a scalable way. While a similar approach already exists for 2D detection, we show how to extend it to estimate the 3D pose of the detected objects. In particular, we explore different hashing strategies and identify the one which is more suitable to our problem. We show empirically that the complexity of our method is sublinear with the number of objects and we enable detection and pose estimation of many 3D objects with high accuracy while outperforming the state-of-the-art in terms of runtime.
https://arxiv.org/abs/1607.06062
We present a 3D object detection method that uses regressed descriptors of locally-sampled RGB-D patches for 6D vote casting. For regression, we employ a convolutional auto-encoder that has been trained on a large collection of random local patches. During testing, scene patch descriptors are matched against a database of synthetic model view patches and cast 6D object votes which are subsequently filtered to refined hypotheses. We evaluate on three datasets to show that our method generalizes well to previously unseen input data, delivers robust detection results that compete with and surpass the state-of-the-art while being scalable in the number of objects.
https://arxiv.org/abs/1607.06038
This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. Two novel deep bidirectional variant models, in which we increase the depth of nonlinearity transition in different way, are proposed to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. We visualize the evolution of bidirectional LSTM internal states over time and qualitatively analyze how our models “translate” image to sentence. Our proposed models are evaluated on caption generation and image-sentence retrieval tasks with three benchmark datasets: Flickr8K, Flickr30K and MSCOCO datasets. We demonstrate that bidirectional LSTM models achieve highly competitive performance to the state-of-the-art results on caption generation even without integrating additional mechanism (e.g. object detection, attention model etc.) and significantly outperform recent methods on retrieval task.
这项工作提出了一个端到端的可训练深度双向LSTM(长 - 短期记忆)模型的图像字幕。我们的模型建立在深度卷积神经网络(CNN)和两个独立的LSTM网络上。它能够通过在高层次的语义空间中利用历史和未来的上下文信息来学习长期的视觉语言交互。提出了两种新的深度双向变体模型,其中我们增加了不同深度的非线性转换,以学习分层视觉语言嵌入。为了防止训练深度模型中的过拟合,提出了数据增强技术,如多视角,多尺度和垂直镜像。我们可视化双向LSTM内部状态随时间的演变,并定性分析我们的模型如何“翻译”图像到句子。我们提出的模型评估字幕生成和图像句子检索任务与三个基准数据集:Flickr8K,Flickr30K和MSCOCO数据集。我们证明,即使没有集成附加机制(例如对象检测,注意模型等),双向LSTM模型也实现了对于字幕生成的最新结果的高度竞争性能,并且在检索任务方面明显胜过最近的方法。
https://arxiv.org/abs/1604.00790
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.
视觉问答(VQA)是一项具有挑战性的任务,受到计算机视觉和自然语言处理社区的广泛关注。给定一个自然语言的图像和问题,它需要推理图像的视觉元素和一般知识来推断正确的答案。在本次调查的第一部分中,我们通过比较现代方法来分析问题的现状。我们通过它们的机制来分类方法来连接视觉和文本的方式。特别是,我们研究了卷积和递归神经网络的共同方法,将图像和问题映射到共同的特征空间。我们还讨论了与结构化知识库交互的内存扩展和模块化架构。在本次调查的第二部分,我们回顾了可用于培训和评估VQA系统的数据集。各种数据集包含不同复杂程度的问题,这些问题需要不同的推理能力和类型。我们深入研究Visual Genome项目中的问题/答案对,并评估图像结构化注释与VQA场景图的相关性。最后,我们讨论这个领域有希望的未来方向,特别是与结构化知识库的连接以及使用自然语言处理模型。
https://arxiv.org/abs/1607.05910
Science is a growing system, exhibiting ~4% annual growth in publications and ~1.8% annual growth in the number of references per publication. Combined these trends correspond to a 12-year doubling period in the total supply of references, thereby challenging traditional methods of evaluating scientific production, from researchers to institutions. Against this background, we analyzed a citation network comprised of 837 million references produced by 32.6 million publications over the period 1965-2012, allowing for a temporal analysis of the `attention economy’ in science. Unlike previous studies, we analyzed the entire probability distribution of reference ages - the time difference between a citing and cited paper - thereby capturing previously overlooked trends. Over this half-century period we observe a narrowing range of attention - both classic and recent literature are being cited increasingly less, pointing to the important role of socio-technical processes. To better understand the impact of exponential growth on the underlying knowledge network we develop a network-based model, featuring the redirection of scientific attention via publications’ reference lists, and validate the model against several empirical benchmarks. We then use the model to test the causal impact of real paradigm shifts, thereby providing guidance for science policy analysis. In particular, we show how perturbations to the growth rate of scientific output affects the reference age distribution and the functionality of the vast science citation network as an aid for the search & retrieval of knowledge. In order to account for the inflation of science, our study points to the need for a systemic overhaul of the counting methods used to evaluate citation impact - especially in the case of evaluating science careers, which can span several decades and thus several doubling periods.
https://arxiv.org/abs/1607.05606
We investigate sub-monolayer InN quantum sheets embedded in GaN(0001) by temperature-dependent photoluminescence spectroscopy under both continuous-wave and pulsed excitation. Both the peak energy and the linewidth of the emission band associated with the quantum sheets exhibit an anomalous dependence on temperature indicative of carrier localization. Photoluminescence transients reveal a power law decay at low temperatures reflecting that the recombining electrons and holes occupy spatially separate, individual potential minima reminiscent of conventional (In,Ga)N(0001) quantum wells exhibiting the characteristic disorder of a random alloy. At elevated temperatures, carrier delocalization sets in and is accompanied by a thermally activated quenching of the emission. We ascribe the strong nonradiative recombination to extended states in the GaN barriers and confirm our assumption by a simple rate-equation model.
https://arxiv.org/abs/1605.00865
Given the vast amounts of video available online, and recent breakthroughs in object detection with static images, object detection in video offers a promising new frontier. However, motion blur and compression artifacts cause substantial frame-level variability, even in videos that appear smooth to the eye. Additionally, video datasets tend to have sparsely annotated frames. We present a new framework for improving object detection in videos that captures temporal context and encourages consistency of predictions. First, we train a pseudo-labeler, that is, a domain-adapted convolutional neural network for object detection. The pseudo-labeler is first trained individually on the subset of labeled frames, and then subsequently applied to all frames. Then we train a recurrent neural network that takes as input sequences of pseudo-labeled frames and optimizes an objective that encourages both accuracy on the target frame and consistency across consecutive frames. The approach incorporates strong supervision of target frames, weak-supervision on context frames, and regularization via a smoothness penalty. Our approach achieves mean Average Precision (mAP) of 68.73, an improvement of 7.1 over the strongest image-based baselines for the Youtube-Video Objects dataset. Our experiments demonstrate that neighboring frames can provide valuable information, even absent labels.
https://arxiv.org/abs/1607.04648
Aiming at improving the performance of existing detection algorithms developed for different applications, we propose a region regression-based multi-stage class-agnostic detection pipeline, whereby the existing algorithms are employed for providing the initial detection proposals. Better detection is obtained by exploiting the power of deep learning in the region regress scheme while avoiding the requirement on a huge amount of reference data for training deep neural networks. Additionally, a novel network architecture with recycled deep features is proposed, which provides superior regression results compared to the commonly used architectures. As demonstrated on a data set with ~1200 samples of different classes, it is feasible to successfully train a deep neural network in our proposed architecture and use it to obtain the desired detection performance. Since only slight modifications are required to common network architectures and since the deep neural network is trained using the standard hyperparameters, the proposed detection is well accessible and can be easily adopted to a broad variety of detection tasks.
https://arxiv.org/abs/1607.05066
Object detection is an import task of computer vision.A variety of methods have been proposed,but methods using the weak labels still do not have a satisfactory this http URL this paper,we propose a new framework that using the weakly supervised method’s output as the pseudo-strong labels to train a strongly supervised model.One weakly supervised method is treated as black-box to generate class-specific bounding boxes on train dataset.A de-noise method is then applied to the noisy bounding boxes.Then the de-noised pseudo-strong labels are used to train a strongly object detection network.The whole framework is still weakly supervised because the entire process only uses the image-level labels.The experiment results on PASCAL VOC 2007 prove the validity of our framework, and we get result 43.4% on mean average precision compared to 39.5% of the previous best result and 34.5% of the initial method,respectively.And this frame work is simple and distinct,and is promising to be applied to other method easily.
https://arxiv.org/abs/1607.04731
Very deep convolutional neural networks (CNNs) yield state of the art results on a wide variety of visual recognition problems. A number of state of the the art methods for image recognition are based on networks with well over 100 layers and the performance vs. depth trend is moving towards networks in excess of 1000 layers. In such extremely deep architectures the vanishing or exploding gradient problem becomes a key issue. Recent evidence also indicates that convolutional networks could benefit from an interface to explicitly constructed memory mechanisms interacting with a CNN feature processing hierarchy. Correspondingly, we propose and evaluate a memory mechanism enhanced convolutional neural network architecture based on augmenting convolutional residual networks with a long short term memory mechanism. We refer to this as a convolutional residual memory network. To the best of our knowledge this approach can yield state of the art performance on the CIFAR-100 benchmark and compares well with other state of the art techniques on the CIFAR-10 and SVHN benchmarks. This is achieved using networks with more breadth, much less depth and much less overall computation relative to comparable deep ResNets without the memory mechanism. Our experiments and analysis explore the importance of the memory mechanism, network depth, breadth, and predictive performance.
https://arxiv.org/abs/1606.05262
In past ten years, modern societies developed enormous communication and social networks. Their classification and information retrieval processing become a formidable task for the society. Due to the rapid growth of World Wide Web, social and communication networks, new mathematical methods have been invented to characterize the properties of these networks on a more detailed and precise level. Various search engines are essentially using such methods. It is highly important to develop new tools to classify and rank enormous amount of network information in a way adapted to internal network structures and characteristics. This review describes the Google matrix analysis of directed complex networks demonstrating its efficiency on various examples including World Wide Web, Wikipedia, software architecture, world trade, social and citation networks, brain neural networks, DNA sequences and Ulam networks. The analytical and numerical matrix methods used in this analysis originate from the fields of Markov chains, quantum chaos and Random Matrix theory.
https://arxiv.org/abs/1409.0428
One successful model of interacting biological systems is the Boolean network. The dynamics of a Boolean network, controlled with Boolean functions, is usually considered to be a Markovian (memory-less) process. However, both self organizing features of biological phenomena and their intelligent nature should raise some doubt about ignoring the history of their time evolution. Here, we extend the Boolean network Markovian approach: we involve the effect of memory on the dynamics. This can be explored by modifying Boolean functions into non-Markovian functions, for example, by investigating the usual non-Markovian threshold function, - one of the most applied Boolean functions. By applying the non-Markovian threshold function on the dynamical process of a cell cycle network, we discover a power law memory with a more robust dynamics than the Markovian dynamics.
https://arxiv.org/abs/1607.03794
Numerous efforts have been made to design different low level saliency cues for the RGBD saliency detection, such as color or depth contrast features, background and color compactness priors. However, how these saliency cues interact with each other and how to incorporate these low level saliency cues effectively to generate a master saliency map remain a challenging problem. In this paper, we design a new convolutional neural network (CNN) to fuse different low level saliency cues into hierarchical features for automatically detecting salient objects in RGBD images. In contrast to the existing works that directly feed raw image pixels to the CNN, the proposed method takes advantage of the knowledge in traditional saliency detection by adopting various meaningful and well-designed saliency feature vectors as input. This can guide the training of CNN towards detecting salient object more effectively due to the reduced learning ambiguity. We then integrate a Laplacian propagation framework with the learned CNN to extract a spatially consistent saliency map by exploiting the intrinsic structure of the input image. Extensive quantitative and qualitative experimental evaluations on three datasets demonstrate that the proposed method consistently outperforms state-of-the-art methods.
https://arxiv.org/abs/1607.03333
We present a study of germanium as n-type dopant in wurtzite GaN films grown by plasma-assisted molecular beam epitaxy, reaching carrier concentrations of up to 6.7E20 cm-3 at 300K, well beyond the Mott density. The Ge concentration and free carrier density were found to scale linearly with the Ge flux in the studied range. All the GaN:Ge layers present smooth surface morphology with atomic terraces, without trace of pits or cracks, and the mosaicity of the samples has no noticeable dependence on the Ge concentration. The variation of the GaN:Ge band gap with the carrier concentration is consistent with theoretical calculations of the band gap renormalization due to electron-electron and electron-ion interaction, and Burstein-Moss effect.
https://arxiv.org/abs/1604.00231
In order to control computational complexity, neural machine translation (NMT) systems convert all rare words outside the vocabulary into a single unk symbol. Previous solution (Luong et al., 2015) resorts to use multiple numbered unks to learn the correspondence between source and target rare words. However, testing words unseen in the training corpus cannot be handled by this method. And it also suffers from the noisy word alignment. In this paper, we focus on a major type of rare words – named entity (NE), and propose to translate them with character level sequence to sequence model. The NE translation model is further used to derive high quality NE alignment in the bilingual training corpus. With the integration of NE translation and alignment modules, our NMT system is able to surpass the baseline system by 2.9 BLEU points on the Chinese to English task.
https://arxiv.org/abs/1607.01856
In this paper, we propose an effective way for biasing the attention mechanism of a sequence-to-sequence neural machine translation (NMT) model towards the well-studied statistical word alignment models. We show that our novel guided alignment training approach improves translation quality on real-life e-commerce texts consisting of product titles and descriptions, overcoming the problems posed by many unknown words and a large type/token ratio. We also show that meta-data associated with input texts such as topic or category information can significantly improve translation quality when used as an additional signal to the decoder part of the network. With both novel features, the BLEU score of the NMT system on a product title set improves from 18.6 to 21.3%. Even larger MT quality gains are obtained through domain adaptation of a general domain NMT system to e-commerce data. The developed NMT system also performs well on the IWSLT speech translation task, where an ensemble of four variant systems outperforms the phrase-based baseline by 2.1% BLEU absolute.
https://arxiv.org/abs/1607.01628
This paper presents the design, fabrication, and experimental characterization of monolithically integrated p-n junction InGaN/GaN multiple quantum well diodes (MQWDs) and suspended waveguides. Suspended MQWDs can be used as transmitters and receivers simultaneously, and suspended waveguides are used for light coupling to create an in-plane visible light communication system. Compared to the waveguide with separation trench, the calculated total light efficiency is increased from 18% to 22% for the continuous waveguide. The MQWDs are characterized by their typical current-voltage performance, and the pulse excitation measurements confirm that the InGaN/GaN MQWDs can achieve the light emission and photodetection at the same time. The photocurrent measurements indicate that the photocurrent is modulated by a bias voltage and that the photons are being supplied from another transmitter. An experimental demonstration is presented showing that the proposed device works well for in-plane full-duplex communication using visible light.
https://arxiv.org/abs/1607.01455
We propose a simple domain adaptation method for neural networks in a supervised setting. Supervised domain adaptation is a way of improving the generalization performance on the target domain by using the source domain dataset, assuming that both of the datasets are labeled. Recently, recurrent neural networks have been shown to be successful on a variety of NLP tasks such as caption generation; however, the existing domain adaptation techniques are limited to (1) tune the model parameters by the target dataset after the training by the source dataset, or (2) design the network to have dual output, one for the source domain and the other for the target domain. Reformulating the idea of the domain adaptation technique proposed by Daume (2007), we propose a simple domain adaptation method, which can be applied to neural networks trained with a cross-entropy loss. On captioning datasets, we show performance improvements over other domain adaptation methods.
我们提出了一个简单的领域适应方法神经网络在一个监督的设置。监督域自适应是一种通过使用源域数据集来提高目标域上的泛化性能的方法,假定两个数据集都被标记。最近,循环神经网络已被证明是成功的各种NLP任务,如字幕生成;然而,现有的领域适应技术限于:(1)在源数据集的训练之后,通过目标数据集调整模型参数;或(2)将网络设计为具有双输出,一个用于源域,另一个用于源域目标域。重新形成Daume(2007)提出的领域自适应技术的思想,我们提出了一个简单的领域自适应方法,可以应用于交叉熵损失训练的神经网络。在字幕数据集上,我们展示了其他领域适应方法的性能改进。
https://arxiv.org/abs/1607.00410
Recently there has been a lot of interest in learning common representations for multiple views of data. Typically, such common representations are learned using a parallel corpus between the two views (say, 1M images and their English captions). In this work, we address a real-world scenario where no direct parallel data is available between two views of interest (say, $V_1$ and $V_2$) but parallel data is available between each of these views and a pivot view ($V_3$). We propose a model for learning a common representation for $V_1$, $V_2$ and $V_3$ using only the parallel data available between $V_1V_3$ and $V_2V_3$. The proposed model is generic and even works when there are $n$ views of interest and only one pivot view which acts as a bridge between them. There are two specific downstream applications that we focus on (i) transfer learning between languages $L_1$,$L_2$,…,$L_n$ using a pivot language $L$ and (ii) cross modal access between images and a language $L_1$ using a pivot language $L_2$. Our model achieves state-of-the-art performance in multilingual document classification on the publicly available multilingual TED corpus and promising results in multilingual multimodal retrieval on a new dataset created and released as a part of this work.
最近在学习多个数据视图的共同表示方面有很多兴趣。通常情况下,这两种视图之间使用平行语料库(例如,1M图像及其英文字幕)学习这种常见表示。在这项工作中,我们提出了一个真实的场景,在两个感兴趣的视图(比如$ V_1 $和$ V_2 $)之间没有直接的并行数据可用,但是在每个视图和一个透视视图($ V_3 $)。我们提出了一个模型,用于仅使用$ V_1V_3 $和$ V_2V_3 $之间可用的并行数据学习$ V_1 $,$ V_2 $和$ V_3 $的通用表示。所提出的模型是通用的,甚至在存在感兴趣的$ n $视图并且只有一个枢轴视图作为它们之间的桥梁时起作用。我们关注的两个具体的下游应用程序(i)使用数据透视语言$ L $在语言$ L_1 $,$ L_2 $,…,$ L_n $之间转换学习;(ii)图像和语言$ L_1 $使用数据透视语言$ L_2 $。我们的模型在公共可用的多语言TED语料库上实现了多语言文档分类方面的最新性能,并且在作为这项工作的一部分创建和发布的新数据集的多语言多模式检索方面取得了有希望的结果。
https://arxiv.org/abs/1510.03519
Neural Machine Translation (NMT), like many other deep learning domains, typically suffers from over-parameterization, resulting in large storage sizes. This paper examines three simple magnitude-based pruning schemes to compress NMT models, namely class-blind, class-uniform, and class-distribution, which differ in terms of how pruning thresholds are computed for the different classes of weights in the NMT architecture. We demonstrate the efficacy of weight pruning as a compression technique for a state-of-the-art NMT system. We show that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT’14 English-German translation task. This sheds light on the distribution of redundancy in the NMT architecture. Our main result is that with retraining, we can recover and even surpass the original performance with an 80%-pruned model.
https://arxiv.org/abs/1606.09274
Uncertain recognition success, unfavorable scaling of connection complexity or dependence on complex external input impair the usefulness of current oscillatory neural networks for pattern recognition or restrict technical realizations to small networks. We propose a new network architecture of coupled oscillators for pattern recognition which shows none of the mentioned aws. Furthermore we illustrate the recognition process with simulation results and analyze the new dynamics analytically: Possible output patterns are isolated attractors of the system. Additionally, simple criteria for recognition success are derived from a lower bound on the basins of attraction.
https://arxiv.org/abs/1604.02085
To better understand the flows of ideas or information through social and biological systems, researchers develop maps that reveal important patterns in network flows. In practice, network flow models have implied memoryless first-order Markov chains, but recently researchers have introduced higher-order Markov chain models with memory to capture patterns in multi-step pathways. Higher-order models are particularly important for effectively revealing actual, overlapping community structure, but higher-order Markov chain models suffer from the curse of dimensionality: their vast parameter spaces require exponentially increasing data to avoid overfitting and therefore make mapping inefficient already for moderate-sized systems. To overcome this problem, we introduce an efficient cross-validated mapping approach based on network flows modeled by sparse Markov chains. To illustrate our approach, we present a map of citation flows in science with research fields that overlap in multidisciplinary journals. Compared with currently used categories in science of science studies, the research fields form better units of analysis because the map more effectively captures how ideas flow through science.
https://arxiv.org/abs/1606.08328
Double barrier GaN/AlN resonant tunneling heterostructures have been grown by molecular beam epitaxy on the (0001) plane of commercially available bulk GaN substrates. Resonant tunneling diodes were fabricated; room temperature current-voltage measurements reveal the presence of a negative differential conductance region under forward bias with peak current densities of ~6.4 $kA/cm^2$ and a peak to valley current ratio of ~1.3. Reverse bias operation presents a characteristic turn-on threshold voltage intimately linked to the polarization fields present in the heterostructure. An analytic electrostatic model is developed to capture the unique features of polar-heterostructure-based resonant tunneling diodes; both the resonant and threshold voltages are derived as a function of the design parameters and polarization fields. Subsequent measurements confirm the repeatability of the negative conductance and demonstrate that III-nitride tunneling heterostructures are capable of robust resonant transport at room temperature.
https://arxiv.org/abs/1606.08100
Causal precedence between biochemical interactions is crucial in the biomedical domain, because it transforms collections of individual interactions, e.g., bindings and phosphorylations, into the causal mechanisms needed to inform meaningful search and inference. Here, we analyze causal precedence in the biomedical domain as distinct from open-domain, temporal precedence. First, we describe a novel, hand-annotated text corpus of causal precedence in the biomedical domain. Second, we use this corpus to investigate a battery of models of precedence, covering rule-based, feature-based, and latent representation models. The highest-performing individual model achieved a micro F1 of 43 points, approaching the best performers on the simpler temporal-only precedence tasks. Feature-based and latent representation models each outperform the rule-based models, but their performance is complementary to one another. We apply a sieve-based architecture to capitalize on this lack of overlap, achieving a micro F1 score of 46 points.
https://arxiv.org/abs/1606.08089
The main constraint of wireless sensor networks (WSN) in enabling wireless image communication is the high energy requirement, which may exceed even the future capabilities of battery technologies. In this paper we have shown that this bottleneck can be overcome by developing local in-network image processing algorithm that offers optimal energy consumption. Our algorithm is very suitable for intruder detection applications. Each node is responsible for processing the image captured by the video sensor, which consists of NxN blocks. If an intruder is detected in the monitoring region, the node will transmit the image for further processing. Otherwise, the node takes no action. Results provided from our experiments show that our algorithm is better than the traditional moving object detection techniques by a factor of (N/2) in terms of energy savings.
https://arxiv.org/abs/1606.07583
While textual reviews have become prominent in many recommendation-based systems, automated frameworks to provide relevant visual cues against text reviews where pictures are not available is a new form of task confronted by data mining and machine learning researchers. Suggestions of pictures that are relevant to the content of a review could significantly benefit the users by increasing the effectiveness of a review. We propose a deep learning-based framework to automatically: (1) tag the images available in a review dataset, (2) generate a caption for each image that does not have one, and (3) enhance each review by recommending relevant images that might not be uploaded by the corresponding reviewer. We evaluate the proposed framework using the Yelp Challenge Dataset. While a subset of the images in this particular dataset are correctly captioned, the majority of the pictures do not have any associated text. Moreover, there is no mapping between reviews and images. Each image has a corresponding business-tag where the picture was taken, though. The overall data setting and unavailability of crucial pieces required for a mapping make the problem of recommending images for reviews a major challenge. Qualitative and quantitative evaluations indicate that our proposed framework provides high quality enhancements through automatic captioning, tagging, and recommendation for mapping reviews and images.
尽管在许多基于推荐的系统中文本评论已经变得突出,但是提供相关的视觉提示以对照不存在图片的文本评论的自动化框架是数据挖掘和机器学习研究者面临的新形式的任务。对与评论内容相关的图片的建议可以通过提高评论的有效性而显着地使用户受益。我们提出了一个深入的基于学习的框架来自动地:(1)在评论数据集中标记可用图像,(2)为每个没有图像的图像生成标题,(3)通过推荐相关图像可能不会被相应的评论者上传。我们使用Yelp挑战数据集来评估建议的框架。虽然在此特定数据集中的图像子集正确标题,大多数图片没有任何关联的文字。而且,评论和图像之间没有映射。尽管如此,每张图片都有相应的商业标签。整体数据设置和映射所需的关键部分的无法使用使得推荐用于评论的图像成为主要挑战。定性和定量评估表明,我们提出的框架通过自动字幕,标记和映射评论和图像推荐提供高质量的增强。
https://arxiv.org/abs/1606.07496
This paper describes the AMU-UEDIN submissions to the WMT 2016 shared task on news translation. We explore methods of decode-time integration of attention-based neural translation models with phrase-based statistical machine translation. Efficient batch-algorithms for GPU-querying are proposed and implemented. For English-Russian, our system stays behind the state-of-the-art pure neural models in terms of BLEU. Among restricted systems, manual evaluation places it in the first cluster tied with the pure neural model. For the Russian-English task, our submission achieves the top BLEU result, outperforming the best pure neural system by 1.1 BLEU points and our own phrase-based baseline by 1.6 BLEU. After manual evaluation, this system is the best restricted system in its own cluster. In follow-up experiments we improve results by additional 0.8 BLEU.
https://arxiv.org/abs/1605.04809
Nearly all previous work on neural machine translation (NMT) has used quite restricted vocabularies, perhaps with a subsequent method to patch in unknown words. This paper presents a novel word-character solution to achieving open vocabulary NMT. We build hybrid systems that translate mostly at the word level and consult the character components for rare words. Our character-level recurrent neural networks compute source word representations and recover unknown target words when needed. The twofold advantage of such a hybrid approach is that it is much faster and easier to train than character-based ones; at the same time, it never produces unknown words as in the case of word-based models. On the WMT’15 English to Czech translation task, this hybrid approach offers an addition boost of +2.1-11.4 BLEU points over models that already handle unknown words. Our best system achieves a new state-of-the-art result with 20.7 BLEU score. We demonstrate that our character models can successfully learn to not only generate well-formed words for Czech, a highly-inflected language with a very complex vocabulary, but also build correct representations for English source words.
https://arxiv.org/abs/1604.00788
The metamorphic malware variants with the same malicious behavior (family), can obfuscate themselves to look different from each other. This variation in structure leads to a huge signature database for traditional signature matching techniques to detect them. In order to effective and efficient detection of malware in large amounts of executables, we need to partition these files into groups which can identify their respective families. In addition, the grouping criteria should be chosen such a way that, it can also be applied to unknown files encounter on computers for classification. This paper discusses the study of malware and benign executables in groups to detect unknown malware with high accuracy. We studied sizes of malware generated by three popular second generation malware (metamorphic malware) creator kits viz. G2, PS-MPC and NGVCK, and observed that the size variation in any two generated malware from same kit is not much. Hence, we grouped the executables on the basis of malware sizes by using Optimal k-Means Clustering algorithm and used these obtained groups to select promising features for training (Random forest, J48, LMT, FT and NBT) classifiers to detect variants of malware or unknown malware. We find that detection of malware on the basis of their respected file sizes gives accuracy up to 99.11% from the classifiers.
具有相同恶意行为(系列)的变形恶意软件变体可以使自己模糊不清以使彼此看起来不同。结构的这种变化导致用于传统签名匹配技术的巨大签名数据库来检测它们。为了有效和高效地检测大量可执行文件中的恶意软件,我们需要将这些文件分成可以识别其各自系列的组。此外,应该选择分组标准,使其也可以应用于计算机上遇到的未知文件进行分类。本文讨论了对组中恶意软件和良性可执行文件的研究,以高精度地检测未知恶意软件。我们研究了三种流行的第二代恶意软件(变形恶意软件)创建工具包生成的恶意软件的大小。 G2,PS-MPC和NGVCK,并观察到来自同一套件的任何两个生成的恶意软件的大小变化并不多。因此,我们使用Optimal k-Means Clustering算法根据恶意软件大小对可执行文件进行分组,并使用这些获得的组来选择有前途的训练特征(Random forest,J48,LMT,FT和NBT)分类器来检测恶意软件的变体或未知恶意软件。我们发现,基于其尊重的文件大小检测恶意软件可使分类器的准确率高达99.11%。
http://arxiv.org/abs/1606.06908
Combating malware is very important for software/systems security, but to prevent the software/systems from the advanced malware, viz. metamorphic malware is a challenging task, as it changes the structure/code after each infection. Therefore in this paper, we present a novel approach to detect the advanced malware with high accuracy by analyzing the occurrence of opcodes (features) by grouping the executables. These groups are made on the basis of our earlier studies [1] that the difference between the sizes of any two malware generated by popular advanced malware kits viz. PS-MPC, G2 and NGVCK are within 5 KB. On the basis of obtained promising features, we studied the performance of thirteen classifiers using N-fold cross-validation available in machine learning tool WEKA. Among these thirteen classifiers we studied in-depth top five classifiers (Random forest, LMT, NBT, J48 and FT) and obtain more than 96.28% accuracy for the detection of unknown malware, which is better than the maximum detection accuracy (95.9%) reported by Santos et al (2013). In these top five classifiers, our approach obtained a detection accuracy of 97.95% by the Random forest.
防范恶意软件对于软件/系统安全非常重要,但要防止软件/系统受到高级恶意软件的攻击,即。变形恶意软件是一项具有挑战性的任务,因为它会在每次感染后更改结构/代码。因此,在本文中,我们提出了一种新方法,通过对可执行文件进行分组来分析操作码(特征)的出现,从而高精度地检测高级恶意软件。这些小组是在我们早期研究[1]的基础上制作的,即流行的高级恶意软件包生成的任何两种恶意软件的大小之间的差异即。 PS-MPC,G2和NGVCK均在5 KB范围内。在获得有希望的特征的基础上,我们使用机器学习工具WEKA中可用的N折叠交叉验证研究了13个分类器的性能。在这13个分类器中,我们深入研究了前五种分类器(随机森林,LMT,NBT,J48和FT),并获得了超过96.28%的未知恶意软件检测准确率,优于最大检测精度(95.9%)由Santos等人(2013)报道。在这五大分类器中,我们的方法通过随机森林获得了97.95%的检测准确度。
http://arxiv.org/abs/1606.06897
We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets), for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20x faster than the Faster R-CNN counterpart. Code is made publicly available at: this https URL
https://arxiv.org/abs/1605.06409
We conduct large-scale studies on `human attention’ in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.
我们在视觉问答(VQA)中对“人的注意力”进行大规模的研究,以了解人类选择回答关于图像的问题。我们设计并测试了多个受游戏启发的新颖注意 - 注释界面,这些界面要求主体锐化模糊图像的区域来回答问题。因此,我们引入VQA-HAT(人类注意力)数据集。我们通过定性(通过可视化)和定量(通过秩序相关)来评估由最先进的VQA模型生成的注意力图与人类的关注。总的来说,我们的实验表明目前的VQA关注模型似乎没有像人类那样看待同一个地区。
https://arxiv.org/abs/1606.05589
This paper presents the University of Cambridge submission to WMT16. Motivated by the complementary nature of syntactical machine translation and neural machine translation (NMT), we exploit the synergies of Hiero and NMT in different combination schemes. Starting out with a simple neural lattice rescoring approach, we show that the Hiero lattices are often too narrow for NMT ensembles. Therefore, instead of a hard restriction of the NMT search space to the lattice, we propose to loosely couple NMT and Hiero by composition with a modified version of the edit distance transducer. The loose combination outperforms lattice rescoring, especially when using multiple NMT systems in an ensemble.
https://arxiv.org/abs/1606.04963
Interlingua based Machine Translation (MT) aims to encode multiple languages into a common linguistic representation and then decode sentences in multiple target languages from this representation. In this work we explore this idea in the context of neural encoder decoder architectures, albeit on a smaller scale and without MT as the end goal. Specifically, we consider the case of three languages or modalities X, Z and Y wherein we are interested in generating sequences in Y starting from information available in X. However, there is no parallel training data available between X and Y but, training data is available between X & Z and Z & Y (as is often the case in many real world applications). Z thus acts as a pivot/bridge. An obvious solution, which is perhaps less elegant but works very well in practice is to train a two stage model which first converts from X to Z and then from Z to Y. Instead we explore an interlingua inspired solution which jointly learns to do the following (i) encode X and Z to a common representation and (ii) decode Y from this common representation. We evaluate our model on two tasks: (i) bridge transliteration and (ii) bridge captioning. We report promising results in both these applications and believe that this is a right step towards truly interlingua inspired encoder decoder architectures.
基于国际语的机器翻译(Interlingua based Machine Translation,MT)旨在将多种语言编码为一种通用的语言表达,然后根据这种表示方式对多种目标语言的句子进行解码。在这项工作中,我们在神经编码器解码器架构的背景下探索这个想法,虽然规模较小,没有MT作为最终目标。具体而言,我们考虑三种语言或模态X,Z和Y的情况,其中我们感兴趣的是从X中可用的信息开始在Y中生成序列。然而,在X和Y之间没有可用的并行训练数据,但是训练数据是在X&Z和Z&Y之间可用(在许多现实世界的应用中通常是这种情况)。 Z因此作为枢轴/桥。一个明显的解决方案可能不那么优雅,但在实践中运作得非常好,那就是训练一个两阶段模型,先从X到Z然后从Z到Y,然后我们探索一个interlingua启发的解决方案,共同学习如下(i)将X和Z编码为公共表示,并(ii)从该公共表示中解码Y.我们评估我们的模型的两个任务:(一)桥音译和(二)桥梁字幕。我们在这两个应用程序中报告了有希望的结果,并相信这是真正意义上的interlingua编码器解码器架构的一个正确的步骤。
https://arxiv.org/abs/1606.04754
Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as \emph{Bidirectional Long-Short Term Memory} (BiLSTM), which deeply captures bidirectional global temporal structure in video. Specifically, we first devise a joint visual modelling approach to encode video data by combining a forward LSTM pass, a backward LSTM pass, together with visual features from Convolutional Neural Networks (CNNs). Then, we inject the derived video representation into the subsequent language model for initialization. The benefits are in two folds: 1) comprehensively preserving sequential and visual information; and 2) adaptively learning dense visual features and sparse semantic representations for videos and sentences, respectively. We verify the effectiveness of our proposed video captioning framework on a commonly-used benchmark, i.e., Microsoft Video Description (MSVD) corpus, and the experimental results demonstrate that the superiority of the proposed approach as compared to several state-of-the-art methods.
视频字幕已经引起了多媒体界的广泛关注。然而,大多数现有的方法要么忽略视频帧之间的时间信息,要么仅仅使用本地上下文时间知识。在这项工作中,我们提出了一个新颖的视频字幕框架,称为\双向长短期记忆(双向长短期记忆)(BiLSTM),它深刻捕捉视频中的双向全局时间结构。具体来说,我们首先设计一种联合可视化建模方法,通过将前向LSTM传递,后向LSTM传递与来自卷积神经网络(CNN)的视觉特征相结合来编码视频数据。然后,我们将派生的视频表示注入到后续的语言模型中进行初始化。其优点有两个:1)全面保存顺序和可视信息; 2)分别自适应学习视频和句子的密集视觉特征和稀疏语义表示。我们验证了我们建议的视频字幕框架在常用基准(即Microsoft Video Description(MSVD)语料库)上的有效性,实验结果表明,所提出方法的优越性与一些最新技术方法。
https://arxiv.org/abs/1606.04631
This paper describes the Georgia Tech team’s approach to the CoNLL-2016 supplementary evaluation on discourse relation sense classification. We use long short-term memories (LSTM) to induce distributed representations of each argument, and then combine these representations with surface features in a neural network. The architecture of the neural network is determined by Bayesian hyperparameter search.
https://arxiv.org/abs/1606.04503
We prove a conjectural formula relating the Bessel period of certain automorphic forms on $\mathrm{GSp}_4$ to a central $L$-value. This formula is proposed by Liu \cite{liu} as the refined Gan-Gross-Prasad conjecture for the groups $(\SO(5), \SO(2))$. The conjecture has been previously proved for certain automorphic forms on $\mathrm{GSp_4}$ from lifts. In this paper, we extend the formula to Siegel modular forms of $\Sp_4(\bZ)$.
https://arxiv.org/abs/1512.09222
We present an approach to improve statistical machine translation of image descriptions by multimodal pivots defined in visual space. The key idea is to perform image retrieval over a database of images that are captioned in the target language, and use the captions of the most similar images for crosslingual reranking of translation outputs. Our approach does not depend on the availability of large amounts of in-domain parallel data, but only relies on available large datasets of monolingually captioned images, and on state-of-the-art convolutional neural networks to compute image similarities. Our experimental evaluation shows improvements of 1 BLEU point over strong baselines.
我们提出了一种方法来改善在视觉空间中定义的多模态枢轴的图像描述的统计机器翻译。关键的想法是对目标语言中标题图像的数据库进行图像检索,并使用最相似图像的标题进行翻译输出的跨语种重新排列。我们的方法并不依赖于大量的域内并行数据的可用性,而只依赖于单幅标题图像的可用大数据集以及用于计算图像相似性的最先进的卷积神经网络。我们的实验评估显示,在强烈的基线上,1 BLEU点的改进。
https://arxiv.org/abs/1601.03916
We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. We focus on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic. Unlike most work on generative models, our primary goal is not to train a model that assigns high likelihood to test data, nor do we require the model to be able to learn well without using any labels. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: our model generates MNIST samples that humans cannot distinguish from real data, and CIFAR-10 samples that yield a human error rate of 21.3%. We also present ImageNet samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of ImageNet classes.
https://arxiv.org/abs/1606.03498
Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.
https://arxiv.org/abs/1508.07909
There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images, there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.
在过去的几年中,视觉和语言界的工作从图像字幕到视频转录,以及回答关于图像的问题已经爆发。这些任务集中在图像的文字描述上。为了超越文字,我们选择探索图像的问题是如何针对常识推理和图像中的物体引起的抽象事件。在本文中,我们介绍视觉问题生成(VQG)的新任务,系统的任务是在图像显示时询问一个自然而有趣的问题。我们提供了三个数据集,涵盖从以对象为中心到以事件为中心的各种图像,其中抽象的训练数据比迄今为止最先进的字幕系统提供的数据多得多。我们训练和测试几个生成和检索模型来解决VQG的任务。评估结果表明,虽然这些模型对各种图像提出了合理的问题,但与人类的表现仍有很大的差距,这促使人们将图像与常识性知识和语用学联系在一起。我们提出的任务为社区提出了一个新的挑战,我们希望能够进一步探索探索视觉和语言之间更深层次的联系。
https://arxiv.org/abs/1603.06059
While cognitive representations of an environment can last for days and even months, the synaptic architecture of the neuronal networks that underlie these representations constantly changes due to various forms of synaptic and structural plasticity at a much faster timescale. This raises an immediate question: how can a transient network maintain a stable representation of space? In the following, we propose a computational model for describing emergence of the hippocampal cognitive map in a network of transient place cell assemblies and demonstrate, using methods of algebraic topology, that such a network can maintain a robust map of the environment.
https://arxiv.org/abs/1606.02765
We have demonstrated effective fringe field control of one-dimensional electron gas (1-DEG) in AlGaN/GaN lateral nanowires. The nanowires are site controlled and formed by a combination of dry and anisotropic wet etching. The nanowire dimensions are well controlled and can have a very high length/width aspect ratio of 10 um/5 nm or larger. The transport is controlled by a fringe gate and shows room temperature quantum transport where gradual filling of 1-D subbands gets manifested as oscillations in the transconductance. The fringe gate threshold voltage for depletion of one-dimensional electron gas is found to increase with increasing drain voltage indicating efficient control of 1-DEG. The transport characteristics and fringe field operation are explained by taking into account quantum capacitance in addition to the conventional geometric capacitance. The effect of nanowire width and fringe gate position is also discussed.
https://arxiv.org/abs/1606.02564
Most of the existing Neural Machine Translation (NMT) models focus on the conversion of sequential data and do not directly use syntactic information. We propose a novel end-to-end syntactic NMT model, extending a sequence-to-sequence model with the source-side phrase structure. Our model has an attention mechanism that enables the decoder to generate a translated word while softly aligning it with phrases as well as words of the source sentence. Experimental results on the WAT’15 English-to-Japanese dataset demonstrate that our proposed model considerably outperforms sequence-to-sequence attentional NMT models and compares favorably with the state-of-the-art tree-to-string SMT system.
https://arxiv.org/abs/1603.06075
In decentralized networks (of sensors, connected objects, etc.), there is an important need for efficient algorithms to optimize a global cost function, for instance to learn a global model from the local data collected by each computing unit. In this paper, we address the problem of decentralized minimization of pairwise functions of the data points, where these points are distributed over the nodes of a graph defining the communication topology of the network. This general problem finds applications in ranking, distance metric learning and graph inference, among others. We propose new gossip algorithms based on dual averaging which aims at solving such problems both in synchronous and asynchronous settings. The proposed framework is flexible enough to deal with constrained and regularized variants of the optimization problem. Our theoretical analysis reveals that the proposed algorithms preserve the convergence rate of centralized dual averaging up to an additive bias term. We present numerical simulations on Area Under the ROC Curve (AUC) maximization and metric learning problems which illustrate the practical interest of our approach.
http://arxiv.org/abs/1606.02421
Continual Learning in artificial neural networks suffers from interference and forgetting when different tasks are learned sequentially. This paper introduces the Active Long Term Memory Networks (A-LTM), a model of sequential multi-task deep learning that is able to maintain previously learned association between sensory input and behavioral output while acquiring knew knowledge. A-LTM exploits the non-convex nature of deep neural networks and actively maintains knowledge of previously learned, inactive tasks using a distillation loss. Distortions of the learned input-output map are penalized but hidden layers are free to transverse towards new local optima that are more favorable for the multi-task objective. We re-frame the McClelland’s seminal Hippocampal theory with respect to Catastrophic Inference (CI) behavior exhibited by modern deep architectures trained with back-propagation and inhomogeneous sampling of latent factors across epochs. We present empirical results of non-trivial CI during continual learning in Deep Linear Networks trained on the same task, in Convolutional Neural Networks when the task shifts from predicting semantic to graphical factors and during domain adaptation from simple to complex environments. We present results of the A-LTM model’s ability to maintain viewpoint recognition learned in the highly controlled iLab-20M dataset with 10 object categories and 88 camera viewpoints, while adapting to the unstructured domain of Imagenet with 1,000 object categories.
https://arxiv.org/abs/1606.02355
A key problem in salient object detection is how to effectively model the semantic properties of salient objects in a data-driven manner. In this paper, we propose a multi-task deep saliency model based on a fully convolutional neural network (FCNN) with global input (whole raw images) and global output (whole saliency maps). In principle, the proposed saliency model takes a data-driven strategy for encoding the underlying saliency prior information, and then sets up a multi-task learning scheme for exploring the intrinsic correlations between saliency detection and semantic image segmentation. Through collaborative feature learning from such two correlated tasks, the shared fully convolutional layers produce effective features for object perception. Moreover, it is capable of capturing the semantic information on salient objects across different levels using the fully convolutional layers, which investigate the feature-sharing properties of salient object detection with great feature redundancy reduction. Finally, we present a graph Laplacian regularized nonlinear regression model for saliency refinement. Experimental results demonstrate the effectiveness of our approach in comparison with the state-of-the-art approaches.
https://arxiv.org/abs/1510.05484
We propose to enhance the RNN decoder in a neural machine translator (NMT) with external memory, as a natural but powerful extension to the state in the decoding RNN. This memory-enhanced RNN decoder is called \textsc{MemDec}. At each time during decoding, \textsc{MemDec} will read from this memory and write to this memory once, both with content-based addressing. Unlike the unbounded memory in previous work\cite{RNNsearch} to store the representation of source sentence, the memory in \textsc{MemDec} is a matrix with pre-determined size designed to better capture the information important for the decoding process at each time step. Our empirical study on Chinese-English translation shows that it can improve by $4.8$ BLEU upon Groundhog and $5.3$ BLEU upon on Moses, yielding the best performance achieved with the same training set.
https://arxiv.org/abs/1606.02003
The ability to automatically detect other vehicles on the road is vital to the safety of partially-autonomous and fully-autonomous vehicles. Most of the high-accuracy techniques for this task are based on R-CNN or one of its faster variants. In the research community, much emphasis has been applied to using 3D vision or complex R-CNN variants to achieve higher accuracy. However, are there more straightforward modifications that could deliver higher accuracy? Yes. We show that increasing input image resolution (i.e. upsampling) offers up to 12 percentage-points higher accuracy compared to an off-the-shelf baseline. We also find situations where earlier/shallower layers of CNN provide higher accuracy than later/deeper layers. We further show that shallow models and upsampled images yield competitive accuracy. Our findings contrast with the current trend towards deeper and larger models to achieve high accuracy in domain specific detection tasks.
https://arxiv.org/abs/1606.01561
Significant performance gains in deep learning coupled with the exponential growth of image and video data on the Internet have resulted in the recent emergence of automated image captioning systems. Ensuring scalability of automated image captioning systems with respect to the ever increasing volume of image and video data is a significant challenge. This paper provides a valuable insight in that the detection of a few significant (top) objects in an image allows one to extract other relevant information such as actions (verbs) in the image. We expect this insight to be useful in the design of scalable image captioning systems. We address two parameters by which the scalability of image captioning systems could be quantified, i.e., the traditional algorithmic time complexity which is important given the resource limitations of the user device and the system development time since the programmers’ time is a critical resource constraint in many real-world scenarios. Additionally, we address the issue of how word embeddings could be used to infer the verb (action) from the nouns (objects) in a given image in a zero-shot manner. Our results show that it is possible to attain reasonably good performance on predicting actions and captioning images using our approaches with the added advantage of simplicity of implementation.
在深度学习方面显着的性能提高,加上互联网上图像和视频数据的指数级增长,导致最近出现了自动图像字幕系统。确保自动图像字幕系统在不断增加的图像和视频数据量方面的可扩展性是一个重大的挑战。本文提供了一个有价值的见解,即图像中几个重要(顶部)对象的检测允许提取其他相关信息,如图像中的动作(动词)。我们期望这种见解在可缩放的图像字幕系统的设计中非常有用。我们解决了图像字幕系统的可扩展性可以量化的两个参数,即传统的算法时间复杂度,考虑到用户设备的资源限制和系统开发时间,因为程序员的时间是一个关键的资源约束许多现实世界的场景。此外,我们解决了如何利用词嵌入来以零点方式从给定图像中的名词(对象)推断动词(动作)的问题。我们的研究结果表明,使用我们的方法可以在预测动作和字幕图像方面取得相当好的性能,并具有实施简单的优点。
https://arxiv.org/abs/1606.01393