In this paper, we provide a new algorithm for the problem of prediction in Reinforcement Learning, \emph{i.e.}, estimating the Value Function of a Markov Reward Process (MRP) using the linear function approximation architecture, with memory and computation costs scaling quadratically in the size of the feature set. The algorithm is a multi-timescale variant of the very popular Cross Entropy (CE) method which is a model based search method to find the global optimum of a real-valued function. This is the first time a model based search method is used for the prediction problem. The application of CE to a stochastic setting is a completely unexplored domain. A proof of convergence using the ODE method is provided. The theoretical results are supplemented with experimental comparisons. The algorithm achieves good performance fairly consistently on many RL benchmark problems. This demonstrates the competitiveness of our algorithm against least squares and other state-of-the-art algorithms in terms of computational efficiency, accuracy and stability.
https://arxiv.org/abs/1609.09449
Detecting moving objects in dynamic scenes from sequences of lidar scans is an important task in object tracking, mapping, localization, and navigation. Many works focus on changes detection in previously observed scenes, while a very limited amount of literature addresses moving objects detection. The state-of-the-art method exploits Dempster-Shafer Theory to evaluate the occupancy of a lidar scan and to discriminate points belonging to the static scene from moving ones. In this paper we improve both speed and accuracy of this method by discretizing the occupancy representation, and by removing false positives through visual cues. Many false positives lying on the ground plane are also removed thanks to a novel ground plane removal algorithm. Efficiency is improved through an octree indexing strategy. Experimental evaluation against the KITTI public dataset shows the effectiveness of our approach, both qualitatively and quantitatively with respect to the state- of-the-art.
https://arxiv.org/abs/1609.09267
A novel variational autoencoder is developed to model images, as well as associated labels or captions. The Deep Generative Deconvolutional Network (DGDN) is used as a decoder of the latent image features, and a deep Convolutional Neural Network (CNN) is used as an image encoder; the CNN is used to approximate a distribution for the latent DGDN features/code. The latent code is also linked to generative models for labels (Bayesian support vector machine) or captions (recurrent neural network). When predicting a label/caption for a new image at test, averaging is performed across the distribution of latent codes; this is computationally efficient as a consequence of the learned CNN-based encoder. Since the framework is capable of modeling the image in the presence/absence of associated labels/captions, a new semi-supervised setting is manifested for CNN learning with images; the framework even allows unsupervised CNN learning, based on images alone.
一种新型的变分自动编码器被开发来模拟图像,以及相关的标签或标题。深生成去卷积网络(DGDN)被用作潜像特征的解码器,深度卷积神经网络(CNN)被用作图像编码器; CNN被用来逼近潜在的DGDN特征/代码的分布。潜在代码还与标签生成模型(贝叶斯支持向量机)或标题(循环神经网络)相关联。当在测试中预测新图像的标签/标题时,对潜在代码的分布进行平均;由于所学习的基于CNN的编码器,这在计算上是有效的。由于该框架能够在存在/不存在相关标签/标题的情况下对图像进行建模,因此新的半监督设置表现为CNN学习图像;该框架甚至允许无监督的CNN学习,仅基于图像。
https://arxiv.org/abs/1609.08976
Recently, end-to-end memory networks have shown promising results on Question Answering task, which encode the past facts into an explicit memory and perform reasoning ability by making multiple computational steps on the memory. However, memory networks conduct the reasoning on sentence-level memory to output coarse semantic vectors and do not further take any attention mechanism to focus on words, which may lead to the model lose some detail information, especially when the answers are rare or unknown words. In this paper, we propose a novel Hierarchical Memory Networks, dubbed HMN. First, we encode the past facts into sentence-level memory and word-level memory respectively. Then, (k)-max pooling is exploited following reasoning module on the sentence-level memory to sample the (k) most relevant sentences to a question and feed these sentences into attention mechanism on the word-level memory to focus the words in the selected sentences. Finally, the prediction is jointly learned over the outputs of the sentence-level reasoning module and the word-level attention mechanism. The experimental results demonstrate that our approach successfully conducts answer selection on unknown words and achieves a better performance than memory networks.
https://arxiv.org/abs/1609.08843
Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA). The performance of most models is clustered around 60-70%. In this paper we propose systematic methods to analyze the behavior of these models as a first step towards recognizing their strengths and weaknesses, and identifying the most fruitful directions for progress. We analyze two models, one each from two major classes of VQA models – with-attention and without-attention and show the similarities and differences in the behavior of these models. We also analyze the winning entry of the VQA Challenge 2016. Our behavior analysis reveals that despite recent progress, today’s VQA models are “myopic” (tend to fail on sufficiently novel instances), often “jump to conclusions” (converge on a predicted answer after ‘listening’ to just half the question), and are “stubborn” (do not change their answers across images).
最近,许多基于深度学习的模型被提出用于视觉问答(VQA)的任务。大多数模型的表现集中在60-70%左右。在本文中,我们提出了系统的方法来分析这些模型的行为,作为认识其优缺点的第一步,并确定最有成效的进展方向。我们分析两种模型,分别来自VQA模型的两个主要类型 - 注意力和不注意力,并显示这些模型行为的相似性和差异性。我们还分析了2016年VQA挑战赛的获胜条目。我们的行为分析显示,尽管最近取得了一些进展,但今天的VQA模型是“近视的”(倾向于失败的新型实例),往往“跳到结论”(收敛于预测的答案在“听”到问题的一半之后),并且“固执”(不要在图像上改变他们的答案)。
https://arxiv.org/abs/1606.07356
We present a novel neural architecture for answering queries, designed to optimally leverage explicit support in the form of query-answer memories. Our model is able to refine and update a given query while separately accumulating evidence for predicting the answer. Its architecture reflects this separation with dedicated embedding matrices and loosely connected information pathways (modules) for updating the query and accumulating evidence. This separation of responsibilities effectively decouples the search for query related support and the prediction of the answer. On recent benchmark datasets for reading comprehension, our model achieves state-of-the-art results. A qualitative analysis reveals that the model effectively accumulates weighted evidence from the query and over multiple support retrieval cycles which results in a robust answer prediction.
https://arxiv.org/abs/1607.03316
Motivated by the need to automate medical information extraction from free-text radiological reports, we present a bi-directional long short-term memory (BiLSTM) neural network architecture for modelling radiological language. The model has been used to address two NLP tasks: medical named-entity recognition (NER) and negation detection. We investigate whether learning several types of word embeddings improves BiLSTM’s performance on those tasks. Using a large dataset of chest x-ray reports, we compare the proposed model to a baseline dictionary-based NER system and a negation detection system that leverages the hand-crafted rules of the NegEx algorithm and the grammatical relations obtained from the Stanford Dependency Parser. Compared to these more traditional rule-based systems, we argue that BiLSTM offers a strong alternative for both our tasks.
https://arxiv.org/abs/1609.08409
We demonstrate single photon emission from self-assembled m-plane InGaN quantum dots (QDs) embedded on the side-walls of GaN nanowires. A combination of electron microscopy, cathodoluminescence, time-resolved micro-PL and photon autocorrelation experiments give a thorough evaluation of the QDs structural and optical properties. The QD exhibits anti-bunched emission up to 100 K, with a measured autocorrelation function of g^2(0) = 0.28 (0.03) at 5 K. Studies on a statistically significant number of QDs show that these m-plane QDs exhibit very fast radiative lifetimes (260 +/- 55 ps) suggesting smaller internal fields than any of the previously reported c-plane and a-plane QDs. Moreover, the observed single photons are almost completely linearly polarized aligned perpendicular to the crystallographic c-axis with a degree of linear polarization of 0.84 +/- 0.12. Such InGaN QDs incorporated in a nanowire system meet many of the requirements for implementation into quantum information systems and could potentially open the door to wholly new device concepts.
https://arxiv.org/abs/1609.07973
Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in “Predicate + Object” (PO) phrases based on “Knowlywood”, an activity knowledge mining model. Our best model archives Recall@10 of 19.2% on annotation and 18.9% on video retrieval tasks for subset of 1000 samples. For multiple-choice test, our best model achieve accuracy 58.11% over whole LSMDC16 public test-set.
学习联合语言 - 视觉嵌入具有许多非常吸引人的特性,并且可以导致各种实际应用,包括自然语言图像/视频注释和搜索。在这项工作中,我们研究了三种不同的联合语言 - 视觉神经网络模型体系结构。我们评估我们的模型大规模LSMDC16电影数据集的两个任务:1)标准排名的视频注释和检索2)我们提出的电影多选题测试。该测试方便了基于人类活动的自然语言视频注释的视觉语言模型的自动评估。除了作为LSMDC16的一部分提供的原始音频描述(AD)字幕外,我们收集并将提供a)使用Amazon MTurk获得的那些字幕的手动生成的重新解释b)在“谓词+对象”中自动生成的人类活动元素(PO)的短语基于“知识型”,一个活动知识挖掘模型。对于1000个样本的子集,我们最好的模型归档回收率为19.2%,注册率为18.2%,视频检索任务为18.9%。对于多项选择测试,我们最好的模型在整个LSMDC16公开测试集上达到了58.11%的准确率。
https://arxiv.org/abs/1609.08124
Parallel programming is emerging fast and intensive applications need more resources, so there is a huge demand for on-chip multiprocessors. Accessing L1 caches beside the cores are the fastest after registers but the size of private caches cannot increase because of design, cost and technology limits. Then split I-cache and D-cache are used with shared LLC (last level cache). For a unified shared LLC, bus interface is not scalable, and it seems that distributed shared LLC (DSLLC) is a better choice. Most of papers assume a distributed shared LLC beside each core in on-chip network. Many works assume that DSLLCs are placed in all cores; however, we will show that this design ignores the effect of traffic congestion in on-chip network. In fact, our work focuses on optimal placement of cores, DSLLCs and even memory controllers to minimize the expected latency based on traffic load in a mesh on-chip network with fixed number of cores and total cache capacity. We try to do some analytical modeling deriving intended cost function and then optimize the mean delay of the on-chip network communication. This work is supposed to be verified using some traffic patterns that are run on CSIM simulator.
https://arxiv.org/abs/1607.04298
Visual Question Answering (VQA) is the task of answering natural-language questions about images. We introduce the novel problem of determining the relevance of questions to images in VQA. Current VQA models do not reason about whether a question is even related to the given image (e.g. What is the capital of Argentina?) or if it requires information from external resources to answer correctly. This can break the continuity of a dialogue in human-machine interaction. Our approaches for determining relevance are composed of two stages. Given an image and a question, (1) we first determine whether the question is visual or not, (2) if visual, we determine whether the question is relevant to the given image or not. Our approaches, based on LSTM-RNNs, VQA model uncertainty, and caption-question similarity, are able to outperform strong baselines on both relevance tasks. We also present human studies showing that VQA models augmented with such question relevance reasoning are perceived as more intelligent, reasonable, and human-like.
视觉问答(VQA)是回答关于图像的自然语言问题的任务。我们介绍了确定问题与VQA中图像相关性的新问题。目前的VQA模型并不能推断问题是否与给定的图像有关(例如,什么是阿根廷的首都?),还是需要来自外部资源的信息才能正确回答。这可能会打破人机对话的连续性。我们确定相关性的方法由两个阶段组成。给出一个图像和一个问题,(1)我们首先确定问题是否是视觉的,(2)如果是视觉的,我们确定问题是否与给定的图像有关。我们基于LSTM-RNN,VQA模型不确定性和字幕问题相似性的方法能够在两个相关性任务上超越强基线。我们还展示了人类研究,表明VQA模型与这种问题相关性推理增强被认为是更聪明,合理,和人类。
https://arxiv.org/abs/1606.06622
We present a model for pragmatically describing scenes, in which contrastive behavior results from a combination of inference-driven pragmatics and learned semantics. Like previous learned approaches to language generation, our model uses a simple feature-driven architecture (here a pair of neural “listener” and “speaker” models) to ground language in the world. Like inference-driven approaches to pragmatics, our model actively reasons about listener behavior when selecting utterances. For training, our approach requires only ordinary captions, annotated without demonstration of the pragmatic behavior the model ultimately exhibits. In human evaluations on a referring expression game, our approach succeeds 81% of the time, compared to a 69% success rate using existing techniques.
我们提出了一个实用描述场景的模型,其中对比行为是由推理驱动的语用学和学习语义学的结合而产生的。像以前学习语言生成的方法一样,我们的模型使用简单的功能驱动架构(这里有一对神经“听者”和“扬声器”模型)来描述世界的语言。像推理驱动的语用学方法一样,我们的模型在选择话语时积极地聆听者的行为。对于培训,我们的方法只需要普通的标题,注释了模型最终展示的实用行为。在参考表达式游戏的人类评估中,我们的方法成功率为81%,而使用现有技术的成功率为69%。
https://arxiv.org/abs/1604.00562
We present an approach to simultaneously perform semantic segmentation and prepositional phrase attachment resolution for captioned images. Some ambiguities in language cannot be resolved without simultaneously reasoning about an associated image. If we consider the sentence “I shot an elephant in my pajamas”, looking at language alone (and not using common sense), it is unclear if it is the person or the elephant wearing the pajamas or both. Our approach produces a diverse set of plausible hypotheses for both semantic segmentation and prepositional phrase attachment resolution that are then jointly reranked to select the most consistent pair. We show that our semantic segmentation and prepositional phrase attachment resolution modules have complementary strengths, and that joint reasoning produces more accurate results than any module operating in isolation. Multiple hypotheses are also shown to be crucial to improved multiple-module reasoning. Our vision and language approach significantly outperforms the Stanford Parser (De Marneffe et al., 2006) by 17.91% (28.69% relative) and 12.83% (25.28% relative) in two different experiments. We also make small improvements over DeepLab-CRF (Chen et al., 2015).
我们提出了一种同时为字幕图像执行语义分割和介词短语附件分辨率的方法。如果不同时推理相关的图像,语言中的一些歧义就不能被解决。如果我们考虑“我穿着睡衣拍大象”这个句子,只看语言(而不是用常识),还不清楚是穿着睡衣的人还是穿着睡衣的大象。我们的方法为语义分段和介词短语附件分辨率产生了一系列合理的假设,然后将这些假设联合重新排列以选择最一致的对。我们表明,我们的语义分割和介词短语附件解析模块具有互补的优势,并且联合推理产生比任何独立操作的模块更准确的结果。多重假设对于改善多模块推理也是至关重要的。我们的视觉和语言方法在两个不同的实验中,比Stanford Parser(De Marneffe et al。,2006)高出17.91%(相对28.69%)和12.83%(相对25.28%)。我们也对DeepLab-CRF做了小的改进(Chen et al。,2015)。
https://arxiv.org/abs/1604.02125
This paper presents a new multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud. Also, an approach for detection and recognition is presented, which is comprised of two parts: i) a new multi-view 3D proposal generation method and ii) the development of several recognition baselines using AlexNet to score our proposals, which is trained either on crops of the dataset or on synthetically composited training images. Finally, we compare the performance of the object proposals and a detection baseline to the Washington RGB-D Scenes (WRGB-D) dataset and demonstrate that our Kitchen scenes dataset is more challenging for object detection and recognition. The dataset is available at: this http URL.
https://arxiv.org/abs/1609.07826
Neurodegenerative diseases and traumatic brain injuries (TBI) are among the main causes of cognitive dysfunction in humans. Both manifestations exhibit the extensive presence of focal axonal swellings (FAS). FAS compromises the information encoded in spike trains, thus leading to potentially severe functional deficits. Complicating our understanding of the impact of FAS is our inability to access small scale injuries with non-invasive methods, the overall complexity of neuronal pathologies, and our limited knowledge of how networks process biological signals. Building on Hopfield’s pioneering work, we extend a model for associative memory to account for FAS and its impact on memory encoding. We calibrate all FAS parameters from biophysical observations of their statistical distribution and size, providing a framework to simulate the effects of brain disorders on memory recall performance. A face recognition example is used to demonstrate and validate the functionality of the novel model. Our results link memory recall ability to observed FAS statistics, allowing for a description of different stages of brain disorders within neuronal networks. This provides a first theoretical model to bridge experimental observations of FAS in neurodegeneration and TBI with compromised memory recall, thus closing the large gap between theory and experiment on how biological signals are processed in damaged, high-dimensional functional networks. The work further lends new insight into positing diagnostic tools to measure cognitive deficits.
https://arxiv.org/abs/1609.07656
We introduce a deep memory network for aspect level sentiment classification. Unlike feature-based SVM and sequential neural models such as LSTM, this approach explicitly captures the importance of each context word when inferring the sentiment polarity of an aspect. Such importance degree and text representation are calculated with multiple computational layers, each of which is a neural attention model over an external memory. Experiments on laptop and restaurant datasets demonstrate that our approach performs comparable to state-of-art feature based SVM system, and substantially better than LSTM and attention-based LSTM architectures. On both datasets we show that multiple computational layers could improve the performance. Moreover, our approach is also fast. The deep memory network with 9 layers is 15 times faster than LSTM with a CPU implementation.
https://arxiv.org/abs/1605.08900
A big challenge in environmental monitoring is the spatiotemporal variation of the phenomena to be observed. To enable persistent sensing and estimation in such a setting, it is beneficial to have a time-varying underlying environmental model. Here we present a planning and learning method that enables an autonomous marine vehicle to perform persistent ocean monitoring tasks by learning and refining an environmental model. To alleviate the computational bottleneck caused by large-scale data accumulated, we propose a framework that iterates between a planning component aimed at collecting the most information-rich data, and a sparse Gaussian Process learning component where the environmental model and hyperparameters are learned online by taking advantage of only a subset of data that provides the greatest contribution. Our simulations with ground-truth ocean data shows that the proposed method is both accurate and efficient.
http://arxiv.org/abs/1609.07560
Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual vectors. As the outer product is typically infeasible due to its high dimensionality, we instead propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and expressively combine multimodal features. We extensively evaluate MCB on the visual question answering and grounding tasks. We consistently show the benefit of MCB over ablations without MCB. For visual question answering, we present an architecture which uses MCB twice, once for predicting attention over spatial features and again to combine the attended representation with the question representation. This model outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge.
使用从大型语言或视觉数据集训练的矢量表示对文本或视觉信息进行建模已经成功地探索了近年来。然而,诸如视觉问题回答这样的任务需要将这些向量表示相互组合。多模式汇集的方法包括元素明智的产品或总和,以及视觉和文本表示的连接。我们假设这些方法不如视觉和文本向量的外部产物那样具有表现力。由于外部产品由于其高维度而通常不可行,因此我们建议使用多模式紧凑双线性池(MCB)来有效地并且表达地结合多模式特征。我们广泛地评估MCB在视觉问题解答和接地任务。我们始终如一地显示MCB在没有MCB的情况下消融的好处。对于视觉问题回答,我们提出一个使用MCB两次的体系结构,一次用于预测对空间特征的注意力,并再次将出席的表示与问题表示相结合。这个模型胜过了Visual7W数据集和VQA挑战的最新技术。
https://arxiv.org/abs/1606.01847
Radial velocity searches for exoplanets have detected many multi-planet systems around nearby bright stars. An advantage of this technique is that it generally samples the orbit outside of inferior/superior conjunction, potentially allowing the Keplerian elements of eccentricity and argument of periastron to be well characterized. The orbital architectures for some of these systems show signs of close planetary encounters that may render the systems unstable as described. We provide an in-depth analysis of two such systems: HD 5319 and HD 7924, for which the scenario of coplanar orbits results in rapid destabilization of the systems. The poorly constrained periastron arguments of the outer planets in these systems further emphasizes the need for detailed investigations. An exhaustive scan of parameters space via dynamical simulations reveals specific mutual inclinations between the two outer planets in each system that allow for stable configurations over long timescales. We compare these configurations with those presented by mean-motion resonance as possible stability sources. Finally, we discuss the relevance to interpretation of multi-planet Keplerian orbits and suggest additional observations that will help to resolve the system stabilities.
https://arxiv.org/abs/1608.02590
We consider the problem of scale detection in images where a region of interest is present together with a measurement tool (e.g. a ruler). For the segmentation part, we focus on the graph based method by Flenner and Bertozzi which reinterprets classical continuous Ginzburg-Landau minimisation models in a totally discrete framework. To overcome the numerical difficulties due to the large size of the images considered we use matrix completion and splitting techniques. The scale on the measurement tool is detected via a Hough transform based algorithm. The method is then applied to some measurement tasks arising in real-world applications such as zoology, medicine and archaeology.
https://arxiv.org/abs/1602.08574
Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.
https://arxiv.org/abs/1606.07947
Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today’s big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification concentrates on automatically labeling video clips based on their semantic contents like human actions or complex events, video captioning attempts to generate a complete and natural sentence, enriching the single label as in video classification, to capture the most informative dynamics in videos. In addition, we also provide a review of popular benchmarks and competitions, which are critical for evaluating the technical progress of this vibrant field.
随着互联网带宽和存储空间的巨大增长,视频数据已经生成,发布和传播,成为当今大数据不可缺少的一部分。在本文中,我们重点回顾两个研究的目的是为了刺激视频深度学习的理解:视频分类和视频字幕。虽然视频分类专注于根据人类行为或复杂事件等语义内容自动标注视频片段,但视频字幕会尝试生成完整且自然的句子,丰富视频分类中的单个标签,以捕捉视频中最丰富的动态信息。此外,我们还提供了对评估这个充满活力的领域的技术进步至关重要的常用基准和比赛的评论。
https://arxiv.org/abs/1609.06782
Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to develop by enabling higher-level programming abstractions (e.g., scripting languages and reusable components). Improvements in computer system cost-effectiveness enabled value creation that could never have been imagined by the field’s founders (e.g., distributed web search sufficiently inexpensive so as to be covered by advertising links). The wide benefits of computer performance growth are clear. Recently, Danowitz et al. apportioned computer performance growth roughly equally between technology and architecture, with architecture credited with ~80x improvement since 1985. As semiconductor technology approaches its “end-of-the-road” (see below), computer architecture will need to play an increasing role in enabling future ICT innovation. But instead of asking, “How can I make my chip run faster?,” architects must now ask, “How can I enable the 21st century infrastructure, from sensors to clouds, adding value from performance to privacy, but without the benefit of near-perfect technology scaling?”. The challenges are many, but with appropriate investment, opportunities abound. Underlying these opportunities is a common theme that future architecture innovations will require the engagement of and investments from innovators in other ICT layers.
https://arxiv.org/abs/1609.06756
Visual Question Answering (VQA) task has showcased a new stage of interaction between language and vision, two of the most pivotal components of artificial intelligence. However, it has mostly focused on generating short and repetitive answers, mostly single words, which fall short of rich linguistic capabilities of humans. We introduce Full-Sentence Visual Question Answering (FSVQA) dataset, consisting of nearly 1 million pairs of questions and full-sentence answers for images, built by applying a number of rule-based natural language processing techniques to original VQA dataset and captions in the MS COCO dataset. This poses many additional complexities to conventional VQA task, and we provide a baseline for approaching and evaluating the task, on top of which we invite the research community to build further improvements.
视觉问答(Visual Question Answering,简称VQA)任务展示了语言和视觉之间的一个新的互动阶段,这是人工智能中最关键的两个组成部分。然而,它主要集中在产生短而重复的答案上,大部分是单个单词,这些单词缺乏人类丰富的语言能力。我们引入了全句视觉问答(FSVQA)数据集,它包含近100万对图像的问题和全句回答,通过将一些基于规则的自然语言处理技术应用到原始的VQA数据集和标题中MS COCO数据集。这给传统的VQA任务带来了许多额外的复杂性,我们提供了一个接近和评估任务的基线,在此基础上,我们邀请研究界进一步改进。
https://arxiv.org/abs/1609.06657
Encoder-decoder networks are popular for modeling sequences probabilistically in many applications. These models use the power of the Long Short-Term Memory (LSTM) architecture to capture the full dependence among variables, unlike earlier models like CRFs that typically assumed conditional independence among non-adjacent variables. However in practice encoder-decoder models exhibit a bias towards short sequences that surprisingly gets worse with increasing beam size. In this paper we show that such phenomenon is due to a discrepancy between the full sequence margin and the per-element margin enforced by the locally conditioned training objective of a encoder-decoder model. The discrepancy more adversely impacts long sequences, explaining the bias towards predicting short sequences. For the case where the predicted sequences come from a closed set, we show that a globally conditioned model alleviates the above problems of encoder-decoder models. From a practical point of view, our proposed model also eliminates the need for a beam-search during inference, which reduces to an efficient dot-product based search in a vector-space.
https://arxiv.org/abs/1606.03402
Neural machine translation (NMT) becomes a new state-of-the-art and achieves promising translation results using a simple encoder-decoder neural network. This neural network is trained once on the parallel corpus and the fixed network is used to translate all the test sentences. We argue that the general fixed network cannot best fit the specific test sentences. In this paper, we propose the dynamic NMT which learns a general network as usual, and then fine-tunes the network for each test sentence. The fine-tune work is done on a small set of the bilingual training data that is obtained through similarity search according to the test sentence. Extensive experiments demonstrate that this method can significantly improve the translation performance, especially when highly similar sentences are available.
https://arxiv.org/abs/1609.06490
In this paper we address the question of how to render sequence-level networks better at handling structured input. We propose a machine reading simulator which processes text incrementally from left to right and performs shallow reasoning with memory and attention. The reader extends the Long Short-Term Memory architecture with a memory network in place of a single memory cell. This enables adaptive memory usage during recurrence with neural attention, offering a way to weakly induce relations among tokens. The system is initially designed to process a single sequence but we also demonstrate how to integrate it with an encoder-decoder architecture. Experiments on language modeling, sentiment analysis, and natural language inference show that our model matches or outperforms the state of the art.
https://arxiv.org/abs/1601.06733
Machine-learning algorithms offer immense possibilities in the development of several cognitive applications. In fact, large scale machine-learning classifiers now represent the state-of-the-art in a wide range of object detection/classification problems. However, the network complexities of large-scale classifiers present them as one of the most challenging and energy intensive workloads across the computing spectrum. In this paper, we present a new approach to optimize energy efficiency of object detection tasks using semantic decomposition to build a hierarchical classification framework. We observe that certain semantic information like color/texture are common across various images in real-world datasets for object detection applications. We exploit these common semantic features to distinguish the objects of interest from the remaining inputs (non-objects of interest) in a dataset at a lower computational effort. We propose a 2-stage hierarchical classification framework, with increasing levels of complexity, wherein the first stage is trained to recognize the broad representative semantic features relevant to the object of interest. The first stage rejects the input instances that do not have the representative features and passes only the relevant instances to the second stage. Our methodology thus allows us to reject certain information at lower complexity and utilize the full computational effort of a network only on a smaller fraction of inputs to perform detection. We use color and texture as distinctive traits to carry out several experiments for object detection. Our experiments on the Caltech101/CIFAR10 dataset show that the proposed method yields 1.93x/1.46x improvement in average energy, respectively, over the traditional single classifier model.
https://arxiv.org/abs/1509.08970
We propose a simplified model of attention which is applicable to feed-forward neural networks and demonstrate that the resulting model can solve the synthetic “addition” and “multiplication” long-term memory problems for sequence lengths which are both longer and more widely varying than the best published results for these tasks.
https://arxiv.org/abs/1512.08756
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.
Flickr30k数据集已成为基于句子的图像描述的标准基准。本文介绍了Flickr30k Entities,它用Flickr30k增加了带有244k共享链的158k字幕,将相同实体跨同一图像的不同标题相关联,并将其与276k手动标注的边界框相关联。这种注释对于自动图像描述和基础语言理解的持续进展是必不可少的。它们使我们能够为图像中文本实体提及的本地化定义新的基准。我们为这个任务提供了一个强大的基线,它结合了图像文本嵌入,普通对象检测器,颜色分类器以及选择较大对象的偏好。虽然我们的基准对手在准确性方面比较复杂,但最先进的模型,我们表明,它的收益不容易成为像图像句子检索这样的任务的改进,从而突出目前的方法的局限性和需要进一步的研究。
https://arxiv.org/abs/1505.04870
We utilize machine learning models which are based on recurrent neural networks to optimize dynamical decoupling (DD) sequences. DD is a relatively simple technique for suppressing the errors in quantum memory for certain noise models. In numerical simulations, we show that with minimum use of prior knowledge and starting from random sequences, the models are able to improve over time and eventually output DD-sequences with performance better than that of the well known DD-families. Furthermore, our algorithm is easy to implement in experiments to find solutions tailored to the specific hardware, as it treats the figure of merit as a black box.
https://arxiv.org/abs/1604.00279
We present a new approach for neural machine translation (NMT) using the morphological and grammatical decomposition of the words (factors) in the output side of the neural network. This architecture addresses two main problems occurring in MT, namely dealing with a large target language vocabulary and the out of vocabulary (OOV) words. By the means of factors, we are able to handle larger vocabulary and reduce the training time (for systems with equivalent target language vocabulary size). In addition, we can produce new words that are not in the vocabulary. We use a morphological analyser to get a factored representation of each word (lemmas, Part of Speech tag, tense, person, gender and number). We have extended the NMT approach with attention mechanism in order to have two different outputs, one for the lemmas and the other for the rest of the factors. The final translation is built using some \textit{a priori} linguistic information. We compare our extension with a word-based NMT system. The experiments, performed on the IWSLT’15 dataset translating from English to French, show that while the performance do not always increase, the system can manage a much larger vocabulary and consistently reduce the OOV rate. We observe up to 2% BLEU point improvement in a simulated out of domain translation setup.
https://arxiv.org/abs/1609.04621
The task of associating images and videos with a natural language description has attracted a great amount of attention recently. Rapid progress has been made in terms of both developing novel algorithms and releasing new datasets. Indeed, the state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates the possibility of empirically establishing performance upper bounds on various visual captioning datasets without extra data labelling effort or human evaluation. In particular, it is assumed that visual captioning is decomposed into two steps: from visual inputs to visual concepts, and from visual concepts to natural language descriptions. One would be able to obtain an upper bound when assuming the first step is perfect and only requiring training a conditional language model for the second step. We demonstrate the construction of such bounds on MS-COCO, YouTube2Text and LSMDC (a combination of M-VAD and MPII-MD). Surprisingly, despite of the imperfect process we used for visual concept extraction in the first step and the simplicity of the language model for the second step, we show that current state-of-the-art models fall short when being compared with the learned upper bounds. Furthermore, with such a bound, we quantify several important factors concerning image and video captioning: the number of visual concepts captured by different models, the trade-off between the amount of visual elements captured and their accuracy, and the intrinsic difficulty and blessing of different datasets.
将图像和视频与自然语言描述联系起来的任务近来引起了很大的关注。在开发新型算法和发布新数据集方面取得了迅速的进展。事实上,一些标准数据集的最新成果已经被推到了制度上,这个制度越来越难以做出重大的改进。这项工作并没有提出新的模型,而是研究了在不需要额外的数据标注或人工评估的情况下,在各种视觉字幕数据集上实验性地建立性能上限的可能性。特别是,视觉标题被分解为两个步骤:从视觉输入到视觉概念,从视觉概念到自然语言描述。当假定第一步是完美的,并且只需要为第二步训练一个条件语言模型时,就能够获得上限。我们演示了MS-COCO,YouTube2Text和LSMDC(M-VAD和MPII-MD的组合)的构造。令人惊讶的是,尽管我们在第一步中使用了视觉概念提取的不完善的过程,而第二步中使用了简单的语言模型,但是我们发现当与现有的上层模型相比时,界限。此外,在这样一个界限下,我们量化了几个与图像和视频字幕有关的重要因素:不同模型所捕获的视觉概念的数量,所捕获的视觉元素的数量与其准确性之间的折衷,以及不同的数据集。
https://arxiv.org/abs/1511.04590
The attention mechanisim is appealing for neural machine translation, since it is able to dynam- ically encode a source sentence by generating a alignment between a target word and source words. Unfortunately, it has been proved to be worse than conventional alignment models in aligment accuracy. In this paper, we analyze and explain this issue from the point view of re- ordering, and propose a supervised attention which is learned with guidance from conventional alignment models. Experiments on two Chinese-to-English translation tasks show that the super- vised attention mechanism yields better alignments leading to substantial gains over the standard attention based NMT.
https://arxiv.org/abs/1609.04186
The attention mechanism is an important part of the neural machine translation (NMT) where it was reported to produce richer source representation compared to fixed-length encoding sequence-to-sequence models. Recently, the effectiveness of attention has also been explored in the context of image captioning. In this work, we assess the feasibility of a multimodal attention mechanism that simultaneously focus over an image and its natural language description for generating a description in another language. We train several variants of our proposed attention mechanism on the Multi30k multilingual image captioning dataset. We show that a dedicated attention for each modality achieves up to 1.6 points in BLEU and METEOR compared to a textual NMT baseline.
注意机制是神经机器翻译(NMT)的重要组成部分,据报道与固定长度编码序列 - 序列模型相比,其产生更丰富的源代表。最近,在图像字幕的背景下也探讨了注意力的有效性。在这项工作中,我们评估多模态注意机制的可行性,同时关注图像及其自然语言描述,用另一种语言生成描述。我们在Multi30k多语言图像字幕数据集上训练了我们提出的关注机制的几个变体。我们显示,与文本NMT基线相比,每种模态的专注力达到BLEU和METEOR的1.6分。
https://arxiv.org/abs/1609.03976
We present the results of near-infrared (2.5–5.4um) long-slit spectroscopy of the extended green object (EGO) G318.05+0.09 with AKARI. Two distinct sources are found in the slit. The brighter source has strong red continuum emission with H2O ice, CO2 ice, and CO gas and ice absorption features at 3.0, 4.25um, 4.67um, respectively, while the other greenish object shows peculiar emission that has double peaks at around 4.5 and 4.7um. The former source is located close to the ultra compact HII region IRAS 14498-5856 and is identified as an embedded massive young stellar object. The spectrum of the latter source can be interpreted by blue-shifted (-3000 ~ -6000km/s) optically-thin emission of the fundamental ro-vibrational transitions (v=1-0) of CO molecules with temperatures of 12000–3700K without noticeable H2 and HI emission. We discuss the nature of this source in terms of outflow associated with the young stellar object and supernova ejecta associated with a supernova remnant.
https://arxiv.org/abs/1608.06698
We investigate the reasons why context in object detection has limited utility by isolating and evaluating the predictive power of different context cues under ideal conditions in which context provided by an oracle. Based on this study, we propose a region-based context re-scoring method with dynamic context selection to remove noise and emphasize informative context. We introduce latent indicator variables to select (or ignore) potential contextual regions, and learn the selection strategy with latent-SVM. We conduct experiments to evaluate the performance of the proposed context selection method on the SUN RGB-D dataset. The method achieves a significant improvement in terms of mean average precision (mAP), compared with both appearance based detectors and a conventional context model without the selection scheme.
https://arxiv.org/abs/1609.02948
Deep neural networks have shown striking progress and obtained state-of-the-art results in many AI research fields in the recent years. However, it is often unsatisfying to not know why they predict what they do. In this paper, we address the problem of interpreting Visual Question Answering (VQA) models. Specifically, we are interested in finding what part of the input (pixels in images or words in questions) the VQA model focuses on while answering the question. To tackle this problem, we use two visualization techniques – guided backpropagation and occlusion – to find important words in the question and important regions in the image. We then present qualitative and quantitative analyses of these importance maps. We found that even without explicit attention mechanisms, VQA models may sometimes be implicitly attending to relevant regions in the image, and often to appropriate words in the question.
近年来,深度神经网络在许多AI研究领域取得了令人瞩目的进展,并取得了最新的成果。然而,不知道为什么他们会预测他们做了什么,这往往是令人不满意的。在本文中,我们解决了解释视觉问答(VQA)模型的问题。具体而言,我们感兴趣的是在回答问题的同时,找出VQA模型关注的部分输入(图像中的像素或问题中的单词)。为了解决这个问题,我们使用两种可视化技术 - 引导反向传播和遮挡 - 在图像中的问题和重要区域中找到重要的单词。然后,我们对这些重要性地图进行定性和定量分析。我们发现,即使没有明确的注意机制,VQA模型有时可能会隐含地关注图像中的相关区域,而且往往会涉及到问题中恰当的词汇。
https://arxiv.org/abs/1608.08974
A great video title describes the most salient event compactly and captures the viewer’s attention. In contrast, video captioning tends to generate sentences that describe the video as a whole. Although generating a video title automatically is a very useful task, it is much less addressed than video captioning. We address video title generation for the first time by proposing two methods that extend state-of-the-art video captioners to this new task. First, we make video captioners highlight sensitive by priming them with a highlight detector. Our framework allows for jointly training a model for title generation and video highlight localization. Second, we induce high sentence diversity in video captioners, so that the generated titles are also diverse and catchy. This means that a large number of sentences might be required to learn the sentence structure of titles. Hence, we propose a novel sentence augmentation method to train a captioner with additional sentence-only examples that come without corresponding videos. We collected a large-scale Video Titles in the Wild (VTW) dataset of 18100 automatically crawled user-generated videos and titles. On VTW, our methods consistently improve title prediction accuracy, and achieve the best performance in both automatic and human evaluation. Finally, our sentence augmentation method also outperforms the baselines on the M-VAD dataset.
一个伟大的视频标题描述最突出的事件紧凑,捕捉观众的注意力。相比之下,视频字幕倾向于生成描述视频整体的句子。虽然自动生成视频标题是一个非常有用的任务,但它比视频字幕要少得多。我们首次提出了视频标题生成方法,提出了两种将最先进的视频字幕扩展到新任务的方法。首先,我们通过用高亮检测器启动视频字幕,突出显示敏感。我们的框架允许联合培训标题生成和视频突出本地化的模型。其次,在视频字幕播放器中引入高度的句子多样性,使得所生成的字幕也是多样化和吸引人的。这意味着可能需要大量的句子来学习标题的句子结构。因此,我们提出了一种新的句子增强方法来训练一个带有额外句子的例子,但没有相应的视频。我们收集了野外的大规模视频标题(VTW)数据集18100自动抓取用户生成的视频和标题。在VTW上,我们的方法不断提高标题预测的准确性,并在自动和人工评估中取得最佳表现。最后,我们的句子增加方法也胜过M-VAD数据集的基线。
https://arxiv.org/abs/1608.07068
Accurate pedestrian detection has a primary role in automotive safety: for example, by issuing warnings to the driver or acting actively on car’s brakes, it helps decreasing the probability of injuries and human fatalities. In order to achieve very high accuracy, recent pedestrian detectors have been based on Convolutional Neural Networks (CNN). Unfortunately, such approaches require vast amounts of computational power and memory, preventing efficient implementations on embedded systems. This work proposes a CNN-based detector, adapting a general-purpose convolutional network to the task at hand. By thoroughly analyzing and optimizing each step of the detection pipeline, we develop an architecture that outperforms methods based on traditional image features and achieves an accuracy close to the state-of-the-art while having low computational complexity. Furthermore, the model is compressed in order to fit the tight constrains of low power devices with a limited amount of embedded memory available. This paper makes two main contributions: (1) it proves that a region based deep neural network can be finely tuned to achieve adequate accuracy for pedestrian detection (2) it achieves a very low memory usage without reducing detection accuracy on the Caltech Pedestrian dataset.
https://arxiv.org/abs/1609.02500
Detection of periodicity in the broad-band non-thermal emission of blazars has so far been proven to be elusive. However, there are a number of scenarios which could lead to quasi-periodic variations in blazar light curves. For example, orbital or thermal/viscous period of accreting matter around central supermassive black holes could, in principle, be imprinted in the multi-wavelength emission of small-scale blazar jets, carrying as such crucial information about plasma conditions within the jet launching regions. In this paper, we present the results of our time series analysis of $\sim 9.2$ year-long, and exceptionally well-sampled optical light curve of the BL Lac OJ 287. The study primarily uses the data from our own observations performed at the Mt. Suhora and Kraków Observatories in Poland, and at the Athens Observatory in Greece. Additionally, SMARTS observations were used to fill in some of the gaps in the data. The Lomb-Scargle Periodogram and the Weighted Wavelet Z-transform methods were employed to search for the possible QPOs in the resulting optical light curve of the source. Both the methods consistently yielded possible quasi-periodic signal around the periods of $\sim 400$ and $\sim 800$ days, the former one with a significance (over the underlying colored noise) of $\geq 99\%$. A number of likely explanations for such are discussed, with a preference given to a modulation of the jet production efficiency by highly magnetized accretion disks. This supports the previous findings and the interpretation reported recently in the literature for OJ 287 and other blazar sources.
https://arxiv.org/abs/1609.02388
In this work we introduce a convolutional neural network (CNN) that jointly handles low-, mid-, and high-level vision tasks in a unified architecture that is trained end-to-end. Such a universal network can act like a `swiss knife’ for vision tasks; we call this architecture an UberNet to indicate its overarching nature. We address two main technical challenges that emerge when broadening up the range of tasks handled by a single CNN: (i) training a deep architecture while relying on diverse training sets and (ii) training many (potentially unlimited) tasks with a limited memory budget. Properly addressing these two problems allows us to train accurate predictors for a host of tasks, without compromising accuracy. Through these advances we train in an end-to-end manner a CNN that simultaneously addresses (a) boundary detection (b) normal estimation (c) saliency estimation (d) semantic segmentation (e) human part segmentation (f) semantic boundary detection, (g) region proposal generation and object detection. We obtain competitive performance while jointly addressing all of these tasks in 0.7 seconds per frame on a single GPU. A demonstration of this system can be found at this http URL.
https://arxiv.org/abs/1609.02132
A peculiar source in the Galactic center known as the Dusty S-cluster Object (DSO/G2) moves on a highly eccentric orbit around the supermassive black hole with the pericenter passage in the spring of 2014. Its nature has been uncertain mainly because of the lack of any information about its intrinsic geometry. For the first time, we use near-infrared polarimetric imaging data to obtain constraints about the geometrical properties of the DSO. We find out that DSO is an intrinsically polarized source, based on the significance analysis of polarization parameters, with the degree of the polarization of $\sim 30\%$ and an alternating polarization angle as it approaches the position of Sgr A*. Since the DSO exhibits a near-infrared excess of $K_{\rm s}-L’>3$ and remains rather compact in emission-line maps, its main characteristics may be explained with the model of a pre-main-sequence star embedded in a non-spherical dusty envelope.
https://arxiv.org/abs/1609.02039
We present a method for discovering and exploiting object specific deep learning features and use face detection as a case study. Motivated by the observation that certain convolutional channels of a Convolutional Neural Network (CNN) exhibit object specific responses, we seek to discover and exploit the convolutional channels of a CNN in which neurons are activated by the presence of specific objects in the input image. A method for explicitly fine-tuning a pre-trained CNN to induce an object specific channel (OSC) and systematically identifying it for the human face object has been developed. Based on the basic OSC features, we introduce a multi-resolution approach to constructing robust face heatmaps for fast face detection in unconstrained settings. We show that multi-resolution OSC can be used to develop state of the art face detectors which have the advantage of being simple and compact.
https://arxiv.org/abs/1609.01366
Polynomial multiplication is a key algorithm underlying computer algebra systems (CAS) and its efficient implementation is crucial for the performance of CAS. In this paper we design and implement algorithms for polynomial multiplication using approaches based the fast Fourier transform (FFT) and the truncated Fourier transform (TFT). We improve on the state-of-the-art in both theoretical and practical performance. The {\SPIRAL} library generation system is extended and used to automatically generate and tune the performance of a polynomial multiplication library that is optimized for memory hierarchy, vectorization and multi-threading, using new and existing algorithms. The performance tuning has been aided by the use of automation where many code choices are generated and intelligent search is utilized to find the “best” implementation on a given architecture. The performance of autotuned implementations is comparable to, and in some cases better than, the best hand-tuned code.
https://arxiv.org/abs/1609.01010
A calculation formula of volume of revolution with integration by parts of definite integral is derived based on monotone function, and extended to a general case that curved trapezoids is determined by continuous, piecewise strictly monotone and differential function. And, two examples are given, ones curvilinear trapezoids is determined by Kepler equation, and the other curvilinear trapezoids is a function transmuted from Kepler equation.
http://arxiv.org/abs/1609.04771
In recent years, numerous effective multi-object tracking (MOT) methods are developed because of the wide range of applications. Existing performance evaluations of MOT methods usually separate the object tracking step from the object detection step by using the same fixed object detection results for comparisons. In this work, we perform a comprehensive quantitative study on the effects of object detection accuracy to the overall MOT performance, using the new large-scale University at Albany DETection and tRACking (UA-DETRAC) benchmark dataset. The UA-DETRAC benchmark dataset consists of 100 challenging video sequences captured from real-world traffic scenes (over 140,000 frames with rich annotations, including occlusion, weather, vehicle category, truncation, and vehicle bounding boxes) for object detection, object tracking and MOT system. We evaluate complete MOT systems constructed from combinations of state-of-the-art object detection and object tracking methods. Our analysis shows the complex effects of object detection accuracy on MOT system performance. Based on these observations, we propose new evaluation tools and metrics for MOT systems that consider both object detection and object tracking for comprehensive analysis.
https://arxiv.org/abs/1511.04136
Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a “feature extraction” module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art on caption retrieval by 7.1% and on image retrieval by 4.4% on the MSCOCO dataset.
视觉问答(VQA)是把图像和自由形式的自然语言问题作为输入的图像,并产生一个准确的答案。在这项工作中,我们将VQA视为一个“特征提取”模块来提取图像和标题表示。我们将这些表示用于图像标题排名的任务。每个特征维度捕捉(想象)一个事实(问题 - 答案对)对于图像和标题是否可能是真实的。这允许模型从各种各样的角度解释图像和标题。我们提出了分数级和表示级融合模型,以将VQA知识并入现有的最先进的VQA不可知的图像标题排名模型中。我们发现,合并和推理图像和标题之间的一致性显着提高性能。具体来说,我们的模型在MSCOCO数据集上提高了7.1%的字幕检索水平和4.4%的图像检索水平。
https://arxiv.org/abs/1605.01379
As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine’s ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.
随着机器变得更加智能化,人们重新关注测量智能的方法。一个常见的方法是提出一个人擅长的任务,但是机器难以找到的任务。然而,一个理想的任务也应该是容易评估,不容易游戏。我们从案例研究开始,探索最近流行的图像字幕任务及其作为测量机器智能任务的局限性。另一个更有前途的任务是视觉问答(Visual Question Answering),它测试机器推理语言和视觉的能力。我们描述了一个数据集,其中包含超过760,000个人类生成的关于图像的问题。使用大约1000万人类生成的答案,机器可能很容易评估。
https://arxiv.org/abs/1608.08716
This paper presents a robust multi-class multi-object tracking (MCMOT) formulated by a Bayesian filtering framework. Multi-object tracking for unlimited object classes is conducted by combining detection responses and changing point detection (CPD) algorithm. The CPD model is used to observe abrupt or abnormal changes due to a drift and an occlusion based spatiotemporal characteristics of track states. The ensemble of convolutional neural network (CNN) based object detector and Lucas-Kanede Tracker (KLT) based motion detector is employed to compute the likelihoods of foreground regions as the detection responses of different object classes. Extensive experiments are performed using lately introduced challenging benchmark videos; ImageNet VID and MOT benchmark dataset. The comparison to state-of-the-art video tracking techniques shows very encouraging results.
https://arxiv.org/abs/1608.08434