This paper addresses 3D shape recognition. Recent work typically represents a 3D shape as a set of binary variables corresponding to 3D voxels of a uniform 3D grid centered on the shape, and resorts to deep convolutional neural networks(CNNs) for modeling these binary variables. Robust learning of such CNNs is currently limited by the small datasets of 3D shapes available, an order of magnitude smaller than other common datasets in computer vision. Related work typically deals with the small training datasets using a number of ad hoc, hand-tuning strategies. To address this issue, we formulate CNN learning as a beam search aimed at identifying an optimal CNN architecture, namely, the number of layers, nodes, and their connectivity in the network, as well as estimating parameters of such an optimal CNN. Each state of the beam search corresponds to a candidate CNN. Two types of actions are defined to add new convolutional filters or new convolutional layers to a parent CNN, and thus transition to children states. The utility function of each action is efficiently computed by transferring parameter values of the parent CNN to its children, thereby enabling an efficient beam search. Our experimental evaluation on the 3D ModelNet dataset demonstrates that our model pursuit using the beam search yields a CNN with superior performance on 3D shape classification than the state of the art.
https://arxiv.org/abs/1612.04774
We demonstrate a series of InGaN/GaN double quantum well nanostructure elements. We grow a layer of 2 {\mu}m undoped GaN template on top of a (0001)-direction sapphire substrate. A 100 nm SiO2 thin film is deposited on top as a masking pattern layer. This layer is then covered with a 300 nm aluminum layer as the anodic aluminum oxide (AAO) hole pattern layer. After oxalic acid etching, we transfer the hole pattern from the AAO layer to the SiO2 layer by reactive ion etching. Lastly, we utilize metal-organic chemical vapor deposition to grow GaN nanorods approximately 1.5 {\mu}m in size. We then grow two layers of InGaN/GaN double quantum wells on the semi-polar face of the GaN nanorod substrate under different temperatures. We then study the characteristics of the InGaN/GaN quantum wells formed on the semi-polar faces of GaN nanorods. We report the following findings from our study: first, using SiO2 with repeating hole pattern, we are able to grow high-quality GaN nanorods with diameters of approximately 80-120 nm; second, photoluminescence (PL) measurements enable us to identify Fabry-Perot effect from InGaN/GaN quantum wells on the semi-polar face. We calculate the quantum wells’ cavity thickness with obtained PL measurements. Lastly, high resolution TEM images allow us to study the lattice structure characteristics of InGaN/GaN quantum wells on GaN nanorod and identify the existence of threading dislocations in the lattice structure that affects the GaN nanorod’s growth mechanism.
https://arxiv.org/abs/1612.04455
Social event detection in a static image is a very challenging problem and it’s very useful for internet of things applications including automatic photo organization, ads recommender system, or image captioning. Several publications show that variety of objects, scene, and people can be very ambiguous for the system to decide the event that occurs in the image. We proposed the spatial pyramid configuration of convolutional neural network (CNN) classifier for social event detection in a static image. By applying the spatial pyramid configuration to the CNN classifier, the detail that occurs in the image can observe more accurately by the classifier. USED dataset provided by Ahmad et al. is used to evaluate our proposed method, which consists of two different image sets, EiMM, and SED dataset. As a result, the average accuracy of our system outperforms the baseline method by 15% and 2% respectively.
静态图像中的社交事件检测是一个非常具有挑战性的问题,对于物联网应用(包括自动照片组织,广告推荐系统或图像字幕)非常有用。一些出版物显示,各种各样的对象,场景和人物可能对系统决定图像中发生的事件非常模糊。我们提出了静态图像中用于社交事件检测的卷积神经网络(CNN)分类器的空间金字塔配置。通过将空间金字塔配置应用于CNN分类器,图像中出现的细节可以被分类器更精确地观察到。 Ahmad等人提供的数据集被用来评估我们提出的方法,它由两个不同的图像集,EiMM和SED数据集组成。因此,我们系统的平均准确度分别比基准方法高出15%和2%。
https://arxiv.org/abs/1612.04062
SPIRou is a near-infrared spectropolarimeter and high precision radial velocity instrument, to be implemented at CFHT in end 2017. It focuses on the search for Earth-like planets around M dwarfs and on the study of stellar and planetary formation in the presence of stellar magnetic field. The calibration unit and the radial-velocity reference module are essential to the short- and long-term precision (1 m/s). We highlight the specificities in the calibration techniques compared to the spectrographs HARPS (at LaSilla, ESO) or SOPHIE (at OHP, France) due to the near-infrared wavelengths, the CMOS detectors, and the instrument design. We also describe the calibration unit architecture, design and production.
https://arxiv.org/abs/1612.03679
In this paper, we address the problem of visual question answering by proposing a novel model, called VIBIKNet. Our model is based on integrating Kernelized Convolutional Neural Networks and Long-Short Term Memory units to generate an answer given a question about an image. We prove that VIBIKNet is an optimal trade-off between accuracy and computational load, in terms of memory and time consumption. We validate our method on the VQA challenge dataset and compare it to the top performing methods in order to illustrate its performance and speed.
在本文中,我们通过提出一个称为VIBIKNet的新模型来解决视觉问题的回答问题。我们的模型是基于集成核化卷积神经网络和长短期记忆单元来给出一个关于图像的问题的答案。我们证明,在内存和时间消耗方面,VIBIKNet是精度和计算负载之间的最佳平衡。我们在VQA质询数据集上验证了我们的方法,并将其与最高性能的方法进行比较,以说明其性能和速度。
https://arxiv.org/abs/1612.03628
Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns to drive visual attention using associated captions. For this model, we propose an exemplar-based learning approach that retrieves from training data associated captions with each image, and use them to learn attention on visual features. Our attention model enables to describe a detailed state of scenes by distinguishing small or confusable objects effectively. We validate our model on MS-COCO Captioning benchmark and achieve the state-of-the-art performance in standard metrics.
视觉注意力对于理解图像起着重要的作用,并展示了其生成图像的自然语言描述的有效性。另一方面,最近的研究表明,在认知过程中,与图像相关的语言可以引导场景中的视觉注意力。受此启发,我们引入了文字引导的图像字幕的注意模式,通过使用相关字幕学习驾驶视觉注意力。对于这个模型,我们提出了一个基于范例的学习方法,从每个图像的训练数据相关联的标题中检索,并使用它们来学习关注视觉特征。我们的关注模型可以通过有效地区分小的或可混淆的对象来描述场景的详细状态。我们在MS-COCO字幕基准上验证了我们的模型,并在标准指标上达到了最先进的性能。
https://arxiv.org/abs/1612.03557
While end-to-end neural machine translation (NMT) has made remarkable progress recently, NMT systems only rely on parallel corpora for parameter estimation. Since parallel corpora are usually limited in quantity, quality, and coverage, especially for low-resource languages, it is appealing to exploit monolingual corpora to improve NMT. We propose a semi-supervised approach for training NMT models on the concatenation of labeled (parallel corpora) and unlabeled (monolingual corpora) data. The central idea is to reconstruct the monolingual corpora using an autoencoder, in which the source-to-target and target-to-source translation models serve as the encoder and decoder, respectively. Our approach can not only exploit the monolingual corpora of the target language, but also of the source language. Experiments on the Chinese-English dataset show that our approach achieves significant improvements over state-of-the-art SMT and NMT systems.
https://arxiv.org/abs/1606.04596
In object detection, reducing computational cost is as important as improving accuracy for most practical usages. This paper proposes a novel network structure, which is an order of magnitude lighter than other state-of-the-art networks while maintaining the accuracy. Based on the basic principle of more layers with less channels, this new deep neural network minimizes its redundancy by adopting recent innovations including C.ReLU and Inception structure. We also show that this network can be trained efficiently to achieve solid results on well-known object detection benchmarks: 84.9% and 84.2% mAP on VOC2007 and VOC2012 while the required compute is less than 10% of the recent ResNet-101.
https://arxiv.org/abs/1611.08588
Neural machine translation (NMT) heavily relies on word-level modelling to learn semantic representations of input sentences. However, for languages without natural word delimiters (e.g., Chinese) where input sentences have to be tokenized first, conventional NMT is confronted with two issues: 1) it is difficult to find an optimal tokenization granularity for source sentence modelling, and 2) errors in 1-best tokenizations may propagate to the encoder of NMT. To handle these issues, we propose word-lattice based Recurrent Neural Network (RNN) encoders for NMT, which generalize the standard RNN to word lattice topology. The proposed encoders take as input a word lattice that compactly encodes multiple tokenizations, and learn to generate new hidden states from arbitrarily many inputs and hidden states in preceding time steps. As such, the word-lattice based encoders not only alleviate the negative impact of tokenization errors but also are more expressive and flexible to embed input sentences. Experiment results on Chinese-English translation demonstrate the superiorities of the proposed encoders over the conventional encoder.
https://arxiv.org/abs/1609.07730
We present a framework to understand GAN training as alternating density ratio estimation and approximate divergence minimization. This provides an interpretation for the mismatched GAN generator and discriminator objectives often used in practice, and explains the problem of poor sample diversity. We also derive a family of generator objectives that target arbitrary $f$-divergences without minimizing a lower bound, and use them to train generative image models that target either improved sample quality or greater sample diversity.
https://arxiv.org/abs/1612.02780
We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.
https://arxiv.org/abs/1612.02605
Automatic image annotation has been an important research topic in facilitating large scale image management and retrieval. Existing methods focus on learning image-tag correlation or correlation between tags to improve annotation accuracy. However, most of these methods evaluate their performance using top-k retrieval performance, where k is fixed. Although such setting gives convenience for comparing different methods, it is not the natural way that humans annotate images. The number of annotated tags should depend on image contents. Inspired by the recent progress in machine translation and image captioning, we propose a novel Recurrent Image Annotator (RIA) model that forms image annotation task as a sequence generation problem so that RIA can natively predict the proper length of tags according to image contents. We evaluate the proposed model on various image annotation datasets. In addition to comparing our model with existing methods using the conventional top-k evaluation measures, we also provide our model as a high quality baseline for the arbitrary length image tagging task. Moreover, the results of our experiments show that the order of tags in training phase has a great impact on the final annotation performance.
自动图像标注一直是促进大规模图像管理和检索的重要研究课题。现有的方法专注于学习标签之间的相关性或标签之间的相关性,以提高注释的准确性。然而,这些方法中的大多数使用top-k检索性能评估其性能,其中k是固定的。尽管这样的设置为比较不同的方法提供了便利,但这不是人类对图像进行注释的自然方式。注释标签的数量应取决于图像内容。受机器翻译和图像字幕的最新进展的启发,我们提出了一种新的RIA(Recurrent Image Annotator)模型,将图像标注任务形成序列生成问题,使得RIA可以根据图像内容自然地预测标签的长度。我们评估提出的模型在各种图像注释数据集。除了将我们的模型与使用传统的top-k评估方法的现有方法进行比较外,我们还提供了我们的模型作为任意长度图像标记任务的高质量基线。此外,我们的实验结果表明,训练阶段的标签顺序对最终的注释性能有很大的影响。
https://arxiv.org/abs/1604.05225
Understanding, predicting, and generating object motions and transformations is a core problem in artificial intelligence. Modeling sequences of evolving images may provide better representations and models of motion and may ultimately be used for forecasting, simulation, or video generation. Diagrammatic Abstract Reasoning is an avenue in which diagrams evolve in complex patterns and one needs to infer the underlying pattern sequence and generate the next image in the sequence. For this, we develop a novel Contextual Generative Adversarial Network based on Recurrent Neural Networks (Context-RNN-GANs), where both the generator and the discriminator modules are based on contextual history (modeled as RNNs) and the adversarial discriminator guides the generator to produce realistic images for the particular time step in the image sequence. We evaluate the Context-RNN-GAN model (and its variants) on a novel dataset of Diagrammatic Abstract Reasoning, where it performs competitively with 10th-grade human performance but there is still scope for interesting improvements as compared to college-grade human performance. We also evaluate our model on a standard video next-frame prediction task, achieving improved performance over comparable state-of-the-art.
https://arxiv.org/abs/1609.09444
We study the contribution of binary black hole (BH-BH) mergers from the first, metal-free stars in the Universe (Pop III) to gravitational wave detection rates. Our study combines initial conditions for the formation of Pop III stars based on N-body simulations of binary formation (including rates, binary fraction, initial mass function, orbital separation and eccentricity distributions) with an updated model of stellar evolution specific for Pop III stars. We find that the merger rate of these Pop III BH-BH systems is relatively small (< 0.1 Gpc^-3 yr^-1) at low redshifts (z<2), where it can be compared with the LIGO empirical estimate of 9-240 Gpc^-3 yr^-1 (Abbott et al. 2016). The predicted rates are even smaller for Pop III double neutron star and black hole neutron star mergers. Our rates are compatible with those of Hartwig et al. (2016), but significantly smaller than those found in previous work (Bond & Carr 1984; Belczynski et al. 2004; Kinugawa et al. 2014, 2016). We explain the reasons for this discrepancy by means of detailed model comparisons and point out that (i) identification of Pop III BH-BH mergers may not be possible by advanced LIGO, and (ii) the level of stochastic gravitational wave background from Pop III mergers may be lower than recently estimated (Kowalska et al. 2012; Inayoshi et al. 2016; Dvorkin et al. 2016). We further estimate gravitational wave detection rates for third-generation interferometric detectors. Our calculations are relevant for low to moderately rotating Pop III stars. We can now exclude significant (> 1 per cent) contribution of these stars to low-redshift BH-BH mergers. However, it remains to be tested whether (and at what level) rapidly spinning Pop III stars (homogeneous evolution) can contribute to BH-BH mergers in the local Universe.
https://arxiv.org/abs/1612.01524
Communicating and sharing intelligence among agents is an important facet of achieving Artificial General Intelligence. As a first step towards this challenge, we introduce a novel framework for image generation: Message Passing Multi-Agent Generative Adversarial Networks (MPM GANs). While GANs have recently been shown to be very effective for image generation and other tasks, these networks have been limited to mostly single generator-discriminator networks. We show that we can obtain multi-agent GANs that communicate through message passing to achieve better image generation. The objectives of the individual agents in this framework are two fold: a co-operation objective and a competing objective. The co-operation objective ensures that the message sharing mechanism guides the other generator to generate better than itself while the competing objective encourages each generator to generate better than its counterpart. We analyze and visualize the messages that these GANs share among themselves in various scenarios. We quantitatively show that the message sharing formulation serves as a regularizer for the adversarial training. Qualitatively, we show that the different generators capture different traits of the underlying data distribution.
https://arxiv.org/abs/1612.01294
Point pair features are a popular representation for free form 3D object detection and pose estimation. In this paper, their performance in an industrial random bin picking context is investigated. A new method to generate representative synthetic datasets is proposed. This allows to investigate the influence of a high degree of clutter and the presence of self similar features, which are typical to our application. We provide an overview of solutions proposed in literature and discuss their strengths and weaknesses. A simple heuristic method to drastically reduce the computational complexity is introduced, which results in improved robustness, speed and accuracy compared to the naive approach.
https://arxiv.org/abs/1612.01288
Current deep learning architectures are growing larger in order to learn from complex datasets. These architectures require giant matrix multiplication operations to train millions of parameters. Conversely, there is another growing trend to bring deep learning to low-power, embedded devices. The matrix operations, associated with both training and testing of deep networks, are very expensive from a computational and energy standpoint. We present a novel hashing based technique to drastically reduce the amount of computation needed to train and test deep networks. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select the nodes with the highest activation efficiently. Our new algorithm for deep learning reduces the overall computational cost of forward and back-propagation by operating on significantly fewer (sparse) nodes. As a consequence, our algorithm uses only 5% of the total multiplications, while keeping on average within 1% of the accuracy of the original model. A unique property of the proposed hashing based back-propagation is that the updates are always sparse. Due to the sparse gradient updates, our algorithm is ideally suited for asynchronous and parallel training leading to near linear speedup with increasing number of cores. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations on several real datasets.
https://arxiv.org/abs/1602.08194
Recently, video captioning has been attracting an increasing amount of interest, due to its potential for improving accessibility and information retrieval. While existing methods rely on different kinds of visual features and model structures, they do not fully exploit relevant semantic information. We present an extensible approach to jointly leverage several sorts of visual features and semantic attributes. Our novel architecture builds on LSTMs for sentence generation, with several attention layers and two multimodal layers. The attention mechanism learns to automatically select the most salient visual features or semantic attributes, and the multimodal layer yields overall representations for the input and outputs of the sentence generation component. Experimental results on the challenging MSVD and MSR-VTT datasets show that our framework outperforms the state-of-the-art approaches, while ground truth based semantic attributes are able to further elevate the output quality to a near-human level.
最近,视频字幕由于其改善可访问性和信息检索的潜力而引起越来越多的兴趣。虽然现有的方法依赖于不同类型的视觉特征和模型结构,但是它们并没有充分利用相关的语义信息。我们提出了一个可扩展的方法来共同利用几种视觉特征和语义属性。我们的新颖架构建立在LSTM上,用于生成句子,具有多个关注层和两个多模态层。注意机制学习自动选择最显着的视觉特征或语义属性,并且多模态层产生句子生成组件的输入和输出的整体表示。在具有挑战性的MSVD和MSR-VTT数据集上的实验结果表明,我们的框架胜过了最先进的方法,而基于地面真值的语义属性能够进一步提高输出质量到接近人类的水平。
https://arxiv.org/abs/1612.00234
We demonstrate that GaN formed in a Nanowall Network (NwN) morphology can overcome fundamental limitations in optoelectronic devices, and enable high light extraction and effective Mg incorporation for efficient p-GaN. We report the growth of Mg doped GaN Nanowall network (NwN) by plasma assisted molecular beam epitaxy (PA-MBE) that is characterized by Photoluminescence (PL) spectroscopy, Raman spectroscopy, high-resolution X-ray diffraction (HR-XRD), X-ray photoelectron spectroscopy (XPS) and Secondary ion mass spectroscopy (SIMS). We record a photo-luminescence enhancement ($ \approx $3.2 times) in lightly doped GaN as compared to that of undoped NwN. Two distinct (and broad) blue luminescence peaks appears at 2.95 and 2.7 eV for the heavily doped GaN (Mg $>10^{20}$ atoms $cm^{-3}$), of which the 2.95 eV peak is sensitive to annealing is observed. XPS and SIMS measurements estimate the incorporated Mg concentration to be $10^{20}$ atoms $cm^{-3}$ in GaN NwN morphology, while retaining its band edge emission at $\approx$ 3.4 eV. A higher Mg accumulation towards the GaN/Al$_2$O$_3$ interface as compared to the surface was observed from SIMS measurements.
https://arxiv.org/abs/1611.10263
Given an image, we would like to learn to detect objects belonging to particular object categories. Common object detection methods train on large annotated datasets which are annotated in terms of bounding boxes that contain the object of interest. Previous works on object detection model the problem as a structured regression problem which ranks the correct bounding boxes more than the background ones. In this paper we develop algorithms which actively obtain annotations from human annotators for a small set of images, instead of all images, thereby reducing the annotation effort. Towards this goal, we make the following contributions: 1. We develop a principled version space based active learning method that solves for object detection as a structured prediction problem in a weakly supervised setting 2. We also propose two variants of the margin sampling strategy 3. We analyse the results on standard object detection benchmarks that show that with only 20% of the data we can obtain more than 95% of the localization accuracy of full supervision. Our methods outperform random sampling and the classical uncertainty-based active learning algorithms like entropy
https://arxiv.org/abs/1611.07285
Instance segmentation has attracted recent attention in computer vision and existing methods in this domain mostly have an object detection stage. In this paper, we study the intrinsic challenge of the instance segmentation problem, the presence of a quotient space (swapping the labels of different instances leads to the same result), and propose new methods that are object proposal- and object detection- free. We propose three alternative methods, namely pixel-based affinity mapping, superpixel-based affinity learning, and boundary-based component segmentation, all focusing on performing labeling transformations to cope with the quotient space problem. By adopting fully convolutional neural networks (FCN) like models, our framework attains competitive results on both the PASCAL dataset (object-centric) and the Gland dataset (texture-centric), which the existing methods are not able to do. Our work also has the advantages in its transparency, simplicity, and being all segmentation based.
https://arxiv.org/abs/1611.08991
Despite the recent success of neural networks in image feature learning, a major problem in the video domain is the lack of sufficient labeled data for learning to model temporal information. In this paper, we propose an unsupervised temporal modeling method that learns from untrimmed videos. The speed of motion varies constantly, e.g., a man may run quickly or slowly. We therefore train a Multirate Visual Recurrent Model (MVRM) by encoding frames of a clip with different intervals. This learning process makes the learned model more capable of dealing with motion speed variance. Given a clip sampled from a video, we use its past and future neighboring clips as the temporal context, and reconstruct the two temporal transitions, i.e., present$\rightarrow$past transition and present$\rightarrow$future transition, reflecting the temporal information in different views. The proposed method exploits the two transitions simultaneously by incorporating a bidirectional reconstruction which consists of a backward reconstruction and a forward reconstruction. We apply the proposed method to two challenging video tasks, i.e., complex event detection and video captioning, in which it achieves state-of-the-art performance. Notably, our method generates the best single feature for event detection with a relative improvement of 10.4% on the MEDTest-13 dataset and achieves the best performance in video captioning across all evaluation metrics on the YouTube2Text dataset.
尽管神经网络最近在图像特征学习中取得了成功,但视频领域的一个主要问题是缺乏足够的标记数据来学习对时间信息进行建模。在本文中,我们提出了一个无监督的时间建模方法,从未修剪的视频学习。运动的速度是不断变化的,例如,一个人可能跑得很快或者很慢。因此,我们通过编码具有不同间隔的剪辑的帧来训练多速率视觉递归模型(MVRM)。这个学习过程使得学习模型更有能力处理运动速度的变化。给定从视频中采样的剪辑,我们使用其过去和未来的相邻剪辑作为时间上下文,并重构两个时间转换,即呈现$ \ rightarrow $过渡并呈现$ \ rightarrow $ future转换,反映时间信息在不同的意见。所提出的方法同时利用两个转换,包括一个双向重建,包括一个向后重建和一个正向重建。我们将所提出的方法应用于两个具有挑战性的视频任务,即复杂的事件检测和视频字幕,其中实现了最先进的性能。值得注意的是,我们的方法为事件检测生成最好的单一特征,MEDTest-13数据集的相对改善10.4%,并在YouTube2Text数据集上所有评估指标的视频字幕中实现最佳性能。
https://arxiv.org/abs/1611.09053
We report the first realization of molecular beam epitaxy grown strained GaN quantum well field-effect transistors on single-crystal bulk AlN substrates. The fabricated double heterostructure FETs exhibit a two- dimensional electron gas (2DEG) density in excess of 2x10^13/cm2. Ohmic contacts to the 2DEG channel were formed by n+ GaN MBE regrowth process, with a contact resistance of 0.13 Ohm-mm. Raman spectroscopy using the quantum well as an optical marker reveals the strain in the quantum well, and strain relaxation in the regrown GaN contacts. A 65-nm-long rectangular-gate device showed a record high DC drain current drive of 2.0 A/mm and peak extrinsic transconductance of 250 mS/mm. Small-signal RF performance of the device achieved current gain cutoff frequency fT~120 GHz. The DC and RF performance demonstrate that bulk AlN substrates offer an attractive alternative platform for strained quantum well nitride transistors for future high-voltage and high-power microwave applications.
https://arxiv.org/abs/1611.08914
Recurrent neural networks (RNNs) have achieved great success in language modeling. However, since the RNNs have fixed size of memory, their memory cannot store all the information about the words it have seen before in the sentence, and thus the useful long-term information may be ignored when predicting the next words. In this paper, we propose Attention-based Memory Selection Recurrent Network (AMSRN), in which the model can review the information stored in the memory at each previous time step and select the relevant information to help generate the outputs. In AMSRN, the attention mechanism finds the time steps storing the relevant information in the memory, and memory selection determines which dimensions of the memory are involved in computing the attention weights and from which the information is extracted.In the experiments, AMSRN outperformed long short-term memory (LSTM) based language models on both English and Chinese corpora. Moreover, we investigate using entropy as a regularizer for attention weights and visualize how the attention mechanism helps language modeling.
https://arxiv.org/abs/1611.08656
We report the quantum efficiency of photoluminescence processes of Er optical centers as well as the thermal quenching mechanism in GaN epilayers prepared by metal-organic chemical vapor deposition. High resolution infrared spectroscopy and temperature dependence measurements of photoluminescence intensity from Er ions in GaN under resonant excitation excitations were performed. Data provide a picture of the thermal quenching processes and activation energy levels. By comparing the photoluminescence from Er ions in the epilayer with a reference sample of Er-doped SiO2, we find that the fraction of Er ions that emits photon at 1.54 micron upon a resonant optical excitation is approximately 68%. This result presents a significant step in the realization of GaN:Er epilayers as an optical gain medium at 1.54 micron.
https://arxiv.org/abs/1611.08620
We present a method for performing hierarchical object detection in images guided by a deep reinforcement learning agent. The key idea is to focus on those parts of the image that contain richer information and zoom on them. We train an intelligent agent that, given an image window, is capable of deciding where to focus the attention among five different predefined region candidates (smaller windows). This procedure is iterated providing a hierarchical image analysis.We compare two different candidate proposal strategies to guide the object search: with and without overlap. Moreover, our work compares two different strategies to extract features from a convolutional neural network for each region proposal: a first one that computes new feature maps for each region proposal, and a second one that computes the feature maps for the whole image to later generate crops for each region proposal. Experiments indicate better results for the overlapping candidate proposal strategy and a loss of performance for the cropped image features due to the loss of spatial resolution. We argue that, while this loss seems unavoidable when working with large amounts of object candidates, the much more reduced amount of region proposals generated by our reinforcement learning agent allows considering to extract features for each location without sharing convolutional computation among regions.
https://arxiv.org/abs/1611.03718
V-pit-defects in GaN-based light-emitting diodes induced by dislocations are considered beneficial to electroluminescence because they relax the strain in InGaN quantum wells and also enhance the hole lateral injection through sidewall of V-pits. In this paper, regularly arranged V-pits are formed on c-plane GaN grown by metal organic vapor phase epitaxy on conventional c-plane cone-patterned sapphire substrates. The size of V-pits and area of flat GaN can be adjusted by changing growth temperature. Five pairs of InGaN/GaN multi-quantumwell and also a light-emitting diode structure are grown on this V-pit-shaped GaN. Two peaks around 410 nm and 450 nm appearing in both photoluminescence and cathodeluminescence spectra are from the semipolar InGaN/GaN multi-quantum-well on sidewalls of V-pits and cplane InGaN/GaN multi-quantum-well, respectively. In addition, dense bright spots can be observed on the surface of light-emitting diode when it works under small injection current, which are believed owing to the enhanced hole injection around V-pits.
https://arxiv.org/abs/1612.06355
In metal organic vapor phase epitaxy of GaN, the growth mode is sensitive to reactor temperature. In this study, V-pit-shaped GaN has been grown on normal c-plane cone-patterned sapphire substrate by decreasing the growth temperature of high-temperature-GaN to around 950 oC, which leads to the 3-dimensional growth of GaN. The so-called “WM” well describes the shape that the bottom of GaN V-pit is just right over the top of sapphire cone, and the regular arrangement of V-pits follows the patterns of sapphire substrate strictly. Two types of semipolar facets (1101) and (1122) expose on sidewalls of V-pits. Furthermore, by raising the growth temperature to 1000 oC, the growth mode of GaN can be transferred to 2-demonsional growth. Accordingly, the size of V-pits becomes smaller and the area of c-plane GaN becomes larger, while the total thickness of GaN keeps almost unchanged during this process. As long as the 2-demonsional growth lasts, the V-pits will disappear and only flat c-plane GaN remains. This means the area ratio of c-plane and semipolar plane GaN can be controlled by the duration time of 2-demonsional growth.
https://arxiv.org/abs/1611.08337
We study the problem of troubleshooting machine learning systems that rely on analytical pipelines of distinct components. Understanding and fixing errors that arise in such integrative systems is difficult as failures can occur at multiple points in the execution workflow. Moreover, errors can propagate, become amplified or be suppressed, making blame assignment difficult. We propose a human-in-the-loop methodology which leverages human intellect for troubleshooting system failures. The approach simulates potential component fixes through human computation tasks and measures the expected improvements in the holistic behavior of the system. The method provides guidance to designers about how they can best improve the system. We demonstrate the effectiveness of the approach on an automated image captioning system that has been pressed into real-world use.
我们研究排除依赖于不同组件的分析流水线的机器学习系统的故障排除问题。理解和修复在这样的综合系统中出现的错误是困难的,因为在执行工作流程中的多个点可能发生错误。而且,错误会传播,被放大或被压制,使责任分配变得困难。我们提出了一种利用人的智慧对系统故障进行故障排除的人在回路(human-in-the-loop)方法。该方法通过人工计算任务模拟潜在的组件修复,并测量系统整体行为的预期改进。该方法为设计人员提供了如何最好地改进系统的指导。我们演示了这种方法在自动图像字幕系统上的有效性,该系统已经被用于实际应用。
https://arxiv.org/abs/1611.08309
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We provide additional insights into the problem by analyzing how much information is contained only in the language part for which we provide a new human baseline. To study human consensus, which is related to the ambiguities inherent in this challenging task, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Moreover, we also extend our analysis to VQA, a large-scale question answering about images dataset, where we investigate some particular design choices and show the importance of stronger visual models. At the same time, we achieve strong performance of our model that still uses a global image representation. Finally, based on such analysis, we refine our Ask Your Neurons on DAQUAR, which also leads to a better performance on this challenging task.
我们针对设置为视觉图灵测试的真实世界图像提出了一个问题回答任务。通过结合图像表示和自然语言处理方面的最新进展,我们提出了问你的神经元,一个可扩展的,联合训练,端到端的制定这个问题。与以前的努力相反,我们正面临着语言输出(答案)以视觉和自然语言输入(形象和问题)为条件的多模式问题。我们通过分析只有在我们提供新的人类基线的语言部分中包含多少信息来提供对该问题的更多见解。为了研究与这个具有挑战性的任务固有的模糊性相关的人类共识,我们提出了两个新的度量标准并收集了将原始DAQUAR数据集扩展到DAQUAR共识的附加答案。此外,我们还将我们的分析扩展到VQA,这是一个关于图像数据集的大型问题,我们在这里研究一些特定的设计选择,并展示更强大的视觉模型的重要性。同时,我们实现了仍然使用全局图像表示的模型的强大性能。最后,基于这样的分析,我们对DAQUAR的Ask Your Neurons进行了细化,这也带来了更好的性能。
https://arxiv.org/abs/1605.02697
Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called \textit{text-conditional attention}, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.
由于其强大的性能,注意机制已经引起了人们对图像字幕的极大兴趣。然而,现有的方法只使用视觉内容作为注意力,文字上下文是否能提高图像字幕的注意力仍然没有解决。为了探索这个问题,我们提出了一种新的注意机制,叫做\ textit {text-conditional attention},它允许字幕生成器把注意力集中在给定的图像特征上。为了获得我们关注模型中与文本相关的图像特征,我们采用了带有CNN微调的指导性长短期记忆(gLSTM)字幕体系结构。我们提出的方法允许以一种网络架构以端到端的方式联合学习图像嵌入,文本嵌入,文本条件注意和语言模型。我们在MS-COCO数据集上进行了大量的实验。实验结果表明,我们的方法在各种量化指标以及人体评估方面都优于最新的字幕方法,它支持在图像字幕中使用文本条件注意。
https://arxiv.org/abs/1606.04621
An Unmanned Ariel vehicle (UAV) has greater importance in the army for border security. The main objective of this article is to develop an OpenCV-Python code using Haar Cascade algorithm for object and face detection. Currently, UAVs are used for detecting and attacking the infiltrated ground targets. The main drawback for this type of UAVs is that sometimes the object are not properly detected, which thereby causes the object to hit the UAV. This project aims to avoid such unwanted collisions and damages of UAV. UAV is also used for surveillance that uses Voila-jones algorithm to detect and track humans. This algorithm uses cascade object detector function and vision. train function to train the algorithm. The main advantage of this code is the reduced processing time. The Python code was tested with the help of available database of video and image, the output was verified.
https://arxiv.org/abs/1611.07791
Automatically generating natural language descriptions of videos plays a fundamental challenge for computer vision community. Most recent progress in this problem has been achieved through employing 2-D and/or 3-D Convolutional Neural Networks (CNN) to encode video content and Recurrent Neural Networks (RNN) to decode a sentence. In this paper, we present Long Short-Term Memory with Transferred Semantic Attributes (LSTM-TSA)—a novel deep architecture that incorporates the transferred semantic attributes learnt from images and videos into the CNN plus RNN framework, by training them in an end-to-end manner. The design of LSTM-TSA is highly inspired by the facts that 1) semantic attributes play a significant contribution to captioning, and 2) images and videos carry complementary semantics and thus can reinforce each other for captioning. To boost video captioning, we propose a novel transfer unit to model the mutually correlated attributes learnt from images and videos. Extensive experiments are conducted on three public datasets, i.e., MSVD, M-VAD and MPII-MD. Our proposed LSTM-TSA achieves to-date the best published performance in sentence generation on MSVD: 52.8% and 74.0% in terms of BLEU@4 and CIDEr-D. Superior results when compared to state-of-the-art methods are also reported on M-VAD and MPII-MD.
自动生成视频的自然语言描述是计算机视觉社区面临的一个基本挑战。通过使用二维和/或三维卷积神经网络(CNN)对视频内容进行编码和递归神经网络(RNN)来对句子进行解码,已经实现了这个问题的最新进展。在本文中,我们提出长时间记忆与转移语义属性(LSTM-TSA)—一种新颖的深层架构,将从图像和视频中学习到的语义属性转移到CNN加RNN框架中,通过在端到端的方式。 LSTM-TSA的设计受到以下事实的启发:1)语义属性对字幕有显着的贡献; 2)图像和视频带有互补的语义,因此可以互相补充字幕。为了增强视频字幕,我们提出了一种新颖的传输单元来模拟从图像和视频学习到的相互关联的属性。对三个公共数据集进行了大量的实验,即MSVD,M-VAD和MPII-MD。我们提出的LSTM-TSA达到了最新发表的MSVD语句的最佳表现:BLEU4和CIDEr-D的表达率分别为52.8%和74.0%。 M-VAD和MPII-MD也报道了与最先进的方法相比的优越结果。
https://arxiv.org/abs/1611.07675
Attention mechanisms have recently been introduced in deep learning for various tasks in natural language processing and computer vision. But despite their popularity, the “correctness” of the implicitly-learned attention maps has only been assessed qualitatively by visualization of several examples. In this paper we focus on evaluating and improving the correctness of attention in neural image captioning models. Specifically, we propose a quantitative evaluation metric for the consistency between the generated attention maps and human annotations, using recently released datasets with alignment between regions in images and entities in captions. We then propose novel models with different levels of explicit supervision for learning attention maps during training. The supervision can be strong when alignment between regions and caption entities are available, or weak when only object segments and categories are provided. We show on the popular Flickr30k and COCO datasets that introducing supervision of attention maps during training solidly improves both attention correctness and caption quality, showing the promise of making machine perception more human-like.
最近在深度学习中引入了注意机制,用于自然语言处理和计算机视觉的各种任务。但是,尽管他们很受欢迎,隐性学习的关注地图的“正确性”只是通过几个例子的可视化来定性评估。在本文中,我们着重于评估和改善神经图像字幕模型中的注意力的正确性。具体来说,我们提出了一个量化评估指标,用于生成关注地图和人类注释之间的一致性,使用最近发布的数据集,图像中的区域和字幕中的实体之间进行对齐。然后,我们提出了不同层次的显式监督的新模型,用于训练期间的学习关注地图。区域和标题实体之间的对齐可以是强有力的,只有提供对象分区和类别时,监督才是有力的。我们在受欢迎的Flickr30k和COCO数据集上展示了在培训期间引入关注地图的监督,可以显着提高注意正确性和字幕质量,显示出使机器感知更像人类的承诺。
https://arxiv.org/abs/1605.09553
Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to support “reasoning”. For multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. We explore variants of the model and study its transferability between both datasets. We also present an error analysis of our model that suggests a key problem of current VQA systems lies in the lack of visual grounding of concepts that occur in the questions and answers. Overall, our results suggest that the performance of current VQA systems is not significantly better than that of systems designed to exploit dataset biases.
视觉问答(VQA)是评估当前图像理解系统的能力和缺点的一个有趣的学习环境。许多最近提出的VQA系统包括旨在支持“推理”的注意或记忆机制。对于多选VQA,几乎所有这些系统都会对图像和问题特征进行多级分类,以预测答案。本文质疑这些通用实践的价值,并开发了一个基于二元分类的简单替代模型。我们的模型不是将答案视为竞争性的选择,而是接受答案作为输入,并预测图像问答三元组是否正确。我们在Visual7W Telling和VQA Real Multiple Choice任务上评估我们的模型,发现我们的模型甚至简单版本都具有竞争力。我们最好的模型可以在Visual7W Telling任务上实现最先进的性能,并与VQA Real Multiple Choice任务中提出的最复杂的系统进行惊人的比较。我们探索模型的变种,并研究两个数据集之间的可转移性。我们还提出了对我们模型的错误分析,表明当前VQA系统的关键问题在于对问题和答案中出现的概念缺乏视觉基础。总的来说,我们的结果表明,目前的VQA系统的性能并没有明显好于利用数据集偏差设计的系统。
https://arxiv.org/abs/1606.08390
We theoretically analyze the contrast observed at the outcrop of a threading dislocation at the GaN(0001) surface in cathodoluminescence and electron-beam induced current maps. We consider exciton diffusion and recombination including finite recombination velocities both at the planar surface and at the dislocation. Formulating the reciprocity theorem for this general case enables us to provide a rigorous analytical solution of this diffusion-recombination problem. The results of the calculations are applied to an experimental example to determine both the exciton diffusion length and the recombination strength of threading dislocations in a free-standing GaN layer with a dislocation density of $6\times10^{5}$~cm$^{-2}$.
https://arxiv.org/abs/1611.06895
Although end-to-end Neural Machine Translation (NMT) has achieved remarkable progress in the past two years, it suffers from a major drawback: translations generated by NMT systems often lack of adequacy. It has been widely observed that NMT tends to repeatedly translate some source words while mistakenly ignoring other words. To alleviate this problem, we propose a novel encoder-decoder-reconstructor framework for NMT. The reconstructor, incorporated into the NMT model, manages to reconstruct the input source sentence from the hidden layer of the output target sentence, to ensure that the information in the source side is transformed to the target side as much as possible. Experiments show that the proposed framework significantly improves the adequacy of NMT output and achieves superior translation result over state-of-the-art NMT and statistical MT systems.
https://arxiv.org/abs/1611.01874
Part of the appeal of Visual Question Answering (VQA) is its promise to answer new questions about previously unseen images. Most current methods demand training questions that illustrate every possible concept, and will therefore never achieve this capability, since the volume of required training data would be prohibitive. Answering general questions about images requires methods capable of Zero-Shot VQA, that is, methods able to answer questions beyond the scope of the training questions. We propose a new evaluation protocol for VQA methods which measures their ability to perform Zero-Shot VQA, and in doing so highlights significant practical deficiencies of current approaches, some of which are masked by the biases in current datasets. We propose and evaluate several strategies for achieving Zero-Shot VQA, including methods based on pretrained word embeddings, object classifiers with semantic embeddings, and test-time retrieval of example images. Our extensive experiments are intended to serve as baselines for Zero-Shot VQA, and they also achieve state-of-the-art performance in the standard VQA evaluation setting.
视觉问答(VQA)的部分吸引力在于回答关于以前看不见的图像的新问题。目前大多数方法都需要训练问题来说明每一个可能的概念,因此从来不会达到这个能力,因为所需要的训练数据量是过高的。回答关于图像的一般问题需要能够进行零射击VQA的方法,即能够回答超出训练问题范围的问题的方法。我们为VQA方法提出了一个新的评估协议,这个协议测量了他们执行Zero-Shot VQA的能力,这样做强调了当前方法的重大实际缺陷,其中一些被当前数据集中的偏差所掩盖。我们提出并评估了实现Zero-Shot VQA的几种策略,包括基于预训练词嵌入的方法,带有语义嵌入的对象分类器以及示例图像的测试时间检索。我们广泛的实验旨在作为Zero-Shot VQA的基准线,并在标准的VQA评估设置中实现最先进的性能。
https://arxiv.org/abs/1611.05546
Generative Adversarial Networks (GANs) have recently demonstrated to successfully approximate complex data distributions. A relevant extension of this model is conditional GANs (cGANs), where the introduction of external information allows to determine specific representations of the generated images. In this work, we evaluate encoders to inverse the mapping of a cGAN, i.e., mapping a real image into a latent space and a conditional representation. This allows, for example, to reconstruct and modify real images of faces conditioning on arbitrary attributes. Additionally, we evaluate the design of cGANs. The combination of an encoder with a cGAN, which we call Invertible cGAN (IcGAN), enables to re-generate real images with deterministic complex modifications.
https://arxiv.org/abs/1611.06355
Curriculum Learning emphasizes the order of training instances in a computational learning setup. The core hypothesis is that simpler instances should be learned early as building blocks to learn more complex ones. Despite its usefulness, it is still unknown how exactly the internal representation of models are affected by curriculum learning. In this paper, we study the effect of curriculum learning on Long Short-Term Memory (LSTM) networks, which have shown strong competency in many Natural Language Processing (NLP) problems. Our experiments on sentiment analysis task and a synthetic task similar to sequence prediction tasks in NLP show that curriculum learning has a positive effect on the LSTM’s internal states by biasing the model towards building constructive representations i.e. the internal representation at the previous timesteps are used as building blocks for the final prediction. We also find that smaller models significantly improves when they are trained with curriculum learning. Lastly, we show that curriculum learning helps more when the amount of training data is limited.
https://arxiv.org/abs/1611.06204
We present a survey on maritime object detection and tracking approaches, which are essential for the development of a navigational system for autonomous ships. The electro-optical (EO) sensor considered here is a video camera that operates in the visible or the infrared spectra, which conventionally complement radar and sonar and have demonstrated effectiveness for situational awareness at sea has demonstrated its effectiveness over the last few years. This paper provides a comprehensive overview of various approaches of video processing for object detection and tracking in the maritime environment. We follow an approach-based taxonomy wherein the advantages and limitations of each approach are compared. The object detection system consists of the following modules: horizon detection, static background subtraction and foreground segmentation. Each of these has been studied extensively in maritime situations and has been shown to be challenging due to the presence of background motion especially due to waves and wakes. The main processes involved in object tracking include video frame registration, dynamic background subtraction, and the object tracking algorithm itself. The challenges for robust tracking arise due to camera motion, dynamic background and low contrast of tracked object, possibly due to environmental degradation. The survey also discusses multisensor approaches and commercial maritime systems that use EO sensors. The survey also highlights methods from computer vision research which hold promise to perform well in maritime EO data processing. Performance of several maritime and computer vision techniques is evaluated on newly proposed Singapore Maritime Dataset.
https://arxiv.org/abs/1611.05842
Machine reading using differentiable reasoning models has recently shown remarkable progress. In this context, End-to-End trainable Memory Networks, MemN2N, have demonstrated promising performance on simple natural language based reasoning tasks such as factual reasoning and basic deduction. However, other tasks, namely multi-fact question-answering, positional reasoning or dialog related tasks, remain challenging particularly due to the necessity of more complex interactions between the memory and controller modules composing this family of models. In this paper, we introduce a novel end-to-end memory access regulation mechanism inspired by the current progress on the connection short-cutting principle in the field of computer vision. Concretely, we develop a Gated End-to-End trainable Memory Network architecture, GMemN2N. From the machine learning perspective, this new capability is learned in an end-to-end fashion without the use of any additional supervision signal which is, as far as our knowledge goes, the first of its kind. Our experiments show significant improvements on the most challenging tasks in the 20 bAbI dataset, without the use of any domain knowledge. Then, we show improvements on the dialog bAbI tasks including the real human-bot conversion-based Dialog State Tracking Challenge (DSTC-2) dataset. On these two datasets, our model sets the new state of the art.
https://arxiv.org/abs/1610.04211
The current trend in object detection and localization is to learn predictions with high capacity deep neural networks trained on a very large amount of annotated data and using a high amount of processing power. In this work, we propose a new neural model which directly predicts bounding box coordinates. The particularity of our contribution lies in the local computations of predictions with a new form of local parameter sharing which keeps the overall amount of trainable parameters low. Key components of the model are spatial 2D-LSTM recurrent layers which convey contextual information between the regions of the image. We show that this model is more powerful than the state of the art in applications where training data is not as abundant as in the classical configuration of natural images and Imagenet/Pascal VOC tasks. We particularly target the detection of text in document images, but our method is not limited to this setting. The proposed model also facilitates the detection of many objects in a single image and can deal with inputs of variable sizes without resizing.
https://arxiv.org/abs/1611.05664
Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs), video captioning has made great progress. However, learning an effective mapping from visual sequence space to language space is still a challenging problem. In this paper, we propose a Multimodal Memory Model (M3) to describe videos, which builds a visual and textual shared memory to model the long-term visual-textual dependency and further guide global visual attention on described targets. Specifically, the proposed M3 attaches an external memory to store and retrieve both visual and textual contents by interacting with video and sentence with multiple read and write operations. First, text representation in the Long Short-Term Memory (LSTM) based text decoder is written into the memory, and the memory contents will be read out to guide an attention to select related visual targets. Then, the selected visual information is written into the memory, which will be further read out to the text decoder. To evaluate the proposed model, we perform experiments on two publicly benchmark datasets: MSVD and MSR-VTT. The experimental results demonstrate that our method outperforms the state-of-theart methods in terms of BLEU and METEOR.
将视频片段自动翻译成自然语言句子的视频字幕是计算机视觉中非常重要的任务。借助于近来的深度学习技术,例如卷积神经网络(CNN)和递归神经网络(RNN),视频字幕已经取得了很大进展。但是,从视觉序列空间到语言空间学习有效的映射仍然是一个具有挑战性的问题。在本文中,我们提出了一个多模态记忆模型(M3)来描述视频,它建立了一个视觉和文本的共享记忆模型的长期视觉文本的依赖,并进一步引导全球视觉注意力描述的目标。具体而言,所提出的M3附加外部存储器以通过与多个读取和写入操作与视频和句子交互来存储和检索视觉和文本内容。首先,基于长短期记忆(LSTM)的文本解码器中的文本表示被写入存储器,并且读出存储器内容以引导注意力来选择相关的可视目标。然后,所选择的视觉信息被写入到存储器中,该信息将被进一步读出到文本解码器。为了评估所提出的模型,我们在两个公开基准数据集上进行实验:MSVD和MSR-VTT。实验结果表明,我们的方法在BLEU和METEOR方面优于现有的方法。
https://arxiv.org/abs/1611.05592
The “CNN-RNN” design pattern is increasingly widely applied in a variety of image annotation tasks including multi-label classification and captioning. Existing models use the weakly semantic CNN hidden layer or its transform as the image embedding that provides the interface between the CNN and RNN. This leaves the RNN overstretched with two jobs: predicting the visual concepts and modelling their correlations for generating structured annotation output. Importantly this makes the end-to-end training of the CNN and RNN slow and ineffective due to the difficulty of back propagating gradients through the RNN to train the CNN. We propose a simple modification to the design pattern that makes learning more effective and efficient. Specifically, we propose to use a semantically regularised embedding layer as the interface between the CNN and RNN. Regularising the interface can partially or completely decouple the learning problems, allowing each to be more effectively trained and jointly training much more efficient. Extensive experiments show that state-of-the art performance is achieved on multi-label classification as well as image captioning.
“CNN-RNN”设计模式日益广泛应用于各种图像标注任务,包括多标签分类和字幕。现有模型使用弱语义的CNN隐藏层或其变换作为图像嵌入,提供CNN和RNN之间的接口。这使RNN在两个工作中过度延伸:预测视觉概念并建模其相关性以生成结构化注释输出。重要的是,这使得CNN和RNN的端到端训练变得缓慢和无效,因为通过RNN向后传播梯度来训练CNN是困难的。我们提出一个简单的修改设计模式,使学习更有效和高效。具体而言,我们建议使用语义正则化的嵌入层作为CNN和RNN之间的接口。规范界面可以部分或完全解耦学习问题,使每个人能够更有效地训练和联合训练更有效率。大量的实验表明,在多标签分类和图像字幕方面实现了最先进的性能。
https://arxiv.org/abs/1611.05490
Intracellular transport is vital for the proper functioning and survival of a cell. Cargo (proteins, vesicles, organelles, etc.) is transferred from its place of creation to its target locations via molecular motor assisted transport along cytoskeletal filaments. The transport efficiency is strongly affected by the spatial organization of the cytoskeleton, which constitutes an inhomogeneous, complex network. In cells with a centrosome microtubules grow radially from the central microtubule organizing center towards the cell periphery whereas actin filaments form a dense meshwork, the actin cortex, underneath the cell membrane with a broad range of orientations. The emerging ballistic motion along filaments is frequently interrupted due to constricting intersection nodes or cycles of detachment and reattachment processes in the crowded cytoplasm. In order to investigate the efficiency of search strategies established by the cell’s specific spatial organization of the cytoskeleton we formulate a random velocity model with intermittent arrest states. With extensive computer simulations we analyze the dependence of the mean first passage times for narrow escape problems on the structural characteristics of the cytoskeleton, the motor properties and the fraction of time spent in each state. We find that an inhomogeneous architecture with a small width of the actin cortex constitutes an efficient intracellular search strategy.
https://arxiv.org/abs/1605.09230
A simplistic design of a self-powered UV-photodetector device based on hybrid r-GO/GaN is demonstrated. Under zero bias, the fabricated hybrid photodetector shows a photosensivity of ~ 85% while ohmic contact GaN photodetector with identical device structure exhibits only ~ 5.3% photosensivity at 350 nm illumination (18 microWatt/cm^2). The responsivity and detectivity of the hybrid device were found to be 1.54 mA/W and 1.45x10^10 Jones (cm Hz^(1/2) W^(-1)), respectively at zero bias under 350 nm illumination (18 microWatt/cm^2) with fast response (60 ms), recovery time (267 ms) and excellent repeatability. Power density-dependent responsivity & detectivity revealed ultrasensitive behaviour under low light conditions. The source of observed self-powered effect in hybrid photodetector is attributed to the depletion region formed at the r-GO and GaN quasi-ohmic interface.
https://arxiv.org/abs/1611.03597
In this paper, we present our first attempts in building a multilingual Neural Machine Translation framework under a unified approach. We are then able to employ attention-based NMT for many-to-many multilingual translation tasks. Our approach does not require any special treatment on the network architecture and it allows us to learn minimal number of free parameters in a standard way of training. Our approach has shown its effectiveness in an under-resourced translation scenario with considerable improvements up to 2.6 BLEU points. In addition, the approach has achieved interesting and promising results when applied in the translation task that there is no direct parallel corpus between source and target languages.
https://arxiv.org/abs/1611.04798
Generative Adversarial Networks (GAN) have limitations when the goal is to generate sequences of discrete elements. The reason for this is that samples from a distribution on discrete objects such as the multinomial are not differentiable with respect to the distribution parameters. This problem can be avoided by using the Gumbel-softmax distribution, which is a continuous approximation to a multinomial distribution parameterized in terms of the softmax function. In this work, we evaluate the performance of GANs based on recurrent neural networks with Gumbel-softmax output distributions in the task of generating sequences of discrete elements.
https://arxiv.org/abs/1611.04051
Interference detection of arbitrary geometric objects is not a trivial task due to the heavy computational load imposed by implementation issues. The hierarchically structured bounding boxes help us to quickly isolate the contour of segments in interference. In this paper, a new approach is introduced to treat the interference detection problem involving the representation of arbitrary shaped objects. Our proposed method relies upon searching for the best possible way to represent contours by means of hierarchically structured rectangular oriented bounding boxes. This technique handles 2D objects boundaries defined by closed B-spline curves with roughness details. Each oriented box is adapted and fitted to the segments of the contour using second order statistical indicators from some elements of the segments of the object contour in a multiresolution framework. Our method is efficient and robust when it comes to 2D animations in real time. It can deal with smooth curves and polygonal approximations as well results are present to illustrate the performance of the new method.
https://arxiv.org/abs/1611.03666