Welcome to AMDS123 Blog!

Recent Papers about CV, CL and SD

Visual Coreference Resolution in Visual Dialog using Neural Module Networks

2018-09-06

Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach

arXiv_CV

arXiv_CV QA VQA
Abstract

Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., it'), as the dialog agent must first link it to a previous coreference (e.g., boat’), and only then can rely on the visual grounding of the coreference boat' to reason about the pronoun it’. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a phrase level of granularity. In this work, we propose a neural module network architecture for visual dialog by introducing two novel modules - Refer and Exclude - that perform explicit, grounded, coreference resolution at a finer word level. We demonstrate the effectiveness of our model on MNIST Dialog, a visually simple yet coreference-wise complex dataset, by achieving near perfect accuracy, and on VisDial, a large and challenging visual dialog dataset on real images, where our model outperforms other approaches, and is more interpretable, grounded, and consistent qualitatively.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.01816

PDF

https://arxiv.org/pdf/1809.01816
Read All
Interpretable Visual Question Answering by Reasoning on Dependency Trees

2018-09-06

Qingxing Cao, Xiaodan Liang, Bailin Li, Liang Lin

arXiv_CV

arXiv_CV QA Attention Relation VQA
Abstract

Collaborative reasoning for understanding each image-question pair is very critical but underexplored for an interpretable visual question answering system. Although very recent works also attempted to use explicit compositional processes to assemble multiple subtasks embedded in the questions, their models heavily rely on annotations or handcrafted rules to obtain valid reasoning processes, leading to either heavy workloads or poor performance on composition reasoning. In this paper, to better align image and language domains in diverse and unrestricted cases, we propose a novel neural network model that performs global reasoning on a dependency tree parsed from the question, and we thus phrase our model as parse-tree-guided reasoning network (PTGRN). This network consists of three collaborative modules: i) an attention module to exploit the local visual evidence for each word parsed from the question, ii) a gated residual composition module to compose the previously mined evidence, and iii) a parse-tree-guided propagation module to pass the mined evidence along the parse tree. Our PTGRN is thus capable of building an interpretable VQA system that gradually derives the image cues following a question-driven parse-tree reasoning route. Experiments on relational datasets demonstrate the superiority of our PTGRN over current state-of-the-art VQA methods, and the visualization results highlight the explainable capability of our reasoning system.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.01810

PDF

https://arxiv.org/pdf/1809.01810
Read All
MDCN: Multi-Scale, Deep Inception Convolutional Neural Networks for Efficient Object Detection

2018-09-06

Wenchi Ma, Yuanwei Wu, Zongbo Wang, Guanghui Wang

arXiv_CV

arXiv_CV Object_Detection CNN Detection
Abstract

Object detection in challenging situations such as scale variation, occlusion, and truncation depends not only on feature details but also on contextual information. Most previous networks emphasize too much on detailed feature extraction through deeper and wider networks, which may enhance the accuracy of object detection to certain extent. However, the feature details are easily being changed or washed out after passing through complicated filtering structures. To better handle these challenges, the paper proposes a novel framework, multi-scale, deep inception convolutional neural network (MDCN), which focuses on wider and broader object regions by activating feature maps produced in the deep part of the network. Instead of incepting inner layers in the shallow part of the network, multi-scale inceptions are introduced in the deep layers. The proposed framework integrates the contextual information into the learning process through a single-shot network structure. It is computational efficient and avoids the hard training problem of previous macro feature extraction network designed for shallow layers. Extensive experiments demonstrate the effectiveness and superior performance of MDCN over the state-of-the-art models.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.01791

PDF

https://arxiv.org/pdf/1809.01791
Read All
GAN Lab: Understanding Complex Deep Generative Models using Interactive Visual Experimentation

2018-09-05

Minsuk Kahng, Nikhil Thorat, Duen Horng Chau, Fernanda Viégas, Martin Wattenberg

arXiv_CV

arXiv_CV Adversarial GAN Deep_Learning
Abstract

Recent success in deep learning has generated immense interest among practitioners and students, inspiring many to learn about this new technology. While visual and interactive approaches have been successfully developed to help people more easily learn deep learning, most existing tools focus on simpler models. In this work, we present GAN Lab, the first interactive visualization tool designed for non-experts to learn and experiment with Generative Adversarial Networks (GANs), a popular class of complex deep learning models. With GAN Lab, users can interactively train generative models and visualize the dynamic training process’s intermediate results. GAN Lab tightly integrates an model overview graph that summarizes GAN’s structure, and a layered distributions view that helps users interpret the interplay between submodels. GAN Lab introduces new interactive experimentation features for learning complex deep learning models, such as step-by-step training at multiple levels of abstraction for understanding intricate training dynamics. Implemented using TensorFlow.js, GAN Lab is accessible to anyone via modern web browsers, without the need for installation or specialized hardware, overcoming a major practical challenge in deploying interactive tools for deep learning.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.01587

PDF

https://arxiv.org/pdf/1809.01587
Read All
Wasserstein Divergence for GANs

2018-09-05

Jiqing Wu, Zhiwu Huang, Janine Thoma, Dinesh Acharya, Luc Van Gool

arXiv_CV

arXiv_CV Adversarial GAN Optimization Quantitative
Abstract

In many domains of computer vision, generative adversarial networks (GANs) have achieved great success, among which the family of Wasserstein GANs (WGANs) is considered to be state-of-the-art due to the theoretical contributions and competitive qualitative performance. However, it is very challenging to approximate the $k$-Lipschitz constraint required by the Wasserstein-1 metric~(W-met). In this paper, we propose a novel Wasserstein divergence~(W-div), which is a relaxed version of W-met and does not require the $k$-Lipschitz constraint. As a concrete application, we introduce a Wasserstein divergence objective for GANs~(WGAN-div), which can faithfully approximate W-div through optimization. Under various settings, including progressive growing training, we demonstrate the stability of the proposed WGAN-div owing to its theoretical and practical advantages over WGANs. Also, we study the quantitative and visual performance of WGAN-div on standard image synthesis benchmarks of computer vision, showing the superior performance of WGAN-div compared to the state-of-the-art methods.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.01026

PDF

https://arxiv.org/pdf/1712.01026
Read All
Training Deeper Neural Machine Translation Models with Transparent Attention

2018-09-04

Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, Yonghui Wu

arXiv_CL

arXiv_CL Attention CNN Optimization NMT RNN
Abstract

While current state-of-the-art NMT models, such as RNN seq2seq and Transformers, possess a large number of parameters, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we attempt to train significantly (2-3x) deeper Transformer and Bi-RNN encoders for machine translation. We propose a simple modification to the attention mechanism that eases the optimization of deeper models, and results in consistent gains of 0.7-1.1 BLEU on the benchmark WMT’14 English-German and WMT’15 Czech-English tasks for both architectures.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.07561

PDF

https://arxiv.org/pdf/1808.07561
Read All
t-Exponential Memory Networks for Question-Answering Machines

2018-09-04

Kyriakos Tolias, Sotirios Chatzis

arXiv_CV

arXiv_CV Sparse Attention Inference Deep_Learning Language_Model Prediction Memory_Networks
Abstract

Recent advances in deep learning have brought to the fore models that can make multiple computational steps in the service of completing a task; these are capable of describ- ing long-term dependencies in sequential data. Novel recurrent attention models over possibly large external memory modules constitute the core mechanisms that enable these capabilities. Our work addresses learning subtler and more complex underlying temporal dynamics in language modeling tasks that deal with sparse sequential data. To this end, we improve upon these recent advances, by adopting concepts from the field of Bayesian statistics, namely variational inference. Our proposed approach consists in treating the network parameters as latent variables with a prior distribution imposed over them. Our statistical assumptions go beyond the standard practice of postulating Gaussian priors. Indeed, to allow for handling outliers, which are prevalent in long observed sequences of multivariate data, multivariate t-exponential distributions are imposed. On this basis, we proceed to infer corresponding posteriors; these can be used for inference and prediction at test time, in a way that accounts for the uncertainty in the available sparse training data. Specifically, to allow for our approach to best exploit the merits of the t-exponential family, our method considers a new t-divergence measure, which generalizes the concept of the Kullback-Leibler divergence. We perform an extensive experimental evaluation of our approach, using challenging language modeling benchmarks, and illustrate its superiority over existing state-of-the-art techniques.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.01229

PDF

https://arxiv.org/pdf/1809.01229
Read All
Neural machine translation framework based cross-lingual document vector with distance constraint training

2018-09-04

Wei Li, Brian Mak

arXiv_CL

arXiv_CL Knowledge Attention Embedding NMT Classification
Abstract

A universal cross-lingual representation of documents is very important for many natural language processing tasks. In this paper, we present a document vectorization method which can effectively create document vectors via self-attention mechanism using a neural machine translation (NMT) framework. The model used by our method can be trained with parallel corpora that are unrelated to the task at hand. During testing, our method will take a monolingual document and convert it into a “Neural machine Translation framework based crosslingual Document Vector with distance constraint training” (cNTDV). cNTDV is a follow-up study from our previous research on the neural machine translation framework based document vector. The cNTDV can produce the document vector from a forward-pass of the encoder with fast speed. Moreover, it is trained with a distance constraint, so that the document vector obtained from different language pair is always consistent with each other. In a cross-lingual document classification task, our cNTDV embeddings surpass the published state-of-the-art performance in the English-to-German classification test, and, to our best knowledge, it also achieves the second best performance in German-to-English classification test. Comparing to our previous research, it does not need a translator in the testing process, which makes the model faster and more convenient.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11057

PDF

https://arxiv.org/pdf/1807.11057
Read All
PFDet: 2nd Place Solution to Open Images Challenge 2018 Object Detection Track

2018-09-04

Takuya Akiba, Tommi Kerola, Yusuke Niitani, Toru Ogawa, Shotaro Sano, Shuji Suzuki

arXiv_CV

arXiv_CV Object_Detection Sparse Detection
Abstract

We present a large-scale object detection system by team PFDet. Our system enables training with huge datasets using 512 GPUs, handles sparsely verified classes, and massive class imbalance. Using our method, we achieved 2nd place in the Google AI Open Images Object Detection Track 2018 on Kaggle.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00778

PDF

https://arxiv.org/pdf/1809.00778
Read All
Hierarchical Video Understanding

2018-09-04

Farzaneh Mahdisoltani, Roland Memisevic, David Fleet

arXiv_CV

arXiv_CV Video_Caption Caption
Abstract

We introduce a hierarchical architecture for video understanding that exploits the structure of real world actions by capturing targets at different levels of granularity. We design the model such that it first learns simpler coarse-grained tasks, and then moves on to learn more fine-grained targets. The model is trained with a joint loss on different granularity levels. We demonstrate empirical results on the recent release of Something-Something dataset, which provides a hierarchy of targets, namely coarse-grained action groups, fine-grained action categories, and captions. Experiments suggest that models that exploit targets at different levels of granularity achieve better performance on all levels.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.03316

PDF

https://arxiv.org/pdf/1809.03316
Read All
Diverse and Coherent Paragraph Generation from Images

2018-09-03

Moitreya Chatterjee, Alexander G. Schwing

arXiv_CV

arXiv_CV Image_Caption Summarization Caption
Abstract

Paragraph generation from images, which has gained popularity recently, is an important task for video summarization, editing, and support of the disabled. Traditional image captioning methods fall short on this front, since they aren’t designed to generate long informative descriptions. Moreover, the vanilla approach of simply concatenating multiple short sentences, possibly synthesized from a classical image captioning system, doesn’t embrace the intricacies of paragraphs: coherent sentences, globally consistent structure, and diversity. To address those challenges, we propose to augment paragraph generation techniques with ‘coherence vectors’, ‘global topic vectors’, and modeling of the inherent ambiguity of associating paragraphs with images, via a variational auto-encoder formulation. We demonstrate the effectiveness of the developed approach on two datasets, outperforming existing state-of-the-art techniques on both.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00681

PDF

https://arxiv.org/pdf/1809.00681
Read All
A programmable three-qubit superconducting processor with all-to-all connectivity

2018-09-03

Tanay Roy, Sumeru Hazra, Suman Kundu, Madhavi Chand, Meghan P. Patankar, R. Vijay

arXiv_CV

arXiv_CV
Abstract

Superconducting circuits are at the forefront of quantum computing technology because of the unparalleled combination of good coherence, fast gates and flexibility in design parameters. The majority of experiments demonstrating small quantum algorithms in the superconducting architecture have used transmon qubits and transverse qubit-qubit coupling. However, efficient universal digital computing has remained a challenge due to the fact that majority of the state-of-art architectures rely on nearest-neighbor coupling in one or two dimensions. The limited connectivity and the availability of only two-qubit entangling gates result in inefficient implementation of algorithms with reduced fidelity. In this work, we present a programmable three-qubit processor, nicknamed “trimon”, with strong all-to-all coupling and access to native three-qubit gates. We implement three-qubit version of various algorithms, namely Deutsch-Jozsa, Bernstein-Vazirani, Grover’s search and the quantum Fourier transform, to demonstrate the performance of our processor. Our results show the potential of the trimon as a building block for larger systems with enhanced qubit-qubit connectivity.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00668

PDF

https://arxiv.org/pdf/1809.00668
Read All
The MeMAD Submission to the WMT18 Multimodal Translation Task

2018-09-03

Stig-Arne Grönroos, Benoit Huet, Mikko Kurimo, Jorma Laaksonen, Bernard Merialdo, Phu Pham, Mats Sjöberg, Umut Sulubacak, Jörg Tiedemann, Raphael Troncy, Raúl Vázquez

arXiv_CL

arXiv_CL NMT
Abstract

This paper describes the MeMAD project entry to the WMT Multimodal Machine Translation Shared Task. We propose adapting the Transformer neural machine translation (NMT) architecture to a multi-modal setting. In this paper, we also describe the preliminary experiments with text-only translation systems leading us up to this choice. We have the top scoring system for both English-to-German and English-to-French, according to the automatic metrics for flickr18. Our experiments show that the effect of the visual features in our system is small. Our largest gains come from the quality of the underlying text-only NMT system. We find that appropriate use of additional data is effective.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.10802

PDF

https://arxiv.org/pdf/1808.10802
Read All
Regularizing Deep Hashing Networks Using GAN Generated Fake Images

2018-09-02

Libing Geng, Yan Pan, Jikai Chen, Hanjiang Lai

arXiv_CV

arXiv_CV Image_Retrieval Adversarial GAN
Abstract

Recently, deep-networks-based hashing (deep hashing) has become a leading approach for large-scale image retrieval. It aims to learn a compact bitwise representation for images via deep networks, so that similar images are mapped to nearby hash codes. Since a deep network model usually has a large number of parameters, it may probably be too complicated for the training data we have, leading to model over-fitting. To address this issue, in this paper, we propose a simple two-stage pipeline to learn deep hashing models, by regularizing the deep hashing networks using fake images. The first stage is to generate fake images from the original training set without extra data, via a generative adversarial network (GAN). In the second stage, we propose a deep architec- ture to learn hash functions, in which we use a maximum-entropy based loss to incorporate the newly created fake images by the GAN. We show that this loss acts as a strong regularizer of the deep architecture, by penalizing low-entropy output hash codes. This loss can also be interpreted as a model ensemble by simultaneously training many network models with massive weight sharing but over different training sets. Empirical evaluation results on several benchmark datasets show that the proposed method has superior performance gains over state-of-the-art hashing methods.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1803.09466

PDF

https://arxiv.org/e-print/1803.09466
Read All
Chittron: An Automatic Bangla Image Captioning System

2018-09-02

Motiur Rahman, Nabeel Mohammed, Nafees Mansoor, Sifat Momen

arXiv_CV

arXiv_CV Image_Caption Caption Embedding RNN Language_Model
Abstract

Automatic image caption generation aims to produce an accurate description of an image in natural language automatically. However, Bangla, the fifth most widely spoken language in the world, is lagging considerably in the research and development of such domain. Besides, while there are many established data sets to related to image annotation in English, no such resource exists for Bangla yet. Hence, this paper outlines the development of “Chittron”, an automatic image captioning system in Bangla. Moreover, to address the data set availability issue, a collection of 16,000 Bangladeshi contextual images has been accumulated and manually annotated in Bangla. This data set is then used to train a model which integrates a pre-trained VGG16 image embedding model with stacked LSTM layers. The model is trained to predict the caption when the input is an image, one word at a time. The results show that the model has successfully been able to learn a working language model and to generate captions of images quite accurately in many cases. The results are evaluated mainly qualitatively. However, BLEU scores are also reported. It is expected that a better result can be obtained with a bigger and more varied data set.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00339

PDF

https://arxiv.org/pdf/1809.00339
Read All
Future-Prediction-Based Model for Neural Machine Translation

2018-09-02

Bingzhen Wei, Junyang Lin

arXiv_CL

arXiv_CL NMT Prediction
Abstract

We propose a novel model for Neural Machine Translation (NMT). Different from the conventional method, our model can predict the future text length and words at each decoding time step so that the generation can be helped with the information from the future prediction. With such information, the model does not stop generation without having translated enough content. Experimental results demonstrate that our model can significantly outperform the baseline models. Besides, our analysis reflects that our model is effective in the prediction of the length and words of the untranslated content.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00336

PDF

https://arxiv.org/pdf/1809.00336
Read All
Learning Dynamic Memory Networks for Object Tracking

2018-09-02

Tianyu Yang, Antoni B. Chan

arXiv_CV

arXiv_CV Dynamic_Memory_Network Attention Tracking Object_Tracking RNN Detection Memory_Networks
Abstract

Template-matching methods for visual tracking have gained popularity recently due to their comparable performance and fast speed. However, they lack effective ways to adapt to changes in the target object’s appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target’s appearance variations during tracking. An LSTM is used as a memory controller, where the input is the search feature map and the outputs are the control signals for the reading and writing process of the memory block. As the location of the target is at first unknown in the search feature map, an attention mechanism is applied to concentrate the LSTM input on the potential target. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. Unlike tracking-by-detection methods where the object’s information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target’s appearance changes by updating the external memory. Moreover, unlike other tracking methods where the model capacity is fixed after offline training — the capacity of our tracker can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on OTB and VOT demonstrates that our tracker MemTrack performs favorably against state-of-the-art tracking methods while retaining real-time speed of 50 fps.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1803.07268

PDF

https://arxiv.org/pdf/1803.07268
Read All
Approximate Distribution Matching for Sequence-to-Sequence Learning

2018-09-02

Wenhu Chen, Guanlin Li, Shujie Liu, Zhirui Zhang, Mu Li, Ming Zhou

arXiv_CV

arXiv_CV Image_Caption Summarization Caption Optimization RNN Prediction
Abstract

Sequence-to-Sequence models were introduced to tackle many real-life problems like machine translation, summarization, image captioning, etc. The standard optimization algorithms are mainly based on example-to-example matching like maximum likelihood estimation, which is known to suffer from data sparsity problem. Here we present an alternate view to explain sequence-to-sequence learning as a distribution matching problem, where each source or target example is viewed to represent a local latent distribution in the source or target domain. Then, we interpret sequence-to-sequence learning as learning a transductive model to transform the source local latent distributions to match their corresponding target distributions. In our framework, we approximate both the source and target latent distributions with recurrent neural networks (augmenter). During training, the parallel augmenters learn to better approximate the local latent distributions, while the sequence prediction model learns to minimize the KL-divergence of the transformed source distributions and the approximated target distributions. This algorithm can alleviate the data sparsity issues in sequence learning by locally augmenting more unseen data pairs and increasing the model’s robustness. Experiments conducted on machine translation and image captioning consistently demonstrate the superiority of our proposed algorithm over the other competing algorithms.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.08003

PDF

https://arxiv.org/pdf/1808.08003
Read All
Evaluation of Neural Networks for Image Recognition Applications: Designing a 0-1 MILP Model of a CNN to create adversarials

2018-09-01

Lucas Schelkes

arXiv_CV

arXiv_CV Adversarial CNN Recognition
Abstract

Image Recognition is a central task in computer vision with applications ranging across search, robotics, self-driving cars and many others. There are three purposes of this document: 1. We follow up on (Fischetti & Jo, December, 2017) and show how standard convolutional neural network can be optimized to a more sophisticated capsule architecture. 2. We introduce a MILP model based on CNN to create adversarials. 3. We compare and evaluate each network for image recognition tasks.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00216

PDF

https://arxiv.org/pdf/1809.00216
Read All
MS-UEdin Submission to the WMT2018 APE Shared Task: Dual-Source Transformer for Automatic Post-Editing

2018-09-01

Marcin Junczys-Dowmunt, Roman Grundkiewicz

arXiv_CL

arXiv_CL NMT
Abstract

This paper describes the Microsoft and University of Edinburgh submission to the Automatic Post-editing shared task at WMT2018. Based on training data and systems from the WMT2017 shared task, we re-implement our own models from the last shared task and introduce improvements based on extensive parameter sharing. Next we experiment with our implementation of dual-source transformer models and data selection for the IT domain. Our submissions decisively wins the SMT post-editing sub-task establishing the new state-of-the-art and is a very close second (or equal, 16.46 vs 16.50 TER) in the NMT sub-task. Based on the rather weak results in the NMT sub-task, we hypothesize that neural-on-neural APE might not be actually useful.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00188

PDF

https://arxiv.org/pdf/1809.00188
Read All
DAC-SDC Low Power Object Detection Challenge for UAV Applications

2018-09-01

Xiaowei Xu, Xinyi Zhang, Bei Yu, X. Sharon Hu, Christopher Rowen, Jingtong Hu, Yiyu Shi

arXiv_CV

arXiv_CV Object_Detection Detection
Abstract

The 55th Design Automation Conference (DAC) held its first System Design Contest (SDC) in 2018. SDC’18 features a lower power object detection challenge (LPODC) on designing and implementing novel algorithms based object detection in images taken from unmanned aerial vehicles (UAV). The dataset includes 95 categories and 150k images, and the hardware platforms include Nvidia’s TX2 and Xilinx’s PYNQ Z1. DAC-SDC’18 attracted more than 110 entries from 12 countries. This paper presents in detail the dataset and evaluation procedure. It further discusses the methods developed by some of the entries as well as representative results. The paper concludes with directions for future improvements.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00110

PDF

https://arxiv.org/pdf/1809.00110
Read All
When to Finish? Optimal Beam Search for Neural Text Generation

2018-08-31

Liang Huang, Kai Zhao, Mingbo Ma

arXiv_CV

arXiv_CV Image_Caption Summarization Text_Generation Caption
Abstract

In neural text generation such as neural machine translation, summarization, and image captioning, beam search is widely used to improve the output text quality. However, in the neural generation setting, hypotheses can finish in different steps, which makes it difficult to decide when to end beam search to ensure optimality. We propose a provably optimal beam search algorithm that will always return the optimal-score complete hypothesis (modulo beam size), and finish as soon as the optimality is established (finishing no later than the baseline). To counter neural generation’s tendency for shorter hypotheses, we also introduce a bounded length reward mechanism which allows a modified version of our beam search algorithm to remain optimal. Experiments on neural machine translation demonstrate that our principled beam search algorithm leads to improvement in BLEU score over previously proposed alternatives.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00069

PDF

https://arxiv.org/pdf/1809.00069
Read All
Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

2018-08-31

Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, Ciprian Chelba

arXiv_CL

arXiv_CL NMT
Abstract

Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1809.00068

PDF

https://arxiv.org/pdf/1809.00068
Read All
Correcting Length Bias in Neural Machine Translation

2018-08-31

Kenton Murray, David Chiang

arXiv_CL

arXiv_CL NMT
Abstract

We study two problems in neural machine translation (NMT). First, in beam search, whereas a wider beam should in principle help translation, it often hurts NMT. Second, NMT has a tendency to produce translations that are too short. Here, we argue that these problems are closely related and both rooted in label bias. We show that correcting the brevity problem almost eliminates the beam problem; we compare some commonly-used methods for doing this, finding that a simple per-word reward works well; and we introduce a simple and quick way to tune this reward using the perceptron algorithm.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.10006

PDF

https://arxiv.org/pdf/1808.10006
Read All
Towards Distributed Coevolutionary GANs

2018-08-31

Abdullah Al-Dujaili, Tom Schmiedlechner, and Erik Hemberg, Una-May O'Reilly

arXiv_CV

arXiv_CV Adversarial GAN Optimization
Abstract

Generative Adversarial Networks (GANs) have become one of the dominant methods for deep generative modeling. Despite their demonstrated success on multiple vision tasks, GANs are difficult to train and much research has been dedicated towards understanding and improving their gradient-based learning dynamics. Here, we investigate the use of coevolution, a class of black-box (gradient-free) co-optimization techniques and a powerful tool in evolutionary computing, as a supplement to gradient-based GAN training techniques. Experiments on a simple model that exhibits several of the GAN gradient-based dynamics (e.g., mode collapse, oscillatory behavior, and vanishing gradients) show that coevolution is a promising framework for escaping degenerate GAN training behaviors.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.08194

PDF

https://arxiv.org/pdf/1807.08194
Read All
Real-time Detection, Tracking, and Classification of Moving and Stationary Objects using Multiple Fisheye Images

2018-08-31

Iljoo Baek, Albert Davies, Geng Yan, Ragunathan (Raj)Rajkumar

arXiv_CV

arXiv_CV Object_Detection Tracking Classification Detection
Abstract

The ability to detect pedestrians and other moving objects is crucial for an autonomous vehicle. This must be done in real-time with minimum system overhead. This paper discusses the implementation of a surround view system to identify moving as well as static objects that are close to the ego vehicle. The algorithm works on 4 views captured by fisheye cameras which are merged into a single frame. The moving object detection and tracking solution uses minimal system overhead to isolate regions of interest (ROIs) containing moving objects. These ROIs are then analyzed using a deep neural network (DNN) to categorize the moving object. With deployment and testing on a real car in urban environments, we have demonstrated the practical feasibility of the solution. The video demos of our algorithm have been uploaded to Youtube: this https URL, this https URL

Abstract (translated by Google)

URL

https://arxiv.org/abs/1803.06077

PDF

https://arxiv.org/pdf/1803.06077
Read All
Characterizing the Rate-Memory Tradeoff in Cache Networks within a Factor of 2

2018-08-31

Qian Yu, Mohammad Ali Maddah-Ali, A. Salman Avestimehr

arXiv_CV

arXiv_CV
Abstract

We consider a basic caching system, where a single server with a database of $N$ files (e.g. movies) is connected to a set of $K$ users through a shared bottleneck link. Each user has a local cache memory with a size of $M$ files. The system operates in two phases: a placement phase, where each cache memory is populated up to its size from the database, and a following delivery phase, where each user requests a file from the database, and the server is responsible for delivering the requested contents. The objective is to design the two phases to minimize the load (peak or average) of the bottleneck link. We characterize the rate-memory tradeoff of the above caching system within a factor of $2.00884$ for both the peak rate and the average rate (under uniform file popularity), improving state of the arts that are within a factor of $4$ and $4.7$ respectively. Moreover, in a practically important case where the number of files ($N$) is large, we exactly characterize the tradeoff for systems with no more than $5$ users, and characterize the tradeoff within a factor of $2$ otherwise. To establish these results, we develop two new converse bounds that improve over the state of the art.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1702.04563

PDF

https://arxiv.org/pdf/1702.04563
Read All
LUCSS: Language-based User-customized Colourization of Scene Sketches

2018-08-30

Changqing Zou, Haoran Mo, Ruofei Du, Xing Wu, Chengying Gao, Hongbo Fu

arXiv_CV

arXiv_CV Segmentation Caption Relation
Abstract

We introduce LUCSS, a language-based system for interactive col- orization of scene sketches, based on their semantic understanding. LUCSS is built upon deep neural networks trained via a large-scale repository of scene sketches and cartoon-style color images with text descriptions. It con- sists of three sequential modules. First, given a scene sketch, the segmenta- tion module automatically partitions an input sketch into individual object instances. Next, the captioning module generates the text description with spatial relationships based on the instance-level segmentation results. Fi- nally, the interactive colorization module allows users to edit the caption and produce colored images based on the altered caption. Our experiments show the effectiveness of our approach and the desirability of its compo- nents to alternative choices.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.10544

PDF

https://arxiv.org/pdf/1808.10544
Read All
iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection

2018-08-30

Chen Gao, Yuliang Zou, Jia-Bin Huang

arXiv_CV

arXiv_CV Attention Prediction Detection
Abstract

Recent years have witnessed rapid progress in detecting and recognizing individual object instances. To understand the situation in a scene, however, computers need to recognize how humans interact with surrounding objects. In this paper, we tackle the challenging task of detecting human-object interactions (HOI). Our core idea is that the appearance of a person or an object instance contains informative cues on which relevant parts of an image to attend to for facilitating interaction prediction. To exploit these cues, we propose an instance-centric attention module that learns to dynamically highlight regions in an image conditioned on the appearance of each instance. Such an attention-based network allows us to selectively aggregate features relevant for recognizing HOIs. We validate the efficacy of the proposed network on the Verb in COCO and HICO-DET datasets and show that our approach compares favorably with the state-of-the-arts.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.10437

PDF

https://arxiv.org/pdf/1808.10437
Read All
Deep Chronnectome Learning via Full Bidirectional Long Short-Term Memory Networks for MCI Diagnosis

2018-08-30

Weizheng Yan, Han Zhang, Jing Sui, Dinggang Shen

arXiv_CV

arXiv_CV RNN Classification Memory_Networks
Abstract

Brain functional connectivity (FC) extracted from resting-state fMRI (RS-fMRI) has become a popular approach for disease diagnosis, where discriminating subjects with mild cognitive impairment (MCI) from normal controls (NC) is still one of the most challenging problems. Dynamic functional connectivity (dFC), consisting of time-varying spatiotemporal dynamics, may characterize “chronnectome” diagnostic information for improving MCI classification. However, most of the current dFC studies are based on detecting discrete major brain status via spatial clustering, which ignores rich spatiotemporal dynamics contained in such chronnectome. We propose Deep Chronnectome Learning for exhaustively mining the comprehensive information, especially the hidden higher-level features, i.e., the dFC time series that may add critical diagnostic power for MCI classification. To this end, we devise a new Fully-connected Bidirectional Long Short-Term Memory Network (Full-BiLSTM) to effectively learn the periodic brain status changes using both past and future information for each brief time segment and then fuse them to form the final output. We have applied our method to a rigorously built large-scale multi-site database (i.e., with 164 data from NCs and 330 from MCIs, which can be further augmented by 25 folds). Our method outperforms other state-of-the-art approaches with an accuracy of 73.6% under solid cross-validations. We also made extensive comparisons among multiple variants of LSTM models. The results suggest high feasibility of our method with promising value also for other brain disorder diagnoses.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.10383

PDF

https://arxiv.org/pdf/1808.10383
Read All
ET Probes, Nodes, and Landbases: A Proposed Galactic Communications Architecture and Implied Search Strategies

2018-08-30

John Gertz

arXiv_CV

arXiv_CV
Abstract

Land-based beacons, information laden probes sent into our solar system, and more distal communication nodes have each been proposed as the most likely means by which we might be contacted by ET. Each method, considered in isolation from ET’s point of view, has limitations and flaws. An overarching galactic communication architecture that tethers together probes, nodes, and land bases is proposed to be a better overall solution. From this more efficient construct flows several conclusions: (a) Earth has been thoroughly surveilled, (b) Earth will be contacted in due course, (c) SETI beyond half the distance that Earth’s EM has reached (~35-50 LY) is futile, and (d) the very quiescence of the galaxy paradoxically implies that that Drake’s N = many, and that there is a system of galactic governance. Search strategies are proposed to detect the described probe-node-land base communications pathway.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.07024

PDF

https://arxiv.org/pdf/1808.07024
Read All
Pronoun Translation in English-French Machine Translation: An Analysis of Error Types

2018-08-30

Christian Hardmeier, Liane Guillou

arXiv_CL

arXiv_CL NMT
Abstract

Pronouns are a long-standing challenge in machine translation. We present a study of the performance of a range of rule-based, statistical and neural MT systems on pronoun translation based on an extensive manual evaluation using the PROTEST test suite, which enables a fine-grained analysis of different pronoun types and sheds light on the difficulties of the task. We find that the rule-based approaches in our corpus perform poorly as a result of oversimplification, whereas SMT and early NMT systems exhibit significant shortcomings due to a lack of awareness of the functional and referential properties of pronouns. A recent Transformer-based NMT system with cross-sentence context shows very promising results on non-anaphoric pronouns and intra-sentential anaphora, but there is still considerable room for improvement in examples with cross-sentence dependencies.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.10196

PDF

https://arxiv.org/pdf/1808.10196
Read All
Searching Toward Pareto-Optimal Device-Aware Neural Architectures

2018-08-30

An-Chieh Cheng, Jin-Dong Dong, Chi-Hung Hsu, Shu-Huan Chang, Min Sun, Shih-Chieh Chang, Jia-Yu Pan, Yu-Ting Chen, Wei Wei, Da-Cheng Juan

arXiv_CV

arXiv_CV Survey Image_Classification Inference Classification
Abstract

Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performance in many tasks such as image classification and language understanding. However, most existing works only optimize for model accuracy and largely ignore other important factors imposed by the underlying hardware and devices, such as latency and energy, when making inference. In this paper, we first introduce the problem of NAS and provide a survey on recent works. Then we deep dive into two recent advancements on extending NAS into multiple-objective frameworks: MONAS and DPP-Net. Both MONAS and DPP-Net are capable of optimizing accuracy and other objectives imposed by devices, searching for neural architectures that can be best deployed on a wide spectrum of devices: from embedded systems and mobile devices to workstations. Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.09830

PDF

https://arxiv.org/pdf/1808.09830
Read All
Hard Non-Monotonic Attention for Character-Level Transduction

2018-08-29

Shijie Wu, Pamela Shapiro, Ryan Cotterell

arXiv_CV

arXiv_CV Image_Caption Attention Caption
Abstract

Character-level string-to-string transduction is an important component of various NLP tasks. The goal is to map an input string to an output string, where the strings may be of different lengths and have characters taken from different alphabets. Recent approaches have used sequence-to-sequence models with an attention mechanism to learn which parts of the input string the model should focus on during the generation of the output string. Both soft attention and hard monotonic attention have been used, but hard non-monotonic attention has only been used in other sequence modeling tasks such as image captioning and has required a stochastic approximation to compute the gradient. In this work, we introduce an exact, polynomial-time algorithm for marginalizing over the exponential number of non-monotonic alignments between two strings, showing that hard attention models can be viewed as neural reparameterizations of the classical IBM Model 1. We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the stochastic approximation and outperforms soft attention.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.10024

PDF

https://arxiv.org/pdf/1808.10024
Read All
Revisiting Character-Based Neural Machine Translation with Capacity and Compression

2018-08-29

Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, Wolfgang Macherey

arXiv_CL

arXiv_CL NMT
Abstract

Translating characters instead of words or word-fragments has the potential to simplify the processing pipeline for neural machine translation (NMT), and improve results by eliminating hyper-parameters and manual feature engineering. However, it results in longer sequences in which each symbol contains less information, creating both modeling and computational challenges. In this paper, we show that the modeling problem can be solved by standard sequence-to-sequence architectures of sufficient depth, and that deep models operating at the character level outperform identical models operating over word fragments. This result implies that alternative architectures for handling character input are better viewed as methods for reducing computation time than as improved ways of modeling longer sequences. From this perspective, we evaluate several techniques for character-level NMT, verify that they do not match the performance of our deep character baseline model, and evaluate the performance versus computation time tradeoffs they offer. Within this framework, we also perform the first evaluation for NMT of conditional computation over time, in which the model learns which timesteps can be skipped, rather than having them be dictated by a fixed schedule specified before training begins.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.09943

PDF

https://arxiv.org/pdf/1808.09943
Read All
Interact as You Intend: Intention-Driven Human-Object Interaction Detection

2018-08-29

Bingjie Xu, Junnan Li, Yongkang Wong, Mohan S. Kankanhalli, Qi Zhao

arXiv_CV

arXiv_CV Attention Detection
Abstract

The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes. However, the ability to fully comprehend a social scene is still in its preliminary stage. In this work, we focus on detecting human-object interactions (HOIs) in social scene images, which is demanding in terms of research and increasingly useful for practical applications. To undertake social tasks interacting with objects, humans direct their attention and move their body based on their intention. Based on this observation, we provide a unique computational perspective to explore human intention in HOI detection. Specifically, the proposed human intention- driven HOI detection (iHOI) framework models human pose with the relative distances from body joints to the object instances. It also utilizes human gaze to guide the attended contextual regions in a weakly-supervised setting. In addition, we propose a hard negative sampling strategy to address the problem of mis-grouping. We perform extensive experiments on two benchmark datasets, namely V-COCO and HICO-DET, and show that iHOI outperforms the existing approaches. The efficacy of each proposed component has also been validated.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.09796

PDF

https://arxiv.org/pdf/1808.09796
Read All
An Operation Sequence Model for Explainable Neural Machine Translation

2018-08-29

Felix Stahlberg, Danielle Saunders, Bill Byrne

arXiv_CL

arXiv_CL NMT
Abstract

We propose to achieve explainable neural machine translation (NMT) by changing the output representation to explain itself. We present a novel approach to NMT which generates the target sentence by monotonically walking through the source sentence. Word reordering is modeled by operations which allow setting markers in the target sentence and move a target-side write head between those markers. In contrast to many modern neural models, our system emits explicit word alignment information which is often crucial to practical machine translation as it improves explainability. Our technique can outperform a plain text system in terms of BLEU score under the recent Transformer architecture on Japanese-English and Portuguese-English, and is within 0.5 BLEU difference on Spanish-English.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.09688

PDF

https://arxiv.org/pdf/1808.09688
Read All
From VQA to Multimodal CQA: Adapting Visual QA Models for Community QA Tasks

2018-08-29

Avikalp Srivastava, Hsin Wen Liu, Sumio Fujita

arXiv_CV

arXiv_CV Knowledge QA Classification VQA
Abstract

In this work, we present novel methods to adapt visual QA models for community QA tasks of practical significance - automated question category classification and finding experts for question answering - on questions containing both text and image. To the best of our knowledge, this is the first work to tackle the multimodality challenge in CQA, and is an enabling step towards basic question-answering on image-based CQA. First, we analyze the differences between visual QA and community QA datasets, discussing the limitations of applying VQA models directly to CQA tasks, and then we propose novel augmentations to VQA-based models to best address those limitations. Our model, with the augmentations of an image-text combination method tailored for CQA and use of auxiliary tasks for learning better grounding features, significantly outperforms the text-only and VQA model baselines for both tasks on real-world CQA data from Yahoo! Chiebukuro, a Japanese counterpart of Yahoo! Answers.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.09648

PDF

https://arxiv.org/pdf/1808.09648
Read All
Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation

2018-08-28

Renjie Zheng, Mingbo Ma, Liang Huang

arXiv_CV

arXiv_CV Image_Caption Summarization Text_Generation Caption
Abstract

Neural text generation, including neural machine translation, image captioning, and summarization, has been quite successful recently. However, during training time, typically only one reference is considered for each example, even though there are often multiple references available, e.g., 4 references in NIST MT evaluations, and 5 references in image captioning data. We first investigate several different ways of utilizing multiple human references during training. But more importantly, we then propose an algorithm to generate exponentially many pseudo-references by first compressing existing human references into lattices and then traversing them to generate new pseudo-references. These approaches lead to substantial improvements over strong baselines in both machine translation (+1.5 BLEU) and image captioning (+3.1 BLEU / +11.7 CIDEr).

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.09564

PDF

https://arxiv.org/pdf/1808.09564
Read All
A Tree-based Decoder for Neural Machine Translation

2018-08-28

Xinyi Wang, Hieu Pham, Pengcheng Yin, Graham Neubig

arXiv_CL

arXiv_CL Knowledge NMT RNN
Abstract

Recent advances in Neural Machine Translation (NMT) show that adding syntactic information to NMT systems can improve the quality of their translations. Most existing work utilizes some specific types of linguistically-inspired tree structures, like constituency and dependency parse trees. This is often done via a standard RNN decoder that operates on a linearized target tree structure. However, it is an open question of what specific linguistic formalism, if any, is the best structural representation for NMT. In this paper, we (1) propose an NMT model that can naturally generate the topology of an arbitrary tree structure on the target side, and (2) experiment with various target tree structures. Our experiments show the surprising result that our model delivers the best improvements with balanced binary trees constructed without any linguistic knowledge; this model outperforms standard seq2seq models by up to 2.1 BLEU points, and other methods for incorporating target-side syntax by up to 0.7 BLEU.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.09374

PDF

https://arxiv.org/pdf/1808.09374
Read All
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation

2018-08-28

Xinyi Wang, Hieu Pham, Zihang Dai, Graham Neubig

arXiv_CL

arXiv_CL Optimization NMT
Abstract

In this work, we examine methods for data augmentation for text-based tasks such as neural machine translation (NMT). We formulate the design of a data augmentation policy with desirable properties as an optimization problem, and derive a generic analytic solution. This solution not only subsumes some existing augmentation schemes, but also leads to an extremely simple data augmentation strategy for NMT: randomly replacing words in both the source sentence and the target sentence with other random words from their corresponding vocabularies. We name this method SwitchOut. Experiments on three translation datasets of different scales show that SwitchOut yields consistent improvements of about 0.5 BLEU, achieving better or comparable performances to strong alternatives such as word dropout (Sennrich et al., 2016a). Code to implement this method is included in the appendix.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.07512

PDF

https://arxiv.org/pdf/1808.07512
Read All
A Stable and Effective Learning Strategy for Trainable Greedy Decoding

2018-08-28

Yun Chen, Victor O.K. Li, Kyunghyun Cho, Samuel R. Bowman

arXiv_CV

arXiv_CV Reinforcement_Learning
Abstract

Beam search is a widely used approximate search strategy for neural network decoders, and it generally outperforms simple greedy decoding on tasks like machine translation. However, this improvement comes at substantial computational cost. In this paper, we propose a flexible new method that allows us to reap nearly the full benefits of beam search with nearly no additional computational cost. The method revolves around a small neural network actor that is trained to observe and manipulate the hidden state of a previously-trained decoder. To train this actor network, we introduce the use of a pseudo-parallel corpus built using the output of beam search on a base model, ranked by a target quality metric like BLEU. Our method is inspired by earlier work on this problem, but requires no reinforcement learning, and can be trained reliably on a range of models. Experiments on three parallel corpora and three architectures show that the method yields substantial improvements in translation quality and speed over each base system.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1804.07915

PDF

https://arxiv.org/pdf/1804.07915
Read All
Single Shot Scene Text Retrieval

2018-08-27

Lluís Gómez, Andrés Mafla, Marçal Rusiñol, Dimosthenis Karatzas

arXiv_CV

arXiv_CV Image_Retrieval
Abstract

Textual information found in scene images provides high level semantic information about the image and its context and it can be leveraged for better scene understanding. In this paper we address the problem of scene text retrieval: given a text query, the system must return all images containing the queried text. The novelty of the proposed model consists in the usage of a single shot CNN architecture that predicts at the same time bounding boxes and a compact text representation of the words in them. In this way, the text based image retrieval task can be casted as a simple nearest neighbor search of the query text representation over the outputs of the CNN over the entire image database. Our experiments demonstrate that the proposed architecture outperforms previous state-of-the-art while it offers a significant increase in processing speed.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.09044

PDF

https://arxiv.org/pdf/1808.09044
Read All
Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement

2018-08-27

Jason Lee, Elman Mansimov, Kyunghyun Cho

arXiv_CV

arXiv_CV Image_Caption Caption
Abstract

We propose a conditional non-autoregressive neural sequence model based on iterative refinement. The proposed model is designed based on the principles of latent variable models and denoising autoencoders, and is generally applicable to any sequence generation task. We extensively evaluate the proposed model on machine translation (En-De and En-Ro) and image caption generation, and observe that it significantly speeds up decoding while maintaining the generation quality comparable to the autoregressive counterpart.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1802.06901

PDF

https://arxiv.org/pdf/1802.06901
Read All
A neural attention model for speech command recognition

2018-08-27

Douglas Coimbra de Andrade, Sabato Leo, Martin Loesener Da Silva Viana, Christoph Bernkopf

arXiv_CV

arXiv_CV Image_Caption Attention Caption CNN Recognition
Abstract

This paper introduces a convolutional recurrent network with attention for speech command recognition. Attention models are powerful tools to improve performance on natural language, image captioning and speech tasks. The proposed model establishes a new state-of-the-art accuracy of 94.1% on Google Speech Commands dataset V1 and 94.5% on V2 (for the 20-commands recognition task), while still keeping a small footprint of only 202K trainable parameters. Results are compared with previous convolutional implementations on 5 different tasks (20 commands recognition (V1 and V2), 12 commands recognition (V1), 35 word recognition (V1) and left-right (V1)). We show detailed performance results and demonstrate that the proposed attention mechanism not only improves performance but also allows inspecting what regions of the audio were taken into consideration by the network when outputting a given category.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.08929

PDF

https://arxiv.org/pdf/1808.08929
Read All
Speeding-up Object Detection Training for Robotics with FALKON

2018-08-27

Elisa Maiettini, Giulia Pasquale, Lorenzo Rosasco, Lorenzo Natale

arXiv_CV

arXiv_CV Object_Detection Deep_Learning Detection
Abstract

Latest deep learning methods for object detection provide remarkable performance, but have limits when used in robotic applications. One of the most relevant issues is the long training time, which is due to the large size and imbalance of the associated training sets, characterized by few positive and a large number of negative examples (i.e. background). Proposed approaches are based on end-to-end learning by back-propagation [22] or kernel methods trained with Hard Negatives Mining on top of deep features [8]. These solutions are effective, but prohibitively slow for on-line applications. In this paper we propose a novel pipeline for object detection that overcomes this problem and provides comparable performance, with a 60x training speedup. Our pipeline combines (i) the Region Proposal Network and the deep feature extractor from [22] to efficiently select candidate RoIs and encode them into powerful representations, with (ii) the FALKON [23] algorithm, a novel kernel-based method that allows fast training on large scale problems (millions of points). We address the size and imbalance of training data by exploiting the stochastic subsampling intrinsic into the method and a novel, fast, bootstrapping approach. We assess the effectiveness of the approach on a standard Computer Vision dataset (PASCAL VOC 2007 [5]) and demonstrate its applicability to a real robotic scenario with the iCubWorld Transformations [18] dataset.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1803.08740

PDF

https://arxiv.org/pdf/1803.08740
Read All
A Study of Reinforcement Learning for Neural Machine Translation

2018-08-27

Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, Tie-Yan Liu

arXiv_CL

arXiv_CL Reinforcement_Learning NMT
Abstract

Recent studies have shown that reinforcement learning (RL) is an effective approach for improving the performance of neural machine translation (NMT) system. However, due to its instability, successfully RL training is challenging, especially in real-world systems where deep models and large datasets are leveraged. In this paper, taking several large-scale translation tasks as testbeds, we conduct a systematic study on how to train better NMT models using reinforcement learning. We provide a comprehensive comparison of several important factors (e.g., baseline reward, reward shaping) in RL training. Furthermore, to fill in the gap that it remains unclear whether RL is still beneficial when monolingual data is used, we propose a new method to leverage RL to further boost the performance of NMT systems trained with source/target monolingual data. By integrating all our findings, we obtain competitive results on WMT14 English- German, WMT17 English-Chinese, and WMT17 Chinese-English translation tasks, especially setting a state-of-the-art performance on WMT17 Chinese-English translation task.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.08866

PDF

https://arxiv.org/pdf/1808.08866
Read All
Character-level Chinese-English Translation through ASCII Encoding

2018-08-27

Nikola I. Nikolov, Yuhuang Hu, Mi Xue Tan, Richard H.R. Hahnloser

arXiv_CL

arXiv_CL CNN NMT
Abstract

Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They mainly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge because of a lack of systematic correspondence between the individual linguistic units. In this paper, we enable character-level NMT for Chinese, by breaking down Chinese characters into linguistic units similar to that of Indo-European languages. We use the Wubi encoding scheme, which preserves the original shape and semantic information of the characters, while also being reversible. We show promising results from training Wubi-based models on the character- and subword-level with recurrent as well as convolutional models.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1805.03330

PDF

https://arxiv.org/pdf/1805.03330
Read All
simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions

2018-08-27

Fenglin Liu, Xuancheng Ren, Yuanxin Liu, Houfeng Wang, Xu Sun

arXiv_CV

arXiv_CV Image_Caption Attention Caption
Abstract

The encode-decoder framework has shown recent success in image captioning. Visual attention, which is good at detailedness, and semantic attention, which is good at comprehensiveness, have been separately proposed to ground the caption on the image. In this paper, we propose the Stepwise Image-Topic Merging Network (simNet) that makes use of the two kinds of attention at the same time. At each time step when generating the caption, the decoder adaptively merges the attentive information in the extracted topics and the image according to the generated context, so that the visual information and the semantic information can be effectively combined. The proposed approach is evaluated on two benchmark datasets and reaches the state-of-the-art performances.(The code is available at this https URL)

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.08732

PDF

https://arxiv.org/pdf/1808.08732
Read All
NullaNet: Training Deep Neural Networks for Reduced-Memory-Access Inference

2018-08-27

Mahdi Nazemi, Ghasem Pasandi, Massoud Pedram

arXiv_CV

arXiv_CV Speech_Recognition Inference Recognition
Abstract

Deep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. However, computational and storage complexity of these models has forced the majority of computations to be performed on high-end computing platforms or on the cloud. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floatingpoint operations, and has a substantially lower latency.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.08716

PDF

https://arxiv.org/pdf/1807.08716
Read All

209/266

Welcome to AMDS123 Blog!

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL