Welcome to AMDS123 Blog!

Recent Papers about CV, CL and SD

Scalable Object Detection for Stylized Objects

2017-11-29

Aayush Garg, Thilo Will, William Darling, Willi Richert, Clemens Marschner

arXiv_CV

arXiv_CV Object_Detection CNN Detection Recognition
Abstract

Following recent breakthroughs in convolutional neural networks and monolithic model architectures, state-of-the-art object detection models can reliably and accurately scale into the realm of up to thousands of classes. Things quickly break down, however, when scaling into the tens of thousands, or, eventually, to millions or billions of unique objects. Further, bounding box-trained end-to-end models require extensive training data. Even though - with some tricks using hierarchies - one can sometimes scale up to thousands of classes, the labor requirements for clean image annotations quickly get out of control. In this paper, we present a two-layer object detection method for brand logos and other stylized objects for which prototypical images exist. It can scale to large numbers of unique classes. Our first layer is a CNN from the Single Shot Multibox Detector family of models that learns to propose regions where some stylized object is likely to appear. The contents of a proposed bounding box is then run against an image index that is targeted for the retrieval task at hand. The proposed architecture scales to a large number of object classes, allows to continously add new classes without retraining, and exhibits state-of-the-art quality on a stylized object detection task such as logo recognition.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.09822

PDF

https://arxiv.org/pdf/1711.09822
Read All
SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving

2017-11-29

Bichen Wu, Alvin Wan, Forrest Iandola, Peter H. Jin, Kurt Keutzer

arXiv_CV

arXiv_CV Object_Detection CNN Inference Detection
Abstract

Object detection is a crucial task for autonomous driving. In addition to requiring high accuracy to ensure safety, object detection for autonomous driving also requires real-time inference speed to guarantee prompt vehicle control, as well as small model size and energy efficiency to enable embedded system deployment. In this work, we propose SqueezeDet, a fully convolutional neural network for object detection that aims to simultaneously satisfy all of the above constraints. In our network we use convolutional layers not only to extract feature maps, but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of our model only contains a single forward pass of a neural network, thus it is extremely fast. Our model is fully-convolutional, which leads to small model size and better energy efficiency. Finally, our experiments show that our model is very accurate, achieving state-of-the-art accuracy on the KITTI benchmark.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1612.01051

PDF

https://arxiv.org/pdf/1612.01051
Read All
Speaker-Sensitive Dual Memory Networks for Multi-Turn Slot Tagging

2017-11-29

Young-Bum Kim, Sungjin Lee, Ruhi Sarikaya

arXiv_CV

arXiv_CV Face Memory_Networks
Abstract

In multi-turn dialogs, natural language understanding models can introduce obvious errors by being blind to contextual information. To incorporate dialog history, we present a neural architecture with Speaker-Sensitive Dual Memory Networks which encode utterances differently depending on the speaker. This addresses the different extents of information available to the system - the system knows only the surface form of user utterances while it has the exact semantics of system output. We performed experiments on real user data from Microsoft Cortana, a commercial personal assistant. The result showed a significant performance improvement over the state-of-the-art slot tagging models using contextual information.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.10705

PDF

https://arxiv.org/pdf/1711.10705
Read All
Actor-Critic Sequence Training for Image Captioning

2017-11-28

Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, Timothy M. Hospedales

arXiv_CV

arXiv_CV Image_Caption Reinforcement_Learning Caption
Abstract

Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing. Such image captioning methods are typically trained by maximising the likelihood of ground-truth annotated caption given the image. While simple and easy to implement, this approach does not directly maximise the language quality metrics we care about such as CIDEr. In this paper we investigate training image captioning methods based on actor-critic reinforcement learning in order to directly optimise non-differentiable quality metrics of interest. By formulating a per-token advantage and value computation strategy in this novel reinforcement learning based captioning model, we show that it is possible to achieve the state of the art performance on the widely used MSCOCO benchmark.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1706.09601

PDF

https://arxiv.org/pdf/1706.09601
Read All
MMD GAN: Towards Deeper Understanding of Moment Matching Network

2017-11-27

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, Barnabás Póczos

arXiv_CV

arXiv_CV Adversarial GAN Gradient_Descent
Abstract

Generative moment matching network (GMMN) is a deep generative model that differs from Generative Adversarial Network (GAN) by replacing the discriminator in GAN with a two-sample test based on kernel maximum mean discrepancy (MMD). Although some theoretical guarantees of MMD have been studied, the empirical performance of GMMN is still not as competitive as that of GAN on challenging and large benchmark datasets. The computational efficiency of GMMN is also less desirable in comparison with GAN, partially due to its requirement for a rather large batch size during the training. In this paper, we propose to improve both the model expressiveness of GMMN and its computational efficiency by introducing adversarial kernel learning techniques, as the replacement of a fixed Gaussian kernel in the original GMMN. The new approach combines the key ideas in both GMMN and GAN, hence we name it MMD GAN. The new distance measure in MMD GAN is a meaningful loss that enjoys the advantage of weak topology and can be optimized via gradient descent with relatively small batch sizes. In our evaluation on multiple benchmark datasets, including MNIST, CIFAR- 10, CelebA and LSUN, the performance of MMD-GAN significantly outperforms GMMN, and is competitive with other representative GAN works.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1705.08584

PDF

https://arxiv.org/pdf/1705.08584
Read All
HP-GAN: Probabilistic 3D human motion prediction via GAN

2017-11-27

Emad Barsoum, John Kender, Zicheng Liu

arXiv_CV

arXiv_CV Adversarial GAN Prediction
Abstract

Predicting and understanding human motion dynamics has many applications, such as motion synthesis, augmented reality, security, and autonomous vehicles. Due to the recent success of generative adversarial networks (GAN), there has been much interest in probabilistic estimation and synthetic data generation using deep neural network architectures and learning algorithms. We propose a novel sequence-to-sequence model for probabilistic human motion prediction, trained with a modified version of improved Wasserstein generative adversarial networks (WGAN-GP), in which we use a custom loss function designed for human motion prediction. Our model, which we call HP-GAN, learns a probability density function of future human poses conditioned on previous poses. It predicts multiple sequences of possible future human poses, each from the same input sequence but a different vector z drawn from a random distribution. Furthermore, to quantify the quality of the non-deterministic predictions, we simultaneously train a motion-quality-assessment model that learns the probability that a given skeleton sequence is a real human motion. We test our algorithm on two of the largest skeleton datasets: NTURGB-D and Human3.6M. We train our model on both single and multiple action types. Its predictive power for long-term motion estimation is demonstrated by generating multiple plausible futures of more than 30 frames from just 10 frames of input. We show that most sequences generated from the same input have more than 50\% probabilities of being judged as a real human sequence. We will release all the code used in this paper to Github.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.09561

PDF

https://arxiv.org/pdf/1711.09561
Read All
Learning to Remember Translation History with a Continuous Cache

2017-11-26

Zhaopeng Tu, Yang Liu, Shuming Shi, Tong Zhang

arXiv_CL

arXiv_CL NMT
Abstract

Existing neural machine translation (NMT) models generally translate sentences in isolation, missing the opportunity to take advantage of document-level information. In this work, we propose to augment NMT models with a very light-weight cache-like memory network, which stores recent hidden representations as translation history. The probability distribution over generated words is updated online depending on the translation history retrieved from the memory, endowing NMT models with the capability to dynamically adapt over time. Experiments on multiple domains with different topics and styles show the effectiveness of the proposed approach with negligible impact on the computational cost.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.09367

PDF

https://arxiv.org/pdf/1711.09367
Read All
Multiple Instance Curriculum Learning for Weakly Supervised Object Detection

2017-11-25

Siyang Li, Xiangxin Zhu, Qin Huang, Hao Xu, C.-C. Jay Kuo

arXiv_CV

arXiv_CV Object_Detection Segmentation Weakly_Supervised Face Detection
Abstract

When supervising an object detector with weakly labeled data, most existing approaches are prone to trapping in the discriminative object parts, e.g., finding the face of a cat instead of the full body, due to lacking the supervision on the extent of full objects. To address this challenge, we incorporate object segmentation into the detector training, which guides the model to correctly localize the full objects. We propose the multiple instance curriculum learning (MICL) method, which injects curriculum learning (CL) into the multiple instance learning (MIL) framework. The MICL method starts by automatically picking the easy training examples, where the extent of the segmentation masks agree with detection bounding boxes. The training set is gradually expanded to include harder examples to train strong detectors that handle complex images. The proposed MICL method with segmentation in the loop outperforms the state-of-the-art weakly supervised object detectors by a substantial margin on the PASCAL VOC datasets.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.09191

PDF

https://arxiv.org/pdf/1711.09191
Read All
Convolutional Image Captioning

2017-11-24

Jyoti Aneja, Aditya Deshpande, Alexander Schwing

arXiv_CV

arXiv_CV Image_Caption Caption CNN RNN
Abstract

Image captioning is an important but challenging task, applicable to virtual assistants, editing tools, image indexing, and support of the disabled. Its challenges are due to the variability and ambiguity of possible image descriptions. In recent years significant progress has been made in image captioning, using Recurrent Neural Networks powered by long-short-term-memory (LSTM) units. Despite mitigating the vanishing gradient problem, and despite their compelling ability to memorize dependencies, LSTM units are complex and inherently sequential across time. To address this issue, recent work has shown benefits of convolutional networks for machine translation and conditional image generation. Inspired by their success, in this paper, we develop a convolutional image captioning technique. We demonstrate its efficacy on the challenging MSCOCO dataset and demonstrate performance on par with the baseline, while having a faster training time per number of parameters. We also perform a detailed analysis, providing compelling reasons in favor of convolutional language generation approaches.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.09151

PDF

https://arxiv.org/pdf/1711.09151
Read All
Critically excited states with enhanced memory and pattern recognition capacities in quantum brain networks: Lesson from black holes

2017-11-24

Gia Dvali

arXiv_CV

arXiv_CV Recognition
Abstract

We implement a mechanism - originally proposed as a model for the large memory storage capacity of black holes - in quantum neural networks and show that an exponentially increased capacity of pattern storage and recognition is achieved in certain critically excited states, without involvement of synaptic plasticity. We consider a simple network of N interconnected quantum neurons with weak excitatory synaptic connections. We show that for frozen synaptic weights there exist the critical states of enhanced memory storage capacity. These states are achieved thanks to the high excitation levels of some of the neurons, which - despite of feeble synaptic connections - dramatically lower the response threshold of the remaining weaker-excited neurons. As a results, the latter neurons acquire a capacity to store an exponentially large number of patterns within a narrow energy gap. The stored patterns can be recognized and retrieved with perfect response under the influence of arbitrarily soft input stimuli. In sharp contrast, under the same stimuli the recall is absent in the ground-state of the system. The lesson is that the state with the highest micro-state entropy and memory storage capacity is not necessarily a local minimum of energy, but rather an excited critical state. The considered phenomenon has a smooth classical limit and can serve for achieving an enhanced memory storage capacity in classical brain networks.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.09079

PDF

https://arxiv.org/pdf/1711.09079
Read All
Long Short-Term Memory networks with jet constituents for boosted top tagging at the LHC

2017-11-24

Shannon Egan, Wojciech Fedorko, Alison Lister, Jannicke Pearkes, Colin Gay

arXiv_CV

arXiv_CV RNN Deep_Learning Memory_Networks
Abstract

Multivariate techniques based on engineered features have found wide adoption in the identification of jets resulting from hadronic top decays at the Large Hadron Collider (LHC). Recent Deep Learning developments in this area include the treatment of the calorimeter activation as an image or supplying a list of jet constituent momenta to a fully connected network. This latter approach lends itself well to the use of Recurrent Neural Networks. In this work the applicability of architectures incorporating Long Short-Term Memory (LSTM) networks is explored. Several network architectures, methods of ordering of jet constituents, and input pre-processing are studied. The best performing LSTM network achieves a background rejection of 100 for 50% signal efficiency. This represents more than a factor of two improvement over a fully connected Deep Neural Network (DNN) trained on similar types of inputs.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.09059

PDF

https://arxiv.org/pdf/1711.09059
Read All
Demystifying AlphaGo Zero as AlphaGo GAN

2017-11-24

Xiao Dong, Jiasong Wu, Ling Zhou

arXiv_CV

arXiv_CV GAN
Abstract

The astonishing success of AlphaGo Zero\cite{Silver_AlphaGo} invokes a worldwide discussion of the future of our human society with a mixed mood of hope, anxiousness, excitement and fear. We try to dymystify AlphaGo Zero by a qualitative analysis to indicate that AlphaGo Zero can be understood as a specially structured GAN system which is expected to possess an inherent good convergence property. Thus we deduct the success of AlphaGo Zero may not be a sign of a new generation of AI.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.09091

PDF

https://arxiv.org/pdf/1711.09091
Read All
Feature Selective Networks for Object Detection

2017-11-24

Yao Zhai, Jingjing Fu, Yan Lu, Houqiang Li

arXiv_CV

arXiv_CV Object_Detection Attention Classification Detection
Abstract

Objects for detection usually have distinct characteristics in different sub-regions and different aspect ratios. However, in prevalent two-stage object detection methods, Region-of-Interest (RoI) features are extracted by RoI pooling with little emphasis on these translation-variant feature components. We present feature selective networks to reform the feature representations of RoIs by exploiting their disparities among sub-regions and aspect ratios. Our network produces the sub-region attention bank and aspect ratio attention bank for the whole image. The RoI-based sub-region attention map and aspect ratio attention map are selectively pooled from the banks, and then used to refine the original RoI features for RoI classification. Equipped with a light-weight detection subnetwork, our network gets a consistent boost in detection performance based on general ConvNet backbones (ResNet-101, GoogLeNet and VGG-16). Without bells and whistles, our detectors equipped with ResNet-101 achieve more than 3% mAP improvement compared to counterparts on PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO datasets.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.08879

PDF

https://arxiv.org/pdf/1711.08879
Read All
Controlling Physical Attributes in GAN-Accelerated Simulation of Electromagnetic Calorimeters

2017-11-23

Luke de Oliveira, Michela Paganini, Benjamin Nachman

arXiv_CV

arXiv_CV Adversarial GAN
Abstract

High-precision modeling of subatomic particle interactions is critical for many fields within the physical sciences, such as nuclear physics and high energy particle physics. Most simulation pipelines in the sciences are computationally intensive – in a variety of scientific fields, Generative Adversarial Networks have been suggested as a solution to speed up the forward component of simulation, with promising results. An important component of any simulation system for the sciences is the ability to condition on any number of physically meaningful latent characteristics that can effect the forward generation procedure. We introduce an auxiliary task to the training of a Generative Adversarial Network on particle showers in a multi-layer electromagnetic calorimeter, which allows our model to learn an attribute-aware conditioning mechanism.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.08813

PDF

https://arxiv.org/pdf/1711.08813
Read All
Memory-efficient and Ultra-fast Network Lookup and Forwarding using Othello Hashing

2017-11-22

Ye Yu, Djamal Belazzougui, Chen Qian, Qin Zhang

arXiv_CV

arXiv_CV
Abstract

Network algorithms always prefer low memory cost and fast packet processing speed. Forwarding information base (FIB), as a typical network processing component, requires a scalable and memory-efficient algorithm to support fast lookups. In this paper, we present a new network algorithm, Othello Hashing, and its application of a FIB design called Concise, which uses very little memory to support ultra-fast lookups of network names. Othello Hashing and Concise make use of minimal perfect hashing and relies on the programmable network framework to support dynamic updates. Our conceptual contribution of Concise is to optimize the memory efficiency and query speed in the data plane and move the relatively complex construction and update components to the resource-rich control plane. We implemented Concise on three platforms. Experimental results show that Concise uses significantly smaller memory to achieve much faster query speed compared to existing solutions of network name lookups.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1608.05699

PDF

https://arxiv.org/pdf/1608.05699
Read All
Visual Question Answering as a Meta Learning Task

2017-11-22

Damien Teney, Anton van den Hengel

arXiv_CV

arXiv_CV QA VQA
Abstract

The predominant approach to Visual Question Answering (VQA) demands that the model represents within its weights all of the information required to answer any question about any image. Learning this information from any real training set seems unlikely, and representing it in a reasonable number of weights doubly so. We propose instead to approach VQA as a meta learning task, thus separating the question answering method from the information required. At test time, the method is provided with a support set of example questions/answers, over which it reasons to resolve the given question. The support set is not fixed and can be extended without retraining, thereby expanding the capabilities of the model. To exploit this dynamically provided information, we adapt a state-of-the-art VQA model with two techniques from the recent meta learning literature, namely prototypical networks and meta networks. Experiments demonstrate the capability of the system to learn to produce completely novel answers (i.e. never seen during training) from examples provided at test time. In comparison to the existing state of the art, the proposed method produces qualitatively distinct results with higher recall of rare answers, and a better sample efficiency that allows training with little initial data. More importantly, it represents an important step towards vision-and-language methods that can learn and reason on-the-fly.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.08105

PDF

https://arxiv.org/pdf/1711.08105
Read All
Towards Automatic Learning of Procedures from Web Instructional Videos

2017-11-21

Luowei Zhou, Chenliang Xu, Jason J. Corso

arXiv_CV

arXiv_CV Video_Caption Segmentation Caption
Abstract

The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation–to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1703.09788

PDF

https://arxiv.org/pdf/1703.09788
Read All
Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

2017-11-21

Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan

arXiv_CV

arXiv_CV Speech_Recognition RNN Deep_Learning Recognition
Abstract

Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.08016

PDF

https://arxiv.org/pdf/1711.08016
Read All
Transferring Agent Behaviors from Videos via Motion GANs

2017-11-21

Ashley D. Edwards, Charles L. Isbell Jr

arXiv_CV

arXiv_CV Adversarial GAN Reinforcement_Learning
Abstract

A major bottleneck for developing general reinforcement learning agents is determining rewards that will yield desirable behaviors under various circumstances. We introduce a general mechanism for automatically specifying meaningful behaviors from raw pixels. In particular, we train a generative adversarial network to produce short sub-goals represented through motion templates. We demonstrate that this approach generates visually meaningful behaviors in unknown environments with novel agents and describe how these motions can be used to train reinforcement learning agents.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.07676

PDF

https://arxiv.org/pdf/1711.07676
Read All
Efficient Architecture Search by Network Transformation

2017-11-21

Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, Jun Wang

arXiv_CV

arXiv_CV NAS Reinforcement_Learning CNN
Abstract

Techniques for automatically designing deep neural network architectures such as reinforcement learning based approaches have recently shown promising results. However, their success is based on vast computational resources (e.g. hundreds of GPUs), making them difficult to be widely used. A noticeable limitation is that they still design and train each network from scratch during the exploration of the architecture space, which is highly inefficient. In this paper, we propose a new framework toward efficient architecture search by exploring the architecture space based on the current network and reusing its weights. We employ a reinforcement learning agent as the meta-controller, whose action is to grow the network depth or layer width with function-preserving transformations. As such, the previously validated networks can be reused for further exploration, thus saves a large amount of computational cost. We apply our method to explore the architecture space of the plain convolutional neural networks (no skip-connections, branching etc.) on image benchmark datasets (CIFAR-10, SVHN) with restricted computational resources (5 GPUs). Our method can design highly competitive networks that outperform existing networks using the same design scheme. On CIFAR-10, our model without skip-connections achieves 4.23\% test error rate, exceeding a vast majority of modern architectures and approaching DenseNet. Furthermore, by applying our method to explore the DenseNet architecture space, we are able to achieve more accurate networks with fewer parameters.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1707.04873

PDF

https://arxiv.org/pdf/1707.04873
Read All
MARTA GANs: Unsupervised Representation Learning for Remote Sensing Image Classification

2017-11-21

Daoyu Lin, Kun Fu, Yang Wang, Guangluan Xu, Xian Sun

arXiv_CV

arXiv_CV Adversarial GAN CNN Image_Classification Represenation_Learning Classification Deep_Learning
Abstract

With the development of deep learning, supervised learning has frequently been adopted to classify remotely sensed images using convolutional networks (CNNs). However, due to the limited amount of labeled data available, supervised learning is often difficult to carry out. Therefore, we proposed an unsupervised model called multiple-layer feature-matching generative adversarial networks (MARTA GANs) to learn a representation using only unlabeled data. MARTA GANs consists of both a generative model $G$ and a discriminative model $D$. We treat $D$ as a feature extractor. To fit the complex properties of remote sensing data, we use a fusion layer to merge the mid-level and global features. $G$ can produce numerous images that are similar to the training data; therefore, $D$ can learn better representations of remotely sensed images using the training data provided by $G$. The classification results on two widely used remote sensing image databases show that the proposed method significantly improves the classification performance compared with other state-of-the-art methods.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1612.08879

PDF

https://arxiv.org/pdf/1612.08879
Read All
Session-aware Information Embedding for E-commerce Product Recommendation

2017-11-20

Chen Wu, Ming Yan, Luo Si

arXiv_CV

arXiv_CV Embedding Quantitative Recommendation
Abstract

Most of the existing recommender systems assume that user’s visiting history can be constantly recorded. However, in recent online services, the user identification may be usually unknown and only limited online user behaviors can be used. It is of great importance to model the temporal online user behaviors and conduct recommendation for the anonymous users. In this paper, we propose a list-wise deep neural network based architecture to model the limited user behaviors within each session. To train the model efficiently, we first design a session embedding method to pre-train a session representation, which incorporates different kinds of user search behaviors such as clicks and views. Based on the learnt session representation, we further propose a list-wise ranking model to generate the recommendation result for each anonymous user session. We conduct quantitative experiments on a recently published dataset from an e-commerce company. The evaluation results validate the effectiveness of the proposed method, which can outperform the state-of-the-art significantly.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1707.05955

PDF

https://arxiv.org/pdf/1707.05955
Read All
Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

2017-11-19

Liwei Wang, Alexander G. Schwing, Svetlana Lazebnik

arXiv_CV

arXiv_CV Image_Caption Caption RNN
Abstract

This paper explores image caption generation using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around $K$ components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a “vanilla” CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.07068

PDF

https://arxiv.org/pdf/1711.07068
Read All
A framework for Multi-A/B testing with online FDR control

2017-11-18

Fanny Yang, Aaditya Ramdas, Kevin Jamieson, Martin J. Wainwright

arXiv_CV

arXiv_CV Caption
Abstract

We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework proposes to replace a sequence of A/B tests by a sequence of best-arm MAB instances, which can be continuously monitored by the data scientist. When interleaving the MAB tests with an an online false discovery rate (FDR) algorithm, we can obtain the best of both worlds: low sample complexity and any time online FDR control. Our main contributions are: (i) to propose reasonable definitions of a null hypothesis for MAB instances; (ii) to demonstrate how one can derive an always-valid sequential p-value that allows continuous monitoring of each MAB test; and (iii) to show that using rejection thresholds of online-FDR algorithms as the confidence levels for the MAB algorithms results in both sample-optimality, high power and low FDR at any point in time. We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1706.05378

PDF

https://arxiv.org/pdf/1706.05378
Read All
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

2017-11-17

Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu, Yizhou Wang, Yonggang Wang

arXiv_CV

arXiv_CV Image_Caption Caption Classification Detection
Abstract

Significant progress has been achieved in Computer Vision by leveraging large-scale image datasets. However, large-scale datasets for complex Computer Vision tasks beyond classification are still limited. This paper proposed a large-scale dataset named AIC (AI Challenger) with three sub-datasets, human keypoint detection (HKD), large-scale attribute dataset (LAD) and image Chinese captioning (ICC). In this dataset, we annotate class labels (LAD), keypoint coordinate (HKD), bounding box (HKD and LAD), attribute (LAD) and caption (ICC). These rich annotations bridge the semantic gap between low-level images and high-level concepts. The proposed dataset is an effective benchmark to evaluate and improve different computational methods. In addition, for related tasks, others can also use our dataset as a new resource to pre-train their models.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.06475

PDF

https://arxiv.org/pdf/1711.06475
Read All
Adaptive Feature Abstraction for Translating Video to Text

2017-11-17

Yunchen Pu, Martin Renqiang Min, Zhe Gan, Lawrence Carin

arXiv_CV

arXiv_CV Video_Caption Attention Caption CNN Quantitative
Abstract

Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features. However, the variable context-dependent semantics in the video may make it more appropriate to adaptively select features from the multiple CNN layers. We propose a new approach for generating adaptive spatiotemporal representations of videos for the captioning task. A novel attention mechanism is developed, that adaptively and sequentially focuses on different layers of CNN features (levels of feature “abstraction”), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1611.07837

PDF

https://arxiv.org/pdf/1611.07837
Read All
VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

2017-11-17

Yin Zhou, Oncel Tuzel

arXiv_CV

arXiv_CV Object_Detection Sparse Face Prediction Detection
Abstract

Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird’s eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.06396

PDF

https://arxiv.org/pdf/1711.06396
Read All
Grounded Objects and Interactions for Video Captioning

2017-11-16

Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, Hans Peter Graf

arXiv_CV

arXiv_CV Video_Caption Caption
Abstract

We address the problem of video captioning by grounding language generation on object interactions in the video. Existing work mostly focuses on overall scene understanding with often limited or no emphasis on object interactions to address the problem of video understanding. In this paper, we propose SINet-Caption that learns to generate captions grounded over higher-order interactions between arbitrary groups of objects for fine-grained video understanding. We discuss the challenges and benefits of such an approach. We further demonstrate state-of-the-art results on the ActivityNet Captions dataset using our model, SINet-Caption based on this approach.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.06354

PDF

https://arxiv.org/pdf/1711.06354
Read All
Deep Matching Autoencoders

2017-11-16

Tanmoy Mukherjee, Makoto Yamada, Timothy M. Hospedales

arXiv_CV

arXiv_CV Image_Caption GAN Caption Represenation_Learning
Abstract

Increasingly many real world tasks involve data in multiple modalities or views. This has motivated the development of many effective algorithms for learning a common latent space to relate multiple domains. However, most existing cross-view learning algorithms assume access to paired data for training. Their applicability is thus limited as the paired data assumption is often violated in practice: many tasks have only a small subset of data available with pairing annotation, or even no paired data at all. In this paper we introduce Deep Matching Autoencoders (DMAE), which learn a common latent space and pairing from unpaired multi-modal data. Specifically we formulate this as a cross-domain representation learning and object matching problem. We simultaneously optimise parameters of representation learning auto-encoders and the pairing of unpaired multi-modal data. This framework elegantly spans the full regime from fully supervised, semi-supervised, and unsupervised (no paired data) multi-modal learning. We show promising results in image captioning, and on a new task that is uniquely enabled by our methodology: unsupervised classifier learning.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.06047

PDF

https://arxiv.org/pdf/1711.06047
Read All
Self-critical Sequence Training for Image Captioning

2017-11-16

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, Vaibhava Goel

arXiv_CV

arXiv_CV Image_Caption Reinforcement_Learning Caption Optimization Inference
Abstract

Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a “baseline” to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure. Empirically we find that directly optimizing the CIDEr metric with SCST and greedy decoding at test-time is highly effective. Our results on the MSCOCO evaluation sever establish a new state-of-the-art on the task, improving the best result in terms of CIDEr from 104.9 to 114.7.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1612.00563

PDF

https://arxiv.org/pdf/1612.00563
Read All
An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

2017-11-15

Cristina España-Bonet, Ádám Csaba Varga, Alberto Barrón-Cedeño, Josef van Genabith

arXiv_CL

arXiv_CL Embedding NMT
Abstract

End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the NMT context vectors, i.e. output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1704.05415

PDF

https://arxiv.org/pdf/1704.05415
Read All
DataVizard: Recommending Visual Presentations for Structured Data

2017-11-14

Rema Ananthanarayanan, Pranay Kr. Lohia, Srikanta Bedathur

arXiv_CV

arXiv_CV Image_Caption Caption Survey
Abstract

Selecting the appropriate visual presentation of the data such that it preserves the semantics of the underlying data and at the same time provides an intuitive summary of the data is an important, often the final step of data analytics. Unfortunately, this is also a step involving significant human effort starting from selection of groups of columns in the structured results from analytics stages, to the selection of right visualization by experimenting with various alternatives. In this paper, we describe our \emph{DataVizard} system aimed at reducing this overhead by automatically recommending the most appropriate visual presentation for the structured result. Specifically, we consider the following two scenarios: first, when one needs to visualize the results of a structured query such as SQL; and the second, when one has acquired a data table with an associated short description (e.g., tables from the Web). Using a corpus of real-world database queries (and their results) and a number of statistical tables crawled from the Web, we show that DataVizard is capable of recommending visual presentations with high accuracy. We also present the results of a user survey that we conducted in order to assess user views of the suitability of the presented charts vis-a-vis the plain text captions of the data.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.04971

PDF

https://arxiv.org/pdf/1711.04971
Read All
Sobolev GAN

2017-11-14

Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, Yu Cheng

arXiv_CV

arXiv_CV Regularization Adversarial GAN Text_Generation
Abstract

We propose a new Integral Probability Metric (IPM) between distributions: the Sobolev IPM. The Sobolev IPM compares the mean discrepancy of two distributions for functions (critic) restricted to a Sobolev ball defined with respect to a dominant measure $\mu$. We show that the Sobolev IPM compares two distributions in high dimensions based on weighted conditional Cumulative Distribution Functions (CDF) of each coordinate on a leave one out basis. The Dominant measure $\mu$ plays a crucial role as it defines the support on which conditional CDFs are compared. Sobolev IPM can be seen as an extension of the one dimensional Von-Mises Cramér statistics to high dimensional distributions. We show how Sobolev IPM can be used to train Generative Adversarial Networks (GANs). We then exploit the intrinsic conditioning implied by Sobolev IPM in text generation. Finally we show that a variant of Sobolev GAN achieves competitive results in semi-supervised learning on CIFAR-10, thanks to the smoothness enforced on the critic by Sobolev GAN which relates to Laplacian regularization.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.04894

PDF

https://arxiv.org/pdf/1711.04894
Read All
Simple And Efficient Architecture Search for Convolutional Neural Networks

2017-11-13

Thomas Elsken, Jan-Hendrik Metzen, Frank Hutter

arXiv_CV

arXiv_CV NAS CNN Optimization
Abstract

Neural networks have recently had a lot of success for many tasks. However, neural network architectures that perform well are still typically designed manually by experts in a cumbersome trial-and-error process. We propose a new method to automatically search for well-performing CNN architectures based on a simple hill climbing procedure whose operators apply network morphisms, followed by short optimization runs by cosine annealing. Surprisingly, this simple method yields competitive results, despite only requiring resources in the same order of magnitude as training a single network. E.g., on CIFAR-10, our method designs and trains networks with an error rate below 6% in only 12 hours on a single GPU; training for one day reduces this error further, to almost 5%.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.04528

PDF

https://arxiv.org/pdf/1711.04528
Read All
Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

2017-11-13

Yining Wang, Long Zhou, Jiajun Zhang, Chengqing Zong

arXiv_CL

arXiv_CL Knowledge NMT
Abstract

Neural machine translation (NMT), a new approach to machine translation, has been proved to outperform conventional statistical machine translation (SMT) across a variety of language pairs. Translation is an open-vocabulary problem, but most existing NMT systems operate with a fixed vocabulary, which causes the incapability of translating rare words. This problem can be alleviated by using different translation granularities, such as character, subword and hybrid word-character. Translation involving Chinese is one of the most difficult tasks in machine translation, however, to the best of our knowledge, there has not been any other work exploring which translation granularity is most suitable for Chinese in NMT. In this paper, we conduct an extensive comparison using Chinese-English NMT as a case study. Furthermore, we discuss the advantages and disadvantages of various translation granularities in detail. Our experiments show that subword model performs best for Chinese-to-English translation with the vocabulary which is not so big while hybrid word-character model is most suitable for English-to-Chinese translation. Moreover, experiments of different granularities show that Hybrid_BPE method can achieve best result on Chinese-to-English translation task.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.04457

PDF

https://arxiv.org/pdf/1711.04457
Read All
MODNet: Moving Object Detection Network with Motion and Appearance for Autonomous Driving

2017-11-12

Mennatullah Siam, Heba Mahgoub, Mohamed Zahran, Senthil Yogamani, Martin Jagersand, Ahmad El-Sallab

arXiv_CV

arXiv_CV Object_Detection Segmentation Detection
Abstract

We propose a novel multi-task learning system that combines appearance and motion cues for a better semantic reasoning of the environment. A unified architecture for joint vehicle detection and motion segmentation is introduced. In this architecture, a two-stream encoder is shared among both tasks. In order to evaluate our method in autonomous driving setting, KITTI annotated sequences with detection and odometry ground truth are used to automatically generate static/dynamic annotations on the vehicles. This dataset is called KITTI Moving Object Detection dataset (KITTI MOD). The dataset will be made publicly available to act as a benchmark for the motion detection task. Our experiments show that the proposed method outperforms state of the art methods that utilize motion cue only with 21.5% in mAP on KITTI MOD. Our method performs on par with the state of the art unsupervised methods on DAVIS benchmark for generic object segmentation. One of our interesting conclusions is that joint training of motion segmentation and vehicle detection benefits motion segmentation. Motion segmentation has relatively fewer data, unlike the detection task. However, the shared fusion encoder benefits from joint training to learn a generalized representation. The proposed method runs in 120 ms per frame, which beats the state of the art motion detection/segmentation in computational efficiency.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1709.04821

PDF

https://arxiv.org/pdf/1709.04821
Read All
High-Order Attention Models for Visual Question Answering

2017-11-12

Idan Schwartz, Alexander G. Schwing, Tamir Hazan

arXiv_CV

arXiv_CV QA Attention Relation VQA
Abstract

The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.04323

PDF

https://arxiv.org/pdf/1711.04323
Read All
Phrase-based Image Captioning with Hierarchical LSTM Model

2017-11-11

Ying Hua Tan, Chee Seng Chan

arXiv_CV

arXiv_CV Image_Caption Caption Inference RNN
Abstract

Automatic generation of caption to describe the content of an image has been gaining a lot of research interests recently, where most of the existing works treat the image caption as pure sequential data. Natural language, however possess a temporal hierarchy structure, with complex dependencies between each subsequence. In this paper, we propose a phrase-based hierarchical Long Short-Term Memory (phi-LSTM) model to generate image description. In contrast to the conventional solutions that generate caption in a pure sequential manner, our proposed model decodes image caption from phrase to sentence. It consists of a phrase decoder at the bottom hierarchy to decode noun phrases of variable length, and an abbreviated sentence decoder at the upper hierarchy to decode an abbreviated form of the image description. A complete image caption is formed by combining the generated phrases with sentence during the inference stage. Empirically, our proposed model shows a better or competitive result on the Flickr8k, Flickr30k and MS-COCO datasets in comparison to the state-of-the art models. We also show that our proposed model is able to generate more novel captions (not seen in the training data) which are richer in word contents in all these three datasets.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.05557

PDF

https://arxiv.org/pdf/1711.05557
Read All
Quantized Memory-Augmented Neural Networks

2017-11-10

Seongsik Park, Seijoon Kim, Seil Lee, Ho Bae, Sungroh Yoon

arXiv_CV

arXiv_CV CNN Inference RNN Memory_Networks
Abstract

Memory-augmented neural networks (MANNs) refer to a class of neural network models equipped with external memory (such as neural Turing machines and memory networks). These neural networks outperform conventional recurrent neural networks (RNNs) in terms of learning long-term dependency, allowing them to solve intriguing AI tasks that would otherwise be hard to address. This paper concerns the problem of quantizing MANNs. Quantization is known to be effective when we deploy deep models on embedded systems with limited resources. Furthermore, quantization can substantially reduce the energy consumption of the inference procedure. These benefits justify recent developments of quantized multi layer perceptrons, convolutional networks, and RNNs. However, no prior work has reported the successful quantization of MANNs. The in-depth analysis presented here reveals various challenges that do not appear in the quantization of the other networks. Without addressing them properly, quantized MANNs would normally suffer from excessive quantization error which leads to degraded performance. In this paper, we identify memory addressing (specifically, content-based addressing) as the main reason for the performance degradation and propose a robust quantization method for MANNs to address the challenge. In our experiments, we achieved a computation-energy gain of 22x with 8-bit fixed-point and binary quantization compared to the floating-point implementation. Measured on the bAbI dataset, the resulting model, named the quantized MANN (Q-MANN), improved the error rate by 46% and 30% with 8-bit fixed-point and binary quantization, respectively, compared to the MANN quantized using conventional techniques.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.03712

PDF

https://arxiv.org/pdf/1711.03712
Read All
Deep Neural Networks to Enable Real-time Multimessenger Astrophysics

2017-11-09

Daniel George, E. A. Huerta

arXiv_CV

arXiv_CV Object_Detection CNN Transfer_Learning Classification Detection
Abstract

Gravitational wave astronomy has set in motion a scientific revolution. To further enhance the science reach of this emergent field, there is a pressing need to increase the depth and speed of the gravitational wave algorithms that have enabled these groundbreaking discoveries. To contribute to this effort, we introduce Deep Filtering, a new highly scalable method for end-to-end time-series signal processing, based on a system of two deep convolutional neural networks, which we designed for classification and regression to rapidly detect and estimate parameters of signals in highly noisy time-series data streams. We demonstrate a novel training scheme with gradually increasing noise levels, and a transfer learning procedure between the two networks. We showcase the application of this method for the detection and parameter estimation of gravitational waves from binary black hole mergers. Our results indicate that Deep Filtering significantly outperforms conventional machine learning techniques, achieves similar performance compared to matched-filtering while being several orders of magnitude faster thus allowing real-time processing of raw big data with minimal resources. More importantly, Deep Filtering extends the range of gravitational wave signals that can be detected with ground-based gravitational wave detectors. This framework leverages recent advances in artificial intelligence algorithms and emerging hardware architectures, such as deep-learning-optimized GPUs, to facilitate real-time searches of gravitational wave sources and their electromagnetic and astro-particle counterparts.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1701.00008

PDF

https://arxiv.org/pdf/1701.00008
Read All
Bayesian GAN

2017-11-08

Yunus Saatchi, Andrew Gordon Wilson

arXiv_CV

arXiv_CV Adversarial GAN Quantitative
Abstract

Generative adversarial networks (GANs) can implicitly learn rich distributions over images, audio, and data which are hard to model with an explicit likelihood. We present a practical Bayesian formulation for unsupervised and semi-supervised learning with GANs. Within this framework, we use stochastic gradient Hamiltonian Monte Carlo to marginalize the weights of the generator and discriminator networks. The resulting approach is straightforward and obtains good performance without any standard interventions such as feature matching, or mini-batch discrimination. By exploring an expressive posterior over the parameters of the generator, the Bayesian GAN avoids mode-collapse, produces interpretable and diverse candidate samples, and provides state-of-the-art quantitative results for semi-supervised learning on benchmarks including SVHN, CelebA, and CIFAR-10, outperforming DCGAN, Wasserstein GANs, and DCGAN ensembles.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1705.09558

PDF

https://arxiv.org/pdf/1705.09558
Read All
Accelerating Neural Architecture Search using Performance Prediction

2017-11-08

Bowen Baker, Otkrist Gupta, Ramesh Raskar, Nikhil Naik

arXiv_CV

arXiv_CV NAS Reinforcement_Learning Optimization Classification Language_Model Prediction
Abstract

Methods for neural network hyperparameter optimization and meta-modeling are computationally expensive due to the need to train a large number of model configurations. In this paper, we show that standard frequentist regression models can predict the final performance of partially trained model configurations using features based on network architectures, hyperparameters, and time-series validation performance data. We empirically show that our performance prediction models are much more effective than prominent Bayesian counterparts, are simpler to implement, and are faster to train. Our models can predict final performance in both visual classification and language modeling domains, are effective for predicting performance of drastically varying model architectures, and can even generalize between model classes. Using these prediction models, we also propose an early stopping method for hyperparameter optimization and meta-modeling, which obtains a speedup of a factor up to 6x in both hyperparameter optimization and meta-modeling. Finally, we empirically show that our early stopping method can be seamlessly incorporated into both reinforcement learning-based architecture selection algorithms and bandit based search methods. Through extensive experimentation, we empirically show our performance prediction models and early stopping algorithm are state-of-the-art in terms of prediction accuracy and speedup achieved while still identifying the optimal model configurations.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1705.10823

PDF

https://arxiv.org/pdf/1705.10823
Read All
Intriguing Properties of Adversarial Examples

2017-11-08

Ekin D. Cubuk, Barret Zoph, Samuel S. Schoenholz, Quoc V. Le

arXiv_CV

arXiv_CV Adversarial NAS Reinforcement_Learning Prediction
Abstract

It is becoming increasingly clear that many machine learning classifiers are vulnerable to adversarial examples. In attempting to explain the origin of adversarial examples, previous studies have typically focused on the fact that neural networks operate on high dimensional data, they overfit, or they are too linear. Here we argue that the origin of adversarial examples is primarily due to an inherent uncertainty that neural networks have about their predictions. We show that the functional form of this uncertainty is independent of architecture, dataset, and training protocol; and depends only on the statistics of the logit differences of the network, which do not change significantly during training. This leads to adversarial error having a universal scaling, as a power-law, with respect to the size of the adversarial perturbation. We show that this universality holds for a broad range of datasets (MNIST, CIFAR10, ImageNet, and random data), models (including state-of-the-art deep networks, linear models, adversarially trained networks, and networks trained on randomly shuffled labels), and attacks (FGSM, step l.l., PGD). Motivated by these results, we study the effects of reducing prediction entropy on adversarial robustness. Finally, we study the effect of network architectures on adversarial sensitivity. To do this, we use neural architecture search with reinforcement learning to find adversarially robust architectures on CIFAR10. Our resulting architecture is more robust to white \emph{and} black box attacks compared to previous attempts.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.02846

PDF

https://arxiv.org/pdf/1711.02846
Read All
High-Speed Gate Driver Using GaN HEMTs for 20-MHz Hard Switching of SiC MOSFETs

2017-11-08

Takafumi Okuda, Takashi Hikihara

arXiv_CV

arXiv_CV GAN
Abstract

In this paper, we investigated a gate driver using a GaN HEMT push-pull configuration for the high-frequency hard switching of a SiC power MOSFET. Low on-resistance and low input capacitance of GaN HEMTs are suitable for a high-frequency gate driver from the logic level, and robustness of SiC MOSFET with high avalanche capability is suitable for a valve transistor in power converters. Our proposed gate driver consists of digital isolators, complementary Si MOSFETs, and GaN HEMTs. The GaN HEMT push-pull stage has a high driving capability owing to its superior switching characteristics, and complementary Si MOSFETs can enhance the control signal from the digital isolator. We investigated limiting factors of the switching frequency of the proposed gate driver by focusing on each circuit component and proposed an improved driving configuration for the gate driver. As a result, 20-MHz hard switching of a SiC MOSFET was achieved using the improved gate driver with GaN HEMTs.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.02832

PDF

https://arxiv.org/pdf/1711.02832
Read All
Bayesian Scale Estimation for Monocular SLAM Based on Generic Object Detection for Correcting Scale Drift

2017-11-07

Edgar Sucar, Jean-Bernard Hayet

arXiv_CV

arXiv_CV Object_Detection Quantitative Detection SLAM
Abstract

This work proposes a new, online algorithm for estimating the local scale correction to apply to the output of a monocular SLAM system and obtain an as faithful as possible metric reconstruction of the 3D map and of the camera trajectory. Within a Bayesian framework, it integrates observations from a deep-learning based generic object detector and a prior on the evolution of the scale drift. For each observation class, a predefined prior on the heights of the class objects is used. This allows to define the observations likelihood. Due to the scale drift inherent to monocular SLAM systems, we integrate a rough model on the dynamics of scale drift. Quantitative evaluations of the system are presented on the KITTI dataset, and compared with different approaches. The results show a superior performance of our proposal in terms of relative translational error when compared to other monocular systems.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.02768

PDF

https://arxiv.org/pdf/1711.02768
Read All
Surface-enhanced Raman scattering of graphene caused by self-induced nanogating by GaN nanowire array

2017-11-07

Jakub Kierdaszuk, Piotr Kaźmierczak, Rafał Bożek, Justyna Grzonka, Aleksandra Krajewska, Zbigniew R. Zytkiewicz, Marta Sobanska, Kamil Klosek, Agnieszka Wołoś, Maria Kamińska, Andrzej Wysmołek, Aneta Drabińska

arXiv_CV

arXiv_CV GAN Face Relation
Abstract

A constant height of gallium nitride (GaN) nanowires with graphene deposited on them is shown to have a strong enhancement of Raman scattering, whilst variable height nanowires fail to give such an enhancement. Scanning electron microscopy reveals a smooth graphene surface which is present when the GaN nanowires are uniform, whereas graphene on nanowires with substantial height differences is observed to be pierced and stretched by the uppermost nanowires. The energy shifts of the characteristic Raman bands confirms that these differences in the nanowire height has a significant impact on the local graphene strain and the carrier concentration. The images obtained by Kelvin probe force microscopy show clearly that the carrier concentration in graphene is modulated by the nanowire substrate and dependent on the nanowire density. Therefore, the observed surface enhanced Raman scattering for graphene deposited on GaN nanowires of comparable height is triggered by self-induced nano-gating to the graphene. However, no clear correlation of the enhancement with the strain or the carrier concentration of graphene was discovered.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1709.04908

PDF

https://arxiv.org/pdf/1709.04908
Read All
Theoretical limitations of Encoder-Decoder GAN architectures

2017-11-07

Sanjeev Arora, Andrej Risteski, Yi Zhang

arXiv_CV

arXiv_CV GAN Inference
Abstract

Encoder-decoder GANs architectures (e.g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps data-points to their succinct encodings. The intuition is that being forced to train an encoder alongside the usual generator forces the system to learn meaningful mappings from the code to the data-point and vice-versa, which should improve the learning of the target distribution and ameliorate mode-collapse. It should also yield meaningful codes that are useful as features for downstream tasks. The current paper shows rigorously that even on real-life distributions of images, the encode-decoder GAN training objectives (a) cannot prevent mode collapse; i.e. the objective can be near-optimal even when the generated distribution has low and finite support (b) cannot prevent learning meaningless codes for data – essentially white noise. Thus if encoder-decoder GANs do indeed work then it must be due to reasons as yet not understood, since the training objective can be low even for meaningless solutions.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.02651

PDF

https://arxiv.org/pdf/1711.02651
Read All
VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning

2017-11-06

Akash Srivastava, Lazar Valkov, Chris Russell, Michael U. Gutmann, Charles Sutton

arXiv_CV

arXiv_CV Adversarial GAN
Abstract

Deep generative models provide powerful tools for distributions over complicated manifolds, such as those of natural images. But many of these methods, including generative adversarial networks (GANs), can be difficult to train, in part because they are prone to mode collapse, which means that they characterize only a few modes of the true distribution. To address this, we introduce VEEGAN, which features a reconstructor network, reversing the action of the generator by mapping from data to noise. Our training objective retains the original asymptotic consistency guarantee of GANs, and can be interpreted as a novel autoencoder loss over the noise. In sharp contrast to a traditional autoencoder over data points, VEEGAN does not require specifying a loss function over the data, but rather only over the representations, which are standard normal by assumption. On an extensive set of synthetic and real world image datasets, VEEGAN indeed resists mode collapsing to a far greater extent than other recent GAN variants, and produces more realistic samples.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1705.07761

PDF

https://arxiv.org/pdf/1705.07761
Read All
A General Neural Network Hardware Architecture on FPGA

2017-11-06

Yufeng Hao

arXiv_CV

arXiv_CV Recognition
Abstract

Field Programmable Gate Arrays (FPGAs) plays an increasingly important role in data sampling and processing industries due to its highly parallel architecture, low power consumption, and flexibility in custom algorithms. Especially, in the artificial intelligence field, for training and implement the neural networks and machine learning algorithms, high energy efficiency hardware implement and massively parallel computing capacity are heavily demanded. Therefore, many global companies have applied FPGAs into AI and Machine learning fields such as autonomous driving and Automatic Spoken Language Recognition (Baidu) [1] [2] and Bing search (Microsoft) [3]. Considering the FPGAs great potential in these fields, we tend to implement a general neural network hardware architecture on XILINX ZU9CG System On Chip (SOC) platform [4], which contains abundant hardware resource and powerful processing capacity. The general neural network architecture on the FPGA SOC platform can perform forward and backward algorithms in deep neural networks (DNN) with high performance and easily be adjusted according to the type and scale of the neural networks.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.05860

PDF

https://arxiv.org/pdf/1711.05860
Read All
Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training

2017-11-06

Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt Schiele

arXiv_CV

arXiv_CV Image_Caption Adversarial Caption
Abstract

While strong progress has been made in image captioning over the last years, machine and human captions are still quite distinct. A closer look reveals that this is due to the deficiencies in the generated word distribution, vocabulary size, and strong bias in the generators towards frequent captions. Furthermore, humans – rightfully so – generate multiple, diverse captions, due to the inherent ambiguity in the captioning task which is not considered in today’s systems. To address these challenges, we change the training objective of the caption generator from reproducing groundtruth captions to generating a set of captions that is indistinguishable from human generated captions. Instead of handcrafting such a learning target, we employ adversarial training in combination with an approximate Gumbel sampler to implicitly match the generated distribution to the human one. While our method achieves comparable performance to the state-of-the-art in terms of the correctness of the captions, we generate a set of diverse captions, that are significantly less biased and match the word statistics better in several aspects.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1703.10476

PDF

https://arxiv.org/pdf/1703.10476
Read All

226/266

Welcome to AMDS123 Blog!

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL