Welcome to AMDS123 Blog!

Recent Papers about CV, CL and SD

Knowledge Graph Embedding with Entity Neighbors and Deep Memory Network

2018-08-11

Kai Wang, Yu Liu, Xiujuan Xu, Dan Lin

arXiv_CV

arXiv_CV Knowledge_Graph Knowledge Embedding Relation
Abstract

Knowledge Graph Embedding (KGE) aims to represent entities and relations of knowledge graph in a low-dimensional continuous vector space. Recent works focus on incorporating structural knowledge with additional information, such as entity descriptions, relation paths and so on. However, common used additional information usually contains plenty of noise, which makes it hard to learn valuable representation. In this paper, we propose a new kind of additional information, called entity neighbors, which contain both semantic and topological features about given entity. We then develop a deep memory network model to encode information from neighbors. Employing a gating mechanism, representations of structure and neighbors are integrated into a joint representation. The experimental results show that our model outperforms existing KGE methods utilizing entity descriptions and achieves state-of-the-art metrics on 4 datasets.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.03752

PDF

https://arxiv.org/pdf/1808.03752
Read All
Dropout during inference as a model for neurological degeneration in an image captioning network

2018-08-11

Bai Li, Ran Zhang, Frank Rudzicz

arXiv_CV

arXiv_CV Image_Caption Caption Inference
Abstract

We replicate a variation of the image captioning architecture by Vinyals et al. (2015), then introduce dropout during inference mode to simulate the effects of neurodegenerative diseases like Alzheimer’s disease (AD) and Wernicke’s aphasia (WA). We evaluate the effects of dropout on language production by measuring the KL-divergence of word frequency distributions and other linguistic metrics as dropout is added. We find that the generated sentences most closely approximate the word frequency distribution of the training corpus when using a moderate dropout of 0.4 during inference.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.03747

PDF

https://arxiv.org/pdf/1808.03747
Read All
Ancient-Modern Chinese Translation with a Large Training Dataset

2018-08-11

Dayiheng Liu, Jiancheng Lv, Kexin Yang, Qian Qu

arXiv_CL

arXiv_CL Knowledge NMT
Abstract

Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatically translation from ancient Chinese to modern Chinese helps to inherit and carry forward the quintessence of the ancients. In this paper, we propose an Ancient-Modern Chinese clause alignment approach and apply it to create a large scale Ancient-Modern Chinese parallel corpus which contains about 1.24M bilingual pairs. To our best knowledge, this is the first large high-quality Ancient-Modern Chinese dataset. Furthermore, we train the SMT and various NMT based models on this dataset and provide a strong baseline for this task

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.03738

PDF

https://arxiv.org/pdf/1808.03738
Read All
Spectroscopic signatures of native charge compensation in Mg doped GaN Nanorods

2018-08-10

Rajendra Kumar, Sanjay Nayak, S.M. Shivaprasad

arXiv_CV

arXiv_CV GAN
Abstract

We study the native charge compensation effect in Mg doped GaN nanorods (NRs), grown by Plasma Assisted Molecular Beam Epitaxy (PAMBE), using Raman, photoluminescence (PL) and X-ray photoelectron spectroscopies (XPS). The XPS valence band analysis shows that upon Mg incorporation the E$F$-E${VBM}$ reduces, suggesting the compensation of the native n-type character of GaN NRs. Raman spectroscopic studies on these samples reveal that the line shape of longitudinal phonon plasmon (LPP) coupled mode is sensitive to Mg concentration and hence to background n-type carrier density. We estimate a two order of native charge compensation in GaN NRs upon Mg-doping with a concentration of 10$^{19}$-10$^{20}$ atoms cm$^{-3}$. Room temperature (RT) PL measurements and our previous electronic structure calculations are used to identify the atomistic origin of this compensation effect.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.01121

PDF

https://arxiv.org/pdf/1807.01121
Read All
On restriction of unitarizable representations of general linear groups and the non-generic local Gan-Gross-Prasad conjecture

2018-08-10

Maxim Gurevich

arXiv_CV

arXiv_CV GAN Prediction Relation
Abstract

We prove one direction of a recently posed conjecture by Gan-Gross-Prasad, which predicts the branching laws that govern restriction from p-adic $GL_n$ to $GL_{n-1}$ of irreducible smooth representations within the Arthur-type class. We extend this prediction to the full class of unitarizable representations, by exhibiting a combinatorial relation that must be satisfied for any pair of irreducible representations, in which one appears as a quotient of the restriction of the other. We settle the full conjecture for the cases in which either one of the representations in the pair is generic. The method of proof involves a transfer of the problem, using the Bernstein decomposition and the quantum affine Schur-Weyl duality, into the realm of quantum affine algebras. This restatement of the problem allows for an application of the combined power of a result of Hernandez on cyclic modules together with the Lapid-Minguez criterion from the p-adic setting.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02640

PDF

https://arxiv.org/pdf/1808.02640
Read All
Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

2018-08-09

Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-yi Lee, Lin-shan Lee

arXiv_CV

arXiv_CV Adversarial GAN
Abstract

Speaking rate refers to the average number of phonemes within some unit time, while the rhythmic patterns refer to duration distributions for realizations of different phonemes within different phonetic structures. Both are key components of prosody in speech, which is different for different speakers. Models like cycle-consistent adversarial network (Cycle-GAN) and variational auto-encoder (VAE) have been successfully applied to voice conversion tasks without parallel data. However, due to the neural network architectures and feature vectors chosen for these approaches, the length of the predicted utterance has to be fixed to that of the input utterance, which limits the flexibility in mimicking the speaking rates and rhythmic patterns for the target speaker. On the other hand, sequence-to-sequence learning model was used to remove the above length constraint, but parallel training data are needed. In this paper, we propose an approach utilizing sequence-to-sequence model trained with unsupervised Cycle-GAN to perform the transformation between the phoneme posteriorgram sequences for different speakers. In this way, the length constraint mentioned above is removed to offer rhythm-flexible voice conversion without requiring parallel data. Preliminary evaluation on two datasets showed very encouraging results.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.03113

PDF

https://arxiv.org/pdf/1808.03113
Read All
Object Detection in Satellite Imagery using 2-Step Convolutional Neural Networks

2018-08-09

Hiroki Miyamoto, Kazuki Uehara, Masahiro Murakawa, Hidenori Sakanashi, Hirokazu Nosato, Toru Kouyama, Ryosuke Nakamura

arXiv_CV

arXiv_CV Object_Detection CNN Deep_Learning Detection
Abstract

This paper presents an efficient object detection method from satellite imagery. Among a number of machine learning algorithms, we proposed a combination of two convolutional neural networks (CNN) aimed at high precision and high recall, respectively. We validated our models using golf courses as target objects. The proposed deep learning method demonstrated higher accuracy than previous object identification methods.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02996

PDF

https://arxiv.org/pdf/1808.02996
Read All
Online UAV Path Planning for Joint Detection and Tracking of Multiple Radio-tagged Objects

2018-08-09

Hoa Van Nguyen, S. Hamid Rezatofighi, Ba-Ngu Vo, Damith C. Ranasinghe

arXiv_CV

arXiv_CV Tracking Detection
Abstract

We consider the problem of online path planning for joint detection and tracking of multiple unknown radio-tagged objects. This is a necessary task for gathering spatio-temporal information using UAVs with on-board sensors in a range of monitoring applications. In this paper, we propose an online path planning algorithm with joint detection and tracking for the problem because signal measurements from these object are inherently noisy. We derive a partially observable Markov decision process with a random finite set track-before-detect (TBD) multi-object filter. We show that, in practice, the likelihood function of raw signals received by the UAV transformed into the time-frequency domain is separable for multiple radio-tagged objects and results in a numerically efficient multi-object TBD filter. We derive a TBD filter with a jump Markov system to accommodate maneuvering objects capable of switching between different dynamic modes. Further, we impose a practical constraint using a void probability formulation to maintain a safe distance between the UAV and objects of interest. Our evaluations demonstrate the capability of our approach handle multiple radio-tagged object behaviors such as birth, death, motion modes and the superiority of the proposed online planning method with the TBD-based filter at tracking and detecting objects compared to the detection-based counterpart, especially under low signal-to-noise ratio environments.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.04445

PDF

https://arxiv.org/pdf/1808.04445
Read All
Choose Your Neuron: Incorporating Domain Knowledge through Neuron-Importance

2018-08-08

Ramprasaath R. Selvaraju, Prithvijit Chattopadhyay, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefan Lee

arXiv_CV

arXiv_CV Knowledge Caption CNN Classification Prediction
Abstract

Individual neurons in convolutional neural networks supervised for image-level classification tasks have been shown to implicitly learn semantically meaningful concepts ranging from simple textures and shapes to whole or partial objects - forming a “dictionary” of concepts acquired through the learning process. In this work we introduce a simple, efficient zero-shot learning approach based on this observation. Our approach, which we call Neuron Importance-AwareWeight Transfer (NIWT), learns to map domain knowledge about novel “unseen” classes onto this dictionary of learned concepts and then optimizes for network parameters that can effectively combine these concepts - essentially learning classifiers by discovering and composing learned semantic concepts in deep networks. Our approach shows improvements over previous approaches on the CUBirds and AWA2 generalized zero-shot learning benchmarks. We demonstrate our approach on a diverse set of semantic inputs as external domain knowledge including attributes and natural language captions. Moreover by learning inverse mappings, NIWT can provide visual and textual explanations for the predictions made by the newly learned classifiers and provide neuron names. Our code is available at this https URL.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02861

PDF

https://arxiv.org/pdf/1808.02861
Read All
Debugging Neural Machine Translations

2018-08-08

Matīss Rikters

arXiv_CL

arXiv_CL Attention NMT
Abstract

In this paper, we describe a tool for debugging the output and attention weights of neural machine translation (NMT) systems and for improved estimations of confidence about the output based on the attention. The purpose of the tool is to help researchers and developers find weak and faulty example translations that their NMT systems produce without the need for reference translations. Our tool also includes an option to directly compare translation outputs from two different NMT engines or experiments. In addition, we present a demo website of our tool with examples of good and bad translations: this http URL

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02733

PDF

https://arxiv.org/pdf/1808.02733
Read All
Question-Guided Hybrid Convolution for Visual Question Answering

2018-08-08

Peng Gao, Pan Lu, Hongsheng Li, Shuang Li, Yikang Li, Steven Hoi, Xiaogang Wang

arXiv_CV

arXiv_CV QA Attention Relation VQA
Abstract

In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features.To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Extensive experiments on public VQA datasets validate the effectiveness of QGHC.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02632

PDF

https://arxiv.org/pdf/1808.02632
Read All
A Joint Sequence Fusion Model for Video Question Answering and Retrieval

2018-08-07

Youngjae Yu, Jongseok Kim, Gunhee Kim

arXiv_CV

arXiv_CV QA Attention CNN VQA
Abstract

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JSFusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02559

PDF

https://arxiv.org/pdf/1808.02559
Read All
SketchyScene: Richly-Annotated Scene Sketches

2018-08-07

Changqing Zou, Qian Yu, Ruofei Du, Haoran Mo, Yi-Zhe Song, Tao Xiang, Chengying Gao, Baoquan Chen, Hao Zhang

arXiv_CV

arXiv_CV Image_Retrieval Segmentation Caption Semantic_Segmentation
Abstract

We contribute the first large-scale dataset of scene sketches, SketchyScene, with the goal of advancing research on sketch understanding at both the object and scene level. The dataset is created through a novel and carefully designed crowdsourcing pipeline, enabling users to efficiently generate large quantities of realistic and diverse scene sketches. SketchyScene contains more than 29,000 scene-level sketches, 7,000+ pairs of scene templates and photos, and 11,000+ object sketches. All objects in the scene sketches have ground-truth semantic and instance masks. The dataset is also highly scalable and extensible, easily allowing augmenting and/or changing scene composition. We demonstrate the potential impact of SketchyScene by training new computational models for semantic segmentation of scene sketches and showing how the new dataset enables several applications including image retrieval, sketch colorization, editing, and captioning, etc. The dataset and code can be found at this https URL.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02473

PDF

https://arxiv.org/pdf/1808.02473
Read All
Two decades of Exoplanetary Science with Adaptive Optics

2018-08-07

G. Chauvin

arXiv_CV

arXiv_CV Review Survey
Abstract

As astronomers, we are living an exciting time for what concerns the search for other worlds. Recent discoveries have already deeply impacted our vision of planetary formation and architectures. Future bio-signature discoveries will probably deeply impact our scientific and philosophical understanding of life formation and evolution. In that unique perspective, the role of observation is crucial to extend our understanding of the formation and physics of giant planets shaping planetary systems. With the development of high contrast imaging techniques and instruments over more than two decades, vast efforts have been devoted to detect and characterize lighter, cooler and closer companions to nearby stars, and ultimately image new planetary systems. Complementary to other planet-hunting techniques, this approach has opened a new astrophysical window to study the physical properties and the formation mechanisms of brown dwarfs and planets. I will briefly review the different observing techniques and strategies used, the main samples of targeted stars, the key discoveries and surveys, to finally address the main results obtained so far about the physics and the mechanisms of formation and evolution of young giant planets and planetary system architectures.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02454

PDF

https://arxiv.org/pdf/1808.02454
Read All
YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud

2018-08-07

Waleed Ali, Sherif Abdelkarim, Mohamed Zahran, Mahmoud Zidan, Ahmad El Sallab

arXiv_CV

arXiv_CV Object_Detection Classification Detection
Abstract

Object detection and classification in 3D is a key task in Automated Driving (AD). LiDAR sensors are employed to provide the 3D point cloud reconstruction of the surrounding environment, while the task of 3D object bounding box detection in real time remains a strong algorithmic challenge. In this paper, we build on the success of the one-shot regression meta-architecture in the 2D perspective image space and extend it to generate oriented 3D object bounding boxes from LiDAR point cloud. Our main contribution is in extending the loss function of YOLO v2 to include the yaw angle, the 3D box center in Cartesian coordinates and the height of the box as a direct regression problem. This formulation enables real-time performance, which is essential for automated driving. Our results are showing promising figures on KITTI benchmark, achieving real-time performance (40 fps) on Titan X GPU.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02350

PDF

https://arxiv.org/pdf/1808.02350
Read All
Object Plane Detection and Phase Retrieval from Single-Shot Holograms using Multi-Wavelength In-Line Holograph

2018-08-07

Hanqing Zhang, Tim Stangner, Krister Wiklund, Magnus Andersson

arXiv_CV

arXiv_CV Tracking Object_Tracking Detection
Abstract

Phase retrieval and the twin-image problem in digital in-line holographic microscopy can be resolved by iterative reconstruction routines. However, recovering the phase properties of an object in a hologram needs an object plane to be chosen correctly for reconstruction. In this work, we present a novel multi-wavelength Gerchberg-Saxton algorithm to determine the object plane using single-shot holograms recorded with multiple wavelengths in an in-line holographic microscope. For micro-sized objects, we verify the object positioning capabilities of the method for various shapes and derive the phase information using synthetic and experimental data. Experimentally, we built a compact digital in-line holographic microscopy setup around a standard optical microscope with a regular RGB-CCD camera and acquire holograms of micro-spheres, E. coli and red blood cells, that are illuminated using three lasers operating at 491 nm, 532 nm and 633 nm, respectively. We demonstrate that our method provides accurate object plane detection and phase retrieval under noisy conditions, e.g., using low-contrast holograms without background normalization. This method allows for automatic positioning and phase retrieval suitable for holographic particle velocimetry, and object tracking in biophysical or colloidal research.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02338

PDF

https://arxiv.org/pdf/1808.02338
Read All
Visual Reference Resolution using Attention Memory for Visual Dialog

2018-08-06

Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal

arXiv_CV

arXiv_CV QA Attention Prediction VQA
Abstract

Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references. The model then merges the retrieved attention with a tentative one to obtain the final attention for the current question; specifically, we use dynamic parameter prediction to combine the two attentions conditioned on the question. Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by ~16 % points) in situations, where visual reference resolution plays an important role. Moreover, the proposed model achieves superior performance (~ 2 % points improvement) in the Visual Dialog dataset, despite having significantly fewer parameters than the baselines.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1709.07992

PDF

https://arxiv.org/pdf/1709.07992
Read All
Surface-enhanced Raman scattering in graphene deposited on Al$_x$Ga$_{1-x}$N/GaN axial heterostructure nanowires

2018-08-06

Jakub Kierdaszuk, Mateusz Tokarczyk, Krzysztof M. Czajkowski, Rafał Bożek, Aleksandra Krajewska, Aleksandra Przewłoka, Wawrzyniec Kaszub, Marta Sobanska, Zbigniew R. Zytkiewicz, Grzegorz Kowalski, Tomasz J. Antosiewicz, Maria Kamińska, Andrzej Wysmołek, Aneta Drabińska

arXiv_CV

arXiv_CV GAN Face
Abstract

The surface-enhanced Raman scattering in graphene deposited on AlxGa1-xN/GaN axial heterostructure nanowires was investigated. The intensity of graphene Raman spectra was found not to be correlated with aluminium content. Analysis of graphene Raman bands parameters, KPFM and electroreflectance showed a screening of polarization charges. Theoretical calculations showed that plasmon resonance in graphene is far beyond the Raman spectral range. This excludes the presence of an electromagnetic mechanism of SERS and therefore suggests the chemical mechanism of enhancement.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02080

PDF

https://arxiv.org/pdf/1808.02080
Read All
An efficient deterministic perturbation theory for selected configuration interaction methods

2018-08-06

Norm M. Tubman, Daniel S. Levine, Diptarka Hait, Martin Head-Gordon, K. Birgitta Whaley

arXiv_CV

arXiv_CV
Abstract

The interplay between advances in stochastic and deterministic algorithms has recently led to development of interesting new selected configuration interaction (SCI) methods for solving the many body Schrödinger equation. The performance of these SCI methods can be greatly improved with a second order perturbation theory (PT2) correction, which is often evaluated in a stochastic or hybrid-stochastic manner. In this work, we present a highly efficient, fully deterministic PT2 algorithm for SCI methods and demonstrate that our approach is orders of magnitude faster than recent proposals for stochastic SCI+PT2. We also show that it is important to have a compact reference SCI wave function, in order to obtain optimal SCI+PT2 energies. This indicates that it advantageous to use accurate search algorithms such as ‘ASCI search’ rather than more approximate approaches. Our deterministic PT2 algorithm is based on sorting techniques that have been developed for modern computing architectures and is inherently straightforward to use on parallel computing architectures. Related architectures such as GPU implementations can be also used to further increase the efficiency. Overall, we demonstrate that the algorithms presented in this work allow for efficient evaluation of trillions of PT2 contributions with modest computing resources.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.02049

PDF

https://arxiv.org/pdf/1808.02049
Read All
Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN

2018-08-06

Dror Freirich, Ron Meir, Aviv Tamar

arXiv_CV

arXiv_CV Adversarial GAN Reinforcement_Learning
Abstract

The recently proposed distributional approach to reinforcement learning (DiRL) is centered on learning the distribution of the reward-to-go, often referred to as the value distribution. In this work, we show that the distributional Bellman equation, which drives DiRL methods, is equivalent to a generative adversarial network (GAN) model. In this formulation, DiRL can be seen as learning a deep generative model of the value distribution, driven by the discrepancy between the distribution of the current value, and the distribution of the sum of current reward and next value. We use this insight to propose a GAN-based approach to DiRL, which leverages the strengths of GANs in learning distributions of high-dimensional data. In particular, we show that our GAN approach can be used for DiRL with multivariate rewards, an important setting which cannot be tackled with prior methods. The multivariate setting also allows us to unify learning the distribution of values and state transitions, and we exploit this idea to devise a novel exploration method that is driven by the discrepancy in estimating both values and states.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.01960

PDF

https://arxiv.org/pdf/1808.01960
Read All
Residual Memory Networks: Feed-forward approach to learn long temporal dependencies

2018-08-06

Murali Karthick Baskar, Martin Karafiat, Lukas Burget, Karel Vesely, Frantisek Grezl, Jan Honza Cernocky

arXiv_CV

arXiv_CV RNN Memory_Networks Recognition
Abstract

Training deep recurrent neural network (RNN) architectures is complicated due to the increased network complexity. This disrupts the learning of higher order abstracts using deep RNN. In case of feed-forward networks training deep structures is simple and faster while learning long-term temporal information is not possible. In this paper we propose a residual memory neural network (RMN) architecture to model short-time dependencies using deep feed-forward layers having residual and time delayed connections. The residual connection paves way to construct deeper networks by enabling unhindered flow of gradients and the time delay units capture temporal information with shared weights. The number of layers in RMN signifies both the hierarchical processing depth and temporal depth. The computational complexity in training RMN is significantly less when compared to deep recurrent networks. RMN is further extended as bi-directional RMN (BRMN) to capture both past and future information. Experimental analysis is done on AMI corpus to substantiate the capability of RMN in learning long-term information and hierarchical information. Recognition performance of RMN trained with 300 hours of Switchboard corpus is compared with various state-of-the-art LVCSR systems. The results indicate that RMN and BRMN gains 6 % and 3.8 % relative improvement over LSTM and BLSTM networks.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.01916

PDF

https://arxiv.org/pdf/1808.01916
Read All
An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization

2018-08-04

Gongbo Tang, Fabienne Cap, Eva Pettersson, Joakim Nivre

arXiv_CL

arXiv_CL Attention NMT RNN
Abstract

In this paper, we apply different NMT models to the problem of historical spelling normalization for five languages: English, German, Hungarian, Icelandic, and Swedish. The NMT models are at different levels, have different attention mechanisms, and different neural network architectures. Our results show that NMT models are much better than SMT models in terms of character error rate. The vanilla RNNs are competitive to GRUs/LSTMs in historical spelling normalization. Transformer models perform better only when provided with more training data. We also find that subword-level models with a small subword vocabulary are better than character-level models for low-resource languages. In addition, we propose a hybrid method which further improves the performance of historical spelling normalization.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1806.05210

PDF

https://arxiv.org/pdf/1806.05210
Read All
Traits & Transferability of Adversarial Examples against Instance Segmentation & Object Detection

2018-08-04

Raghav Gurbaxani, Shivank Mishra

arXiv_CV

arXiv_CV Adversarial Object_Detection Segmentation Image_Classification Classification Detection
Abstract

Despite the recent advancements in deploying neural networks for image classification, it has been found that adversarial examples are able to fool these models leading them to misclassify the images. Since these models are now being widely deployed, we provide an insight on the threat of these adversarial examples by evaluating their characteristics and transferability to more complex models that utilize Image Classification as a subtask. We demonstrate the ineffectiveness of adversarial examples when applied to Instance Segmentation & Object Detection models. We show that this ineffectiveness arises from the inability of adversarial examples to withstand transformations such as scaling or a change in lighting conditions. Moreover, we show that there exists a small threshold below which the adversarial property is retained while applying these input transformations. Additionally, these attacks demonstrate weak cross-network transferability across neural network architectures, e.g. VGG16 and ResNet50, however, the attack may fool both the networks if passed sequentially through networks during its formation. The lack of scalability and transferability challenges the question of how adversarial images would be effective in the real world.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.01452

PDF

https://arxiv.org/pdf/1808.01452
Read All
Online Illumination Invariant Moving Object Detection by Generative Neural Network

2018-08-03

Fateme Bahri, Moein Shakeri, Nilanjan Ray

arXiv_CV

arXiv_CV Image_Caption Object_Detection Optimization Detection Gradient_Descent
Abstract

Moving object detection (MOD) is a significant problem in computer vision that has many real world applications. Different categories of methods have been proposed to solve MOD. One of the challenges is to separate moving objects from illumination changes and shadows that are present in most real world videos. State-of-the-art methods that can handle illumination changes and shadows work in a batch mode; thus, these methods are not suitable for long video sequences or real-time applications. In this paper, we propose an extension of a state-of-the-art batch MOD method (ILISD) to an online/incremental MOD using unsupervised and generative neural networks, which use illumination invariant image representations. For each image in a sequence, we use a low-dimensional representation of a background image by a neural network and then based on the illumination invariant representation, decompose the foreground image into: illumination change and moving objects. Optimization is performed by stochastic gradient descent in an end-to-end and unsupervised fashion. Our algorithm can work in both batch and online modes. In the batch mode, like other batch methods, optimizer uses all the images. In online mode, images can be incrementally fed into the optimizer. Based on our experimental evaluation on benchmark image sequences, both the online and the batch modes of our algorithm achieve state-of-the-art accuracy on most data sets.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.01066

PDF

https://arxiv.org/pdf/1808.01066
Read All
Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

2018-08-02

Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Stevens, Noah Constant, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil

arXiv_CL

arXiv_CL Embedding NMT
Abstract

This paper presents an effective approach for parallel corpus mining using bilingual sentence embeddings. Our embedding models are trained to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. This is achieved using a novel training method that introduces hard negatives consisting of sentences that are not translations but that have some degree of semantic similarity. The quality of the resulting embeddings are evaluated on parallel corpus reconstruction and by assessing machine translation systems trained on gold vs. mined sentence pairs. We find that the sentence embeddings can be used to reconstruct the United Nations Parallel Corpus at the sentence level with a precision of 48.9% for en-fr and 54.9% for en-es. When adapted to document level matching, we achieve a parallel document matching accuracy that is comparable to the significantly more computationally intensive approach of [Jakob 2010]. Using reconstructed parallel data, we are able to train NMT models that perform nearly as well as models trained on the original data (within 1-2 BLEU).

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11906

PDF

https://arxiv.org/pdf/1807.11906
Read All
GNAS: A Greedy Neural Architecture Search Method for Multi-Attribute Learning

2018-08-01

Siyu Huang, Xi Li, Zhi-Qi Cheng, Zhongfei Zhang, Alexander Hauptmann

arXiv_CV

arXiv_CV Knowledge NAS Optimization Relation
Abstract

A key problem in deep multi-attribute learning is to effectively discover the inter-attribute correlation structures. Typically, the conventional deep multi-attribute learning approaches follow the pipeline of manually designing the network architectures based on task-specific expertise prior knowledge and careful network tunings, leading to the inflexibility for various complicated scenarios in practice. Motivated by addressing this problem, we propose an efficient greedy neural architecture search approach (GNAS) to automatically discover the optimal tree-like deep architecture for multi-attribute learning. In a greedy manner, GNAS divides the optimization of global architecture into the optimizations of individual connections step by step. By iteratively updating the local architectures, the global tree-like architecture gets converged where the bottom layers are shared across relevant attributes and the branches in top layers more encode attribute-specific features. Experiments on three benchmark multi-attribute datasets show the effectiveness and compactness of neural architectures derived by GNAS, and also demonstrate the efficiency of GNAS in searching neural architectures.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1804.06964

PDF

https://arxiv.org/pdf/1804.06964
Read All
Low-Latency Neural Speech Translation

2018-08-01

Jan Niehues, Ngoc-Quan Pham, Thanh-Le Ha, Matthias Sperber, Alex Waibel

arXiv_CL

arXiv_CL NMT Deep_Learning
Abstract

Through the development of neural machine translation, the quality of machine translation systems has been improved significantly. By exploiting advancements in deep learning, systems are now able to better approximate the complex mapping from source sentences to target sentences. But with this ability, new challenges also arise. An example is the translation of partial sentences in low-latency speech translation. Since the model has only seen complete sentences in training, it will always try to generate a complete sentence, though the input may only be a partial sentence. We show that NMT systems can be adapted to scenarios where no task-specific training data is available. Furthermore, this is possible without losing performance on the original training data. We achieve this by creating artificial data and by using multi-task learning. After adaptation, we are able to reduce the number of corrections displayed during incremental output construction by 45%, without a decrease in translation quality.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.00491

PDF

https://arxiv.org/pdf/1808.00491
Read All
Efficient Progressive Neural Architecture Search

2018-08-01

Juan-Manuel Perez-Rua, Moez Baccouche, Stephane Pateux

arXiv_CV

arXiv_CV NAS Image_Classification Optimization Classification Prediction
Abstract

This paper addresses the difficult problem of finding an optimal neural architecture design for a given image classification task. We propose a method that aggregates two main results of the previous state-of-the-art in neural architecture search. These are, appealing to the strong sampling efficiency of a search scheme based on sequential model-based optimization (SMBO), and increasing training efficiency by sharing weights among sampled architectures. Sequential search has previously demonstrated its capabilities to find state-of-the-art neural architectures for image classification. However, its computational cost remains high, even unreachable under modest computational settings. Affording SMBO with weight-sharing alleviates this problem. On the other hand, progressive search with SMBO is inherently greedy, as it leverages a learned surrogate function to predict the validation error of neural architectures. This prediction is directly used to rank the sampled neural architectures. We propose to attenuate the greediness of the original SMBO method by relaxing the role of the surrogate function so it predicts architecture sampling probability instead. We demonstrate with experiments on the CIFAR-10 dataset that our method, denominated Efficient progressive neural architecture search (EPNAS), leads to increased search efficiency, while retaining competitiveness of found architectures.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.00391

PDF

https://arxiv.org/pdf/1808.00391
Read All
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

2018-08-01

Yundong Zhang, Juan Carlos Niebles, Alvaro Soto

arXiv_CV

arXiv_CV QA Attention Relation VQA
Abstract

A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1808.00265

PDF

https://arxiv.org/pdf/1808.00265
Read All
Unsupervisedly Training GANs for Segmenting Digital Pathology with Automatically Generated Annotations

2018-08-01

Michael Gadermayr, Laxmi Gupta, Barbara M. Klinkhammer, Peter Boor, Dorit Merhof

arXiv_CV

arXiv_CV Adversarial Knowledge Segmentation GAN
Abstract

Recently, generative adversarial networks exhibited excellent performances in semi-supervised image analysis scenarios. In this paper, we go even further by proposing a fully unsupervised approach for segmentation applications with prior knowledge of the objects’ shapes. We propose and investigate different strategies to generate simulated label data and perform image-to-image translation between the image and the label domain using an adversarial model. Specifically, we assess the impact of the annotation model’s accuracy as well as the effect of simulating additional low-level image features. For experimental evaluation, we consider the segmentation of the glomeruli, an application scenario from renal pathology. Experiments provide proof of concept and also confirm that the strategy for creating the simulated label data is of particular relevance considering the stability of GAN trainings.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1805.10059

PDF

https://arxiv.org/pdf/1805.10059
Read All
Zero-Annotation Object Detection with Web Knowledge Transfer

2018-08-01

Qingyi Tao, Hao Yang, Jianfei Cai

arXiv_CV

arXiv_CV Adversarial Object_Detection Knowledge Attention Weakly_Supervised Detection
Abstract

Object detection is one of the major problems in computer vision, and has been extensively studied. Most of the existing detection works rely on labor-intensive supervision, such as ground truth bounding boxes of objects or at least image-level annotations. On the contrary, we propose an object detection method that does not require any form of human annotation on target tasks, by exploiting freely available web images. In order to facilitate effective knowledge transfer from web images, we introduce a multi-instance multi-label domain adaption learning framework with two key innovations. First of all, we propose an instance-level adversarial domain adaptation network with attention on foreground objects to transfer the object appearances from web domain to target domain. Second, to preserve the class-specific semantic structure of transferred object features, we propose a simultaneous transfer mechanism to transfer the supervision across domains through pseudo strong label generation. With our end-to-end framework that simultaneously learns a weakly supervised detector and transfers knowledge across domains, we achieved significant improvements over baseline methods on the benchmark datasets.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.05954

PDF

https://arxiv.org/pdf/1711.05954
Read All
What am I searching for?

2018-07-31

Mengmi Zhang, Jiashi Feng, Joo Hwee Lim, Qi Zhao, Gabriel Kreiman

arXiv_CV

arXiv_CV CNN Inference
Abstract

Can we infer intentions and goals from a person’s actions? As an example of this family of problems, we consider here whether it is possible to decipher what a person is searching for by decoding their eye movement behavior. We conducted two human psychophysics experiments on object arrays and natural images where we monitored subjects’ eye movements while they were looking for a target object. Using as input the pattern of “error” fixations on non-target objects before the target was found, we developed a model (InferNet) whose goal was to infer what the target was. “Error” fixations share similar features with the sought target. The Infernet model uses a pre-trained 2D convolutional architecture to extract features from the error fixations and computes a 2D similarity map between the error fixation and all locations across the search image by modulating the search image via convolution across layers. InferNet consolidates the modulated response maps across layers via max pooling to keep track of the sub-patterns highly similar to features at error fixations and integrates these maps across all error fixations. InferNet successfully identifies the subject’s goal and outperforms all the competitive null models, even without any object-specific training on the inference task.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11926

PDF

https://arxiv.org/pdf/1807.11926
Read All
Which Training Methods for GANs do actually Converge?

2018-07-31

Lars Mescheder, Andreas Geiger, Sebastian Nowozin

arXiv_CV

arXiv_CV Regularization GAN
Abstract

Recent work has shown local convergence of GAN training for absolutely continuous data and generator distributions. In this paper, we show that the requirement of absolute continuity is necessary: we describe a simple yet prototypical counterexample showing that in the more realistic case of distributions that are not absolutely continuous, unregularized GAN training is not always convergent. Furthermore, we discuss regularization strategies that were recently proposed to stabilize GAN training. Our analysis shows that GAN training with instance noise or zero-centered gradient penalties converges. On the other hand, we show that Wasserstein-GANs and WGAN-GP with a finite number of discriminator updates per generator update do not always converge to the equilibrium point. We discuss these results, leading us to a new explanation for the stability problems of GAN training. Based on our analysis, we extend our convergence results to more general GANs and prove local convergence for simplified gradient penalties even if the generator and data distribution lie on lower dimensional manifolds. We find these penalties to work well in practice and use them to learn high-resolution generative image models for a variety of datasets with little hyperparameter tuning.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1801.04406

PDF

https://arxiv.org/pdf/1801.04406
Read All
DOCK: Detecting Objects by transferring Common-sense Knowledge

2018-07-31

Krishna Kumar Singh, Santosh Divvala, Ali Farhadi, Yong Jae Lee

arXiv_CV

arXiv_CV Object_Detection Knowledge Detection
Abstract

We present a scalable approach for Detecting Objects by transferring Common-sense Knowledge (DOCK) from source to target categories. In our setting, the training data for the source categories have bounding box annotations, while those for the target categories only have image-level annotations. Current state-of-the-art approaches focus on image-level visual or semantic similarity to adapt a detector trained on the source categories to the new target categories. In contrast, our key idea is to (i) use similarity not at the image-level, but rather at the region-level, and (ii) leverage richer common-sense (based on attribute, spatial, etc.) to guide the algorithm towards learning the correct detections. We acquire such common-sense cues automatically from readily-available knowledge bases without any extra human effort. On the challenging MS COCO dataset, we find that common-sense knowledge can substantially improve detection performance over existing transfer-learning baselines.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1804.01077

PDF

https://arxiv.org/pdf/1804.01077
Read All
Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

2018-07-31

Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke Saito, Nobuaki Minematsu

arXiv_CV

arXiv_CV Adversarial GAN RNN
Abstract

Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an alternative to vocoders and hence they are often called neural vocoders. The neural vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. However, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation happens, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. We also extend the GAN frameworks and use the discretized mixture logistic loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated discretized-mixture-of-logistics (DML) loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11679

PDF

https://arxiv.org/pdf/1807.11679
Read All
Recurrent Fusion Network for Image Captioning

2018-07-31

Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, Tong Zhang

arXiv_CV

arXiv_CV Image_Caption Caption CNN RNN
Abstract

Recently, much advance has been made in image captioning, and an encoder-decoder framework has been adopted by all the state-of-the-art models. Under this framework, an input image is encoded by a convolutional neural network (CNN) and then translated into natural language with a recurrent neural network (RNN). The existing models counting on this framework merely employ one kind of CNNs, e.g., ResNet or Inception-X, which describe image contents from only one specific view point. Thus, the semantic meaning of an input image cannot be comprehensively understood, which restricts the performance of captioning. In this paper, in order to exploit the complementary information from multiple encoders, we propose a novel Recurrent Fusion Network (RFNet) for tackling image captioning. The fusion process in our model can exploit the interactions among the outputs of the image encoders and then generate new compact yet informative representations for the decoder. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RFNet, which sets a new state-of-the-art for image captioning.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.09986

PDF

https://arxiv.org/pdf/1807.09986
Read All
SPARE: Spiking Networks Acceleration Using CMOS ROM-Embedded RAM as an In-Memory-Computation Primitive

2018-07-31

Amogh Agrawal, Aayush Ankit, Kaushik Roy

arXiv_CV

arXiv_CV Image_Classification Classification
Abstract

Despite huge success of artificial intelligence, hardware systems running these algorithms consume orders of magnitude higher energy compared to the human brain, mainly due to heavy data movements between the memory unit and the computation cores. Spiking neural networks (SNNs) built using bio-plausible neuron and synaptic models have emerged as the power-efficient choice for designing cognitive applications. These algorithms involve several lookup-table (LUT) based function evaluations such as high-order polynomials and transcendental functions for solving complex neuro-synaptic models, that typically require additional storage. To that effect, we propose `SPARE’ - an in-memory, distributed processing architecture built on ROM-embedded RAM technology, for accelerating SNNs. ROM-embedded RAMs allow storage of LUTs, embedded within a typical memory array, without additional area overhead. Our proposed architecture consists of a 2-D array of Processing Elements (PEs). Since most of the computations are done locally within each PE, unnecessary data transfers are restricted, thereby alleviating the von-Neumann bottleneck. We evaluate SPARE for two different ROM-Embedded RAM structures - CMOS based ROM-Embedded SRAMs (R-SRAMs) and STT-MRAM based ROM-Embedded MRAMs (R-MRAMs). Moreover, we analyze trade-offs in terms of energy, area and performance, for using the two technologies on a range of image classification benchmarks. Furthermore, we leverage the additional storage density to implement complex neuro-synaptic functionalities. This enhances the utility of the proposed architecture by provisioning implementation of any neuron/synaptic behavior as necessitated by the application. Our results show up-to 1.75x, 1.95x and 1.95x improvement in energy, iso-storage area, and iso-area performance, respectively, by using neural network accelerators built on ROM-embedded RAM primitives.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.07546

PDF

https://arxiv.org/pdf/1711.07546
Read All
Doubly Attentive Transformer Machine Translation

2018-07-30

Hasan Sait Arslan, Mark Fishel, Gholamreza Anbarjafari

arXiv_CV

arXiv_CV Image_Caption Attention Caption CNN NMT
Abstract

In this paper a doubly attentive transformer machine translation model (DATNMT) is presented in which a doubly-attentive transformer decoder normally joins spatial visual features obtained via pretrained convolutional neural networks, conquering any gap between image captioning and translation. In this framework, the transformer decoder figures out how to take care of source-language words and parts of an image freely by methods for two separate attention components in an Enhanced Multi-Head Attention Layer of doubly attentive transformer, as it generates words in the target language. We find that the proposed model can effectively exploit not just the scarce multimodal machine translation data, but also large general-domain text-only machine translation corpora, or image-text image captioning corpora. The experimental results show that the proposed doubly-attentive transformer-decoder performs better than a single-decoder transformer model, and gives the state-of-the-art results in the English-German multimodal machine translation task.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11605

PDF

https://arxiv.org/pdf/1807.11605
Read All
Acquisition of Localization Confidence for Accurate Object Detection

2018-07-30

Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, Yuning Jiang

arXiv_CV

arXiv_CV Object_Detection Optimization Classification Detection
Abstract

Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. In the paper we propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective. Extensive experiments on the MS-COCO dataset show the effectiveness of IoU-Net, as well as its compatibility with and adaptivity to several state-of-the-art object detectors.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11590

PDF

https://arxiv.org/pdf/1807.11590
Read All
To learn image super-resolution, use a GAN to learn how to do image degradation first

2018-07-30

Adrian Bulat, Jing Yang, Georgios Tzimiropoulos

arXiv_CV

arXiv_CV Adversarial Super_Resolution GAN Face
Abstract

This paper is on image and face super-resolution. The vast majority of prior work for this problem focus on how to increase the resolution of low-resolution images which are artificially generated by simple bilinear down-sampling (or in a few cases by blurring followed by down-sampling).We show that such methods fail to produce good results when applied to real-world low-resolution, low quality images. To circumvent this problem, we propose a two-stage process which firstly trains a High-to-Low Generative Adversarial Network (GAN) to learn how to degrade and downsample high-resolution images requiring, during training, only unpaired high and low-resolution images. Once this is achieved, the output of this network is used to train a Low-to-High GAN for image super-resolution using this time paired low- and high-resolution images. Our main result is that this network can be now used to efectively increase the quality of real-world low-resolution images. We have applied the proposed pipeline for the problem of face super-resolution where we report large improvement over baselines and prior work although the proposed method is potentially applicable to other object categories.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11458

PDF

https://arxiv.org/pdf/1807.11458
Read All
Emulating malware authors for proactive protection using GANs over a distributed image visualization of dynamic file behavior

2018-07-30

Vineeth S. Bhaskara, Debanjan Bhattacharyya

arXiv_CV

arXiv_CV Image_Caption Adversarial GAN CNN
Abstract

Malware authors have always been at an advantage of being able to adversarially test and augment their malicious code, before deploying the payload, using anti-malware products at their disposal. The anti-malware developers and threat experts, on the other hand, do not have such a privilege of tuning anti-malware products against zero-day attacks pro-actively. This allows the malware authors to being a step ahead of the anti-malware products, fundamentally biasing the cat and mouse game played by the two parties. In this paper, we propose a way that would enable machine learning based threat prevention models to bridge that gap by being able to tune against a deep generative adversarial network (GAN), which takes up the role of a malware author and generates new types of malware. The GAN is trained over a reversible distributed RGB image representation of known malware behaviors, encoding the sequence of API call ngrams and the corresponding term frequencies. The generated images represent synthetic malware that can be decoded back to the underlying API call sequence information. The image representation is not only demonstrated as a general technique of incorporating necessary priors for exploiting convolutional neural network architectures for generative or discriminative modeling, but also as a visualization method for easy manual software or malware categorization, by having individual API ngram information distributed across the image space. In addition, we also propose using smart-definitions for detecting malwares based on perceptual hashing of these images. Such hashes are potentially more effective than cryptographic hashes that do not carry any meaningful similarity metric, and hence, do not generalize well.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.07525

PDF

https://arxiv.org/pdf/1807.07525
Read All
Training Neural Machine Translation using Word Embedding-based Loss

2018-07-30

Katsuki Chousa, Katsuhito Sudoh, Satoshi Nakamura

arXiv_CL

arXiv_CL Embedding NMT
Abstract

In neural machine translation (NMT), the computational cost at the output layer increases with the size of the target-side vocabulary. Using a limited-size vocabulary instead may cause a significant decrease in translation quality. This trade-off is derived from a softmax-based loss function that handles in-dictionary words independently, in which word similarity is not considered. In this paper, we propose a novel NMT loss function that includes word similarity in forms of distances in a word embedding space. The proposed loss function encourages an NMT decoder to generate words close to their references in the embedding space; this helps the decoder to choose similar acceptable words when the actual best candidates are not included in the vocabulary due to its size limitation. In experiments using ASPEC Japanese-to-English and IWSLT17 English-to-French data sets, the proposed method showed improvements against a standard NMT baseline in both datasets; especially with IWSLT17 En-Fr, it achieved up to +1.72 in BLEU and +1.99 in METEOR. When the target-side vocabulary was very limited to 1,000 words, the proposed method demonstrated a substantial gain, +1.72 in METEOR with ASPEC Ja-En.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11219

PDF

https://arxiv.org/pdf/1807.11219
Read All
ADVISE: Symbolism and External Knowledge for Decoding Advertisements

2018-07-29

Keren Ye, Adriana Kovashka

arXiv_CV

arXiv_CV Image_Caption Knowledge GAN Caption Recognition
Abstract

In order to convey the most content in their limited space, advertisements embed references to outside knowledge via symbolism. For example, a motorcycle stands for adventure (a positive property the ad wants associated with the product being sold), and a gun stands for danger (a negative property to dissuade viewers from undesirable behaviors). We show how to use symbolic references to better understand the meaning of an ad. We further show how anchoring ad understanding in general-purpose object recognition and image captioning improves results. We formulate the ad understanding task as matching the ad image to human-generated statements that describe the action that the ad prompts, and the rationale it provides for taking this action. Our proposed method outperforms the state of the art on this task, and on an alternative formulation of question-answering on ads. We show additional applications of our learned representations for matching ads to slogans, and clustering ads according to their topic, without extra training.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.06666

PDF

https://arxiv.org/pdf/1711.06666
Read All
'Factual' or 'Emotional': Stylized Image Captioning with Adaptive Learning and Attention

2018-07-29

Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, Jiebo Luo

arXiv_CV

arXiv_CV Image_Caption Knowledge Attention Caption RNN
Abstract

Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements into consideration. To this end, we first devise a new variant of LSTM, named style-factual LSTM, as the building block of our model. It uses two groups of matrices to capture the factual and stylized knowledge, respectively, and automatically learns the word-level weights of the two groups based on previous context. In addition, when we train the model to capture stylized elements, we propose an adaptive learning approach based on a reference factual model, it provides factual knowledge to the model as the model learns from stylized caption labels, and can adaptively compute how much information to supply at each time step. We evaluate our model on two stylized image captioning datasets, which contain humorous/romantic captions and positive/negative captions, respectively. Experiments shows that our proposed model outperforms the state-of-the-art approaches, without using extra ground truth supervision.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.03871

PDF

https://arxiv.org/pdf/1807.03871
Read All
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

2018-07-29

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler

arXiv_CV

arXiv_CV Image_Retrieval Caption Embedding Prediction
Abstract

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

Abstract (translated by Google)

URL

https://arxiv.org/abs/1707.05612

PDF

https://arxiv.org/pdf/1707.05612
Read All
Tiny-DSOD: Lightweight Object Detection for Resource-Restricted Usages

2018-07-29

Yuxi Li, Jiuwei Li, Weiyao Lin, Jianguo Li

arXiv_CV

arXiv_CV Object_Detection Deep_Learning Detection
Abstract

Object detection has made great progress in the past few years along with the development of deep learning. However, most current object detection methods are resource hungry, which hinders their wide deployment to many resource restricted usages such as usages on always-on devices, battery-powered low-end devices, etc. This paper considers the resource and accuracy trade-off for resource-restricted usages during designing the whole object detection framework. Based on the deeply supervised object detection (DSOD) framework, we propose Tiny-DSOD dedicating to resource-restricted usages. Tiny-DSOD introduces two innovative and ultra-efficient architecture blocks: depthwise dense block (DDB) based backbone and depthwise feature-pyramid-network (D-FPN) based front-end. We conduct extensive experiments on three famous benchmarks (PASCAL VOC 2007, KITTI, and COCO), and compare Tiny-DSOD to the state-of-the-art ultra-efficient object detection solutions such as Tiny-YOLO, MobileNet-SSD (v1 & v2), SqueezeDet, Pelee, etc. Results show that Tiny-DSOD outperforms these solutions in all the three metrics (parameter-size, FLOPs, accuracy) in each comparison. For instance, Tiny-DSOD achieves 72.1% mAP with only 0.95M parameters and 1.06B FLOPs, which is by far the state-of-the-arts result with such a low resource requirement.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11013

PDF

https://arxiv.org/pdf/1807.11013
Read All
Learning to Segment Object Candidates via Recursive Neural Networks

2018-07-29

Tianshui Chen, Liang Lin, Xian Wu, Nong Xiao, Xiaonan Luo

arXiv_CV

arXiv_CV Object_Detection Inference Prediction Detection
Abstract

To avoid the exhaustive search over locations and scales, current state-of-the-art object detection systems usually involve a crucial component generating a batch of candidate object proposals from images. In this paper, we present a simple yet effective approach for segmenting object proposals via a deep architecture of recursive neural networks (ReNNs), which hierarchically groups regions for detecting object candidates over scales. Unlike traditional methods that mainly adopt fixed similarity measures for merging regions or finding object proposals, our approach adaptively learns the region merging similarity and the objectness measure during the process of hierarchical region grouping. Specifically, guided by a structured loss, the ReNN model jointly optimizes the cross-region similarity metric with the region merging process as well as the objectness prediction. During inference of the object proposal generation, we introduce randomness into the greedy search to cope with the ambiguity of grouping regions. Extensive experiments on standard benchmarks, e.g., PASCAL VOC and ImageNet, suggest that our approach is capable of producing object proposals with high recall while well preserving the object boundaries and outperforms other existing methods in both accuracy and efficiency.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1612.01057

PDF

https://arxiv.org/pdf/1612.01057
Read All
Bridge the Gap Between VQA and Human Behavior on Omnidirectional Video: A Large-Scale Dataset and a Deep Learning Model

2018-07-29

Chen Li, Mai Xu, Xinzhe Du, Zulin Wang

arXiv_CV

arXiv_CV QA Deep_Learning VQA
Abstract

Omnidirectional video enables spherical stimuli with the $360 \times 180^ \circ$ viewing range. Meanwhile, only the viewport region of omnidirectional video can be seen by the observer through head movement (HM), and an even smaller region within the viewport can be clearly perceived through eye movement (EM). Thus, the subjective quality of omnidirectional video may be correlated with HM and EM of human behavior. To fill in the gap between subjective quality and human behavior, this paper proposes a large-scale visual quality assessment (VQA) dataset of omnidirectional video, called VQA-OV, which collects 60 reference sequences and 540 impaired sequences. Our VQA-OV dataset provides not only the subjective quality scores of sequences but also the HM and EM data of subjects. By mining our dataset, we find that the subjective quality of omnidirectional video is indeed related to HM and EM. Hence, we develop a deep learning model, which embeds HM and EM, for objective VQA on omnidirectional video. Experimental results show that our model significantly improves the state-of-the-art performance of VQA on omnidirectional video.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.10990

PDF

https://arxiv.org/pdf/1807.10990
Read All
A user model for JND-based video quality assessment: theory and applications

2018-07-28

Haiqiang Wang, Ioannis Katsavounidis, Xinfeng Zhang, Chao Yang, C.-C. Jay Kuo

arXiv_CV

arXiv_CV QA Attention VQA
Abstract

The video quality assessment (VQA) technology has attracted a lot of attention in recent years due to an increasing demand of video streaming services. Existing VQA methods are designed to predict video quality in terms of the mean opinion score (MOS) calibrated by humans in subjective experiments. However, they cannot predict the satisfied user ratio (SUR) of an aggregated viewer group. Furthermore, they provide little guidance to video coding parameter selection, e.g. the Quantization Parameter (QP) of a set of consecutive frames, in practical video streaming services. To overcome these shortcomings, the just-noticeable-difference (JND) based VQA methodology has been proposed as an alternative. It is observed experimentally that the JND location is a normally distributed random variable. In this work, we explain this distribution by proposing a user model that takes both subject variabilities and content variabilities into account. This model is built upon user’s capability to discern the quality difference between video clips encoded with different QPs. Moreover, it analyzes video content characteristics to account for inter-content variability. The proposed user model is validated on the data collected in the VideoSet. It is demonstrated that the model is flexible to predict SUR distribution of a specific user group.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.10894

PDF

https://arxiv.org/pdf/1807.10894
Read All
MaskConnect: Connectivity Learning by Gradient Descent

2018-07-28

Karim Ahmed, Lorenzo Torresani

arXiv_CV

arXiv_CV NAS Image_Classification Classification Gradient_Descent
Abstract

Although deep networks have recently emerged as the model of choice for many computer vision problems, in order to yield good results they often require time-consuming architecture search. To combat the complexity of design choices, prior work has adopted the principle of modularized design which consists in defining the network in terms of a composition of topologically identical or similar building blocks (a.k.a. modules). This reduces architecture search to the problem of determining the number of modules to compose and how to connect such modules. Again, for reasons of design complexity and training cost, previous approaches have relied on simple rules of connectivity, e.g., connecting each module to only the immediately preceding module or perhaps to all of the previous ones. Such simple connectivity rules are unlikely to yield the optimal architecture for the given problem. In this work we remove these predefined choices and propose an algorithm to learn the connections between modules in the network. Instead of being chosen a priori by the human designer, the connectivity is learned simultaneously with the weights of the network by optimizing the loss function of the end task using a modified version of gradient descent. We demonstrate our connectivity learning method on the problem of multi-class image classification using two popular architectures: ResNet and ResNeXt. Experiments on four different datasets show that connectivity learning using our approach yields consistently higher accuracy compared to relying on traditional predefined rules of connectivity. Furthermore, in certain settings it leads to significant savings in number of parameters.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1807.11473

PDF

https://arxiv.org/pdf/1807.11473
Read All

211/266

Welcome to AMDS123 Blog!

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL