Welcome to AMDS123 Blog!

Recent Papers about CV, CL and SD

Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation

2017-12-16

Jean-Benoit Delbrouck, Stéphane Dupont, Omar Seddati

arXiv_CV

arXiv_CV Image_Caption Object_Detection Caption Embedding CNN NMT Detection
Abstract

In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English. This is considered as the multimodal image caption translation task. The images are processed with Convolutional Neural Network (CNN) to extract visual features exploitable by the translation model. So far, the CNNs used are pre-trained on object detection and localization task. We hypothesize that richer architecture, such as dense captioning models, may be more suitable for MNMT and could lead to improved translations. We extend this intuition to the word-embeddings, where we compute both linguistic and visual representation for our corpus vocabulary. We combine and compare different confi

Abstract (translated by Google)

URL

https://arxiv.org/abs/1707.01009

PDF

https://arxiv.org/pdf/1707.01009
Read All
Tensor Product Generation Networks for Deep NLP Modeling

2017-12-16

Qiuyuan Huang, Paul Smolensky, Xiaodong He, Li Deng, Dapeng Wu

arXiv_CV

arXiv_CV Image_Caption Caption RNN Deep_Learning
Abstract

We present a new approach to the design of deep networks for natural language processing (NLP), based on the general technique of Tensor Product Representations (TPRs) for encoding and processing symbol structures in distributed neural networks. A network architecture — the Tensor Product Generation Network (TPGN) — is proposed which is capable in principle of carrying out TPR computation, but which uses unconstrained deep learning to design its internal representations. Instantiated in a model for image-caption generation, TPGN outperforms LSTM baselines when evaluated on the COCO dataset. The TPR-capable structure enables interpretation of internal representations and operations, which prove to contain considerable grammatical content. Our caption-generation model can be interpreted as generating sequences of grammatical categories and retrieving words by their categories from a plan encoded as a distributed representation.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1709.09118

PDF

https://arxiv.org/pdf/1709.09118
Read All
Impression Network for Video Object Detection

2017-12-16

Congrui Hetang, Hongwei Qin, Shaohui Liu, Junjie Yan

arXiv_CV

arXiv_CV Object_Detection Sparse Detection
Abstract

Video object detection is more challenging compared to image object detection. Previous works proved that applying object detector frame by frame is not only slow but also inaccurate. Visual clues get weakened by defocus and motion blur, causing failure on corresponding frames. Multi-frame feature fusion methods proved effective in improving the accuracy, but they dramatically sacrifice the speed. Feature propagation based methods proved effective in improving the speed, but they sacrifice the accuracy. So is it possible to improve speed and performance simultaneously? Inspired by how human utilize impression to recognize objects from blurry frames, we propose Impression Network that embodies a natural and efficient feature aggregation mechanism. In our framework, an impression feature is established by iteratively absorbing sparsely extracted frame features. The impression feature is propagated all the way down the video, helping enhance features of low-quality frames. This impression mechanism makes it possible to perform long-range multi-frame feature fusion among sparse keyframes with minimal overhead. It significantly improves per-frame detection baseline on ImageNet VID while being 3 times faster (20 fps). We hope Impression Network can provide a new perspective on video feature enhancement. Code will be made available.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.05896

PDF

https://arxiv.org/pdf/1712.05896
Read All
On reproduction of On the regularization of Wasserstein GANs

2017-12-16

Junghoon Seo, Taegyun Jeon

arXiv_CV

arXiv_CV Regularization GAN
Abstract

This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.05882

PDF

https://arxiv.org/pdf/1712.05882
Read All
Kill Two Birds With One Stone: Boosting Both Object Detection Accuracy and Speed With adaptive Patch-of-Interest Composition

2017-12-15

Shihao Zhang, Weiyao Lin, Ping Lu, Weihua Li, Shuo Deng

arXiv_CV

arXiv_CV Video_Caption Object_Detection Detection
Abstract

Object detection is an important yet challenging task in video understanding & analysis, where one major challenge lies in the proper balance between two contradictive factors: detection accuracy and detection speed. In this paper, we propose a new adaptive patch-of-interest composition approach for boosting both the accuracy and speed for object detection. The proposed approach first extracts patches in a video frame which have the potential to include objects-of-interest. Then, an adaptive composition process is introduced to compose the extracted patches into an optimal number of sub-frames for object detection. With this process, we are able to maintain the resolution of the original frame during object detection (for guaranteeing the accuracy), while minimizing the number of inputs in detection (for boosting the speed). Experimental results on various datasets demonstrate the effectiveness of the proposed approach.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1708.03795

PDF

https://arxiv.org/pdf/1708.03795
Read All
Tunable polymorphism of epitaxial iron oxides in the four-in-one ferroic-on-GaN system with magnetically ordered α-, γ-, ε-Fe2O3 and Fe3O4 layers

2017-12-15

Sergey Suturin, Alexander Korovin, Sergey Gastev, Mikhail Volkov, Masao Tabuchi, Nikolai Sokolov

arXiv_CV

arXiv_CV GAN Face
Abstract

Hybridization of semiconducting and magnetic materials into a single heterostructure is believed to be potentially applicable to the design of novel functional spintronic devices. In the present work we report epitaxial stabilization of four magnetically ordered iron oxide phases (Fe3O4, {\gamma}-Fe2O3, {\alpha}-Fe2O3 and most exotic metastable {\epsilon}-Fe2O3) in the form of nanometer sized single crystalline films on GaN(0001) surface. The epitaxial growth of as many as four distinctly different iron oxide phases is demonstrated within the same single-target Laser MBE technological process on a GaN semiconductor substrate widely used for electronic device fabrication. The discussed iron oxides belong to a family of simple formula magnetic materials exhibiting a rich variety of outstanding physical properties including peculiar Verwey and Morin phase transitions in Fe3O4 and {\alpha}-Fe2O3 and multiferroic behavior in metastable magnetically hard {\epsilon}-Fe2O3 ferrite. The physical reasons standing behind the nucleation of a particular phase in an epitaxial growth process deserve interest from the fundamental point of view. The practical side of the presented study is to exploit the tunable polymorphism of iron oxides for creation of ferroic-on-semiconductor heterostructures usable in novel spintronic devices. By application of a wide range of experimental techniques the surface morphology, crystalline structure, electronic and magnetic properties of the single phase iron oxide epitaxial films on GaN have been studied. A comprehensive comparison has been made to the properties of the same ferrite materials in the bulk and nanostructured form reported by other research groups.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.05632

PDF

https://arxiv.org/pdf/1712.05632
Read All
Personalization in Goal-Oriented Dialog

2017-12-15

Chaitanya K. Joshi, Fei Mi, Boi Faltings

arXiv_CL

arXiv_CL Memory_Networks
Abstract

The main goal of modeling human conversation is to create agents which can interact with people in both open-ended and goal-oriented scenarios. End-to-end trained neural dialog systems are an important line of research for such generalized dialog models as they do not resort to any situation-specific handcrafting of rules. However, incorporating personalization into such systems is a largely unexplored topic as there are no existing corpora to facilitate such work. In this paper, we present a new dataset of goal-oriented dialogs which are influenced by speaker profiles attached to them. We analyze the shortcomings of an existing end-to-end dialog system based on Memory Networks and propose modifications to the architecture which enable personalization. We also investigate personalization in dialog as a multi-task learning problem, and show that a single model which shares features among various profiles outperforms separate models for each profile.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1706.07503

PDF

http://arxiv.org/pdf/1706.07503
Read All
Vacancy defect in bulk and at surface of GaN: A combined first-principles theoretical and experimental analysis

2017-12-14

Sanjay Nayak, Mit H. Naik, Manish Jain, U.V. Waghmare, S.M. Shivaprasad

arXiv_CV

arXiv_CV GAN Face
Abstract

We determine atomic and electronic structure, formation energy, stability and magnetic properties of native point defects, such as Gallium (Ga) and Nitrogen (N) vacancies in bulk and at non-polar (10$\overline{1}$0) surface of wurtzite Gallium Nitride (\textit w-GaN) using, first-principles calculations based on Density Functional Theory (DFT). Under both Ga-rich and N-rich conditions, formation energy of N-vacancies is significantly lower than that of Ga-vacancies in bulk and at (10$\overline{1}$0) surface. Experimental evidence of the presence of N-vacancies was noted from electron energy loss spectroscopy measurements which further correlated with the high electrical conductivity observed in GaN nanowall network. We find that the Fermi level pins at 0.35 $\pm$0.02 eV below Ga derived surface state. Presence of atomic steps in the nanostructure due to formation of N-vacancies at the (10$\overline{1}$0) surface makes its electronic structure metallic. Clustering of N-vacancies and Ga-Ga metallic bond formation near these vacancies, is seen to be another source of electrical conductivity of faceted GaN nanostructure that is observed experimentally.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1710.05670

PDF

https://arxiv.org/pdf/1710.05670
Read All
A Brief Overview of the KTA WCET Tool

2017-12-14

David Broman

arXiv_CV

arXiv_CV
Abstract

KTA (KTH’s timing analyzer) is a research tool for performing timing analysis of program code. The currently available toolchain can perform two different kinds of analyses: i) exhaustive fine-grained timing analysis, where timing information can be provided between arbitrary timing program points within a function, and ii) abstract search-based timing analysis, where the tool can perform optimal worst-case execution time (WCET) analysis. The latter is based on a technique that combines divide-and-conquer search and abstract interpretation. The tool is under development and currently supports a subset of the MIPS instruction set architecture.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.05264

PDF

https://arxiv.org/pdf/1712.05264
Read All
OSU Multimodal Machine Translation System Report

2017-12-14

Mingbo Ma, Dapeng Li, Kai Zhao, Liang Huang

arXiv_CV

arXiv_CV Image_Caption Caption
Abstract

This paper describes Oregon State University’s submissions to the shared WMT’17 task “multimodal translation task I”. In this task, all the sentence pairs are image captions in different languages. The key difference between this task and conventional machine translation is that we have corresponding images as additional information for each sentence pair. In this paper, we introduce a simple but effective system which takes an image shared between different languages, feeding it into the both encoding and decoding side. We report our system’s performance for English-French and English-German with Flickr30K (in-domain) and MSCOCO (out-of-domain) datasets. Our system achieves the best performance in TER for English-German for MSCOCO dataset.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1710.02718

PDF

https://arxiv.org/pdf/1710.02718
Read All
MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features

2017-12-13

Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, Hartwig Adam

arXiv_CV

arXiv_CV Object_Detection Segmentation Semantic_Segmentation Prediction Detection
Abstract

In this work, we tackle the problem of instance segmentation, the task of simultaneously solving object detection and semantic segmentation. Towards this goal, we present a model, called MaskLab, which produces three outputs: box detection, semantic segmentation, and direction prediction. Building on top of the Faster-RCNN object detector, the predicted boxes provide accurate localization of object instances. Within each region of interest, MaskLab performs foreground/background segmentation by combining semantic and direction prediction. Semantic segmentation assists the model in distinguishing between objects of different semantic classes including background, while the direction prediction, estimating each pixel’s direction towards its corresponding center, allows separating instances of the same semantic class. Moreover, we explore the effect of incorporating recent successful methods from both segmentation and detection (i.e. atrous convolution and hypercolumn). Our proposed model is evaluated on the COCO instance segmentation benchmark and shows comparable performance with other state-of-art models.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.04837

PDF

https://arxiv.org/pdf/1712.04837
Read All
ADA: A Game-Theoretic Perspective on Data Augmentation for Object Detection

2017-12-12

Sima Behpour, Kris M. Kitani, Brian D. Ziebart

arXiv_CV

arXiv_CV Adversarial Object_Detection Detection
Abstract

The use of random perturbations of ground truth data, such as random translation or scaling of bounding boxes, is a common heuristic used for data augmentation that has been shown to prevent overfitting and improve generalization. Since the design of data augmentation is largely guided by reported best practices, it is difficult to understand if those design choices are optimal. To provide a more principled perspective, we develop a game-theoretic interpretation of data augmentation in the context of object detection. We aim to find an optimal adversarial perturbations of the ground truth data (i.e., the worst case perturbations) that forces the object bounding box predictor to learn from the hardest distribution of perturbed examples for better test-time performance. We establish that the game theoretic solution, the Nash equilibrium, provides both an optimal predictor and optimal data augmentation distribution. We show that our adversarial method of training a predictor can significantly improve test time performance for the task of object detection. On the ImageNet object detection task, our adversarial approach improves performance by over 16\% compared to the best performing data augmentation method

Abstract (translated by Google)

URL

https://arxiv.org/abs/1710.07735

PDF

https://arxiv.org/pdf/1710.07735
Read All
A Practical Approach for Detecting Logical Error in Object Oriented Environment

2017-12-12

Ghassan Samara

arXiv_CV

arXiv_CV
Abstract

A programming language is a formally constructed language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs to control the behavior of a machine or to express algorithms. Most programs that are written by programmers are often compiled correctly with no syntax or semantic errors. However, some other errors appear after the execution of the program (logical error). Logical Errors (LE) are errors that remain after all syntax errors have been removed. Usually, the compiler does not detect LE, so the produced results are different from what the programmer is expecting. For this reason, discovering and fixing the logical error is very hard and proposes a good topic for research and practice. Some LE are resulted from the misuse of classes’ objects, and in Software Development Life Cycle (SDLC), it is considered that the software with LE is low-quality software with high maintenance cost. In this paper, an object-oriented environment that allows the programmer to detect and discover LE to avoid it. This environment will be called Object Behavior Environment (OBEnvironment) will enforce the correct use of objects according to their predefined behaviors by using tools like Xceed Component (that appeal .Net windows form developers for building better applications), Alsing Component (that provides an area to programmer that allows writing correct syntax code by C# language) and, Mind Fusion Component (that provides an area to programmer that allows drawing State Diagrams to show object state).

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.04189

PDF

https://arxiv.org/pdf/1712.04189
Read All
Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering

2017-12-12

Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, Xiaogang Wang

arXiv_CV

arXiv_CV QA Attention Embedding Detection VQA
Abstract

Recently, the Visual Question Answering (VQA) task has gained increasing attention in artificial intelligence. Existing VQA methods mainly adopt the visual attention mechanism to associate the input question with corresponding image regions for effective question answering. The free-form region based and the detection-based visual attention mechanisms are mostly investigated, with the former ones attending free-form image regions and the latter ones attending pre-specified detection-box regions. We argue that the two attention mechanisms are able to provide complementary information and should be effectively integrated to better solve the VQA problem. In this paper, we propose a novel deep neural network for VQA that integrates both attention mechanisms. Our proposed framework effectively fuses features from free-form image regions, detection boxes, and question representations via a multi-modal multiplicative feature embedding scheme to jointly attend question-related free-form image regions and detection boxes for more accurate question answering. The proposed method is extensively evaluated on two publicly available datasets, COCO-QA and VQA, and outperforms state-of-the-art approaches. Source code is available at this https URL.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.06794

PDF

https://arxiv.org/pdf/1711.06794
Read All
Predicting Yelp Star Reviews Based on Network Structure with Deep Learning

2017-12-11

Luis Perez

arXiv_CV

arXiv_CV Review Caption Image_Classification Classification Deep_Learning
Abstract

In this paper, we tackle the real-world problem of predicting Yelp star-review rating based on business features (such as images, descriptions), user features (average previous ratings), and, of particular interest, network properties (which businesses has a user rated before). We compare multiple models on different sets of features – from simple linear regression on network features only to deep learning models on network and item features. In recent years, breakthroughs in deep learning have led to increased accuracy in common supervised learning tasks, such as image classification, captioning, and language understanding. However, the idea of combining deep learning with network feature and structure appears to be novel. While the problem of predicting future interactions in a network has been studied at length, these approaches have often ignored either node-specific data or global structure. We demonstrate that taking a mixed approach combining both node-level features and network information can effectively be used to predict Yelp-review star ratings. We evaluate on the Yelp dataset by splitting our data along the time dimension (as would naturally occur in the real-world) and comparing our model against others which do no take advantage of the network structure and/or deep learning.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.04350

PDF

https://arxiv.org/pdf/1712.04350
Read All
On Convergence and Stability of GANs

2017-12-10

Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira

arXiv_CV

arXiv_CV GAN
Abstract

We propose studying GAN training dynamics as regret minimization, which is in contrast to the popular view that there is consistent minimization of a divergence between real and generated distributions. We analyze the convergence of GAN training from this new point of view to understand why mode collapse happens. We hypothesize the existence of undesirable local equilibria in this non-convex game to be responsible for mode collapse. We observe that these local equilibria often exhibit sharp gradients of the discriminator function around some real data points. We demonstrate that these degenerate local equilibria can be avoided with a gradient penalty scheme called DRAGAN. We show that DRAGAN enables faster training, achieves improved stability with fewer mode collapses, and leads to generator networks with better modeling performance across a variety of architectures and objective functions.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1705.07215

PDF

https://arxiv.org/pdf/1705.07215
Read All
Dynamics Transfer GAN: Generating Video by Transferring Arbitrary Temporal Dynamics from a Source Video to a Single Target Image

2017-12-10

Wissam J. Baddar, Geonmo Gu, Sangmin Lee, Yong Man Ro

arXiv_CV

arXiv_CV Adversarial GAN
Abstract

In this paper, we propose Dynamics Transfer GAN; a new method for generating video sequences based on generative adversarial learning. The spatial constructs of a generated video sequence are acquired from the target image. The dynamics of the generated video sequence are imported from a source video sequence, with arbitrary motion, and imposed onto the target image. To preserve the spatial construct of the target image, the appearance of the source video sequence is suppressed and only the dynamics are obtained before being imposed onto the target image. That is achieved using the proposed appearance suppressed dynamics feature. Moreover, the spatial and temporal consistencies of the generated video sequence are verified via two discriminator networks. One discriminator validates the fidelity of the generated frames appearance, while the other validates the dynamic consistency of the generated video sequence. Experiments have been conducted to verify the quality of the video sequences generated by the proposed method. The results verified that Dynamics Transfer GAN successfully transferred arbitrary dynamics of the source video sequence onto a target image when generating the output video sequence. The experimental results also showed that Dynamics Transfer GAN maintained the spatial constructs (appearance) of the target image while generating spatially and temporally consistent video sequences.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.03534

PDF

https://arxiv.org/pdf/1712.03534
Read All
Integrating both Visual and Audio Cues for Enhanced Video Caption

2017-12-09

Wangli Hao, Zhaoxiang Zhang, He Guan, Guibo Zhu

arXiv_CV

arXiv_CV Video_Caption Caption Inference
Abstract

Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visual-audio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.08097

PDF

https://arxiv.org/e-print/1711.08097
Read All
Video Salient Object Detection via Fully Convolutional Networks

2017-12-09

Wenguan Wang, Jianbing Shen, Ling Shao

arXiv_CV

arXiv_CV Salient Object_Detection CNN Inference Deep_Learning Detection
Abstract

This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: (1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data, and (2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image datasets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the DAVIS dataset (MAE of .06) and the FBMS dataset (MAE of .07), and do so with much improved speed (2fps with all steps).

Abstract (translated by Google)

URL

https://arxiv.org/abs/1702.00871

PDF

https://arxiv.org/pdf/1702.00871
Read All
Long Text Generation via Adversarial Training with Leaked Information

2017-12-08

Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang

arXiv_CV

arXiv_CV Image_Caption Adversarial GAN Text_Generation Reinforcement_Learning Caption
Abstract

Automatically generating coherent and semantically meaningful text has many applications in machine translation, dialogue systems, image captioning, etc. Recently, by combining with policy gradient, Generative Adversarial Nets (GAN) that use a discriminative model to guide the training of the generative model as a reinforcement learning policy has shown promising results in text generation. However, the scalar guiding signal is only available after the entire text has been generated and lacks intermediate information about text structure during the generative process. As such, it limits its success when the length of the generated text samples is long (more than 20 words). In this paper, we propose a new framework, called LeakGAN, to address the problem for long text generation. We allow the discriminative net to leak its own high-level extracted features to the generative net to further help the guidance. The generator incorporates such informative signals into all generation steps through an additional Manager module, which takes the extracted features of current generated words and outputs a latent vector to guide the Worker module for next-word generation. Our extensive experiments on synthetic data and various real-world tasks with Turing test demonstrate that LeakGAN is highly effective in long text generation and also improves the performance in short text generation scenarios. More importantly, without any supervision, LeakGAN would be able to implicitly learn sentence structures only through the interaction between Manager and Worker.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1709.08624

PDF

https://arxiv.org/pdf/1709.08624
Read All
Artificial Neural Networks that Learn to Satisfy Logic Constraints

2017-12-08

Gadi Pinkas, Shimon Cohen

arXiv_CV

arXiv_CV Knowledge Relation
Abstract

Logic-based problems such as planning, theorem proving, or puzzles, typically involve combinatoric search and structured knowledge representation. Artificial neural networks are very successful statistical learners, however, for many years, they have been criticized for their weaknesses in representing and in processing complex structured knowledge which is crucial for combinatoric search and symbol manipulation. Two neural architectures are presented, which can encode structured relational knowledge in neural activation, and store bounded First Order Logic constraints in connection weights. Both architectures learn to search for a solution that satisfies the constraints. Learning is done by unsupervised practicing on problem instances from the same domain, in a way that improves the network-solving speed. No teacher exists to provide answers for the problem instances of the training and test sets. However, the domain constraints are provided as prior knowledge to a loss function that measures the degree of constraint violations. Iterations of activation calculation and learning are executed until a solution that maximally satisfies the constraints emerges on the output units. As a test case, block-world planning problems are used to train networks that learn to plan in that domain, but the techniques proposed could be used more generally as in integrating prior symbolic knowledge with statistical learning

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.03049

PDF

https://arxiv.org/pdf/1712.03049
Read All
Semi-Supervised Learning with IPM-based GANs: an Empirical Study

2017-12-07

Tom Sercu, Youssef Mroueh

arXiv_CV

arXiv_CV Adversarial GAN Classification
Abstract

We present an empirical investigation of a recent class of Generative Adversarial Networks (GANs) using Integral Probability Metrics (IPM) and their performance for semi-supervised learning. IPM-based GANs like Wasserstein GAN, Fisher GAN and Sobolev GAN have desirable properties in terms of theoretical understanding, training stability, and a meaningful loss. In this work we investigate how the design of the critic (or discriminator) influences the performance in semi-supervised learning. We distill three key take-aways which are important for good SSL performance: (1) the K+1 formulation, (2) avoiding batch normalization in the critic and (3) avoiding gradient penalty constraints on the classification layer.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.02505

PDF

https://arxiv.org/pdf/1712.02505
Read All
HyperPower: Power- and Memory-Constrained Hyper-Parameter Optimization for Neural Networks

2017-12-06

Dimitrios Stamoulis, Ermao Cai, Da-Cheng Juan, Diana Marculescu

arXiv_CV

arXiv_CV Optimization
Abstract

While selecting the hyper-parameters of Neural Networks (NNs) has been so far treated as an art, the emergence of more complex, deeper architectures poses increasingly more challenges to designers and Machine Learning (ML) practitioners, especially when power and memory constraints need to be considered. In this work, we propose HyperPower, a framework that enables efficient Bayesian optimization and random search in the context of power- and memory-constrained hyper-parameter optimization for NNs running on a given hardware platform. HyperPower is the first work (i) to show that power consumption can be used as a low-cost, a priori known constraint, and (ii) to propose predictive models for the power and memory of NNs executing on GPUs. Thanks to HyperPower, the number of function evaluations and the best test error achieved by a constraint-unaware method are reached up to 112.99x and 30.12x faster, respectively, while never considering invalid configurations. HyperPower significantly speeds up the hyper-parameter optimization, achieving up to 57.20x more function evaluations compared to constraint-unaware methods for a given time interval, effectively yielding significant accuracy improvements by up to 67.6%.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.02446

PDF

https://arxiv.org/pdf/1712.02446
Read All
A method of immediate detection of objects with a near-zero apparent motion in series of CCD-frames

2017-12-06

V.E. Savanevych, S.V. Khlamov, I.B. Vavilova, A.B. Briukhovetskyi, A.V. Pohorelov, D.E. Mkrtichian, V.I. Kudak, L.K. Pakuliak, E.N. Dikov, R.G. Melnik, V.P. Vlasenko, D.E. Reichart

arXiv_CV

arXiv_CV Detection
Abstract

The paper deals with a computational method for detection of the solar system minor bodies (SSOs), whose inter-frame shifts in series of CCD-frames during the observation are commensurate with the errors in measuring their positions. These objects have velocities of apparent motion between CCD-frames not exceeding three RMS errors ($3\sigma$) of measurements of their positions. About 15\% of objects have a near-zero apparent motion in CCD-frames, including the objects beyond the Jupiter’s orbit as well as the asteroids heading straight to the Earth. The proposed method for detection of the object’s near-zero apparent motion in series of CCD-frames is based on the Fisher f-criterion instead of using the traditional decision rules that are based on the maximum likelihood criterion. We analyzed the quality indicators of detection of the object’s near-zero apparent motion applying statistical and in situ modeling techniques in terms of the conditional probability of the true detection of objects with a near-zero apparent motion. The efficiency of method being implemented as a plugin for the Collection Light Technology (CoLiTec) software for automated asteroids and comets detection has been demonstrated. Among the objects discovered with this plugin, there was the sungrazing comet C/2012 S1 (ISON). Within 26 minutes of the observation, the comet’s image has been moved by three pixels in a series of four CCD-frames (the velocity of its apparent motion at the moment of discovery was equal to 0.8 pixels per CCD-frame; the image size on the frame was about five pixels). Next verification in observations of asteroids with a near-zero apparent motion conducted with small telescopes has confirmed an efficiency of the method even in bad conditions (strong backlight from the full Moon). So, we recommend applying the proposed method for series of observations with four or more frames.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.02425

PDF

https://arxiv.org/pdf/1712.02425
Read All
Multi-channel Encoder for Neural Machine Translation

2017-12-06

Hao Xiong, Zhongjun He, Xiaoguang Hu, Hua Wu

arXiv_CL

arXiv_CL Attention Embedding NMT RNN
Abstract

Attention-based Encoder-Decoder has the effective architecture for neural machine translation (NMT), which typically relies on recurrent neural networks (RNN) to build the blocks that will be lately called by attentive reader during the decoding process. This design of encoder yields relatively uniform composition on source sentence, despite the gating mechanism employed in encoding RNN. On the other hand, we often hope the decoder to take pieces of source sentence at varying levels suiting its own linguistic structure: for example, we may want to take the entity name in its raw form while taking an idiom as a perfectly composed unit. Motivated by this demand, we propose Multi-channel Encoder (MCE), which enhances encoding components with different levels of composition. More specifically, in addition to the hidden state of encoding RNN, MCE takes 1) the original word embedding for raw encoding with no composition, and 2) a particular design of external memory in Neural Turing Machine (NTM) for more complex composition, while all three encoding strategies are properly blended during decoding. Empirical study on Chinese-English translation shows that our model can improve by 6.52 BLEU points upon a strong open source NMT system: DL4MT1. On the WMT14 English- French task, our single shallow system achieves BLEU=38.8, comparable with the state-of-the-art deep models.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.02109

PDF

https://arxiv.org/pdf/1712.02109
Read All
Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing

2017-12-06

Sushant Kafle, Matt Huenerfauth

arXiv_CV

arXiv_CV Speech_Recognition Caption Relation Recognition
Abstract

The accuracy of Automated Speech Recognition (ASR) technology has improved, but it is still imperfect in many settings. Researchers who evaluate ASR performance often focus on improving the Word Error Rate (WER) metric, but WER has been found to have little correlation with human-subject performance on many applications. We propose a new captioning-focused evaluation metric that better predicts the impact of ASR recognition errors on the usability of automatically generated captions for people who are Deaf or Hard of Hearing (DHH). Through a user study with 30 DHH users, we compared our new metric with the traditional WER metric on a caption usability evaluation task. In a side-by-side comparison of pairs of ASR text output (with identical WER), the texts preferred by our new metric were preferred by DHH participants. Further, our metric had significantly higher correlation with DHH participants’ subjective scores on the usability of a caption, as compared to the correlation between WER metric and participant subjective scores. This new metric could be used to select ASR systems for captioning applications, and it may be a better metric for ASR researchers to consider when optimizing ASR systems.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.02033

PDF

https://arxiv.org/pdf/1712.02033
Read All
ADC Bit Optimization for Spectrum- and Energy-Efficient Millimeter Wave Communications

2017-12-06

Jinseok Choi, Junmo Sung, Brian L. Evans, Alan Gatherer

arXiv_CV

arXiv_CV Optimization
Abstract

A spectrum- and energy-efficient system is essential for millimeter wave communication systems that require large antenna arrays with power-demanding ADCs. We propose an ADC bit allocation (BA) algorithm that solves a minimum mean squared quantization error problem under a power constraint. Unlike existing BA methods that only consider an ADC power constraint, the proposed algorithm regards total receiver power constraint for a hybrid analog-digital beamforming architecture. The major challenge is the non-linearities in the minimization problem. To address this issue, we first convert the problem into a convex optimization problem through real number relaxation and substitution of ADC resolution switching power with constant average switching power. Then, we derive a closed-form solution by fixing the number of activated radio frequency (RF) chains M. Leveraging the solution, the binary search finds the optimal M and its corresponding optimal solution. We also provide an off-line training and modeling approach to estimate the average switching power. Simulation results validate the spectral and energy efficiency of the proposed algorithm. In particular, existing state-of-the-art digital beamformers can be used in the system in conjunction with the BA algorithm as it makes the quantization error negligible in the low-resolution regime.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.02018

PDF

https://arxiv.org/pdf/1712.02018
Read All
Neural Machine Translation by Generating Multiple Linguistic Factors

2017-12-05

Mercedes García-Martínez, Loïc Barrault, Fethi Bougares

arXiv_CL

arXiv_CL NMT Quantitative
Abstract

Factored neural machine translation (FNMT) is founded on the idea of using the morphological and grammatical decomposition of the words (factors) at the output side of the neural network. This architecture addresses two well-known problems occurring in MT, namely the size of target language vocabulary and the number of unknown tokens produced in the translation. FNMT system is designed to manage larger vocabulary and reduce the training time (for systems with equivalent target language vocabulary size). Moreover, we can produce grammatically correct words that are not part of the vocabulary. FNMT model is evaluated on IWSLT’15 English to French task and compared to the baseline word-based and BPE-based NMT systems. Promising qualitative and quantitative results (in terms of BLEU and METEOR) are reported.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.01821

PDF

https://arxiv.org/pdf/1712.01821
Read All
Semi-supervised Learning with GANs: Manifold Invariance with Improved Inference

2017-12-05

Abhishek Kumar, Prasanna Sattigeri, P. Thomas Fletcher

arXiv_CV

arXiv_CV Adversarial GAN Inference
Abstract

Semi-supervised learning methods using Generative Adversarial Networks (GANs) have shown promising empirical success recently. Most of these methods use a shared discriminator/classifier which discriminates real examples from fake while also predicting the class label. Motivated by the ability of the GANs generator to capture the data manifold well, we propose to estimate the tangent space to the data manifold using GANs and employ it to inject invariances into the classifier. In the process, we propose enhancements over existing methods for learning the inverse mapping (i.e., the encoder) which greatly improves in terms of semantic similarity of the reconstructed sample with the input sample. We observe considerable empirical gains in semi-supervised learning over baselines, particularly in the cases when the number of labeled examples is low. We also provide insights into how fake examples influence the semi-supervised learning procedure.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1705.08850

PDF

https://arxiv.org/pdf/1705.08850
Read All
ClusterNet: Detecting Small Objects in Large Scenes by Exploiting Spatio-Temporal Information

2017-12-04

Rodney LaLonde, Dong Zhang, Mubarak Shah

arXiv_CV

arXiv_CV Object_Detection Sparse Attention CNN Detection
Abstract

Object detection in wide area motion imagery (WAMI) has drawn the attention of the computer vision research community for a number of years. WAMI proposes a number of unique challenges including extremely small object sizes, both sparse and densely-packed objects, and extremely large search spaces (large video frames). Nearly all state-of-the-art methods in WAMI object detection report that appearance-based classifiers fail in this challenging data and instead rely almost entirely on motion information in the form of background subtraction or frame-differencing. In this work, we experimentally verify the failure of appearance-based classifiers in WAMI, such as Faster R-CNN and a heatmap-based fully convolutional neural network (CNN), and propose a novel two-stage spatio-temporal CNN which effectively and efficiently combines both appearance and motion information to significantly surpass the state-of-the-art in WAMI object detection. To reduce the large search space, the first stage (ClusterNet) takes in a set of extremely large video frames, combines the motion and appearance information within the convolutional architecture, and proposes regions of objects of interest (ROOBI). These ROOBI can contain from one to clusters of several hundred objects due to the large video frame size and varying object density in WAMI. The second stage (FoveaNet) then estimates the centroid location of all objects in that given ROOBI simultaneously via heatmap estimation. The proposed method exceeds state-of-the-art results on the WPAFB 2009 dataset by 5-16% for moving objects and nearly 50% for stopped objects, as well as being the first proposed method in wide area motion imagery to detect completely stationary objects.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1704.02694

PDF

https://arxiv.org/pdf/1704.02694
Read All
Learning by Asking Questions

2017-12-04

Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, Laurens van der Maaten

arXiv_CV

arXiv_CV QA VQA
Abstract

We introduce an interactive learning framework for the development and testing of intelligent visual systems, called learning-by-asking (LBA). We explore LBA in context of the Visual Question Answering (VQA) task. LBA differs from standard VQA training in that most questions are not observed during training time, and the learner must ask questions it wants answers to. Thus, LBA more closely mimics natural learning and has the potential to be more data-efficient than the traditional VQA setting. We present a model that performs LBA on the CLEVR dataset, and show that it automatically discovers an easy-to-hard curriculum when learning interactively from an oracle. Our LBA generated data consistently matches or outperforms the CLEVR train data and is more sample efficient. We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.01238

PDF

https://arxiv.org/pdf/1712.01238
Read All
Mixed-precision training of deep neural networks using computational memory

2017-12-04

Nandakumar S. R., Manuel Le Gallo, Irem Boybat, Bipin Rajendran, Abu Sebastian, Evangelos Eleftheriou

arXiv_CV

arXiv_CV GAN Speech_Recognition Classification Recognition
Abstract

Deep neural networks have revolutionized the field of machine learning by providing unprecedented human-like performance in solving many real-world problems such as image and speech recognition. Training of large DNNs, however, is a computationally intensive task, and this necessitates the development of novel computing architectures targeting this application. A computational memory unit where resistive memory devices are organized in crossbar arrays can be used to locally store the synaptic weights in their conductance states. The expensive multiply accumulate operations can be performed in place using Kirchhoff’s circuit laws in a non-von Neumann manner. However, a key challenge remains the inability to alter the conductance states of the devices in a reliable manner during the weight update process. We propose a mixed-precision architecture that combines a computational memory unit storing the synaptic weights with a digital processing unit and an additional memory unit accumulating weight updates in high precision. The new architecture delivers classification accuracies comparable to those of floating-point implementations without being constrained by challenges associated with the non-ideal weight update characteristics of emerging resistive memories. A two layer neural network in which the computational memory unit is realized using non-linear stochastic models of phase-change memory devices achieves a test accuracy of 97.40% on the MNIST handwritten digit classification problem.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.01192

PDF

https://arxiv.org/pdf/1712.01192
Read All
Real-valued Time Series Generation with Recurrent Conditional GANs

2017-12-04

Cristóbal Esteban, Stephanie L. Hyland, Gunnar Rätsch

arXiv_CV

arXiv_CV Adversarial GAN RNN Classification Quantitative
Abstract

Generative Adversarial Networks (GANs) have shown remarkable success as a framework for training models to produce realistic-looking data. In this work, we propose a Recurrent GAN (RGAN) and Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data. RGANs make use of recurrent neural networks in the generator and the discriminator. In the case of RCGANs, both of these RNNs are conditioned on auxiliary information. We demonstrate our models in a set of toy datasets, where we show visually and quantitatively (using sample likelihood and maximum mean discrepancy) that they can successfully generate realistic time-series. We also describe novel evaluation methods for GANs, where we generate a synthetic labelled training dataset, and evaluate on a real test set the performance of a model trained on the synthetic data, and vice-versa. We illustrate with these metrics that RCGANs can generate time-series data useful for supervised training, with only minor degradation in performance on real test data. This is demonstrated on digit classification from ‘serialised’ MNIST and by training an early warning system on a medical dataset of 17,000 patients from an intensive care unit. We further discuss and analyse the privacy concerns that may arise when using RCGANs to generate realistic synthetic medical time series data.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1706.02633

PDF

https://arxiv.org/pdf/1706.02633
Read All
Always Lurking: Understanding and Mitigating Bias in Online Human Trafficking Detection

2017-12-03

Kyle Hundman, Thamme Gowda, Mayank Kejriwal, Benedikt Boecking

arXiv_CV

arXiv_CV Sparse Detection
Abstract

Web-based human trafficking activity has increased in recent years but it remains sparsely dispersed among escort advertisements and difficult to identify due to its often-latent nature. The use of intelligent systems to detect trafficking can thus have a direct impact on investigative resource allocation and decision-making, and, more broadly, help curb a widespread social problem. Trafficking detection involves assigning a normalized score to a set of escort advertisements crawled from the Web – a higher score indicates a greater risk of trafficking-related (involuntary) activities. In this paper, we define and study the problem of trafficking detection and present a trafficking detection pipeline architecture developed over three years of research within the DARPA Memex program. Drawing on multi-institutional data, systems, and experiences collected during this time, we also conduct post hoc bias analyses and present a bias mitigation plan. Our findings show that, while automatic trafficking detection is an important application of AI for social good, it also provides cautionary lessons for deploying predictive machine learning algorithms without appropriate de-biasing. This ultimately led to integration of an interpretable solution into a search system that contains over 100 million advertisements and is used by over 200 law enforcement agencies to investigate leads.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.00846

PDF

https://arxiv.org/pdf/1712.00846
Read All
Classical Planning in Deep Latent Space: Bridging the Subsymbolic-Symbolic Boundary

2017-12-03

Masataro Asai, Alex Fukunaga

arXiv_CV

arXiv_CV Knowledge Deep_Learning
Abstract

Current domain-independent, classical planners require symbolic models of the problem domain and instance as input, resulting in a knowledge acquisition bottleneck. Meanwhile, although deep learning has achieved significant success in many fields, the knowledge is encoded in a subsymbolic representation which is incompatible with symbolic systems such as planners. We propose LatPlan, an unsupervised architecture combining deep learning and classical planning. Given only an unlabeled set of image pairs showing a subset of transitions allowed in the environment (training inputs), and a pair of images representing the initial and the goal states (planning inputs), LatPlan finds a plan to the goal state in a symbolic latent space and returns a visualized plan execution. The contribution of this paper is twofold: (1) State Autoencoder, which finds a propositional state representation of the environment using a Variational Autoencoder. It generates a discrete latent vector from the images, based on which a PDDL model can be constructed and then solved by an off-the-shelf planner. (2) Action Autoencoder / Discriminator, a neural architecture which jointly finds the action symbols and the implicit action models (preconditions/effects), and provides a successor function for the implicit graph search. We evaluate LatPlan using image-based versions of 3 planning domains: 8-puzzle, Towers of Hanoi and LightsOut.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1705.00154

PDF

https://arxiv.org/pdf/1705.00154
Read All
Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

2017-12-03

Guohao Li, Hang Su, Wenwu Zhu

arXiv_CV

arXiv_CV Knowledge_Graph Knowledge QA Dynamic_Memory_Network Attention Relation Memory_Networks VQA
Abstract

Visual Question Answering (VQA) has attracted much attention since it offers insight into the relationships between the multi-modal analysis of images and natural language. Most of the current algorithms are incapable of answering open-domain questions that require to perform reasoning beyond the image contents. To address this issue, we propose a novel framework which endows the model capabilities in answering more complex questions by leveraging massive external knowledge with dynamic memory networks. Specifically, the questions along with the corresponding images trigger a process to retrieve the relevant information in external knowledge bases, which are embedded into a continuous vector space by preserving the entity-relation structures. Afterwards, we employ dynamic memory networks to attend to the large body of facts in the knowledge graph and images, and then perform reasoning over these facts to generate corresponding answers. Extensive experiments demonstrate that our model not only achieves the state-of-the-art performance in the visual question answering task, but can also answer open-domain questions effectively by leveraging the external knowledge.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.00733

PDF

https://arxiv.org/pdf/1712.00733
Read All
Cascade R-CNN: Delving into High Quality Object Detection

2017-12-03

Zhaowei Cai, Nuno Vasconcelos

arXiv_CV

arXiv_CV Object_Detection Inference Detection
Abstract

In object detection, an intersection over union (IoU) threshold is required to define positives and negatives. An object detector, trained with low IoU threshold, e.g. 0.5, usually produces noisy detections. However, detection performance tends to degrade with increasing the IoU thresholds. Two main factors are responsible for this: 1) overfitting during training, due to exponentially vanishing positive samples, and 2) inference-time mismatch between the IoUs for which the detector is optimal and those of the input hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, is proposed to address these problems. It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector. The resampling of progressively improved hypotheses guarantees that all detectors have a positive set of examples of equivalent size, reducing the overfitting problem. The same cascade procedure is applied at inference, enabling a closer match between the hypotheses and the detector quality of each stage. A simple implementation of the Cascade R-CNN is shown to surpass all single-model object detectors on the challenging COCO dataset. Experiments also show that the Cascade R-CNN is widely applicable across detector architectures, achieving consistent gains independently of the baseline detector strength. The code will be made available at this https URL.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.00726

PDF

https://arxiv.org/pdf/1712.00726
Read All
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

2017-12-03

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, Gunhee Kim

arXiv_CV

arXiv_CV QA Attention RNN VQA
Abstract

Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention, and show its effectiveness over conventional VQA techniques through empirical evaluations.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1704.04497

PDF

https://arxiv.org/pdf/1704.04497
Read All
Structured Deep Hashing with Convolutional Neural Networks for Fast Person Re-identification

2017-12-03

Lin Wu, Yang Wang

arXiv_CV

arXiv_CV Re-identification Person_Re-identification CNN Optimization Deep_Learning
Abstract

Given a pedestrian image as a query, the purpose of person re-identification is to identify the correct match from a large collection of gallery images depicting the same person captured by disjoint camera views. The critical challenge is how to construct a robust yet discriminative feature representation to capture the compounded variations in pedestrian appearance. To this end, deep learning methods have been proposed to extract hierarchical features against extreme variability of appearance. However, existing methods in this category generally neglect the efficiency in the matching stage whereas the searching speed of a re-identification system is crucial in real-world applications. In this paper, we present a novel deep hashing framework with Convolutional Neural Networks (CNNs) for fast person re-identification. Technically, we simultaneously learn both CNN features and hash functions/codes to get robust yet discriminative features and similarity-preserving hash codes. Thereby, person re-identification can be resolved by efficiently computing and ranking the Hamming distances between images. A structured loss function defined over positive pairs and hard negatives is proposed to formulate a novel optimization problem so that fast convergence and more stable optimized solution can be obtained. Extensive experiments on two benchmarks CUHK03 \cite{FPNN} and Market-1501 \cite{Market1501} show that the proposed deep architecture is efficacy over state-of-the-arts.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1702.04179

PDF

https://arxiv.org/pdf/1702.04179
Read All
Labeled Memory Networks for Online Model Adaptation

2017-12-02

Shiv Shankar, Sunita Sarawagi

arXiv_CV

arXiv_CV GAN RNN Classification Memory_Networks
Abstract

Augmenting a neural network with memory that can grow without growing the number of trained parameters is a recent powerful concept with many exciting applications. We propose a design of memory augmented neural networks (MANNs) called Labeled Memory Networks (LMNs) suited for tasks requiring online adaptation in classification models. LMNs organize the memory with classes as the primary key.The memory acts as a second boosted stage following a regular neural network thereby allowing the memory and the primary network to play complementary roles. Unlike existing MANNs that write to memory for every instance and use LRU based memory replacement, LMNs write only for instances with non-zero loss and use label-based memory replacement. We demonstrate significant accuracy gains on various tasks including word-modelling and few-shot learning. In this paper, we establish their potential in online adapting a batch trained neural network to domain-relevant labeled data at deployment time. We show that LMNs are better than other MANNs designed for meta-learning. We also found them to be more accurate and faster than state-of-the-art methods of retuning model parameters for adapting to domain-specific labeled data.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1707.01461

PDF

https://arxiv.org/pdf/1707.01461
Read All
Competing charge density wave and antiferromagnetism of metallic atom wires in GaN and ZnO

2017-12-02

Yoon-Gu Kang, Sun-Woo Kim, Jun-Hyung Cho

arXiv_CV

arXiv_CV GAN Face Relation
Abstract

Low-dimensional electron systems often show a delicate interplay between electron-phonon and electron-electron interactions, giving rise to interesting quantum phases such as the charge density wave (CDW) and magnetism. Using the density-functional theory (DFT) calculations with the semilocal and hybrid exchange-correlation functionals as well as the exact-exchange plus correlation in the random-phase approximation (EX + cRPA), we systematically investigate the ground state of the metallic atom wires containing dangling-bond (DB) electrons, fabricated by partially hydrogenating the GaN(10${\overline{1}}$0) and ZnO(10${\overline{1}}$0) surfaces. We find that the CDW or antiferromagnetic (AFM) order has an electronic energy gain due to a band-gap opening, thereby being more stabilized compared to the metallic state. Our semilocal DFT calculation predicts that both DB wires in GaN(10${\overline{1}}$0) and ZnO(10${\overline{1}}$0) have the same CDW ground state, whereas the hybrid DFT and EX+cRPA calculations predict the AFM ground state for the former DB wire and the CDW ground state for the latter one. It is revealed that more localized Ga DB electrons in GaN(10${\overline{1}}$0) prefer the AFM order, while less localized Zn DB electrons in ZnO(10${\overline{1}}$0) the CDW formation. Our findings demonstrate that the drastically different ground states are competing in the DB wires created on the two representative compound semiconductor surfaces.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1709.10251

PDF

https://arxiv.org/pdf/1709.10251
Read All
Multi-Content GAN for Few-Shot Font Style Transfer

2017-12-01

Samaneh Azadi, Matthew Fisher, Vladimir Kim, Zhaowen Wang, Eli Shechtman, Trevor Darrell

arXiv_CV

arXiv_CV GAN Face Style_Transfer
Abstract

In this work, we focus on the challenge of taking partial observations of highly-stylized text and generalizing the observations to generate unobserved glyphs in the ornamented typeface. To generate a set of multi-content images following a consistent style from very few examples, we propose an end-to-end stacked conditional GAN model considering content along channels and style along network layers. Our proposed network transfers the style of given glyphs to the contents of unseen ones, capturing highly stylized fonts found in the real-world such as those on movie posters or infographics. We seek to transfer both the typographic stylization (ex. serifs and ears) as well as the textual stylization (ex. color gradients and effects.) We base our experiments on our collected data set including 10,000 fonts with different styles and demonstrate effective generalization from a very small number of observed glyphs.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.00516

PDF

https://arxiv.org/pdf/1712.00516
Read All
Deep Learning Scaling is Predictable, Empirically

2017-12-01

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, Yanqi Zhou

arXiv_CV

arXiv_CV Speech_Recognition NAS Deep_Learning Language_Model Relation Recognition
Abstract

Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents—the “steepness” of the learning curve—yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.00409

PDF

https://arxiv.org/pdf/1712.00409
Read All
ME R-CNN: Multi-Expert R-CNN for Object Detection

2017-12-01

Hyungtae Lee, Sungmin Eum, Heesung Kwon

arXiv_CV

arXiv_CV Object_Detection Detection
Abstract

We introduce Multi-Expert Region-based CNN (ME R-CNN) which is equipped with multiple experts and built on top of the R-CNN framework known to be one of the state-of-the-art object detection methods. ME R-CNN focuses in better capturing the appearance variations caused by different shapes, poses, and viewing angles. The proposed approach consists of three experts each responsible for objects with particular shapes: horizontally elongated, square-like, and vertically elongated. On top of using selective search which provides a compact, yet effective set of region of interests (RoIs) for object detection, we augmented the set by also employing the exhaustive search for training only. Incorporating the exhaustive search can provide complementary advantages: i) it captures the multitude of neighboring RoIs missed by the selective search, and thus ii) provide significantly larger amount of training examples. We show that the ME R-CNN architecture provides considerable performance increase over the baselines on PASCAL VOC 07, 12, and MS COCO datasets.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1704.01069

PDF

https://arxiv.org/pdf/1704.01069
Read All
Sparse Coding on Stereo Video for Object Detection

2017-11-30

Sheng Y. Lundquist, Melanie Mitchell, Garrett T. Kenyon

arXiv_CV

arXiv_CV Object_Detection Sparse CNN Image_Classification Classification Detection
Abstract

Deep Convolutional Neural Networks (DCNN) require millions of labeled training examples for image classification and object detection tasks, which restrict these models to domains where such datasets are available. In this paper, we explore the use of unsupervised sparse coding applied to stereo-video data to help alleviate the need for large amounts of labeled data. We show that replacing a typical supervised convolutional layer with an unsupervised sparse-coding layer within a DCNN allows for better performance on a car detection task when only a limited number of labeled training examples is available. Furthermore, the network that incorporates sparse coding allows for more consistent performance over varying initializations and ordering of training examples when compared to a fully supervised DCNN. Finally, we compare activations between the unsupervised sparse-coding layer and the supervised convolutional layer, and show that the sparse representation exhibits an encoding that is depth selective, whereas encodings from the convolutional layer do not exhibit such selectivity. These result indicates promise for using unsupervised sparse-coding approaches in real-world computer vision tasks in domains with limited labeled training data.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1705.07144

PDF

https://arxiv.org/pdf/1705.07144
Read All
Multi-Channel CNN-based Object Detection for Enhanced Situation Awareness

2017-11-30

Shuo Liu, Zheng Liu

arXiv_CV

arXiv_CV Object_Detection CNN Transfer_Learning Deep_Learning Detection
Abstract

Object Detection is critical for automatic military operations. However, the performance of current object detection algorithms is deficient in terms of the requirements in military scenarios. This is mainly because the object presence is hard to detect due to the indistinguishable appearance and dramatic changes of object’s size which is determined by the distance to the detection sensors. Recent advances in deep learning have achieved promising results in many challenging tasks. The state-of-the-art in object detection is represented by convolutional neural networks (CNNs), such as the fast R-CNN algorithm. These CNN-based methods improve the detection performance significantly on several public generic object detection datasets. However, their performance on detecting small objects or undistinguishable objects in visible spectrum images is still insufficient. In this study, we propose a novel detection algorithm for military objects by fusing multi-channel CNNs. We combine spatial, temporal and thermal information by generating a three-channel image, and they will be fused as CNN feature maps in an unsupervised manner. The backbone of our object detection framework is from the fast R-CNN algorithm, and we utilize cross-domain transfer learning technique to fine-tune the CNN model on generated multi-channel images. In the experiments, we validated the proposed method with the images from SENSIAC (Military Sensing Information Analysis Centre) database and compared it with the state-of-the-art. The experimental results demonstrated the effectiveness of the proposed method on both accuracy and computational efficiency.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1712.00075

PDF

https://arxiv.org/pdf/1712.00075
Read All
Towards High Performance Video Object Detection

2017-11-30

Xizhou Zhu, Jifeng Dai, Lu Yuan, Yichen Wei

arXiv_CV

arXiv_CV Object_Detection Attention Detection
Abstract

There has been significant progresses for image object detection in recent years. Nevertheless, video object detection has received little attention, although it is more challenging and more important in practical scenarios. Built upon the recent works, this work proposes a unified approach based on the principle of multi-frame end-to-end learning of features and cross-frame motion. Our approach extends prior works with three new techniques and steadily pushes forward the performance envelope (speed-accuracy tradeoff), towards high performance video object detection.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.11577

PDF

https://arxiv.org/pdf/1711.11577
Read All
A novel graph structure for salient object detection based on divergence background and compact foreground

2017-11-30

Chenxing Xia, Hanling Zhang, Keqin Li

arXiv_CV

arXiv_CV Salient Object_Detection Detection Relation
Abstract

In this paper, we propose an efficient and discriminative model for salient object detection. Our method is carried out in a stepwise mechanism based on both divergence background and compact foreground cues. In order to effectively enhance the distinction between nodes along object boundaries and the similarity among object regions, a graph is constructed by introducing the concept of virtual node. To remove incorrect outputs, a scheme for selecting background seeds and a method for generating compactness foreground regions are introduced, respectively. Different from prior methods, we calculate the saliency value of each node based on the relationship between the corresponding node and the virtual node. In order to achieve significant performance improvement consistently, we propose an Extended Manifold Ranking (EMR) algorithm, which subtly combines suppressed / active nodes and mid-level information. Extensive experimental results demonstrate that the proposed algorithm performs favorably against the state-of-art saliency detection methods in terms of different evaluation metrics on several benchmark datasets.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.11266

PDF

https://arxiv.org/pdf/1711.11266
Read All
An Abstract Method Linearization for Detecting Source Code Plagiarism in Object-Oriented Environment

2017-11-29

Oscar Karnalim

arXiv_CV

arXiv_CV Detection
Abstract

Despite the fact that plagiarizing source code is a trivial task for most CS students, detecting such unethical behavior requires a considerable amount of effort. Thus, several plagiarism detection systems were developed to handle such issue. This paper extends Karnalim’s work, a low-level approach for detecting Java source code plagiarism, by incorporating abstract method linearization. Such extension is incorporated to enhance the accuracy of low-level approach in term of detecting plagiarism in object-oriented environment. According to our evaluation, which was conducted based on 23 design-pattern source code pairs, our extended low-level approach is more effective than state-of-the-art and Karnalim’s approach. On the one hand, when compared to state-of-the-art approach, our approach can generate less coincidental similarities and provide more accurate result. On the other hand, when compared to Karnalim’s approach, our approach, at some extent, can generate higher similarity when simple abstract method invocation is incorporated.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1711.10762

PDF

https://arxiv.org/pdf/1711.10762
Read All
From source to target and back: symmetric bi-directional adaptive GAN

2017-11-29

Paolo Russo, Fabio Maria Carlucci, Tatiana Tommasi, Barbara Caputo

arXiv_CV

arXiv_CV Adversarial GAN Quantitative
Abstract

The effectiveness of generative adversarial approaches in producing images according to a specific style or visual domain has recently opened new directions to solve the unsupervised domain adaptation problem. It has been shown that source labeled images can be modified to mimic target samples making it possible to train directly a classifier in the target domain, despite the original lack of annotated data. Inverse mappings from the target to the source domain have also been evaluated but only passing through adapted feature spaces, thus without new image generation. In this paper we propose to better exploit the potential of generative adversarial networks for adaptation by introducing a novel symmetric mapping among domains. We jointly optimize bi-directional image transformations combining them with target self-labeling. Moreover we define a new class consistency loss that aligns the generators in the two directions imposing to conserve the class identity of an image passing through both domain mappings. A detailed qualitative and quantitative analysis of the reconstructed images confirm the power of our approach. By integrating the two domain specific classifiers obtained with our bi-directional network we exceed previous state-of-the-art unsupervised adaptation results on four different benchmark datasets.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1705.08824

PDF

https://arxiv.org/pdf/1705.08824
Read All

225/266

Welcome to AMDS123 Blog!

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL