Welcome to AMDS123 Blog!

Recent Papers about CV, CL and SD

Deep Learning Based Automatic Video Annotation Tool for Self-Driving Car

2019-04-19

N.S.Manikandan, K.Ganesan

arXiv_CV

arXiv_CV Object_Detection Tracking Object_Tracking Classification Deep_Learning Detection
Abstract

In a self-driving car, objection detection, object classification, lane detection and object tracking are considered to be the crucial modules. In recent times, using the real time video one wants to narrate the scene captured by the camera fitted in our vehicle. To effectively implement this task, deep learning techniques and automatic video annotation tools are widely used. In the present paper, we compare the various techniques that are available for each module and choose the best algorithm among them by using appropriate metrics. For object detection, YOLO and Retinanet-50 are considered and the best one is chosen based on mean Average Precision (mAP). For object classification, we consider VGG-19 and Resnet-50 and select the best algorithm based on low error rate and good accuracy. For lane detection, Udacity’s ‘Finding Lane Line’ and deep learning based LaneNet algorithms are compared and the best one that can accurately identify the given lane is chosen for implementation. As far as object tracking is concerned, we compare Udacity’s ‘Object Detection and Tracking’ algorithm and deep learning based Deep Sort algorithm. Based on the accuracy of tracking the same object in many frames and predicting the movement of objects, the best algorithm is chosen. Our automatic video annotation tool is found to be 83% accurate when compared with a human annotator. We considered a video with 530 frames each of resolution 1035 x 1800 pixels. At an average each frame had about 15 objects. Our annotation tool consumed 43 minutes in a CPU based system and 2.58 minutes in a mid-level GPU based system to process all four modules. But the same video took nearly 3060 minutes for one human annotator to narrate the scene in the given video. Thus we claim that our proposed automatic video annotation tool is reasonably fast (about 1200 times in a GPU system) and accurate.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.12618

PDF

http://arxiv.org/pdf/1904.12618
Read All
Simple yet efficient real-time pose-based action recognition

2019-04-19

Dennis Ludl, Thomas Gulde, Cristóbal Curio

arXiv_CV

arXiv_CV Action_Recognition Detection Recognition
Abstract

Recognizing human actions is a core challenge for autonomous systems as they directly share the same space with humans. Systems must be able to recognize and assess human actions in real-time. In order to train corresponding data-driven algorithms, a significant amount of annotated training data is required. We demonstrated a pipeline to detect humans, estimate their pose, track them over time and recognize their actions in real-time with standard monocular camera sensors. For action recognition, we encode the human pose into a new data format called Encoded Human Pose Image (EHPI) that can then be classified using standard methods from the computer vision community. With this simple procedure we achieve competitive state-of-the-art performance in pose-based action detection and can ensure real-time performance. In addition, we show a use case in the context of autonomous driving to demonstrate how such a system can be trained to recognize human actions using simulation data.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09140

PDF

http://arxiv.org/pdf/1904.09140
Read All
Multiple receptive fields and small-object-focusing weakly-supervised segmentation network for fast object detection

2019-04-19

Siyang Sun

arXiv_CV

arXiv_CV Object_Detection Segmentation Attention Inference Detection
Abstract

Object detection plays an important role in various visual applications. However, the precision and speed of detector are usually contradictory. One main reason for fast detectors’ precision reduction is that small objects are hard to be detected. To address this problem, we propose a multiple receptive field and small-object-focusing weakly-supervised segmentation network (MRFSWSnet) to achieve fast object detection. In MRFSWSnet, multiple receptive fields block (MRF) is used to pay attention to the object and its adjacent background’s different spatial location with different weights to enhance the feature’s discriminability. In addition, in order to improve the accuracy of small object detection, a small-object-focusing weakly-supervised segmentation module which only focuses on small object instead of all objects is integrated into the detection network for auxiliary training to improve the precision of small object detection. Extensive experiments show the effectiveness of our method on both PASCAL VOC and MS COCO detection datasets. In particular, with a lower resolution version of 300x300, MRFSWSnet achieves 80.9% mAP on VOC2007 test with an inference speed of 15 milliseconds per frame, which is the state-of-the-art detector among real-time detectors.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.12619

PDF

http://arxiv.org/pdf/1904.12619
Read All
Who wrote this book? A challenge for e-commerce

2019-04-19

Béranger Dumont, Simona Maggio, Ghiles Sidi Said, Quoc-Tien Au

arXiv_CL

arXiv_CL Deep_Learning
Abstract

Modern e-commerce catalogs contain millions of references, associated with textual and visual information that is of paramount importance for the products to be found via search or browsing. Of particular significance is the book category, where the author name(s) field poses a significant challenge. Indeed, books written by a given author (such as F. Scott Fitzgerald) might be listed with different authors’ names in a catalog due to abbreviations and spelling variants and mistakes, among others. To solve this problem at scale, we design a composite system involving open data sources for books as well as machine learning components leveraging deep learning-based techniques for natural language processing. In particular, we use Siamese neural networks for an approximate match with known author names, and direct correction of the provided author’s name using sequence-to-sequence learning with neural networks. We evaluate this approach on product data from the e-commerce website Rakuten France, and find that the top proposal of the system is the normalized author name with 72% accuracy.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1905.01973

PDF

http://arxiv.org/pdf/1905.01973
Read All
Data Augmentation Using GANs

2019-04-19

Fabio Henrique Kiyoiti dos Santos Tanaka, Claus Aranha

arXiv_CV

arXiv_CV Adversarial GAN
Abstract

In this paper we propose the use of Generative Adversarial Networks (GAN) to generate artificial training data for machine learning tasks. The generation of artificial training data can be extremely useful in situations such as imbalanced data sets, performing a role similar to SMOTE or ADASYN. It is also useful when the data contains sensitive information, and it is desirable to avoid using the original data set as much as possible (example: medical data). We test our proposal on benchmark data sets using different network architectures, and show that a Decision Tree (DT) classifier trained using the training data generated by the GAN reached the same, (and surprisingly sometimes better), accuracy and recall than a DT trained on the original data set.

Abstract (translated by Google)

URL

https://arxiv.org/abs/1904.09135

PDF

https://arxiv.org/pdf/1904.09135
Read All
The Seventh Answer Set Programming Competition: Design and Results

2019-04-19

Martin Gebser, Marco Maratea, Francesco Ricca

arXiv_AI

arXiv_AI Knowledge GAN
Abstract

Answer Set Programming (ASP) is a prominent knowledge representation language with roots in logic programming and non-monotonic reasoning. Biennial ASP competitions are organized in order to furnish challenging benchmark collections and assess the advancement of the state of the art in ASP solving. In this paper, we report on the design and results of the Seventh ASP Competition, jointly organized by the University of Calabria (Italy), the University of Genova (Italy), and the University of Potsdam (Germany), in affiliation with the 14th International Conference on Logic Programming and Non-Monotonic Reasoning (LPNMR 2017). (Under consideration for acceptance in TPLP).

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09134

PDF

http://arxiv.org/pdf/1904.09134
Read All
OpenTapioca: Lightweight Entity Linking for Wikidata

2019-04-19

Antonin Delpeuch

arXiv_CL

arXiv_CL
Abstract

We propose a simple Named Entity Linking system that can be trained from Wikidata only. This demonstrates the strengths and weaknesses of this data source for this task and provides an easily reproducible baseline to compare other systems against. Our model is lightweight to train, to run and to keep synchronous with Wikidata in real time.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09131

PDF

http://arxiv.org/pdf/1904.09131
Read All
Zero-Shot Cross-Lingual Opinion Target Extraction

2019-04-19

Soufian Jebbara, Philipp Cimiano

arXiv_CL

arXiv_CL Sentiment Embedding CNN Prediction Recognition
Abstract

Aspect-based sentiment analysis involves the recognition of so called opinion target expressions (OTEs). To automatically extract OTEs, supervised learning algorithms are usually employed which are trained on manually annotated corpora. The creation of these corpora is labor-intensive and sufficiently large datasets are therefore usually only available for a very narrow selection of languages and domains. In this work, we address the lack of available annotated data for specific languages by proposing a zero-shot cross-lingual approach for the extraction of opinion target expressions. We leverage multilingual word embeddings that share a common vector space across various languages and incorporate these into a convolutional neural network architecture for OTE extraction. Our experiments with 5 languages give promising results: We can successfully train a model on annotated data of a source language and perform accurate prediction on a target language without ever using any annotated samples in that target language. Depending on the source and target language pairs, we reach performances in a zero-shot regime of up to 77% of a model trained on target language data. Furthermore, we can increase this performance up to 87% of a baseline model trained on target language data by performing cross-lingual learning from multiple source languages.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09122

PDF

http://arxiv.org/pdf/1904.09122
Read All
Deep Q Learning Driven CT Pancreas Segmentation with Geometry-Aware U-Net

2019-04-19

Yunze Man, Yangsibo Huang, Junyi Feng, Xi Li, Fei Wu

arXiv_AI

arXiv_AI Segmentation Face
Abstract

Segmentation of pancreas is important for medical image analysis, yet it faces great challenges of class imbalance, background distractions and non-rigid geometrical features. To address these difficulties, we introduce a Deep Q Network(DQN) driven approach with deformable U-Net to accurately segment the pancreas by explicitly interacting with contextual information and extract anisotropic features from pancreas. The DQN based model learns a context-adaptive localization policy to produce a visually tightened and precise localization bounding box of the pancreas. Furthermore, deformable U-Net captures geometry-aware information of pancreas by learning geometrically deformable filters for feature extraction. Experiments on NIH dataset validate the effectiveness of the proposed framework in pancreas segmentation.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09120

PDF

http://arxiv.org/pdf/1904.09120
Read All
SelFlow: Self-Supervised Learning of Optical Flow

2019-04-19

Pengpeng Liu, Michael Lyu, Irwin King, Jia Xu

arXiv_CV

arXiv_CV Prediction
Abstract

We present a self-supervised learning approach for optical flow. Our method distills reliable flow estimations from non-occluded pixels, and uses these predictions as ground truth to learn optical flow for hallucinated occlusions. We further design a simple CNN to utilize temporal information from multiple frames for better flow estimation. These two principles lead to an approach that yields the best performance for unsupervised optical flow learning on the challenging benchmarks including MPI Sintel, KITTI 2012 and 2015. More notably, our self-supervised pre-trained model provides an excellent initialization for supervised fine-tuning. Our fine-tuned models achieve state-of-the-art results on all three datasets. At the time of writing, we achieve EPE=4.26 on the Sintel benchmark, outperforming all submitted methods.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09117

PDF

http://arxiv.org/pdf/1904.09117
Read All
Enabling Socially Competent navigation through incorporating HRI

2019-04-19

Arturo Cruz-Maya, Fernando Garcia, Amit Kumar Pandey

arXiv_RO

arXiv_RO Object_Detection Deep_Learning Detection
Abstract

Over the last years, social robots have been deployed in public environments making evident the need of human-aware navigation capabilities. In this regard, the robotics community have made efforts to include proxemics or social conventions within the navigation approaches. Nevertheless, few works have tackled the problem of labelling humans as an interactive agent when blocking the robot motion trajectory. Current state of the art navigation planners will either propose an alternative path or freeze the motion until the path is free. We present the first prototype of a framework designed to enhance social competency of robots while navigating in indoor environments. The implementation is done using Navigation and Object Detection open-source software. Specifically, the Robot Operating System (ROS) navigation stack, and OpenCV with Caffe deep learning models and MobileNet Single Shot Detector (SSD), respectively.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09116

PDF

http://arxiv.org/pdf/1904.09116
Read All
A Novel Task-Oriented Text Corpus in Silent Speech Recognition and its Natural Language Generation Construction Method

2019-04-19

Dong Cao, Dongdong Zhang, HaiBo Chen

arXiv_CL

arXiv_CL Speech_Recognition Text_Generation Recognition
Abstract

Millions of people with severe speech disorders around the world may regain their communication capabilities through techniques of silent speech recognition (SSR). Using electroencephalography (EEG) as a biomarker for speech decoding has been popular for SSR. However, the lack of SSR text corpus has impeded the development of this technique. Here, we construct a novel task-oriented text corpus, which is utilized in the field of SSR. In the process of construction, we propose a task-oriented hybrid construction method based on natural language generation algorithm. The algorithm focuses on the strategy of data-to-text generation, and has two advantages including linguistic quality and high diversity. These two advantages use template-based method and deep neural networks respectively. In an SSR experiment with the generated text corpus, analysis results show that the performance of our hybrid construction method outperforms the pure method such as template-based natural language generation or neural natural language generation models.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1905.01974

PDF

http://arxiv.org/pdf/1905.01974
Read All
Listen to the Image

2019-04-19

Di Hu, Dong Wang, Xuelong Li, Feiping Nie, Qi Wang

arXiv_CV

arXiv_CV Optimization
Abstract

Visual-to-auditory sensory substitution devices can assist the blind in sensing the visual environment by translating the visual information into a sound pattern. To improve the translation quality, the task performances of the blind are usually employed to evaluate different encoding schemes. In contrast to the toilsome human-based assessment, we argue that machine model can be also developed for evaluation, and more efficient. To this end, we firstly propose two distinct cross-modal perception model w.r.t. the late-blind and congenitally-blind cases, which aim to generate concrete visual contents based on the translated sound. To validate the functionality of proposed models, two novel optimization strategies w.r.t. the primary encoding scheme are presented. Further, we conduct sets of human-based experiments to evaluate and compare them with the conducted machine-based assessments in the cross-modal generation task. Their highly consistent results w.r.t. different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09115

PDF

http://arxiv.org/pdf/1904.09115
Read All
Recognizing the vocabulary of Brazilian popular newspapers with a free-access computational dictionary

2019-04-19

Maria José Finatto (UFRGS), Oto Vale (UFSCar), Eric Laporte (LIGM)

arXiv_CL

arXiv_CL
Abstract

We report an experiment to check the identification of a set of words in popular written Portuguese with two versions of a computational dictionary of Brazilian Portuguese, DELAF PB 2004 and DELAF PB 2015. This dictionary is freely available for use in linguistic analyses of Brazilian Portuguese and other researches, which justifies critical study. The vocabulary comes from the PorPopular corpus, made of popular newspapers Di{'a}rio Ga{'u}cho (DG) and Massa! (MA). From DG, we retained a set of texts with 984.465 words (tokens), published in 2008, with the spelling used before the Portuguese Language Orthographic Agreement adopted in 2009. From MA, we examined papers of 2012, 2014 e 2015, with 215.776 words (tokens), all with the new spelling. The checking involved: a) generating lists of words (types) occurring in DG and MA; b) comparing them with the entry lists of both versions of DELAF PB; c) assessing the coverage of this vocabulary; d) proposing ways of incorporating the items not covered. The results of the work show that an average of 19% of the types in DG were not found in DELAF PB 2004 or 2015. In MA, this average is 13%. Switching versions of the dictionary affected slightly the performance in recognizing the words.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09108

PDF

http://arxiv.org/pdf/1904.09108
Read All
AnonymousNet: Natural Face De-Identification with Measurable Privacy

2019-04-19

Tao Li, Lei Lin

arXiv_CV

arXiv_CV Adversarial Face Prediction Quantitative
Abstract

With billions of personal images being generated from social media and cameras of all sorts on a daily basis, security and privacy are unprecedentedly challenged. Although extensive attempts have been made, existing face image de-identification techniques are either insufficient in photo-reality or incapable of balancing privacy and usability qualitatively and quantitatively, i.e., they fail to answer counterfactual questions such as “is it private now?”, “how private is it?”, and “can it be more private?” In this paper, we propose a novel framework called AnonymousNet, with an effort to address these issues systematically, balance usability, and enhance privacy in a natural and measurable manner. The framework encompasses four stages: facial attribute estimation, privacy-metric-oriented face obfuscation, directed natural image synthesis, and adversarial perturbation. Not only do we achieve the state-of-the-arts in terms of image quality and attribute prediction accuracy, we are also the first to show that facial privacy is measurable, can be factorized, and accordingly be manipulated in a photo-realistic fashion to fulfill different requirements and application scenarios. Experiments further demonstrate the effectiveness of the proposed framework.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.12620

PDF

http://arxiv.org/pdf/1904.12620
Read All
Code-Switching for Enhancing NMT with Pre-Specified Translation

2019-04-19

Kai Song, Yue Zhang, Heng Yu, Weihua Luo, Kun Wang, Min Zhang

arXiv_CL

arXiv_CL NMT
Abstract

Leveraging user-provided translation to constrain NMT has practical significance. Existing methods can be classified into two main categories, namely the use of placeholder tags for lexicon words and the use of hard constraints during decoding. Both methods can hurt translation fidelity for various reasons. We investigate a data augmentation method, making code-switched training data by replacing source phrases with their target translations. Our method does not change the MNT model or decoding algorithm, allowing the model to learn lexicon translations by copying source-side target words. Extensive experiments show that our method achieves consistent improvements over existing approaches, improving translation of constrained words without hurting unconstrained words.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09107

PDF

http://arxiv.org/pdf/1904.09107
Read All
Automated Segmentation of Pulmonary Lobes using Coordination-Guided Deep Neural Networks

2019-04-19

Wenjia Wang, Junxuan Chen, Jie Zhao, Ying Chi, Xuansong Xie, Li Zhang, Xiansheng Hua

arXiv_CV

arXiv_CV Segmentation CNN Classification
Abstract

The identification of pulmonary lobes is of great importance in disease diagnosis and treatment. A few lung diseases have regional disorders at lobar level. Thus, an accurate segmentation of pulmonary lobes is necessary. In this work, we propose an automated segmentation of pulmonary lobes using coordination-guided deep neural networks from chest CT images. We first employ an automated lung segmentation to extract the lung area from CT image, then exploit volumetric convolutional neural network (V-net) for segmenting the pulmonary lobes. To reduce the misclassification of different lobes, we therefore adopt coordination-guided convolutional layers (CoordConvs) that generate additional feature maps of the positional information of pulmonary lobes. The proposed model is trained and evaluated on a few publicly available datasets and has achieved the state-of-the-art accuracy with a mean Dice coefficient index of 0.947 $\pm$ 0.044.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09106

PDF

http://arxiv.org/pdf/1904.09106
Read All
Deep Likelihood Network for Image Restoration with Multiple Degradations

2019-04-19

Yiwen Guo, Wangmeng Zuo, Changshui Zhang, Yurong Chen

arXiv_CV

arXiv_CV Super_Resolution CNN
Abstract

Convolutional neural networks have been proven very effective in a variety of image restoration tasks. Most state-of-the-art solutions, however, are trained using images with a single particular degradation level, and can deteriorate drastically when being applied to some other degradation settings. In this paper, we propose a novel method dubbed deep likelihood network (DL-Net), aiming at generalizing off-the-shelf image restoration networks to succeed over a spectrum of degradation settings while keeping their original learning objectives and core architectures. In particular, we slightly modify the original restoration networks by appending a simple yet effective recursive module, which is derived from a fidelity term for disentangling the effect of degradations. Extensive experimental results on image inpainting, interpolation and super-resolution demonstrate the effectiveness of our DL-Net.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09105

PDF

http://arxiv.org/pdf/1904.09105
Read All
Deep Multimodal Clustering for Unsupervised Audiovisual Learning

2019-04-19

Di Hu, Feiping Nie, Xuelong Li

arXiv_CV

arXiv_CV CNN Detection
Abstract

The seen birds twitter, the running cars accompany with noise, etc. These naturally audiovisual correspondences provide the possibilities to explore and understand the outside world. However, the mixed multiple objects and sounds make it intractable to perform efficient matching in the unconstrained environment. To settle this problem, we propose to adequately excavate audio and visual components and perform elaborate correspondence learning among them. Concretely, a novel unsupervised audiovisual learning model is proposed, named as \Deep Multimodal Clustering (DMC), that synchronously performs sets of clustering with multimodal vectors of convolutional maps in different shared spaces for capturing multiple audiovisual correspondences. And such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion. Amounts of experiments in feature evaluation and audiovisual tasks are performed. The results demonstrate that DMC can learn effective unimodal representation, with which the classifier can even outperform human performance. Further, DMC shows noticeable performance in sound localization, multisource detection, and audiovisual understanding.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1807.03094

PDF

http://arxiv.org/pdf/1807.03094
Read All
Body Lift and Drag for a Legged Millirobot in Compliant Beam Environment

2019-04-19

Can Koc, Cem Koc, Brian Su, Carlos Casarez, Ronald Fearing

arXiv_RO

arXiv_RO
Abstract

Much current study of legged locomotion has rightly focused on foot traction forces, including on granular media. Future legged millirobots will need to go through terrain, such as brush or other vegetation, where the body contact forces significantly affect locomotion. In this work, a (previously developed) low-cost 6-axis force/torque sensing shell is used to measure the interaction forces between a hexapedal millirobot and a set of compliant beams, which act as a surrogate for a densely cluttered environment. Experiments with a VelociRoACH robotic platform are used to measure lift and drag forces on the tactile shell, where negative lift forces can increase traction, even while drag forces increase. The drag energy and specific resistance required to pass through dense terrains can be measured. Furthermore, some contact between the robot and the compliant beams can lower specific resistance of locomotion. For small, light-weight legged robots in the beam environment, the body motion depends on both leg-ground and body-beam forces. A shell-shape which reduces drag but increases negative lift, such as the half-ellipsoid used, is suggested to be advantageous for robot locomotion in this type of environment.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09101

PDF

http://arxiv.org/pdf/1904.09101
Read All
AMNet: Deep Atrous Multiscale Stereo Disparity Estimation Networks

2019-04-19

Xianzhi Du, Mostafa El-Khamy, Jungwon Lee

arXiv_CV

arXiv_CV Deep_Learning
Abstract

In this paper, a new deep learning architecture for stereo disparity estimation is proposed. The proposed atrous multiscale network (AMNet) adopts an efficient feature extractor with depthwise-separable convolutions and an extended cost volume that deploys novel stereo matching costs on the deep features. A stacked atrous multiscale network is proposed to aggregate rich multiscale contextual information from the cost volume which allows for estimating the disparity with high accuracy at multiple scales. AMNet can be further modified to be a foreground-background aware network, FBA-AMNet, which is capable of discriminating between the foreground and the background objects in the scene at multiple scales. An iterative multitask learning method is proposed to train FBA-AMNet end-to-end. The proposed disparity estimation networks, AMNet and FBA-AMNet, show accurate disparity estimates and advance the state of the art on the challenging Middlebury, KITTI 2012, KITTI 2015, and Sceneflow stereo disparity estimation benchmarks.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09099

PDF

http://arxiv.org/pdf/1904.09099
Read All
Toward Convolutional Blind Denoising of Real Photographs

2019-04-19

Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, Lei Zhang

arXiv_CV

arXiv_CV CNN Quantitative
Abstract

While deep convolutional neural networks (CNNs) have achieved impressive success in image denoising with additive white Gaussian noise (AWGN), their performance remains limited on real-world noisy photographs. The main reason is that their learned models are easy to overfit on the simplified AWGN model which deviates severely from the complicated real-world noise model. In order to improve the generalization ability of deep CNN denoisers, we suggest training a convolutional blind denoising network (CBDNet) with more realistic noise model and real-world noisy-clean image pairs. On the one hand, both signal-dependent noise and in-camera signal processing pipeline is considered to synthesize realistic noisy images. On the other hand, real-world noisy photographs and their nearly noise-free counterparts are also included to train our CBDNet. To further provide an interactive strategy to rectify denoising result conveniently, a noise estimation subnetwork with asymmetric learning to suppress under-estimation of noise level is embedded into CBDNet. Extensive experimental results on three datasets of real-world noisy photographs clearly demonstrate the superior performance of CBDNet over state-of-the-arts in terms of quantitative metrics and visual quality. The code has been made available at https://github.com/GuoShi28/CBDNet.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1807.04686

PDF

http://arxiv.org/pdf/1807.04686
Read All
Weakly Supervised Adversarial Domain Adaptation for Semantic Segmentation in Urban Scenes

2019-04-19

Qi Wang, Junyu Gao, Xuelong Li

arXiv_CV

arXiv_CV Adversarial Segmentation Weakly_Supervised CNN Semantic_Segmentation Detection
Abstract

Semantic segmentation, a pixel-level vision task, is developed rapidly by using convolutional neural networks (CNNs). Training CNNs requires a large amount of labeled data, but manually annotating data is difficult. For emancipating manpower, in recent years, some synthetic datasets are released. However, they are still different from real scenes, which causes that training a model on the synthetic data (source domain) cannot achieve a good performance on real urban scenes (target domain). In this paper, we propose a weakly supervised adversarial domain adaptation to improve the segmentation performance from synthetic data to real scenes, which consists of three deep neural networks. To be specific, a detection and segmentation (“DS” for short) model focuses on detecting objects and predicting segmentation map; a pixel-level domain classifier (“PDC” for short) tries to distinguish the image features from which domains; an object-level domain classifier (“ODC” for short) discriminates the objects from which domains and predicts the objects classes. PDC and ODC are treated as the discriminators, and DS is considered as the generator. By adversarial learning, DS is supposed to learn domain-invariant features. In experiments, our proposed method yields the new record of mIoU metric in the same problem.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09092

PDF

http://arxiv.org/pdf/1904.09092
Read All
Learning Programmatic Idioms for Scalable Semantic Parsing

2019-04-19

Srinivasan Iyer, Alvin Cheung, Luke Zettlemoyer

arXiv_CL

arXiv_CL GAN
Abstract

Programmers typically organize executable source code using high-level coding patterns or idiomatic structures such as nested loops, exception handlers and recursive blocks, rather than as individual code tokens. In contrast, state of the art semantic parsers still map natural language instructions to source code by building the code syntax tree one node at a time. In this paper, we introduce an iterative method to extract code idioms from large source code corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax trees, and we train semantic parsers to apply these idioms during decoding. We apply this idiom-based code generation to a recent context-dependent semantic parsing task, and improve the state of the art by 2.2% BLEU score while reducing training time by more than 50%. This improved speed enables us to scale up the model by training on an extended training set that is 5x times larger, to further move up the state of the art by an additional 2.3% BLEU and 0.9% exact match.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09086

PDF

http://arxiv.org/pdf/1904.09086
Read All
LATTE: Accelerating LiDAR Point Cloud Annotation via Sensor Fusion, One-Click Annotation, and Tracking

2019-04-19

Bernie Wang, Virginia Wu, Bichen Wu, Kurt Keutzer

arXiv_CV

arXiv_CV Sparse Tracking Detection
Abstract

LiDAR (Light Detection And Ranging) is an essential and widely adopted sensor for autonomous vehicles, particularly for those vehicles operating at higher levels (L4-L5) of autonomy. Recent work has demonstrated the promise of deep-learning approaches for LiDAR-based detection. However, deep-learning algorithms are extremely data hungry, requiring large amounts of labeled point-cloud data for training and evaluation. Annotating LiDAR point cloud data is challenging due to the following issues: 1) A LiDAR point cloud is usually sparse and has low resolution, making it difficult for human annotators to recognize objects. 2) Compared to annotation on 2D images, the operation of drawing 3D bounding boxes or even point-wise labels on LiDAR point clouds is more complex and time-consuming. 3) LiDAR data are usually collected in sequences, so consecutive frames are highly correlated, leading to repeated annotations. To tackle these challenges, we propose LATTE, an open-sourced annotation tool for LiDAR point clouds. LATTE features the following innovations: 1) Sensor fusion: We utilize image-based detection algorithms to automatically pre-label a calibrated image, and transfer the labels to the point cloud. 2) One-click annotation: Instead of drawing 3D bounding boxes or point-wise labels, we simplify the annotation to just one click on the target object, and automatically generate the bounding box for the target. 3) Tracking: we integrate tracking into sequence annotation such that we can transfer labels from one frame to subsequent ones and therefore significantly reduce repeated labeling. Experiments show the proposed features accelerate the annotation speed by 6.2x and significantly improve label quality with 23.6% and 2.2% higher instance-level precision and recall, and 2.0% higher bounding box IoU. LATTE is open-sourced at https://github.com/bernwang/latte.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09085

PDF

http://arxiv.org/pdf/1904.09085
Read All
Graphical Contrastive Losses for Scene Graph Parsing

2019-04-19

Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, Bryan Catanzaro

arXiv_CV

arXiv_CV Object_Detection Detection Relation
Abstract

Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a softmax distribution. We find that such pipelines, trained with only a cross entropy loss over predicate classes, suffer from two common errors. The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e.g. multiple cups). The second, Proximal Relationship Ambiguity, arises when multiple subject-predicate-object triplets appear in close proximity with the same predicate, and the model struggles to infer the correct subject-object pairings (e.g. mis-pairing musicians and their instruments). We propose a set of contrastive loss formulations that specifically target these types of errors within the scene graph parsing problem, collectively termed the Graphical Contrastive Losses. These losses explicitly force the model to disambiguate related and unrelated instances through margin constraints specific to each type of confusion. We further construct a relationship detector, called RelDN, using the aforementioned pipeline to demonstrate the efficacy of our proposed losses. Our model outperforms the winning method of the OpenImages Relationship Detection Challenge by 4.7\% (16.5\% relative) on the test set. We also show improved results over the best previous methods on the Visual Genome and Visual Relationship Detection datasets.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1903.02728

PDF

http://arxiv.org/pdf/1903.02728
Read All
Deep Learning-based Image Super-Resolution Considering Quantitative and Perceptual Quality

2019-04-19

Jun-Ho Choi, Jun-Hyuk Kim, Manri Cheon, Jong-Seok Lee

arXiv_CV

arXiv_CV Super_Resolution Deep_Learning Quantitative Relation
Abstract

Recently, it has been shown that in super-resolution, there exists a tradeoff relationship between the quantitative and perceptual quality of super-resolved images, which correspond to the similarity to the ground-truth images and the naturalness, respectively. In this paper, we propose a novel super-resolution method that can improve the perceptual quality of the upscaled images while preserving the conventional quantitative performance. The proposed method employs a deep network for multi-pass upscaling in company with a discriminator network and two quantitative score predictor networks. Experimental results demonstrate that the proposed method achieves a good balance of the quantitative and perceptual quality, showing more satisfactory results than existing methods.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1809.04789

PDF

http://arxiv.org/pdf/1809.04789
Read All
Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

2019-04-19

Shijie Wu, Mark Dredze

arXiv_CL

arXiv_CL Inference Classification
Abstract

Pretrained contextual representation models (Peters et al., 2018; Devlin et al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new release of BERT (Devlin, 2018) includes a model simultaneously pretrained on 104 languages with impressive performance for zero-shot cross-lingual transfer on a natural language inference task. This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing. We compare mBERT with the best-published methods for zero-shot cross-lingual transfer and find mBERT competitive on each task. Additionally, we investigate the most effective strategy for utilizing mBERT in this manner, determine to what extent mBERT generalizes away from language specific features, and measure factors that influence cross-lingual transfer.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09077

PDF

http://arxiv.org/pdf/1904.09077
Read All
Suggestion Mining from Online Reviews using ULMFiT

2019-04-19

Sarthak Anand, Debanjan Mahata, Kartik Aggarwal, Laiba Mehnaz, Simra Shahid, Haimin Zhang, Yaman Kumar, Rajiv Ratn Shah, Karan Uppal

arXiv_CL

arXiv_CL Review Text_Classification Classification Language_Model
Abstract

In this paper we present our approach and the system description for Sub Task A of SemEval 2019 Task 9: Suggestion Mining from Online Reviews and Forums. Given a sentence, the task asks to predict whether the sentence consists of a suggestion or not. Our model is based on Universal Language Model Fine-tuning for Text Classification. We apply various pre-processing techniques before training the language and the classification model. We further provide detailed analysis of the results obtained using the trained model. Our team ranked 10th out of 34 participants, achieving an F1 score of 0.7011. We publicly share our implementation at https://github.com/isarth/SemEval9_MIDAS

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09076

PDF

http://arxiv.org/pdf/1904.09076
Read All
Advanced Deep Convolutional Neural Network Approaches for Digital Pathology Image Analysis: a comprehensive evaluation with different use cases

2019-04-19

Md Zahangir Alom, Theus Aspiras, Tarek M. Taha, Vijayan K. Asari, TJ Bowen, Dave Billiter, Simon Arkell

arXiv_CV

arXiv_CV Segmentation CNN Classification Deep_Learning Detection
Abstract

Deep Learning (DL) approaches have been providing state-of-the-art performance in different modalities in the field of medical imagining including Digital Pathology Image Analysis (DPIA). Out of many different DL approaches, Deep Convolutional Neural Network (DCNN) technique provides superior performance for classification, segmentation, and detection tasks. Most of the task in DPIA problems are somehow possible to solve with classification, segmentation, and detection approaches. In addition, sometimes pre and post-processing methods are applied for solving some specific type of problems. Recently, different DCNN models including Inception residual recurrent CNN (IRRCNN), Densely Connected Recurrent Convolution Network (DCRCN), Recurrent Residual U-Net (R2U-Net), and R2U-Net based regression model (UD-Net) have proposed and provide state-of-the-art performance for different computer vision and medical image analysis tasks. However, these advanced DCNN models have not been explored for solving different problems related to DPIA. In this study, we have applied these DCNN techniques for solving different DPIA problems and evaluated on different publicly available benchmark datasets for seven different tasks in digital pathology including lymphoma classification, Invasive Ductal Carcinoma (IDC) detection, nuclei segmentation, epithelium segmentation, tubule segmentation, lymphocyte detection, and mitosis detection. The experimental results are evaluated with different performance metrics such as sensitivity, specificity, accuracy, F1-score, Receiver Operating Characteristics (ROC) curve, dice coefficient (DC), and Means Squired Errors (MSE). The results demonstrate superior performance for classification, segmentation, and detection tasks compared to existing machine learning and DCNN based approaches.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09075

PDF

http://arxiv.org/pdf/1904.09075
Read All
Disguised-Nets: Image Disguising for Privacy-preserving Outsourced Deep Learning

2019-04-19

Sagar Sharma, Keke Chen

arXiv_CV

arXiv_CV Re-identification Image_Classification Classification Deep_Learning
Abstract

Deep learning model developers often use cloud GPU resources to experiment with large data and models that need expensive setups. However, this practice raises privacy concerns. Adversaries may be interested in: 1) personally identifiable information or objects encoded in the training images, and 2) the models trained with sensitive data to launch model-based attacks. Learning deep neural networks (DNN) from encrypted data is still impractical due to the large training data and the expensive learning process. A few recent studies have tried to provide efficient, practical solutions to protect data privacy in outsourced deep-learning. However, we find out that they are vulnerable under certain attacks. In this paper, we specifically identify two types of unique attacks on outsourced deep-learning: 1) the visual re-identification attack on the training data, and 2) the class membership attack on the learned models, which can break existing privacy-preserving solutions. We develop an image disguising approach to address these attacks and design a suite of methods to evaluate the levels of attack resilience for a privacy-preserving solution for outsourced deep learning. The experimental results show that our image-disguising mechanisms can provide a high level of protection against the two attacks while still generating high-quality DNN models for image classification.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1902.01878

PDF

http://arxiv.org/pdf/1902.01878
Read All
Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

2019-04-19

Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, Ajay Divakaran

arXiv_CV

arXiv_CV Image_Caption Caption Detection Relation
Abstract

Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example a caption might reflect ironically on the image, so neither the caption nor the image is a mere transcript of the other. Instead they combine – via what has been called meaning multiplication – to create a new meaning that has a more complex relation to the literal meanings of text and image. Here we introduce a multimodal dataset of 1299 Instagram post labeled for three orthogonal taxonomies: the authorial intent behind the image-caption pair, the contextual relationship between the literal meanings of the image and caption, and the semiotic relationship between the signified meanings of the image and caption. We build a baseline deep multimodal classifier to validate the taxonomy, showing that employing both text and image improves intent detection by 8% compared to using only image modality, demonstrating the commonality of non-intersective meaning multiplication. Our dataset offers an important resource for the study of the rich meanings that results from pairing text and image.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09073

PDF

http://arxiv.org/pdf/1904.09073
Read All
Identifying Offensive Posts and Targeted Offense from Twitter

2019-04-19

Haimin Zhang, Debanjan Mahata, Simra Shahid, Laiba Mehnaz, Sarthak Anand, Yaman Singla, Rajiv Ratn Shah, Karan Uppal

arXiv_CL

arXiv_CL Attention CNN RNN
Abstract

In this paper we present our approach and the system description for Sub-task A and Sub Task B of SemEval 2019 Task 6: Identifying and Categorizing Offensive Language in Social Media. Sub-task A involves identifying if a given tweet is offensive or not, and Sub Task B involves detecting if an offensive tweet is targeted towards someone (group or an individual). Our models for Sub-task A is based on an ensemble of Convolutional Neural Network, Bidirectional LSTM with attention, and Bidirectional LSTM + Bidirectional GRU, whereas for Sub-task B, we rely on a set of heuristics derived from the training data and manual observation. We provide detailed analysis of the results obtained using the trained models. Our team ranked 5th out of 103 participants in Sub-task A, achieving a macro F1 score of 0.807, and ranked 8th out of 75 participants in Sub Task B achieving a macro F1 of 0.695.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09072

PDF

http://arxiv.org/pdf/1904.09072
Read All
A Hybrid Retrieval-Generation Neural Conversation Model

2019-04-19

Liu Yang, Junjie Hu, Minghui Qiu, Chen Qu, Jianfeng Gao, W. Bruce Croft, Xiaodong Liu, Yelong Shen, Jingjing Liu

arXiv_CL

arXiv_CL Knowledge Face Text_Generation
Abstract

Intelligent personal assistant systems, with either text-based or voice-based conversational interfaces, are becoming increasingly popular. Most previous research has used either retrieval-based or generation-based methods. Retrieval-based methods have the advantage of returning fluent and informative responses with great diversity. The retrieved responses are easier to control and explain. However, the response retrieval performance is limited by the size of the response repository. On the other hand, although generation-based methods can return highly coherent responses given conversation context, they are likely to return universal or general responses with insufficient ground knowledge information. In this paper, we build a hybrid neural conversation model with the capability of both response retrieval and generation, in order to combine the merits of these two types of methods. Experimental results on Twitter and Foursquare data show that the proposed model can outperform both retrieval-based methods and generation-based methods (including a recently proposed knowledge-grounded neural conversation model) under both automatic evaluation metrics and human evaluation. Our models and research findings provide new insights on how to integrate text retrieval and text generation models for building conversation systems.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09068

PDF

http://arxiv.org/pdf/1904.09068
Read All
Emergence of Compositional Language with Deep Generational Transmission

2019-04-19

Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra

arXiv_AI

arXiv_AI Knowledge Reinforcement_Learning Deep_Learning
Abstract

Consider a collaborative task that requires communication. Two agents are placed in an environment and must create a language from scratch in order to coordinate. Recent work has been interested in what kinds of languages emerge when deep reinforcement learning agents are put in such a situation, and in particular in the factors that cause language to be compositional-i.e. meaning is expressed by combining words which themselves have meaning. Evolutionary linguists have also studied the emergence of compositional language for decades, and they find that in addition to structural priors like those already studied in deep learning, the dynamics of transmitting language from generation to generation contribute significantly to the emergence of compositionality. In this paper, we introduce these cultural evolutionary dynamics into language emergence by periodically replacing agents in a population to create a knowledge gap, implicitly inducing cultural transmission of language. We show that this implicit cultural transmission encourages the resulting languages to exhibit better compositional generalization and suggest how elements of cultural dynamics can be further integrated into populations of deep agents.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09067

PDF

http://arxiv.org/pdf/1904.09067
Read All
Feature Forwarding for Efficient Single Image Dehazing

2019-04-19

Peter Morales, Tzofi Klinghoffer, Seung Jae Lee

arXiv_CV

arXiv_CV Super_Resolution CNN
Abstract

Haze degrades content and obscures information of images, which can negatively impact vision-based decision-making in real-time systems. In this paper, we propose an efficient fully convolutional neural network (CNN) image dehazing method designed to run on edge graphical processing units (GPUs). We utilize three variants of our architecture to explore the dependency of dehazed image quality on parameter count and model design. The first two variants presented, a small and big version, make use of a single efficient encoder–decoder convolutional feature extractor. The final variant utilizes a pair of encoder–decoders for atmospheric light and transmission map estimation. Each variant ends with an image refinement pyramid pooling network to form the final dehazed image. For the big variant of the single-encoder network, we demonstrate state-of-the-art performance on the NYU Depth dataset. For the small variant, we maintain competitive performance on the super-resolution O/I-HAZE datasets without the need for image cropping. Finally, we examine some challenges presented by the Dense-Haze dataset when leveraging CNN architectures for dehazing of dense haze imagery and examine the impact of loss function selection on image quality. Benchmarks are included to show the feasibility of introducing this approach into real-time systems.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09059

PDF

http://arxiv.org/pdf/1904.09059
Read All
Feature Fusion for Online Mutual Knowledge Distillation

2019-04-19

Jangho Kim, Minsung Hyun, Inseop Chung, Nojun Kwak

arXiv_CV

arXiv_CV Knowledge Classification
Abstract

We propose a learning framework named Feature Fusion Learning (FFL) that efficiently trains a powerful classifier through a fusion module which combines the feature maps generated from parallel neural networks. Specifically, we train a number of parallel neural networks as sub-networks, then we combine the feature maps from each sub-network using a fusion module to create a more meaningful feature map. The fused feature map is passed into the fused classifier for overall classification. Unlike existing feature fusion methods, in our framework, an ensemble of sub-network classifiers transfers its knowledge to the fused classifier and then the fused classifier delivers its knowledge back to each sub-network, mutually teaching one another in an online-knowledge distillation manner. This mutually teaching system not only improves the performance of the fused classifier but also obtains performance gain in each sub-network. Moreover, our model is more beneficial because different types of network can be used for each sub-network. We have performed a variety of experiments on multiple datasets such as CIFAR-10, CIFAR-100 and ImageNet and proved that our method is more effective than other alternative methods in terms of performance of both sub-networks and the fused classifier.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09058

PDF

http://arxiv.org/pdf/1904.09058
Read All
Relation Discovery with Out-of-Relation Knowledge Base as Supervision

2019-04-19

Yan Liang, Xin Liu, Jianwen Zhang, Yangqiu Song

arXiv_AI

arXiv_AI Knowledge Embedding Relation
Abstract

Unsupervised relation discovery aims to discover new relations from a given text corpus without annotated data. However, it does not consider existing human annotated knowledge bases even when they are relevant to the relations to be discovered. In this paper, we study the problem of how to use out-of-relation knowledge bases to supervise the discovery of unseen relations, where out-of-relation means that relations to discover from the text corpus and those in knowledge bases are not overlapped. We construct a set of constraints between entity pairs based on the knowledge base embedding and then incorporate constraints into the relation discovery by a variational auto-encoder based algorithm. Experiments show that our new approach can improve the state-of-the-art relation discovery performance by a large margin.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1905.01959

PDF

http://arxiv.org/pdf/1905.01959
Read All
User Constrained Thumbnail Generation using Adaptive Convolutions

2019-04-19

Perla Sai Raj Kishore, Ayan Kumar Bhunia, Shuvozit Ghose, Partha Pratim Roy

arXiv_CV

arXiv_CV Review
Abstract

Thumbnails are widely used all over the world as a preview for digital images. In this work we propose a deep neural framework to generate thumbnails of any size and aspect ratio, even for unseen values during training, with high accuracy and precision. We use Global Context Aggregation (GCA) and a modified Region Proposal Network (RPN) with adaptive convolutions to generate thumbnails in real time. GCA is used to selectively attend and aggregate the global context information from the entire image while the RPN is used to predict candidate bounding boxes for the thumbnail image. Adaptive convolution eliminates the problem of generating thumbnails of various aspect ratios by using filter weights dynamically generated from the aspect ratio information. The experimental results indicate the superior performance of the proposed model over existing state-of-the-art techniques.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1810.13054

PDF

http://arxiv.org/pdf/1810.13054
Read All
Query-focused Sentence Compression in Linear Time

2019-04-19

Abram Handler, Brendan O'Connor

arXiv_CL

arXiv_CL Face
Abstract

Search applications often display shortened sentences which must contain certain query terms and must fit within the space constraints of a user interface. This work introduces a new transition-based sentence compression technique developed for such settings. Our method constructs length and lexically constrained compressions in linear time, by growing a subgraph in the dependency parse of a sentence. This approach achieves a 4x speed up over baseline ILP compression techniques, and better reconstructs gold shortenings under constraints. Such efficiency gains permit constrained compression of multiple sentences, without unreasonable lag.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09051

PDF

http://arxiv.org/pdf/1904.09051
Read All
Three dimensional blind image deconvolution for fluorescence microscopy using generative adversarial networks

2019-04-19

Soonam Lee, Shuo Han, Paul Salama, Kenneth W. Dunn, Edward J. Delp

arXiv_CV

arXiv_CV Adversarial GAN Quantitative
Abstract

Due to image blurring image deconvolution is often used for studying biological structures in fluorescence microscopy. Fluorescence microscopy image volumes inherently suffer from intensity inhomogeneity, blur, and are corrupted by various types of noise which exacerbate image quality at deeper tissue depth. Therefore, quantitative analysis of fluorescence microscopy in deeper tissue still remains a challenge. This paper presents a three dimensional blind image deconvolution method for fluorescence microscopy using 3-way spatially constrained cycle-consistent adversarial networks. The restored volumes of the proposed deconvolution method and other well-known deconvolution methods, denoising methods, and an inhomogeneity correction method are visually and numerically evaluated. Experimental results indicate that the proposed method can restore and improve the quality of blurred and noisy deep depth microscopy image visually and quantitatively.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09974

PDF

http://arxiv.org/pdf/1904.09974
Read All
Dry, Focus, and Transcribe: End-to-End Integration of Dereverberation, Beamforming, and ASR

2019-04-19

Aswin Shanmugam Subramanian, Xiaofei Wang, Shinji Watanabe, Toru Taniguchi, Dung Tran, Yuya Fujita

arXiv_CL

arXiv_CL Knowledge Speech_Recognition Recognition
Abstract

Sequence-to-sequence (S2S) modeling is becoming a popular paradigm for automatic speech recognition (ASR) because of its ability to jointly optimize all the conventional ASR components in an end-to-end (E2E) fashion. This paper extends the ability of E2E ASR from standard close-talk to far-field applications by encompassing entire multichannel speech enhancement and ASR components within the S2S model. There have been previous studies on jointly optimizing neural beamforming alongside E2E ASR for denoising. It is clear from both recent challenge outcomes and successful products that far-field systems would be incomplete without solving both denoising and dereverberation simultaneously. This paper proposes a novel architecture for far-field ASR by composing neural extensions of dereverberation and beamforming modules with the S2S ASR module as a single differentiable neural network and also clearly defining the role of each subnetwork. To our knowledge, this is the first successful demonstration of such a system, which we term DFTnet (dry, focus, and transcribe). It achieves better performance than conventional pipeline methods on the DIRHA English dataset and comparable performance on the REVERB dataset. It also has additional advantages of being neither iterative nor requiring parallel noisy and clean speech data.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09049

PDF

http://arxiv.org/pdf/1904.09049
Read All
Automated Focal Loss for Image based Object Detection

2019-04-19

Michael Weber, Michael Fürst, J. Marius Zöllner

arXiv_CV

arXiv_CV Object_Detection Detection
Abstract

Current state-of-the-art object detection algorithms still suffer the problem of imbalanced distribution of training data over object classes and background. Recent work introduced a new loss function called focal loss to mitigate this problem, but at the cost of an additional hyperparameter. Manually tuning this hyperparameter for each training task is highly time-consuming. With automated focal loss we introduce a new loss function which substitutes this hyperparameter by a parameter that is automatically adapted during the training progress and controls the amount of focusing on hard training examples. We show on the COCO benchmark that this leads to an up to 30% faster training convergence. We further introduced a focal regression loss which on the more challenging task of 3D vehicle detection outperforms other loss functions by up to 1.8 AOS and can be used as a value range independent metric for regression.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09048

PDF

http://arxiv.org/pdf/1904.09048
Read All
Granularity and Generalized Inclusion Functions - Their Variants and Contamination

2019-04-19

A. Mani

arXiv_AI

arXiv_AI Knowledge
Abstract

Rough inclusion functions (RIFs) are known by many other names in formal approaches to vagueness, belief, and uncertainty. Their use is often poorly grounded in factual knowledge or involve wild statistical assumptions. The concept of contamination introduced and studied by the present author across a number of her papers, concerns mixing up of information across semantic domains (or domains of discourse). RIFs play a key role in contaminating algorithms and some solutions that seek to replace or avoid them have been proposed and investigated by the present author in some of her earlier papers. The proposals break many algorithms of rough sets in a serious way. In this research, algorithm-friendly granular generalizations of such functions that reduce contamination (and data intrusion) are proposed and investigated from a mathematically sound perspective. Interesting representation results are proved and a core algebraic strategy for generalizing Skowron-Polkowski style of rough mereology is formulated.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1811.06560

PDF

http://arxiv.org/pdf/1811.06560
Read All
Geographical Map Registration and Fusion of Lidar-Aerial Orthoimagery in GIS

2019-04-19

Siqi Yi, Stewart Worrall, Eduardo Nebot

arXiv_RO

arXiv_RO Sparse
Abstract

Centimeter level globally accurate and consistent maps for autonomous vehicles navigation has long been achieved by on board real-time kinematic(RTK)-GPS in open areas. However when dealing with urban environments, GPS will experience multipath and blockage in urban canyon, under bridges, inside tunnels and in underground environments. In this paper we present strategies to efficiently register local maps in geographical coordinate systems through the tactical integration of GPS and information extracted from precisely geo-referenced high resolution aerial orthogonal imagery. Dense lidar point clouds obtained from moving vehicle are projected down to horizontal plane, accurately registered and overlaid on aerial orthoimagery. Sparse, robust and long-term pole-like landmarks are used as anchor points to link lidar and aerial image sensing, and constrain the spatial uncertainties of remaining lidar points that cannot be directly measured and identified. We achieved 15-75cm absolute average global accuracy using precisely geo-referenced aerial imagery as ground truth. This is valuable in enabling the fusion of ground vehicle on-board sensor features with features extracted from aerial images such as traffic and lane markings. It is also useful for cooperative sensing to have an unbiased and accurate global reference. Experimental results are presented demonstrating the accuracy and consistency of the maps when operating in large areas.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09047

PDF

http://arxiv.org/pdf/1904.09047
Read All
Human Motion Prediction via Pattern Completion in Latent Representation Space

2019-04-18

Yi Tian Xu, Yaqiao Li, David Meger

arXiv_AI

arXiv_AI Inference Classification Prediction Relation
Abstract

Inspired by ideas in cognitive science, we propose a novel and general approach to solve human motion understanding via pattern completion on a learned latent representation space. Our model outperforms current state-of-the-art methods in human motion prediction across a number of tasks, with no customization. To construct a latent representation for time-series of various lengths, we propose a new and generic autoencoder based on sequence-to-sequence learning. While traditional inference strategies find a correlation between an input and an output, we use pattern completion, which views the input as a partial pattern and to predict the best corresponding complete pattern. Our results demonstrate that this approach has advantages when combined with our autoencoder in solving human motion prediction, motion generation and action classification.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09039

PDF

http://arxiv.org/pdf/1904.09039
Read All
SlowFast Networks for Video Recognition

2019-04-18

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He

arXiv_CV

arXiv_CV Classification Detection Recognition
Abstract

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code will be made publicly available in PyTorch.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1812.03982

PDF

http://arxiv.org/pdf/1812.03982
Read All
Person Identification with Visual Summary for a Safe Access to a Smart Home

2019-04-18

Shahinur Alam, Mohammed Yeasin

arXiv_CV

arXiv_CV Face CNN Detection Recognition
Abstract

SafeAccess is an integrated system designed to provide easier and safer access to a smart home for people with or without disabilities. The system is designed to enhance safety and promote the independence of people with disability (i.e., visually impaired). The key functionality of the system includes the detection and identification of human and generating contextual visual summary from the real-time video streams obtained from the cameras placed in strategic locations around the house. In addition, the system classifies human into groups (i.e. friends/families/caregiver versus intruders/burglars/unknown). These features allow the user to grant/deny remote access to the premises or ability to call emergency services. In this paper, we focus on designing a prototype system for the smart home and building a robust recognition engine that meets the system criteria and addresses speed, accuracy, deployment and environmental challenges under a wide variety of practical and real-life situations. To interact with the system, we implemented a dialog enabled interface to create a personalized profile using face images or video of friend/families/caregiver. To improve computational efficiency, we apply change detection to filter out frames and use Faster-RCNN to detect the human presence and extract faces using Multitask Cascaded Convolutional Networks (MTCNN). Subsequently, we apply LBP/FaceNet to identify a person and groups by matching extracted faces with the profile. SafeAccess sends a visual summary to the users with an MMS containing a person’s name if any match found or as “Unknown”, scene image, facial description, and contextual information. SafeAccess identifies friends/families/caregiver versus intruders/unknown with an average F-score 0.97 and generates a visual summary from 10 classes with an average accuracy of 98.01%.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.01178

PDF

http://arxiv.org/pdf/1904.01178
Read All
ProductNet: a Collection of High-Quality Datasets for Product Representation Learning

2019-04-18

Chu Wang, Lei Tang, Yang Lu, Shujun Bian, Hirohisa Fujita, Da Zhang, Zuohua Zhang, Yongning Wu

arXiv_CV

arXiv_CV Embedding Transfer_Learning Represenation_Learning
Abstract

ProductNet is a collection of high-quality product datasets for better product understanding. Motivated by ImageNet, ProductNet aims at supporting product representation learning by curating product datasets of high quality with properly chosen taxonomy. In this paper, the two goals of building high-quality product datasets and learning product representation support each other in an iterative fashion: the product embedding is obtained via a multi-modal deep neural network (master model) designed to leverage product image and catalog information; and in return, the embedding is utilized via active learning (local model) to vastly accelerate the annotation process. For the labeled data, the proposed master model yields high categorization accuracy (94.7% top-1 accuracy for 1240 classes), which can be used as search indices, partition keys, and input features for machine learning models. The product embedding, as well as the fined-tuned master model for a specific business task, can also be used for various transfer learning tasks.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.09037

PDF

http://arxiv.org/pdf/1904.09037
Read All
NSGA-Net: Neural Architecture Search using Multi-Objective Genetic Algorithm

2019-04-18

Zhichao Lu, Ian Whalen, Vishnu Boddeti, Yashesh Dhebar, Kalyanmoy Deb, Erik Goodman, Wolfgang Banzhaf

arXiv_CV

arXiv_CV Knowledge
Abstract

This paper introduces NSGA-Net – an evolutionary approach for neural architecture search (NAS). NSGA-Net is designed with three goals in mind: (1) a procedure considering multiple and conflicting objectives, (2) an efficient procedure balancing exploration and exploitation of the space of potential neural network architectures, and (3) a procedure finding a diverse set of trade-off network architectures achieved in a single run. NSGA-Net is a population-based search algorithm that explores a space of potential neural network architectures in three steps, namely, a population initialization step that is based on prior-knowledge from hand-crafted architectures, an exploration step comprising crossover and mutation of architectures, and finally an exploitation step that utilizes the hidden useful knowledge stored in the entire history of evaluated neural architectures in the form of a Bayesian Network. Experimental results suggest that combining the dual objectives of minimizing an error metric and computational complexity, as measured by FLOPs, allows NSGA-Net to find competitive neural architectures. Moreover, NSGA-Net achieves error rate on the CIFAR-10 dataset on par with other state-of-the-art NAS methods while using orders of magnitude less computational resources. These results are encouraging and shows the promise to further use of EC methods in various deep-learning paradigms.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1810.03522

PDF

http://arxiv.org/pdf/1810.03522
Read All

63/266

Welcome to AMDS123 Blog!

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL

PDF

Abstract

Abstract (translated by Google)

URL