Domain adaptation is critical for learning in new and unseen environments. With domain adversarial training, deep networks can learn disentangled and transferable features that effectively diminish the dataset shift between the source and target domains for knowledge transfer. In the era of Big Data, the ready availability of large-scale labeled datasets has stimulated wide interest in partial domain adaptation (PDA), which transfers a recognizer from a labeled large domain to an unlabeled small domain. It extends standard domain adaptation to the scenario where target labels are only a subset of source labels. Under the condition that target labels are unknown, the key challenge of PDA is how to transfer relevant examples in the shared classes to promote positive transfer, and ignore irrelevant ones in the specific classes to mitigate negative transfer. In this work, we propose a unified approach to PDA, Example Transfer Network (ETN), which jointly learns domain-invariant representations across the source and target domains, and a progressive weighting scheme that quantifies the transferability of source examples while controlling their importance to the learning task in the target domain. A thorough evaluation on several benchmark datasets shows that our approach achieves state-of-the-art results for partial domain adaptation tasks.
http://arxiv.org/abs/1903.12230
In a wide array of areas, algorithms are matching and surpassing the performance of human experts, leading to consideration of the roles of human judgment and algorithmic prediction in these domains. The discussion around these developments, however, has implicitly equated the specific task of prediction with the general task of automation. We argue here that automation is broader than just a comparison of human versus algorithmic performance on a task; it also involves the decision of which instances of the task to give to the algorithm in the first place. We develop a general framework that poses this latter decision as an optimization problem, and we show how basic heuristics for this optimization problem can lead to performance gains even on heavily-studied applications of AI in medicine. Our framework also serves to highlight how effective automation depends crucially on estimating both algorithmic and human error on an instance-by-instance basis, and our results show how improvements in these error estimation problems can yield significant gains for automation as well.
http://arxiv.org/abs/1903.12220
This paper aims to count arbitrary objects in images. The leading counting approaches start from point annotations per object from which they construct density maps. Then, their training objective transforms input images to density maps through deep convolutional networks. We posit that the point annotations serve more supervision purposes than just constructing density maps. We introduce ways to repurpose the points for free. First, we propose supervised focus from segmentation, where points are converted into binary maps. The binary maps are combined with a network branch and accompanying loss function to focus on areas of interest. Second, we propose supervised focus from global density, where the ratio of point annotations to image pixels is used in another branch to regularize the overall density estimation. To assist both the density estimation and the focus from segmentation, we also introduce an improved kernel size estimator for the point annotations. Experiments on four datasets show that all our contributions reduce the counting error, regardless of the base network, resulting in state-of-the-art accuracy using only a single network. Finally, we are the first to count on WIDER FACE, allowing us to show the benefits of our approach in handling varying object scales and crowding levels.
http://arxiv.org/abs/1903.12206
Sliding-window object detectors that generate bounding-box object predictions over a dense, regular grid have advanced rapidly and proven popular. In contrast, modern instance segmentation approaches are dominated by methods that first detect object bounding boxes, and then crop and segment these regions, as popularized by Mask R-CNN. In this work, we investigate the paradigm of dense sliding-window instance segmentation, which is surprisingly under-explored. Our core observation is that this task is fundamentally different than other dense prediction tasks such as semantic segmentation or bounding-box object detection, as the output at every spatial location is itself a geometric structure with its own spatial dimensions. To formalize this, we treat dense instance segmentation as a prediction task over 4D tensors and present a general framework called TensorMask that explicitly captures this geometry and enables novel operators on 4D tensors. We demonstrate that the tensor view leads to large gains over baselines that ignore this structure, and leads to results comparable to Mask R-CNN. These promising results suggest that TensorMask can serve as a foundation for novel advances in dense mask prediction and a more complete understanding of the task. Code will be made available.
http://arxiv.org/abs/1903.12174
Learning descriptive spatio-temporal object models from data is paramount for the task of semi-supervised video object segmentation. Most existing approaches mainly rely on models that estimate the segmentation mask based on a reference mask at the first frame (aided sometimes by optical flow or the previous mask). These models, however, are prone to fail under rapid appearance changes or occlusions due to their limitations in modelling the temporal component. On the other hand, very recently, other approaches learned long-term features using a convolutional LSTM to leverage the information from all previous video frames. Even though these models achieve better temporal representations, they still have to be fine-tuned for every new video sequence. In this paper, we present an intermediate solution and devise a novel GAN architecture, FaSTGAN, to learn spatio-temporal object models over finite temporal windows. To achieve this, we concentrate all the heavy computational load to the training phase with two critics that enforce spatial and temporal mask consistency over the last K frames. Then at test time, we only use a relatively light regressor, which reduces the inference time considerably. As a result, our approach combines a high resiliency to sudden geometric and photometric object changes with efficiency at test time (no need for fine-tuning nor post-processing). We demonstrate that the accuracy of our method is on par with state-of-the-art techniques on the challenging YouTube-VOS and DAVIS datasets, while running at 32 fps, about 4x faster than the closest competitor.
http://arxiv.org/abs/1903.12161
State-of-the-art methods for text classification include several distinct steps of pre-processing, feature extraction and post-processing. In this work, we focus on end-to-end neural architectures and show that the best performance in text classification is obtained by combining information from different neural modules. Concretely, we combine convolution, recurrent and attention modules with ensemble methods and show that they are complementary. We introduce ECGA, an end-to-end go-to architecture for novel text classification tasks. We prove that it is efficient and robust, as it attains or surpasses the state-of-the-art on varied datasets, including both low and high data regimes.
http://arxiv.org/abs/1903.12157
Deep neural networks have recently been used to edit images with great success, in particular for faces. However, they are often limited to only being able to work at a restricted range of resolutions. Many methods are so flexible that face edits can often result in an unwanted loss of identity. This work proposes to learn how to perform semantic image edits through the application of smooth warp fields. Previous approaches that attempted to use warping for semantic edits required paired data, i.e. example images of the same subject with different semantic attributes. In contrast, we employ recent advances in Generative Adversarial Networks that allow our model to be trained with unpaired data. We demonstrate face editing at very high resolutions (4k images) with a single forward pass of a deep network at a lower resolution. We also show that our edits are substantially better at preserving the subject’s identity.
https://arxiv.org/abs/1811.12784
Detailed whole brain segmentation is an essential quantitative technique, which provides a non-invasive way of measuring brain regions from a structural magnetic resonance imaging (MRI). Recently, deep convolution neural network (CNN) has been applied to whole brain segmentation. However, restricted by current GPU memory, 2D based methods, downsampling based 3D CNN methods, and patch-based high-resolution 3D CNN methods have been the de facto standard solutions. 3D patch-based high resolution methods typically yield superior performance among CNN approaches on detailed whole brain segmentation (>100 labels), however, whose performance are still commonly inferior compared with multi-atlas segmentation methods (MAS) due to the following challenges: (1) a single network is typically used to learn both spatial and contextual information for the patches, (2) limited manually traced whole brain volumes are available (typically less than 50) for training a network. In this work, we propose the spatially localized atlas network tiles (SLANT) method to distribute multiple independent 3D fully convolutional networks (FCN) for high-resolution whole brain segmentation. To address the first challenge, multiple spatially distributed networks were used in the SLANT method, in which each network learned contextual information for a fixed spatial location. To address the second challenge, auxiliary labels on 5111 initially unlabeled scans were created by multi-atlas segmentation for training. Since the method integrated multiple traditional medical image processing methods with deep learning, we developed a containerized pipeline to deploy the end-to-end solution. From the results, the proposed method achieved superior performance compared with multi-atlas segmentation methods, while reducing the computational time from >30 hours to 15 minutes (https://github.com/MASILab/SLANTbrainSeg).
http://arxiv.org/abs/1903.12152
Label noise is inherent in many deep learning tasks when the training set becomes large. A typical approach to tackle noisy labels is using robust loss functions. Categorical cross entropy (CCE) is a successful loss function in many applications. However, CCE is also notorious for fitting samples with corrupted labels easily. In contrast, mean absolute error (MAE) is noise-tolerant theoretically, but it generally works much worse than CCE in practice. In this work, we have three main points. First, to explain why MAE generally performs much worse than CCE, we introduce a new understanding of them fundamentally by exposing their intrinsic sample weighting schemes from the perspective of every sample’s gradient magnitude with respect to logit vector. Consequently, we find that MAE’s differentiation degree over training examples is too small so that informative ones cannot contribute enough against the non-informative during training. Therefore, MAE generally underfits training data when noise rate is high. Second, based on our finding, we propose an improved MAE (IMAE), which inherits MAE’s good noise-robustness. Moreover, the differentiation degree over training data points is controllable so that IMAE addresses the underfitting problem of MAE. Third, the effectiveness of IMAE against CCE and MAE is evaluated empirically with extensive experiments, which focus on image classification under synthetic corrupted labels and video retrieval under real noisy labels.
http://arxiv.org/abs/1903.12141
Leather is a natural and durable material created through a process of tanning of hides and skins of animals. The price of the leather is subjective as it is highly sensitive to its quality and surface defects condition. In the literature, there are very few works investigating on the defects detection for leather using automatic image processing techniques. The manual defect inspection process is essential in an leather production industry to control the quality of the finished products. However, it is tedious, as it is labour intensive, time consuming, causes eye fatigue and often prone to human error. In this paper, a fully automatic defect detection and marking system on a calf leather is proposed. The proposed system consists of a piece of leather, LED light, high resolution camera and a robot arm. Succinctly, a machine vision method is presented to identify the position of the defects on the leather using a deep learning architecture. Then, a series of processes are conducted to predict the defect instances, including elicitation of the leather images with a robot arm, train and test the images using a deep learning architecture and determination of the boundary of the defects using mathematical derivation of the geometry. Note that, all the processes do not involve human intervention, except for the defect ground truths construction stage. The proposed algorithm is capable to exhibit 91.5% segmentation accuracy on the train data and 70.35% on the test data. We also report confusion matrix, F1-score, precision and specificity, sensitivity performance metrics to further verify the effectiveness of the proposed approach.
http://arxiv.org/abs/1903.12139
In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent poster child of this trend is the deep language representation model, which includes BERT, ELMo, and GPT. These developments have led to the conviction that previous-generation, shallower neural networks for language understanding are obsolete. In this paper, however, we demonstrate that rudimentary, lightweight neural networks can still be made competitive without architecture changes, external training data, or additional input features. We propose to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks. Across multiple datasets in paraphrasing, natural language inference, and sentiment classification, we achieve comparable results with ELMo, while using roughly 100 times fewer parameters and 15 times less inference time.
http://arxiv.org/abs/1903.12136
Generative Adversarial Networks (GANs) can produce images of remarkable complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene. Capturing such complex interactions between different objects in the world, including their relative scaling, spatial layout, occlusion, or viewpoint transformation is a challenging problem. In this work, we propose a novel self-consistent Composition-by-Decomposition (CoDe) network to compose a pair of objects. Given object images from two distinct distributions, our model can generate a realistic composite image from their joint distribution following the texture and shape of the input objects. We evaluate our approach through qualitative experiments and user evaluations. Our results indicate that the learned model captures potential interactions between the two object domains, and generates realistic composed scenes at test time.
http://arxiv.org/abs/1807.07560
This paper explores the expressive capabilities of a swarm of miniature mobile robots within the context of inter-robot interactions and their mapping to the so-called fundamental emotions. In particular, we investigate how motion and shape descriptors that are psychologically associated with different emotions can be incorporated into different swarm behaviors for the purpose of artistic expositions. Based on these characterizations from social psychology, a set of swarm behaviors is created, where each behavior corresponds to a fundamental emotion. The effectiveness of these behaviors was evaluated in a survey in which the participants were asked to associate different swarm behaviors with the fundamental emotions. The results of the survey show that most of the research participants assigned to each video the emotion intended to be portrayed by design. These results confirm that abstract descriptors associated with the different fundamental emotions in social psychology provide useful motion characterizations that can be effectively transformed into expressive behaviors for a swarm of simple ground mobile robots.
http://arxiv.org/abs/1903.12118
Typical multi-task learning (MTL) methods rely on architectural adjustments and a large trainable parameter set to jointly optimize over several tasks. However, when the number of tasks increases so do the complexity of the architectural adjustments and resource requirements. In this paper, we introduce a method which applies a conditional feature-wise transformation over the convolutional activations that enables a model to successfully perform a large number of tasks. To distinguish from regular MTL, we introduce Many Task Learning (MaTL) as a special case of MTL where more than 20 tasks are performed by a single model. Our method dubbed Task Routing (TR) is encapsulated in a layer we call the Task Routing Layer (TRL), which applied in an MaTL scenario successfully fits hundreds of classification tasks in one model. We evaluate our method on 5 datasets against strong baselines and state-of-the-art approaches.
http://arxiv.org/abs/1903.12117
Evolving cybersecurity threats are a persistent challenge for systemadministrators and security experts as new malwares are continu-ally released. Attackers may look for vulnerabilities in commercialproducts or execute sophisticated reconnaissance campaigns tounderstand a targets network and gather information on securityproducts like firewalls and intrusion detection / prevention systems(network or host-based). Many new attacks tend to be modificationsof existing ones. In such a scenario, rule-based systems fail to detectthe attack, even though there are minor differences in conditions /attributes between rules to identify the new and existing attack. Todetect these differences the IDS must be able to isolate the subset ofconditions that are true and predict the likely conditions (differentfrom the original) that must be observed. In this paper, we proposeaprobabilistic abductive reasoningapproach that augments an exist-ing rule-based IDS (snort [29]) to detect these evolved attacks by (a)Predicting rule conditions that are likely to occur (based on existingrules) and (b) able to generate new snort rules when provided withseed rule (i.e. a starting rule) to reduce the burden on experts toconstantly update them. We demonstrate the effectiveness of theapproach by generating new rules from the snort 2012 rules set andtesting it on the MACCDC 2012 dataset [6].
http://arxiv.org/abs/1903.12101
The past decade has witnessed great success in applying deep learning to enhance the quality of compressed video. However, the existing approaches aim at quality enhancement on a single frame, or only using fixed neighboring frames. Thus they fail to take full advantage of the inter-frame correlation in the video. This paper proposes the Quality-Gated Convolutional Long Short-Term Memory (QG-ConvLSTM) network with bi-directional recurrent structure to fully exploit the advantageous information in a large range of frames. More importantly, due to the obvious quality fluctuation among compressed frames, higher quality frames can provide more useful information for other frames to enhance quality. Therefore, we propose learning the “forget” and “input” gates in the ConvLSTM cell from quality-related features. As such, the frames with various quality contribute to the memory in ConvLSTM with different importance, making the information of each frame reasonably and adequately used. Finally, the experiments validate the effectiveness of our QG-ConvLSTM approach in advancing the state-of-the-art quality enhancement of compressed video, and the ablation study shows that our QG-ConvLSTM approach is learnt to make a trade-off between quality and correlation when leveraging multi-frame information.
http://arxiv.org/abs/1903.04596
Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle” approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.
http://arxiv.org/abs/1903.12094
In this paper, gating mechanisms are applied in deep neural network (DNN) training for x-vector-based text-independent speaker verification. First, a gated convolution neural network (GCNN) is employed for modeling the frame-level embedding layers. Compared with the time-delay DNN (TDNN), the GCNN can obtain more expressive frame-level representations through carefully designed memory cell and gating mechanisms. Moreover, we propose a novel gated-attention statistics pooling strategy in which the attention scores are shared with the output gate. The gated-attention statistics pooling combines both gating and attention mechanisms into one framework; therefore, we can capture more useful information in the temporal pooling layer. Experiments are carried out using the NIST SRE16 and SRE18 evaluation datasets. The results demonstrate the effectiveness of the GCNN and show that the proposed gated-attention statistics pooling can further improve the performance.
http://arxiv.org/abs/1903.12092
In this paper, we proposed a no-reference (NR) quality metric for RGB plus image-depth (RGB-D) synthesis images based on Generative Adversarial Networks (GANs), namely GANs-NQM. Due to the failure of the inpainting on dis-occluded regions in RGB-D synthesis process, to capture the non-uniformly distributed local distortions and to learn their impact on perceptual quality are challenging tasks for objective quality metrics. In our study, based on the characteristics of GANs, we proposed i) a novel training strategy of GANs for RGB-D synthesis images using existing large-scale computer vision datasets rather than RGB-D dataset; ii) a referenceless quality metric based on the trained discriminator by learning a `Bag of Distortion Word’ (BDW) codebook and a local distortion regions selector; iii) a hole filling inpainter, i.e., the generator of the trained GANs, for RGB-D dis-occluded regions as a side outcome. According to the experimental results on IRCCyN/IVC DIBR database, the proposed model outperforms the state-of-the-art quality metrics, in addition, is more applicable in real scenarios. The corresponding context inpainter also shows appealing results over other inpainting algorithms.
http://arxiv.org/abs/1903.12088
Neural speech synthesis algorithms are a promising new approach for coding speech at very low bitrate. They have so far demonstrated quality that far exceeds traditional vocoders, at the cost of very high complexity. In this work, we present a low-bitrate neural vocoder based on the LPCNet model. The use of linear prediction and sparse recurrent networks makes it possible to achieve real-time operation on general-purpose hardware. We demonstrate that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPCNet can exceed the quality of a waveform codec operating at low bitrate. This opens the way for new codec designs based on neural synthesis models.
http://arxiv.org/abs/1903.12087
We present a 3-step registration pipeline for differently stained histological serial sections that consists of 1) a robust pre-alignment, 2) a parametric registration computed on coarse resolution images, and 3) an accurate nonlinear registration. In all three steps the NGF distance measure is minimized with respect to an increasingly flexible transformation. We apply the method in the ANHIR image registration challenge and evaluate its performance on the training data. The presented method is robust (error reduction in 99.6% of the cases), fast (runtime < 4 seconds) and accurate (median relative target registration error 0.0019).
http://arxiv.org/abs/1903.12063
In this paper, we propose a hybrid depth imaging system in which a polarisation camera is augmented by a second image from a standard digital camera. For this modest increase in equipment complexity over conventional shape-from-polarisation, we obtain a number of benefits that enable us to overcome longstanding problems with the polarisation shape cue. The stereo cue provides a depth map which, although coarse, is metrically accurate. This is used as a guide surface for disambiguation of the polarisation surface normal estimates using a higher order graphical model. In turn, these are used to estimate diffuse albedo. By extending a previous shape-from-polarisation method to the perspective case, we show how to compute dense, detailed maps of absolute depth, while retaining a linear formulation. We show that our hybrid method is able to recover dense 3D geometry that is superior to state-of-the-art shape-from-polarisation or two view stereo alone.
http://arxiv.org/abs/1903.12061
The x-vector based deep neural network (DNN) embedding systems have demonstrated effectiveness for text-independent speaker verification. This paper presents a multi-task learning architecture for training the speaker embedding DNN, with the primary task of classifying the target speakers and the auxiliary task of reconstructing the higher-order statistics of the original input utterance. The proposed training strategy aggregates both the supervised and unsupervised learning into one framework to make the speaker embeddings more discriminative and robust. Experiments are carried out in the NIST SRE16 evaluation dataset and the VOiCES dataset. The results demonstrate that our proposed method outperform the original x-vector approach with very low additional complexity added.
http://arxiv.org/abs/1903.12058
Successive frames of a video are highly redundant, and the most popular object detection methods do not take advantage of this fact. Using multiple consecutive frames can improve detection of small objects or difficult examples and can improve speed and detection consistency in a video sequence, for instance by interpolating features between frames. In this work, a novel approach is introduced to perform online video object detection using two consecutive frames of video sequences involving road users. Two new models, RetinaNet-Double and RetinaNet-Flow, are proposed, based respectively on the concatenation of a target frame with a preceding frame, and the concatenation of the optical flow with the target frame. The models are trained and evaluated on three public datasets. Experiments show that using a preceding frame improves performance over single frame detectors, but using explicit optical flow usually does not.
http://arxiv.org/abs/1903.12049
Since programming concepts do not match their syntactic representations, code search is a very tedious task. For instance in Java or C, array doesn’t match [], so using “array” as a query, one cannot find what they are looking for. Often developers have to search code whether to understand any code, or to reuse some part of that code, or just to read it, without natural language searching, developers have to often scroll back and forth or use variable names as their queries. In our work, we have used Stackoverflow (SO) question and answers to make a mapping of programming concepts with their respective natural language keywords, and then tag these natural language terms to every line of code, which can further we used in searching using natural language keywords.
http://arxiv.org/abs/1903.12495
Recently, the state-of-the-art models for image captioning have overtaken human performance based on the most popular metrics, such as BLEU, METEOR, ROUGE, and CIDEr. Does this mean we have solved the task of image captioning? The above metrics only measure the similarity of the generated caption to the human annotations, which reflects its accuracy. However, an image contains many concepts and multiple levels of detail, and thus there is a variety of captions that express different concepts and details that might be interesting for different humans. Therefore only evaluating accuracy is not sufficient for measuring the performance of captioning models — the diversity of the generated captions should also be considered. In this paper, we proposed a new metric for measuring the diversity of image captions, which is derived from latent semantic analysis and kernelized to use CIDEr similarity. We conduct extensive experiments to re-evaluate recent captioning models in the context of both diversity and accuracy. We find that there is still a large gap between the model and human performance in terms of both accuracy and diversity and the models that have optimized accuracy (CIDEr) have low diversity. We also show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions.
http://arxiv.org/abs/1903.12020
Evaluating translation models is a trade-off between effort and detail. On the one end of the spectrum there are automatic count-based methods such as BLEU, on the other end linguistic evaluations by humans, which arguably are more informative but also require a disproportionately high effort. To narrow the spectrum, we propose a general approach on how to automatically expose systematic differences between human and machine translations to human experts. Inspired by adversarial settings, we train a neural text classifier to distinguish human from machine translations. A classifier that performs and generalizes well after training should recognize systematic differences between the two classes, which we uncover with neural explainability methods. Our proof-of-concept implementation, DiaMaT, is open source. Applied to a dataset translated by a state-of-the-art neural Transformer model, DiaMaT achieves a classification accuracy of 75% and exposes meaningful differences between humans and the Transformer, amidst the current discussion about human parity.
http://arxiv.org/abs/1903.12017
In this paper, we address the problem of effectively self-training neural networks in a low-resource setting. Self-training is frequently used to automatically increase the amount of training data. However, in a low-resource scenario, it is less effective due to unreliable annotations created using self-labeling of unlabeled data. We propose to combine self-training with noise handling on the self-labeled data. Directly estimating noise on the combined clean training set and self-labeled data can lead to corruption of the clean data and hence, performs worse. Thus, we propose the Clean and Noisy Label Neural Network which trains on clean and noisy self-labeled data simultaneously by explicitly modelling clean and noisy labels separately. In our experiments on Chunking and NER, this approach performs more robustly than the baselines. Complementary to this explicit approach, noise can also be handled implicitly with the help of an auxiliary learning task. To such a complementary approach, our method is more beneficial than other baseline methods and together provides the best performance overall.
http://arxiv.org/abs/1903.12008
Face manipulation has shown remarkable advances with the flourish of Generative Adversarial Networks. However, due to the difficulties of controlling the structure and texture in high-resolution, it is challenging to simultaneously model pose and expression during manipulation. In this paper, we propose a novel framework that simplifies face manipulation with extreme pose and expression into two correlated stages: a boundary prediction stage and a disentangled face synthesis stage. In the first stage, we propose to use a boundary image for joint pose and expression modeling. An encoder-decoder network is employed to predict the boundary image of the target face in a semi-supervised way. Pose and expression estimators are used to improve the prediction accuracy. In the second stage, the predicted boundary image and the original face are encoded into the structure and texture latent space by two encoder networks respectively. A proxy network and a feature threshold loss are further imposed as constraints to disentangle the latent space. In addition, we build up a new high quality Multi-View Face (MVF-HQ) database that contains 120K high-resolution face images of 479 identities with pose and expression variations, which will be released soon. Qualitative and quantitative experiments on four databases show that our method pushes forward the advance of extreme face manipulation from 128 $\times$ 128 resolution to 1024 $\times$ 1024 resolution, and significantly improves the face recognition performance under large poses.
http://arxiv.org/abs/1903.12003
This paper presents an experimental validation of a new quadrotor-based aerial manipulator. A quadrotor is equipped with a 2-DOF robotic arm that is designed with a new topology to enable the end-effector of the whole system to track a 6-DOF trajectory. An identification experiment is carried out to find out the system parameters. The mathematical model of the whole system is presented. A measurement scheme is proposed to get the accurate pose of the vehicle considering the motion of the manipulator below the quadrotor. The system controller is designed and implemented based on PID with a gravity compensation algorithm. System simulation is implemented in the MATLAB/SIMULINK environment with real system parameters, to better emulate a realistic setup. Real-time Experiments are conducted. Both simulation and experimental results show the feasibility and a satisfactory efficiency of the proposed system in achieving the position holding and transferring an object to a specific target position.
http://arxiv.org/abs/1903.12001
We present a novel optimizer for deep neural networks that combines the ideas of Netwon’s method and line search to efficiently compute and utilize curvature information. Our work is based on empirical observation suggesting that the loss function can be approximated by a parabola in negative gradient direction. Due to this approximation, we are able to perform a variable and loss function dependent parameter update by jumping directly into the minimum of the approximated parabola. To evaluate our optimizer, we performed multiple comprehensive hyperparameter grid searches for which we trained more than 20000 networks in total. We can show that PAL outperforms RMSPROP, and can outperform gradient descent with momentum and ADAM on large-scale high-dimensional machine learning problems. Furthermore, PAL requires up to 52.2% less training epochs. PyTorch and TensorFlow implementations are provided at https://github.com/cogsys-tuebingen/PAL.
http://arxiv.org/abs/1903.11991
Trajectory optimization with learned dynamics models can often suffer from erroneous predictions of out-of-distribution trajectories. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the dynamics model. We visually demonstrate the effectiveness of the regularization in gradient-based trajectory optimization for open-loop control of an industrial process. We compare with recent model-based reinforcement learning algorithms on a set of popular motor control tasks to demonstrate that the denoising regularization enables state-of-the-art sample-efficiency. We demonstrate the efficacy of the proposed method in regularizing both gradient-based and gradient-free trajectory optimization.
http://arxiv.org/abs/1903.11981
Imbalanced data commonly exists in real world, espacially in sentiment-related corpus, making it difficult to train a classifier to distinguish latent sentiment in text data. We observe that humans often express transitional emotion between two adjacent discourses with discourse markers like “but”, “though”, “while”, etc, and the head discourse and the tail discourse 3 usually indicate opposite emotional tendencies. Based on this observation, we propose a novel plug-and-play method, which first samples discourses according to transitional discourse markers and then validates sentimental polarities with the help of a pretrained attention-based model. Our method increases sample diversity in the first place, can serve as a upstream preprocessing part in data augmentation. We conduct experiments on three public sentiment datasets, with several frequently used algorithms. Results show that our method is found to be consistently effective, even in highly imbalanced scenario, and easily be integrated with oversampling method to boost the performance on imbalanced sentiment classification.
http://arxiv.org/abs/1903.11919
The intelligent Processing technique is more and more attractive to researchers due to its ability to deal with key problems in Vehicular Ad hoc networks. However, several problems in applying intelligent processing technologies in VANETs remain open. The existing applications are comprehensively reviewed and discussed, and classified into different categories in this paper. Their strategies, advantages/disadvantages, and performances are elaborated. By generalizing different tactics in various applications related to different scenarios of VANETs and evaluating their performances, several promising directions for future research have been suggested.
http://arxiv.org/abs/1903.11916
We are concerned with the vulnerability of computer vision models to distributional shifts. We cast this problem in terms of combinatorial optimization, evaluating the regions in the input space where a (black-box) model is more vulnerable. This is carried out by combining image transformations from a given set and standard search algorithms. We embed this idea in a training procedure, where we define new data augmentation rules over iterations, accordingly to the image transformations that the current model is most vulnerable to. An empirical evaluation on classification and semantic segmentation problems suggests that the devised algorithm allows to train models more robust against content-preserving image transformations, and in general, against distributional shifts.
http://arxiv.org/abs/1903.11900
Convolutional neural networks (CNNs) are easily spoofed by adversarial examples which lead to wrong classification results. Most of the defense methods focus only on how to improve the robustness of CNNs or to detect adversarial examples. They are incapable of detecting and correctly classifying adversarial examples simultaneously. We find that adversarial examples and original images have diverse representations in the feature space, and this difference grows as layers go deeper, which we call Adversarial Feature Separability (AFS). Inspired by AFS, we propose a defense framework based on Adversarial Feature Genome (AFG), which can detect and correctly classify adversarial examples into original classes simultaneously. AFG is an innovative encoding for both image and adversarial example. It consists of group features and a mixed label. With group features which are visual representations of adversarial and original images via group visualization method, one can detect adversarial examples because of ASF of group features. With a mixed label, one can trace back to the original label of an adversarial example. Then, the classification of adversarial example is modeled as a multi-label classification trained on the AFG dataset, which can get the original class of adversarial example. Experiments show that the proposed framework not only effectively detects adversarial examples from different attack algorithms, but also correctly classifies adversarial examples. Our framework potentially gives a new perspective, i.e., a data-driven way, to improve the robustness of a CNN model.
http://arxiv.org/abs/1812.10085
It is challenging to detect the anomaly in crowded scenes for quite a long time. In this paper, a self-supervised framework, abnormal event detection network (AED-Net), which is composed of PCAnet and kernel principal component analysis (kPCA), is proposed to address this problem. Using surveillance video sequences of different scenes as raw data, PCAnet is trained to extract high-level semantics of crowd’s situation. Next, kPCA,a one-class classifier, is trained to determine anomaly of the scene. In contrast to some prevailing deep learning methods,the framework is completely self-supervised because it utilizes only video sequences in a normal situation. Experiments of global and local abnormal event detection are carried out on UMN and UCSD datasets, and competitive results with higher EER and AUC compared to other state-of-the-art methods are observed. Furthermore, by adding local response normalization (LRN) layer, we propose an improvement to original AED-Net. And it is proved to perform better by promoting the framework’s generalization capacity according to the experiments.
http://arxiv.org/abs/1903.11891
Good parameter settings are crucial to achieve high performance in many areas of artificial intelligence (AI), such as propositional satisfiability solving, AI planning, scheduling, and machine learning (in particular deep learning). Automated algorithm configuration methods have recently received much attention in the AI community since they replace tedious, irreproducible and error-prone manual parameter tuning and can lead to new state-of-the-art performance. However, practical applications of algorithm configuration are prone to several (often subtle) pitfalls in the experimental design that can render the procedure ineffective. We identify several common issues and propose best practices for avoiding them. As one possibility for automatically handling as many of these as possible, we also propose a tool called GenericWrapper4AC.
http://arxiv.org/abs/1705.06058
Inertial navigation and attitude initialization in polar areas become a hot topic in recent years in the navigation community, as the widely-used navigation mechanization of the local level frame encounters the inherent singularity when the latitude approaches 90 degrees. Great endeavors have been devoted to devising novel navigation mechanizations such as the grid or transversal frames. This paper highlights the fact that the common Earth-frame mechanization is sufficiently good to well handle the singularity problem in polar areas. Simulation results are reported to demonstrate the singularity problem and the effectiveness of the Earth-frame mechanization.
http://arxiv.org/abs/1903.11863
This paper investigates the visual quality of the adversarial examples. Recent papers propose to smooth the perturbations to get rid of high frequency artefacts. In this work, smoothing has a different meaning as it perceptually shapes the perturbation according to the visual content of the image to be attacked. The perturbation becomes locally smooth on the flat areas of the input image, but it may be noisy on its textured areas and sharp across its edges. This operation relies on Laplacian smoothing, well-known in graph signal processing, which we integrate in the attack pipeline. We benchmark several attacks with and without smoothing under a white-box scenario and evaluate their transferability. Despite the additional constraint of smoothness, our attack has the same probability of success at lower distortion.
http://arxiv.org/abs/1903.11862
In multiple attribute decision analysis (MADA) problems, one often needs to deal with assessment information with uncertainty. The evidential reasoning approach is one of the most effective methods to deal with such MADA problems. As kernel of the evidential reasoning approach, an original evidential reasoning (ER) algorithm was firstly proposed by Yang et al, and later they modified the ER algorithm in order to satisfy the proposed four synthesis axioms with which a rational aggregation process needs to satisfy. However, up to present, the essential difference of the two ER algorithms as well as the rationality of the synthesis axioms are still unclear. In this paper, we analyze the ER algorithms in Dempster-Shafer theory (DST) framework and prove that the original ER algorithm follows the reliability discounting and combination scheme, while the modified one follows the importance discounting and combination scheme. Further we reveal that the four synthesis axioms are not valid criteria to check the rationality of one attribute aggregation algorithm. Based on these new findings, an extended ER algorithm is proposed to take into account both the reliability and importance of different attributes, which provides a more general attribute aggregation scheme for MADA with uncertainty. A motorcycle performance assessment problem is examined to illustrate the proposed algorithm.
http://arxiv.org/abs/1903.11857
A well-trained model should classify objects with a unanimous score for every category. This requires the high-level semantic features should be as much alike as possible among samples. To achive this, previous works focus on re-designing the loss or proposing new regularization constraints. In this paper, we provide a new perspective. For each category, it is assumed that there are two feature sets: one with reliable information and the other with less reliable source. We argue that the reliable set could guide the feature learning of the less reliable set during training - in spirit of student mimicking teacher behavior and thus pushing towards a more compact class centroid in the feature space. Such a scheme also benefits the reliable set since samples become closer within the same category - implying that it is easier for the classifier to identify. We refer to this mutual learning process as feature intertwiner and embed it into object detection. It is well-known that objects of low resolution are more difficult to detect due to the loss of detailed information during network forward pass (e.g., RoI operation). We thus regard objects of high resolution as the reliable set and objects of low resolution as the less reliable set. Specifically, an intertwiner is designed to minimize the distribution divergence between two sets. The choice of generating an effective feature representation for the reliable set is further investigated, where we introduce the optimal transport (OT) theory into the framework. Samples in the less reliable set are better aligned with aid of OT metric. Incorporated with such a plug-and-play intertwiner, we achieve an evident improvement over previous state-of-the-arts.
http://arxiv.org/abs/1903.11851
Current state of the art systems in NLP heavily rely on manually annotated datasets, which are expensive to construct. Very little work adequately exploits unannotated data – such as discourse markers between sentences – mainly because of data sparseness and ineffective extraction methods. In the present work, we propose a method to automatically discover sentence pairs with relevant discourse markers, and apply it to massive amounts of data. Our resulting dataset contains 174 discourse markers with at least 10k examples each, even for rare markers such as coincidentally or amazingly We use the resulting data as supervision for learning transferable sentence embeddings. In addition, we show that even though sentence representation learning through prediction of discourse markers yields state of the art results across different transfer tasks, it is not clear that our models made use of the semantic relation between sentences, thus leaving room for further improvements. Our datasets are publicly available (https://github.com/synapse-developpement/Discovery)
http://arxiv.org/abs/1903.11850
Machine reading comprehension have been intensively studied in recent years, and neural network-based models have shown dominant performances. In this paper, we present a Sogou Machine Reading Comprehension (SMRC) toolkit that can be used to provide the fast and efficient development of modern machine comprehension models, including both published models and original prototypes. To achieve this goal, the toolkit provides dataset readers, a flexible preprocessing pipeline, necessary neural network components, and built-in models, which make the whole process of data preparation, model construction, and training easier.
http://arxiv.org/abs/1903.11848
Liver lesion segmentation is a difficult yet critical task for medical image analysis. Recently, deep learning based image segmentation methods have achieved promising performance, which can be divided into three categories: 2D, 2.5D and 3D, based on the dimensionality of the models. However, 2.5D and 3D methods can have very high complexity and 2D methods may not perform satisfactorily. To obtain competitive performance with low complexity, in this paper, we propose a Feature-fusion Encoder-Decoder Network (FED-Net) based 2D segmentation model to tackle the challenging problem of liver lesion segmentation from CT images. Our feature fusion method is based on the attention mechanism, which fuses high-level features carrying semantic information with low-level features having image details. Additionally, to compensate for the information loss during the upsampling process, a dense upsampling convolution and a residual convolutional structure are proposed. We tested our method on the dataset of MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge and achieved competitive results compared with other state-of-the-art methods.
http://arxiv.org/abs/1903.11834
Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative approach using a second-order optimization method that shows similar generalization capability to first-order methods, but converges faster and can handle larger mini-batches. To test our method on a benchmark where highly optimized first-order methods are available as references, we train ResNet-50 on ImageNet. We converged to 75% Top-1 validation accuracy in 35 epochs for mini-batch sizes under 16,384, and achieved 75% even with a mini-batch size of 131,072, which took only 978 iterations.
http://arxiv.org/abs/1811.12019
Modern approaches for semantic segmentation usually employ dilated convolutions in the backbone to extract high-resolution feature maps, which brings heavy computation complexity and memory footprint. To replace the time and memory consuming dilated convolutions, we propose a novel joint upsampling module named Joint Pyramid Upsampling (JPU) by formulating the task of extracting high-resolution feature maps into a joint upsampling problem. With the proposed JPU, our method reduces the computation complexity by more than three times without performance loss. Experiments show that JPU is superior to other upsampling modules, which can be plugged into many existing approaches to reduce computation complexity and improve performance. By replacing dilated convolutions with the proposed JPU module, our method achieves the state-of-the-art performance in Pascal Context dataset (mIoU of 53.13%) and ADE20K dataset (final score of 0.5584) while running 3 times faster.
http://arxiv.org/abs/1903.11816
Scene text detection, an essential step of scene text recognition system, is to locate text instances in natural scene images automatically. Some recent attempts benefiting from Mask R-CNN formulate scene text detection task as an instance segmentation problem and achieve remarkable performance. In this paper, we present a new Mask R-CNN based framework named Pyramid Mask Text Detector (PMTD) to handle the scene text detection. Instead of binary text mask generated by the existing Mask R-CNN based methods, our PMTD performs pixel-level regression under the guidance of location-aware supervision, yielding a more informative soft text mask for each text instance. As for the generation of text boxes, PMTD reinterprets the obtained 2D soft mask into 3D space and introduces a novel plane clustering algorithm to derive the optimal text box on the basis of 3D shape. Experiments on standard datasets demonstrate that the proposed PMTD brings consistent and noticeable gain and clearly outperforms state-of-the-art methods. Specifically, it achieves an F-measure of 80.13% on ICDAR 2017 MLT dataset.
http://arxiv.org/abs/1903.11800
Sound event detection with weakly labeled data is considered as a problem of multi-instance learning. And the choice of pooling function is the key to solving this problem. In this paper, we proposed a hierarchical pooling structure to improve the performance of weakly labeled sound event detection system. Proposed pooling structure has made remarkable improvements on three types of pooling function without adding any parameters. Moreover, our system has achieved competitive performance on Task 4 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge using hierarchical pooling structure.
http://arxiv.org/abs/1903.11791
In this paper, we report on a parallel freeviewpoint video synthesis algorithm that can efficiently reconstruct a high-quality 3D scene representation of sports scenes. The proposed method focuses on a scene that is captured by multiple synchronized cameras featuring wide-baselines. The following strategies are introduced to accelerate the production of a free-viewpoint video taking the improvement of visual quality into account: (1) a sparse point cloud is reconstructed using a volumetric visual hull approach, and an exact 3D ROI is found for each object using an efficient connected components labeling algorithm. Next, the reconstruction of a dense point cloud is accelerated by implementing visual hull only in the ROIs; (2) an accurate polyhedral surface mesh is built by estimating the exact intersections between grid cells and the visual hull; (3) the appearance of the reconstructed presentation is reproduced in a view-dependent manner that respectively renders the non-occluded and occluded region with the nearest camera and its neighboring cameras. The production for volleyball and judo sequences demonstrates the effectiveness of our method in terms of both execution time and visual quality.
http://arxiv.org/abs/1903.11785