Detecting acoustic shadows in ultrasound images is important in many clinical and engineering applications. Real-time feedback of acoustic shadows can guide sonographers to a standardized diagnostic viewing plane with minimal artifacts and can provide additional information for other automatic image analysis algorithms. However, automatically detecting shadow regions using learning-based algorithms is challenging because pixel-wise ground truth annotation of acoustic shadows is subjective and time consuming. In this paper we propose a weakly supervised method for automatic confidence estimation of acoustic shadow regions. Our method is able to generate a dense shadow-focused confidence map. In our method, a shadow-seg module is built to learn general shadow features for shadow segmentation, based on global image-level annotations as well as a small number of coarse pixel-wise shadow annotations. A transfer function is introduced to extend the obtained binary shadow segmentation to a reference confidence map. Additionally, a confidence estimation network is proposed to learn the mapping between input images and the reference confidence maps. This network is able to predict shadow confidence maps directly from input images during inference. We use evaluation metrics such as DICE, inter-class correlation and etc. to verify the effectiveness of our method. Our method is more consistent than human annotation, and outperforms the state-of-the-art quantitatively in shadow segmentation and qualitatively in confidence estimation of shadow regions. We further demonstrate the applicability of our method by integrating shadow confidence maps into tasks such as ultrasound image classification, multi-view image fusion and automated biometric measurements.
http://arxiv.org/abs/1811.08164
In this paper, we propose a generative framework that unifies depth-based 3D facial pose tracking and face model adaptation on-the-fly, in the unconstrained scenarios with heavy occlusions and arbitrary facial expression variations. Specifically, we introduce a statistical 3D morphable model that flexibly describes the distribution of points on the surface of the face model, with an efficient switchable online adaptation that gradually captures the identity of the tracked subject and rapidly constructs a suitable face model when the subject changes. Moreover, unlike prior art that employed ICP-based facial pose estimation, to improve robustness to occlusions, we propose a ray visibility constraint that regularizes the pose based on the face model’s visibility with respect to the input point cloud. Ablation studies and experimental results on Biwi and ICT-3DHP datasets demonstrate that the proposed framework is effective and outperforms completing state-of-the-art depth-based methods.
http://arxiv.org/abs/1905.02114
In recent years, deep learning has achieved remarkable achievements in many fields, including computer vision, natural language processing, speech recognition and others. Adequate training data is the key to ensure the effectiveness of the deep models. However, obtaining valid data requires a lot of time and labor resources. Data augmentation (DA) is an effective alternative approach, which can generate new labeled data based on existing data using label-preserving transformations. Although we can benefit a lot from DA, designing appropriate DA policies requires a lot of expert experience and time consumption, and the evaluation of searching the optimal policies is costly. So we raise a new question in this paper: how to achieve automated data augmentation at as low cost as possible? We propose a method named BO-Aug for automating the process by finding the optimal DA policies using the Bayesian optimization approach. Our method can find the optimal policies at a relatively low search cost, and the searched policies based on a specific dataset are transferable across different neural network architectures or even different datasets. We validate the BO-Aug on three widely used image classification datasets, including CIFAR-10, CIFAR-100 and SVHN. Experimental results show that the proposed method can achieve state-of-the-art or near advanced classification accuracy. Code to reproduce our experiments is available at this https URL.
https://arxiv.org/abs/1905.02610
Online videos have witnessed an unprecedented growth over the last decade, owing to wide range of content creation. This provides the advertisement and marketing agencies plethora of opportunities for targeted advertisements. Such techniques involve replacing an existing advertisement in a video frame, with a new advertisement. However, such post-processing of online videos is mostly done manually by video editors. This is cumbersome and time-consuming. In this paper, we propose DeepAds – a deep neural network, based on the simple encoder-decoder architecture, that can accurately localize the position of an advert in a video frame. Our approach of localizing billboards in outdoor scenes using neural nets, is the first of its kind, and achieves the best performance. We benchmark our proposed method with other semantic segmentation algorithms, on a public dataset of outdoor scenes with manually annotated billboard binary maps.
http://arxiv.org/abs/1905.02106
In this paper, we propose Hard Person Identity Mining (HPIM) that attempts to refine the hard example mining to improve the exploration efficacy in person re-identification. It is motivated by following observation: the more attributes some people share, the more difficult to separate their identities. Based on this observation, we develop HPIM via a transferred attribute describer, a deep multi-attribute classifier trained from the source noisy person attribute datasets. We encode each image into the attribute probabilistic description in the target person re-ID dataset. Afterwards in the attribute code space, we consider each person as a distribution to generate his view-specific attribute codes in different practical scenarios. Hence we estimate the person-specific statistical moments from zeroth to higher order, which are further used to calculate the central moment discrepancies between persons. Such discrepancy is a ground to choose hard identity to organize proper mini-batches, without concerning the person representation changing in metric learning. It presents as a complementary tool of hard example mining, which helps to explore the global instead of the local hard example constraint in the mini-batch built by randomly sampled identities. Extensive experiments on two person re-identification benchmarks validated the effectiveness of our proposed algorithm.
http://arxiv.org/abs/1905.02102
End-to-end deep learning language or dialect identification systems operate on the spectrogram or other acoustic feature and directly generate identification scores for each class. An important issue for end-to-end systems is to have some knowledge of the application domain, because the system can be vulnerable to use cases that were not seen in the training phase; such a scenario is often referred to as a domain mismatched condition. In general, we assume that there is enough variation in the training dataset to expose the system to multiple domains. In this work, we study how to best make use a training dataset in order to have maximum effectiveness on unknown target domains. Our goal is to process the input without any knowledge of the target domain while preserving robust performance on other domains as well. To accomplish this objective, we propose a domain attentive fusion approach for end-to-end dialect/language identification systems. To help with experimentation, we collect a dataset from three different domains, and create experimental protocols for a domain mismatched condition. The results of our proposed approach, which were tested on a variety of broadcast and YouTube data, shows significant performance gain compared to traditional approaches, even without any prior target domain information.
http://arxiv.org/abs/1812.01501
We consider the problem of 3D object pose estimation. While much recent work has focused on the RGB domain, the reliance on accurately annotated images limits their generalizability and scalability. On the other hand, the easily available CAD models of objects are rich sources of data, providing a large number of synthetically rendered images. In this paper, we solve this key problem of existing methods requiring expensive 3D pose annotations by proposing a new method that matches RGB images to CAD models for object pose estimation. Our key innovations compared to existing work include removing the need for either real-world textures for CAD models or explicit 3D pose annotations for RGB images. We achieve this through a series of objectives that learn how to select keypoints and enforce viewpoint and modality invariance across RGB images and CAD model renderings. We conduct extensive experiments to demonstrate that the proposed method can reliably estimate object pose in RGB images, as well as generalize to object instances not seen during training.
http://arxiv.org/abs/1811.07249
3D Hand pose estimation from a single depth image is an essential topic in computer vision and human-computer interaction. Although the rising of deep learning method boosts the accuracy a lot, the problem is still hard to solve due to the complex structure of the human hand. Existing methods with deep learning either lose spatial information of hand structure or lack a direct supervision of joint coordinates. In this paper, we propose a novel Pixel-wise Regression method, which use spatial-form representation (SFR) and differentiable decoder (DD) to solve the two problems. To use our method, we build a model, in which we design a particular SFR and its correlative DD which divided the 3D joint coordinates into two parts, plane coordinates and depth coordinates and use two modules named Plane Regression (PR) and Depth Regression (DR) to deal with them respectively. We conduct an ablation experiment to show the method we proposed achieve better results than the former methods. We also make an exploration on how different training strategies influence the learned SFRs and results. The experiment on three public datasets demonstrates that our model is comparable with the existing state-of-the-art models and in one of them our model can reduce mean 3D joint error by 25%.
http://arxiv.org/abs/1905.02085
Mapping and localization are essential capabilities of robotic systems. Although the majority of mapping systems focus on static environments, the deployment in real-world situations requires them to handle dynamic objects. In this paper, we propose an approach for an RGB-D sensor that is able to consistently map scenes containing multiple dynamic elements. For localization and mapping, we employ an efficient direct tracking on the truncated signed distance function (TSDF) and leverage color information encoded in the TSDF to estimate the pose of the sensor. The TSDF is efficiently represented using voxel hashing, with most computations parallelized on a GPU. For detecting dynamics, we exploit the residuals obtained after an initial registration, together with the explicit modeling of free space in the model. We evaluate our approach on existing datasets, and provide a new dataset showing highly dynamic scenes. These experiments show that our approach often surpass other state-of-the-art dense SLAM methods. We make available our dataset with the ground truth for both the trajectory of the RGB-D sensor obtained by a motion capture system and the model of the static environment using a high-precision terrestrial laser scanner. Finally, we release our approach as open source code.
http://arxiv.org/abs/1905.02082
Many learning algorithms have invariances: when their training data is transformed in certain ways, the function they learn transforms in a predictable manner. Here we formalize this notion using concepts from the mathematical field of category theory. The invariances that a supervised learning algorithm possesses are formalized by categories of predictor and target spaces, whose morphisms represent the algorithm’s invariances, and an index category whose morphisms represent permutations of the training examples. An invariant learning algorithm is a natural transformation between two functors from the product of these categories to the category of sets, representing training datasets and learned functions respectively. We illustrate the framework by characterizing and contrasting the invariances of linear regression and ridge regression.
https://arxiv.org/abs/1905.02072
We study the problem of semi-supervised singing voice separation, in which the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music. Our solution employs a single mapping function g, which, applied to a mixed sample, recovers the underlying instrumental music, and, applied to an instrumental sample, returns the same sample. The network g is trained using purely instrumental samples, as well as on synthetic mixed samples that are created by mixing reconstructed singing voices with random instrumental samples. Our results indicate that we are on a par with or better than fully supervised methods, which are also provided with training samples of unmixed singing voices, and are better than other recent semi-supervised methods.
http://arxiv.org/abs/1812.06087
Multi-view clustering attracts much attention recently, which aims to take advantage of multi-view information to improve the performance of clustering. However, most recent work mainly focus on self-representation based subspace clustering, which is of high computation complexity. In this paper, we focus on the Markov chain based spectral clustering method and propose a novel essential tensor learning method to explore the high order correlations for multi-view representation. We first construct a tensor based on multi-view transition probability matrices of the Markov chain. By incorporating the idea from robust principle component analysis, tensor singular value decomposition (t-SVD) based tensor nuclear norm is imposed to preserve the low-rank property of the essential tensor, which can well capture the principle information from multiple views. We also employ the tensor rotation operator for this task to better investigate the relationship among views as well as reduce the computation complexity. The proposed method can be efficiently optimized by the alternating direction method of multipliers~(ADMM). Extensive experiments on six real world datasets corresponding to five different applications show that our method achieves superior performance over other state-of-the-art methods.
http://arxiv.org/abs/1807.03602
Unsupervised domain adaptation (UDA) amounts to assigning class labels to the unlabeled instances of a dataset from a target domain, using labeled instances of a dataset from a related source domain. In this paper, we propose to cast this problem in a game-theoretic setting as a non-cooperative game and introduce a fully automatized iterative algorithm for UDA based on graph transduction games (GTG). The main advantages of this approach are its principled foundation, guaranteed termination of the iterative algorithms to a Nash equilibrium (which corresponds to a consistent labeling condition) and soft labels quantifying the uncertainty of the label assignment process. We also investigate the beneficial effect of using pseudo-labels from linear classifiers to initialize the iterative process. The performance of the resulting methods is assessed on publicly available object recognition benchmark datasets involving both shallow and deep features. Results of experiments demonstrate the suitability of the proposed game-theoretic approach for solving UDA tasks.
http://arxiv.org/abs/1905.02036
Spike-based communication between biological neurons is sparse and unreliable. This enables the brain to process visual information from the eyes efficiently. Taking inspiration from biology, artificial spiking neural networks coupled with silicon retinas attempt to model these computations. Recent findings in machine learning allowed the derivation of a family of powerful synaptic plasticity rules approximating backpropagation for spiking networks. Are these rules capable of processing real-world visual sensory data? In this paper, we evaluate the performance of Event-Driven Random Back-Propagation (eRBP) at learning representations from event streams provided by a Dynamic Vision Sensor (DVS). First, we show that eRBP matches state-of-the-art performance on the DvsGesture dataset with the addition of a simple covert attention mechanism. By remapping visual receptive fields relatively to the center of the motion, this attention mechanism provides translation invariance at low computational cost compared to convolutions. Second, we successfully integrate eRBP in a real robotic setup, where a robotic arm grasps objects according to detected visual affordances. In this setup, visual information is actively sensed by a DVS mounted on a robotic head performing microsaccadic eye movements. We show that our method classifies affordances within 100ms after microsaccade onset, which is comparable to human performance reported in behavioral study. Our results suggest that advances in neuromorphic technology and plasticity rules enable the development of autonomous robots operating at high speed and low energy consumption.
http://arxiv.org/abs/1904.04805
Reinforcement learning usually makes use of numerical rewards, which have nice properties but also come with drawbacks and difficulties. Using rewards on an ordinal scale (ordinal rewards) is an alternative to numerical rewards that has received more attention in recent years. In this paper, a general approach to adapting reinforcement learning problems to the use of ordinal rewards is presented and motivated. We show how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduce Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal rewards. Additionally, we run evaluations on problems provided by the OpenAI Gym framework, showing that our ordinal variants exhibit a performance that is comparable to the numerical variations for a number of problems. We also give first evidence that our ordinal variant is able to produce better results for problems with less engineered and simpler-to-design reward signals.
http://arxiv.org/abs/1905.02005
Tactical decision making for autonomous driving is challenging due to the diversity of environments, the uncertainty in the sensor information, and the complex interaction with other road users. This paper introduces a general framework for tactical decision making, which combines the concepts of planning and learning, in the form of Monte Carlo tree search and deep reinforcement learning. The method is based on the AlphaGo Zero algorithm, which is extended to a domain with a continuous state space where self-play cannot be used. The framework is applied to two different highway driving cases in a simulated environment and it is shown to perform better than a commonly used baseline method. The strength of combining planning and learning is also illustrated by a comparison to using the Monte Carlo tree search or the neural network policy separately.
http://arxiv.org/abs/1905.02680
The behavior of deep neural networks (DNNs) is hard to understand. This makes it necessary to explore post hoc explanation methods. We conduct the first comprehensive evaluation of explanation methods for NLP. To this end, we design two novel evaluation paradigms that cover two important classes of NLP problems: small context and large context problems. Both paradigms require no manual annotation and are therefore broadly applicable. We also introduce LIMSSE, an explanation method inspired by LIME that is designed for NLP. We show empirically that LIMSSE, LRP and DeepLIFT are the most effective explanation methods and recommend them for explaining DNNs in NLP.
http://arxiv.org/abs/1801.06422
Provable safety is one of the most critical challenges in automated driving. The behavior of numerous traffic participants in a scene cannot be predicted reliably due to complex interdependencies and the indiscriminate behavior of humans. Additionally, we face high uncertainties and only incomplete environment knowledge. Recent approaches minimize risk with probabilistic and machine learning methods - even under occlusions. These generate comfortable behavior with good traffic flow, but cannot guarantee safety of their maneuvers. Therefore, we contribute a safety verification method for trajectories under occlusions. The field-of-view of the ego vehicle and a map are used to identify critical sensing field edges, each representing a potentially hidden obstacle. The state of occluded obstacles is unknown, but can be over-approximated by intervals over all possible states. Then set-based methods are extended to provide occupancy predictions for obstacles with state intervals. The proposed method can verify the safety of given trajectories (e.g. if they ensure collision-free fail-safe maneuver options) w.r.t. arbitrary safe-state formulations. The potential for provably safe trajectory planning is shown in three evaluative scenarios.
http://arxiv.org/abs/1807.01262
This paper studies the problem of automatically extracting a short title from a manually written longer description of E-commerce products for display on mobile devices. It is a new extractive summarization problem on short text inputs, for which we propose a feature-enriched network model, combining three different categories of features in parallel. Experimental results show that our framework significantly outperforms several baselines by a substantial gain of 4.5%. Moreover, we produce an extractive summarization dataset for E-commerce short texts and will release it to the research community.
http://arxiv.org/abs/1803.11359
Slot filling is a critical task in natural language understanding (NLU) for dialog systems. State-of-the-art approaches treat it as a sequence labeling problem and adopt such models as BiLSTM-CRF. While these models work relatively well on standard benchmark datasets, they face challenges in the context of E-commerce where the slot labels are more informative and carry richer expressions. In this work, inspired by the unique structure of E-commerce knowledge base, we propose a novel multi-task model with cascade and residual connections, which jointly learns segment tagging, named entity tagging and slot filling. Experiments show the effectiveness of the proposed cascade and residual structures. Our model has a 14.6% advantage in F1 score over the strong baseline methods on a new Chinese E-commerce shopping assistant dataset, while achieving competitive accuracies on a standard dataset. Furthermore, online test deployed on such dominant E-commerce platform shows 130% improvement on accuracy of understanding user utterances. Our model has already gone into production in the E-commerce platform.
http://arxiv.org/abs/1803.11326
Inter-personal anatomical differences limit the accuracy of person-independent gaze estimation networks. Yet there is a need to lower gaze errors further to enable applications requiring higher quality. Further gains can be achieved by personalizing gaze networks, ideally with few calibration samples. However, over-parameterized neural networks are not amenable to learning from few examples as they can quickly over-fit. We embrace these challenges and propose a novel framework for Few-shot Adaptive GaZE Estimation (FAZE) for learning person-specific gaze networks with very few (less than 9) calibration samples. FAZE learns a rotation-aware latent representation of gaze via a disentangling encoder-decoder architecture along with a highly adaptable gaze estimator trained using meta-learning. It is capable of adapting to any new person to yield significant performance gains with as few as 3 samples, yielding state-of-the-art performance of 3.18-deg on GazeCapture, a 19% improvement over prior art.
http://arxiv.org/abs/1905.01941
Understanding how cities visually differ from each others is interesting for planners, residents, and historians. We investigate the interpretation of deep features learned by convolutional neural networks (CNNs) for city recognition. Given a trained city recognition network, we first generate weighted masks using the known Grad-CAM technique and to select the most discriminate regions in the image. Since the image classification label is the city name, it contains no information of objects that are class-discriminate, we investigate the interpretability of deep representations with two methods. (i) Unsupervised method is used to cluster the objects appearing in the visual explanations. (ii) A pretrained semantic segmentation model is used to label objects in pixel level, and then we introduce statistical measures to quantitatively evaluate the interpretability of discriminate objects. The influence of network architectures and random initializations in training, is studied on the interpretability of CNN features for city recognition. The results suggest that network architectures would affect the interpretability of learned visual representations greater than different initializations.
http://arxiv.org/abs/1905.01932
This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings. Results on the ESC-50 dataset show that the proposed system can perform zero-shot audio classification with small training dataset. It can achieve accuracy (26 % on average) better than random guess (10 %) on each audio category. Particularly, it reaches up to 39.7 % for the category of natural audio classes.
http://arxiv.org/abs/1905.01926
The performance of a Convolutional Neural Network (CNN) depends on its hyperparameters, like the number of layers, kernel sizes, or the learning rate for example. Especially in smaller networks and applications with limited computational resources, optimisation is key. We present a fast and efficient approach for CNN architecture selection. Taking into account time consumption, precision and robustness, we develop a heuristic to quickly and reliably assess a network’s performance. In combination with Bayesian optimisation (BO), to effectively cover the vast parameter space, our contribution offers a plain and powerful architecture search for this machine learning technique.
https://arxiv.org/abs/1905.01924
In this letter, we report on a unique device design strategy for increasing the breakdown voltage and hence Baliga Figure of Merit (BFOM) of III-nitride HEMTs by engineering the gate edge towards the drain. The breakdown of such devices with meandering gate-drain access region (M-HEMT) are found to be 62% more compared to that of conventional HEMT while the ON resistance suffers by 76%, leading to an overall improvement in the BFOM for by 28%. 3D-TCAD simulations show that the decrease in the peak electric field at the gate edge was responsible for increased breakdown voltage.
https://arxiv.org/abs/1905.01922
Existing methods for face image manipulation generally focus on editing the expression, changing some predefined attributes, or applying different filters. However, users lack the flexibility of controlling the shapes of different semantic facial parts in the generated face. In this paper, we propose an approach to compute a disentangled shape representation for a face image, namely the FaceShapeGene. The proposed FaceShapeGene encodes the shape information of each semantic facial part separately into a 1D latent vector. On the basis of the FaceShapeGene, a novel part-wise face image editing system is developed, which contains a shape-remix network and a conditional label-to-face transformer. The shape-remix network can freely recombine the part-wise latent vectors from different individuals, producing a remixed face shape in the form of a label map, which contains the facial characteristics of multiple subjects. The conditional label-to-face transformer, which is trained in an unsupervised cyclic manner, performs part-wise face editing while preserving the original identity of the subject. Experimental results on several tasks demonstrate that the proposed FaceShapeGene representation correctly disentangles the shape features of different semantic parts. %In addition, we test our system on several novel part-wise face editing tasks. Comparisons to existing methods demonstrate the superiority of the proposed method on accomplishing novel face editing tasks.
http://arxiv.org/abs/1905.01920
Automatically generating descriptive captions for images is a well-researched area in computer vision. However, existing evaluation approaches focus on measuring the similarity between two sentences disregarding fine-grained semantics of the captions. In our setting of images depicting persons interacting with branded products, the subject, predicate, object and the name of the branded product are important evaluation criteria of the generated captions. Generating image captions with these constraints is a new challenge, which we tackle in this work. By simultaneously predicting integer-valued ratings that describe attributes of the human-product interaction, we optimize a deep neural network architecture in a multi-task learning setting, which considerably improves the caption quality. Furthermore, we introduce a novel metric that allows us to assess whether the generated captions meet our requirements (i.e., subject, predicate, object, and product name) and describe a series of experiments on caption quality and how to address annotator disagreements for the image ratings with an approach called soft targets. We also show that our novel clause-focused metrics are also applicable to other image captioning datasets, such as the popular MSCOCO dataset.
http://arxiv.org/abs/1905.01919
We present a novel algorithm for Fast Registration Of image Groups (FROG), applied to large 3D image groups. Our approach extracts 3D SURF keypoints from images, computes matched pairs of keypoints and registers the group by minimizing pair distances in a hubless way i.e. without computing any central mean image. Using keypoints significantly reduces the problem complexity compared to voxel-based approaches, and enables us to provide an in-core global optimization, similar to the Bundle Adjustment for 3D reconstruction. As we aim to register images of different patients, the matching step yields many outliers. Then we propose a new EM-weighting algorithm which efficiently discards outliers. Global optimization is carried out with a fast gradient descent algorithm. This allows our approach to robustly register large datasets. The result is a set of diffeomorphic half transforms which link the volumes together and can be subsequently exploited for computational anatomy and landmark detection. We show experimental results on whole-body CT scans, with groups of up to 103 volumes. On a benchmark based on anatomical landmarks, our algorithm compares favorably with the star-groupwise voxel-based ANTs and NiftyReg approaches while being much faster. We also discuss the limitations of our approach for lower resolution images such as brain MRI.
http://arxiv.org/abs/1809.03951
Breast cancer is the most common invasive cancer with the highest cancer occurrence in females. Handheld ultrasound is one of the most efficient ways to identify and diagnose the breast cancer. The area and the shape information of a lesion is very helpful for clinicians to make diagnostic decisions. In this study we propose a new deep-learning scheme, semi-pixel-wise cycle generative adversarial net (SPCGAN) for segmenting the lesion in 2D ultrasound. The method takes the advantage of a fully connected convolutional neural network (FCN) and a generative adversarial net to segment a lesion by using prior knowledge. We compared the proposed method to a fully connected neural network and the level set segmentation method on a test dataset consisting of 32 malignant lesions and 109 benign lesions. Our proposed method achieved a Dice similarity coefficient (DSC) of 0.92 while FCN and the level set achieved 0.90 and 0.79 respectively. Particularly, for malignant lesions, our method increases the DSC (0.90) of the fully connected neural network to 0.93 significantly (p$<$0.001). The results show that our SPCGAN can obtain robust segmentation results and may be used to relieve the radiologists’ burden for annotation.
http://arxiv.org/abs/1905.01902
In this paper we propose an efficient deep learning encoder-decoder network for performing Harmonic-Percussive Source Separation (HPSS). It is shown that we are able to greatly reduce the number of model trainable parameters by using a dense arrangement of skip connections between the model layers. We also explore the utilisation of different kernel sizes for the 2D filters of the convolutional layers with the objective of allowing the network to learn the different time-frequency patterns associated with percussive and harmonic sources more efficiently. The training and evaluation of the separation has been done using the training and test sets of the MUSDB18 dataset. Results show that the proposed deep network achieves automatic learning of high-level features and maintains HPSS performance at a state-of-the-art level while reducing the number of parameters and training time.
http://arxiv.org/abs/1905.01899
Utilizing a human-perception-related objective function to train a speech enhancement model has become a popular topic recently. This is primarily because the conventional mean squared error (MSE) loss cannot reflect auditory perception well. Among the human-perception-related metrics, the perceptual evaluation of speech quality (PESQ) is a typical one, and has been proven to provide a high correlation to the quality scores rated by humans. Owing to its complex and non-differentiable properties, however, the PESQ function may not be used to optimize speech enhancement models directly. In this study, we propose optimizing the enhancement model with an approximated PESQ function, which is differentiable and learned from the training data. The experimental results indicate that the average PESQ score of the enhanced speech fine-tuning by the learned loss function can further improve 0.1 points, as compared to that with the MSE-based pre-trained model.
http://arxiv.org/abs/1905.01898
Distributional semantics provides multi-dimensional, graded, empirically induced word representations that successfully capture many aspects of meaning in natural languages, as shown in a large body of work in computational linguistics; yet, its impact in theoretical linguistics has so far been limited. This survey provides a critical discussion of the literature on distributional semantics, with an emphasis on methods and results that are of relevance for theoretical linguistics, in three areas: semantic change, polysemy, and the grammar-semantics interface.
http://arxiv.org/abs/1905.01896
While nowadays deep neural networks achieve impressive performances on semantic segmentation tasks, they are usually trained by optimizing pixel-wise losses such as cross-entropy. As a result, the predictions outputted by such networks usually struggle to accurately capture the object boundaries and exhibit holes inside the objects. In this paper, we propose a novel approach to improve the structure of the predicted segmentation masks. We introduce a novel semantic edge detection network, which allows to match the predicted and ground truth segmentation masks. This Semantic Edge-Aware strategy (SEMEDA) can be combined with any backbone deep network in an end-to-end training framework. Through thorough experimental validation on Pascal VOC 2012 and Cityscapes datasets, we show that the proposed SEMEDA approach enhances the structure of the predicted segmentation masks by enforcing sharp boundaries and avoiding discontinuities inside objects, improving the segmentation performance. In addition, our semantic edge-aware loss can be integrated into any popular segmentation network without requiring any additional annotation and with negligible computational load, as compared to standard pixel-wise cross-entropy loss.
http://arxiv.org/abs/1905.01892
Image completion is the problem of generating whole images from fragments only. It encompasses inpainting (generating a patch given its surrounding), reverse inpainting/extrapolation (generating the periphery given the central patch) as well as colorization (generating one or several channels given other ones). In this paper, we employ a deep network to perform image completion, with adversarial training as well as perceptual and completion losses, and call it the missing data encoder'' (MDE). We consider several configurations based on how the seed fragments are chosen. We show that training MDE for
random extrapolation and colorization’’ (MDE-REC), i.e. using random channel-independent fragments, allows a better capture of the image semantics and geometry. MDE training makes use of a novel ``hide-and-seek’’ adversarial loss, where the discriminator seeks the original non-masked regions, while the generator tries to hide them. We validate our models both qualitatively and quantitatively on several datasets, showing their interest for image completion, unsupervised representation learning as well as face occlusion handling.
http://arxiv.org/abs/1905.01861
This paper describes the machine translation systems developed by the Universidade Federal do Rio Grande do Sul (UFRGS) team for the biomedical translation shared task. Our systems are based on statistical machine translation and neural machine translation, using the Moses and OpenNMT toolkits, respectively. We participated in four translation directions for the English/Spanish and English/Portuguese language pairs. To create our training data, we concatenated several parallel corpora, both from in-domain and out-of-domain sources, as well as terminological resources from UMLS. Our systems achieved the best BLEU scores according to the official shared task evaluation.
http://arxiv.org/abs/1905.01855
The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development of a parallel corpus from Scielo in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for all language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Statistical Machine Translation system (Moses) for each language pair, which outperformed related works on scientific articles. Sentence alignment was also manually evaluated, presenting an average of 98.8% correctly aligned sentences across all languages. Our parallel corpus is freely available in the TMX format, with complementary information regarding article metadata.
http://arxiv.org/abs/1905.01852
Most of the existing recognition algorithms are proposed for closed set scenarios, where all categories are known beforehand. However, in practice, recognition is essentially an open set problem. There are categories we know called “knowns”, and there are more we do not know called “unknowns”. Enumerating all categories beforehand is never possible, consequently it is infeasible to prepare sufficient training samples for those unknowns. Applying closed set recognition methods will naturally lead to unseen-category errors. To address this problem, we propose the prototype based Open Deep Network (P-ODN) for open set recognition tasks. Specifically, we introduce prototype learning into open set recognition. Prototypes and prototype radiuses are trained jointly to guide a CNN network to derive more discriminative features. Then P-ODN detects the unknowns by applying a multi-class triplet thresholding method based on the distance metric between features and prototypes. Manual labeling the unknowns which are detected in the previous process as new categories. Predictors for new categories are added to the classification layer to “open” the deep neural networks to incorporate new categories dynamically. The weights of new predictors are initialized exquisitely by applying a distances based algorithm to transfer the learned knowledge. Consequently, this initialization method speed up the fine-tuning process and reduce the samples needed to train new predictors. Extensive experiments show that P-ODN can effectively detect unknowns and needs only few samples with human intervention to recognize a new category. In the real world scenarios, our method achieves state-of-the-art performance on the UCF11, UCF50, UCF101 and HMDB51 datasets.
http://arxiv.org/abs/1905.01851
The abstraction of musical structures (notes, melodies, chords, harmonic or rhythmic progressions, etc.) as mathematical objects in a geometrical space is one of the great accomplishments of contemporary music theory. Building on this foundation, I generalize the concept of musical spaces as networks and derive functional principles of compositional design by the direct analysis of the network topology. This approach provides a novel framework for the analysis and quantification of similarity of musical objects and structures, and suggests a way to relate such measures to the human perception of different musical entities. Finally, the analysis of a single work or a corpus of compositions as complex networks provides alternative ways of interpreting the compositional process of a composer by quantifying emergent behaviors with well-established statistical mechanics techniques. Interpreting the latter as probabilistic randomness in the network, I develop novel compositional design frameworks that are central to my own artistic research.
http://arxiv.org/abs/1905.01842
The advancement of the Artificial Intelligence (AI) technologies makes it possible to learn stylistic design criteria from existing maps or other visual art and transfer these styles to make new digital maps. In this paper, we propose a novel framework using AI for map style transfer applicable across multiple map scales. Specifically, we identify and transfer the stylistic elements from a target group of visual examples, including Google Maps, OpenStreetMap, and artistic paintings, to unstylized GIS vector data through two generative adversarial network (GAN) models. We then train a binary classifier based on a deep convolutional neural network to evaluate whether the transfer styled map images preserve the original map design characteristics. Our experiment results show that GANs have great potential for multiscale map style transferring, but many challenges remain requiring future research.
https://arxiv.org/abs/1905.02200
The emergence of big data enables us to evaluate the various human emotions at places from a statistic perspective by applying affective computing. In this study, a novel framework for extracting human emotions from large-scale georeferenced photos at different places is proposed. After the construction of places based on spatial clustering of user generated footprints collected in social media websites, online cognitive services are utilized to extract human emotions from facial expressions using the state-of-the-art computer vision techniques. And two happiness metrics are defined for measuring the human emotions at different places. To validate the feasibility of the framework, we take 80 tourist attractions around the world as an example and a happiness ranking list of places is generated based on human emotions calculated over 2 million faces detected out from over 6 million photos. Different kinds of geographical contexts are taken into consideration to find out the relationship between human emotions and environmental factors. Results show that much of the emotional variation at different places can be explained by a few factors such as openness. The research may offer insights on integrating human emotions to enrich the understanding of sense of place in geography and in place-based GIS.
http://arxiv.org/abs/1905.01817
Current state-of-the-art object objectors are fine-tuned from the off-the-shelf networks pretrained on large-scale classification dataset ImageNet, which incurs some additional problems: 1) The classification and detection have different degrees of sensitivity to translation, resulting in the learning objective bias; 2) The architecture is limited by the classification network, leading to the inconvenience of modification. To cope with these problems, training detectors from scratch is a feasible solution. However, the detectors trained from scratch generally perform worse than the pretrained ones, even suffer from the convergence issue in training. In this paper, we explore to train object detectors from scratch robustly. By analysing the previous work on optimization landscape, we find that one of the overlooked points in current trained-from-scratch detector is the BatchNorm. Resorting to the stable and predictable gradient brought by BatchNorm, detectors can be trained from scratch stably while keeping the favourable performance independent to the network architecture. Taking this advantage, we are able to explore various types of networks for object detection, without suffering from the poor convergence. By extensive experiments and analyses on downsampling factor, we propose the Root-ResNet backbone network, which makes full use of the information from original images. Our ScratchDet achieves the state-of-the-art accuracy on PASCAL VOC 2007, 2012 and MS COCO among all the train-from-scratch detectors and even performs better than several one-stage pretrained methods. Codes will be made publicly available at https://github.com/KimSoybean/ScratchDet.
http://arxiv.org/abs/1810.08425
Differentiable programming is able to combine different functions or programs in a processing pipeline with the goal of applying end-to-end learning or optimization. A significant impediment is the non-differentiable nature of some algorithms. We propose to use synthetic gradients (SG) to overcome this difficulty. SG uses the universal function approximation property of neural networks. We apply SG to combine convolutional neural network (CNN) with dynamic programming (DP) in end-to-end learning for segmenting left ventricle from short axis view of heart MRI. Our experiments show that end-to-end combination of CNN and DP requires fewer labeled images to achieve a significantly better segmentation accuracy than using only CNN.
http://arxiv.org/abs/1812.00328
RSL19BD (Waseda University Sakai Laboratory) participated in the Fourth Dialogue Breakdown Detection Challenge (DBDC4) and submitted five runs to both English and Japanese subtasks. In these runs, we utilise the Decision Tree-based model and the Long Short-Term Memory-based (LSTM-based) models following the approaches of RSL17BD and KTH in the Third Dialogue Breakdown Detection Challenge (DBDC3) respectively. The Decision Tree-based model follows the approach of RSL17BD but utilises RandomForestRegressor instead of ExtraTreesRegressor. In addition, instead of predicting the mean and the variance of the probability distribution of the three breakdown labels, it predicts the probability of each label directly. The LSTM-based model follows the approach of KTH with some changes in the architecture and utilises Convolutional Neural Network (CNN) to perform text feature extraction. In addition, instead of targeting the single breakdown label and minimising the categorical cross entropy loss, it targets the probability distribution of the three breakdown labels and minimises the mean squared error. Run 1 utilises a Decision Tree-based model; Run 2 utilises an LSTM-based model; Run 3 performs an ensemble of 5 LSTM-based models; Run 4 performs an ensemble of Run 1 and Run 2; Run 5 performs an ensemble of Run 1 and Run 3. Run 5 statistically significantly outperformed all other runs in terms of MSE (NB, PB, B) for the English data and all other runs except Run 4 in terms of MSE (NB, PB, B) for the Japanese data (alpha level = 0.05).
http://arxiv.org/abs/1905.01799
This paper aims to learn a compact representation of a video for video face recognition task. We make the following contributions: first, we propose a meta attention-based aggregation scheme which adaptively and fine-grained weighs the feature along each feature dimension among all frames to form a compact and discriminative representation. It makes the best to exploit the valuable or discriminative part of each frame to promote the performance of face recognition, without discarding or despising low quality frames as usual methods do. Second, we build a feature aggregation network comprised of a feature embedding module and a feature aggregation module. The embedding module is a convolutional neural network used to extract a feature vector from a face image, while the aggregation module consists of cascaded two meta attention blocks which adaptively aggregate the feature vectors into a single fixed-length representation. The network can deal with arbitrary number of frames, and is insensitive to frame order. Third, we validate the performance of proposed aggregation scheme. Experiments on publicly available datasets, such as YouTube face dataset and IJB-A dataset, show the effectiveness of our method, and it achieves competitive performances on both the verification and identification protocols.
http://arxiv.org/abs/1905.01796
To achieve lightweight object detectors for deployment on the edge devices, an effective model compression pipeline is proposed in this paper. The compression pipeline consists of automatic channel pruning for the backbone, fixed channel deletion for the branch layers and knowledge distillation for the guidance learning. As results, the Resnet50-v1d is auto-pruned and fine-tuned on ImageNet to attain a compact base model as the backbone of object detector. Then, lightweight object detectors are implemented with proposed compression pipeline. For instance, the SSD-300 with model size=16.3MB, FLOPS=2.31G, and mAP=71.2 is created, revealing a better result than SSD-300-MobileNet.
http://arxiv.org/abs/1905.01787
For network architecture search (NAS), it is crucial but challenging to simultaneously guarantee both effectiveness and efficiency. Towards achieving this goal, we develop a differentiable NAS solution, where the search space includes arbitrary feed-forward network consisting of the predefined number of connections. Benefiting from a proposed ensemble Gumbel-Softmax estimator, our method optimizes both the architecture of a deep network and its parameters in the same round of backward propagation, yielding an end-to-end mechanism of searching network architectures. Extensive experiments on a variety of popular datasets strongly evidence that our method is capable of discovering high-performance architectures, while guaranteeing the requisite efficiency during searching.
https://arxiv.org/abs/1905.01786
We present our 7th place solution to the Gendered Pronoun Resolution challenge, which uses BERT without fine-tuning and a novel augmentation strategy designed for contextual embedding token-level tasks. Our method anonymizes the referent by replacing candidate names with a set of common placeholder names. Besides the usual benefits of effectively increasing training data size, this approach diversifies idiosyncratic information embedded in names. Using same set of common first names can also help the model recognize names better, shorten token length, and remove gender and regional biases associated with names. The system scored 0.1947 log loss in stage 2, where the augmentation contributed to an improvements of 0.04. Post-competition analysis shows that, when using different embedding layers, the system scores 0.1799 which would be third place.
http://arxiv.org/abs/1905.01780
Videos can be manipulated by duplicating a sequence of consecutive frames with the goal of concealing or imitating a specific content in the same video. In this paper, we propose a novel coarse-to-fine framework based on deep Convolutional Neural Networks to automatically detect and localize such frame duplication. First, an I3D network finds coarse-level matches between candidate duplicated frame sequences and the corresponding selected original frame sequences. Then a Siamese network based on ResNet architecture identifies fine-level correspondences between an individual duplicated frame and the corresponding selected frame. We also propose a robust statistical approach to compute a video-level score indicating the likelihood of manipulation or forgery. Additionally, for providing manipulation localization information we develop an inconsistency detector based on the I3D network to distinguish the duplicated frames from the selected original frames. Quantified evaluation on two challenging video forgery datasets clearly demonstrates that this approach performs significantly better than four recent state-of-the-art methods.
http://arxiv.org/abs/1811.10762
The potential huge advantage of spectral computed tomography (CT) is its capability to provide accuracy material identification and quantitative tissue information. This can benefit clinical applications, such as brain angiography, early tumor recognition, etc. To achieve more accurate material components with higher material image quality, we develop a dictionary learning based image-domain material decomposition (DLIMD) for spectral CT in this paper. First, we reconstruct spectral CT image from projections and calculate material coefficients matrix by selecting uniform regions of basis materials from image reconstruction results. Second, we employ the direct inversion (DI) method to obtain initial material decomposition results, and a set of image patches are extracted from the mode-1 unfolding of normalized material image tensor to train a united dictionary by the K-SVD technique. Third, the trained dictionary is employed to explore the similarities from decomposed material images by constructing the DLIMD model. Fourth, more constraints (i.e., volume conservation and the bounds of each pixel within material maps) are further integrated into the model to improve the accuracy of material decomposition. Finally, both physical phantom and preclinical experiments are employed to evaluate the performance of the proposed DLIMD in material decomposition accuracy, material image edge preservation and feature recovery.
https://arxiv.org/abs/1905.02567
What did it feel like to walk through a city from the past? In this work, we describe Nostalgin (Nostalgia Engine), a method that can faithfully reconstruct cities from historical images. Unlike existing work in city reconstruction, we focus on the task of reconstructing 3D cities from historical images. Working with historical image data is substantially more difficult, as there are significantly fewer buildings available and the details of the camera parameters which captured the images are unknown. Nostalgin can generate a city model even if there is only a single image per facade, regardless of viewpoint or occlusions. To achieve this, our novel architecture combines image segmentation, rectification, and inpainting. We motivate our design decisions with experimental analysis of individual components of our pipeline, and show that we can improve on baselines in both speed and visual realism. We demonstrate the efficacy of our pipeline by recreating two 1940s Manhattan city blocks. We aim to deploy Nostalgin as an open source platform where users can generate immersive historical experiences from their own photos.
http://arxiv.org/abs/1905.01772