Fundamental frequency is one of the most important characteristics of speech and audio signals. Harmonic model-based fundamental frequency estimators offer a higher estimation accuracy and robustness against noise than the widely used autocorrelation-based methods. However, the traditional harmonic model-based estimators do not take the temporal smoothness of the fundamental frequency, the model order, and the voicing into account as they process each data segment independently. In this paper, a fully Bayesian fundamental frequency tracking algorithm based on the harmonic model and a first-order Markov process model is proposed. Smoothness priors are imposed on the fundamental frequencies, model orders, and voicing using first-order Markov process models. Using these Markov models, fundamental frequency estimation and voicing detection errors can be reduced. Using the harmonic model, the proposed fundamental frequency tracker has an improved robustness to noise. An analytical form of the likelihood function, which can be computed efficiently, is derived. Compared to the state-of-the-art neural network and non-parametric approaches, the proposed fundamental frequency tracking algorithm reduces the mean absolute errors and gross errors by 15\% and 20\% on the Keele pitch database and 36\% and 26\% on sustained /a/ sounds from a database of Parkinson’s disease voices under 0 dB white Gaussian noise. A MATLAB version of the proposed algorithm is made freely available for reproduction of the results\footnote{An implementation of the proposed algorithm using MATLAB may be found in \url{https://tinyurl.com/yxn4a543}
http://arxiv.org/abs/1905.08557
This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate benchmark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.
http://arxiv.org/abs/1905.08546
To guide surgical and medical treatment X-ray images have been used by physicians in every modern healthcare organization and hospitals. Doctor’s evaluation process and disease identification in the area of skeletal system can be performed in a faster and efficient way with the help of X-ray imaging technique as they can depict bone structure painlessly. This paper presents an efficient contrast enhancement technique using morphological operators which will help to visualize important bone segments and soft tissues more clearly. Top-hat and Bottom-hat transform are utilized to enhance the image where gradient magnitude value is calculated for automatically selecting the structuring element (SE) size. Experimental evaluation on different x-ray imaging databases shows the effectiveness of our method which also produces comparatively better output against some existing image enhancement techniques.
http://arxiv.org/abs/1905.08545
High-dimensional data classification is a fundamental task in machine learning and imaging science. In this paper, we propose a two-stage multiphase semi-supervised classification method for classifying high-dimensional data and unstructured point clouds. To begin with, a fuzzy classification method such as the standard support vector machine is used to generate a warm initialization. We then apply a two-stage approach named SaT (smoothing and thresholding) to improve the classification. In the first stage, an unconstraint convex variational model is implemented to purify and smooth the initialization, followed by the second stage which is to project the smoothed partition obtained at stage one to a binary partition. These two stages can be repeated, with the latest result as a new initialization, to keep improving the classification quality. We show that the convex model of the smoothing stage has a unique solution and can be solved by a specifically designed primal-dual algorithm whose convergence is guaranteed. We test our method and compare it with the state-of-the-art methods on several benchmark data sets. The experimental results demonstrate clearly that our method is superior in both the classification accuracy and computation speed for high-dimensional data and point clouds.
http://arxiv.org/abs/1905.08538
Funnel lane concept is a qualitative visual navigation method which helps robots to autonomously navigate by using a recorded video. A visual path is extracted from the video by extracting some keyframes from the video. The robot uses this visual path for its navigation. Funnel lane unlike some other methods does not make use of traditional calculations of Jacobians, homographies, fundamental matrices, or the focus of expansion, and does not require any camera calibration. However, funnel lane has some shortcomings. One problem is that funnel lane gives no information about the radius of rotation, so in turnings, the robot turns by a constant radius of rotation along the path. This reduces the maneuverability and limits the robot from dealing with all turnings conditions. In addition, this problem makes the robot faces a serious problem in correcting its path when it deviates from the desired path. Another flaw is that in some situations the robot faces an ambiguity to understand whether a translation or a rotation should be followed in the visual path which leads the robot to deviate and to fail in following the desired path. This paper introduces the sloped funnel lane technique which does not have these shortcomings. The roll and pitch angles are added to the funnel lane, which help the robot to set its radius of rotation according to the turnings conditions it faces. Moreover, they help to reduce the ambiguity between translation and rotation. Therefore the robot can deal with different turnings conditions and the navigation method will be more robust and accurate. Experimental results on challenging scenarios on a real ground robot demonstrate the effectiveness of sloped funnel lane technique.
http://arxiv.org/abs/1808.07707
High sensitivity of neural architecture search (NAS) methods against their input such as step-size (i.e., learning rate) and search space prevents practitioners from applying them out-of-the-box to their own problems, albeit its purpose is to automate a part of tuning process. Aiming at a fast, robust, and widely-applicable NAS, we develop a generic optimization framework for NAS. We turn a coupled optimization of connection weights and neural architecture into a differentiable optimization by means of stochastic relaxation. It accepts arbitrary search space (widely-applicable) and enables to employ a gradient-based simultaneous optimization of weights and architecture (fast). We propose a stochastic natural gradient method with an adaptive step-size mechanism built upon our theoretical investigation (robust). Despite its simplicity and no problem-dependent parameter tuning, our method exhibited near state-of-the-art performances with low computational budgets both on image classification and inpainting tasks.
https://arxiv.org/abs/1905.08537
The performance of a Part-of-speech (POS) tagger is highly dependent on the domain ofthe processed text, and for many domains there is no or only very little training data available. This work addresses the problem of POS tagging noisy user-generated text using a neural network. We propose an architecture that trains an out-of-domain model on a large newswire corpus, and transfers those weights by using them as a prior for a model trained on the target domain (a data-set of German Tweets) for which there is very little an-notations available. The neural network has two standard bidirectional LSTMs at its core. However, we find it crucial to also encode a set of task-specific features, and to obtain reliable (source-domain and target-domain) word representations. Experiments with different regularization techniques such as early stopping, dropout and fine-tuning the domain adaptation prior weights are conducted. Our best model uses external weights from the out-of-domain model, as well as feature embeddings, pre-trained word and sub-word embeddings and achieves a tagging accuracy of slightly over 90%, improving on the previous state of the art for this task.
http://arxiv.org/abs/1905.08920
Lake and Baroni (2018) introduced the SCAN dataset probing the ability of seq2seq models to capture compositional generalizations, such as inferring the meaning of “jump around” 0-shot from the component words. Recurrent networks (RNNs) were found to completely fail the most challenging generalization cases. We test here a convolutional network (CNN) on these tasks, reporting hugely improved performance with respect to RNNs. Despite the big improvement, the CNN has however not induced systematic rules, suggesting that the difference between compositional and non-compositional behaviour is not clear-cut.
http://arxiv.org/abs/1905.08527
Many dialogue management frameworks allow the system designer to directly define belief rules to implement an efficient dialog policy. Because these rules are directly defined, the components are said to be hand-crafted. As dialogues become more complex, the number of states, transitions, and policy decisions becomes very large. To facilitate the dialog policy design process, we propose an approach to automatically learn belief rules using a supervised machine learning approach. We validate our ideas in Student-Advisor conversation domain, where we extract latent beliefs like student is curious, confused and neutral, etc. Further, we also perform epistemic reasoning that helps to tailor the dialog according to student’s emotional state and hence improve the overall effectiveness of the dialog system. Our latent belief identification approach shows an accuracy of 87% and this results in efficient and meaningful dialog management.
http://arxiv.org/abs/1811.10238
Inverse reinforcement learning (IRL) is an ill-posed inverse problem since expert demonstrations may infer many solutions of reward functions which is hard to recover by local search methods such as a gradient method. In this paper, we generalize the original IRL problem to recover a probability distribution for reward functions. We call such a generalized problem stochastic inverse reinforcement learning (SIRL) which is first formulated as an expectation optimization problem. We adopt the Monte Carlo expectation-maximization (MCEM) method, a global search method, to estimate the parameter of the probability distribution as the first solution to SIRL. With our approach, it is possible to observe the deep intrinsic property in IRL from a global viewpoint, and the technique achieves a considerable robust recovery performance on the classic learning environment, objectworld.
http://arxiv.org/abs/1905.08513
Question answering (QA) using textual sources such as reading comprehension (RC) has attracted much attention recently. This study focuses on the task of explainable multi-hop QA, which requires the system to return the answer with evidence sentences by reasoning and gathering disjoint pieces of the reference texts. For evidence extraction of explainable multi-hop QA, the existed method extracted evidence sentences by evaluating the importance of each sentence independently. In this study, we propose the Query Focused Extractor (QFE) model and introduce the multi-task learning of the QA model for answer selection and the QFE model for evidence extraction. QFE sequentially extracts the evidence sentences by an RNN with an attention mechanism to the question sentence, which is inspired by extractive summarization models. It enables QFE to consider the dependency among the evidence sentences and cover the important information in the question sentence. Experimental results show that QFE with the simple RC baseline model achieves a state-of-the-art evidence extraction score on HotpotQA. Although designed for RC, QFE also achieves a state-of-the-art evidence extraction score on FEVER, which is a recognizing textual entailment task on a large textual database.
http://arxiv.org/abs/1905.08511
Graph Neural Network (GNN) is an effective framework for representation learning and prediction for graph structural data. A neighborhood aggregation scheme is applied in the training of GNN and variants, that representation of each node is calculated through recursively aggregating and transforming representation of the neighboring nodes. A variety of GNNS and the variants are build and have achieved state-of-the-art results on both node and graph classification tasks. However, despite common neighborhood which is used in the state-of-the-art GNN models, there is little analysis on the properties of the neighborhood in the neighborhood aggregation scheme. Here, we analyze the properties of the node, edges, and neighborhood of the graph model. Our results characterize the efficiency of the common neighborhood used in the state-of-the-art GNNs, and show that it is not sufficient for the representation learning of the nodes. We propose a simple neighborhood which is likely to be more sufficient. We empirically validate our theoretical analysis on a number of graph classification benchmarks and demonstrate that our methods achieve state-of-the-art performance on listed benchmarks. The implementation code is available at \url{https://github.com/CODE-SUBMIT/Neighborhood-Enlargement-in-Graph-Network}.
http://arxiv.org/abs/1905.08509
Many Multi-View-Stereo algorithms extract a 3D mesh model of a scene, after fusing depth maps into a volumetric representation of the space. Due to the limited scalability of such representations, the estimated model does not capture fine details of the scene. Therefore a mesh refinement algorithm is usually applied; it improves the mesh resolution and accuracy by minimizing the photometric error induced by the 3D model into pairs of cameras. The choice of these pairs significantly affects the quality of the refinement and usually relies on sparse 3D points belonging to the surface. Instead, in this paper, to increase the quality of pairs selection, we exploit the 3D model (before the refinement) to compute five metrics: scene coverage, mutual image overlap, image resolution, camera parallax, and a new symmetry term. To improve the refinement robustness, we also propose an explicit method to manage occlusions, which may negatively affect the computation of the photometric error. The proposed method takes into account the depth of the model while computing the similarity measure and its gradient. We quantitatively and qualitatively validated our approach on publicly available datasets against state of the art reconstruction methods.
http://arxiv.org/abs/1905.08502
With the growth of image on the web, research on hashing which enables high-speed image retrieval has been actively studied. In recent years, various hashing methods based on deep neural networks have been proposed and achieved higher precision than the other hashing methods. In these methods, multiple losses for hash codes and the parameters of neural networks are defined. They generate hash codes that minimize the weighted sum of the losses. Therefore, an expert has to tune the weights for the losses heuristically, and the probabilistic optimality of the loss function cannot be explained. In order to generate explainable hash codes without weight tuning, we theoretically derive a single loss function with no hyperparameters for the hash code from the probability distribution of the images. By generating hash codes that minimize this loss function, highly accurate image retrieval with probabilistic optimality is performed. We evaluate the performance of hashing using MNIST, CIFAR-10, SVHN and show that the proposed method outperforms the state-of-the-art hashing methods.
http://arxiv.org/abs/1905.08501
For machine learning task, lacking sufficient samples mean the trained model has low confidence to approach the ground truth function. Until recently, after the generative adversarial networks (GAN) had been proposed, we see the hope of small samples data augmentation (DA) with realistic fake data, and many works validated the viability of GAN-based DA. Although most of the works pointed out higher accuracy can be achieved using GAN-based DA, some researchers stressed that the fake data generated from GAN has inherent bias, and in this paper, we explored when the bias is so low that it cannot hurt the performance, we set experiments to depict the bias in different GAN-based DA setting, and from the results, we design a pipeline to inspect specific dataset is efficiently-augmentable with GAN-based DA or not. And finally, depending on our trial to reduce the bias, we proposed some advice to mitigate bias in GAN-based DA application.
http://arxiv.org/abs/1905.08495
Multi-frame approaches for single-microphone speech enhancement, e.g., the multi-frame minimum-variance-distortionless-response (MVDR) filter, are able to exploit speech correlations across neighboring time frames. In contrast to single-frame approaches such as the Wiener gain, it has been shown that multi-frame approaches achieve a substantial noise reduction with hardly any speech distortion, provided that an accurate estimate of the correlation matrices and especially the speech interframe correlation vector is available. Typical estimation procedures of the correlation matrices and the speech interframe correlation (IFC) vector require an estimate of the speech presence probability (SPP) in each time-frequency bin. In this paper, we propose to use a bi-directional long short-term memory deep neural network (DNN) to estimate a speech mask and a noise mask for each time-frequency bin, using which two different SPP estimates are derived. Aiming at achieving a robust performance, the DNN is trained for various noise types and signal-to-noise ratios. Experimental results show that the multi-frame MVDR in combination with the proposed data-driven SPP estimator yields an increased speech quality compared to a state-of-the-art model-based estimator.
http://arxiv.org/abs/1905.08492
In this paper, we propose a high-quality generative text-to-speech (TTS) system using an effective spectrum and excitation estimation method. Our previous research verified the effectiveness of the ExcitNet-based speech generation model in a parametric TTS framework. However, the challenge remains to build a high-quality speech synthesis system because auxiliary conditional features estimated by a simple deep neural network often contain large prediction errors, and the errors are inevitably propagated throughout the autoregressive generation process of the ExcitNet vocoder. To generate more natural speech signals, we exploited a sequence-to-sequence (seq2seq) acoustic model with an attention-based generative network (e.g., Tacotron 2) to estimate the condition parameters of the ExcitNet vocoder. Because the seq2seq acoustic model accurately estimates spectral parameters, and because the ExcitNet model effectively generates the corresponding time-domain excitation signals, combining these two models can synthesize natural speech signals. Furthermore, we verified the merit of the proposed method in producing expressive speech segments by adopting a global style token-based emotion embedding method. The experimental results confirmed that the proposed system significantly outperforms the systems with a similarly configured conventional WaveNet vocoder and our best prior parametric TTS counterpart.
http://arxiv.org/abs/1905.08486
This work offers a new method for generating photo-realistic images from semantic label maps and a simulator edge map images. We do so in a conditional manner, where we train a Generative Adversarial network (GAN) given an image and its semantic label map to output a photo-realistic version of that scene. Existing architectures of GANs still lack the photo-realism capabilities. We address this issue by embedding edge maps, and presenting the Generator with an edge map image as a prior, which enables generating high level details in the image. We offer a model that uses this generator to create visually appealing videos as well, when a sequence of images is given.
http://arxiv.org/abs/1905.08474
We analyze a new robust method for the reconstruction of probability distributions of observed data in the presence of output outliers. It is based on a so-called gradient conjugate prior (GCP) network which outputs the parameters of a prior. By rigorously studying the dynamics of the GCP learning process, we derive an explicit formula for correcting the obtained variance of the marginal distribution and removing the bias caused by outliers in the training set. Assuming a Gaussian (input-dependent) ground truth distribution contaminated with a proportion $\varepsilon$ of outliers, we show that the fitted mean is in a $c e^{-1/\varepsilon}$-neighborhood of the ground truth mean and the corrected variance is in a $b\varepsilon$-neighborhood of the ground truth variance, whereas the uncorrected variance of the marginal distribution can even be infinite. We explicitly find $b$ as a function of the output of the GCP network, without a priori knowledge of the outliers (possibly input-dependent) distribution. Experiments with synthetic and real-world data sets indicate that the GCP network fitted with a standard optimizer outperforms other robust methods for regression.
http://arxiv.org/abs/1905.08464
In this work, we propose a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains about 17.5 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, it has even fewer attention errors than the autoregressive model on the challenging test sentences. Furthermore, we build the first fully parallel neural text-to-speech system by applying the inverse autoregressive flow~(IAF) as the parallel neural vocoder. Our system can synthesize speech from text through a single feed-forward pass. We also explore a novel approach to train the IAF from scratch as a generative model for raw waveform, which avoids the need for distillation from a separately trained WaveNet.
http://arxiv.org/abs/1905.08459
The prevalent approaches of Chinese word segmentation task almost rely on the Bi-LSTM neural network. However, the methods based the Bi-LSTM have some inherent drawbacks: hard to parallel computing, little efficient in applying the Dropout method to inhibit the Overfitting and little efficient in capturing the character information at the more distant site of a long sentence for the word segmentation task. In this work, we propose a sequence-to-sequence transformer model for Chinese word segmentation, which is premised a type of convolutional neural network named temporal convolutional network. The model uses the temporal convolutional network to construct an encoder, and uses one layer of fully-connected neural network to build a decoder, and applies the Dropout method to inhibit the Overfitting, and captures the character information at the distant site of a sentence by adding the layers of the encoder, and binds Conditional Random Fields model to train parameters, and uses the Viterbi algorithm to infer the final result of the Chinese word segmentation. The experiments on traditional Chinese corpora and simplified Chinese corpora show that the performance of Chinese word segmentation of the model is equivalent to the performance of the methods based the Bi-LSTM, and the model has a tremendous growth in parallel computing than the models based the Bi-LSTM.
http://arxiv.org/abs/1905.08454
Recently, autonomous driving development ignited competition among car makers and technical corporations. Low-level automation cars are already commercially available. But high automated vehicles where the vehicle drives by itself without human monitoring is still at infancy. Such autonomous vehicles (AVs) entirely rely on the computing system in the car to to interpret the environment and make driving decisions. Therefore, computing system design is essential particularly in enhancing the attainment of driving safety. However, to our knowledge, no clear guideline exists so far regarding safety-aware AV computing system and architecture design. To understand the safety requirement of AV computing system, we performed a field study by running industrial Level-4 autonomous driving fleets in various locations, road conditions, and traffic patterns. The field study indicates that traditional computing system performance metrics, such as tail latency, average latency, maximum latency, and timeout, cannot fully satisfy the safety requirement for AV computing system design. To address this issue, we propose a `safety score’ as a primary metric for measuring the level of safety in AV computing system design. Furthermore, we propose a perception latency model, which helps architects estimate the safety score of given architecture and system design without physically testing them in an AV. We demonstrate the use of our safety score and latency model, by developing and evaluating a safety-aware AV computing system computation hardware resource management scheme.
http://arxiv.org/abs/1905.08453
The science of solving clinical problems by analyzing images generated in clinical practice is known as medical image analysis. The aim is to extract information in an effective and efficient manner for improved clinical diagnosis. The recent advances in the field of biomedical engineering has made medical image analysis one of the top research and development area. One of the reason for this advancement is the application of machine learning techniques for the analysis of medical images. Deep learning is successfully used as a tool for machine learning, where a neural network is capable of automatically learning features. This is in contrast to those methods where traditionally hand crafted features are used. The selection and calculation of these features is a challenging task. Among deep learning techniques, deep convolutional networks are actively used for the purpose of medical image analysis. This include application areas such as segmentation, abnormality detection, disease classification, computer aided diagnosis and retrieval. In this study, a comprehensive review of the current state-of-the-art in medical image analysis using deep convolutional networks is presented. The challenges and potential of these techniques are also highlighted.
http://arxiv.org/abs/1709.02250
Dynamically changing environments, unreliable state estimation, and operation under severe resource constraints are fundamental challenges for robotics, which still limit the deployment of small autonomous drones. We address these challenges in the context of autonomous, vision-based drone racing in dynamic environments. A racing drone must traverse a track with possibly moving gates at high speed. We enable this functionality by combining the performance of a state-of-the-art path-planning and control system with the perceptual awareness of a convolutional neural network (CNN). The CNN directly maps raw images to a desired waypoint and speed. Given the CNN output, the planner generates a short minimum-jerk trajectory segment that is tracked by a model-based controller to actuate the drone towards the waypoint. The resulting modular system has several desirable features: (i) it can run fully on-board, (ii) it does not require globally consistent state estimation, and (iii) it is both platform and domain independent. We extensively test the precision and robustness of our system, both in simulation and on a physical platform. In both domains, our method significantly outperforms the prior state of the art. In order to understand the limits of our approach, we additionally compare against professional human drone pilots with different skill levels.
http://arxiv.org/abs/1905.09727
Hamilton-Jacobi (HJ) reachability analysis has been developed over the past decades into a widely-applicable tool for determining goal satisfaction and safety verification in nonlinear systems. While HJ reachability can be formulated very generally, computational complexity can be a serious impediment for many systems of practical interest. Much prior work has been devoted to computing approximate solutions to large reachability problems, yet many of these methods may only apply to very restrictive problem classes, do not generate controllers, and/or can be extremely conservative. In this paper, we present a new method for approximating the optimal controller of the HJ reachability problem for control-affine systems. While also a specific problem class, many dynamical systems of interest are, or can be well approximated, by control-affine models. We explicitly avoid storing a representation of the reachability value function, and instead learn a controller as a sequence of simple binary classifiers. We compare our approach to existing grid-based methodologies in HJ reachability and demonstrate its utility on several examples, including a physical quadrotor navigation task.
http://arxiv.org/abs/1803.03237
Graph-based clustering has shown promising performance in many tasks. A key step of graph-based approach is the similarity graph construction. In general, learning graph in kernel space can enhance clustering accuracy due to the incorporation of nonlinearity. However, most existing kernel-based graph learning mechanisms is not similarity-preserving, hence leads to sub-optimal performance. To overcome this drawback, we propose a more discriminative graph learning method which can preserve the pairwise similarities between samples in an adaptive manner for the first time. Specifically, we require the learned graph be close to a kernel matrix, which serves as a measure of similarity in raw data. Moreover, the structure is adaptively tuned so that the number of connected components of the graph is exactly equal to the number of clusters. Finally, our method unifies clustering and graph learning which can directly obtain cluster indicators from the graph itself without performing further clustering step. The effectiveness of this approach is examined on both single and multiple kernel learning scenarios in several datasets.
http://arxiv.org/abs/1905.08419
Using FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) are ignored in favor of simply counting GOPs, and results on accuracy, which is critical to application success, are often not even reported. In this work, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet. Both the accelerator and ConvNet are tailored to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with only 1$\times$1 convolutions while spatial convolutions are replaced by more efficient shift operations. DiracDeltaNet achieves competitive accuracy on ImageNet (88.7% top-5), but with 42$\times$ fewer parameters and 48$\times$ fewer OPs than VGG16. We further quantize DiracDeltaNet’s weights to 4-bit and activations to 4-bits, with less than 1% accuracy loss. These quantizations exploit well the nature of FPGA hardware. In short, DiracDeltaNet’s small model size, low computational OP count, low precision and simplified operators allow us to co-design a highly customized computing unit for an FPGA. We implement the computing units for DiracDeltaNet on an Ultra96 SoC system through high-level synthesis. Our accelerator’s final top-5 accuracy of 88.1% on ImageNet, is higher than all the previously reported embedded FPGA accelerators. In addition, the accelerator reaches an inference speed of 96.5 FPS on the ImageNet classification task, surpassing prior works with similar accuracy by at least 16.9$\times$.
http://arxiv.org/abs/1811.08634
In the detection of anemia, leukemia and other blood diseases, the number and type of leukocytes are essential evaluation parameters. However, the conventional leukocyte counting method is not only quite time-consuming but also error-prone. Consequently, many automation methods are introduced for the diagnosis of medical images. It remains difficult to accurately extract related features and count the number of cells under the variable conditions such as background, staining method, staining degree, light conditions and so on. Therefore, in order to adapt to various complex situations, we consider RGB color space, HSI color space, and the linear combination of G, H and S components, and propose a fast and accurate algorithm for the segmentation of peripheral blood leukocytes in this paper. First, the nucleus of leukocyte was separated by using the stepwise averaging method. Then based on the interval-valued fuzzy sets, the cytoplasm of leukocyte was segmented by minimizing the fuzzy divergence. Next, post-processing was carried out by using the concave-convex iterative repair algorithm and the decision mechanism of candidate mask sets. Experimental results show that the proposed method outperforms the existing non-fuzzy sets methods. Among the methods based on fuzzy sets, the interval-valued fuzzy sets perform slightly better than interval-valued intuitionistic fuzzy sets and intuitionistic fuzzy sets.
http://arxiv.org/abs/1905.08416
An accurate segmentation of lung nodules in computed tomography (CT) images is critical to lung cancer analysis and diagnosis. However, due to the variety of lung nodules and the similarity of visual characteristics between nodules and their surroundings, a robust segmentation of nodules becomes a challenging problem. In this study, we propose the Dual-branch Residual Network (DB-ResNet) which is a data-driven model. Our approach integrates two new schemes to improve the generalization capability of the model: 1) the proposed model can simultaneously capture multi-view and multi-scale features of different nodules in CT images; 2) we combine the features of the intensity and the convolution neural networks (CNN). We propose a pooling method, called the central intensity-pooling layer (CIP), to extract the intensity features of the center voxel of the block, and then use the CNN to obtain the convolutional features of the center voxel of the block. In addition, we designed a weighted sampling strategy based on the boundary of nodules for the selection of those voxels using the weighting score, to increase the accuracy of the model. The proposed method has been extensively evaluated on the LIDC dataset containing 986 nodules. Experimental results show that the DB-ResNet achieves superior segmentation performance with an average dice score of 82.74% on the dataset. Moreover, we compared our results with those of four radiologists on the same dataset. The comparison showed that our average dice score was 0.49% higher than that of human experts. This proves that our proposed method is as good as the experienced radiologist.
http://arxiv.org/abs/1905.08413
The 2D Multi-Agent Path Finding (MAPF) problem aims at finding collision-free paths for a number of agents, from a set of start locations to a set of goal positions in a known 2D environment. MAPF has been studied in theoretical computer science, robotics, and artificial intelligence over several decades, due to its importance for robot navigation. It is currently experiencing significant scientific progress due to its relevance in automated warehousing (such as those operated by Amazon) and in other contemporary application areas. In this paper, we demonstrate that many recently developed MAPF algorithms apply more broadly than currently believed in the MAPF research community. In particular, we describe the 3D Pipe Routing (PR) problem, which aims at placing collision-free pipes from given start locations to given goal locations in a known 3D environment. The MAPF and PR problems are similar: a solution to a MAPF instance is a set of blocked cells in x-y-t space, while a solution to the corresponding PR instance is a set of blocked cells in x-y-z space. We show how to use this similarity to apply several recently developed MAPF algorithms to the PR problem, and discuss their performance on abstract PR instances. We also discuss further research necessary to tackle real-world pipe-routing instances of interest to industry today. This opens up a new direction of industrial relevance for the MAPF research community.
http://arxiv.org/abs/1905.08412
Applying convolutional neural networks to spherical images requires particular considerations. We look to the millennia of work on cartographic map projections to provide the tools to define an optimal representation of spherical images for the convolution operation. We propose a representation for deep spherical image inference based on the icosahedral Snyder equal-area (ISEA) projection, a projection onto a geodesic grid, and show that it vastly exceeds the state-of-the-art for convolution on spherical images, improving semantic segmentation results by 12.6%.
http://arxiv.org/abs/1905.08409
Structured information about entities is critical for many semantic parsing tasks. We present an approach that uses a Graph Neural Network (GNN) architecture to incorporate information about relevant entities and their relations during parsing. Combined with a decoder copy mechanism, this approach provides a conceptually simple mechanism to generate logical forms with entities. We demonstrate that this approach is competitive with state-of-the-art across several tasks without pre-training, and outperforms existing approaches when combined with BERT pre-training.
http://arxiv.org/abs/1905.08407
Automated prediction of public speaking performance enables novel systems for tutoring public speaking skills. We use the largest open repository—TED Talks—to predict the ratings provided by the online viewers. The dataset contains over 2200 talk transcripts and the associated meta information including over 5.5 million ratings from spontaneous visitors to the website. We carefully removed the bias present in the dataset (e.g., the speakers’ reputations, popularity gained by publicity, etc.) by modeling the data generating process using a causal diagram. We use a word sequence based recurrent architecture and a dependency tree based recursive architecture as the neural networks for predicting the TED talk ratings. Our neural network models can predict the ratings with an average F-score of 0.77 which largely outperforms the competitive baseline method.
http://arxiv.org/abs/1905.08392
Usage similarity estimation addresses the semantic proximity of word instances in different contexts. We apply contextualized (ELMo and BERT) word and sentence embeddings to this task, and propose supervised models that leverage these representations for prediction. Our models are further assisted by lexical substitute annotations automatically assigned to word instances by context2vec, a neural model that relies on a bidirectional LSTM. We perform an extensive comparison of existing word and sentence representations on benchmark datasets addressing both graded and binary similarity. The best performing models outperform previous methods in both settings.
http://arxiv.org/abs/1905.08377
After a hurricane, damage assessment is critical to emergency managers for efficient response and resource allocation. One way to gauge the damage extent is to quantify the number of flooded/damaged buildings, which is traditionally done by ground survey. This process can be labor-intensive and time-consuming. In this paper, we propose to improve the efficiency of building damage assessment by applying image classification algorithms to post-hurricane satellite imagery. At the known building coordinates (available from public data), we extract square-sized images from the satellite imagery to create training, validation, and test datasets. Each square-sized image contains a building to be classified as either ‘Flooded/Damaged’ (labeled by volunteers in a crowd-sourcing project) or ‘Undamaged’. We design and train a convolutional neural network from scratch and compare it with an existing neural network used widely for common object classification. We demonstrate the promise of our damage annotation model (over 97% accuracy) in the case study of building damage assessment in the Greater Houston area affected by 2017 Hurricane Harvey.
http://arxiv.org/abs/1807.01688
Developing deep learning models for resource-constrained Internet-of-Things (IoT) devices is challenging, as it is difficult to achieve both good quality of results (QoR), such as DNN model inference accuracy, and quality of service (QoS), such as inference latency, throughput, and power consumption. Existing approaches typically separate the DNN model development step from its deployment on IoT devices, resulting in suboptimal solutions. In this paper, we first introduce a few interesting but counterintuitive observations about such a separate design approach, and empirically show why it may lead to suboptimal designs. Motivated by these observations, we then propose a novel and practical bi-directional co-design approach: a bottom-up DNN model design strategy together with a top-down flow for DNN accelerator design. It enables a joint optimization of both DNN models and their deployment configurations on IoT devices as represented as FPGAs. We demonstrate the effectiveness of the proposed co-design approach on a real-life object detection application using Pynq-Z1 embedded FPGA. Our method obtains the state-of-the-art results on both QoR with high accuracy (IoU) and QoS with high throughput (FPS) and high energy efficiency.
http://arxiv.org/abs/1905.08369
We describe a rational, but low resolution model of probability.
http://arxiv.org/abs/1905.10924
Kernel methods have produced state-of-the-art results for a number of NLP tasks such as relation extraction, but suffer from poor scalability due to the high cost of computing kernel similarities between natural language structures. A recently proposed technique, kernelized locality-sensitive hashing (KLSH), can significantly reduce the computational cost, but is only applicable to classifiers operating on kNN graphs. Here we propose to use random subspaces of KLSH codes for efficiently constructing an explicit representation of NLP structures suitable for general classification methods. Further, we propose an approach for optimizing the KLSH model for classification problems by maximizing an approximation of mutual information between the KLSH codes (feature vectors) and the class labels. We evaluate the proposed approach on biomedical relation extraction datasets, and observe significant and robust improvements in accuracy w.r.t. state-of-the-art classifiers, along with drastic (orders-of-magnitude) speedup compared to conventional kernel methods.
http://arxiv.org/abs/1711.04044
Nowadays, Machine Learning (ML) is seen as the universal solution to improve the effectiveness of information retrieval (IR) methods. However, while mathematics is a precise and accurate science, it is usually expressed by less accurate and imprecise descriptions, contributing to the relative dearth of machine learning applications for IR in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. Given recent advances in ML, it seems canonical to apply ML techniques to represent and retrieve mathematics semantically. In this work, we apply popular text embedding techniques to the arXiv collection of STEM documents and explore how these are unable to properly understand mathematics from that corpus. In addition, we also investigate the missing aspects that would allow mathematics to be learned by computers.
http://arxiv.org/abs/1905.08359
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60-millisecond) and long-term (30-minute) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer, i.e. an affine layer whose weights are dynamically adapted at prediction time by an auxiliary network taking long-term summary statistics of spectrotemporal features as input. We show that both techniques are helpful and complementary. […] We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.
http://arxiv.org/abs/1905.08352
While research on iterated revision is predominant in the field of iterated belief change, the class of iterated contraction operators received more attention in recent years. In this article, we examine a non-prioritized generalisation of iterated contraction. In particular, the class of weak decrement operators is introduced, which are operators that by multiple steps achieve the same as a contraction. Inspired by Darwiche and Pearl’s work on iterated revision the subclass of decrement operators is defined. For both, decrement and weak decrement operators, postulates are presented and for each of them a representation theorem in the framework of total preorders is given. Furthermore, we present two types of decrement operators which have a unique representative.
http://arxiv.org/abs/1905.08347
The way developers edit day-to-day code tends to be repetitive, often using existing code elements. Many researchers have tried to automate repetitive code changes by learning from specific change templates which are applied to limited scope. The advancement of Neural Machine Translation (NMT) and the availability of vast open-source evolutionary data opens up the possibility of automatically learning those templates from the wild. However, unlike natural languages, for which NMT techniques were originally devised, source code and its changes have certain properties. For instance, compared to natural language, source code vocabulary can be significantly larger. Further, good changes in code do not break its syntactic structure. Thus, deploying state-of-the-art NMT models without adapting the methods to the source code domain yields sub-optimal results. To this end, we propose a novel Tree based NMT system to model source code changes and learn code change patterns from the wild. We realize our model with a change suggestion engine: CODIT and train the model with more than 30k real-world changes and evaluate it on 6k patches. Our evaluation shows the effectiveness of CODIT in learning and suggesting patches.CODIT also shows promise generating bug fix patches.
https://arxiv.org/abs/1810.00314
Operation in a real world traffic requires autonomous vehicles to be able to plan their motion in complex environments (multiple moving participants). Planning through such environment requires the right search space to be provided for the trajectory or maneuver planners so that the safest motion for the ego vehicle can be identified. Given the current states of the environment and its participants, analyzing the risks based on the predicted trajectories of all the traffic participants provides the necessary search space for the planning of motion. This paper provides a fresh taxonomy of safety / risks that an autonomous vehicle should be able to handle while navigating through traffic. It provides a reference system architecture that needs to be implemented as well as describes a novel way of identifying and predicting the behaviors of the traffic participants using classic Multi Model Adaptive Estimation (MMAE). Preliminary simulation results of the implemented model are included.
http://arxiv.org/abs/1905.08332
A recent trend in DNN development is to extend the reach of deep learning applications to platforms that are more resource and energy constrained, e.g., mobile devices. These endeavors aim to reduce the DNN model size and improve the hardware processing efficiency, and have resulted in DNNs that are much more compact in their structures and/or have high data sparsity. These compact or sparse models are different from the traditional large ones in that there is much more variation in their layer shapes and sizes, and often require specialized hardware to exploit sparsity for performance improvement. Thus, many DNN accelerators designed for large DNNs do not perform well on these models. In this work, we present Eyeriss v2, a DNN accelerator architecture designed for running compact and sparse DNNs. To deal with the widely varying layer shapes and sizes, it introduces a highly flexible on-chip network, called hierarchical mesh, that can adapt to the different amounts of data reuse and bandwidth requirements of different data types, which improves the utilization of the computation resources. Furthermore, Eyeriss v2 can process sparse data directly in the compressed domain for both weights and activations, and therefore is able to improve both processing speed and energy efficiency with sparse models. Overall, with sparse MobileNet, Eyeriss v2 in a 65nm CMOS process achieves a throughput of 1470.6 inferences/sec and 2560.3 inferences/J at a batch size of 1, which is 12.6x faster and 2.5x more energy efficient than the original Eyeriss running MobileNet. We also present an analysis methodology called Eyexam that provides a systematic way of understanding the performance limits for DNN processors as a function of specific characteristics of the DNN model and accelerator design; it applies these characteristics as sequential steps to increasingly tighten the bound on the performance limits.
http://arxiv.org/abs/1807.07928
Algorithm selection and hyperparameter tuning remain two of the most challenging tasks in machine learning. Automated machine learning (AutoML) seeks to automate these tasks to enable widespread use of machine learning by non-experts. This paper introduces OBOE, a collaborative filtering method for time-constrained model selection and hyperparameter tuning. OBOE forms a matrix of the cross-validated errors of a large number of supervised learning models (algorithms together with hyperparameters) on a large number of datasets, and fits a low rank model to learn the low-dimensional feature vectors for the models and datasets that best predict the cross-validated errors. To find promising models for a new dataset, OBOE runs a set of fast but informative algorithms on the new dataset and uses their cross-validated errors to infer the feature vector for the new dataset. OBOE can find good models under constraints on the number of models fit or the total time budget. To this end, this paper develops a new heuristic for active learning in time-constrained matrix completion based on optimal experiment design. Our experiments demonstrate that OBOE delivers state-of-the-art performance faster than competing approaches on a test bed of supervised learning problems. Moreover, the success of the bilinear model used by OBOE suggests that AutoML may be simpler than was previously understood.
http://arxiv.org/abs/1808.03233
Surgical workflow analysis is of importance for understanding onset and persistence of surgical phases and individual tool usage across surgery and in each phase. It is beneficial for clinical quality control and to hospital administrators for understanding surgery planning. Video acquired during surgery typically can be leveraged for this task. Currently, a combination of convolutional neural network (CNN) and recurrent neural networks (RNN) are popularly used for video analysis in general, not only being restricted to surgical videos. In this paper, we propose a multi-task learning framework using CNN followed by a bi-directional long short term memory (Bi-LSTM) to learn to encapsulate both forward and backward temporal dependencies. Further, the joint distribution indicating set of tools associated with a phase is used as an additional loss during learning to correct for their co-occurrence in any predictions. Experimental evaluation is performed using the Cholec80 dataset. We report a mean average precision (mAP) score of 0.99 and 0.86 for tool and phase identification respectively which are higher compared to prior-art in the field.
http://arxiv.org/abs/1905.08315
In this paper we analyze convolutional layers of VGG16 model pre-trained on ILSVRC2012. We based our analysis on the responses of neurons to the images of all classes in ImageNet database. In our analysis, we first propose a visualization method to illustrate the learned content of each neuron. Next, we investigate single and multi-faceted neurons based on the diversity of neurons responses to different classes. Finally, we compute the neuronal similarity at each layer and make a comparison between them. Our results demonstrate that the neurons in lower layers exhibit a multi-faceted behavior, whereas the majority of neurons in higher layers comprise single-faceted property and tend to respond to a smaller number of classes.
http://arxiv.org/abs/1811.00161
This article proposes a biologically inspired neurocomputational architecture which learns associations between words and referents in different contexts, considering evidence collected from the literature of Psycholinguistics and Neurolinguistics. The multi-layered architecture takes as input raw images of objects (referents) and streams of word’s phonemes (labels), builds an adequate representation, recognizes the current context, and associates label with referents incrementally, by employing a Self-Organizing Map which creates new association nodes (prototypes) as required, adjusts the existing prototypes to better represent the input stimuli and removes prototypes that become obsolete/unused. The model takes into account the current context to retrieve the correct meaning of words with multiple meanings. Simulations show that the model can reach up to 78% of word-referent association accuracy in ambiguous situations and approximates well the learning rates of humans as reported by three different authors in five Cross-Situational Word Learning experiments, also displaying similar learning patterns in the different learning conditions.
http://arxiv.org/abs/1905.08300
Software analytics can be improved by surveying; i.e. rechecking and (possibly) revising the labels offered by prior analysis. Surveying is a time-consuming task and effective surveyors must carefully manage their time. Specifically, they must balance the cost of further surveying against the additional benefits of that extra effort. This paper proposes SURVEY0, an incremental Logistic Regression estimation method that implements cost/benefit analysis. Some classifier is used to rank the as-yet-unvisited examples according to how interesting they might be. Humans then review the most interesting examples, after which their feedback is used to update an estimator for estimating how many examples are remaining. This paper evaluates SURVEY0 in the context of self-admitted technical debt. As software project mature, they can accumulate “technical debt” i.e. developer decisions which are sub-optimal and decrease the overall quality of the code. Such decisions are often commented on by programmers in the code; i.e. it is self-admitted technical debt (SATD). Recent results show that text classifiers can automatically detect such debt. We find that we can significantly outperform prior results by SURVEYing the data. Specifically, for ten open-source JAVA projects, we can find 83% of the technical debt via SURVEY0 using just 16% of the comments (and if higher levels of recall are required, SURVEY0can adjust towards that with some additional effort).
http://arxiv.org/abs/1905.08297
N-discount optimality was introduced as a hierarchical form of policy- and value-function optimality, with Blackwell optimality lying at the top level of the hierarchy Veinott (1969); Blackwell (1962). We formalize notions of myopic discount factors, value functions and policies in terms of Blackwell optimality in MDPs, and we provide a novel concept of regret, called Blackwell regret, which measures the regret compared to a Blackwell optimal policy. Our main analysis focuses on long horizon MDPs with sparse rewards. We show that selecting the discount factor under which zero Blackwell regret can be achieved becomes arbitrarily hard. Moreover, even with oracle knowledge of such a discount factor that can realize a Blackwell regret-free value function, an $\epsilon$-Blackwell optimal value function may not even be gain optimal. Difficulties associated with this class of problems is discussed, and the notion of a policy gap is defined as the difference in expected return between a given policy and any other policy that differs at that state; we prove certain properties related to this gap. Finally, we provide experimental results that further support our theoretical results.
http://arxiv.org/abs/1905.08293