The success of deep neural networks often relies on a large amount of labeled examples, which can be difficult to obtain in many real scenarios. To address this challenge, unsupervised methods are strongly preferred for training neural networks without using any labeled data. In this paper, we present a novel paradigm of unsupervised representation learning by Auto-Encoding Transformation (AET) in contrast to the conventional Auto-Encoding Data (AED) approach. Given a randomly sampled transformation, AET seeks to predict it merely from the encoded features as accurately as possible at the output end. The idea is the following: as long as the unsupervised features successfully encode the essential information about the visual structures of original and transformed images, the transformation can be well predicted. We will show that this AET paradigm allows us to instantiate a large variety of transformations, from parameterized, to non-parameterized and GAN-induced ones. Our experiments show that AET greatly improves over existing unsupervised approaches, setting new state-of-the-art performances being greatly closer to the upper bounds by their fully supervised counterparts on CIFAR-10, ImageNet and Places datasets.
https://arxiv.org/abs/1901.04596
Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related, and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the Predictive, Descriptive, Relevant (PDR) framework for discussing interpretations. The PDR framework provides three overarching desiderata for evaluation: predictive accuracy, descriptive accuracy and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post-hoc categories, with sub-groups including sparsity, modularity and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often under-appreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods.
https://arxiv.org/abs/1901.04592
Needle shape, diameter, and path are critical parameters that directly affect suture depth and tissue trauma in autonomous suturing. This paper presents an optimization-based approach to specify these parameters. Given clinical suturing guidelines, a kinematic model of needle-tissue interaction was developed to quantify suture parameters and constraints. The model was further used to formulate constant curvature needle path planning as a nonlinear optimization problem. The optimization results were confirmed experimentally with the Raven II surgical system. The proposed needle path planning algorithm guarantees minimal tissue trauma and complies with a wide range of suturing requirements.
http://arxiv.org/abs/1901.04588
People learn in fast and flexible ways that have not been emulated by machines. Once a person learns a new verb “dax,” he or she can effortlessly understand how to “dax twice,” “walk and dax,” or “dax vigorously.” There have been striking recent improvements in machine learning for natural language processing, yet the best algorithms require vast amounts of experience and struggle to generalize new concepts in compositional ways. To better understand these distinctively human abilities, we study the compositional skills of people through language-like instruction learning tasks. Our results show that people can learn and use novel functional concepts from very few examples (few-shot learning), successfully applying familiar functions to novel inputs. People can also compose concepts in complex ways that go beyond the provided demonstrations. Two additional experiments examined the assumptions and inductive biases that people make when solving these tasks, revealing three biases: mutual exclusivity, one-to-one mappings, and iconic concatenation. We discuss the implications for cognitive modeling and the potential for building machines with more human-like language learning capabilities.
https://arxiv.org/abs/1901.04587
We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pretraining or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution that puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving $9.3\%$ WER and $4.5\%$ WER respectively.
http://arxiv.org/abs/1810.01398
As more researchers have become aware of and passionate about algorithmic fairness, there has been an explosion in papers laying out new metrics, suggesting algorithms to address issues, and calling attention to issues in existing applications of machine learning. This research has greatly expanded our understanding of the concerns and challenges in deploying machine learning, but there has been much less work in seeing how the rubber meets the road. In this paper we provide a case-study on the application of fairness in machine learning research to a production classification system, and offer new insights in how to measure and address algorithmic fairness issues. We discuss open questions in implementing equality of opportunity and describe our fairness metric, conditional equality, that takes into account distributional differences. Further, we provide a new approach to improve on the fairness metric during model training and demonstrate its efficacy in improving performance for a real-world product
https://arxiv.org/abs/1901.04562
Previous attempts at music artist classification use frame-level audio features which summarize frequency content within short intervals of time. Comparatively, more recent music information retrieval tasks take advantage of temporal structure in audio spectrograms using deep convolutional and recurrent models. This paper revisits artist classification with this new framework and empirically explores the impacts of incorporating temporal structure in the feature representation. To this end, an established classification architecture, a Convolutional Recurrent Neural Network (CRNN), is applied to the artist20 music artist identification dataset under a comprehensive set of conditions. These include audio clip length, which is a novel contribution in this work, and previously identified considerations such as dataset split and feature-level. Our results improve upon baseline works, verify the influence of the production details on classification performance and demonstrate the trade-offs between sample length and training set size. The best performing model achieves an average F1-score of 0.937 across three independent trials which is a substantial improvement over the corresponding baseline under similar conditions. Finally, to showcase the effectiveness of the CRNN’s feature extraction capabilities, we visualize audio samples at its bottleneck layer demonstrating that learned representations segment into clusters belonging to their respective artists.
https://arxiv.org/abs/1901.04555
Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related AI technology, increasing transparency into how well AI technology works. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed evaluation numbers and other relevant documentation.
http://arxiv.org/abs/1810.03993
We propose to simultaneously learn to sample and reconstruct magnetic resonance images (MRI) to maximize the reconstruction quality given a limited sample budget, in a self-supervised setup. Unlike existing deep methods that focus only on reconstructing given data, thus being passive, we go beyond the current state of the art by considering both the data acquisition and the reconstruction process within a single deep-learning framework. As our network learns to acquire data, the network is active in nature. In order to do so, we simultaneously train two neural networks, one dedicated to reconstruction and the other to progressive sampling, each with an automatically generated supervision signal that links them together. The two supervision signals are created through Monte Carlo tree search (MCTS). MCTS returns a better sampling pattern than what the current sampling network can give and, thus, a better final reconstruction. The sampling network is trained to mimic the MCTS results using the previous sampling network, thus being enhanced. The reconstruction network is trained to give the highest reconstruction quality, given the MCTS sampling pattern. Through this framework, we are able to train the two networks without providing any direct supervision on sampling.
https://arxiv.org/abs/1901.04547
To investigate whether and to what extent central serous chorioretinopathy (CSC) depicted on color fundus photographs can be assessed using deep learning technology. We collected a total of 2,504 fundus images acquired on different subjects. We verified the CSC status of these images using their corresponding optical coherence tomography (OCT) images. A total of 1,329 images depicted CSC. These images were preprocessed and normalized. This resulting dataset was randomly split into three parts in the ratio of 8:1:1 respectively for training, validation, and testing purposes. We used the deep learning architecture termed InceptionV3 to train the classifier. We performed nonparametric receiver operating characteristic (ROC) analyses to assess the capability of the developed algorithm to identify CSC. The Kappa coefficient between the two raters was 0.48 (p < 0.001), while the Kappa coefficients between the computer and the two raters were 0.59 (p < 0.001) and 0.33 (p < 0.05).Our experiments showed that the computer algorithm based on deep learning can assess CSC depicted on color fundus photographs in a relatively reliable and consistent way.
https://arxiv.org/abs/1901.04540
Recent GAN-based architectures have been able to deliver impressive performance on the general task of image-to-image translation. In particular, it was shown that a wide variety of image translation operators may be learned from two image sets, containing images from two different domains, without establishing an explicit pairing between the images. This was made possible by introducing clever regularizers to overcome the under-constrained nature of the unpaired translation problem. In this work, we introduce a novel architecture for unpaired image translation, and explore several new regularizers enabled by it. Specifically, our architecture comprises a pair of GANs, as well as a pair of translators between their respective latent spaces. These cross-translators enable us to impose several regularizing constraints on the learnt image translation operator, collectively referred to as latent cross-consistency. Our results show that our proposed architecture and latent cross-consistency constraints are able to outperform the existing state-of-the-art on a wide variety of image translation tasks.
https://arxiv.org/abs/1901.04530
Over the past few years, deep neural networks (DNNs) have exhibited great success in predicting the saliency of images. However, there are few works that apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient data to train the DNN models for predicting video saliency. Through the statistical analysis of our LEDOV database, we find that human attention is normally attracted by objects, particularly moving objects or the moving parts of objects. Accordingly, we propose an object-to-motion convolutional neural network (OM-CNN) to learn spatio-temporal features for predicting the intra-frame saliency via exploring the information of both objectness and object motion. We further find from our database that there exists a temporal correlation of human attention with a smooth saliency transition across video frames. Therefore, we develop a two-layer convolutional long short-term memory (2C-LSTM) network in our DNN-based method, using the extracted features of OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can be generated, which consider the transition of attention across video frames. Finally, the experimental results show that our method advances the state-of-the-art in video saliency prediction.
http://arxiv.org/abs/1709.06316
In this paper, we present a principled method to model general planar sliding motion with distributed convex contact patch. The effect of contact patch with indeterminate pressure distribution can be equivalently modeled as the contact wrench at one point contact. We call this point equivalent contact point. Our dynamic model embeds ECP within the equations of slider’s motion and friction model which approximates the distributed contact patch, and eventually brings us a system of quadratic equations. This discrete-time dynamic model allows us to solve for the two components of tangential friction impulses, the friction moment and the slip speed. The state of the slider as well as the ECP can be computed by solving a system of linear equations once the contact impulses are computed. In addition, we derive the closed form solutions for the state of slider for quasi-static motion. Furthermore, in pure translation case, based on the discrete-time model, we present the closed form expressions for the friction impulses the slider suffers and the state of it at each time step. Simulation examples are shown to demonstrate the validity of our approach.
http://arxiv.org/abs/1809.05511
This paper deals with the problem of computing surface ice concentration for two different types of ice from river ice images. It presents the results of attempting to solve this problem using several state of the art semantic segmentation methods based on deep convolutional neural networks (CNNs). This task presents two main challenges - very limited availability of labeled training data and the great difficulty of visually distinguishing the two types of ice, even for humans, leading to noisy labels.. The results are used to analyze the extent to which some of the best deep learning methods currently in existence can handle these challenges.
http://arxiv.org/abs/1901.04412
We survey research on self-driving cars published in the literature focusing on autonomous cars developed since the DARPA challenges, which are equipped with an autonomy system that can be categorized as SAE level 3 or higher. The architecture of the autonomy system of self-driving cars is typically organized into the perception system and the decision-making system. The perception system is generally divided into many subsystems responsible for tasks such as self-driving-car localization, static obstacles mapping, moving obstacles detection and tracking, road mapping, traffic signalization detection and recognition, among others. The decision-making system is commonly partitioned as well into many subsystems responsible for tasks such as route planning, path planning, behavior selection, motion planning, and control. In this survey, we present the typical architecture of the autonomy system of self-driving cars. We also review research on relevant methods for perception and decision making. Furthermore, we present a detailed description of the architecture of the autonomy system of the UFES’s car, IARA. Finally, we list prominent autonomous research cars developed by technology companies and reported in the media.
http://arxiv.org/abs/1901.04407
A book about turning high-degree optimization problems into quadratic optimization problems that maintain the same global minimum (ground state). This book explores quadratizations for pseudo-Boolean optimization, perturbative gadgets used in QMA completeness theorems, and also non-perturbative k-local to 2-local transformations used for quantum mechanics, quantum annealing and universal adiabatic quantum computing. The book contains ~70 different Hamiltonian transformations, each of them on a separate page, where the cost (in number of auxiliary binary variables or auxiliary qubits, or number of sub-modular terms, or in graph connectivity, etc.), pros, cons, examples, and references are given. One can therefore look up a quadratization appropriate for the specific term(s) that need to be quadratized, much like using an integral table to look up the integral that needs to be done. This book is therefore useful for writing compilers to transform general optimization problems, into a form that quantum annealing or universal adiabatic quantum computing hardware requires; or for transforming quantum chemistry problems written in the Jordan-Wigner or Bravyi-Kitaev form, into a form where all multi-qubit interactions become 2-qubit pairwise interactions, without changing the desired ground state. Applications cited include computer vision problems (e.g. image de-noising, un-blurring, etc.), number theory (e.g. integer factoring), graph theory (e.g. Ramsey number determination), and quantum chemistry. The book is open source, and anyone can make modifications here: https://github.com/HPQC-LABS/Book_About_Quadratization.
http://arxiv.org/abs/1901.04405
Spiking neural networks (SNNs) equipped with latency coding and spike-timing dependent plasticity rules offer an alternative to solve the data and energy bottlenecks of standard computer vision approaches: they can learn visual features without supervision and can be implemented by ultra-low power hardware architectures. However, their performance in image classification has never been evaluated on recent image datasets. In this paper, we compare SNNs to auto-encoders on three visual recognition datasets, and extend the use of SNNs to color images. Results show that SNNs are not competitive yet with traditional feature learning approaches, especially for color features. Further analyses of the results allow us to identify some of the bottlenecks of SNNs and provide specific directions towards improving their performance on vision tasks.
http://arxiv.org/abs/1901.04392
Textual network embedding leverages rich text information associated with the network to learn low-dimensional vectorial representations of vertices. Rather than using typical natural language processing (NLP) approaches, recent research exploits the relationship of texts on the same edge to graphically embed text. However, these models neglect to measure the complete level of connectivity between any two texts in the graph. We present diffusion maps for textual network embedding (DMTE), integrating global structural information of the graph to capture the semantic relatedness between texts, with a diffusion-convolution operation applied on the text inputs. In addition, a new objective function is designed to efficiently preserve the high-order proximity using the graph diffusion. Experimental results show that the proposed approach outperforms state-of-the-art methods on the vertex-classification and link-prediction tasks.
http://arxiv.org/abs/1805.09906
This work aims to develop an end-to-end solution for seizure onset detection. We design the SeizNet, a Convolutional Neural Network for seizure detection. To compare SeizNet with traditional machine learning approach, a baseline classifier is implemented using spectrum band power features with Support Vector Machines (BPsvm). We explore the possibility to use the least number of channels for accurate seizure detection by evaluating SeizNet and BPsvm approaches using all channels and two channels settings respectively. EEG Data is acquired from 29 pediatric patients admitted to KK Woman’s and Children’s Hospital who were diagnosed as typical absence seizures. We conduct leave-one-out cross validation for all subjects. Using full channel data, BPsvm yields a sensitivity of 86.6\% and 0.84 false alarm (per hour) while SeizNet yields overall sensitivity of 95.8 \% with 0.17 false alarm. More interestingly, two channels seizNet outperforms full channel BPsvm with a sensitivity of 93.3\% and 0.58 false alarm. We further investigate interpretability of SeizNet by decoding the filters learned along convolutional layers. Seizure-like characteristics can be clearly observed in the filters from third and forth convolutional layers.
http://arxiv.org/abs/1901.05305
Deep neural acoustic models benefit from context dependent modeling of output symbols. However, their usage requires state-tying decision trees that are typically transferred from classical GMM-HMM systems. In this work we consider direct training of CTC networks with context dependent outputs. A state-tying decision tree is replaced with a neural network that predicts the weights of the final SoftMax classifier in a context-dependent way. This network is trained together with the rest of the acoustic model and lifts one of the last cases in which neural systems have to be bootstrapped from GMM-HMM ones. We describe changes to the CTC cost function that are needed to accommodate context-dependent symbols and validate this idea on bigram context dependent system built for character-based WSJ.
http://arxiv.org/abs/1901.04379
We revisit skip-gram negative sampling (SGNS), one of the most popular neural-network based approaches to learning distributed word representation. We first point out the ambiguity issue undermining the SGNS model, in the sense that the word vectors can be entirely distorted without changing the objective value. To resolve the issue, we investigate the intrinsic structures in solution that a good word embedding model should deliver. Motivated by this, we rectify the SGNS model with quadratic regularization, and show that this simple modification suffices to structure the solution in the desired manner. A theoretical justification is presented, which provides novel insights into quadratic regularization . Preliminary experiments are also conducted on Google’s analytical reasoning task to support the modified SGNS model.
http://arxiv.org/abs/1804.00306
The goal of this article is to inspire data scientists to participate in the debate on the impact that their professional work has on society, and to become active in public debates on the digital world as data science professionals. How do ethical principles (e.g., fairness, justice, beneficence, and non-maleficence) relate to our professional lives? What lies in our responsibility as professionals by our expertise in the field? More specifically this article makes an appeal to statisticians to join that debate, and to be part of the community that establishes data science as a proper profession in the sense of Airaksinen, a philosopher working on professional ethics. As we will argue, data science has one of its roots in statistics and extends beyond it. To shape the future of statistics, and to take responsibility for the statistical contributions to data science, statisticians should actively engage in the discussions. First the term data science is defined, and the technical changes that have led to a strong influence of data science on society are outlined. Next the systematic approach from CNIL is introduced. Prominent examples are given for ethical issues arising from the work of data scientists. Further we provide reasons why data scientists should engage in shaping morality around and to formulate codes of conduct and codes of practice for data science. Next we present established ethical guidelines for the related fields of statistics and computing machinery. Thereafter necessary steps in the community to develop professional ethics for data science are described. Finally we give our starting statement for the debate: Data science is in the focal point of current societal development. Without becoming a profession with professional ethics, data science will fail in building trust in its interaction with and its much needed contributions to society!
https://arxiv.org/abs/1901.04824
Imaging-based, non-contact measurement of physiology (including imaging photoplethysmography and imaging ballistocardiography) is a growing field of research. There are several strengths of imaging methods that make them attractive. They remove the need for uncomfortable contact sensors and can enable spatial and concomitant measurement from a single sensor. Furthermore, cameras are ubiquitous and often low-cost solutions for sensing. Open source toolboxes help accelerate the progress of research by providing a means to compare new approaches against standard implementations of the state-of-the-art. We present an open source imaging-based physiological measurement toolbox with implementations of many of the most frequently employed computational methods. We hope that this toolbox will contribute to the advancement of non-contact physiological sensing methods.
http://arxiv.org/abs/1901.04366
This paper approaches, using structural complexity theory, the question of whether there is a chasm between knowing an object exists and getting one’s hands on the object or its properties. In particular, we study the nontransparency of so-called backbones. A backbone of a boolean formula $F$ is a collection $S$ of its variables for which there is a unique partial assignment $a_S$ such that $F[a_S]$ is satisfiable [MZK+99,WGS03]. We show that, under the widely believed assumption that integer factoring is hard, there exist sets of boolean formulas that have obvious, nontrivial backbones yet finding the values, $a_S$, of those backbones is intractable. We also show that, under the same assumption, there exist sets of boolean formulas that obviously have large backbones yet producing such a backbone $S$ is intractable. Furthermore, we show that if integer factoring is not merely worst-case hard but is frequently hard, as is widely believed, then the frequency of hardness in our two results is not too much less than that frequency. These results hold more generally, namely, in the settings where, respectively, one’s assumption is that P $\neq$ NP $\cap$ coNP or that some problem in NP $\cap$ coNP is frequently hard.
http://arxiv.org/abs/1606.03634
Lack of enough labeled data is a major problem in building machine learning based models when the manual annotation (labeling) is error-prone, expensive, tedious, and time-consuming. In this paper, we introduce an iterative deep learning based method to improve segmentation and counting of cells based on unbiased stereology applied to regions of interest of extended depth of field (EDF) images. This method uses an existing machine learning algorithm called the adaptive segmentation algorithm (ASA) to generate masks (verified by a user) for EDF images to train deep learning models. Then an iterative deep learning approach is used to feed newly predicted and accepted deep learning masks/images (verified by a user) to the training set of the deep learning model. The error rate in unbiased stereology count of cells on an unseen test set reduced from about 3 % to less than 1 % after 5 iterations of the iterative deep learning based unbiased stereology process.
http://arxiv.org/abs/1901.04355
During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability in speech of different speakers, different styles or different emotions with few data remains challenging. In this paper, we investigate how to leverage fine-tuning on a pre-trained Deep Learning-based TTS model to synthesize speech with a small dataset of another speaker. Then we investigate the possibility to adapt this model to have emotional TTS by fine-tuning the neutral TTS model with a small emotional dataset.
http://arxiv.org/abs/1901.04276
In many problem settings, most notably in game playing, an agent receives a possibly delayed reward for its actions. Often, those rewards are handcrafted and not naturally given. Even simple terminal-only rewards, like winning equals 1 and losing equals -1, can not be seen as an unbiased statement, since these values are chosen arbitrarily, and the behavior of the learner may change with different encodings, such as setting the value of a loss to -0:5, which is often done in practice to encourage learning. It is hard to argue about good rewards and the performance of an agent often depends on the design of the reward signal. In particular, in domains where states by nature only have an ordinal ranking and where meaningful distance information between game state values are not available, a numerical reward signal is necessarily biased. In this paper, we take a look at Monte Carlo Tree Search (MCTS), a popular algorithm to solve MDPs, highlight a reoccurring problem concerning its use of rewards, and show that an ordinal treatment of the rewards overcomes this problem. Using the General Video Game Playing framework we show a dominance of our newly proposed ordinal MCTS algorithm over preference-based MCTS, vanilla MCTS and various other MCTS variants.
http://arxiv.org/abs/1901.04274
In this paper, we propose to learn shared semantic space with correlation alignment (${S}^{3}CA$) for multimodal data representations, which aligns nonlinear correlations of multimodal data distributions in deep neural networks designed for heterogeneous data. In the context of cross-modal (event) retrieval, we design a neural network with convolutional layers and fully-connected layers to extract features for images, including images on Flickr-like social media. Simultaneously, we exploit a fully-connected neural network to extract semantic features for texts, including news articles from news media. In particular, nonlinear correlations of layer activations in the two neural networks are aligned with correlation alignment during the joint training of the networks. Furthermore, we project the multimodal data into a shared semantic space for cross-modal (event) retrieval, where the distances between heterogeneous data samples can be measured directly. In addition, we contribute a Wiki-Flickr Event dataset, where the multimodal data samples are not describing each other in pairs like the existing paired datasets, but all of them are describing semantic events. Extensive experiments conducted on both paired and unpaired datasets manifest the effectiveness of ${S}^{3}CA$, outperforming the state-of-the-art methods.
http://arxiv.org/abs/1901.04268
We present a method for turning a flash selfie taken with a smartphone into a photograph as if it was taken in a studio setting with uniform lighting. Our method uses a convolutional neural network trained on a set of pairs of photographs acquired in an ad-hoc acquisition campaign. Each pair consists of one photograph of a subject’s face taken with the camera flash enabled and another one of the same subject in the same pose illuminated using a photographic studio-lighting setup. We show how our method can amend defects introduced by a close-up camera flash, such as specular highlights, shadows, skin shine, and flattened images.
http://arxiv.org/abs/1901.04252
The continuing monitoring and surveying of the nearby space to detect Near Earth Objects (NEOs) and Near Earth Asteroids (NEAs) are essential because of the threats that this kind of objects impose on the future of our planet. We need more computational resources and advanced algorithms to deal with the exponential growth of the digital cameras’ performances and to be able to process (in near real-time) data coming from large surveys. This paper presents a software platform called NEARBY that supports automated detection of moving sources (asteroids) among stars from astronomical images. The detection procedure is based on the classic “blink” detection and, after that, the system supports visual analysis techniques to validate the moving sources, assisted by static and dynamical presentations.
http://arxiv.org/abs/1901.04248
In this paper, we present a graph-based semi-supervised framework for hyperspectral image classification. We first introduce a novel superpixel algorithm based on the spectral covariance matrix representation of pixels to provide a better representation of our data. We then construct a superpixel graph, based on carefully considered feature vectors, before performing classification. We demonstrate, through a set of experimental results using two benchmarking datasets, that our approach outperforms three state-of-the-art classification frameworks, especially when an extremely small amount of labelled data is used.
http://arxiv.org/abs/1901.04240
In this work we investigate the accuracy of standard and state-of-the-art language identification methods in identifying Albanian in written text documents. A dataset consisting of news articles written in Albanian has been constructed for this purpose. We noticed a considerable decrease of accuracy when using test documents that miss the Albanian alphabet letters “ "E “ and “ \c{C} “ and created a custom training corpus that solved this problem by achieving an accuracy of more than 99%. Based on our experiments, the most performing language identification methods for Albanian use a na"ive Bayes classifier and n-gram based classification features.
http://arxiv.org/abs/1901.04216
Visual SLAM shows significant progress in recent years due to high attention from vision community but still, challenges remain for low-textured environments. Feature based visual SLAMs do not produce reliable camera and structure estimates due to insufficient features in a low-textured environment. Moreover, existing visual SLAMs produce partial reconstruction when the number of 3D-2D correspondences is insufficient for incremental camera estimation using bundle adjustment. This paper presents Edge SLAM, a feature based monocular visual SLAM which mitigates the above mentioned problems. Our proposed Edge SLAM pipeline detects edge points from images and tracks those using optical flow for point correspondence. We further refine these point correspondences using geometrical relationship among three views. Owing to our edge-point tracking, we use a robust method for two-view initialization for bundle adjustment. Our proposed SLAM also identifies the potential situations where estimating a new camera into the existing reconstruction is becoming unreliable and we adopt a novel method to estimate the new camera reliably using a local optimization technique. We present an extensive evaluation of our proposed SLAM pipeline with most popular open datasets and compare with the state-of-the art. Experimental result indicates that our Edge SLAM is robust and works reliably well for both textured and less-textured environment in comparison to existing state-of-the-art SLAMs.
http://arxiv.org/abs/1901.04210
Low rank matrix approximation (LRMA) has drawn increasing attention in recent years, due to its wide range of applications in computer vision and machine learning. However, LRMA, achieved by nuclear norm minimization (NNM), tends to over-shrink the rank components with the same threshold and ignore the differences between rank components. To address this problem, we propose a flexible and precise model named multi-band weighted $l_p$ norm minimization (MBWPNM). The proposed MBWPNM not only gives more accurate approximation with a Schatten $p$-norm, but also considers the prior knowledge where different rank components have different importance. We analyze the solution of MBWPNM and prove that MBWPNM is equivalent to a non-convex $l_p$ norm subproblems under certain weight condition, whose global optimum can be solved by a generalized soft-thresholding algorithm. We then adopt the MBWPNM algorithm to color and multispectral image denoising. Extensive experiments on additive white Gaussian noise removal and realistic noise removal demonstrate that the proposed MBWPNM achieves a better performance than several state-of-art algorithms.
http://arxiv.org/abs/1901.04206
We propose FuCoLoT – a Fully Correlational Long-term Tracker. It exploits the novel DCF constrained filter learning method to design a detector that is able to re-detect the target in the whole image efficiently. FuCoLoT maintains several correlation filters trained on different time scales that act as the detector components. A novel mechanism based on the correlation response is used for tracking failure estimation. FuCoLoT achieves state-of-the-art results on standard short-term benchmarks and it outperforms the current best-performing tracker on the long-term UAV20L benchmark by over 19%. It has an order of magnitude smaller memory footprint than its best-performing competitors and runs at 15fps in a single CPU thread.
http://arxiv.org/abs/1711.09594
This book contains the accepted papers at ProSocrates 2017 Symposium: Problem-solving,Creativity and Spatial Reasoning in Cognitive Systems. ProSocrates 2017 symposium was held at the Hansewissenschaftkolleg (HWK) of Advanced Studies in Delmenhorst, 20-21July 2017. This was the second edition of this symposium which aims to bring together researchers interested in spatial reasoning, problem solving and creativity.
http://arxiv.org/abs/1901.04199
Context. Convolutional neural networks (CNNs) have been proven to perform fast classification and detection on natural images and have potential to infer astrophysical parameters on the exponentially increasing amount of sky survey imaging data. The inference pipeline can be trained either from real human-annotated data or simulated mock observations. Until now star cluster analysis was based on integral or individual resolved stellar photometry. This limits the amount of information that can be extracted from cluster images. Aims. Develop a CNN-based algorithm aimed to simultaneously derive ages, masses, and sizes of star clusters directly from multi-band images. Demonstrate CNN capabilities on low mass semi-resolved star clusters in a low signal-to-noise ratio regime. Methods. A CNN was constructed based on the deep residual network (ResNet) architecture and trained on simulated images of star clusters with various ages, masses, and sizes. To provide realistic backgrounds, M31 star fields taken from the PHAT survey were added to the mock cluster images. Results. The proposed CNN was verified on mock images of artificial clusters and has demonstrated high precision and no significant bias for clusters of ages $\lesssim$3Gyr and masses between 250 and 4,000 ${\rm M_\odot}$. The pipeline is end-to-end, starting from input images all the way to the inferred parameters; no hand-coded steps have to be performed: estimates of parameters are provided by the neural network in one inferential step from raw images.
http://arxiv.org/abs/1807.07658
Short-term tracking is an open and challenging problem for which discriminative correlation filters (DCF) have shown excellent performance. We introduce the channel and spatial reliability concepts to DCF tracking and provide a novel learning algorithm for its efficient and seamless integration in the filter update and the tracking process. The spatial reliability map adjusts the filter support to the part of the object suitable for tracking. This both allows to enlarge the search region and improves tracking of non-rectangular objects. Reliability scores reflect channel-wise quality of the learned filters and are used as feature weighting coefficients in localization. Experimentally, with only two simple standard features, HoGs and Colornames, the novel CSR-DCF method – DCF with Channel and Spatial Reliability – achieves state-of-the-art results on VOT 2016, VOT 2015 and OTB100. The CSR-DCF runs in real-time on a CPU.
http://arxiv.org/abs/1611.08461
In this work, we propose a goal-driven collaborative task that contains language, vision, and action in a virtual environment as its core components. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate via two-way communication using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human agents. We define protocols and metrics to evaluate the effectiveness of learned agents on this testbed, highlighting the need for a novel crosstalk condition which pairs agents trained independently on disjoint subsets of the training data for evaluation. We present models for our task, including simple but effective nearest-neighbor techniques and neural network approaches trained using a combination of imitation learning and goal-driven training. All models are benchmarked using both fully automated evaluation and by playing the game with live human agents.
http://arxiv.org/abs/1712.05558
Demand for healthcare is increasing rapidly. To meet demand, we must improve the efficiency of our public health services. We present a mixed integer programming (MIP) formulation that simultaneously tackles the integrated Master Surgical Schedule (MSS) and Surgical Case Assignment (SCA) problems. We consider volatile surgical durations and non-elective arrivals whilst applying a rolling horizon approach to adjust the schedule after cancellations, equipment failure, or new arrivals on the waiting list. A case study of an Australian public hospital with a large surgical department is the basis for the model. The formulation includes significant detail and provides practitioners with a globally implementable model. We produce good feasible solutions in short amounts of computational time with a constructive heuristic and two hyper metaheuristics. Using a rolling horizon schedule increases patient throughput and can help reduce waiting lists.
http://arxiv.org/abs/1808.10139
In this paper, we applies GA algorithm into Electrical Impedance Tomography (EIT) application. We first outline the EIT problem as an optimization problem and define a target optimization function. Then we show how the GA algorithm as an alternative searching algorithm can be used for solving EIT inverse problem. In this paper, we explore evolutionary methods such as GA algorithms combined with various regularization operators to solve EIT inverse computing problem. Key words: Electrical Impedance Tomography (EIT), GA, Tikhonov operator , Mumford-Shah operator, Particle Swarm Optimization(PSO), Back Propagation(BP).
https://arxiv.org/abs/1901.04872
In the current field of computer vision, automatically generating texts from given images has been a fully worked technique. Up till now, most works of this area focus on image content describing, namely image-captioning. However, rare researches focus on generating product review texts, which is ubiquitous in the online shopping malls and is crucial for online shopping selection and evaluation. Different from content describing, review texts include more subjective information of customers, which may bring difference to the results. Therefore, we aimed at a new field concerning generating review text from customers based on images together with the ratings of online shopping products, which appear as non-image attributes. We made several adjustments to the existing image-captioning model to fit our task, in which we should also take non-image features into consideration. We also did experiments based on our model and get effective primary results.
http://arxiv.org/abs/1901.04140
Accurate prediction of fading channel in future is essential to realize adaptive transmission and other methods that can save power and provide gains. In practice, wireless channel model can be regarded as a new language model, and the time-varying channel can be seen as mumbling in this language, which is too complex to understand, to say nothing of prediction. Fortunately, neural networks have been proved efficient in learning language models in recent times, moreover, sequence-to-sequence (seq2seq) models can provide the state of the art performance in various tasks such as machine translation, image caption generation, and text summarization. Predicting channel with neural networks seems promising, however, vanilla neural networks cannot deal with complex-valued inputs while channel state information (CSI) is in complex domain. In this paper, we present a powerful method to understand and predict complex-valued channel by utilizing seq2seq models, the results show that seq2seq models are also expert in time series prediction, and realistic channel prediction with comparable or superior performance relative to channel estimation is attainable.
http://arxiv.org/abs/1901.04119
Without real bilingual corpus available, unsupervised Neural Machine Translation (NMT) typically requires pseudo parallel data generated with the back-translation method for the model training. However, due to weak supervision, the pseudo data inevitably contain noises and errors that will be accumulated and reinforced in the subsequent training process, leading to bad translation performance. To address this issue, we introduce phrase based Statistic Machine Translation (SMT) models which are robust to noisy data, as posterior regularizations to guide the training of unsupervised NMT models in the iterative back-translation process. Our method starts from SMT models built with pre-trained language models and word-level translation tables inferred from cross-lingual embeddings. Then SMT and NMT models are optimized jointly and boost each other incrementally in a unified EM framework. In this way, (1) the negative effect caused by errors in the iterative back-translation process can be alleviated timely by SMT filtering noises from its phrase tables; meanwhile, (2) NMT can compensate for the deficiency of fluency inherent in SMT. Experiments conducted on en-fr and en-de translation tasks show that our method outperforms the strong baseline and achieves new state-of-the-art unsupervised machine translation performance.
http://arxiv.org/abs/1901.04112
This paper addresses the problem of 3D pose estimation for multiple people in a few calibrated camera views. The main challenge of this problem is to find the cross-view correspondences among noisy and incomplete 2D pose predictions. Most previous methods address this challenge by directly reasoning in 3D using a pictorial structure model, which is inefficient due to the huge state space. We propose a fast and robust approach to solve this problem. Our key idea is to use a multi-way matching algorithm to cluster the detected 2D poses in all views. Each resulting cluster encodes 2D poses of the same person across different views and consistent correspondences across the keypoints, from which the 3D pose of each person can be effectively inferred. The proposed convex optimization based multi-way matching algorithm is efficient and robust against missing and false detections, without knowing the number of people in the scene. Moreover, we propose to combine geometric and appearance cues for cross-view matching. The proposed approach achieves significant performance gains from the state-of-the-art (96.3% vs. 90.6% and 96.9% vs. 88% on the Campus and Shelf datasets, respectively), while being efficient for real-time applications.
http://arxiv.org/abs/1901.04111
The automatic recognition of emotion in speech can inform our understanding of language, emotion, and the brain. It also has practical application to human-machine interactive systems. This paper examines the recognition of emotion in naturally occurring speech, where there are no constraints on what is said or the emotions expressed. This task is more difficult than that using data collected in scripted, experimentally controlled settings, and fewer results are published. Our data come from couples in psychotherapy. Video and audio recordings were made of three couples (A, B, C) over 18 hour-long therapy sessions. This paper describes the method used to code the audio recordings for the four emotions of Anger, Sadness, Joy and Tension, plus Neutral, also covering our approach to managing the unbalanced samples that a naturally occurring emotional speech dataset produces. Three groups of acoustic features were used in our analysis: filter-bank, frequency, and voice-quality features. The random forests model classified the features. Recognition rates are reported for each individual, the result of the speaker-dependent models that we built. In each case, the best recognition rates were achieved using the filter-bank features alone. For Couple A, these rates were 90% for the female and 87% for the male for the recognition of three emotions plus Neutral. For Couple B, the rates were 84% for the female and 78% for the male for the recognition of all four emotions plus Neutral. For Couple C, a rate of 88% was achieved for the female for the recognition of the four emotions plus Neutral and 95% for the male for three emotions plus Neutral. For pairwise recognition, the rates ranged from 76% to 99% across the three couples. Our results show that couple therapy is a rich context for the study of emotion in naturally occurring speech.
http://arxiv.org/abs/1901.04110
Evolutionarily stable strategy (ESS) is an important solution concept in game theory which has been applied frequently to biological models. Informally an ESS is a strategy that if followed by the population cannot be taken over by a mutation strategy that is initially rare. Finding such a strategy has been shown to be difficult from a theoretical complexity perspective. We present an algorithm for the case where mutations are restricted to pure strategies, and present experiments on several game classes including random and a recently-proposed cancer model. Our algorithm is based on a mixed-integer non-convex feasibility program formulation, which constitutes the first general optimization formulation for this problem. It turns out that the vast majority of the games included in the experiments contain ESS with small support, and our algorithm is outperformed by a support-enumeration based approach. However we suspect our algorithm may be useful in the future as games are studied that have ESS with potentially larger and unknown support size.
https://arxiv.org/abs/1803.00607
Recently, neural models pretrained on a language modeling task, such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018), have achieved impressive results on various natural language processing tasks such as question-answering and natural language inference. In this paper, we describe a simple re-implementation of BERT for query-based passage re-ranking. Our system is the start of the art on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% (relative) in MRR@10. The code to reproduce our submission is available at https://github.com/nyu-dl/dl4marco-bert
http://arxiv.org/abs/1901.04085
Increasing depth of convolutional neural networks (CNNs) is a highly promising method of increasing the accuracy of the (CNNs). Increased CNN depth will also result in increased layer count (parameters), leading to a slow backpropagation convergence prone to overfitting. We trained our model (Residual-CNDS) to classify very large-scale scene datasets MIT Places 205, and MIT Places 365-Standard. The outcome result from the two datasets proved our proposed model (Residual-CNDS) effectively handled the slow convergence, overfitting, and degradation. CNNs that include deep supervision (CNDS) add supplementary branches to the deep convolutional neural network in specified layers by calculating vanishing, effectively addressing delayed convergence and overfitting. Nevertheless, (CNDS) does not resolve degradation; hence, we add residual learning to the (CNDS) in certain layers after studying the best place in which to add it. With this approach we overcome degradation in the very deep network. We have built two models (Residual-CNDS 8), and (Residual-CNDS 10). Moreover, we tested our models on two large-scale datasets, and we compared our results with other recently introduced cutting-edge networks in the domain of top-1 and top-5 classification accuracy. As a result, both of models have shown good improvement, which supports the assertion that the addition of residual connections enhances network CNDS accuracy without adding any computation complexity.
http://arxiv.org/abs/1902.10030
Background image subtraction algorithm is a common approach which detects moving objects in a video sequence by finding the significant difference between the video frames and the static background model. This paper presents a developed system which achieves vehicle detection by using background image subtraction algorithm based on blocks followed by deep learning data validation algorithm. The main idea is to segment the image into equal size blocks, to model the static reference background image (SRBI), by calculating the variance between each block pixels and each counterpart block pixels in the adjacent frame, the system implemented into four different methods: Absolute Difference, Image Entropy, Exclusive OR (XOR) and Discrete Cosine Transform (DCT). The experimental results showed that the DCT method has the highest vehicle detection accuracy.
http://arxiv.org/abs/1901.04077