The rapid progress in synthetic image generation and manipulation has now come to a point where it raises significant concerns on the implication on the society. At best, this leads to a loss of trust in digital content, but it might even cause further harm by spreading false information and the creation of fake news. In this paper, we examine the realism of state-of-the-art image manipulations, and how difficult it is to detect them - either automatically or by humans. In particular, we focus on DeepFakes, Face2Face, and FaceSwap as prominent representatives for facial manipulations. We create more than half a million manipulated images respectively for each approach. The resulting publicly available dataset is at least an order of magnitude larger than comparable alternatives and it enables us to train data-driven forgery detectors in a supervised fashion. We show that the use of additional domain specific knowledge improves forgery detection to an unprecedented accuracy, even in the presence of strong compression. By conducting a series of thorough experiments, we quantify the differences between classical approaches, novel deep learning approaches, and the performance of human observers.
http://arxiv.org/abs/1901.08971
Despite inherent ill-definition, anomaly detection is a research endeavor of great interest within machine learning and visual scene understanding alike. Most commonly, anomaly detection is considered as the detection of outliers within a given data distribution based on some measure of normality. The most significant challenge in real-world anomaly detection problems is that available data is highly imbalanced towards normality (i.e. non-anomalous) and contains a most a subset of all possible anomalous samples - hence limiting the use of well-established supervised learning methods. By contrast, we introduce an unsupervised anomaly detection model, trained only on the normal (non-anomalous, plentiful) samples in order to learn the normality distribution of the domain and hence detect abnormality based on deviation from this model. Our proposed approach employs an encoder-decoder convolutional neural network with skip connections to thoroughly capture the multi-scale distribution of the normal data distribution in high-dimensional image space. Furthermore, utilizing an adversarial training scheme for this chosen architecture provides superior reconstruction both within high-dimensional image space and a lower-dimensional latent vector space encoding. Minimizing the reconstruction error metric within both the image and hidden vector spaces during training aids the model to learn the distribution of normality as required. Higher reconstruction metrics during subsequent test and deployment are thus indicative of a deviation from this normal distribution, hence indicative of an anomaly. Experimentation over established anomaly detection benchmarks and challenging real-world datasets, within the context of X-ray security screening, shows the unique promise of such a proposed approach.
http://arxiv.org/abs/1901.08954
We explore the use of a knowledge graphs, that capture general or commonsense knowledge, to augment the information extracted from images by the state-of-the-art methods for image captioning. The results of our experiments, on several benchmark data sets such as MS COCO, as measured by CIDEr-D, a performance metric for image captioning, show that the variants of the state-of-the-art methods for image captioning that make use of the information extracted from knowledge graphs can substantially outperform those that rely solely on the information extracted from images.
http://arxiv.org/abs/1901.08942
Learning with auxiliary tasks has been shown to improve the generalisation of a primary task. However, this comes at the cost of manually-labelling additional tasks which may, or may not, be useful for the primary task. We propose a new method which automatically learns labels for an auxiliary task, such that any supervised learning task can be improved without requiring access to additional data. The approach is to train two neural networks: a label-generation network to predict the auxiliary labels, and a multi-task network to train the primary task alongside the auxiliary task. The loss for the label-generation network incorporates the multi-task network’s performance, and so this interaction between the two networks can be seen as a form of meta learning. We show that our proposed method, Meta AuXiliary Learning (MAXL), outperforms single-task learning on 7 image datasets by a significant margin, without requiring additional auxiliary labels. We also show that MAXL outperforms several other baselines for generating auxiliary labels, and is even competitive when compared with human-defined auxiliary labels. The self-supervised nature of our method leads to a promising new direction towards automated generalisation. The source code is available at \url{https://github.com/lorenmt/maxl}.
http://arxiv.org/abs/1901.08933
We present the first real-time human performance capture approach that reconstructs dense, space-time coherent deforming geometry of entire humans in general everyday clothing from just a single RGB video. We propose a novel two-stage analysis-by-synthesis optimization whose formulation and implementation are designed for high performance. In the first stage, a skinned template model is jointly fitted to background subtracted input video, 2D and 3D skeleton joint positions found using a deep neural network, and a set of sparse facial landmark detections. In the second stage, dense non-rigid 3D deformations of skin and even loose apparel are captured based on a novel real-time capable algorithm for non-rigid tracking using dense photometric and silhouette constraints. Our novel energy formulation leverages automatically identified material regions on the template to model the differing non-rigid deformation behavior of skin and apparel. The two resulting non-linear optimization problems per-frame are solved with specially-tailored data-parallel Gauss-Newton solvers. In order to achieve real-time performance of over 25Hz, we design a pipelined parallel architecture using the CPU and two commodity GPUs. Our method is the first real-time monocular approach for full-body performance capture. Our method yields comparable accuracy with off-line performance capture techniques, while being orders of magnitude faster.
http://arxiv.org/abs/1810.02648
Often when multiple labels are obtained for a training example it is assumed that there is an element of noise that must be accounted for. It has been shown that this disagreement can be considered signal instead of noise. In this work we investigate using soft labels for training data to improve generalization in machine learning models. However, using soft labels for training Deep Neural Networks (DNNs) is not practical due to the costs involved in obtaining multiple labels for large data sets. We propose soft label memorization-generalization (SLMG), a fine-tuning approach to using soft labels for training DNNs. We assume that differences in labels provided by human annotators represent ambiguity about the true label instead of noise. Experiments with SLMG demonstrate improved generalization performance on the Natural Language Inference (NLI) task. Our experiments show that by injecting a small percentage of soft label training data (0.03% of training set size) we can improve generalization performance over several baselines.
http://arxiv.org/abs/1702.08563
Reconstructing a high-resolution 3D model of an object is a challenging task in computer vision. Designing scalable and light-weight architectures is crucial while addressing this problem. Existing point-cloud based reconstruction approaches directly predict the entire point cloud in a single stage. Although this technique can handle low-resolution point clouds, it is not a viable solution for generating dense, high-resolution outputs. In this work, we introduce DensePCR, a deep pyramidal network for point cloud reconstruction that hierarchically predicts point clouds of increasing resolution. Towards this end, we propose an architecture that first predicts a low-resolution point cloud, and then hierarchically increases the resolution by aggregating local and global point features to deform a grid. Our method generates point clouds that are accurate, uniform and dense. Through extensive quantitative and qualitative evaluation on synthetic and real datasets, we demonstrate that DensePCR outperforms the existing state-of-the-art point cloud reconstruction works, while also providing a light-weight and scalable architecture for predicting high-resolution outputs.
http://arxiv.org/abs/1901.08906
This paper aims to use term clustering to build a modular ontology according to core ontology from domain-specific text. The acquisition of semantic knowledge focuses on noun phrase appearing with the same syntactic roles in relation to a verb or its preposition combination in a sentence. The construction of this co-occurrence matrix from context helps to build feature space of noun phrases, which is then transformed to several encoding representations including feature selection and dimensionality reduction. In addition, the content has also been presented with the construction of word vectors. These representations are clustered respectively with K-Means and Affinity Propagation (AP) methods, which differentiate into the term clustering frameworks. Due to the randomness of K-Means, iteration efforts are adopted to find the optimal parameter. The frameworks are evaluated extensively where AP shows dominant effectiveness for co-occurred terms and NMF encoding technique is salient by its promising facilities in feature compression.
http://arxiv.org/abs/1901.09037
Swarms of small spacecraft offer whole new capabilities in Earth observation, global positioning and communications compared to a large monolithic spacecraft. These small spacecrafts can provide bigger apertures that increase gain in communication antennas, increase area coverage or effective resolution of distributed cameras and enable persistent observation of ground or space targets. However, there remain important challenges in operating large number of spacecrafts at once. Current methods would require a large number of ground operators monitor and actively control these spacecrafts which poses challenges in terms of coordination and control which prevents the technology from scaled up in cost-effective manner. Technologies are required to enable one ground operator to manage tens if not hundreds of spacecrafts. We propose to utilize laser beams directed from the ground or from a command and control spacecraft to organize and manage a large swarm. Each satellite in the swarm will have a customized “smart skin” con-taining solar panels, power and control circuitry and an embedded secondary propulsion unit. A secondary propulsion unit may include electrospray pro-pulsion, solar radiation pressure-based system, photonic laser thrusters and Lorentz force thrusters. Solar panels typically occupy the largest surface area on an earth orbiting satellite. A laser beam from another spacecraft or from the ground would interact with solar panels of the spacecraft swarm. The laser beam would be used to select a ‘leader’ amongst a group of spacecrafts, set parameters for formation-flight, including separation distance, local if-then rules and coordinated changes in attitude and position.
http://arxiv.org/abs/1901.08875
There is growing demand for satellite swarms and constellations for global positioning, remote sensing and relay communication in higher LEO orbits. This will result in many obsolete, damaged and abandoned satellites that will remain on-orbit beyond 25 years. These abandoned satellites and space debris maybe economically valuable orbital real-estate and resources that can be reused, repaired or upgraded for future use. Space traffic management is critical to repair damaged satellites, divert satellites into warehouse orbits and effectively de-orbit satellites and space debris that are beyond repair and salvage. Current methods for on-orbit capture, servicing and repair require a large service satellite. However, by accessing abandoned satellites and space debris, there is an inherent heightened risk of damage to a servicing spacecraft. Sending multiple small-robots with each robot specialized in a specific task is a credible alternative, as the system is simple and cost-effective and where loss of one or more robots does not end the mission. In this work, we outline an end to end multirobot system to capture damaged and abandoned spacecraft for salvaging, repair and for de-orbiting. We analyze the feasibility of sending multiple, decentralized robots that can work cooperatively to perform capture of the target satellite as a first step, followed by crawling onto damage satellites to perform detailed mapping. After obtaining a detailed map of the satellite, the robots will proceed to either repair and replace or dismantle components for salvage operations. Finally, the remaining components will be packaged with a de-orbit device for accelerated de-orbit.
http://arxiv.org/abs/1901.11121
The problem of evaluating the performance of soccer players is attracting the interest of many companies and the scientific community, thanks to the availability of massive data capturing all the events generated during a match (e.g., tackles, passes, shots, etc.). Unfortunately, there is no consolidated and widely accepted metric for measuring performance quality in all of its facets. In this paper, we design and implement PlayeRank, a data-driven framework that offers a principled multi-dimensional and role-aware evaluation of the performance of soccer players. We build our framework by deploying a massive dataset of soccer-logs and consisting of millions of match events pertaining to four seasons of 18 prominent soccer competitions. By comparing PlayeRank to known algorithms for performance evaluation in soccer, and by exploiting a dataset of players’ evaluations made by professional soccer scouts, we show that PlayeRank significantly outperforms the competitors. We also explore the ratings produced by {\sf PlayeRank} and discover interesting patterns about the nature of excellent performances and what distinguishes the top players from the others. At the end, we explore some applications of PlayeRank – i.e. searching players and player versatility — showing its flexibility and efficiency, which makes it worth to be used in the design of a scalable platform for soccer analytics.
http://arxiv.org/abs/1802.04987
Image captioning is the process of generating a natural language description of an image. Most current image captioning models, however, do not take into account the emotional aspect of an image, which is very relevant to activities and interpersonal relationships represented therein. Towards developing a model that can produce human-like captions incorporating these, we use facial expression features extracted from images including human faces, with the aim of improving the descriptive ability of the model. In this work, we present two variants of our Face-Cap model, which embed facial expression features in different ways, to generate image captions. Using all standard evaluation metrics, our Face-Cap models outperform a state-of-the-art baseline model for generating image captions when applied to an image caption dataset extracted from the standard Flickr 30K dataset, consisting of around 11K images containing faces. An analysis of the captions finds that, perhaps surprisingly, the improvement in caption quality appears to come not from the addition of adjectives linked to emotional aspects of the images, but from more variety in the actions described in the captions.
https://arxiv.org/abs/1807.02250
We are proceeding towards the age of automation and robotic integration of our production lines [5]. Effective quality-control systems have to be put in place to maintain the quality of manufactured components. Among different quality-control systems, vision-based inspection systems have gained considerable amount of popularity [8] due to developments in computing power and image processing techniques. In this paper, we present a vision-based inspection system (VBI) as a quality-control system, which not only detects the presence of defects, such as in conventional VBIs, but also leverage developments in machine learning to predict the presence of surface fractures and wearing. We use OpenCV, an open source computer-vision framework, and Tensorflow, an open source machine-learning framework developed by Google Inc., to accomplish the tasks of detection and prediction of presence of surface defects such as fractures of manufactured gears.
http://arxiv.org/abs/1901.08864
In this work, we introduce a novel framework that employs cluster annotation to boost active learning by reducing the number of human interactions required to train deep neural networks. Instead of annotating single samples individually, humans can also label clusters, producing a higher number of annotated samples with the cost of a small label error. Our experiments show that the proposed framework requires 82% and 87% less human interactions for CIFAR-10 and EuroSAT datasets respectively when compared with the fully-supervised training while maintaining similar performance on the test set.
http://arxiv.org/abs/1812.11780
Recently, state-of-the-art results have been achieved in semantic segmentation using fully convolutional networks (FCNs). Most of these networks employ encoder-decoder style architecture similar to U-Net and are trained with images and the corresponding segmentation maps as a pixel-wise classification task. Such frameworks only exploit class information by using the ground truth segmentation maps. In this paper, we propose a multi-task learning framework with the main aim of exploiting structural and spatial information along with the class information. We modify the decoder part of the FCN to exploit class information and the structural information as well. We intend to do this while also keeping the parameters of the network as low as possible. We obtain the structural information using either of the two ways: i) using the contour map and ii) using the distance map, both of which can be obtained from ground truth segmentation maps with no additional annotation costs. We also explore different ways in which distance maps can be computed and study the effects of different distance maps on the segmentation performance. We also experiment extensively on two different medical image segmentation applications: i.e i) using color fundus images for optic disc and cup segmentation and ii) using endoscopic images for polyp segmentation. Through our experiments, we report results comparable to, and in some cases performing better than the current state-of-the-art architectures and with an order of 2x reduction in the number of parameters.
http://arxiv.org/abs/1901.08824
This is the Proceedings of AAAI 2019 Workshop on Network Interpretability for Deep Learning
http://arxiv.org/abs/1901.08813
Face morphing represents nowadays a big security threat in the context of electronic identity documents as well as an interesting challenge for researchers in the field of face recognition. Despite of the good performance obtained by state-of-the-art approaches on digital images, no satisfactory solutions have been identified so far to deal with cross-database testing and printed-scanned images (typically used in many countries for document issuing). In this work, novel approaches are proposed to train Deep Neural Networks for morphing detection: in particular generation of simulated printed-scanned images together with other data augmentation strategies and pre-training on large face recognition datasets, allowed to reach state-of-the-art accuracy on challenging datasets from heterogeneous image sources.
http://arxiv.org/abs/1901.08811
This is the Proceedings of the elevent Workshop on Answer Set Programming and Other Computing Paradigms (ASPOCP) 2018, which was held in Oxford, UK, July 18th, 2018.
http://arxiv.org/abs/1812.03508
In this study, a multiple hypothesis tracking (MHT) algorithm for multi-target multi-camera tracking (MCT) with disjoint views is proposed. Our method forms track-hypothesis trees, and each branch of them represents a multi-camera track of a target that may move within a camera as well as move across cameras. Furthermore, multi-target tracking within a camera is performed simultaneously with the tree formation by manipulating a status of each track hypothesis. Each status represents three different stages of a multi-camera track: tracking, searching, and end-of-track. The tracking status means targets are tracked by a single camera tracker. In the searching status, the disappeared targets are examined if they reappear in other cameras. The end-of-track status does the target exited the camera network due to its lengthy invisibility. These three status assists MHT to form the track-hypothesis trees for multi-camera tracking. Furthermore, they present a gating technique for eliminating of unlikely observation-to-track association. In the experiments, they evaluate the proposed method using two datasets, DukeMTMC and NLPR-MCT, which demonstrates that the proposed method outperforms the state-of-the-art method in terms of improvement of the accuracy. In addition, they show that the proposed method can operate in real-time and online.
http://arxiv.org/abs/1901.08787
Neural network quantization has significant benefits for deployment on dedicated accelerators. We introduce the first practical 4-bit post training quantization approach: it does not involve training the quantized model (“fine-tuning”), nor it requires the availability of the full dataset. Yet, it maintains accuracy that is just a few percents less the state-of-the-art baseline across a wide range of convolutional models. This is unlike traditional approaches that fail entirely in these settings. To achieve this, we convert a full precision pre-trained network to a limited precision network by minimizing the quantization error at the tensor level. We analyze the trade-off between quantization noise and clipping distortion in low precision networks. This enables us to derive approximate analytical expressions for the mean-square-error degradation due to clipping. By optimizing these expressions, we show marked improvements over standard quantization schemes that normally avoid clipping.
http://arxiv.org/abs/1810.05723
When trained on multimodal image datasets, normal Generative Adversarial Networks (GANs) are usually outperformed by class-conditional GANs and ensemble GANs, but conditional GANs is restricted to labeled datasets and ensemble GANs lack efficiency. We propose a novel GAN variant called virtual conditional GAN (vcGAN) which is not only an ensemble GAN with multiple generative paths while adding almost zero network parameters, but also a conditional GAN that can be trained on unlabeled datasets without explicit clustering steps or objectives other than the adversary loss. Inside the vcGAN’s generator, a learnable ``analog-to-digital converter (ADC)” module maps a slice of the inputted multivariate Gaussian noise to discrete/digital noise (virtual label), according to which a selector selects the corresponding generative path to produce the sample. All the generative paths share the same decoder network while in each path the decoder network is fed with a concatenation of a different pre-computed amplified one-hot vector and the inputted Gaussian noise. We conducted a lot of experiments on several balanced/imbalanced image datasets to demonstrate that vcGAN converges faster and achieves improved Frech'et Inception Distance (FID). In addition, we show the training byproduct that the ADC in vcGAN learned the categorical probability of each mode and that each generative path generates samples of specific mode, which enables class-conditional sampling. Codes are available at \url{https://github.com/annonnymmouss/vcgan}
http://arxiv.org/abs/1901.09822
Decision making in multi-agent systems (MAS) is a great challenge due to enormous state and joint action spaces as well as uncertainty, making centralized control generally infeasible. Decentralized control offers better scalability and robustness but requires mechanisms to coordinate on joint tasks and to avoid conflicts. Common approaches to learn decentralized policies for cooperative MAS suffer from non-stationarity and lacking credit assignment, which can lead to unstable and uncoordinated behavior in complex environments. In this paper, we propose Strong Emergent Policy approximation (STEP), a scalable approach to learn strong decentralized policies for cooperative MAS with a distributed variant of policy iteration. For that, we use function approximation to learn from action recommendations of a decentralized multi-agent planning algorithm. STEP combines decentralized multi-agent planning with centralized learning, only requiring a generative model for distributed black box optimization. We experimentally evaluate STEP in two challenging and stochastic domains with large state and joint action spaces and show that STEP is able to learn stronger policies than standard multi-agent reinforcement learning algorithms, when combining multi-agent open-loop planning with centralized function approximation. The learned policies can be reintegrated into the multi-agent planning process to further improve performance.
http://arxiv.org/abs/1901.08761
YouTube is the leading social media platform for sharing videos. As a result, it is plagued with misleading content that includes staged videos presented as real footages from an incident, videos with misrepresented context and videos where audio/video content is morphed. We tackle the problem of detecting such misleading videos as a supervised classification task. We develop UCNet - a deep network to detect fake videos and perform our experiments on two datasets - VAVD created by us and publicly available FVC [8]. We achieve a macro averaged F-score of 0.82 while training and testing on a 70:30 split of FVC, while the baseline model scores 0.36. We find that the proposed model generalizes well when trained on one dataset and tested on the other.
http://arxiv.org/abs/1901.08759
Biomedical text mining has become more important than ever as the number of biomedical documents rapidly grows. With the progress of machine learning, extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning is boosting the development of effective biomedical text mining models. However, as deep learning models require a large amount of training data, biomedical text mining with deep learning often fails due to the small sizes of training datasets in biomedical fields. Recent researches on learning contextualized language representation models from text corpora shed light on the possibility of leveraging a large number of unannotated biomedical text corpora. We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain specific language representation model pre-trained on large-scale biomedical corpora. Based on the BERT architecture, BioBERT effectively transfers the knowledge of large amount of biomedical texts into biomedical text mining models. While BERT also shows competitive performances with previous state-of-the-art models, BioBERT significantly outperforms them on three representative biomedical text mining tasks including biomedical named entity recognition (1.86% absolute improvement), biomedical relation extraction (3.33% absolute improvement), and biomedical question answering (9.61% absolute improvement) with minimal task-specific architecture modifications. We make pre-trained weights of BioBERT freely available in https://github.com/naver/biobert-pretrained, and source codes of fine-tuned models in https://github.com/dmis-lab/biobert.
http://arxiv.org/abs/1901.08746
Dynamic portfolio optimization is the process of sequentially allocating wealth to a collection of assets in some consecutive trading periods, based on investors’ return-risk profile. Automating this process with machine learning remains a challenging problem. Here, we design a deep reinforcement learning (RL) architecture with an autonomous trading agent such that, investment decisions and actions are made periodically, based on a global objective, with autonomy. In particular, without relying on a purely model-free RL agent, we train our trading agent using a novel RL architecture consisting of an infused prediction module (IPM), a generative adversarial data augmentation module (DAM) and a behavior cloning module (BCM). Our model-based approach works with both on-policy or off-policy RL algorithms. We further design the back-testing and execution engine which interact with the RL agent in real time. Using historical {\em real} financial market data, we simulate trading with practical constraints, and demonstrate that our proposed model is robust, profitable and risk-sensitive, as compared to baseline trading strategies and model-free RL agents from prior work.
http://arxiv.org/abs/1901.08740
The current state-of-the-art Scrabble agents are not learning-based but depend on truncated Monte Carlo simulations and the quality of such agents is contingent upon the time available for running the simulations. This thesis takes steps towards building a learning-based Scrabble agent using self-play. Specifically, we try to find a better function approximation for the static evaluation function used in Scrabble which determines the move goodness at a given board configuration. In this work, we experimented with evolutionary algorithms and Bayesian Optimization to learn the weights for an approximate feature-based evaluation function. However, these optimization methods were not quite effective, which lead us to explore the given problem from an Imitation Learning point of view. We also tried to imitate the ranking of moves produced by the Quackle simulation agent using supervised learning with a neural network function approximator which takes the raw representation of the Scrabble board as the input instead of using only a fixed number of handcrafted features.
http://arxiv.org/abs/1901.08728
Discrete-action algorithms have been central to numerous recent successes of deep reinforcement learning. However, applying these algorithms to high-dimensional action tasks requires tackling the combinatorial increase of the number of possible actions with the number of action dimensions. This problem is further exacerbated for continuous-action tasks that require fine control of actions via discretization. In this paper, we propose a novel neural architecture featuring a shared decision module followed by several network branches, one for each action dimension. This approach achieves a linear increase of the number of network outputs with the number of degrees of freedom by allowing a level of independence for each individual action dimension. To illustrate the approach, we present a novel agent, called Branching Dueling Q-Network (BDQ), as a branching variant of the Dueling Double Deep Q-Network (Dueling DDQN). We evaluate the performance of our agent on a set of challenging continuous control tasks. The empirical results show that the proposed agent scales gracefully to environments with increasing action dimensionality and indicate the significance of the shared decision module in coordination of the distributed action branches. Furthermore, we show that the proposed agent performs competitively against a state-of-the-art continuous control algorithm, Deep Deterministic Policy Gradient (DDPG).
http://arxiv.org/abs/1711.08946
Extended Kalman filter (EKF) does not guarantee consistent mean and covariance under linearization, even though it is the main framework for robotic localization. While Lie group improves the modeling of the state space in localization, the EKF on Lie group still relies on the arbitrary Gaussian assumption in face of nonlinear models. We instead use von Mises filter for orientation estimation together with the conventional Kalman filter for position estimation, and thus we are able to characterize the first two moments of the state estimates. Since the proposed algorithm holds a solid probabilistic basis, it is fundamentally relieved from the inconsistency problem. Furthermore, we extend the localization algorithm to fully circular representation even for position, which is similar to grid patterns found in mammalian brains and in recurrent neural networks. The applicability of the proposed algorithms is substantiated not only by strong mathematical foundation but also by the comparison against other common localization methods.
http://arxiv.org/abs/1809.02910
Many real-world problems exhibit the coexistence of multiple types of heterogeneity, such as view heterogeneity (i.e., multi-view property) and task heterogeneity (i.e., multi-task property). For example, in an image classification problem containing multiple poses of the same object, each pose can be considered as one view, and the detection of each type of object can be treated as one task. Furthermore, in some problems, the data type of multiple views might be different. In a web classification problem, for instance, we might be provided an image and text mixed data set, where the web pages are characterized by both images and texts. A common strategy to solve this kind of problem is to leverage the consistency of views and the relatedness of tasks to build the prediction model. In the context of deep neural network, multi-task relatedness is usually realized by grouping tasks at each layer, while multi-view consistency is usually enforced by finding the maximal correlation coefficient between views. However, there is no existing deep learning algorithm that jointly models task and view dual heterogeneity, particularly for a data set with multiple modalities (text and image mixed data set or text and video mixed data set, etc.). In this paper, we bridge this gap by proposing a deep multi-task multi-view learning framework that learns a deep representation for such dual-heterogeneity problems. Empirical studies on multiple real-world data sets demonstrate the effectiveness of our proposed Deep-MTMV algorithm.
http://arxiv.org/abs/1901.08723
A unified method for extracting geometric shape features from binary image data using a steady state partial differential equation (PDE) system as a boundary value problem is presented in this paper. The PDE and functions are formulated to extract the thickness, orientation, and skeleton simultaneously. The main advantages of the proposed method is that the orientation is defined without derivatives and thickness computation is not imposed a topological constraint on the target shape. A one-dimensional analytical solution is provided to validate the proposed method. In addition, two and three-dimensional numerical examples are presented to confirm the usefulness of the proposed method.
http://arxiv.org/abs/1806.05299
One form of comparing the expressiveness of rectifier networks is by the number of linear regions, or pieces, of the piecewise linear functions modeled by such networks. However, enumerating these regions is prohibitive in practice and the known analytical bounds on their numbers are identical for networks having the same dimensions. In this work, we approximate the number of linear regions of rectifier networks through empirical bounds based on features of the trained network and probabilistic inference. Our first contribution is an algorithm for probabilistic lower bounds of mixed-integer linear sets, which is several orders of magnitude faster than exact counting and obtain values reaching similar orders of magnitude. Our second contribution is a tighter activation-based bound for the maximum number of linear regions, which is particularly stronger in networks with narrow layers. Combined, these bounds yield a reasonable proxy for the number of linear regions and the accuracy of the networks.
http://arxiv.org/abs/1810.03370
Limitations in actuation, sensing, and computation have forced small legged robots to rely on carefully tuned, mechanically mediated leg trajectories for effective locomotion. Recent advances in manufacturing, however, have enabled the development of small legged robots capable of operation at multiple stride frequencies using multi-degree-of-freedom leg trajectories. Proprioceptive sensing and control is key to extending the capabilities of these robots to a broad range of operating conditions. In this work, we leverage concomitant sensing for piezoelectric actuation to develop a computationally efficient framework for estimation and control of leg trajectories on a quadrupedal microrobot. We demonstrate accurate position estimation ($<$16% root-mean-square error) and control ($<$16% root-mean-square tracking error) during locomotion across a wide range of stride frequencies (10-50 Hz). This capability enables the exploration of two parametric leg trajectories designed to reduce leg slip and increase locomotion performance (e.g., speed, cost-of-transport, etc.). Using this approach, we demonstrate high performance locomotion at stride frequencies of (10-30 Hz) where the robot’s natural dynamics result in poor open-loop locomotion. Furthermore, we identify regions of highly dynamic locomotion, low cost-of-transport (3.33), and minimal leg slippage ($<$10%).
http://arxiv.org/abs/1901.08715
Studying complexity of various bribery problems has been one of the main research focus in computational social choice. In all the models of bribery studied so far, the briber has to pay every voter some amount of money depending on what the briber wants the voter to report and the briber has some budget at her disposal. Although these models successfully capture many real world applications, in many other scenarios, the voters may be unwilling to deviate too much from their true preferences. In this paper, we study the computational complexity of the problem of finding a preference profile which is as close to the true preference profile as possible and still achieves the briber’s goal subject to budget constraints. We call this problem Optimal Bribery. We consider three important measures of distances, namely, swap distance, footrule distance, and maximum displacement distance, and resolve the complexity of the optimal bribery problem for many common voting rules. We show that the problem is polynomial time solvable for the plurality and veto voting rules for all the three measures of distance. On the other hand, we prove that the problem is NP-complete for a class of scoring rules which includes the Borda voting rule, maximin, Copeland$^\alpha$ for any $\alpha\in[0,1]$, and Bucklin voting rules for all the three measures of distance even when the distance allowed per voter is $1$ for the swap and maximum displacement distances and $2$ for the footrule distance even without the budget constraints (which corresponds to having an infinite budget). For the $k$-approval voting rule for any constant $k>1$ and the simplified Bucklin voting rule, we show that the problem is NP-complete for the swap distance even when the distance allowed is $2$ and for the footrule distance even when the distance allowed is $4$ even without the budget constraints.
http://arxiv.org/abs/1901.08711
On the topic of journalistic integrity, the current state of accurate, impartial news reporting has garnered much debate in context to the 2016 US Presidential Election. In pursuit of computational evaluation of news text, the statements (attributions) ascribed by media outlets to sources provide a common category of evidence on which to operate. In this paper, we develop an approach to compare partisan traits of news text attributions and apply it to characterize differences in statements ascribed to candidate, Hilary Clinton, and incumbent President, Donald Trump. In doing so, we present a model trained on over 600 in-house annotated attributions to identify each candidate with accuracy > 88%. Finally, we discuss insights from its performance for future research.
http://arxiv.org/abs/1902.02179
We investigate the effectiveness of a simple solution to the common problem of deep learning in medical image analysis with limited quantities of labeled training data. The underlying idea is to assign artificial labels to abundantly available unlabeled medical images and, through a process known as surrogate supervision, pre-train a deep neural network model for the target medical image analysis task lacking sufficient labeled training data. In particular, we employ 3 surrogate supervision schemes, namely rotation, reconstruction, and colorization, in 4 different medical imaging applications representing classification and segmentation for both 2D and 3D medical images. 3 key findings emerge from our research: 1) pre-training with surrogate supervision is effective for small training sets; 2) deep models trained from initial weights pre-trained through surrogate supervision outperform the same models when trained from scratch, suggesting that pre-training with surrogate supervision should be considered prior to training any deep 3D models; 3) pre-training models in the medical domain with surrogate supervision is more effective than transfer learning from an unrelated domain (e.g., natural images), indicating the practical value of abundant unlabeled medical image data.
http://arxiv.org/abs/1901.08707
In this work, we propose a computational framework in which agents equipped with communication capabilities simultaneously play a series of referential games, where agents are trained using deep reinforcement learning. We demonstrate that the framework mirrors linguistic phenomena observed in natural language: i) the outcome of contact between communities is a function of inter- and intra-group connectivity; ii) linguistic contact either converges to the majority protocol, or in balanced cases leads to novel creole languages of lower complexity; and iii) a linguistic continuum emerges where neighboring languages are more mutually intelligible than farther removed languages. We conclude that intricate properties of language evolution need not depend on complex evolved linguistic capabilities, but can emerge from simple social exchanges between perceptually-enabled agents playing communication games.
http://arxiv.org/abs/1901.08706
Due to the heavy reliance of millimeter-wave (mmWave) wireless systems on directional links, beamforming (BF) with high-dimensional arrays is essential for cellular systems in these frequencies. How to perform the array processing in a power efficient manner is a fundamental challenge. Analog and hybrid BF require fewer analog-to-digital and digital-to-analog converters (ADCs and DACs), but can only communicate in a small number of directions at a time, limiting directional search, spatial multiplexing and control signaling. Digital BF enables flexible spatial processing, but must be operated at a low quantization resolution to stay within reasonable power levels. This decrease in quantizer resolution introduces noise in the received signal and degrades the quality of the transmitted signal. To assess the effect of low-resolution quantization on cellular system, we present a simple additive white Gaussian noise (AWGN) model for quantization noise. Simulations with this model reveal that at moderate resolutions (3-4 bits per ADC), there is negligible loss in downlink cellular capacity from quantization. In essence, the low-resolution ADCs limit the high SNR, where cellular systems typically do not operate. For the transmitter, it is shown that DACs with 4 or more bits of resolution do not violate the adjacent carrier leakage limit set by 3-rd Generation Partnership Project (3GPP) New Radio (NR) standards for cellular operations. Further, this work studies the effect of low resolution quantization on the error vector magnitude (EVM) of the transmitted this http URL fact, our findings suggests that low-resolution fully digital BF architectures can be a power efficient alternative to analog or hybrid beamforming for both transmitters and receivers at millimeter wave.
https://arxiv.org/abs/1901.08693
We present a novel Convolutional Neural Network (CNN) based approach for one class classification. The idea is to use a zero centered Gaussian noise in the latent space as the pseudo-negative class and train the network using the cross-entropy loss to learn a good representation as well as the decision boundary for the given class. A key feature of the proposed approach is that any pre-trained CNN can be used as the base network for one class classification. The proposed One Class CNN (OC-CNN) is evaluated on the UMDAA-02 Face, Abnormality-1001, FounderType-200 datasets. These datasets are related to a variety of one class application problems such as user authentication, abnormality detection and novelty detection. Extensive experiments demonstrate that the proposed method achieves significant improvements over the recent state-of-the-art methods. The source code is available at : github.com/otkupjnoz/oc-cnn.
http://arxiv.org/abs/1901.08688
We address two questions for training a convolutional neural network (CNN) for hyperspectral image classification: i) is it possible to build a pre-trained network? and ii) is the pre-training effective in furthering the performance? To answer the first question, we have devised an approach that pre-trains a network on multiple source datasets that differ in their hyperspectral characteristics and fine-tunes on a target dataset. This approach effectively resolves the architectural issue that arises when transferring meaningful information between the source and the target networks. To answer the second question, we carried out several ablation experiments. Based on the experimental results, a network trained from scratch performs as good as a network fine-tuned from a pre-trained network. However, we observed that pre-training the network has its own advantage in achieving better performances when deeper networks are required.
http://arxiv.org/abs/1901.08658
Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science. However, most work makes the assumption that humans are acting (noisily) optimally with respect to their preferences. Such approaches can fail when people are themselves learning about what they want. In this work, we introduce the assistive multi-armed bandit, where a robot assists a human playing a bandit task to maximize cumulative reward. In this problem, the human does not know the reward function but can learn it through the rewards received from arm pulls; the robot only observes which arms the human pulls but not the reward associated with each pull. We offer sufficient and necessary conditions for successfully assisting the human in this framework. Surprisingly, better human performance in isolation does not necessarily lead to better performance when assisted by the robot: a human policy can do better by effectively communicating its observed rewards to the robot. We conduct proof-of-concept experiments that support these results. We see this work as contributing towards a theory behind algorithms for human-robot interaction.
http://arxiv.org/abs/1901.08654
Legged robots pose one of the greatest challenges in robotics. Dynamic and agile maneuvers of animals cannot be imitated by existing methods that are crafted by humans. A compelling alternative is reinforcement learning, which requires minimal craftsmanship and promotes the natural evolution of a control policy. However, so far, reinforcement learning research for legged robots is mainly limited to simulation, and only few and comparably simple examples have been deployed on real systems. The primary reason is that training with real robots, particularly with dynamically balancing systems, is complicated and expensive. In the present work, we introduce a method for training a neural network policy in simulation and transferring it to a state-of-the-art legged system, thereby leveraging fast, automated, and cost-effective data generation schemes. The approach is applied to the ANYmal robot, a sophisticated medium-dog-sized quadrupedal system. Using policies trained in simulation, the quadrupedal machine achieves locomotion skills that go beyond what had been achieved with prior methods: ANYmal is capable of precisely and energy-efficiently following high-level body velocity commands, running faster than before, and recovering from falling even in complex configurations.
http://arxiv.org/abs/1901.08652
Scaling end-to-end reinforcement learning to control real robots from vision presents a series of challenges, in particular in terms of sample efficiency. Against end-to-end learning, state representation learning can help learn a compact, efficient and relevant representation of states that speeds up policy learning, reducing the number of samples needed, and that is easier to interpret. We evaluate several state representation learning methods on goal based robotics tasks and propose a new unsupervised model that stacks representations and combines strengths of several of these approaches. This method encodes all the relevant features, performs on par or better than end-to-end learning, and is robust to hyper-parameters change.
http://arxiv.org/abs/1901.08651
We present a novel method for learning a set of disentangled reward functions that sum to the original environment reward and are constrained to be independently achievable. We define independent achievability in terms of value functions with respect to achieving one learned reward while pursuing another learned reward. Empirically, we illustrate that our method can learn meaningful reward decompositions in a variety of domains and that these decompositions exhibit some form of generalization performance when the environment’s reward is modified. Theoretically, we derive results about the effect of maximizing our method’s objective on the resulting reward functions and their corresponding optimal policies.
http://arxiv.org/abs/1901.08649
This letter introduces the LOOP binary descriptor (local optimal oriented pattern) that encodes rotation invariance into the main formulation itself. This makes any post processing stage for rotation invariance redundant and improves on both accuracy and time complexity. We consider fine-grained lepidoptera (moth/butterfly) species recognition as the representative problem since it involves repetition of localized patterns and textures that may be exploited for discrimination. We evaluate the performance of LOOP against its predecessors as well as few other popular descriptors. Besides experiments on standard benchmarks, we also introduce a new small image dataset on NZ Lepidoptera. Loop performs as well or better on all datasets evaluated compared to previous binary descriptors. The new dataset and demo code of the proposed method are to be made available through the lead author’s academic webpage and GitHub.
http://arxiv.org/abs/1710.09317
Radiation therapy of thoracic and abdominal tumors requires incorporating the respiratory motion into treatments. To precisely account for the patient respiratory motions and predict the respiratory signals, a generalized model for predictions of different types of respiratory motions is desired. The aim of this study is to explore the feasibility of developing a Long Short-Term Memory (LSTM)-based model for the respiratory signal prediction. To achieve that, 1703 sets of Real-Time Position Management data were collected from retrospective studies across three clinical institutions. These datasets were separated as the training, internal validity and external validity groups. 1187 datasets were used for model development and the remaining 516 datasets were used to test the generality power of the model. Furthermore, an exhaustive grid search was implemented to find the optimal hyper-parameters of the LSTM model. The hyper-parameters are the number of LSTM layers, the number of hidden units, the optimizer, the learning rate, the number of epochs, and the length of time lags. Our model achieved superior accuracy over conventional artificial neural network models: with a prediction window of 500ms, the LSTM model achieved an average relative Mean Absolute Error (MAE) of 0.037, an average Root Mean Square Error (RMSE) of 0.048, and a Maximum Error (ME) of 1.687 in the internal validity data, and an average relative MAE of 0.112, an average RMSE of 0.139 and an ME of 1.811 in the external validity data. Compared to the LSTM model trained with default hyper-parameters, the MAE of the optimized model results decreased by 20%, indicating the importance of tuning the hyper-parameters of LSTM models to obtain superior accuracy. This study demonstrates the potential of deep LSTM models for the respiratory signal prediction and illustrates the impacts of major hyper-parameters in LSTM models.
https://arxiv.org/abs/1901.08638
This technical note describes a new baseline for the Natural Questions. Our model is based on BERT and reduces the gap between the model F1 scores reported in the original dataset paper and the human upper bound by 30% and 50% relative for the long and short answer tasks respectively. This baseline has been submitted to the official NQ leaderboard at ai.google.com/research/NaturalQuestions and we plan to opensource the code for it in the near future.
http://arxiv.org/abs/1901.08634
Camera-equipped unmanned vehicles (UVs) have received a lot of attention in data collection for construction monitoring applications. To develop an autonomous platform, the UV should be able to process multiple modules (e.g., context-awareness, control, localization, and mapping) on an embedded platform. Pixel-wise semantic segmentation provides a UV with the ability to be contextually aware of its surrounding environment. However, in the case of mobile robotic systems with limited computing resources, the large size of the segmentation model and high memory usage requires high computing resources, which a major challenge for mobile UVs (e.g., a small-scale vehicle with limited payload and space). To overcome this challenge, this paper presents a light and efficient deep neural network architecture to run on an embedded platform in real-time. The proposed model segments navigable space on an image sequence (i.e., a video stream), which is essential for an autonomous vehicle that is based on machine vision. The results demonstrate the performance efficiency of the proposed architecture compared to the existing models and suggest possible improvements that could make the model even more efficient, which is necessary for the future development of the autonomous robotics systems.
http://arxiv.org/abs/1901.08630
The parallel corpus for multilingual NLP tasks, deep learning applications like Statistical Machine Translation Systems is very important. The parallel corpus of Hindi-English language pair available for news translation task till date is of very limited size as per the requirement of the systems are concerned. In this work we have developed an automatic parallel corpus generation system prototype, which creates Hindi-English parallel corpus for news translation task. Further to verify the quality of generated parallel corpus we have experimented by taking various performance metrics and the results are quite interesting.
http://arxiv.org/abs/1901.08625
We consider the weighted belief-propagation (WBP) decoder recently proposed by Nachmani et al. where different weights are introduced for each Tanner graph edge and optimized using machine learning techniques. Our focus is on simple-scaling models that use the same weights across certain edges to reduce the storage and computational burden. The main contribution is to show that simple scaling with few parameters often achieves the same gain as the full parameterization. Moreover, several training improvements for WBP are proposed. For example, it is shown that minimizing average binary cross-entropy is suboptimal in general in terms of bit error rate (BER) and a new “soft-BER” loss is proposed which can lead to better performance. We also investigate parameter adapter networks (PANs) that learn the relation between the signal-to-noise ratio and the WBP parameters. As an example, for the (32,16) Reed-Muller code with a highly redundant parity-check matrix, training a PAN with soft-BER loss gives near-maximum-likelihood performance assuming simple scaling with only three parameters.
http://arxiv.org/abs/1901.08621
We employ triplet loss as a space embedding regularizer to boost classification performance. Standard architectures, like ResNet and DesneNet, are extended to support both losses with minimal hyper-parameter tuning. This promotes generality while fine-tuning pretrained networks. Triplet loss is a powerful surrogate for recently proposed embedding regularizers. Yet, it is avoided for large batch-size requirement and high computational cost. Through our experiments, we re-assess these assumptions. During inference, our network supports both classification and embedding tasks without any computational overhead. Quantitative evaluation highlights how our approach compares favorably to the existing state of the art on multiple fine-grained recognition datasets. Further evaluation on an imbalanced video dataset achieves significant improvement (>7%). Beyond boosting efficiency, triplet loss brings retrieval and interpretability to classification models.
http://arxiv.org/abs/1901.08616