Histopathological cancer diagnosis is based on visual examination of stained tissue slides. Hematoxylin and eosin (H\&E) is a standard stain routinely employed worldwide. It is easy to acquire and cost effective, but cells and tissue components show low-contrast with varying tones of dark blue and pink, which makes difficult visual assessments, digital image analysis, and quantifications. These limitations can be overcome by IHC staining of target proteins of the tissue slide. IHC provides a selective, high-contrast imaging of cells and tissue components, but their use is largely limited by a significantly more complex laboratory processing and high cost. We proposed a conditional CycleGAN (cCGAN) network to transform the H\&E stained images into IHC stained images, facilitating virtual IHC staining on the same slide. This data-driven method requires only a limited amount of labelled data but will generate pixel level segmentation results. The proposed cCGAN model improves the original network \cite{zhu_unpaired_2017} by adding category conditions and introducing two structural loss functions, which realize a multi-subdomain translation and improve the translation accuracy as well. % need to give reasons here. Experiments demonstrate that the proposed model outperforms the original method in unpaired image translation with multi-subdomains. We also explore the potential of unpaired images to image translation method applied on other histology images related tasks with different staining techniques.
http://arxiv.org/abs/1901.04059
In this work, we report the set-up and results of the Liver Tumor Segmentation Benchmark (LITS) organized in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI) 2016 and International Conference On Medical Image Computing Computer Assisted Intervention (MICCAI) 2017. Twenty four valid state-of-the-art liver and liver tumor segmentation algorithms were applied to a set of 131 computed tomography (CT) volumes with different types of tumor contrast levels (hyper-/hypo-intense), abnormalities in tissues (metastasectomie) size and varying amount of lesions. The submitted algorithms have been tested on 70 undisclosed volumes. The dataset is created in collaboration with seven hospitals and research institutions and manually reviewed by independent three radiologists. We found that not a single algorithm performed best for liver and tumors. The best liver segmentation algorithm achieved a Dice score of 0.96(MICCAI) whereas for tumor segmentation the best algorithm evaluated at 0.67(ISBI) and 0.70(MICCAI). The LITS image data and manual annotations continue to be publicly available through an online evaluation system as an ongoing benchmarking resource.
http://arxiv.org/abs/1901.04056
We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark and demonstrates satisfying 3D pose results even for multi-person images.
http://arxiv.org/abs/1803.00455
Generating accurate and reliable sales forecasts is crucial in the E-commerce business. The current state-of-the-art techniques are typically univariate methods, which produce forecasts considering only the historical sales data of a single product. However, in a situation where large quantities of related time series are available, conditioning the forecast of an individual time series on past behaviour of similar, related time series can be beneficial. Given that the product assortment hierarchy in an E-commerce platform contains large numbers of related products, in which the sales demand patterns can be correlated, our attempt is to incorporate this cross-series information in a unified model. We achieve this by globally training a Long Short-Term Memory network (LSTM) that exploits the nonlinear demand relationships available in an E-commerce product assortment hierarchy. Aside from the forecasting engine, we propose a systematic pre-processing framework to overcome the challenges in an E-commerce setting. We also introduce several product grouping strategies to supplement the LSTM learning schemes, in situations where sales patterns in a product portfolio are disparate. We empirically evaluate the proposed forecasting framework on a real-world online marketplace dataset from Walmart. com. Our method achieves competitive results on category level and super-departmental level datasets, outperforming state-of-the-art techniques.
https://arxiv.org/abs/1901.04028
Many challenging image processing tasks can be described by an ill-posed linear inverse problem: deblurring, deconvolution, inpainting, compressed sensing, and superresolution all lie in this framework. Traditional inverse problem solvers minimize a cost function consisting of a data-fit term, which measures how well an image matches the observations, and a regularizer, which reflects prior knowledge and promotes images with desirable properties like smoothness. Recent advances in machine learning and image processing have illustrated that it is often possible to learn a regularizer from training data that can outperform more traditional regularizers. We present an end-to-end, data-driven method of solving inverse problems inspired by the Neumann series, which we call a Neumann network. Rather than unroll an iterative optimization algorithm, we truncate a Neumann series which directly solves the linear inverse problem with a data-driven nonlinear regularizer. The Neumann network architecture outperforms traditional inverse problem solution methods, model-free deep learning approaches, and state-of-the-art unrolled iterative methods on standard datasets. Finally, when the images belong to a union of subspaces and under appropriate assumptions on the forward model, we prove there exists a Neumann network configuration that well-approximates the optimal oracle estimator for the inverse problem and demonstrate empirically that the trained Neumann network has the form predicted by theory.
http://arxiv.org/abs/1901.03707
In the authors’ opinion, anomaly detection systems, or ADS, seem to be the most perspective direction in the subject of attack detection, because these systems can detect, among others, the unknown (zero-day) attacks. To detect anomalies, the authors propose to use machine synesthesia. In this case, machine synesthesia is understood as an interface that allows using image classification algorithms in the problem of detecting network anomalies, making it possible to use non-specialized image detection methods that have recently been widely and actively developed. The proposed approach is that the network traffic data is “projected” into the image. It can be seen from the experimental results that the proposed method for detecting anomalies shows high results in the detection of attacks. On a large sample, the value of the complex efficiency indicator reaches 97%.
http://arxiv.org/abs/1901.04017
Our middleware approach, Context-Oriented Software Middleware (COSM), supports context-dependent software with self-adaptability and dependability in a mobile computing environment. The COSM-middleware is a generic and platform-independent adaptation engine, which performs a runtime composition of the software’s context-dependent behaviours based on the execution contexts. Our middleware distinguishes between the context-dependent and context-independent functionality of software systems. This enables the COSM-middleware to adapt the application behaviour by composing a set of context-oriented components, that implement the context-dependent functionality of the software. Accordingly, the software dependability is achieved by considering the functionality of the COSM-middleware and the adaptation impact/costs. The COSM-middleware uses a dynamic policy-based engine to evaluate the adaptation outputs and verify the fitness of the adaptation output with the application’s objectives, goals and the architecture quality attributes. These capabilities are demonstrated through an empirical evaluation of a case study implementation.
http://arxiv.org/abs/1901.04011
Deep generative models have shown great promise when it comes to synthesising novel images. While they can generate images that look convincing on a higher-level, generating fine-grained details is still a challenge. In order to foster research on more powerful generative approaches, this paper proposes a novel task: generative modelling of 2D tree skeletons. Trees are an interesting shape class because they exhibit complexity and variations that are well-suited to measure the ability of a generative model to generated detailed structures. We propose a new dataset for this task and demonstrate that state-of-the-art generative models fail to synthesise realistic images on our benchmark, even though they perform well on current datasets like MNIST digits. Motivated by these results, we propose a novel network architecture based on combining a variational autoencoder using Recurrent Neural Networks and a convolutional discriminator. The network, error metrics and training procedure are adapted to the task of fine-grained sketching. Through quantitative and perceptual experiments, we show that our model outperforms previous work and that our dataset is a valuable benchmark for generative models. We will make our dataset publicly available.
http://arxiv.org/abs/1901.03991
Deep Neural Models are vulnerable to adversarial perturbations in classification. Many attack methods generate adversarial examples with large pixel modification and low cosine similarity with original images. In this paper, we propose an adversarial method generating perturbations based on root mean square gradient which formulates adversarial perturbation size in root mean square level and update gradient in direction, due to updating gradients with adaptive and root mean square stride, our method map origin, and corresponding adversarial image directly which shows good transferability in adversarial examples generation. We evaluate several traditional perturbations creating ways in image classification with our methods. Experimental results show that our approach works well and outperform recent techniques in the change of misclassifying image classification with slight pixel modification, and excellent efficiency in fooling deep network models.
http://arxiv.org/abs/1901.03706
Visualizing the perceptual content by analyzing human functional magnetic resonance imaging (fMRI) has been an active research area. However, due to its high dimensionality, complex dimensional structure, and small number of samples available, reconstructing realistic images from fMRI remains challenging. Recently with the development of convolutional neural network (CNN) and generative adversarial network (GAN), mapping multi-voxel fMRI data to complex, realistic images has been made possible. In this paper, we propose a model, DCNN-GAN, by combining a reconstruction network and GAN. We utilize the CNN for hierarchical feature extraction and the DCNN-GAN to reconstruct more realistic images. Extensive experiments have been conducted, showing that our method outperforms previous works, regarding reconstruction quality and computational cost.
http://arxiv.org/abs/1901.07368
We introduce a generative adversarial network (GAN) model to simulate the 3-dimensional Lagrangian motion of particles trapped in the recirculation zone of a buoyancy-opposed flame. The GAN model comprises a stochastic recurrent neural network, serving as a generator, and a convoluted neural network, serving as a discriminator. Adversarial training was performed to the point where the best-trained discriminator failed to distinguish the ground truth from the trajectory produced by the best-trained generator. The model performance was then benchmarked against a statistical analysis performed on both the simulated trajectories and the ground truth, with regard to the accuracy and generalization criteria.
http://arxiv.org/abs/1901.03960
Replacing the background and simultaneously adjusting foreground objects is a challenging task in image editing. Current techniques for generating such images relies heavily on user interactions with image editing softwares, which is a tedious job for professional retouchers. To reduce their workload, some exciting progress has been made on generating images with a given background. However, these models can neither adjust the position and scale of the foreground objects, nor guarantee the semantic consistency between foreground and background. To overcome these limitations, we propose a framework – ART(Auto-Retoucher), to generate images with sufficient semantic and spatial consistency. Images are first processed by semantic matting and scene parsing modules, then a multi-task verifier model will give two confidence scores for the current background and position setting. We demonstrate that our jointly optimized verifier model successfully improves the visual consistency, and our ART framework performs well on images with the human body as foregrounds.
http://arxiv.org/abs/1901.03954
Most computer vision systems and computational photography systems are visible light based which is a small fraction of the electromagnetic (EM) spectrum. In recent years radio frequency (RF) hardware has become more widely available, for example, many cars are equipped with a RADAR, and almost every home has a WiFi device. In the context of imaging, RF spectrum holds many advantages compared to visible light systems. In particular, in this regime, EM energy effectively interacts in different ways with matter. This property allows for many novel applications such as privacy preserving computer vision and imaging through absorbing and scattering materials in visible light such as walls. Here, we expand many of the concepts in computational photography in visible light to RF cameras. The main limitation of imaging with RF is the large wavelength that limits the imaging resolution when compared to visible light. However, the output of RF cameras is usually processed by computer vision and perception algorithms which would benefit from multi-modal sensing of the environment, and from sensing in situations in which visible light systems fail. To bridge the gap between computational photography and RF imaging, we expand the concept of light-field to RF. This work paves the way to novel computational sensing systems with RF.
http://arxiv.org/abs/1901.03953
Single Image Super Resolution (SISR) is a well-researched problem with broad commercial relevance. However, most of the SISR literature focuses on small-size images under 500px, whereas business needs can mandate the generation of very high resolution images. At Expedia Group, we were tasked with generating images of at least 2000px for display on the website, four times greater than the sizes typically reported in the literature. This requirement poses a challenge that state-of-the-art models, validated on small images, have not been proven to handle. In this paper, we investigate solutions to the problem of generating high-quality images for large-scale super resolution in a commercial setting. We find that training a generative adversarial network (GAN) with attention from scratch using a large-scale lodging image data set generates images with high PSNR and SSIM scores. We describe a novel attentional SISR model for large-scale images, A-SRGAN, that uses a Flexible Self Attention layer to enable processing of large-scale images. We also describe a distributed algorithm which speeds up training by around a factor of five.
https://arxiv.org/abs/1812.04821
An image retrieval method based on convolution neural network and dimension reduction is proposed in this paper. Convolution neural network is used to extract high-level features of images, and to solve the problem that the extracted feature dimensions are too high and have strong correlation, multilinear principal component analysis is used to reduce the dimension of features. The features after dimension reduction are binary hash coded for fast image retrieval. Experiments show that the method proposed in this paper has better retrieval effect than the retrieval method based on principal component analysis on the e-commerce image datasets.
http://arxiv.org/abs/1901.03924
Prevailing fingerprint recognition systems are vulnerable to spoof attacks. To mitigate these attacks, automated spoof detectors are trained to distinguish a set of live or bona fide fingerprints from a set of known spoof fingerprints. However, spoof detectors remain vulnerable when exposed to attacks from spoofs made with materials not seen during training of the detector. To alleviate this shortcoming, we approach spoof detection as a one-class classification problem. The goal is to train a spoof detector on only the live fingerprints such that once the concept of “live” has been learned, spoofs of any material can be rejected. We accomplish this through training multiple generative adversarial networks (GANS) on live fingerprint images acquired with the open source, dual-camera, 1900 ppi RaspiReader fingerprint reader. Our experimental results, conducted on 5.5K spoof images (from 12 materials) and 11.8K live images show that the proposed approach improves the cross-material spoof detection performance over state-of-the-art one-class and binary class spoof detectors TDR = 84.4% vs. TDR = 40.3% (both @ FDR = 0.2%).
http://arxiv.org/abs/1901.03918
The automated recognition of music genres from audio information is a challenging problem, as genre labels are subjective and noisy. Artist labels are less subjective and less noisy, while certain artists may relate more strongly to certain genres. At the same time, at prediction time, it is not guaranteed that artist labels are available for a given audio segment. Therefore, in this work, we propose to apply the transfer learning framework, learning artist-related information which will be used at inference time for genre classification. We consider different types of artist-related information, expressed through artist group factors, which will allow for more efficient learning and stronger robustness to potential label noise. Furthermore, we investigate how to achieve the highest validation accuracy on the given FMA dataset, by experimenting with various kinds of transfer methods, including single-task transfer, multi-task transfer and finally multi-task learning.
http://arxiv.org/abs/1805.02043
Feature detectors and descriptors are key low-level vision tools that many higher-level tasks build on. Unfortunately these fail in the presence of challenging light transport effects including partial occlusion, low contrast, and reflective or refractive surfaces. Building on spatio-angular imaging modalities offered by emerging light field cameras, we introduce a new and computationally efficient 4D light field feature detector and descriptor: LiFF. LiFF is scale invariant and utilizes the full 4D light field to detect features that are robust to changes in perspective. This is particularly useful for structure from motion (SfM) and other tasks that match features across viewpoints of a scene. We demonstrate significantly improved 3D reconstructions via SfM when using LiFF instead of the leading 2D or 4D features, and show that LiFF runs an order of magnitude faster than the leading 4D approach. Finally, LiFF inherently estimates depth for each feature, opening a path for future research in light field-based SfM.
http://arxiv.org/abs/1901.03916
Photorealism is a complex concept that cannot easily be formulated mathematically. Deep Photo Style Transfer is an attempt to transfer the style of a reference image to a content image while preserving its photorealism. This is achieved by introducing a constraint that prevents distortions in the content image and by applying the style transfer independently for semantically different parts of the images. In addition, an automated segmentation process is presented that consists of a neural network based segmentation method followed by a semantic grouping step. To further improve the results a measure for image aesthetics is used and elaborated. If the content and the style image are sufficiently similar, the result images look very realistic. With the automation of the image segmentation the pipeline becomes completely independent from any user interaction, which allows for new applications.
http://arxiv.org/abs/1901.03915
Convolutional Neural Networks (CNN) are successfully used for various visual perception tasks including bounding box object detection, semantic segmentation, optical flow, depth estimation and visual SLAM. Generally these tasks are independently explored and modeled. In this paper, we present a joint multi-task network design for learning object detection and semantic segmentation simultaneously. The main motivation is to achieve real-time performance on a low power embedded SOC by sharing of encoder for both the tasks. We construct an efficient architecture using a small ResNet10 like encoder which is shared for both decoders. Object detection uses YOLO v2 like decoder and semantic segmentation uses FCN8 like decoder. We evaluate the proposed network in two public datasets (KITTI, Cityscapes) and in our private fisheye camera dataset, and demonstrate that joint network provides the same accuracy as that of separate networks. We further optimize the network to achieve 30 fps for 1280x384 resolution image.
http://arxiv.org/abs/1901.03912
Scene reconstruction from unorganized RGB images is an important task in many computer vision applications. Multi-view Stereo (MVS) is a common solution in photogrammetry applications for the dense reconstruction of a static scene. The static scene assumption, however, limits the general applicability of MVS algorithms, as many day-to-day scenes undergo non-rigid motion, e.g., clothes, faces, or human bodies. In this paper, we open up a new challenging direction: dense 3D reconstruction of scenes with non-rigid changes observed from arbitrary, sparse, and wide-baseline views. We formulate the problem as a joint optimization of deformation and depth estimation, using deformation graphs as the underlying representation. We propose a new sparse 3D to 2D matching technique, together with a dense patch-match evaluation scheme to estimate deformation and depth with photometric consistency. We show that creating a dense 4D structure from a few RGB images with non-rigid changes is possible, and demonstrate that our method can be used to interpolate novel deformed scenes from various combinations of these deformation estimates derived from the sparse views.
http://arxiv.org/abs/1901.03910
Speech Acts (SAs) are one of the important areas of pragmatics, which give us a better understanding of the state of mind of the people and convey an intended language function. Knowledge of the SA of a text can be helpful in analyzing that text in natural language processing applications. This study presents a dictionary-based statistical technique for Persian SA recognition. The proposed technique classifies a text into seven classes of SA based on four criteria: lexical, syntactic, semantic, and surface features. WordNet as the tool for extracting synonym and enriching features dictionary is utilized. To evaluate the proposed technique, we utilized four classification methods including Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB), and K-Nearest Neighbors (KNN). The experimental results demonstrate that the proposed method using RF and SVM as the best classifiers achieved a state-of-the-art performance with an accuracy of 0.95 for classification of Persian SAs. Our original vision of this work is introducing an application of SA recognition on social media content, especially the common SA in rumors. Therefore, the proposed system utilized to determine the common SAs in rumors. The results showed that Persian rumors are often expressed in three SA classes including narrative, question, and threat, and in some cases with the request SA.
http://arxiv.org/abs/1901.03904
This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage consensus among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are subsequently validated via manual verification. Annotated clips include both positive examples and hard negatives. The resulting dataset is dubbed HACS Clips. Through a separate process we also collect annotations defining action segment boundaries. This resulting dataset is called HACS Segments. Overall, HACS Clips consists of 1.55M annotated clips sampled from 504K untrimmed videos, and HACS Segments contains 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories. HACS Clips contains more labeled examples than any existing video benchmark. This renders our dataset an excellent source for spatiotemporal feature learning, as evidenced by our transfer learning experiments on three different target datasets where HACS Clips outperforms Kinetics and Sports1M as a pretraining benchmark, and yields the best published results to date. On HACS Segments, we evaluate state-of-the-art methods of action proposal generation and action localization, and highlight the new challenges posed by our dense and fine-grained temporal annotations.
http://arxiv.org/abs/1712.09374
This paper proposes a novel adaptive guidance system developed using reinforcement meta-learning with a recurrent policy and value function approximator. The use of recurrent network layers allows the deployed policy to adapt real time to environmental forces acting on the agent. We compare the performance of the DR/DV guidance law, an RL agent with a non-recurrent policy, and an RL agent with a recurrent policy in four difficult tasks with unknown but highly variable dynamics. These tasks include a safe Mars landing with random engine failure and a landing on an asteroid with unknown environmental dynamics. We also demonstrate the ability of a recurrent policy to navigate using only Doppler radar altimeter returns, thus integrating guidance and navigation.
http://arxiv.org/abs/1901.04473
The recent interest in using deep learning for seismic interpretation tasks, such as facies classification, has been facing a significant obstacle, namely the absence of large publicly available annotated datasets for training and testing models. As a result, researchers have often resorted to annotating their own training and testing data. However, different researchers may annotate different classes, or use different train and test splits. In addition, it is common for papers that apply deep learning for facies classification to not contain quantitative results, and rather rely solely on visual inspection of the results. All of these practices have lead to subjective results and have greatly hindered the ability to compare different machine learning models against each other and understand the advantages and disadvantages of each approach. To address these issues, we open-source an accurate 3D geological model of the Netherlands F3 Block. This geological model is based on both well log data and 3D seismic data and is grounded on the careful study of the geology of the region. Furthermore, we propose two baseline models for facies classification based on deconvolution networks and make their codes publicly available. Finally, we propose a scheme for evaluating different models on this dataset, and we share the results of our baseline models. In addition to making the dataset and the code publicly available, this work can help advance research in this area and create an objective benchmark for comparing the results of different machine learning approaches for facies classification for researchers to use in the future.
http://arxiv.org/abs/1901.07659
Image steganography is a procedure for hiding messages inside pictures. While other techniques such as cryptography aim to prevent adversaries from reading the secret message, steganography aims to hide the presence of the message itself. In this paper, we propose a novel technique based on generative adversarial networks for hiding arbitrary binary data in images. We show that our approach achieves state-of-the-art payloads of 4.4 bits per pixel, evades detection by steganalysis tools, and is effective on images from multiple datasets. To enable fair comparisons, we have released an open source library that is available online at https://github.com/DAI-Lab/SteganoGAN.
http://arxiv.org/abs/1901.03892
Deep reinforcement learning algorithms have recently been used to train multiple interacting agents in a centralised manner whilst keeping their execution decentralised. When the agents can only acquire partial observations and are faced with a task requiring coordination and synchronisation skills, inter-agent communication plays an essential role. In this work, we propose a framework for multi-agent training using deep deterministic policy gradients that enables the concurrent, end-to-end learning of an explicit communication protocol through a memory device. During training, the agents learn to perform read and write operations enabling them to infer a shared representation of the world. We empirically demonstrate that concurrent learning of the communication device and individual policies can improve inter-agent coordination and performance, and illustrate how different communication patterns can emerge for different tasks.
http://arxiv.org/abs/1901.03887
We propose Stochastic Neural Architecture Search (SNAS), an economical end-to-end solution to Neural Architecture Search (NAS) that trains neural operation parameters and architecture distribution parameters in same round of back-propagation, while maintaining the completeness and differentiability of the NAS pipeline. In this work, NAS is reformulated as an optimization problem on parameters of a joint distribution for the search space in a cell. To leverage the gradient information in generic differentiable loss for architecture search, a novel search gradient is proposed. We prove that this search gradient optimizes the same objective as reinforcement-learning-based NAS, but assigns credits to structural decisions more efficiently. This credit assignment is further augmented with locally decomposable reward to enforce a resource-efficient constraint. In experiments on CIFAR-10, SNAS takes less epochs to find a cell architecture with state-of-the-art accuracy than non-differentiable evolution-based and reinforcement-learning-based NAS, which is also transferable to ImageNet. It is also shown that child networks of SNAS can maintain the validation accuracy in searching, with which attention-based NAS requires parameter retraining to compete, exhibiting potentials to stride towards efficient NAS on big datasets.
https://arxiv.org/abs/1812.09926
Direct design of a robot’s rendered dynamics, such as in impedance control, is now a well-established control mode in uncertain environments. When the physical interaction port variables are not measured directly, dynamic and kinematic models are required to relate the measured variables to the interaction port variables. A typical example is serial manipulators with joint torque sensors, where the interaction occurs at the end-effector. As interactive robots perform increasingly complex tasks, they will be intermittently coupled with additional dynamic elements such as tools, grippers, or workpieces, some of which should be compensated and brought to the robot side of the interaction port, making the inverse dynamics multimodal. Furthermore, there may also be unavoidable and unmeasured external input when the desired system cannot be totally isolated. Towards semi-autonomous robots, capable of handling such applications, a multimodal Gaussian process regression approach to manipulator dynamic modelling is developed. A sampling-based approach clusters different dynamic modes from unlabelled data, also allowing the seperation of perturbed data with significant, irregular external input. The passivity of the overall approach is shown analytically, and experiments examine the performance and safety of this approach on a test actuator.
http://arxiv.org/abs/1901.03872
This paper is concerned with open-domain question answering (i.e., OpenQA). Recently, some works have viewed this problem as a reading comprehension (RC) task, and directly applied successful RC models to it. However, the performances of such models are not so good as that in the RC task. In our opinion, the perspective of RC ignores three characteristics in OpenQA task: 1) many paragraphs without the answer span are included in the data collection; 2) multiple answer spans may exist within one given paragraph; 3) the end position of an answer span is dependent with the start position. In this paper, we first propose a new probabilistic formulation of OpenQA, based on a three-level hierarchical structure, i.e.,~the question level, the paragraph level and the answer span level. Then a Hierarchical Answer Spans Model (HAS-QA) is designed to capture each probability. HAS-QA has the ability to tackle the above three problems, and experiments on public OpenQA datasets show that it significantly outperforms traditional RC baselines and recent OpenQA baselines.
http://arxiv.org/abs/1901.03866
We present a new approach to the problem of estimating 3D room layout from a single panoramic image. We represent room layout as three 1D vectors that encode, at each image column, the boundary positions of floor-wall and ceiling-wall, and the existence of wall-wall boundary. The proposed network architecture, HorizonNet, trained for predicting 1D layout, outperforms previous state-of-the-art approaches. The designed post-processing procedure for recovering 3D room layouts from 1D predictions can automatically infer the room shape with low computation cost—it takes less than 20ms for a panorama image while prior works might need dozens of seconds. We also propose Pano Stretch Data Augmentation, which can diversify panorama data and be applied to other panorama-related learning tasks. Due to the limited training data available for non-cuboid layout, we re-annotate 65 general layout data from the current dataset for fine-tuning and qualitatively show the ability of our approach to estimate general layouts.
http://arxiv.org/abs/1901.03861
Continuous Speech Keyword Spotting (CSKS) is the problem of spotting keywords in recorded conversations, when a small number of instances of keywords are available in training data. Unlike the more common Keyword Spotting, where an algorithm needs to detect lone keywords or short phrases like “Alexa”, “Cortana”, “Hi Alexa!”, “Whatsup Octavia?” etc. in speech, CSKS needs to filter out embedded words from a continuous flow of speech, ie. spot “Anna” and “github” in “I know a developer named Anna who can look into this github issue.” Apart from the issue of limited training data availability, CSKS is an extremely imbalanced classification problem. We address the limitations of simple keyword spotting baselines for both aforementioned challenges by using a novel combination of loss functions (Prototypical networks’ loss and metric loss) and transfer learning. Our method improves F1 score by over 10%.
http://arxiv.org/abs/1901.03860
Existing approaches to automatic summarization assume that a length limit for the summary is given, and view content selection as an optimization problem to maximize informativeness and minimize redundancy within this budget. This framework ignores the fact that human-written summaries have rich internal structure which can be exploited to train a summarization system. We present NEXTSUM, a novel approach to summarization based on a model that predicts the next sentence to include in the summary using not only the source article, but also the summary produced so far. We show that such a model successfully captures summary-specific discourse moves, and leads to better content selection performance, in addition to automatically predicting how long the target summary should be. We perform experiments on the New York Times Annotated Corpus of summaries, where NEXTSUM outperforms lead and content-model summarization baselines by significant margins. We also show that the lengths of summaries produced by our system correlates with the lengths of the human-written gold standards.
http://arxiv.org/abs/1901.03859
The aim of this study was to develop a digital histopathology system for identifying odontogenic keratocysts in hematoxylin- and eosin-stained tissue specimens of jaw cysts. Approximately 5000 microscopy images with 400$\times$ magnification were obtained from 199 odontogenic keratocysts, 208 dentigerous cysts, and 55 radicular cysts. A proportion of these images were used to make training patches, which were annotated as belonging to one of the following three classes: keratocysts, non-keratocysts, and stroma. The patches for the cysts contained the complete lining epithelium, with the cyst cavity being present on the upper side. The convolutional neural network (CNN) VGG16 was finetuned to this dataset. The trained CNN could recognize the basal cell palisading pattern, which is the definitive criterion for diagnosing keratocysts. Some of the remaining images were scanned and analyzed by the trained CNN, whose output was then used to train another CNN for binary classification (keratocyst or not). The area under the receiver operating characteristics curve for the entire algorithm was 0.997 for the test dataset. Thus, the proposed patch classification strategy is usable for automated keratocyst diagnosis. However, further optimization must be performed to make it suitable for practical use.
http://arxiv.org/abs/1901.03857
In this paper, we extend the standard belief propagation (BP) sequential technique proposed in the tree-reweighted sequential method to the fully connected CRF models with the geodesic distance affinity. The proposed method has been applied to the stereo matching problem. Also a new approach to the BP marginal solution is proposed that we call one-view occlusion detection (OVOD). In contrast to the standard winner takes all (WTA) estimation, the proposed OVOD solution allows to find occluded regions in the disparity map and simultaneously improve the matching result. As a result we can perform only one energy minimization process and avoid the cost calculation for the second view and the left-right check procedure. We show that the OVOD approach considerably improves results for cost augmentation and energy minimization techniques in comparison with the standard one-view affinity space implementation. We apply our method to the Middlebury data set and reach state-of-the-art especially for median, average and mean squared error metrics.
http://arxiv.org/abs/1901.03852
Over the past few years, there has been an astounding growth in the number of news channels as well as the amount of broadcast news video data. As a result, it is imperative that automated methods need to be developed in order to effectively summarize and store this voluminous data. Format detection of news videos plays an important role in news video analysis. Our problem involves building a robust and versatile news format detector, which identifies the different band elements in a news frame. Probabilistic progressive Hough transform has been used for the detection of band edges. The detected bands are classified as natural images, computer generated graphics (non-text) and text bands. A contrast based text detector has been used to identify the text regions from news frames. Two classifers have been trained and evaluated for the labeling of the detected bands as natural or artificial - Support Vector Machine (SVM) Classifer with RBF kernel, and Extreme Learning Machine (ELM) classifier. The classifiers have been trained on a dataset of 6000 images (3000 images of each class). The ELM classifier reports a balanced accuracy of 77.38%, while the SVM classifier outperforms it with a balanced accuracy of 96.5% using 10-fold cross-validation. The detected bands which have been fragmented due to the presence of gradients in the image have been merged using a three-tier hierarchical reasoning model. The bands were detected with a Jaccard Index of 0.8138, when compared to manually marked ground truth data. We have also presented an extensive literature review of previous work done towards news videos format detection, element band classification, and associative reasoning.
http://arxiv.org/abs/1901.03842
Prediction accuracy and model explainability are the two most important objectives when developing machine learning algorithms to solve real-world problems. The neural networks are known to possess good prediction performance, but lack of sufficient model explainability. In this paper, we propose to enhance the explainability of neural networks through the following architecture constraints: a) sparse additive subnetworks; b) orthogonal projection pursuit; and c) smooth function approximation. It leads to a sparse, orthogonal and smooth explainable neural network (SOSxNN). The multiple parameters in the SOSxNN model are simultaneously estimated by a modified mini-batch gradient descent algorithm based on the backpropagation technique for calculating the derivatives and the Cayley transform for preserving the projection orthogonality. The hyperparameters controlling the sparse and smooth constraints are optimized by the grid search. Through simulation studies, we compare the SOSxNN method to several benchmark methods including least absolute shrinkage and selection operator, support vector machine, random forest, and multi-layer perceptron. It is shown that proposed model keeps the flexibility of pursuing prediction accuracy while attaining the improved interpretability, which can be therefore used as a promising surrogate model for complex model approximation. Finally, the real data example from the Lending Club is employed as a showcase of the SOSxNN application.
https://arxiv.org/abs/1901.03838
Compared with other semantic segmentation tasks, portrait segmentation requires both higher precision and faster inference speed. However, this problem has not been well studied in previous works. In this paper, we propose a lightweight network architecture, called Boundary-Aware Network (BANet) which selectively extracts detail information in boundary area to make high-quality segmentation output with real-time( >25FPS) speed. In addition, we design a new loss function called refine loss which supervises the network with image level gradient information. Our model is able to produce finer segmentation results which has richer details than annotations.
http://arxiv.org/abs/1901.03814
We propose nnstreamer, a software system that handles neural networks as filters of stream pipelines, applying the stream processing paradigm to neural network applications. A new trend with the wide-spread of deep neural network applications is on-device AI; i.e., processing neural networks directly on mobile devices or edge/IoT devices instead of cloud servers. Emerging privacy issues, data transmission costs, and operational costs signifies the need for on-device AI especially when a huge number of devices with real-time data processing are deployed. Nnstreamer efficiently handles neural networks with complex data stream pipelines on devices, improving the overall performance significantly with minimal efforts. Besides, nnstreamer simplifies the neural network pipeline implementations and allows reusing off-shelf multimedia stream filters directly; thus it reduces the developmental costs significantly. Nnstreamer is already being deployed with a product releasing soon and is open source software applicable to a wide range of hardware architectures and software platforms.
https://arxiv.org/abs/1901.04985
Driven by recent computer vision and robotic applications, recovering 3D human poses has become increasingly important and attracted growing interests. In fact, completing this task is quite challenging due to the diverse appearances, viewpoints, occlusions and inherently geometric ambiguities inside monocular images. Most of the existing methods focus on designing some elaborate priors /constraints to directly regress 3D human poses based on the corresponding 2D human pose-aware features or 2D pose predictions. However, due to the insufficient 3D pose data for training and the domain gap between 2D space and 3D space, these methods have limited scalabilities for all practical scenarios (e.g., outdoor scene). Attempt to address this issue, this paper proposes a simple yet effective self-supervised correction mechanism to learn all intrinsic structures of human poses from abundant images. Specifically, the proposed mechanism involves two dual learning tasks, i.e., the 2D-to-3D pose transformation and 3D-to-2D pose projection, to serve as a bridge between 3D and 2D human poses in a type of “free” self-supervision for accurate 3D human pose estimation. The 2D-to-3D pose implies to sequentially regress intermediate 3D poses by transforming the pose representation from the 2D domain to the 3D domain under the sequence-dependent temporal context, while the 3D-to-2D pose projection contributes to refining the intermediate 3D poses by maintaining geometric consistency between the 2D projections of 3D poses and the estimated 2D poses. We further apply our self-supervised correction mechanism to develop a 3D human pose machine, which jointly integrates the 2D spatial relationship, temporal smoothness of predictions and 3D geometric knowledge. Extensive evaluations demonstrate the superior performance and efficiency of our framework over all the compared competing methods.
http://arxiv.org/abs/1901.03798
As the post-processing step for object detection, non-maximum suppression (GreedyNMS) is widely used in most of the detectors for many years. It is efficient and accurate for sparse scenes, but suffers an inevitable trade-off between precision and recall in crowded scenes. To overcome this drawback, we propose a Pairwise-NMS to cure GreedyNMS. Specifically, a pairwise-relationship network that is based on deep learning is learned to predict if two overlapping proposal boxes contain two objects or zero/one object, which can handle multiple overlapping objects effectively. Through neatly coupling with GreedyNMS without losing efficiency, consistent improvements have been achieved in heavily occluded datasets including MOT15, TUD-Crossing and PETS. In addition, Pairwise-NMS can be integrated into any learning based detectors (Both of Faster-RCNN and DPM detectors are tested in this paper), thus building a bridge between GreedyNMS and end-to-end learning detectors.
http://arxiv.org/abs/1901.03796
Neural architecture search (NAS) automatically finds the best task-specific neural network topology, outperforming many manual architecture designs. However, it can be prohibitively expensive as the search requires training thousands of different networks, while each can last for hours. In this work, we propose the Graph HyperNetwork (GHN) to amortize the search cost: given an architecture, it directly generates the weights by running inference on a graph neural network. GHNs model the topology of an architecture and therefore can predict network performance more accurately than regular hypernetworks and premature early stopping. To perform NAS, we randomly sample architectures and use the validation accuracy of networks with GHN generated weights as the surrogate search signal. GHNs are fast – they can search nearly 10 times faster than other random search methods on CIFAR-10 and ImageNet. GHNs can be further extended to the anytime prediction setting, where they have found networks with better speed-accuracy tradeoff than the state-of-the-art manual designs.
https://arxiv.org/abs/1810.05749
Though quite challenging, leveraging large-scale unlabeled or partially labeled data in learning systems (e.g., model/classifier training) has attracted increasing attentions due to its fundamental importance. To address this problem, many active learning (AL) methods have been proposed that employ up-to-date detectors to retrieve representative minority samples according to predefined confidence or uncertainty thresholds. However, these AL methods cause the detectors to ignore the remaining majority samples (i.e., those with low uncertainty or high prediction confidence). In this work, by developing a principled active sample mining (ASM) framework, we demonstrate that cost-effectively mining samples from these unlabeled majority data is key to training more powerful object detectors while minimizing user effort. Specifically, our ASM framework involves a switchable sample selection mechanism for determining whether an unlabeled sample should be manually annotated via AL or automatically pseudo-labeled via a novel self-learning process. The proposed process can be compatible with mini-batch based training (i.e., using a batch of unlabeled or partially labeled data as a one-time input) for object detection. In addition, a few samples with low-confidence predictions are selected and annotated via AL. Notably, our method is suitable for object categories that are not seen in the unlabeled data during the learning process. Extensive experiments clearly demonstrate that our ASM framework can achieve performance comparable to that of alternative methods but with significantly fewer annotations.
https://arxiv.org/abs/1807.00147
Question answering (QA) is an important natural language processing (NLP) task and has received much attention in academic research and industry communities. Existing QA studies assume that questions are raised by humans and answers are generated by machines. Nevertheless, in many real applications, machines are also required to determine human needs or perceive human states. In such scenarios, machines may proactively raise questions and humans supply answers. Subsequently, machines should attempt to understand the true meaning of these answers. This new QA approach is called reverse-QA (rQA) throughout this paper. In this work, the human answer understanding problem is investigated and solved by classifying the answers into predefined answer-label categories (e.g., True, False, Uncertain). To explore the relationships between questions and answers, we use the interactive attention network (IAN) model and propose an improved structure called semi-interactive attention network (Semi-IAN). Two Chinese data sets for rQA are compiled. We evaluate several conventional text classification models for comparison, and experimental results indicate the promising performance of our proposed models.
http://arxiv.org/abs/1901.03788
Geologic interpretation of large seismic stacked or migrated seismic images can be a time-consuming task for seismic interpreters. Neural network based semantic segmentation provides fast and automatic interpretations, provided a sufficient number of example interpretations are available. Networks that map from image-to-image emerged recently as powerful tools for automatic segmentation, but standard implementations require fully interpreted examples. Generating training labels for large images manually is time consuming. We introduce a partial loss-function and labeling strategies such that networks can learn from partially interpreted seismic images. This strategy requires only a small number of annotated pixels per seismic image. Tests on seismic images and interpretation information from the Sea of Ireland show that we obtain high-quality predicted interpretations from a small number of large seismic images. The combination of a partial-loss function, a multi-resolution network that explicitly takes small and large-scale geological features into account, and new labeling strategies make neural networks a more practical tool for automatic seismic interpretation.
http://arxiv.org/abs/1901.03786
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves state-of-the-art performance with much faster inference.
http://arxiv.org/abs/1901.03784
Reconstruction of geometry based on different input modes, such as images or point clouds, has been instrumental in the development of computer aided design and computer graphics. Optimal implementations of these applications have traditionally involved the use of spline-based representations at their core. Most such methods attempt to solve optimization problems that minimize an output-target mismatch. However, these optimization techniques require an initialization that is close enough, as they are local methods by nature. We propose a deep learning architecture that adapts to perform spline fitting tasks accordingly, providing complementary results to the aforementioned traditional methods. We showcase the performance of our approach, by reconstructing spline curves and surfaces based on input images or point clouds.
http://arxiv.org/abs/1901.03781
In the last decade or so we have seen tremendous progress in Artificial Intelligence (AI). AI is now in the real world, powering applications that have a large practical impact. Most of it is based on modeling, i.e. machine learning of statistical models that make it possible to predict what the right decision might be in future situations. The next step for AI is machine creativity, i.e. tasks where the correct, or even good, solutions are not known, but need to be discovered. Methods for machine creativity have existed for decades. I believe we are now in a similar situation as deep learning was a few years ago: with the million-fold increase in computational power, those methods can now be used to scale up to creativity in real-world tasks. In particular, Evolutionary Computation is in a unique position to take advantage of that power, and become the next deep learning.
http://arxiv.org/abs/1901.03775
The study of human genes and diseases is very rewarding and can lead to improvements in healthcare, disease diagnostics and drug discovery. In this paper, we further our previous study on gene disease relationship specifically with the multifunctional genes. We investigate the multifunctional gene disease relationship based on the published molecular function annotations of genes from the Gene Ontology which is the most comprehensive source on gene functions.
https://arxiv.org/abs/1901.04847
We report FPGA implementation results of low precision CNN convolution layers optimized for sparse and constant parameters. We describe techniques that amortizes the cost of common factor multiplication and automatically leverage dense hand tuned LUT structures. We apply this method to corner case residual blocks of Resnet on a sparse Resnet50 model to assess achievable utilization and frequency and demonstrate an effective performance of 131 and 23 TOP/chip for the corner case blocks. The projected performance on a multichip persistent implementation of all Resnet50 convolution layers is 10k im/s/chip at batch size 2. This is 1.37x higher than V100 GPU upper bound at the same batch size after normalizing for sparsity.
https://arxiv.org/abs/1901.04969