Because image sensor chips have a finite bandwidth with which to read out pixels, recording video typically requires a trade-off between frame rate and pixel count. Compressed sensing techniques can circumvent this trade-off by assuming that the image is compressible. Here, we propose using multiplexing optics to spatially compress the scene, enabling information about the whole scene to be sampled from a row of sensor pixels, which can be read off quickly via a rolling shutter CMOS sensor. Conveniently, such multiplexing can be achieved with a simple lensless, diffuser-based imaging system. Using sparse recovery methods, we are able to recover 140 video frames at over 4,500 frames per second, all from a single captured image with a rolling shutter sensor. Our proof-of-concept system uses easily-fabricated diffusers paired with an off-the-shelf sensor. The resulting prototype enables compressive encoding of high frame rate video into a single rolling shutter exposure, and exceeds the sampling-limited performance of an equivalent global shutter system for sufficiently sparse objects.
http://arxiv.org/abs/1905.13221
Over the past several years progress in designing better neural network architectures for visual recognition has been substantial. To help sustain this rate of progress, in this work we propose to reexamine the methodology for comparing network architectures. In particular, we introduce a new comparison paradigm of distribution estimates, in which network design spaces are compared by applying statistical techniques to populations of sampled models, while controlling for confounding factors like network complexity. Compared to current methodologies of comparing point and curve estimates of model families, distribution estimates paint a more complete picture of the entire design landscape. As a case study, we examine design spaces used in neural architecture search (NAS). We find significant statistical differences between recent NAS design space variants that have been largely overlooked. Furthermore, our analysis reveals that the design spaces for standard model families like ResNeXt can be comparable to the more complex ones used in recent NAS work. We hope these insights into distribution analysis will enable more robust progress toward discovering better networks for visual recognition.
http://arxiv.org/abs/1905.13214
Neural networks have successfully been applied to solving reasoning tasks, ranging from learning simple concepts like “close to”, to intricate questions whose reasoning procedures resemble algorithms. Empirically, not all network structures work equally well for reasoning. For example, Graph Neural Networks have achieved impressive empirical results, while less structured neural networks may fail to learn to reason. Theoretically, there is currently limited understanding of the interplay between reasoning tasks and network learning. In this paper, we develop a framework to characterize which tasks a neural network can learn well, by studying how well its structure aligns with the algorithmic structure of the relevant reasoning procedure. This suggests that Graph Neural Networks can learn dynamic programming, a powerful algorithmic strategy that solves a broad class of reasoning problems, such as relational question answering, sorting, intuitive physics, and shortest paths. Our perspective also implies strategies to design neural architectures for complex reasoning. On several abstract reasoning tasks, we see empirically that our theory aligns well with practice.
http://arxiv.org/abs/1905.13211
Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to a third dimension (using a limited number of space-time modules such as 3D convolutions) or by introducing a handcrafted two-stream design to capture both appearance and motion in videos. We interpret a video CNN as a collection of multi-stream space-time convolutional blocks connected to each other, and propose the approach of automatically finding neural architectures with better connectivity for video understanding. This is done by evolving a population of overly-connected architectures guided by connection weight learning. Architectures combining representations that abstract different input types (i.e., RGB and optical flow) at multiple temporal resolutions are searched for, allowing different types or sources of information to interact with each other. Our method, referred to as AssembleNet, outperforms prior approaches on public video datasets, in some cases by a great margin.
http://arxiv.org/abs/1905.13209
Histology review is often used as the `gold standard’ for disease diagnosis. Computer aided diagnosis tools can potentially help improve current pathology workflows by reducing examination time and interobserver variability. Previous work in cancer grading has focused mainly on classifying pre-defined regions of interest (ROIs), or relied on large amounts of fine-grained labels. In this paper, we propose a two-stage attention-based multiple instance learning model for slide-level cancer grading and weakly-supervised ROI detection and demonstrate its use in prostate cancer. Compared with existing Gleason classification models, our model goes a step further by utilizing visualized saliency maps to select informative tiles for fine-grained grade classification. The model was primarily developed on a large-scale whole slide dataset consisting of 3,521 prostate biopsy slides with only slide-level labels from 718 patients. The model achieved state-of-the-art performance for prostate cancer grading with an accuracy of 85.11\% for classifying benign, low-grade (Gleason grade 3+3 or 3+4), and high-grade (Gleason grade 4+3 or higher) slides on an independent test set.
http://arxiv.org/abs/1905.13208
Quantifying and measuring uncertainty in deep neural networks, despite recent important advances, is still an open problem. Bayesian neural networks are a powerful solution, where the prior over network weights is a design choice, often a normal distribution or other distribution encouraging sparsity. However, this prior is agnostic to the generative process of the input data, which might lead to unwarranted generalization for out-of-distribution tested data. We suggest treating the generative process of the input data as a confounder for the relation between the input and the discriminative function, thereby conditioning the prior of the network weights on the distribution of the input. We propose an algorithm for modeling this confounder through neural connectivity patterns. This approach is ultimately translated into a new deep architecture—a compact hierarchy of networks. We demonstrate that sampling networks from this hierarchy, proportionally to their posterior, is efficient and enables estimating various types of uncertainties. Empirical evaluations of our method demonstrate significant improvement compared to state-of-the-art calibration and out-of-distribution detection methods.
http://arxiv.org/abs/1905.13195
While graph kernels (GKs) are easy to train and enjoy provable theoretical guarantees, their practical performances are limited by their expressive power, as the kernel function often depends on hand-crafted combinatorial features of graphs. Compared to graph kernels, graph neural networks (GNNs) usually achieve better practical performance, as GNNs use multi-layer architectures and non-linear activation functions to extract high-order information of graphs as features. However, due to the large number of hyper-parameters and the non-convex nature of the training procedure, GNNs are harder to train. Theoretical guarantees of GNNs are also not well-understood. Furthermore, the expressive power of GNNs scales with the number of parameters, and thus it is hard to exploit the full power of GNNs when computing resources are limited. The current paper presents a new class of graph kernels, Graph Neural Tangent Kernels (GNTKs), which correspond to \emph{infinitely wide} multi-layer GNNs trained by gradient descent. GNTKs enjoy the full expressive power of GNNs and inherit advantages of GKs. Theoretically, we show GNTKs provably learn a class of smooth functions on graphs. Empirically, we test GNTKs on graph classification datasets and show they achieve strong performance.
http://arxiv.org/abs/1905.13192
We study revenue-optimal pricing and driver compensation in ridesharing platforms when drivers have heterogeneous preferences over locations. If a platform ignores drivers’ location preferences, it may make inefficient trip dispatches; moreover, drivers may strategize so as to route towards their preferred locations. In a model with stationary and continuous demand and supply, we present a mechanism that incentivizes drivers to both (i) report their location preferences truthfully and (ii) always provide service. In settings with unconstrained driver supply or symmetric demand patterns, our mechanism achieves (full-information) first-best revenue. Under supply constraints and unbalanced demand, we show via simulation that our mechanism improves over existing mechanisms and has performance close to the first-best.
http://arxiv.org/abs/1905.13191
Artificial Intelligence (AI) technology is rapidly changing many areas of society. While there is tremendous potential in this transition, there are several pitfalls as well. Using the history of computing and the world-wide web as a guide, in this article we identify those pitfalls and actions that lead AI development to its full potential. If done right, AI will be instrumental in achieving the goals we set for economy, society, and the world in general.
http://arxiv.org/abs/1905.13178
We present a method to fabricate general models by multi-directional 3D printing systems, in which different regions of a model are printed along different directions. The core of our method is a support-effective volume decomposition algorithm that targets on minimizing the area of the regions with large overhangs. Optimal volume decomposition represented by a sequence of clipping planes is determined by a beam-guided searching algorithm according to manufacturing constraints. Different from existing approaches that need to manually assemble 3D printed components into a final model, regions decomposed by our algorithm can be automatically fabricated on a multi-directional 3D printing system. Our approach is general and can be applied to models with loops and handles. For those models that cannot completely eliminate supporting structures for large overhangs, an algorithm is developed to generate special supporting structures for multi-directional 3D printing. We developed two different hardware systems to physically verify the effectiveness of our method: a Cartesian-motion based system and an angular-motion based system. A variety of 3D models have been successfully fabricated on these systems.
http://arxiv.org/abs/1812.00606
In this paper, we develop a neural summarization model which can effectively process multiple input documents and distill Transformer architecture with the ability to encode documents in a hierarchical manner. We represent cross-document relationships via an attention mechanism which allows to share information as opposed to simply concatenating text spans and processing them as a flat sequence. Our model learns latent dependencies among textual units, but can also take advantage of explicit graph representations focusing on similarity or discourse relations. Empirical results on the WikiSum dataset demonstrate that the proposed architecture brings substantial improvements over several strong baselines.
http://arxiv.org/abs/1905.13164
Recent years have witnessed rapid developments on social recommendation techniques for improving the performance of recommender systems due to the growing influence of social networks to our daily life. The majority of existing social recommendation methods unify user representation for the user-item interactions (item domain) and user-user connections (social domain). However, it may restrain user representation learning in each respective domain, since users behave and interact differently in the two domains, which makes their representations to be heterogeneous. In addition, most of traditional recommender systems can not efficiently optimize these objectives, since they utilize negative sampling technique which is unable to provide enough informative guidance towards the training during the optimization process. In this paper, to address the aforementioned challenges, we propose a novel deep adversarial social recommendation framework DASO. It adopts a bidirectional mapping method to transfer users’ information between social domain and item domain using adversarial learning. Comprehensive experiments on two real-world datasets show the effectiveness of the proposed framework.
http://arxiv.org/abs/1905.13160
We develop a system to disambiguate objects based on simple physical descriptions. The system takes as input a natural language phrase and a depth image containing a segmented object and predicts how similar the observed object is to the described object. Our system is designed to learn from only a small amount of human-labeled language data and generalize to viewpoints not represented in the language-annotated depth-image training set. By decoupling 3D shape representation from language representation, our method is able to ground language to novel objects using a small amount of language-annotated depth-data and a larger corpus of unlabeled 3D object meshes, even when these objects are partially observed from unusual viewpoints. Our system is able to disambiguate between novel objects, observed via depth-images, based on natural language descriptions. Our method also enables view-point transfer; trained on human-annotated data on a small set of depth-images captured from frontal viewpoints, our system successfully predicted object attributes from rear views despite having no such depth images in its training set. Finally, we demonstrate our system on a Baxter robot, enabling it to pick specific objects based on human-provided natural language descriptions.
http://arxiv.org/abs/1905.13153
In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.
http://arxiv.org/abs/1905.13150
Prostate cancer is one of the most common forms of cancer and the third leading cause of cancer death in North America. As an integrated part of computer-aided detection (CAD) tools, diffusion-weighted magnetic resonance imaging (DWI) has been intensively studied for accurate detection of prostate cancer. With deep convolutional neural networks (CNNs) significant success in computer vision tasks such as object detection and segmentation, different CNNs architectures are increasingly investigated in medical imaging research community as promising solutions for designing more accurate CAD tools for cancer detection. In this work, we developed and implemented an automated CNNs-based pipeline for detection of clinically significant prostate cancer (PCa) for a given axial DWI image and for each patient. DWI images of 427 patients were used as the dataset, which contained 175 patients with PCa and 252 healthy patients. To measure the performance of the proposed pipeline, a test set of 108 (out of 427) patients were set aside and not used in the training phase. The proposed pipeline achieved area under the receiver operating characteristic curve (AUC) of 0.87 (95% Confidence Interval (CI): 0.84-0.90) and 0.84 (95% CI: 0.76-0.91) at slice level and patient level, respectively.
http://arxiv.org/abs/1905.13145
Person re-identification (reID) aims to match person images to retrieve the ones with the same identity. This is a challenging task, as the images to be matched are generally semantically misaligned due to the diversity of human poses and capture viewpoints, incompleteness of the visible bodies (due to occlusion), etc. In this paper, we propose a framework that drives the reID network to learn semantics-aligned feature representation through delicate supervision designs. Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation. Moreover, at the decoder, besides the reconstruction loss, we add triplet reID constraints/losses over the feature maps as the perceptual losses. The decoder is discarded in the inference/test and thus our scheme is computationally efficient. Ablation studies demonstrate the effectiveness of our design. We achieve the state-of-the-art performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the partial person reID dataset Partial REID.
http://arxiv.org/abs/1905.13143
Spectral methods such as the improved Fourier Mellin Invariant (iFMI) transform have proved faster, more robust and accurate than feature based methods on image registration. However, iFMI is restricted to work only when the camera moves in 2D space and has not been applied on omni-cameras images so far. In this work, we extend the iFMI method and apply a motion model to estimate an omni-camera’s pose when it moves in 3D space. This is particularly useful in field robotics applications to get a rapid and comprehensive view of unstructured environments, and to estimate robustly the robot pose. In the experiment section, we compared the extended iFMI method against ORB and AKAZE feature based approaches on three datasets showing different type of environments: office, lawn and urban scenery (MPI-omni dataset). The results show that our method boosts the accuracy of the robot pose estimation two to four times with respect to the feature registration techniques, while offering lower processing times. Furthermore, the iFMI approach presents the best performance against motion blur typically present in mobile robotics.
http://arxiv.org/abs/1811.05306
High accuracy localisation technologies exist but are prohibitively expensive to deploy for large indoor spaces such as warehouses, factories, and supermarkets to track assets and people. However, these technologies can be used to lend their highly accurate localisation capabilities to low-cost, commodity, and less-accurate technologies. In this paper, we bridge this link by proposing a technology-agnostic calibration framework based on artificial intelligence to assist such low-cost technologies through highly accurate localisation systems. A single-layer neural network is used to calibrate less accurate technology using more accurate one such as BLE using UWB and UWB using a professional motion tracking system. On a real indoor testbed, we demonstrate an increase in accuracy of approximately 70% for BLE and 50% for UWB. Not only the proposed approach requires a very short measurement campaign, the low complexity of the single-layer neural network also makes it ideal for deployment on constrained devices typically for localisation purposes.
http://arxiv.org/abs/1905.13118
We present a reinforcement learning (RL) based guidance system for automated theorem proving geared towards Finding Longer Proofs (FLoP). FLoP focuses on generalizing from short proofs to longer ones of similar structure. To achieve that, FLoP uses state-of-the-art RL approaches that were previously not applied in theorem proving. In particular, we show that curriculum learning significantly outperforms previous learning-based proof guidance on a synthetic dataset of increasingly difficult arithmetic problems.
http://arxiv.org/abs/1905.13100
Medical imaging only indirectly measures the molecular identity of the tissue within each voxel, which often produces only ambiguous image evidence for target measures of interest, like semantic segmentation. This diversity and the variations of plausible interpretations are often specific to given image regions and may thus manifest on various scales, spanning all the way from the pixel to the image level. In order to learn a flexible distribution that can account for multiple scales of variations, we propose the Hierarchical Probabilistic U-Net, a segmentation network with a conditional variational auto-encoder (cVAE) that uses a hierarchical latent space decomposition. We show that this model formulation enables sampling and reconstruction of segmenations with high fidelity, i.e. with finely resolved detail, while providing the flexibility to learn complex structured distributions across scales. We demonstrate these abilities on the task of segmenting ambiguous medical scans as well as on instance segmentation of neurobiological and natural images. Our model automatically separates independent factors across scales, an inductive bias that we deem beneficial in structured output prediction tasks beyond segmentation.
http://arxiv.org/abs/1905.13077
Deep Neural Network (DNN) trained by the gradient descent method is known to be vulnerable to maliciously perturbed adversarial input, aka. adversarial attack. As one of the countermeasures against adversarial attack, increasing the model capacity for DNN robustness enhancement was discussed and reported as an effective approach by many recent works. In this work, we show that shrinking the model size through proper weight pruning can even be helpful to improve the DNN robustness under adversarial attack. For obtaining a simultaneously robust and compact DNN model, we propose a multi-objective training method called Robust Sparse Regularization (RSR), through the fusion of various regularization techniques, including channel-wise noise injection, lasso weight penalty, and adversarial training. We conduct extensive experiments across popular ResNet-20, ResNet-18 and VGG-16 DNN architectures to demonstrate the effectiveness of RSR against popular white-box (i.e., PGD and FGSM) and black-box attacks. Thanks to RSR, 85% weight connections of ResNet-18 can be pruned while still achieving 0.68% and 8.72% improvement in clean- and perturbed-data accuracy respectively on CIFAR-10 dataset, in comparison to its PGD adversarial training baseline.
http://arxiv.org/abs/1905.13074
Weakly Aggregative Modal Logic (WAML) is a collection of disguised polyadic modal logics with n-ary modalities whose arguments are all the same. WAML has some interesting applications on epistemic logic and logic of games, so we study some basic model theoretical aspects of WAML in this paper. Specifically, we give a van Benthem-Rosen characterization theorem of WAML based on an intuitive notion of bisimulation and show that each basic WAML system K_n lacks Craig Interpolation.
http://arxiv.org/abs/1803.10953
This paper describes Unbabel’s submission to the WMT2019 APE Shared Task for the English-German language pair. Following the recent rise of large, powerful, pre-trained models, we adapt the BERT pretrained model to perform Automatic Post-Editing in an encoder-decoder framework. Analogously to dual-encoder architectures we develop a BERT-based encoder-decoder (BED) model in which a single pretrained BERT encoder receives both the source src and machine translation tgt strings. Furthermore, we explore a conservativeness factor to constrain the APE system to perform fewer edits. As the official results show, when trained on a weighted combination of in-domain and artificial training data, our BED system with the conservativeness penalty improves significantly the translations of a strong Neural Machine Translation system by $-0.78$ and $+1.23$ in terms of TER and BLEU, respectively.
http://arxiv.org/abs/1905.13068
We propose a novel feed-forward network for video inpainting. We use a set of sampled video frames as the reference to take visible contents to fill the hole of a target frame. Our video inpainting network consists of two stages. The first stage is an alignment module that uses computed homographies between the reference frames and the target frame. The visible patches are then aggregated based on the frame similarity to fill in the target holes roughly. The second stage is a non-local attention module that matches the generated patches with known reference patches (in space and time) to refine the previous global alignment stage. Both stages consist of large spatial-temporal window size for the reference and thus enable modeling long-range correlations between distant information and the hole regions. Therefore, even challenging scenes with large or slowly moving holes can be handled, which have been hardly modeled by existing flow-based approach. Our network is also designed with a recurrent propagation stream to encourage temporal consistency in video results. Experiments on video object removal demonstrate that our method inpaints the holes with globally and locally coherent contents.
http://arxiv.org/abs/1905.13066
The ability of reasoning beyond data fitting is substantial to deep learning systems in order to make a leap forward towards artificial general intelligence. A lot of efforts have been made to model neural-based reasoning as an iterative decision-making process based on recurrent networks and reinforcement learning. Instead, inspired by the consciousness prior proposed by Yoshua Bengio, we explore reasoning with the notion of attentive awareness from a cognitive perspective, and formulate it in the form of attentive message passing on graphs, called neural consciousness flow (NeuCFlow). Aiming to bridge the gap between deep learning systems and reasoning, we propose an attentive computation framework with a three-layer architecture, which consists of an unconsciousness flow layer, a consciousness flow layer, and an attention flow layer. We implement the NeuCFlow model with graph neural networks (GNNs) and conditional transition matrices. Our attentive computation greatly reduces the complexity of vanilla GNN-based methods, capable of running on large-scale graphs. We validate our model for knowledge graph reasoning by solving a series of knowledge base completion (KBC) tasks. The experimental results show NeuCFlow significantly outperforms previous state-of-the-art KBC methods, including the embedding-based and the path-based. The reproducible code can be found by the link below.
http://arxiv.org/abs/1905.13049
Binarization is widely used as an image preprocessing step to separate object especially text from background before recognition. For noisy images with uneven illumination, threshold values should be computed pixel by pixel to obtain a good segmentation. Since local threshold values typically depend on moments-based statistics such as mean and variance of gray levels inside rectangular windows, integral images are commonly used to accelerate the calculation. However, integral images are memory consuming. For Sauvola’s method, the two integral images occupy $16HW$ bytes given a $H\times W$ input image. By using a recursive technique to avoid integral images, memory usage of intermediate data structures can be reduced significantly to $6\min{H,W}$ bytes, while the time complexity remains $O(HW)$ independent of window size. Therefore, the proposed implementation enable various local adaptive binarization methods to be applied in real-time use cases on devices with limited resources.
http://arxiv.org/abs/1905.13038
Chinese word segmentation (CWS) is often regarded as a character-based sequence labeling task in most current works which have achieved great success with the help of powerful neural networks. However, these works neglect an important clue: Chinese characters incorporate both semantic and phonetic meanings. In this paper, we introduce multiple character embeddings including Pinyin Romanization and Wubi Input, both of which are easily accessible and effective in depicting semantics of characters. We propose a novel shared Bi-LSTM-CRF model to fuse linguistic features efficiently by sharing the LSTM network during the training procedure. Extensive experiments on five corpora show that extra embeddings help obtain a significant improvement in labeling accuracy. Specifically, we achieve the state-of-the-art performance in AS and CityU corpora with F1 scores of 96.9 and 97.3, respectively without leveraging any external lexical resources.
http://arxiv.org/abs/1808.04963
Detecting and mapping informal settlements encompasses several of the United Nations sustainable development goals. This is because informal settlements are home to the most socially and economically vulnerable people on the planet. Thus, understanding where these settlements are is of paramount importance to both government and non-government organizations (NGOs), such as the United Nations Children’s Fund (UNICEF), who can use this information to deliver effective social and economic aid. We propose a method that detects and maps the locations of informal settlements using only freely available, Sentinel-2 low-resolution satellite spectral data and socio-economic data. This is in contrast to previous studies that only use costly very-high resolution (VHR) satellite and aerial imagery. We show how we can detect informal settlements by combining both domain knowledge and machine learning techniques, to build a classifier that looks for known roofing materials used in informal settlements. Please find additional material at https://frontierdevelopmentlab.github.io/informal-settlements/.
http://arxiv.org/abs/1812.00786
Nonnegative matrix factorization (NMF) is a linear dimensionality technique for nonnegative data with applications such as image analysis, text mining, audio source separation and hyperspectral unmixing. Given a data matrix $M$ and a factorization rank $r$, NMF looks for a nonnegative matrix $W$ with $r$ columns and a nonnegative matrix $H$ with $r$ rows such that $M \approx WH$. NMF is NP-hard to solve in general. However, it can be computed efficiently under the separability assumption which requires that the basis vectors appear as data points, that is, that there exists an index set $\mathcal{K}$ such that $W = M(:,\mathcal{K})$. In this paper, we generalize the separability assumption: We only require that for each rank-one factor $W(:,k)H(k,:)$ for $k=1,2,\dots,r$, either $W(:,k) = M(:,j)$ for some $j$ or $H(k,:) = M(i,:)$ for some $i$. We refer to the corresponding problem as generalized separable NMF (GS-NMF). We discuss some properties of GS-NMF and propose a convex optimization model which we solve using a fast gradient method. We also propose a heuristic algorithm inspired by the successive projection algorithm. To verify the effectiveness of our methods, we compare them with several state-of-the-art separable NMF algorithms on synthetic, document and image data sets.
http://arxiv.org/abs/1905.12995
We show that any DNNF circuit that expresses the set of linear orders over a set of $n$ candidates must be of size $2^{\Omega(n)}$. Moreover, we show that there exist DNNF circuits of size $2^{O(n)}$ expressing linear orders over $n$ candidates.
http://arxiv.org/abs/1807.06397
We investigate the data complexity of answering queries mediated by metric temporal logic ontologies under the event-based semantics assuming that data instances are finite timed words timestamped with binary fractions. We identify classes of ontology-mediated queries answering which can be done in AC0, NC1, L, NL, P, and coNP for data complexity, provide their rewritings to first-order logic and its extensions with primitive recursion, transitive closure or datalog, and establish lower complexity bounds.
http://arxiv.org/abs/1905.12990
Gastric endoscopy is a common clinical practice that enables medical doctors to diagnose the stomach inside a body. In order to identify a gastric lesion’s location such as early gastric cancer within the stomach, this work addressed to reconstruct the 3D shape of a whole stomach with color texture information generated from a standard monocular endoscope video. Previous works have tried to reconstruct the 3D structures of various organs from endoscope images. However, they are mainly focused on a partial surface. In this work, we investigated how to enable structure-from-motion (SfM) to reconstruct the whole shape of a stomach from a standard endoscope video. We specifically investigated the combined effect of chromo-endoscopy and color channel selection on SfM. Our study found that 3D reconstruction of the whole stomach can be achieved by using red channel images captured under chromo-endoscopy by spreading indigo carmine (IC) dye on the stomach surface.
http://arxiv.org/abs/1905.12988
Multi-robot visual simultaneous localization and mapping (SLAM) system is normally consisted of multiple mobile robots equipped with camera and/or other visual sensors. The networked robots work independently or cooperatively in an unknown scene in order to solve autonomous localization and mapping problem. One of the most critical issues in Multi-robot visual SLAM is the intensive computation that is normally required yet overwhelming for inexpensive mobile robots with limited on-board resources. To address this problem, a novel task offloading strategy and dense point cloud map construction method is proposed in this paper. First, we develop a novel strategy to remotely offload computation-intensive tasks to cloud center, so that the tasks that could not originally be achieved locally on the resource-limited robot systems become possible. Second, a modified iterative closest point algorithm (ICP), named fitness score hierarchical ICP algorithm (FS-HICP), is developed to accelerate point cloud registration. The correctness, efficiency, and scalability of the proposed strategy are evaluated with both theoretical analysis and experimental simulations. The results show that the proposed method can effectively reduce the energy consumption while increase the computation capability and speed of the multi-robot visual SLAM system, especially in indoor environment.
http://arxiv.org/abs/1905.12973
Rankings, representing preferences over a set of candidates, are widely used in many information systems, e.g., group decision making. It is of great importance to evaluate the consensus of the obtained rankings from multiple agents. There is often no ground truth available for a ranking task. An overall measure of the consensus degree enables us to have a clear cognition about the ranking data. Moreover, it could provide a quantitative indicator for consensus comparison between groups and further improvement of a ranking system. In this paper, a novel consensus quantifying approach, without the need for any correlation or distance functions, is proposed based on a concept of q-support patterns of rankings. The q-support patterns represent the commonality embedded in a set of rankings. A method for detecting outliers in a set of rankings is naturally derived from the proposed consensus quantifying approach. Experimental studies are conducted to demonstrate the effectiveness of the proposed approach.
http://arxiv.org/abs/1905.12966
Set-Based Multi-Task Priority is a recent framework to handle inverse kinematics for redundant structures. Both equality tasks, i.e., control objectives to be driven to a desired value, and set-bases tasks, i.e., control objectives to be satisfied with a set/range of values can be addressed in a rigorous manner within a priority framework. In addition, optimization tasks, driven by the gradient of a proper function, may be considered as well, usually as lower priority tasks. In this paper the proper design of the tasks, their priority and the use of a Set-Based Multi-Task Priority framework is proposed in order to handle several constraints simultaneously in real-time. It is shown that safety related tasks such as, e.g., joint limits or kinematic singularity, may be properly handled by consider them both at an higher priority as set-based task and at a lower within a proper optimization functional. Experimental results on a 7DOF Jaco$^2$
http://arxiv.org/abs/1905.12945
In this paper, we present a framework for real-time autonomous robot navigation based on cloud and on-demand databases to address two major issues of human-like robot interaction and task planning in global dynamic environment, which is not known a priori. Our framework contributes to make human-like brain GPS mapping system for robot using spatial information and performs 3D visual semantic SLAM for independent robot navigation. We accomplish the feat by separating robot’s memory system into Long-Term Memory (LTM) and Short-Term Memory (STM). We also form robot’s behavior and knowledge system by linking these memories to Autonomous Navigation Module (ANM), Learning Module (LM), and Behavior Planner Module (BPM). The proposed framework is assessed through simulation using ROS-based Gazebo-simulated mobile robot, RGB-D camera (3D sensor) and a laser range finder (2D sensor) in 3D model of realistic indoor environment. Simulation corroborates the substantial practical merit of our proposed framework.
http://arxiv.org/abs/1905.12942
We propose a novel reinforcement learning algorithm, AlphaNPI, that incorporates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI contributes structural biases in the form of modularity, hierarchy and recursion, which are helpful to reduce sample complexity, improve generalization and increase interpretability. AlphaZero contributes powerful neural network guided search algorithms, which we augment with recursion. AlphaNPI only assumes a hierarchical program specification with sparse rewards: 1 when the program execution satisfies the specification, and 0 otherwise. Using this specification, AlphaNPI is able to train NPI models effectively with RL for the first time, completely eliminating the need for strong supervision in the form of execution traces. The experiments show that AlphaNPI can sort as well as previous strongly supervised NPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with two disks and is shown to generalize to puzzles with an arbitrary number of disk
http://arxiv.org/abs/1905.12941
We present a computational model based on the CRISP theory (Content Representation, Intrinsic Sequences, and Pattern completion) of the hippocampus that allows to continuously store pattern sequences online in a one-shot fashion. Rather than storing a sequence in CA3, CA3 provides a pre-trained sequence that is hetero-associated with the input sequence, which allows the system to perform one-shot learning. Plasticity on a short time scale therefore only happens in the incoming and outgoing connections of CA3. Stored sequences can later be recalled from a single cue pattern. We identify the pattern separation performed by subregion DG to be necessary for storing sequences that contain correlated patterns. A design principle of the model is that we use a single learning rule named Hebbiand-escent to train all parts of the system. Hebbian-descent has an inherent forgetting mechanism that allows the system to continuously memorize new patterns while forgetting early stored ones. The model shows a plausible behavior when noisy and new patterns are presented and has a rather high capacity of about 40% in terms of the number of neurons in CA3. One notable property of our model is that it is capable of boot-strapping' (improving) itself without external input in a process we refer to as
dreaming’. Besides artificially generated input sequences we also show that the model works with sequences of encoded handwritten digits or natural images. To our knowledge this is the first model of the hippocampus that allows to store correlated pattern sequences online in a one-shot fashion without a consolidation process, which can instantaneously be recalled later.
http://arxiv.org/abs/1905.12937
In this paper we present an architecture for the operation of an assistive robot finally aimed at allowing users with severe motion disabilities to perform manipulation tasks that may help in daily-life operations. The robotic system, based on a lightweight robot manipulator, receives high level commands from the user through a Brain-Computer Interface based on P300 paradigm. The motion of the manipulator is controlled relying on a closed loop inverse kinematic algorithm that simultaneously manages multiple set-based and equality-based tasks. The software architecture is developed relying on widely used frameworks to operate BCIs and robots (namely, BCI2000 for the operation of the BCI and ROS for the control of the manipulator) integrating control, perception and communication modules developed for the application at hand. Preliminary experiments have been conducted to show the potentialities of the developed architecture.
http://arxiv.org/abs/1905.12927
Unsupervised text attribute transfer automatically transforms a text to alter a specific attribute (e.g. sentiment) without using any parallel data, while simultaneously preserving its attribute-independent content. The dominant approaches are trying to model the content-independent attribute separately, e.g., learning different attributes’ representations or using multiple attribute-specific decoders. However, it may lead to inflexibility from the perspective of controlling the degree of transfer or transferring over multiple aspects at the same time. To address the above problems, we propose a more flexible unsupervised text attribute transfer framework which replaces the process of modeling attribute with minimal editing of latent representations based on an attribute classifier. Specifically, we first propose a Transformer-based autoencoder to learn an entangled latent representation for a discrete text, then we transform the attribute transfer task to an optimization problem and propose the Fast-Gradient-Iterative-Modification algorithm to edit the latent representation until conforming to the target attribute. Extensive experimental results demonstrate that our model achieves very competitive performance on three public data sets. Furthermore, we also show that our model can not only control the degree of transfer freely but also allow to transfer over multiple aspects at the same time.
http://arxiv.org/abs/1905.12926
Robotic grasp detection is a fundamental capability for intelligent manipulation in unstructured environments. Previous work mainly employed visual and tactile fusion to achieve stable grasp, while, the whole process depending heavily on regrasping, which wastes much time to regulate and evaluate. We propose a novel way to improve robotic grasping: by using learned tactile knowledge, a robot can achieve a stable grasp from an image. First, we construct a prior tactile knowledge learning framework with novel grasp quality metric which is determined by measuring its resistance to external perturbations. Second, we propose a multi-phases Bayesian Grasp architecture to generate stable grasp configurations through a single RGB image based on prior tactile knowledge. Results show that this framework can classify the outcome of grasps with an average accuracy of 86% on known objects and 79% on novel objects. The prior tactile knowledge improves the successful rate of 55% over traditional vision-based strategies.
http://arxiv.org/abs/1905.12920
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.
http://arxiv.org/abs/1706.04902
In this work, we demonstrate yet another approach to tackle the amodal segmentation problem. Specifically, we first introduce a new representation, namely a semantics-aware distance map (sem-dist map), to serve as our target for amodal segmentation instead of the commonly used masks and heatmaps. The sem-dist map is a kind of level-set representation, of which the different regions of an object are placed into different levels on the map according to their visibility. It is a natural extension of masks and heatmaps, where modal, amodal segmentation, as well as depth order information, are all well-described. Then we also introduce a novel convolutional neural network (CNN) architecture, which we refer to as semantic layering network, to estimate sem-dist maps layer by layer, from the global-level to the instance-level, for all objects in an image. Extensive experiments on the COCOA and D2SA datasets have demonstrated that our framework can predict amodal segmentation, occlusion and depth order with state-of-the-art performance.
http://arxiv.org/abs/1905.12898
In this paper, we propose a novel method for a sentence-level answer-selection task that is one of the fundamental problems in natural language processing. First, we explore the effect of additional information by adopting a pretrained language model to compute the vector representation of the input text and by applying transfer learning from a large-scale corpus. Second, we enhance the compare-aggregate model by proposing a novel latent clustering method to compute additional information within the target corpus and by changing the objective function from listwise to pointwise. To evaluate the performance of the proposed approaches, experiments are performed with the WikiQA and TREC-QA datasets. The empirical results demonstrate the superiority of our proposed approach, which achieve state-of-the-art performance on both datasets.
http://arxiv.org/abs/1905.12897
We address the problem of imitation learning with multi-modal demonstrations. Instead of attempting to learn all modes, we argue that in many tasks it is sufficient to imitate any one of them. We show that the state-of-the-art methods such as GAIL and behavior cloning, due to their choice of loss function, often incorrectly interpolate between such modes. Our key insight is to minimize the right divergence between the learner and the expert state-action distributions, namely the reverse KL divergence or I-projection. We propose a general imitation learning framework for estimating and minimizing any f-Divergence. By plugging in different divergences, we are able to recover existing algorithms such as Behavior Cloning (Kullback-Leibler), GAIL (Jensen Shannon) and Dagger (Total Variation). Empirical results show that our approximate I-projection technique is able to imitate multi-modal behaviors more reliably than GAIL and behavior cloning.
http://arxiv.org/abs/1905.12888
Computer vision produces representations of scene content. Much computer vision research is predicated on the assumption that these intermediate representations are useful for action. Recent work at the intersection of machine learning and robotics calls this assumption into question by training sensorimotor systems directly for the task at hand, from pixels to actions, with no explicit intermediate representations. Thus the central question of our work: Does computer vision matter for action? We probe this question and its offshoots via immersive simulation, which allows us to conduct controlled reproducible experiments at scale. We instrument immersive three-dimensional environments to simulate challenges such as urban driving, off-road trail traversal, and battle. Our main finding is that computer vision does matter. Models equipped with intermediate representations train faster, achieve higher task performance, and generalize better to previously unseen environments. A video that summarizes the work and illustrates the results can be found at https://youtu.be/4MfWa2yZ0Jc
http://arxiv.org/abs/1905.12887
Existing Earth Vision datasets are either suitable for semantic segmentation or object detection. In this work, we introduce the first benchmark dataset for instance segmentation in aerial imagery that combines instance-level object detection and pixel-level segmentation tasks. In comparison to instance segmentation in natural scenes, aerial images present unique challenges e.g., a huge number of instances per image, large object-scale variations and abundant tiny objects. Our large-scale and densely annotated Instance Segmentation in Aerial Images Dataset (iSAID) comes with 655,451 object instances for 15 categories across 2,806 high-resolution images. Such precise per-pixel annotations for each instance ensure accurate localization that is essential for detailed scene analysis. Compared to existing small-scale aerial image based instance segmentation datasets, iSAID contains 15$\times$ the number of object categories and 5$\times$ the number of instances. We benchmark our dataset using two popular instance segmentation approaches for natural images, namely Mask R-CNN and PANet. In our experiments we show that direct application of off-the-shelf Mask R-CNN and PANet on aerial images provide suboptimal instance segmentation results, thus requiring specialized solutions from the research community.
http://arxiv.org/abs/1905.12886
M-GWAP is a multimodal game with a purpose of that leverages on the wisdom of crowds phenomenon for the annotation of multimedia data in terms of mental states. This game with a purpose is developed in WordPress to allow users implementing the game without programming skills. The game adopts motivational strategies for the player to remain engaged, such as a score system, text motivators while playing, a ranking system to foster competition and mechanics for identify building. The current version of the game was deployed after alpha and beta testing helped refining the game accordingly.
http://arxiv.org/abs/1905.12884
Recently, deep convolutional neural networks (CNNs) have achieved great success in pathological image classification. However, due to the limited number of labeled pathological images, there are still two challenges to be addressed: (1) overfitting: the performance of a CNN model is undermined by the overfitting due to its huge amounts of parameters and the insufficiency of labeled training data. (2) privacy leakage: the model trained using a conventional method may involuntarily reveal the private information of the patients in the training dataset. The smaller the dataset, the worse the privacy leakage. To tackle the above two challenges, we introduce a novel stochastic gradient descent (SGD) scheme, named patient privacy preserving SGD (P3SGD), which performs the model update of the SGD in the patient level via a large-step update built upon each patient’s data. Specifically, to protect privacy and regularize the CNN model, we propose to inject the well-designed noise into the updates. Moreover, we equip our P3SGD with an elaborated strategy to adaptively control the scale of the injected noise. To validate the effectiveness of P3SGD, we perform extensive experiments on a real-world clinical dataset and quantitatively demonstrate the superior ability of P3SGD in reducing the risk of overfitting. We also provide a rigorous analysis of the privacy cost under differential privacy. Additionally, we find that the models trained with P3SGD are resistant to the model-inversion attack compared with those trained using non-private SGD.
http://arxiv.org/abs/1905.12883
Over the past few years the Angry Birds AI competition has been held in an attempt to develop intelligent agents that can successfully and efficiently solve levels for the video game Angry Birds. Many different agents and strategies have been developed to solve the complex and challenging physical reasoning problems associated with such a game. However none of these agents attempt one of the key strategies which humans employ to solve Angry Birds levels, which is restarting levels. Restarting is important in Angry Birds because sometimes the level is no longer solvable or some given shot made has little to no benefit towards the ultimate goal of the game. This paper proposes a framework and experimental evaluation for when to restart levels in Angry Birds. We demonstrate that restarting is a viable strategy to improve agent performance in many cases.
http://arxiv.org/abs/1905.12877