Projects consist of interconnected dimensions such as objective, time, resource and environment. Use of these dimensions in a controlled way and their effective scheduling brings the project success. Project scheduling process includes defining project activities, and estimation of time and resources to be used for the activities. At this point, the project resource-scheduling problems have begun to attract more attention after Program Evaluation and Review Technique (PERT) and Critical Path Method (CPM) are developed one after the other. However, complexity and difficulty of CPM and PERT processes led to the use of these techniques through artificial intelligence methods such as Genetic Algorithm (GA). In this study, an algorithm was proposed and developed, which determines critical path, critical activities and project completion duration by using GA, instead of CPM and PERT techniques used for network analysis within the scope of project management. The purpose of using GA was that these algorithms are an effective method for solution of complex optimization problems. Therefore, correct decisions can be made for implemented project activities by using obtained results. Thus, optimum results were obtained in a shorter time than the CPM and PERT techniques by using the model based on the dynamic algorithm. It is expected that this study will contribute to the performance field (time, speed, low error etc.) of other studies.
http://arxiv.org/abs/1902.00659
The recent proposal of learned index structures opens up a new perspective on how traditional range indexes can be optimized. However, the current learned indexes assume the data distribution is relatively static and the access pattern is uniform, while real-world scenarios consist of skew query distribution and evolving data. In this paper, we demonstrate that the missing consideration of access patterns and dynamic data distribution notably hinders the applicability of learned indexes. To this end, we propose solutions for learned indexes for dynamic workloads (called Doraemon). To improve the latency for skew queries, Doraemon augments the training data with access frequencies. To address the slow model re-training when data distribution shifts, Doraemon caches the previously-trained models and incrementally fine-tunes them for similar access patterns and data distribution. Our preliminary result shows that, Doraemon improves the query latency by 45.1% and reduces the model re-training time to 1/20.
http://arxiv.org/abs/1902.00655
Deep gated convolutional networks have been proved to be very effective in single channel speech separation. However current state-of-the-art framework often considers training the gated convolutional networks in time-frequency (TF) domain. Such an approach will result in limited perceptual score, such as signal-to-distortion ratio (SDR) upper bound of separated utterances and also fail to exploit an end-to-end framework. In this paper we present an integrated simple and effective end-to-end approach to monaural speech separation, which consists of deep gated convolutional neural networks (GCNN) that takes the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker’s voice. In addition long short-term memory (LSTM) is employed for long term temporal modeling. For the objective, we propose to train the network by directly optimizing utterance level SDR in a permutation invariant training (PIT) style. Our experiments on the the public WSJ0-2mix data corpus demonstrate that this new scheme can produce more discriminative separated utterances and leading to performance improvement on the speaker separation task.
http://arxiv.org/abs/1902.00651
During human-robot interaction (HRI), we want the robot to understand us, and we want to intuitively understand the robot. In order to communicate with and understand the robot, we can leverage interactions, where the human and robot observe each other’s behavior. However, it is not always clear how the human and robot should interpret these actions: a given interaction might mean several different things. Within today’s state-of-the-art, the robot assigns a single interaction strategy to the human, and learns from or teaches the human according to this fixed strategy. Instead, we here recognize that different users interact in different ways, and so one size does not fit all. Therefore, we argue that the robot should maintain a distribution over the possible human interaction strategies, and then infer how each individual end-user interacts during the task. We formally define learning and teaching when the robot is uncertain about the human’s interaction strategy, and derive solutions to both problems using Bayesian inference. In examples and a benchmark simulation, we show that our personalized approach outperforms standard methods that maintain a fixed interaction strategy.
http://arxiv.org/abs/1902.00646
Hashing method maps similar high-dimensional data to binary hashcodes with smaller hamming distance, and it has received broad attention due to its low storage cost and fast retrieval speed. Pairwise similarity is easily obtained and widely used for retrieval, and most supervised hashing algorithms are carefully designed for the pairwise supervisions. As labeling all data pairs is difficult, semi-supervised hashing is proposed which aims at learning efficient codes with limited labeled pairs and abundant unlabeled ones. Existing methods build graphs to capture the structure of dataset, but they are not working well for complex data as the graph is built based on the data representations and determining the representations of complex data is difficult. In this paper, we propose a novel teacher-student semi-supervised hashing framework in which the student is trained with the pairwise information produced by the teacher network. The network follows the smoothness assumption, which achieves consistent distances for similar data pairs so that the retrieval results are similar for neighborhood queries. Experiments on large-scale datasets show that the proposed method reaches impressive gain over the supervised baselines and is superior to state-of-the-art semi-supervised hashing methods.
http://arxiv.org/abs/1902.00643
Hashing method maps similar data to binary hashcodes with smaller hamming distance, which has received a broad attention due to its low storage cost and fast retrieval speed. With the rapid development of deep learning, deep hashing methods have achieved promising results in efficient information retrieval. Most of the existing deep hashing methods adopt pairwise or triplet losses to deal with similarities underlying the data, but the training is difficult and less efficient because $O(n^2)$ data pairs and $O(n^3)$ triplets are involved. To address these issues, we propose a novel deep hashing algorithm with unary loss which can be trained very efficiently. We first of all introduce a Unary Upper Bound of the traditional triplet loss, thus reducing the complexity to $O(n)$ and bridging the classification-based unary loss and the triplet loss. Second, we propose a novel Semantic Cluster Deep Hashing (SCDH) algorithm by introducing a modified Unary Upper Bound loss, named Semantic Cluster Unary Loss (SCUL). The resultant hashcodes form several compact clusters, which means hashcodes in the same cluster have similar semantic information. We also demonstrate that the proposed SCDH is easy to be extended to semi-supervised settings by incorporating the state-of-the-art semi-supervised learning algorithms. Experiments on large-scale datasets show that the proposed method is superior to state-of-the-art hashing algorithms.
http://arxiv.org/abs/1805.08705
Short-time Fourier transform (STFT) is used as the front end of many popular successful monaural speech separation methods, such as deep clustering (DPCL), permutation invariant training (PIT) and their various variants. Since the frequency component of STFT is linear, while the frequency distribution of human auditory system is nonlinear. In this work we propose and give an empirical study to use an alternative front end called constant Q transform (CQT) instead of STFT to achieve a better simulation of the frequency resolving power of the human auditory system. The upper bound in signal-to-distortion (SDR) of ideal speech separation based on CQT’s ideal ration mask (IRM) is higher than that based on STFT. In the same experimental setting on WSJ0-2mix corpus, we examined the performance of CQT under different backends, including the original DPCL, utterance level PIT, and some of their variants. It is found that all CQT-based methods are better than STFT-based methods, and achieved on average 0.4dB better performance than STFT based method in SDR improvements.
http://arxiv.org/abs/1902.00631
Congealing is a flexible nonparametric data-driven framework for the joint alignment of data. It has been successfully applied to the joint alignment of binary images of digits, binary images of object silhouettes, grayscale MRI images, color images of cars and faces, and 3D brain volumes. This research enhances congealing to practically and effectively apply it to curve data. We develop a parameterized set of nonlinear transformations that allow us to apply congealing to this type of data. We present positive results on aligning synthetic and real curve data sets and conclude with a discussion on extending this work to simultaneous alignment and clustering.
http://arxiv.org/abs/1902.00626
A question answering system (QA System) was developed that uses graph-pattern association rules on the YAGO knowledge base. The answer as output of the system is provided based on a user question as input. If the answer is missing or unavailable in the database, then graph-pattern association rules are used to get the answer. The architecture of this question answering system is as follows: question classification, graph component generation, query generation, and query processing. The question answering system uses association graph patterns in a waterfall model. In this paper, the architecture of the system is described, specifically discussing its reasoning and performance capabilities. The results of this research is that rules with high confidence and correct logic produce correct answers, and vice versa
http://arxiv.org/abs/1902.00624
Cross-modal similarity search is a problem about designing a search system supporting querying across content modalities, e.g., using an image to search for texts or using a text to search for images. This paper presents a compact coding solution for efficient search, with a focus on the quantization approach which has already shown the superior performance over the hashing solutions in the single-modal similarity search. We propose a cross-modal quantization approach, which is among the early attempts to introduce quantization into cross-modal search. The major contribution lies in jointly learning the quantizers for both modalities through aligning the quantized representations for each pair of image and text belonging to a document. In addition, our approach simultaneously learns the common space for both modalities in which quantization is conducted to enable efficient and effective search using the Euclidean distance computed in the common space with fast distance table lookup. Experimental results compared with several competitive algorithms over three benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance.
http://arxiv.org/abs/1902.00623
In this paper, we address the problem of searching for semantically similar images from a large database. We present a compact coding approach, supervised quantization. Our approach simultaneously learns feature selection that linearly transforms the database points into a low-dimensional discriminative subspace, and quantizes the data points in the transformed space. The optimization criterion is that the quantized points not only approximate the transformed points accurately, but also are semantically separable: the points belonging to a class lie in a cluster that is not overlapped with other clusters corresponding to other classes, which is formulated as a classification problem. The experiments on several standard datasets show the superiority of our approach over the state-of-the art supervised hashing and unsupervised quantization algorithms.
http://arxiv.org/abs/1902.00617
Multiple automakers have in development or in production automated driving systems (ADS) that offer freeway-pilot functions. This type of ADS is typically limited to restricted-access freeways only, that is, the transition from manual to automated modes takes place only after the ramp merging process is completed manually. One major challenge to extend the automation to ramp merging is that the automated vehicle needs to incorporate and optimize long-term objectives (e.g. successful and smooth merge) when near-term actions must be safely executed. Moreover, the merging process involves interactions with other vehicles whose behaviors are sometimes hard to predict but may influence the merging vehicle optimal actions. To tackle such a complicated control problem, we propose to apply Deep Reinforcement Learning (DRL) techniques for finding an optimal driving policy by maximizing the long-term reward in an interactive environment. Specifically, we apply a Long Short-Term Memory (LSTM) architecture to model the interactive environment, from which an internal state containing historical driving information is conveyed to a Deep Q-Network (DQN). The DQN is used to approximate the Q-function, which takes the internal state as input and generates Q-values as output for action selection. With this DRL architecture, the historical impact of interactive environment on the long-term reward can be captured and taken into account for deciding the optimal control policy. The proposed architecture has the potential to be extended and applied to other autonomous driving scenarios such as driving through a complex intersection or changing lanes under varying traffic flow conditions.
http://arxiv.org/abs/1709.02066
Over the past two decades, Unmanned Aerial Vehicles (UAVs), more commonly known as drones, have gained a lot of attention, and are rapidly becoming ubiquitous because of their diverse applications such as surveillance, disaster management, pollution monitoring, film-making, and military reconnaissance. However, incidents such as fatal system failures, malicious attacks, and disastrous misuses have raised concerns in the recent past. Security and viability concerns in drone-based applications are growing at an alarming rate. Besides, UAV networks (UAVNets) are distinctive from other ad-hoc networks. Therefore, it is necessary to address these issues to ensure proper functioning of these UAVs while keeping their uniqueness in mind. Furthermore, adequate security and functionality require the consideration of many parameters that may include an accurate cognizance of the working mechanism of vehicles, geographical and weather conditions, and UAVNet communication. This is achievable by creating a simulator that includes these aspects. A performance evaluation through relevant drone simulator becomes indispensable procedure to test features, configurations, and designs to demonstrate superiority to comparative schemes and suitability. Thus, it becomes of paramount importance to establish the credibility of simulation results by investigating the merits and limitations of each simulator prior to selection. Based on this motivation, we present a comprehensive survey of current drone simulators. In addition, open research issues and research challenges are discussed and presented.
http://arxiv.org/abs/1902.00616
With deep learning based image analysis getting popular in recent years, a lot of multiple objects tracking applications are in demand. Some of these applications (e.g., surveillance camera, intelligent robotics, and autonomous driving) require the system runs in real-time. Though recent proposed methods reach fairly high accuracy, the speed is still slower than real-time application requirement. In order to increase tracking-by-detection system’s speed for real-time tracking, we proposed confidence trigger detection (CTD) approach which uses confidence of tracker to decide when to trigger object detection. Using this approach, system can safely skip detection of images frames that objects barely move. We had studied the influence of different confidences in three popular detectors separately. Though we found trade-off between speed and accuracy, our approach reaches higher accuracy at given speed.
http://arxiv.org/abs/1902.00615
Word embedding is a powerful tool in natural language processing. In this paper we consider the problem of word embedding composition -– given vector representations of two words, compute a vector for the entire phrase. We give a generative model that can capture specific syntactic relations between words. Under our model, we prove that the correlations between three words (measured by their PMI) form a tensor that has an approximate low rank Tucker decomposition. The result of the Tucker decomposition gives the word embeddings as well as a core tensor, which can be used to produce better compositions of the word embeddings. We also complement our theoretical results with experiments that verify our assumptions, and demonstrate the effectiveness of the new composition method.
http://arxiv.org/abs/1902.00613
Iterative learning control has been successfully used for several decades to improve the performance of control systems that perform a single repeated task. Using information from prior control executions, learning controllers gradually determine open-loop control inputs whose reference tracking performance can exceed that of traditional feedback-feedforward control algorithms. This paper considers iterative learning control for a previously unexplored field: autonomous racing. Racecars are driven multiple laps around the same sequence of turns while operating near the physical limits of tire-road friction, where steering dynamics become highly nonlinear and transient, making accurate path tracking difficult. However, because the vehicle trajectory is identical for each lap in the case of single-car racing, the nonlinear vehicle dynamics and unmodelled road conditions are repeatable and can be accounted for using iterative learning control, provided the tire force limits have not been exceeded. This paper describes the design and application of proportional-derivative (PD) and quadratically optimal (Q-ILC) learning algorithms for multiple-lap path tracking of an autonomous race vehicle. Simulation results are used to tune controller gains and test convergence, and experimental results are presented on an Audi TTS race vehicle driving several laps around Thunderhill Raceway in Willows, CA at lateral accelerations of up to 8 $\mathrm{m/s^2}$. Both control algorithms are able to correct transient path tracking errors and improve the performance provided by a reference feedforward controller.
http://arxiv.org/abs/1902.00611
The problem of maneuvering a vehicle through a race course in minimum time requires computation of both longitudinal (brake and throttle) and lateral (steering wheel) control inputs. Unfortunately, solving the resulting nonlinear optimal control problem is typically computationally expensive and infeasible for real-time trajectory planning. This paper presents an iterative algorithm that divides the path generation task into two sequential subproblems that are significantly easier to solve. Given an initial path through the race track, the algorithm runs a forward-backward integration scheme to determine the minimum-time longitudinal speed profile, subject to tire friction constraints. With this fixed speed profile, the algorithm updates the vehicle’s path by solving a convex optimization problem that minimizes the resulting path curvature while staying within track boundaries and obeying affine, time-varying vehicle dynamics constraints. This two-step process is repeated iteratively until the predicted lap time no longer improves. While providing no guarantees of convergence or a globally optimal solution, the approach performs very well when validated on the Thunderhill Raceway course in Willows, CA. The predicted lap time converges after four to five iterations, with each iteration over the full 4.5 km race course requiring only thirty seconds of computation time on a laptop computer. The resulting trajectory is experimentally driven at the race circuit with an autonomous Audi TTS test vehicle, and the resulting lap time and racing line is comparable to both a nonlinear gradient descent solution and a trajectory recorded from a professional racecar driver. The experimental results indicate that the proposed method is a viable option for online trajectory planning in the near future.
http://arxiv.org/abs/1902.00606
Generating explanation to explain its behavior is an essential capability for a robotic teammate. Explanations help human partners better understand the situation and maintain trust of their teammates. Prior work on robot generating explanations focuses on providing the reasoning behind its decision making. These approaches, however, fail to heed the cognitive requirement of understanding an explanation. In other words, while they provide the right explanations from the explainer’s perspective, the explainee part of the equation is ignored. In this work, we address an important aspect along this direction that contributes to a better understanding of a given explanation, which we refer to as the progressiveness of explanations. A progressive explanation improves understanding by limiting the cognitive effort required at each step of making the explanation. As a result, such explanations are expected to be smoother and hence easier to understand. A general formulation of progressive explanation is presented. Algorithms are provided based on several alternative quantifications of cognitive effort as an explanation is being made, which are evaluated in a standard planning competition domain.
http://arxiv.org/abs/1902.00604
As autonomous vehicles (AVs) inch closer to reality, a central requirement for acceptance will be earning the trust of humans in everyday driving situations. In particular, the interaction between AVs and pedestrians is of high importance, as every human is a pedestrian at some point of the day. This paper considers the interaction of a pedestrian and an autonomous vehicle at a mid-block, unsignalized intersection where there is ambiguity over when the pedestrian should cross and when and how the vehicle should yield. By modeling pedestrian behavior through the concept of gap acceptance, the authors show that a hybrid controller with just four distinct modes allows an autonomous vehicle to successfully interact with a pedestrian across a continuous spectrum of possible crosswalk entry behaviors. The controller is validated through extensive simulation and compared to an alternate POMDP solution and experimental results are provided on a research vehicle for a virtual pedestrian.
http://arxiv.org/abs/1902.00597
Intuitively, human readers cope easily with errors in text; typos, misspelling, word substitutions, etc. do not unduly disrupt natural reading. Previous work indicates that letter transpositions result in increased reading times, but it is unclear if this effect generalizes to more natural errors. In this paper, we report an eye-tracking study that compares two error types (letter transpositions and naturally occurring misspelling) and two error rates (10% or 50% of all words contain errors). We find that human readers show unimpaired comprehension in spite of these errors, but error words cause more reading difficulty than correct words. Also, transpositions are more difficult than misspellings, and a high error rate increases difficulty for all words, including correct ones. We then present a computational model that uses character-based (rather than traditional word-based) surprisal to account for these results. The model explains that transpositions are harder than misspellings because they contain unexpected letter combinations. It also explains the error rate effect: upcoming words are more difficultto predict when the context is degraded, leading to increased surprisal.
http://arxiv.org/abs/1902.00595
In this paper, we present a generative retrieval method for sponsored search engine, which uses neural machine translation (NMT) to generate keywords directly from query. This method is completely end-to-end, which skips query rewriting and relevance judging phases in traditional retrieval systems. Different from standard machine translation, the target space in the retrieval setting is a constrained closed set, where only committed keywords should be generated. We present a Trie-based pruning technique in beam search to address this problem. The biggest challenge in deploying this method into a real industrial environment is the latency impact of running the decoder. Self-normalized training coupled with Trie-based dynamic pruning dramatically reduces the inference time, yielding a speedup of more than 20 times. We also devise an mixed online-offline serving architecture to reduce the latency and CPU consumption. To encourage the NMT to generate new keywords uncovered by the existing system, training data is carefully selected. This model has been successfully applied in Baidu’s commercial search engine as a supplementary retrieval branch, which has brought a remarkable revenue improvement of more than 10 percents.
https://arxiv.org/abs/1902.00592
We learn end-to-end point-to-point and path-following navigation behaviors that avoid moving obstacles. These policies receive noisy lidar observations and output robot linear and angular velocities. The policies are trained in small, static environments with AutoRL, an evolutionary automation layer around Reinforcement Learning (RL) that searches for a deep RL reward and neural network architecture with large-scale hyper-parameter optimization. AutoRL first finds a reward that maximizes task completion, and then finds a neural network architecture that maximizes the cumulative of the found reward. Empirical evaluations, both in simulation and on-robot, show that AutoRL policies do not suffer from the catastrophic forgetfulness that plagues many other deep reinforcement learning algorithms, generalize to new environments and moving obstacles, are robust to sensor, actuator, and localization noise, and can serve as robust building blocks for larger navigation tasks. Our path-following and point-to-point policies are respectively 23% and 26% more successful than comparison methods across new environments. Video at: https://youtu.be/0UwkjpUEcbI
http://arxiv.org/abs/1809.10124
This paper presents Recurrent Dual Attention Network (ReDAN) for visual dialog, using multi-step reasoning to answer a series of questions about an image. In each turn of the dialog, ReDAN infers answers progressively through multiple steps. In each step, a recurrently-updated semantic representation of the (refined) query is used for iterative reasoning over both the image and previous dialog history. Experimental results on VisDial v1.0 dataset show that the proposed ReDAN model outperforms prior state-of-the-art approaches across multiple evaluation metrics. Visualization on the iterative reasoning process further demonstrates that ReDAN can locate context-relevant visual and textual clues leading to the correct answers step-by-step.
http://arxiv.org/abs/1902.00579
Adversarial attacks and the development of (deep) neural networks robust against them are currently two widely researched topics. The robustness of Learning Vector Quantization (LVQ) models against adversarial attacks has however not yet been studied to the same extend. We therefore present an extensive evaluation of three LVQ models: Generalized LVQ, Generalized Matrix LVQ and Generalized Tangent LVQ. The evaluation suggests that both Generalized LVQ and Generalized Tangent LVQ have a high base robustness, on par with the current state-of-the-art in robust neural network methods. In contrast to this, Generalized Matrix LVQ shows a high susceptibility to adversarial attacks, scoring consistently behind all other models. Additionally, our numerical evaluation indicates that increasing the number of prototypes per class improves the robustness of the models.
http://arxiv.org/abs/1902.00577
Voice controlled virtual assistants (VAs) are now available in smartphones, cars, and standalone devices in homes. In most cases, the user needs to first “wake-up” the VA by saying a particular word/phrase every time he or she wants the VA to do something. Eliminating the need for saying the wake-up word for every interaction could improve the user experience. This would require the VA to have the capability to detect the speech that is being directed at it and respond accordingly. In other words, the challenge is to distinguish between system-directed and non-system-directed speech utterances. In this paper, we present a number of neural network architectures for tackling this classification problem based on using only acoustic features. These architectures are based on using convolutional, recurrent and feed-forward layers. In addition, we investigate the use of an attention mechanism applied to the output of the convolutional and the recurrent layers. It is shown that incorporating the proposed attention mechanism into the models always leads to significant improvement in classification accuracy. The best model achieved equal error rates of 16.25 and 15.62 percents on two distinct realistic datasets.
http://arxiv.org/abs/1902.00570
We study the efficacy of SHIELD in the face of alternative threat models. We find that SHIELD’s robustness decreases by 65% (accuracy drops from 63% to 22%) against an adaptive adversary (one who knows JPEG compression is being used as a pre-processing step but not necessarily the compression level) in the gray-box threat model (adversary is aware of the model architecture but not necessarily the weights of that model). However, these adversarial examples are, so far, unable to force a targeted prediction. We also find that the robustness of the JPEG-trained models used in SHIELD decreases by 67% (accuracy drops from 57% to 19% on average) against an adaptive adversary in the gray-box threat model. The addition of SLQ pre-processing to these JPEG-trained models is also not a robust defense (accuracy drops to 0.1%) against an adaptive adversary in the gray-box threat model, and an adversary can create adversarial perturbations that force a chosen prediction. We find that neither JPEG-trained models with SLQ pre-processing nor SHIELD are robust against an adaptive adversary in the white-box threat model (accuracy is 0.1%) and the adversary can control the predicted output of their adversarial images. Finally, ensemble-based attacks transfer better (29.8% targeted accuracy) than non-ensemble based attacks (1.4%) against the JPEG-trained models in SHIELD.
http://arxiv.org/abs/1902.00541
We propose the multi-layered cepstrum (MLC) method to estimate multiple fundamental frequencies (MF0) of a signal under challenging contamination such as high-pass filter noise. Taking the operation of cepstrum (i.e., Fourier transform, filtering, and nonlinear activation) recursively, MLC is shown as an efficient method to enhance MF0 saliency in a step-by-step manner. Evaluation on a real-world polyphonic music dataset under both normal and low-fidelity conditions demonstrates the potential of MLC.
http://arxiv.org/abs/1902.00539
From the early days of computing, games have been important testbeds for studying how well machines can do sophisticated decision making. In recent years, machine learning has made dramatic advances with artificial agents reaching superhuman performance in challenge domains like Go, Atari, and some variants of poker. As with their predecessors of chess, checkers, and backgammon, these game domains have driven research by providing sophisticated yet well-defined challenges for artificial intelligence practitioners. We continue this tradition by proposing the game of Hanabi as a new challenge domain with novel problems that arise from its combination of purely cooperative gameplay and imperfect information in a two to five player setting. In particular, we argue that Hanabi elevates reasoning about the beliefs and intentions of other agents to the foreground. We believe developing novel techniques capable of imbuing artificial agents with such theory of mind will not only be crucial for their success in Hanabi, but also in broader collaborative efforts, and especially those with human partners. To facilitate future research, we introduce the open-source Hanabi Learning Environment, propose an experimental framework for the research community to evaluate algorithmic advances, and assess the performance of current state-of-the-art techniques.
http://arxiv.org/abs/1902.00506
This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos or other streaming data. Learning latent terminals, non-terminals, and productions rules directly from streaming data allows the construction of a generative model capturing sequential structures with multiple possibilities. Our model is fully differentiable, and provides easily interpretable results which are important in order to understand the learned structures. It outperforms the state-of-the-art on several challenging datasets and is more accurate for forecasting future activities in videos. We plan to open-source the code.
http://arxiv.org/abs/1902.00505
Humans have entered the age of algorithms. Each minute, algorithms shape countless preferences from suggesting a product to a potential life partner. In the marketplace algorithms are trained to learn consumer preferences from customer reviews because user-generated reviews are considered the voice of customers and a valuable source of information to firms. Insights mined from reviews play an indispensable role in several business activities ranging from product recommendation, targeted advertising, promotions, segmentation etc. In this research, we question whether reviews might hold stereotypic gender bias that algorithms learn and propagate Utilizing data from millions of observations and a word embedding approach, GloVe, we show that algorithms designed to learn from human language output also learn gender bias. We also examine why such biases occur: whether the bias is caused because of a negative bias against females or a positive bias for males. We examine the impact of gender bias in reviews on choice and conclude with policy implications for female consumers, especially when they are unaware of the bias, and the ethical implications for firms.
http://arxiv.org/abs/1902.00496
Recent approaches to English-language sentence compression rely on parallel corpora consisting of sentence-compression pairs. However, a sentence may be shortened in many different ways, which each might be suited to the needs of a particular application. Therefore, in this work, we collect and model crowdsourced judgements of the acceptability of many possible sentence shortenings. We then show how a model of such judgements can be used to support a flexible approach to the compression task. We release our model and dataset for future work.
http://arxiv.org/abs/1902.00489
Predicting the collective motion of a group of pedestrians (a crowd) under the vehicle influence is essential for the development of autonomous vehicles to deal with mixed urban scenarios where interpersonal interaction and vehicle-crowd interaction (VCI) are significant. This usually requires a model that can describe individual pedestrian motion under the influence of nearby pedestrians and the vehicle. This study proposed two pedestrian trajectory dataset, CITR dataset and DUT dataset, so that the pedestrian motion models can be further calibrated and verified, especially when vehicle influence on pedestrians plays an important role. CITR dataset consists of experimentally designed fundamental VCI scenarios (front, back, and lateral VCIs) and provides unique ID for each pedestrian, which is suitable for exploring a specific aspect of VCI. DUT dataset gives two ordinary and natural VCI scenarios in crowded university campus, which can be used for more general purpose VCI exploration. The trajectories of pedestrians, as well as vehicles, were extracted by processing video frames that come from a down-facing camera mounted on a hovering drone as the recording equipment. The final trajectories were refined by a Kalman Filter, in which the pedestrian velocity was also estimated. The statistics of the velocity magnitude distribution demonstrated the validity of the proposed dataset. In total, there are approximate 340 pedestrian trajectories in CITR dataset and 1793 pedestrian trajectories in DUT dataset. The dataset is available at GitHub.
http://arxiv.org/abs/1902.00487
Memorization of data in deep neural networks has become a subject of significant research interest. We prove that over-parameterized single layer fully connected autoencoders memorize training data: they produce outputs in (a non-linear version of) the span of the training examples. In contrast to fully connected autoencoders, we prove that depth is necessary for memorization in convolutional autoencoders. Moreover, we observe that adding nonlinearity to deep convolutional autoencoders results in a stronger form of memorization: instead of outputting points in the span of the training images, deep convolutional autoencoders tend to output individual training images. Since convolutional autoencoder components are building blocks of deep convolutional networks, we envision that our findings will shed light on the important phenomenon of memorization in over-parameterized deep networks.
http://arxiv.org/abs/1810.10333
Straight lines are common features in human made environments. They are a richer feature than points, since they yield more information about the environment (these are one degree features instead of the zero degrees of points). Besides, they are easier to detect and track in image sensors. Having a robust estimation of the 3D parameters of a line measured from an image is a must for several control applications, such as Visual Servoing. In this work, a classical dynamical system that models the apparent motion of lines in a moving camera’s image is presented. In order to obtain the 3D structure of lines, a nonlinear observer is proposed. However, in order to guarantee convergence, the dynamical system must be coupled with an algebraic equation. This is achieved by using spherical coordinates to represent the line’s moment vector and a change of basis, which allows to introduce the algebraic constraint directly on the system’s dynamics. Finally, a control law that attempts to optimize the convergence behavior of the observer is presented. The approach is validated in simulation, and with a real robotic platform with a camera onboard.
http://arxiv.org/abs/1902.00473
Computational simulation of ultrasound (US) echography is essential for training sonographers. Realistic simulation of US interaction with microscopic tissue structures is often modeled by a tissue representation in the form of point scatterers, convolved with a spatially varying point spread function. This yields a realistic US B-mode speckle texture, given that a scatterer representation for a particular tissue type is readily available. This is often not the case and scatterers are nontrivial to determine. In this work we propose to estimate scatterer maps from sample US B-mode images of a tissue, by formulating this inverse mapping problem as image translation, where we learn the mapping with Generative Adversarial Networks, using a US simulation software for training. We demonstrate robust reconstruction results, invariant to US viewing and imaging settings such as imaging direction and center frequency. Our method is shown to generalize beyond the trained imaging settings, demonstrated on in-vivo US data. Our inference runs orders of magnitude faster than optimization-based techniques, enabling future extensions for reconstructing 3D B-mode volumes with only linear computational complexity.
http://arxiv.org/abs/1902.00469
We describe TF-Replicator, a framework for distributed machine learning designed for DeepMind researchers and implemented as an abstraction over TensorFlow. TF-Replicator simplifies writing data-parallel and model-parallel research code. The same models can be effortlessly deployed to different cluster architectures (i.e. one or many machines containing CPUs, GPUs or TPU accelerators) using synchronous or asynchronous training regimes. To demonstrate the generality and scalability of TF-Replicator, we implement and benchmark three very different models: (1) A ResNet-50 for ImageNet classification, (2) a SN-GAN for class-conditional ImageNet image generation, and (3) a D4PG reinforcement learning agent for continuous control. Our results show strong scalability performance without demanding any distributed systems expertise of the user. The TF-Replicator programming model will be open-sourced as part of TensorFlow 2.0 (see https://github.com/tensorflow/community/pull/25).
http://arxiv.org/abs/1902.00465
The recent advent of `Internet of Things’ (IOT) has increased the demand for enabling AI-based edge computing. This has necessitated the search for efficient implementations of neural networks in terms of both computations and storage. Although extreme quantization has proven to be a powerful tool to achieve significant compression over full-precision networks, it can result in significant degradation in performance. In this work, we propose extremely quantized hybrid network architectures with both binary and full-precision sections to emulate the classification performance of full-precision networks while ensuring significant energy efficiency and memory compression. We explore several hybrid network architectures and analyze the performance of the networks in terms of accuracy, energy efficiency and memory compression. We perform our analysis on ResNet and VGG network architectures. Among the proposed network architectures, we show that the hybrid networks with full-precision residual connections emerge as the optimum by attaining accuracies close to full-precision networks while achieving excellent memory compression, up to 21.8x in case of VGG-19. This work demonstrates an effective way of hybridizing networks which achieve performance close to full-precision networks while attaining significant compression, furthering the feasibility of using such networks for energy-efficient neural computing in IOT-based edge devices.
https://arxiv.org/abs/1902.00460
The ability to interpret a scene is an important capability for a robot that is supposed to interact with its environment. The knowledge of what is in front of the robot is, for example, relevant for navigation, manipulation, or planning. Semantic segmentation labels each pixel of an image with a class label and thus provides a detailed semantic annotation of the surroundings to the robot. Convolutional neural networks (CNNs) are popular methods for addressing this type of problem. The available software for training and the integration of CNNs for real robots, however, is quite fragmented and often difficult to use for non-experts, despite the availability of several high-quality open-source frameworks for neural network implementation and training. In this paper, we propose a tool called Bonnet, which addresses this fragmentation problem by building a higher abstraction that is specific for the semantic segmentation task. It provides a modular approach to simplify the training of a semantic segmentation CNN independently of the used dataset and the intended task. Furthermore, we also address the deployment on a real robotic platform. Thus, we do not propose a new CNN approach in this paper. Instead, we provide a stable and easy-to-use tool to make this technology more approachable in the context of autonomous systems. In this sense, we aim at closing a gap between computer vision research and its use in robotics research. We provide an open-source codebase for training and deployment. The training interface is implemented in Python using TensorFlow and the deployment interface provides a C++ library that can be easily integrated in an existing robotics codebase, a ROS node, and two standalone applications for label prediction in images and videos.
http://arxiv.org/abs/1802.08960
The use of background knowledge remains largely unexploited in many text classification tasks. In this work, we explore word taxonomies as means for constructing new semantic features, which may improve the performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy based features, and demonstrate its use on six short-text classification problems, including gender, age and personality type prediction, drug effectiveness and side effect prediction, and news topic prediction. The experimental results indicate that the interpretable features constructed using tax2vec can notably improve the performance of classifiers; the constructed features, in combination with fast, linear classifiers tested against strong baselines, such as hierarchical attention neural networks, achieved comparable or better classification results on short documents. Further, tax2vec can also serve for extraction of corpus-specific keywords. Finally, we investigated the semantic space of potential features where we observe a similarity with the well known Zipf’s law.
http://arxiv.org/abs/1902.00438
We find that 3.3% and 10% of the images from the CIFAR-10 and CIFAR-100 test sets, respectively, have duplicates in the training set. This may incur a bias on the comparison of image recognition techniques with respect to their generalization capability on these heavily benchmarked datasets. To eliminate this bias, we provide the “fair CIFAR” (ciFAIR) dataset, where we replaced all duplicates in the test sets with new images sampled from the same domain. The training set remains unchanged, in order not to invalidate pre-trained models. We then re-evaluate the classification performance of various popular state-of-the-art CNN architectures on these new test sets to investigate whether recent research has overfitted to memorizing data instead of learning abstract concepts. Fortunately, this does not seem to be the case yet. The ciFAIR dataset and pre-trained models are available at https://cvjena.github.io/cifair/, where we also maintain a leaderboard.
http://arxiv.org/abs/1902.00423
The fixed parameter tractable (FPT) approach is a powerful tool in tackling computationally hard problems. In this paper, we link FPT results to classic artificial intelligence (AI) techniques to show how they complement each other. Specifically, we consider the workflow satisfiability problem (WSP) which asks whether there exists an assignment of authorised users to the steps in a workflow specification, subject to certain constraints on the assignment. It was shown by Cohen et al. (JAIR 2014) that WSP restricted to the class of user-independent constraints (UI), covering many practical cases, admits FPT algorithms, i.e. can be solved in time exponential only in the number of steps $k$ and polynomial in the number of users $n$. Since usually $k \ll n$ in WSP, such FPT algorithms are of great practical interest. We present a new interpretation of the FPT nature of the WSP with UI constraints giving a decomposition of the problem into two levels. Exploiting this two-level split, we develop a new FPT algorithm that is by many orders of magnitude faster than the previous state-of-the-art WSP algorithm and also has only polynomial-space complexity. We also introduce new pseudo-Boolean (PB) and Constraint Satisfaction (CSP) formulations of the WSP with UI constraints which efficiently exploit this new decomposition of the problem and raise the novel issue of how to use general-purpose solvers to tackle FPT problems in a fashion that meets FPT efficiency expectations. In our computational study, we investigate, for the first time, the phase transition (PT) properties of the WSP, under a model for generation of random instances. We show how PT studies can be extended, in a novel fashion, to support empirical evaluation of scaling of FPT algorithms.
http://arxiv.org/abs/1604.05636
Slow acquisition has been one of the historical problems in dynamic magnetic resonance imaging (dMRI), but the rise of compressed sensing (CS) has brought numerous algorithms that successfully achieve high acceleration rates. While CS proposes random sampling for data acquisition, practical CS applications to dMRI have typically relied on random variable-density (VD) sampling patterns, where masks are drawn from probabilistic models, which preferably sample from the center of the Fourier domain. In contrast to this model-driven approach, we propose the first data-driven, scalable framework for optimizing sampling patterns in dMRI. Through a greedy algorithm, this approach allows the data to directly govern the search for a mask that exhibits good empirical performance. Previous greedy approach, designed for static MRI, required very intensive computations, prohibiting their direct application to dMRI, and we address this issue by resorting to a stochastic greedy algorithm that exploits only a fraction of resources compared to the previous approach without sacrificing the reconstruction accuracy. A thorough comparison on in vivo datasets shows the inefficiency of model-based approaches in terms of sampling performance and suggests that our data-driven sampling approach could fully enable the potential of CS applied to dMRI.
http://arxiv.org/abs/1902.00386
We propose a method to incrementally learn an embedding space over the domain of network architectures, to enable the careful selection of architectures for evaluation during compressed architecture search. Given a teacher network, we search for a compressed network architecture by using Bayesian Optimization (BO) with a kernel function defined over our proposed embedding space to select architectures for evaluation. We demonstrate that our search algorithm can significantly outperform various baseline methods, such as random search and reinforcement learning (Ashok et al., 2018). The compressed architectures found by our method are also better than the state-of-the-art manually-designed compact architecture ShuffleNet (Zhang et al., 2018). We also demonstrate that the learned embedding space can be transferred to new settings for architecture search, such as a larger teacher network or a teacher network in a different architecture family, without any training.
http://arxiv.org/abs/1902.00383
What is the place of emotion in intelligent robots? In the past two decades, researchers have advocated for the inclusion of some emotion-related components in the general information processing architecture of autonomous agents, say, for better communication with humans, or to instill a sense of urgency to action. The framework advanced here goes beyond these approaches and proposes that emotion and motivation need to be integrated with all aspects of the architecture. Thus, cognitive-emotional integration is a key design principle. Emotion is not an “add on” that endows a robot with “feelings” (for instance, reporting or expressing its internal state). It allows the significance of percepts, plans, and actions to be an integral part of all its computations. It is hypothesized that a sophisticated artificial intelligence cannot be built from separate cognitive and emotional modules. A hypothetical test inspired by the Turing test, called the Dolores test, is proposed to test this assertion.
http://arxiv.org/abs/1902.00363
Convolutional neural networks are state-of-the-art for various segmentation tasks. While for 2D images these networks are also computationally efficient, 3D convolutions have huge storage requirements and require long training time. To overcome this issue, we introduce a network structure for volumetric data without 3D convolution layers. The main idea is to integrate projection layers to transform the volumetric data to a sequence of images, where each image contains information of the full data. We then apply 2D convolutions to the projection images followed by lifting to a volumetric data. The proposed network structure can be trained in much less time than any 3D-network and still shows accurate performance for a sparse binary segmentation task.
http://arxiv.org/abs/1902.00347
Selecting an optimal set of icons is a crucial step in the pipeline of visual design to structure and navigate through content. However, designing the icons sets is usually a difficult task for which expert knowledge is required. In this work, to ease the process of icon set selection to the users, we propose a similarity metric which captures the properties of style and visual identity. We train a Siamese Neural Network with an online dataset of icons organized in visually coherent collections that are used to adaptively sample training data and optimize the training process. As the dataset contains noise, we further collect human-rated information on the perception of icon’s similarity which will be used for evaluating and testing the proposed model. We present several results and applications based on searches, kernel visualizations and optimized set proposals that can be helpful for designers and non-expert users while exploring large collections of icons.
http://arxiv.org/abs/1902.05378
This work proposes a new neural network feature representation that help to leave out sensitive information in the decision-making process of pattern recognition and machine learning algorithms. The aim of this work is to develop a learning method capable to remove certain information from the feature space without drop of performance in a recognition task based on that feature space. Our work is in part motivated by the new international regulation for personal data protection, which forces data controllers to avoid discriminative hazards while managing sensitive data of users. Our method is based on a triplet loss learning generalization that introduces a sensitive information removal process. The method is evaluated on face recognition technologies using state-of-the-art algorithms and publicly available benchmarks. In addition, we present a new annotation dataset with balanced distribution between genders and ethnic origins. The dataset includes more than 120K images from 24K identities with variety of poses, image quality, facial expressions, and illumination. The experiments demonstrate that it is possible to reduce sensitive information such as gender or ethnicity in the feature representation while retaining competitive performance in a face recognition task.
http://arxiv.org/abs/1902.00334