Real world multimedia data is often composed of multiple modalities such as an image or a video with associated text (e.g. captions, user comments, etc.) and metadata. Such multimodal data packages are prone to manipulations, where a subset of these modalities can be altered to misrepresent or repurpose data packages, with possible malicious intent. It is, therefore, important to develop methods to assess or verify the integrity of these multimedia packages. Using computer vision and natural language processing methods to directly compare the image (or video) and the associated caption to verify the integrity of a media package is only possible for a limited set of objects and scenes. In this paper, we present a novel deep learning-based approach for assessing the semantic integrity of multimedia packages containing images and captions, using a reference set of multimedia packages. We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset. We present the MultimodAl Information Manipulation dataset (MAIM), a new dataset of media packages from Flickr, which we make available to the research community. We use both the newly created dataset as well as Flickr30K and MS COCO datasets to quantitatively evaluate our proposed approach. The reference dataset does not contain unmanipulated versions of tampered query packages. Our method is able to achieve F1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO, respectively, for detecting semantically incoherent media packages.
https://arxiv.org/abs/1707.01606
We propose dynamic scaling in temporal networks with heterogeneous activities and memory, and provide a comprehensive picture for the dynamic topologies of such networks, in terms of the modified activity-driven network model [H. Kim \textit{et al.}, Eur. Phys. J. B {\bf 88}, 315 (2015)]. Particularly, we focus on the interplay of the time resolution and memory in dynamic topologies. Through the random walk (RW) process, we investigate diffusion properties and topological changes as the time resolution increases. Our results with memory are compared to those of the memoryless case. Based on the temporal percolation concept, we derive scaling exponents in the dynamics of the largest cluster and the coverage of the RW process in time-varying networks. We find that the time resolution in the time-accumulated network determines the effective size of the network, while memory affects relevant scaling properties at the crossover from the dynamic regime to the static one. The origin of memory-dependent scaling behaviors is the dynamics of the largest cluster, which depends on temporal degree distributions. Finally, we conjecture of the extended finite-size scaling ansatz for dynamic topologies and the fundamental property of temporal networks, which are numerically confirmed.
https://arxiv.org/abs/1711.07868
In this work we focused on GAN-based solution for the attribute guided face synthesis. Previous works exploited GANs for generation of photo-realistic face images and did not pay attention to the question of diversity of the resulting images. The proposed solution in its turn introducing novel latent space of unit complex numbers is able to provide the diversity on the “birthday paradox” score 3 times higher than the size of the training dataset. It is important to emphasize that our result is shown on relatively small dataset (20k samples vs 200k) while preserving photo-realistic properties of generated faces on significantly higher resolution (128x128 in comparison to 32x32 of previous works).
https://arxiv.org/abs/1806.10982
Our manuscript aims to develop a system which will lead to energy conservation and by doing so, we would be able to lighten few more homes. The proposed work is accomplished by using Arduino microcontroller and sensors that will control the electricity based on night and object’s detection. Meanwhile, a counter is set that will count the number of objects passed through the road. The beauty of the proposed work is that the wastage of unused electricity can be reduced, lifetime of the streetlights gets enhance because the lights do not stay ON during the whole night, and helps to increase safety measurements. We are confident that the proposed idea will be beneficial in the future applications of microcontrollers and sensors etc.
https://arxiv.org/abs/1806.10968
Automatically generated fake restaurant reviews are a threat to online review systems. Recent research has shown that users have difficulties in detecting machine-generated fake reviews hiding among real restaurant reviews. The method used in this work (char-LSTM ) has one drawback: it has difficulties staying in context, i.e. when it generates a review for specific target entity, the resulting review may contain phrases that are unrelated to the target, thus increasing its detectability. In this work, we present and evaluate a more sophisticated technique based on neural machine translation (NMT) with which we can generate reviews that stay on-topic. We test multiple variants of our technique using native English speakers on Amazon Mechanical Turk. We demonstrate that reviews generated by the best variant have almost optimal undetectability (class-averaged F-score 47%). We conduct a user study with skeptical users and show that our method evades detection more frequently compared to the state-of-the-art (average evasion 3.2/4 vs 1.5/4) with statistical significance, at level {\alpha} = 1% (Section 4.3). We develop very effective detection tools and reach average F-score of 97% in classifying these. Although fake reviews are very effective in fooling people, effective automatic detection is still feasible.
https://arxiv.org/abs/1805.02400
Adversarial examples are intentionally crafted data with the purpose of deceiving neural networks into misclassification. When we talk about strategies to create such examples, we usually refer to perturbation-based methods that fabricate adversarial examples by applying invisible perturbations onto normal data. The resulting data reserve their visual appearance to human observers, yet can be totally unrecognizable to DNN models, which in turn leads to completely misleading predictions. In this paper, however, we consider crafting adversarial examples from existing data as a limitation to example diversity. We propose a non-perturbation-based framework that generates native adversarial examples from class-conditional generative adversarial networks.As such, the generated data will not resemble any existing data and thus expand example diversity, raising the difficulty in adversarial defense. We then extend this framework to pre-trained conditional GANs, in which we turn an existing generator into an “adversarial-example generator”. We conduct experiments on our approach for MNIST and CIFAR10 datasets and have satisfactory results, showing that this approach can be a potential alternative to previous attack strategies.
https://arxiv.org/abs/1806.10496
We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of current frameworks and image-text datasets (e.g., MS-COCO) both quantitatively and qualitatively. The large gap between the number of possible constitutions of real-world semantics and the size of parallel data, to a large extent, restricts the model to establish the link between textual semantics and visual concepts. We alleviate this problem by augmenting the MS-COCO image captioning datasets with textual contrastive adversarial samples. These samples are synthesized using linguistic rules and the WordNet knowledge base. The construction procedure is both syntax- and semantics-aware. The samples enforce the model to ground learned embeddings to concrete concepts within the image. This simple but powerful technique brings a noticeable improvement over the baselines on a diverse set of downstream tasks, in addition to defending known-type adversarial attacks. We release the codes at this https URL.
https://arxiv.org/abs/1806.10348
Despite the impressive search rate of one key per clock cycle, the update stage of a random-access-memory-based content-addressable-memory (RAM-based CAM) always suffers high latency. Two primary causes of such latency include: (1) the compulsory erasing stage along with the writing stage and (2) the major difference in data width between the RAM-based CAM (e.g., 8-bit width) and the modern systems (e.g., 256-bit width). This brief, therefore, aims for an efficient input/output (I/O) architecture of RAM-based binary CAM (RCAM) for low-latency update. To achieve this goal, three techniques, namely centralized erase RAM, bit-sliced, and hierarchical-partitioning, are proposed to eliminate the latency of erasing stage, as well as to allow RCAM to exploit the bandwidth of modern systems effectively. Several RCAMs, whose data width ranges from 8 bits to 64 bits, were integrated into a 256-bit system for the evaluation. The experimental results in an Intel Arria V 5ASTFD5 FPGA prove that at 100 MHz, the proposed designs achieve at least 9.6 times higher I/O efficiency as compared to the traditional RCAM.
https://arxiv.org/abs/1804.02330
In order to support communication and computation cooperation, we propose ME-RAN architecture, which consists of mobile edge cloud (ME) as the computation provision platform and radio access network (RAN) as the communication interface. Cooperative offloading framework is proposed to achieve the following tasks: (1) to increase user equipments’ computing capacity by triggering offloading action, especially for those UEs which cannot complete the computations locally; (2) to reduce the power consumption for all the UEs by considering limited computing and communication resources. Based on above objectives, we formulate the power minimization problem, which is shown to be a non-convex mix-integer programming. Firstly, Decentralized Local Decision Algorithm (DLDA) is proposed for each UE to estimate the possible local resource consumption and decide if offloading is in its interest. This operation will reduce the overhead and signalling in the later stage. Then, Centralized decision and resource Allocation algoRithm (CAR) is proposed to conduct the decision making and resource allocation in ME-RAN. Moreover, two low complexity algorithms, i.e., UE with largest saved power accepted first (CAR-P) and UE with smallest required data rate accepted first are proposed. Simulations show that the performance of proposed algorithms is very close to the exhaustive search but with much less complexity.
https://arxiv.org/abs/1705.10384
Attributing the pixels of an input image to a certain category is an important and well-studied problem in computer vision, with applications ranging from weakly supervised localisation to understanding hidden effects in the data. In recent years, approaches based on interpreting a previously trained neural network classifier have become the de facto state-of-the-art and are commonly used on medical as well as natural image datasets. In this paper, we discuss a limitation of these approaches which may lead to only a subset of the category specific features being detected. To address this problem we develop a novel feature attribution technique based on Wasserstein Generative Adversarial Networks (WGAN), which does not suffer from this limitation. We show that our proposed method performs substantially better than the state-of-the-art for visual attribution on a synthetic dataset and on real 3D neuroimaging data from patients with mild cognitive impairment (MCI) and Alzheimer’s disease (AD). For AD patients the method produces compellingly realistic disease effect maps which are very close to the observed effects.
https://arxiv.org/abs/1711.08998
Resources for the non-English languages are scarce and this paper addresses this problem in the context of machine translation, by automatically extracting parallel sentence pairs from the multilingual articles available on the Internet. In this paper, we have used an end-to-end Siamese bidirectional recurrent neural network to generate parallel sentences from comparable multilingual articles in Wikipedia. Subsequently, we have showed that using the harvested dataset improved BLEU scores on both NMT and phrase-based SMT systems for the low-resource language pairs: English–Hindi and English–Tamil, when compared to training exclusively on the limited bilingual corpora collected for these language pairs.
https://arxiv.org/abs/1806.09652
Although the performance of person Re-Identification (ReID) has been significantly boosted, many challenging issues in real scenarios have not been fully investigated, e.g., the complex scenes and lighting variations, viewpoint and pose changes, and the large number of identities in a camera network. To facilitate the research towards conquering those issues, this paper contributes a new dataset called MSMT17 with many important features, e.g., 1) the raw videos are taken by an 15-camera network deployed in both indoor and outdoor scenes, 2) the videos cover a long period of time and present complex lighting variations, and 3) it contains currently the largest number of annotated identities, i.e., 4,101 identities and 126,441 bounding boxes. We also observe that, domain gap commonly exists between datasets, which essentially causes severe performance drop when training and testing on different datasets. This results in that available training data cannot be effectively leveraged for new testing domains. To relieve the expensive costs of annotating new training samples, we propose a Person Transfer Generative Adversarial Network (PTGAN) to bridge the domain gap. Comprehensive experiments show that the domain gap could be substantially narrowed-down by the PTGAN.
https://arxiv.org/abs/1711.08565
This note describes the details of our solution to the dense-captioning events in videos task of ActivityNet Challenge 2018. Specifically, we solve this problem with a two-stage way, i.e., first temporal event proposal and then sentence generation. For temporal event proposal, we directly leverage the three-stage workflow in [13, 16]. For sentence generation, we capitalize on LSTM-based captioning framework with temporal attention mechanism (dubbed as LSTM-T). Moreover, the input visual sequence to the LSTM-based video captioning model is comprised of RGB and optical flow images. At inference, we adopt a late fusion scheme to fuse the two LSTM-based captioning models for sentence generation.
https://arxiv.org/abs/1806.09278
We report here the first RF noise measurements on two designs of n-doped GaN/AlN double-barrier resonant tunneling diodes (RTDs), each having a room-temperature negative differential resistance (NDR) and also strong near-UV light emission. The measurements are made with a standard, un-isolated RF receiver and calibration is made using a substitution-resistor/hot-cold radiometric technique which works in the positive differential resistance (PDR) region but not the NDR region. A high-quality InGaAs/AlAs double-barrier RTD is used as a control sample and displays shot noise suppression down to $\Gamma\approx$0.5 in the PDR region, as expected. The GaN/AlN RTDs display both shot-noise enhancement and suppression in the PDR regions, but no obvious sign of sudden shot-noise enhancement in the threshold bias region of light emission. This supports the hypothesis that the holes required for light emission are created by electronic (Zener) interband tunneling, not impact ionization. Further the minimum shot-noise factor of $\Gamma\sim$ 0.34 suggests that the GaN/AlN RTDs are acting like triple-barrier devices.
https://arxiv.org/abs/1806.09270
“Which Generative Adversarial Networks (GANs) generates the most plausible images?” has been a frequently asked question among researchers. To address this problem, we first propose an \emph{incomplete} U-statistics estimate of maximum mean discrepancy $\mathrm{MMD}{inc}$ to measure the distribution discrepancy between generated and real images. $\mathrm{MMD}{inc}$ enjoys the advantages of asymptotic normality, computation efficiency, and model agnosticity. We then propose a GANs analysis framework to select and test the “best” member in GANs family using the Post Selection Inference (PSI) with $\mathrm{MMD}{inc}$. In the experiments, we adopt the proposed framework on 7 GANs variants and compare their $\mathrm{MMD}{inc}$ scores.
https://arxiv.org/abs/1802.05411
Evaluating on adversarial examples has become a standard procedure to measure robustness of deep learning models. Due to the difficulty of creating white-box adversarial examples for discrete text input, most analyses of the robustness of NLP models have been done through black-box adversarial examples. We investigate adversarial examples for character-level neural machine translation (NMT), and contrast black-box adversaries with a novel white-box adversary, which employs differentiable string-edit operations to rank adversarial changes. We propose two novel types of attacks which aim to remove or change a word in a translation, rather than simply break the NMT. We demonstrate that white-box adversarial examples are significantly stronger than their black-box counterparts in different attack scenarios, which show more serious vulnerabilities than previously known. In addition, after performing adversarial training, which takes only 3 times longer than regular training, we can improve the model’s robustness significantly.
https://arxiv.org/abs/1806.09030
Face de-identification has become increasingly important as the image sources are explosively growing and easily accessible. The advance of new face recognition techniques also arises people’s concern regarding the privacy leakage. The mainstream pipelines of face de-identification are mostly based on the k-same framework, which bears critiques of low effectiveness and poor visual quality. In this paper, we propose a new framework called Privacy-Protective-GAN (PP-GAN) that adapts GAN with novel verificator and regulator modules specially designed for the face de-identification problem to ensure generating de-identified output with retained structure similarity according to a single input. We evaluate the proposed approach in terms of privacy protection, utility preservation, and structure similarity. Our approach not only outperforms existing face de-identification techniques but also provides a practical framework of adapting GAN with priors of domain knowledge.
https://arxiv.org/abs/1806.08906
We perform a systematic theoretical analysis of the nature and importance of alloy disorder effects on the electronic and optical properties of GaN${y}$As${1-x-y}$Bi${x}$ alloys and quantum wells (QWs), using large-scale atomistic supercell electronic structure calculations based on the tight-binding method. Using ordered alloy supercell calculations we also derive and parametrise an extended basis 14-band \textbf{k}$\cdot$\textbf{p} Hamiltonian for GaN${y}$As${1-x-y}$Bi${x}$. Comparison of the results of these models highlights the role played by short-range alloy disorder – associated with substitutional nitrogen (N) and bismuth (Bi) incorporation – in determining the details of the electronic and optical properties. Systematic analysis of large alloy supercells reveals that the respective impact of N and Bi on the band structure remain largely independent, a robust conclusion we find to be valid even in the presence of significant alloy disorder where N and Bi atoms share common Ga nearest neighbours. Our calculations reveal that N- (Bi-) related alloy disorder strongly influences the conduction (valence) band edge states, leading in QWs to strong carrier localisation, as well as inhomogeneous broadening and modification of the conventional selection rules for optical transitions. Our analysis provides detailed insight into key properties and trends in this unusual material system, and enables quantitative evaluation of the potential of GaN${y}$As${1-x-y}$Bi$_{x}$ alloys for applications in photonic and photovoltaic devices.
https://arxiv.org/abs/1712.07693
Training generative adversarial networks is unstable in high-dimensions as the true data distribution tends to be concentrated in a small fraction of the ambient space. The discriminator is then quickly able to classify nearly all generated samples as fake, leaving the generator without meaningful gradients and causing it to deteriorate after a point in training. In this work, we propose training a single generator simultaneously against an array of discriminators, each of which looks at a different random low-dimensional projection of the data. Individual discriminators, now provided with restricted views of the input, are unable to reject generated samples perfectly and continue to provide meaningful gradients to the generator throughout training. Meanwhile, the generator learns to produce samples consistent with the full data distribution to satisfy all discriminators simultaneously. We demonstrate the practical utility of this approach experimentally, and show that it is able to produce image samples with higher quality than traditional training with a single discriminator.
https://arxiv.org/abs/1705.07831
This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3). Temporal proposal generation and caption generation are both important to the dense captioning task. Therefore, we propose a proposal ranking model to employ a set of effective feature representations for proposal generation, and ensemble a series of caption models enhanced with context information to generate captions robustly on predicted proposals. Our approach achieves the state-of-the-art performance on the dense video captioning task with 8.529 METEOR score on the challenge testing set.
https://arxiv.org/abs/1806.08854
There is an ever growing need to ensure the quality of food and assess specific quality parameters in all the links of the food chain, ranging from processing, distribution and retail to preparing food. Various imaging and sensing technologies, including X-ray imaging, ultrasound, and near infrared reflectance spectroscopy have been applied to the problem. Cost and other constraints restrict the application of some of these technologies. In this study we test a novel Multiplexing Electric Field Sensor (MEFS), an approach that allows for a completely non-invasive and non-destructive testing approach. Our experiments demonstrate the reliable detection of certain foreign objects and provide evidence that this sensor technology has the capability of measuring fat content in minced meat. Given the fact that this technology can already be deployed at very low cost, low maintenance and in various different form factors, we conclude that this type of MEFS is an extremely promising technology for addressing specific food quality issues.
https://arxiv.org/abs/1806.08596
In this paper, we prove, following earlier work of Waldspurger ([Wa1], [Wa4]), a sort of local relative trace formula which is related to the local Gan-Gross-Prasad conjecture for unitary groups over a local field $F$ of characteristic zero. As a consequence, we obtain a geometric formula for certain multiplicities $m(\pi)$ appearing in this conjecture and deduce from it a weak form of the local Gan-Gross-Prasad conjecture (multiplicity one in tempered L-packets). These results were already known over $p$-adic fields and thus are only new when $F=\mathbb{R}$.
https://arxiv.org/abs/1506.01452
The role of accretion disks in the formation of low-mass stars has been well assessed by means of high angular resolution observations at various wavelengths. These findings confirm the prediction that conservation of angular momentum during the collapse leading to the formation of a star is bound to produce flattening and rotation of the collapsing core. What about high-mass stars? At present, several authors have reported on detections of disks around high-mass YSOs. Notwithstanding these important results, the presence of disks rotating about high-mass stars is not sufficient by itself to prove unambiguously the accretion model: what is needed is iron-clad evidence of infall. Such evidence is very difficult to find, as the free-fall velocity becomes significant only very close to the accreting star, i.e., over a region of a few 0.01 pc ($\sim$2000 au), which is very difficult to access and disentangle from the surrounding quiescent or rotating material. In this chapter we discuss how to characterize the infall of material in a sample of 36 high-mass accretion disk candidates covering a broad range of luminosities, from 10$^3$ $L_\odot$ to 10$^6$ $L_\odot$, compiled by Beltrán & de Wit (2016) with the next generation Very Large Array (ngVLA).
https://arxiv.org/abs/1806.08143
While 3D object detection and pose estimation has been studied for a long time, its evaluation is not yet completely satisfactory. Indeed, existing datasets typically consist in numerous acquisitions of only a few scenes because of the tediousness of pose annotation, and existing evaluation protocols cannot handle properly objects with symmetries. This work aims at addressing those two points. We first present automatic techniques to produce fully annotated RGBD data of many object instances in arbitrary poses, with which we produce a dataset of thousands of independent scenes of bulk parts composed of both real and synthetic images. We then propose a consistent evaluation methodology suitable for any rigid object, regardless of its symmetries. We illustrate it with two reference object detection and pose estimation methods on different objects, and show that incorporating symmetry considerations into pose estimation methods themselves can lead to significant performance gains. The proposed dataset is available at this http URL.
https://arxiv.org/abs/1806.08129
Recommendation systems play a vital role to keep users engaged with personalized content in modern online platforms. Deep learning has revolutionized many research fields and there is a recent surge of interest in applying it to collaborative filtering (CF). However, existing methods compose deep learning architectures with the latent factor model ignoring a major class of CF models, neighborhood or memory-based approaches. We propose Collaborative Memory Networks (CMN), a deep architecture to unify the two classes of CF models capitalizing on the strengths of the global structure of latent factor model and local neighborhood-based structure in a nonlinear fashion. Motivated by the success of Memory Networks, we fuse a memory component and neural attention mechanism as the neighborhood component. The associative addressing scheme with the user and item memories in the memory module encodes complex user-item relations coupled with the neural attention mechanism to learn a user-item specific neighborhood. Finally, the output module jointly exploits the neighborhood with the user and item memories to produce the ranking score. Stacking multiple memory modules together yield deeper architectures capturing increasingly complex user-item relations. Furthermore, we show strong connections between CMN components, memory networks and the three classes of CF models. Comprehensive experimental results demonstrate the effectiveness of CMN on three public datasets outperforming competitive baselines. Qualitative visualization of the attention weights provide insight into the model’s recommendation process and suggest the presence of higher order interactions.
https://arxiv.org/abs/1804.10862
Recently, neural machine translation (NMT) has been extended to multilinguality, that is to handle more than one translation direction with a single system. Multilingual NMT showed competitive performance against pure bilingual systems. Notably, in low-resource settings, it proved to work effectively and efficiently, thanks to shared representation space that is forced across languages and induces a sort of transfer-learning. Furthermore, multilingual NMT enables so-called zero-shot inference across language pairs never seen at training time. Despite the increasing interest in this framework, an in-depth analysis of what a multilingual NMT model is capable of and what it is not is still missing. Motivated by this, our work (i) provides a quantitative and comparative analysis of the translations produced by bilingual, multilingual and zero-shot systems; (ii) investigates the translation quality of two of the currently dominant neural architectures in MT, which are the Recurrent and the Transformer ones; and (iii) quantitatively explores how the closeness between languages influences the zero-shot translation. Our analysis leverages multiple professional post-edits of automatic translations by several different systems and focuses both on automatic standard metrics (BLEU and TER) and on widely used error categories, which are lexical, morphology, and word order errors.
https://arxiv.org/abs/1806.06957
Genetic Algorithms (GAs) are used to solve search and optimization problems in which an optimal solution can be found using an iterative process with probabilistic and non-deterministic transitions. However, depending on the problem’s nature, the time required to find a solution can be high in sequential machines due to the computational complexity of genetic algorithms. This work proposes a parallel implementation of a genetic algorithm on field-programmable gate array (FPGA). Optimization of the system’s processing time is the main goal of this project. Results associated with the processing time and area occupancy (on FPGA) for various population sizes are analyzed. Studies concerning the accuracy of the GA response for the optimization of two variables functions were also evaluated for the hardware implementation. However, the high-performance implementation proposes in this paper is able to work with more variable from some adjustments on hardware architecture.
https://arxiv.org/abs/1806.11555
In this paper, we propose a method that disentangles the effects of multiple input conditions in Generative Adversarial Networks (GANs). In particular, we demonstrate our method in controlling color, texture, and shape of a generated garment image for computer-aided fashion design. To disentangle the effect of input attributes, we customize conditional GANs with consistency loss functions. In our experiments, we tune one input at a time and show that we can guide our network to generate novel and realistic images of clothing articles. In addition, we present a fashion design process that estimates the input attributes of an existing garment and modifies them using our generator.
https://arxiv.org/abs/1806.07819
Noisy, intermediate-scale quantum (NISQ) computers are expected to execute quantum circuits of up to a few hundred qubits. The circuits have to satisfy certain constraints concerning the placement and interactions of the involved qubits. Hence, a compiler takes an input circuit not conforming to a NISQ architecture and transforms it to a conforming output circuit. NISQ hardware is faulty and insufficient to implement computational fault-tolerance, such that computation results will be faulty, too. Accordingly, compilers need to optimise the depth and the gate count of the compiled circuits, because these influence the aggregated computation result error. This work discusses the complexity of compilation with a particular focus on the search space structure. The presented analysis decomposes the compilation problem into three combinatorial subproblems for which heuristics can be determined. The search space structure is the result of analysing jointly the gate sequence of the input circuit and its influence on how qubits have to be mapped to a NISQ architecture. These findings support the development of future NISQ compilers.
https://arxiv.org/abs/1806.07241
Generative adversarial networks (GANs) are pow- erful generative models based on providing feed- back to a generative network via a discriminator network. However, the discriminator usually as- sesses individual samples. This prevents the dis- criminator from accessing global distributional statistics of generated samples, and often leads to mode dropping: the generator models only part of the target distribution. We propose to feed the discriminator with mixed batches of true and fake samples, and train it to predict the ratio of true samples in the batch. The latter score does not depend on the order of samples in a batch. Rather than learning this invariance, we introduce a generic permutation-invariant discriminator ar- chitecture. This architecture is provably a uni- versal approximator of all symmetric functions. Experimentally, our approach reduces mode col- lapse in GANs on two synthetic datasets, and obtains good results on the CIFAR10 and CelebA datasets, both qualitatively and quantitatively.
https://arxiv.org/abs/1806.07185
We empirically investigate learning from partial feedback in neural machine translation (NMT), when partial feedback is collected by asking users to highlight a correct chunk of a translation. We propose a simple and effective way of utilizing such feedback in NMT training. We demonstrate how the common machine translation problem of domain mismatch between training and deployment can be reduced solely based on chunk-level user feedback. We conduct a series of simulation experiments to test the effectiveness of the proposed method. Our results show that chunk-level feedback outperforms sentence based feedback by up to 2.61% BLEU absolute.
https://arxiv.org/abs/1806.07169
Recent work (Pennington et al, 2017) suggests that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning. Motivated by this, we study the distribution of singular values of the Jacobian of the generator in Generative Adversarial Networks (GANs). We find that this Jacobian generally becomes ill-conditioned at the beginning of training. Moreover, we find that the average (with z from p(z)) conditioning of the generator is highly predictive of two other ad-hoc metrics for measuring the ‘quality’ of trained GANs: the Inception Score and the Frechet Inception Distance (FID). We test the hypothesis that this relationship is causal by proposing a ‘regularization’ technique (called Jacobian Clamping) that softly penalizes the condition number of the generator Jacobian. Jacobian Clamping improves the mean Inception Score and the mean FID for GANs trained on several datasets. It also greatly reduces inter-run variance of the aforementioned scores, addressing (at least partially) one of the main criticisms of GANs.
https://arxiv.org/abs/1802.08768
We study the robustness of object detection under the presence of missing annotations. In this setting, the unlabeled object instances will be treated as background, which will generate an incorrect training signal for the detector. Interestingly, we observe that after dropping 30% of the annotations (and labeling them as background), the performance of CNN-based object detectors like Faster-RCNN only drops by 5% on the PASCAL VOC dataset. We provide a detailed explanation for this result. To further bridge the performance gap, we propose a simple yet effective solution, called Soft Sampling. Soft Sampling re-weights the gradients of RoIs as a function of overlap with positive instances. This ensures that the uncertain background regions are given a smaller weight compared to the hardnegatives. Extensive experiments on curated PASCAL VOC datasets demonstrate the effectiveness of the proposed Soft Sampling method at different annotation drop rates. Finally, we show that on OpenImagesV3, which is a real-world dataset with missing annotations, Soft Sampling outperforms standard detection baselines by over 3%.
https://arxiv.org/abs/1806.06986
We address the problem of story-based temporal summarization of long 360° videos. We propose a novel memory network model named Past-Future Memory Network (PFMN), in which we first compute the scores of 81 normal field of view (NFOV) region proposals cropped from the input 360° video, and then recover a latent, collective summary using the network with two external memories that store the embeddings of previously selected subshots and future candidate subshots. Our major contributions are two-fold. First, our work is the first to address story-based temporal summarization of 360° videos. Second, our model is the first attempt to leverage memory networks for video summarization tasks. For evaluation, we perform three sets of experiments. First, we investigate the view selection capability of our model on the Pano2Vid dataset. Second, we evaluate the temporal summarization with a newly collected 360° video dataset. Finally, we experiment our model’s performance in another domain, with image-based storytelling VIST dataset. We verify that our model achieves state-of-the-art performance on all the tasks.
https://arxiv.org/abs/1805.02838
This paper presents a method for detecting salient objects in videos where temporal information in addition to spatial information is fully taken into account. Following recent reports on the advantage of deep features over conventional hand-crafted features, we propose a new set of SpatioTemporal Deep (STD) features that utilize local and global contexts over frames. We also propose new SpatioTemporal Conditional Random Field (STCRF) to compute saliency from STD features. STCRF is our extension of CRF to the temporal domain and describes the relationships among neighboring regions both in a frame and over frames. STCRF leads to temporally consistent saliency maps over frames, contributing to the accurate detection of salient objects’ boundaries and noise reduction during detection. Our proposed method first segments an input video into multiple scales and then computes a saliency map at each scale level using STD features with STCRF. The final saliency map is computed by fusing saliency maps at different scale levels. Our experiments, using publicly available benchmark datasets, confirm that the proposed method significantly outperforms state-of-the-art methods. We also applied our saliency computation to the video object segmentation task, showing that our method outperforms existing video object segmentation methods.
https://arxiv.org/abs/1708.01447
Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rule-based metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlates well with human judgments, but fails to capture the syntactic structure of a sentence. To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions. In addition, we further propose a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. The proposed metric is evaluated with three kinds of robustness tests and its correlation with human judgments. Extensive experiments show that the proposed data augmentation scheme not only makes our metric more robust toward several pathological transformations, but also improves its correlation with human judgments. Our metric outperforms other metrics on both caption level human correlation in Flickr 8k and system level human correlation in COCO. The proposed approach could be served as a learning based evaluation metric that is complementary to existing rule-based metrics.
https://arxiv.org/abs/1806.06422
Fine-grained object categorization aims for distinguishing objects of subordinate categories that belong to the same entry-level object category. The task is challenging due to the facts that (1) training images with ground-truth labels are difficult to obtain, and (2) variations among different subordinate categories are subtle. It is well established that characterizing features of different subordinate categories are located on local parts of object instances. In fact, careful part annotations are available in many fine-grained categorization datasets. However, manually annotating object parts requires expertise, which is also difficult to generalize to new fine-grained categorization tasks. In this work, we propose a Weakly Supervised Part Detection Network (PartNet) that is able to detect discriminative local parts for use of fine-grained categorization. A vanilla PartNet builds on top of a base subnetwork two parallel streams of upper network layers, which respectively compute scores of classification probabilities (over subordinate categories) and detection probabilities (over a specified number of discriminative part detectors) for local regions of interest (RoIs). The image-level prediction is obtained by aggregating element-wise products of these region-level probabilities. To generate a diverse set of RoIs as inputs of PartNet, we propose a simple Discretized Part Proposals module (DPP) that directly targets for proposing candidates of discriminative local parts, with no bridging via object-level proposals. Experiments on the benchmark CUB-200-2011 and Oxford Flower 102 datasets show the efficacy of our proposed method for both discriminative part detection and fine-grained categorization. In particular, we achieve the new state-of-the-art performance on CUB-200-2011 dataset when ground-truth part annotations are not available.
https://arxiv.org/abs/1806.06198
Object tracking is the cornerstone of many visual analytics systems. While considerable progress has been made in this area in recent years, robust, efficient, and accurate tracking in real-world video remains a challenge. In this paper, we present a hybrid tracker that leverages motion information from the compressed video stream and a general-purpose semantic object detector acting on decoded frames to construct a fast and efficient tracking engine. The proposed approach is compared with several well-known recent trackers on the OTB tracking dataset. The results indicate advantages of the proposed method in terms of speed and/or accuracy.Other desirable features of the proposed method are its simplicity and deployment efficiency, which stems from the fact that it reuses the resources and information that may already exist in the system for other reasons.
https://arxiv.org/abs/1805.00107
In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, \emph{i.e.}, estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy (CE) optimization method which is a model based search method to find the global optimum of a real-valued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability.
https://arxiv.org/abs/1806.06720
Deep neural networks have achieved impressive success in large-scale visual object recognition tasks with a predefined set of classes. However, recognizing objects of novel classes unseen during training still remains challenging. The problem of detecting such novel classes has been addressed in the literature, but most prior works have focused on providing simple binary or regressive decisions, e.g., the output would be “known,” “novel,” or corresponding confidence intervals. In this paper, we study more informative novelty detection schemes based on a hierarchical classification framework. For an object of a novel class, we aim for finding its closest super class in the hierarchical taxonomy of known classes. To this end, we propose two different approaches termed top-down and flatten methods, and their combination as well. The essential ingredients of our methods are confidence-calibrated classifiers, data relabeling, and the leave-one-out strategy for modeling novel classes under the hierarchical taxonomy. Furthermore, our method can generate a hierarchical embedding that leads to improved generalized zero-shot learning performance in combination with other commonly-used semantic embeddings.
https://arxiv.org/abs/1804.00722
Generative Adversarial Networks (GANs) are a promising approach to language generation. The latest works introducing novel GAN models for language generation use n-gram based metrics for evaluation and only report single scores of the best run. In this paper, we argue that this often misrepresents the true picture and does not tell the full story, as GAN models can be extremely sensitive to the random initialization and small deviations from the best hyperparameter choice. In particular, we demonstrate that the previously used BLEU score is not sensitive to semantic deterioration of generated texts and propose alternative metrics that better capture the quality and diversity of the generated samples. We also conduct a set of experiments comparing a number of GAN models for text with a conventional Language Model (LM) and find that neither of the considered models performs convincingly better than the LM.
https://arxiv.org/abs/1806.04936
The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity. In this paper, we compare two common but linguistically uninformed methods of subword construction (BPE and STE, the method implemented in Tensor2Tensor toolkit) and two linguistically-motivated methods: Morfessor and one novel method, based on a derivational dictionary. Our experiments with German-to-Czech translation, both morphologically rich, document that so far, the non-motivated methods perform better. Furthermore, we iden- tify a critical difference between BPE and STE and show a simple pre- processing step for BPE that considerably increases translation quality as evaluated by automatic measures.
https://arxiv.org/abs/1806.05482
Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era. All state-of-the-art object detection systems still rely on recognizing object instances individually, without exploiting their relations during learning. This work proposes an object relation module. It processes a set of objects simultaneously through interaction between their appearance feature and geometry, thus allowing modeling of their relations. It is lightweight and in-place. It does not require additional supervision and is easy to embed in existing networks. It is shown effective on improving object recognition and duplicate removal steps in the modern object detection pipeline. It verifies the efficacy of modeling object relations in CNN based detection. It gives rise to the first fully end-to-end object detector.
https://arxiv.org/abs/1711.11575
Sentences in a well-formed text are connected to each other via various links to form the cohesive structure of the text. Current neural machine translation (NMT) systems translate a text in a conventional sentence-by-sentence fashion, ignoring such cross-sentence links and dependencies. This may lead to generate an incoherent target text for a coherent source text. In order to handle this issue, we propose a cache-based approach to modeling coherence for neural machine translation by capturing contextual information either from recently translated sentences or the entire document. Particularly, we explore two types of caches: a dynamic cache, which stores words from the best translation hypotheses of preceding sentences, and a topic cache, which maintains a set of target-side topical words that are semantically related to the document to be translated. On this basis, we build a new layer to score target words in these two caches with a cache-based neural model. Here the estimated probabilities from the cache-based neural model are combined with NMT probabilities into the final word prediction probabilities via a gating mechanism. Finally, the proposed cache-based neural model is trained jointly with NMT system in an end-to-end manner. Experiments and analysis presented in this paper demonstrate that the proposed cache-based model achieves substantial improvements over several state-of-the-art SMT and NMT baselines.
https://arxiv.org/abs/1711.11221
Using pre-trained word embeddings as input layer is a common practice in many natural language processing (NLP) tasks, but it is largely neglected for neural machine translation (NMT). In this paper, we conducted a systematic analysis on the effect of using pre-trained source-side monolingual word embedding in NMT. We compared several strategies, such as fixing or updating the embeddings during NMT training on varying amounts of data, and we also proposed a novel strategy called dual-embedding that blends the fixing and updating strategies. Our results suggest that pre-trained embeddings can be helpful if properly incorporated into NMT, especially when parallel data is limited or additional in-domain monolingual data is readily available.
https://arxiv.org/abs/1806.01515
This work demonstrates the first nonpolar vertical GaN on GaN pn power diodes grown on m-plane free standing substrates by MOCVD. The SEM and HRXRD results showed the good crystal quality of the homoepitaxial nonpolar structure with low defect densities. The CL result confirmed the nonpolar p GaN was of high quality with considerably reduced deep level states. At forward bias, the device showed good rectifying behaviors with a turn-on voltage of 4.0 V, an on-resistance of 2.3 mohmcm2, and a high on off ratio of 1e10. At reverse bias, the current leakage and breakdown were described by the trap assisted space charge limited current conduction mechanism, where I was proportional to V power 4.5. The critical electrical field was calculated to be 2.0 MV per cm without field plates or edge termination, which is the highest value reported on nonpolar power devices. The high performance m-plane p-n diodes can serve as key building blocks to further develop nonpolar GaN power electronics and polarization-engineering-related advanced power device structures for power conversion applications.
https://arxiv.org/abs/1806.05308
In this work we introduce impostor networks, an architecture that allows to perform fine-grained recognition with high accuracy and using a light-weight convolutional network, making it particularly suitable for fine-grained applications on low-power and non-GPU enabled platforms. Impostor networks compensate for the lightness of its `backend’ network by combining it with a lightweight non-parametric classifier. The combination of a convolutional network and such non-parametric classifier is trained in an end-to-end fashion. Similarly to convolutional neural networks, impostor networks can fit large-scale training datasets very well, while also being able to generalize to new data points. At the same time, the bulk of computations within impostor networks happen through nearest neighbor search in high-dimensions. Such search can be performed efficiently on a variety of architectures including standard CPUs, where deep convolutional networks are inefficient. In a series of experiments with three fine-grained datasets, we show that impostor networks are able to boost the classification accuracy of a moderate-sized convolutional network considerably at a very small computational cost.
https://arxiv.org/abs/1806.05217
We tackle the problem of one-shot segmentation: finding and segmenting a previously unseen object in a cluttered scene based on a single instruction example. We propose a novel dataset, which we call $\textit{cluttered Omniglot}$. Using a baseline architecture combining a Siamese embedding for detection with a U-net for segmentation we show that increasing levels of clutter make the task progressively harder. Using oracle models with access to various amounts of ground-truth information, we evaluate different aspects of the problem and show that in this kind of visual search task, detection and segmentation are two intertwined problems, the solution to each of which helps solving the other. We therefore introduce $\textit{MaskNet}$, an improved model that attends to multiple candidate locations, generates segmentation proposals to mask out background clutter and selects among the segmented objects. Our findings suggest that such image recognition models based on an iterative refinement of object detection and foreground segmentation may provide a way to deal with highly cluttered scenes.
https://arxiv.org/abs/1803.09597
Visual question answering (VQA) requires joint comprehension of images and natural language questions, where many questions can’t be directly or clearly answered from visual content but require reasoning from structured human knowledge with confirmation from visual content. This paper proposes visual knowledge memory network (VKMN) to address this issue, which seamlessly incorporates structured human knowledge and deep visual features into memory networks in an end-to-end learning framework. Comparing to existing methods for leveraging external knowledge for supporting VQA, this paper stresses more on two missing mechanisms. First is the mechanism for integrating visual contents with knowledge facts. VKMN handles this issue by embedding knowledge triples (subject, relation, target) and deep visual features jointly into the visual knowledge features. Second is the mechanism for handling multiple knowledge facts expanding from question and answer pairs. VKMN stores joint embedding using key-value pair structure in the memory networks so that it is easy to handle multiple facts. Experiments show that the proposed method achieves promising results on both VQA v1.0 and v2.0 benchmarks, while outperforms state-of-the-art methods on the knowledge-reasoning related questions.
https://arxiv.org/abs/1806.04860
Neural Architecture Search (NAS) is a laborious process. Prior work on automated NAS targets mainly on improving accuracy, but lacks consideration of computational resource use. We propose the Resource-Efficient Neural Architect (RENA), an efficient resource-constrained NAS using reinforcement learning with network embedding. RENA uses a policy network to process the network embeddings to generate new configurations. We demonstrate RENA on image recognition and keyword spotting (KWS) problems. RENA can find novel architectures that achieve high performance even with tight resource constraints. For CIFAR10, it achieves 2.95% test error when compute intensity is greater than 100 FLOPs/byte, and 3.87% test error when model size is less than 3M parameters. For Google Speech Commands Dataset, RENA achieves the state-of-the-art accuracy without resource constraints, and it outperforms the optimized architectures with tight resource constraints.
https://arxiv.org/abs/1806.07912