Abstract
Text-based person search aims to retrieve the corresponding persons in an image database by virtue of a describing sentence about the person, which poses great potential for various applications such as video surveillance. Extracting visual contents corresponding to the human description is the key to this cross-modal matching problem. Moreover, correlated images and descriptions involve different levels of semantic relevance. To exploit the multilevel relevances between human description and corresponding visual contents, we propose a pose-guided joint global and attentive local matching network (GALM), which includes global, uni-local and bi-local matching. The global matching network aims to learn global cross-modal representations. To further capture the meaningful local relations, we propose an uni-local matching network to compute the local similarities between image regions and textual description and then utilize a similarity-based hard attention to select the description-related image regions. In addition to sentence-level matching, the fine-grained phrase-level matching is captured by the bi-local matching network, which employs pose information to learn latent semantic alignment between visual body part and textual noun phrase. To verify the effectiveness of our model, we perform extensive experiments on the CUHK Person Description Dataset (CUHK-PEDES) which is currently the only available dataset for text-based person search. Experimental results show that our approach outperforms the state-of-the-art methods by 15 \% in terms of top-1 metric.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1809.08440