papers AI Learner
The Github is limit! Click to go to the new site.

An Analysis of Lemmatization on Topic Models of Morphologically Rich Language

2019-05-10
Chandler May, Ryan Cotterell, Benjamin Van Durme

Abstract

Topic models are typically represented by top-$m$ word lists for human interpretation. The corpus is often pre-processed with lemmatization (or stemming) so that those representations are not undermined by a proliferation of words with similar meanings, but there is little public work on the effects of that pre-processing. Recent work studied the effect of stemming on topic models of English texts and found no supporting evidence for the practice. We study the effect of lemmatization on topic models of Russian Wikipedia articles, finding in one configuration that it significantly improves interpretability according to a word intrusion metric. We conclude that lemmatization may benefit topic models on morphologically rich languages, but that further investigation is needed.

Abstract (translated by Google)
URL

http://arxiv.org/abs/1608.03995

PDF

http://arxiv.org/pdf/1608.03995


Comments

Content