papers AI Learner
The Github is limit! Click to go to the new site.

Albanian Language Identification in Text Documents

2019-01-14
Klesti Hoxha, Artur Baxhaku

Abstract

In this work we investigate the accuracy of standard and state-of-the-art language identification methods in identifying Albanian in written text documents. A dataset consisting of news articles written in Albanian has been constructed for this purpose. We noticed a considerable decrease of accuracy when using test documents that miss the Albanian alphabet letters “ "E “ and “ \c{C} “ and created a custom training corpus that solved this problem by achieving an accuracy of more than 99%. Based on our experiments, the most performing language identification methods for Albanian use a na"ive Bayes classifier and n-gram based classification features.

Abstract (translated by Google)
URL

http://arxiv.org/abs/1901.04216

PDF

http://arxiv.org/pdf/1901.04216


Similar Posts

Comments