papers AI Learner
The Github is limit! Click to go to the new site.

A Large Parallel Corpus of Full-Text Scientific Articles

2019-05-06
Felipe Soares, Viviane Pereira Moreira, Karin Becker

Abstract

The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development of a parallel corpus from Scielo in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for all language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Statistical Machine Translation system (Moses) for each language pair, which outperformed related works on scientific articles. Sentence alignment was also manually evaluated, presenting an average of 98.8% correctly aligned sentences across all languages. Our parallel corpus is freely available in the TMX format, with complementary information regarding article metadata.

Abstract (translated by Google)
URL

http://arxiv.org/abs/1905.01852

PDF

http://arxiv.org/pdf/1905.01852


Comments

Content