A Large Parallel Corpus of Full-Text Scientific Articles

Abstract
Abstract (translated by Google)
URL
PDF

Abstract

The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development of a parallel corpus from Scielo in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for all language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Statistical Machine Translation system (Moses) for each language pair, which outperformed related works on scientific articles. Sentence alignment was also manually evaluated, presenting an average of 98.8% correctly aligned sentences across all languages. Our parallel corpus is freely available in the TMX format, with complementary information regarding article metadata.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1905.01852

PDF

http://arxiv.org/pdf/1905.01852

A Large Parallel Corpus of Full-Text Scientific Articles

Abstract

Abstract (translated by Google)

URL

PDF

Comments