Europarl Parallel Corpus
Description
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.
Show moreYear of publication
2020
Authors
University of Edinburgh
Philipp Koehn - Curator, Creator
Other information
Fields of science
Languages
Language
Bulgarian language, Czech language, Danish language, German, Greek, Modern (1453-), English, Estonian, Finnish, French, Hungarian language, Italian, Latvian, Lithuanian language, Dutch, Polish, Portuguese, Romanian language, Slovak language, Slovene language, Spanish, Swedish
Open access
Restricted access