Europarl Parallel Corpus

Description

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek. The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.

Year of publication

2020

Type of data

Authors

University of Edinburgh

Philipp Koehn - Creator, Curator

Project

Other information

Fields of science

Languages

Language

Bulgarian language, Czech language, Danish language, German, Greek, Modern (1453-), English, Estonian, Finnish, French, Hungarian language, Italian, Latvian, Lithuanian language, Dutch, Polish, Portuguese, Romanian language, Slovak language, Slovene language, Spanish, Swedish

Europarl Parallel Corpus

Description

Year of publication

Type of data

Authors

Project

Other information

Fields of science

Language

Open access

License

Keywords

Subject headings

Temporal coverage

Related to this research data