Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2018, source

Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2018, source

Description

This resource is available for download in Kielipankki – the Language Bank of Finland. This is a parallel corpus created of the Yle news articles from 2014-2018 by aligning the standard Finnish versions with the easy-language versions. The dataset, created by Anna Dmitrieva and available in CSV format, is aligned on the document level. The news articles were obtained from the datasets available via Kielipankki (http://urn.fi/urn:nbn:fi:lb-2017070501 and http://urn.fi/urn:nbn:fi:lb-2019050901). This dataset extends the previously published Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2019-2020 (http://urn.fi/urn:nbn:fi:lb-2022111625). Please note that this dataset has not been assessed by a human expert. The articles have been aligned automatically with the Vecalign document alignment algorithm (https://github.com/thompsonb/vecalign) without candidate rescoring, using LASER embeddings (https://github.com/facebookresearch/LASER). Description of all columns in the dataset: -index_in_selko: This index consists of two parts divided by an underscore. The first (longer) part identifies the entire Easy Finnish article from the original dataset. The second (shorter) part is the number of the paragraph. Since the Yle Selkosuomi articles usually consist of multiple paragraphs, each paragraph describing a separate piece of news, we represent each paragraph as an individual little article in our dataset. Paragraph numbering starts with 0. - index_in_regular: The identifier of the regular Finnish article taken from the original dataset. - selko_text: A piece of news in Easy Finnish. - regular_text: A corresponding piece of news in regular Finnish. - distance: The cosine distance between the document vectors. The lower the distance, the more similar the documents are.
Show more

Year of publication

2024

Authors

Finnish Broadcasting Company (Yle) - Creator, Rights holder

Anna Dmitrieva - Curator, Creator

Other information

Fields of science

Languages

Language

Finnish

Open access

Restricted access

License

CLARIN ACA+NC (Academic, Non Commercial) End User License 1.0
Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2018, source - Research.fi