Finnish News Agency Archive 1992-2018, CoNLL-U, source
Description
This is the parsed version of the Finnish News Agency Archive 1992-2018 corpus (http://urn.fi/urn:nbn:fi:lb-2019041501). The corpus was parsed by Khalid Alnajjar (University of Helsinki) using Turku neural parser pipeline (http://turkunlp.org/Turku-neural-parser-pipeline/).
The Finnish News Agency Archive corpus comprises newswire articles in Finnish sent to media outlets by the Finnish News Agency (STT) between 1992-2018. The corpus includes about 2,8 million items in total. Most of the material is news articles that vary from short “news flashes” to telegrams and longer articles. News articles are categorized by department (domestic, foreign, economy, politics, culture, entertainment and sports) as well as by metadata (IPTC subject categories or keywords and location data). The archive also includes other material STT has created or forwarded such as news planning lists, sports results, analysis articles and press releases.
The corpus is available for non-commercial research through the download service korp.csc.fi/download as whole texts based on a research plan submitted with the application in the Language Bank Rights.
Notes:
-) Headlines and news content were parsed and the output is in CoNLL-U Format.
-) Filenames in the original corpus are preserved, only the file extension was changed. This allows mapping the parsed corpus to the original corpus to obtain additional metadata if needed.
-) Files having "h_" as the prefix contain the parsed headline. Otherwise, it is the parsed news content.
-) Not all documents in the corpus contained a headline or/and news content. In such cases, the file was ignored.
-) The corpus contained some English documents and, in such cases, the output of the parser is usually incorrect. Language identification could be done to deal with the English documents appropriately.
-) UralicNLP (https://github.com/mikahama/uralicNLP/wiki/UD-parser) can be utilized easily to read and use the parsed corpus in Python.
Acknowledgments:
-) This work has been supported by European Union's Horizon 2020 research and innovation programme under grant agreement No 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media).
-) The corpus was processed on the Finnish Grid and Cloud Infrastructure (urn:nbn:fi:research-infras-2016072533).
Licence: http://urn.fi/urn:nbn:fi:lb-2019041502
Show moreYear of publication
2020
Authors
Oy Suomen Tietotoimisto Finska Notisbyrån Ab - Creator
University of Helsinki - Curator
Other information
Fields of science
Languages
Language
Finnish
Open access
Restricted access