Aalto Finnish Parliament ASR Corpus 2008-2020, version 2

Description

# Aalto Finnish Parliament ASR Corpus 2008-2020, version 2 Short name: `fi-parliament-asr-v2` Persistent Identifier of this resource: http://urn.fi/urn:nbn:fi:lb-2022052002 This corpus is extracted from the Finnish parliament plenary session transcripts and videos by the Aalto Speech Recognition group. The original session transcripts and videos are available at the web portals of the Parliament of Finland (avoindata.eduskunta.fi and verkkolahetys.eduskunta.fi). The corpus is split into three parts: 1. 2015-2020 set 2. 2008-2016 set 3. Development and test sets A non-overlapping combination of the 2008-2016 set and the 2015-2020 set form a training set of size: - 1 422 318 sample pairs - 3 130 hours of speech - 19 356 831 word tokens All audio files in this corpus are single-channel wavs with sample rate 16 kHz and 16-bit precision. The transcript files (.trn) are plain text files. See this github repository for data preparation and baseline models using the Kaldi toolkit: https://github.com/aalto-speech/fin-parl-models --- ## 2015-2020 set This subset is extracted from the Finnish parliament plenary session transcripts and videos by the Aalto Speech Recognition group in 2021. The tools and code used to produce this subset: - Preprocessing and postprocessing: https://github.com/aalto-speech/fi-parliament-tools - Decoding and segmentation: Kaldi, https://github.com/kaldi-asr/kaldi ### Data This subset contains samples of speech (.wav) and their corresponding transcripts (.trn) from sessions between 1/2015 and 104/2020. Few sessions that had broken or empty session transcript are left out, so the session range has some gaps. Samples are grouped by session. Each filename is formed from the following components: > Filename (Kaldi-compatible utterance id): <mpid>-<session_number>-<session_year>-<startsec>-<endsec> > e.g.: 00259-001-2015-00186868-00187044 Further details: | Component | Definition | |:----------------:|:----------------------------------------------------------------------------------------------------------------------------:| | <mpid> | The unique Member of Parliament identifier given to the MPs in the parliament's public databases. | | <session_number> | A running number given to the plenary session which together with the working year uniquely identifies the session. | | <session_year> | The parliamentary working year of the session. In election years, the working year differs from the calendar year. | | <startsec> | The start timestamp of the segment in the full plenary session audio. Format is seconds + two decimals, 00186868 = 1868.68 s | | <endsec> | Like start timestamp, this marks the end timestamp of the segment in the original audio. | This subset is machine-extracted so there remains some inaccuracies in the samples. The audio quality also varies. ### Statistics In total, there are: - 984 676 sample pairs - 1 780 hours of speech - 11 234 724 word tokens ### Text data This subset comes with a 10 million word token in-domain text corpus in the file `parl-full-transcripts-78-2016-104-2020.train`. This 10 million token text corpus can be combined with the 20 million token text corpus that comes with the 2008-2016 set to form a 30 million token text corpus. ### Note about MPIDs There is one speaker in this subset that is not an MP, Risto Hiekkataipale (MPID: 00002). His MPID is arbitrary. The 2015-2020 set and 2008-2016 set use different speaker IDs. A mapping is provided in `speaker-id-mapping.csv`. --- ## 2008-2016 set This subset is extracted from the Finnish parliament plenary session transcripts and videos by the Aalto Speech Recognition group in 2017. Code used to produce this subset: - https://github.com/aalto-speech/finnish-parliament-scripts ### Data This subset contains samples of speech (.wav) and their corresponding transcripts (.trn) from sessions between 71/2008 and 77/2016. A list of samples from sessions held in 2008-2014, that do not overlap with samples in the 2015-2020 set, is provided in `2008-2014-samples.list`. Samples are grouped by speaker. Each filepath is formed from the following components: > Utterance id: <speaker-id>/<speaker-name>_<sample-id> > e.g.: 0004/aila_paloniemi_00045.wav Further details: | Component | Definition | |:--------------:|:-----------------------------------------------:| | <speaker-id> | A number identifier for the speaker. | | <speaker-name> | Speaker's name in "firstname_lastname" order. | | <sample-id> | A number identifier assigned to each sample. | This subset is machine-extracted so there remains some inaccuracies in the samples. The audio quality also varies. A mapping to the 2015-2020 set MP IDs is provided in `speaker-id-mapping.csv`. ### Splits The paper "Automatic Construction of the Finnish Parliament Speech Corpus" by Mansikkaniemi et al. (see citation) uses training splits which are defined in the following files: - `parl-all.train.list` - `parl-400.train.list` - `parl-60min.train.list` - `parl-30min.train.list` ### Additional files There are two additional files provided with the 2008-2016 set: 1. `dropped_duplicates.list` - There are some utterances in the raw dataset that have overlapping utterance id. This file indicates which duplicates were dropped in the paper by Mansikkaniemi et al. (see citation). The `local/data_prep.sh` script in the Github repository https://github.com/aalto-speech/fin-parl-models can recreate the Kaldi input files for the 2008-2016 set used in Mansikkaniemi et al. 2. `utt2year` - This file maps utterance ids to the year they were spoken. This file is compatible with the Kaldi input files created by the script `local/data_prep.sh` mentioned above. ### Statistics In total, there are: - 522 543 sample pairs - 1560 hours of speech - 9 743 296 word tokens (in .trn files) In the 2008-2014 subset, there are: - 437 642 sample pairs - 1 350 hours of speech - 8 122 107 word tokens (in .trn files) ### Text data This subset comes with a 20 million word token in-domain text corpus in the file `parl-transcripts.train`. The text corpus is extracted from the 2008-2016 session transcripts. --- ## Development and test sets This subset contains the dev and test sets for Finnish Parliament ASR corpus. There are three sets: 1. 2016-dev 2. 2016-test 3. 2020-test The 2016 sets have been created with the same tools as the 2008-2016 train set. Similarly, the 2020 test set and 2015-2020 train set have been created with the same pipeline. Each dev and test set has been cleaned and corrected by hand. ### Data The 2016 sets contain samples of speech (.wav) and their corresponding transcripts (.trn) from the same sessions as the 2008-2016 train set. The samples are split to seen and unseen speakers. Read more about the seen/unseen split in the paper "Automatic Construction of the Finnish Parliament Speech Corpus" by Mansikkaniemi et al. (see citation below). Each filename is formed from the following components: > Utterance id: <speaker-name>_<sample-id> > e.g.: anne_mari_virolainen_04297.wav Further details: | Component | Definition | |:--------------:|:-----------------------------------------------:| | <speaker-name> | Speaker's name in "firstname_lastname" order. | | <sample-id> | A number identifier assigned to each sample. | A mapping that connects `<speaker-name>` to the speaker IDs used in training sets 2008-2016 and 2015-2020 is provided in `dev-test-speakers.csv`. The 2020 test set has been created from the sessions held in autumn 2020, ranging from 105/2020 to 170/2020. The data is in the same format as the 2015-2020 train set. ### Splits The paper "Automatic Construction of the Finnish Parliament Speech Corpus" uses seen and unseen speaker splits for the 2016 dev and test sets. These splits are defined in the files (subset duration, HH:MM:SS, in parentheses): - `seen_dev.list` (2:36:51) - `seen_test.list` (2:53:51) - `unseen_dev.list` (2:45:27) - `unseen_test.list` (2:48:20) More details in the paper. --- ## Citation The 2008-2016 set and dev-test set are detailed in the following publication: ``` @conference{Aaltodoc:http://urn.fi/URN:NBN:fi:aalto-201710157137, title={Automatic Construction of the Finnish Parliament Speech Corpus}, author={Mansikkaniemi, Andre; Smit, Peter; Kurimo, Mikko}, year={2017-08}, language={en}, pages={3762-3766}, keyword={automatic speech recognition; speech-to-text alignment; DNN acoustic models; parliament speech dat; transcribed speech corpus}, series={Interspeech 2017}, doi={10.21437/Interspeech.2017-1115}, url={http://urn.fi/URN:NBN:fi:aalto-201710157137}, } ``` --- ## License See the `LICENSE.md` file. --- ## Contact Authors: Anja Virkkunen, André Mansikkaniemi, and Mikko Kurimo of the Aalto Speech Recognition Group Contact via kielipankki@csc.fi
Show more

Year of publication

2022

Type of data

Authors

FIN-CLARIN

User support at CSC - IT Center for Science Ltd. The Language Bank of Finland - Curator

The Parliament of Finland - Creator

Organisation missing

User support at CSC - IT Center for Science Ltd. The Language Bank of Finland - Curator

Project

Other information

Fields of science

Languages

Language

Finnish

Open access

Open

License

CLARIN PUB (Public) End User License 1.0

Keywords

Subject headings

Temporal coverage

undefined

Related to this research data