Developing named-entity recognition for state authority archives
Year of publication
2025
Authors
Toivanen, Ida; Poso, Venla; Lipsanen, Mikko; Välisalo, Tanja
Abstract
Named entity recognition (NER) is one of the more common natural language processing tasks, that usually entails the detection of entities like person, location and date from textual data. Due to the bureaucratic language present in the data from state authority archives, existing NER models may not perform as well as researchers utilising them would wish. The diversity of the archival data, containing texts from different domains, as well as noise due to imperfect optical character recognition (OCR), creates challenges for NER. This gave us an incentive to train our own NER model, FinArcNER, and see if our attempts would produce better classification results in an archival setting. The aim of our study was to answer the following research questions: 1) Does training with noisy archival data bring the needed improvement to the model performance? 2) Does the training with noisy archival data skew the results with non-archival data? The FinArcNER model shows consistent performance when tested with modern and archival data (F1 scores 0.9200 and 0.8710, respectively). We can deduce from this that the increased diversity of the training data improved the model performance – that is, even though we included archival data with OCR noise, the model still learned to detect named entities correctly from noise-free, non-archival data.
Show moreOrganizations and authors
Publication type
Publication format
Article
Parent publication type
Conference
Article type
Other article
Audience
ScientificPeer-reviewed
Peer-ReviewedMINEDU's publication type classification code
A4 Article in conference proceedingsPublication channel information
Parent publication name
Publisher
Volume
7
Issue
3
ISSN
Publication forum
Publication forum level
1
Open access
Open access in the publisher’s service
Yes
Open access of publication channel
Fully open publication channel
Self-archived
Yes
Other information
Fields of science
Computer and information sciences; History and archaeology; Other humanities
Keywords
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Publication country
Norway
Internationality of the publisher
International
Language
English
International co-publication
No
Co-publication with a company
No
DOI
10.5617/dhnbpub.12262
The publication is included in the Ministry of Education and Culture’s Publication data collection
Yes