undefined

Developing named-entity recognition for state authority archives

Year of publication

2025

Authors

Toivanen, Ida; Poso, Venla; Lipsanen, Mikko; Välisalo, Tanja

Abstract

Named entity recognition (NER) is one of the more common natural language processing tasks, that usually entails the detection of entities like person, location and date from textual data. Due to the bureaucratic language present in the data from state authority archives, existing NER models may not perform as well as researchers utilising them would wish. The diversity of the archival data, containing texts from different domains, as well as noise due to imperfect optical character recognition (OCR), creates challenges for NER. This gave us an incentive to train our own NER model, FinArcNER, and see if our attempts would produce better classification results in an archival setting. The aim of our study was to answer the following research questions: 1) Does training with noisy archival data bring the needed improvement to the model performance? 2) Does the training with noisy archival data skew the results with non-archival data? The FinArcNER model shows consistent performance when tested with modern and archival data (F1 scores 0.9200 and 0.8710, respectively). We can deduce from this that the increased diversity of the training data improved the model performance – that is, even though we included archival data with OCR noise, the model still learned to detect named entities correctly from noise-free, non-archival data.
Show more

Organizations and authors

University of Jyväskylä

Toivanen Ida Orcid -palvelun logo

Poso Venla Orcid -palvelun logo

Publication type

Publication format

Article

Parent publication type

Conference

Article type

Other article

Audience

Scientific

Peer-reviewed

Peer-Reviewed

MINEDU's publication type classification code

A4 Article in conference proceedings

Publication channel information

Open access

Open access in the publisher’s service

Yes

Open access of publication channel

Fully open publication channel

Self-archived

Yes

Other information

Fields of science

Computer and information sciences; History and archaeology; Other humanities

Keywords

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Publication country

Norway

Internationality of the publisher

International

Language

English

International co-publication

No

Co-publication with a company

No

DOI

10.5617/dhnbpub.12262

The publication is included in the Ministry of Education and Culture’s Publication data collection

Yes