undefined

Making Sense of Bureaucratic Documents : Named Entity Recognition for State Authority Archives

Year of publication

2024

Authors

Poso, Venla; Lipsanen, Mikko; Toivanen, Ida; Välisalo, Tanja

Abstract

The usability and accessibility of digitised archival data can be improved using deep learning solutions. In this paper, the authors present their work in developing a named entity recognition (NER) model for digitised archival data, specifically state authority documents. The entities for the model were chosen based on surveying different user groups. In addition to common entities, two new entities were created to identify businesses (FIBC) and archival documents (JON). The NER model was trained by fine-tuning an existing Finnish BERT model. The training data also included modern digitally born texts to achieve good performance with various types of inputs. The finished model performs fairly well with OCR-processed data, achieving an overall F1 score of 0.868, and particularly well with the new entities (F1 scores of 0.89 and 0.97 for JON and FIBC, respectively).
Show more

Organizations and authors

University of Jyväskylä

Toivanen Ida Orcid -palvelun logo

Välisalo Tanja Orcid -palvelun logo

Poso Venla Orcid -palvelun logo

Publication type

Publication format

Article

Parent publication type

Conference

Article type

Other article

Audience

Scientific

Peer-reviewed

Non Peer-Reviewed

MINEDU's publication type classification code

B3 Article in conference proceedings (non-peer-reviewed)

Publication channel information

Journal/Series

Archiving

Conference

Archiving Conference

Publisher

Society for Imaging Science & Technology

Pages

6-10

Open access

Open access in the publisher’s service

Yes

Open access of publication channel

Fully open publication channel

Self-archived

Yes

Other information

Fields of science

Computer and information sciences; History and archaeology; Other humanities

Keywords

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Publication country

United States

Internationality of the publisher

International

Language

English

International co-publication

No

Co-publication with a company

No

DOI

10.2352/issn.2168-3204.2024.21.1.2

The publication is included in the Ministry of Education and Culture’s Publication data collection

Yes