Making Sense of Bureaucratic Documents : Named Entity Recognition for State Authority Archives
Year of publication
2024
Authors
Poso, Venla; Lipsanen, Mikko; Toivanen, Ida; Välisalo, Tanja
Abstract
The usability and accessibility of digitised archival data can be improved using deep learning solutions. In this paper, the authors present their work in developing a named entity recognition (NER) model for digitised archival data, specifically state authority documents. The entities for the model were chosen based on surveying different user groups. In addition to common entities, two new entities were created to identify businesses (FIBC) and archival documents (JON). The NER model was trained by fine-tuning an existing Finnish BERT model. The training data also included modern digitally born texts to achieve good performance with various types of inputs. The finished model performs fairly well with OCR-processed data, achieving an overall F1 score of 0.868, and particularly well with the new entities (F1 scores of 0.89 and 0.97 for JON and FIBC, respectively).
Show moreOrganizations and authors
Publication type
Publication format
Article
Parent publication type
Conference
Article type
Other article
Audience
ScientificPeer-reviewed
Non Peer-ReviewedMINEDU's publication type classification code
B3 Article in conference proceedings (non-peer-reviewed)Publication channel information
Journal/Series
Archiving
Parent publication name
Conference
Archiving Conference
Publisher
Society for Imaging Science & Technology
Pages
6-10
ISSN
ISBN
Open access
Open access in the publisher’s service
Yes
Open access of publication channel
Fully open publication channel
Self-archived
Yes
Other information
Fields of science
Computer and information sciences; History and archaeology; Other humanities
Keywords
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Publication country
United States
Internationality of the publisher
International
Language
English
International co-publication
No
Co-publication with a company
No
DOI
10.2352/issn.2168-3204.2024.21.1.2
The publication is included in the Ministry of Education and Culture’s Publication data collection
Yes