Unseen Languages in Language Identification

Description of the granted funding

One of the most challenging issues in language identification is the handling of texts written in unseen languages. Systems ground their predictions on training corpora for a finite number of languages. Almost exclusively, the methods label whatever text they encounter as one of the languages in their repertoire. If they encounter text written in an unknown language, they label it with the language they deem closest. The results can vary from the indicated language being a close relative to a seemingly random choice. The handling of unseen languages was already 2006 stated as an outstanding issue, but it still remains an issue without any real solutions. We have gathered a highly skilled group of collaborators with whom we will inspect several case studies where unseen languages pose practical problems for researchers or the users of language resources created by them. We aim to significantly improve the understanding of the phenomenon and the methods used to handle it.
Show more

Starting year

2025

End year

2029

Granted funding

Tommi Jauhiainen Orcid -palvelun logo
695 479 €

Funder

Research Council of Finland

Funding instrument

Academy research fellows

Decision maker

Scientific Council for Social Sciences and Humanities
17.06.2025

Other information

Funding decision number

370756

Fields of science

Languages

Research fields

Kielitieteet