Language Identification for Austronesian Languages

Dunn, J. & Nijhof, W. (2022). “Language Identification for Austronesian Languages.” In Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association.

Abstract.

This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a minimal impact on accuracy caused by increasing the inventory of non-Austronesian languages. Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.

pacificLID models: https://www.dropbox.com/sh/od1cfflcbqx5gfw/AACNhqvGmUqo6f4h9ujMYmVua?dl=0

pacific CodeSwitch package: https://github.com/jonathandunn/pacific_CodeSwitch