Geographically-Informed Language Identification

Dunn, J. and Edwards-Brown, L. (2024). Geographically-Informed Language Identification. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024). 7672–7682. Abstract. This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic origin of the text in question. … More Geographically-Informed Language Identification

Validating and Exploring Large Geographic Corpora

Dunn, J. (2024). “Validating and Exploring Large Geographic Corpora.” In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024). 17348–17358. Abstract. This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three … More Validating and Exploring Large Geographic Corpora

Corpus similarity measures remain robust across diverse languages

Li, H. & Dunn, J. (2022). “Corpus similarity measures remain robust across diverse languages.” Lingua. Abstract. This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task. The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora. Both … More Corpus similarity measures remain robust across diverse languages

Predicting Embedding Reliability in Low-Resource Settings

Dunn, J.; Li, H.; & Sastre, D. (2022). “Predicting Embedding Reliability in Low-Resource Settings Using Corpus Similarity Measures.” In Proceedings of the 13th International Conference on Language Resources and Evaluation. European Language Resources Association. 6461-6470. Abstract This paper simulates a low-resource setting across 17 languages in order to evaluate embedding similarity, stability, and reliability under … More Predicting Embedding Reliability in Low-Resource Settings