Geographically-Informed Language Identification

Dunn, J. and Edwards-Brown, L. (2024). Geographically-Informed Language Identification. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024). Abstract. This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic origin of the text in question. Given … More Geographically-Informed Language Identification

Validating and Exploring Large Geographic Corpora

Dunn, J. (2024). “Validating and Exploring Large Geographic Corpora.” In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024). Abstract. This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods … More Validating and Exploring Large Geographic Corpora

Register Variation Remains Stable Across 60 Languages

Li, H.; Dunn, J.; and Nini, A. (In Press). “Register Variation Remains Stable Across 60 Languages.” Corpus Linguistics and Linguistic Theory. Abstract. This paper measures the stability of cross-linguistic register variation. A register is a variety of a language that is associated with extra-linguistic context. The relationship between a register and its context is functional: … More Register Variation Remains Stable Across 60 Languages

Corpus similarity measures remain robust across diverse languages

Li, H. & Dunn, J. (2022). “Corpus similarity measures remain robust across diverse languages.” Lingua. Abstract. This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task. The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora. Both … More Corpus similarity measures remain robust across diverse languages

Predicting Embedding Reliability in Low-Resource Settings

Dunn, J.; Li, H.; & Sastre, D. (2022). “Predicting Embedding Reliability in Low-Resource Settings Using Corpus Similarity Measures.” In Proceedings of the 13th International Conference on Language Resources and Evaluation. European Language Resources Association. 6461-6470. Abstract This paper simulates a low-resource setting across 17 languages in order to evaluate embedding similarity, stability, and reliability under … More Predicting Embedding Reliability in Low-Resource Settings

Representations of Language Varieties Are Reliable

Dunn, J. (2021). “Representations of Language Varieties Are Reliable Given Corpus Similarity Measures.” In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties, and Dialects. Association for Computational Linguistics. 28-38. Abstract. This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the … More Representations of Language Varieties Are Reliable

Measuring Linguistic Diversity During COVID-19

Dunn, J.; Coupe, T.; & Adams, B. (2020). “Measuring Linguistic Diversity During COVID-19.” Proceedings of the 4th Workshop on NLP and Computational Social Science. Association for Computational Linguistics. 1-10. Abstract. Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of … More Measuring Linguistic Diversity During COVID-19

Mapping Languages: The Corpus of Global Language Use

Dunn, J. (2020). “Mapping Languages: The Corpus of Global Language Use.” Language Resources and Evaluation. 54: 999-1018. Abstract. This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages … More Mapping Languages: The Corpus of Global Language Use

Geographically-Balanced Gigaword Corpora for 50 Language Varieties

Dunn, J. & Adams, B. (2020). “Geographically-Balanced Gigaword Corpora for 50 Language Varieties.” In Proceedings of the Language Resources and Evaluation Conference. European Language Resources Association. 2528-2536. Abstract. While text corpora have been steadily increasing in overall size, even very large corpora are not designed to represent global population demographics. For example, recent work has … More Geographically-Balanced Gigaword Corpora for 50 Language Varieties

Mapping Languages and Demographics

Dunn, J. and Adams, B. (2019). “Mapping Languages and Demographics with Georeferenced Corpora.” In Proceedings of Geocomputation 2019. Abstract. This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts … More Mapping Languages and Demographics