Mapping Languages: The Corpus of Global Language Use

[Draft: Corpus_of_Global_Language_Use.Revisions.Preprint] [Data: The Corpus of Global Language Use v.4.2] Abstract This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together … More Mapping Languages: The Corpus of Global Language Use

Geographically-Balanced Gigaword Corpora for 50 Language Varieties

[Read draft here: LREC_2020] [GeoWAC Corpus Family] Abstract While text corpora have been steadily increasing in overall size, even very large corpora are not designed to represent global population demographics. For example, recent work has shown that existing English gigaword corpora over-represent inner-circle varieties from the US and the UK (Dunn, 2019b). To correct implicit … More Geographically-Balanced Gigaword Corpora for 50 Language Varieties

Mapping Languages and Demographics with Georeferenced Corpora

[Read full-text] Co-authored with Ben Adams. Abstract This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and … More Mapping Languages and Demographics with Georeferenced Corpora

Computational Learning of Construction Grammars

[Read Full-Tex: Computational Learning of Construction Grammars] [Original Data: https://drive.google.com/open?id=0B6oBPlj4dynZTFlXV1JNbF9GNEE%5D [Current code: https://github.com/jonathandunn/c2xg%5D This paper presents an algorithm for learning the construction grammar of a language from a large corpus. This grammar induction algorithm has two goals: first, to show that construction grammars are learnable without highly specified innate structure; second, to develop a model … More Computational Learning of Construction Grammars

Multi-Unit Directional Measures of Association: Moving Beyond Pairs of Words

[Read Full Text: Dunn.Association Measures] [Data: https://drive.google.com/open?id=1Nc8huc0QEbNQ4WY3-4sqLVatbB0h0CC7%5D This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise ΔP measure, that are able to quantify association in sequences of varying length and type of representation. Multi-unit measures face an additional segmentation problem: once the implicit length constraint of pairwise … More Multi-Unit Directional Measures of Association: Moving Beyond Pairs of Words