My research models two related phenomena: (A) the emergence of grammatical structure within individuals, with a focus on the degree to which structure can be learned from usage alone; and (B) variation in grammatical structures across populations and across registers, with a focus on how grammars change as complex adaptive systems. My basic research question is how language learning and language change interact at scale when we observe both an entire grammar and a global community of speaker-hearers. Computational models applied to large corpora provide a method for solving this difficult problem.
Cambridge University Press recently published my first book on the intersection between natural language processing and linguistic analysis. I have also published over 30 papers in both linguistic venues (including Cognitive Linguistics, Language & Cognition, Metaphor & Symbol, the International Journal of Corpus Linguistics, Corpus Linguistics & Linguistic Theory, and Lingua) as well computational venues (including proceedings from the Association for Computational Linguistics). From this perspective, I have published 13 papers in linguistic venues and 21 papers in computational venues. This page focuses on a few recent examples of my work to show how I have approached the relationship between the emergence of and variation in grammatical structure.
Modelling the Emergence of Grammatical Structure
My research experiments with usage-based grammar using computational models applied to large multi-lingual corpora. The basic idea behind these computational experiments is to understand the relationship between exposure to the production contained in a corpus (usage) and the emergence of grammatical structure. For example, a recent paper has shown that increased exposure, simulated using increased training data, creates grammars that are more similar across different contexts of production (Dunn & Tayyar Madabushi, 2021). This is important because it means that the idiosyncratic properties of specific corpora become less important as the amount of exposure increases. Another recent paper has shown that grammars converge more quickly given exposure to a single individual, rather than to an aggregation of individuals (Dunn & Nini, 2021). This is important because it tests the implicit assumption behind models that rely on aggregated corpora to represent the behavior of individuals. Finally, a forthcoming paper (Dunn, In Press) takes a cross-linguistic approach to the interaction between grammar and the lexicon during learning, showing that neither a strict separation between grammar and lexicon, nor a naive continuum which makes no distinctions between the two, are fully adequate.
Modelling Variations in Grammatical Structure
My research also models variation and change in grammars across both geographic dialects and contexts of production. The basic idea behind these computational experiments is to understand how different types of exposure introduce and propagate variants across an entire grammar. While language is a complex adaptive system, non-computational approaches divide grammatical structures into a small number of discrete variants (like the ditransitive vs the dative). The central challenge is to model grammatical variation without arbitrarily reducing a high-dimensional grammar into separate low-dimensional variables. For example, a recent paper has shown that learned grammars are able to capture spatial variation in language to such a degree that it is possible to predict the dialect membership of held-out samples (Dunn, 2019). This is important because we are able to observe variation across an entire grammar at once. Another recent paper (Dunn & Wong, 2022) undertakes a rigorous spatio-temporal analysis of these high-dimensional models of grammatical variation in order to distinguish between (i) variation that remains stable and (ii) variation that is leading to change.
Connecting Models with Real-World Populations
My larger goal is to model how grammatical structure emerges and how language change is connected to this process of emergence. In order to carry out these experiments, we need large multi-lingual geo-referenced corpora that provide valid representations of particular communities of speaker-hearers. I have systematically solved this practical problem by developing state-of-the-art language identification and the 400 billion word Corpus of Global Language Use. This project covers over 400 languages from 100 countries (Dunn, 2020), including previously uncovered low-resource languages (Dunn & Nijhof, 2022). The data is visualized through my earthLings.io project. My recent work has shown how geo-referenced data can be used to train models without the population bias that over-represents speakers from Western countries (Dunn & Adams, 2020). And other recent work has shown that we can observe changes in linguistic diversity caused by travel restrictions in these corpora (Dunn, et al., 2020) and that corpus similarity measures can be used across languages for validating very large data sets (Li & Dunn, 2022). This kind of practical work is important because it enables us to undertake computational experiments on a population-level scale. It is also important for contributing to the development of representative language technology for low-resource languages (Dunn, et al., 2022).