Dunn, J. (2019). “Modeling Global Syntactic Variation in English Using Dialect Classification.” In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (NAACL 19). Association for Computational Linguistics. 42-53.
Abstract. This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers.
[Data, CC: https://labbcat.canterbury.ac.nz/download/?jonathandunn/CGLU_v3]
[Data: Grammars: https://labbcat.canterbury.ac.nz/download/?jonathandunn/CxG_Data_FixedSize]
[Code, CC: https://github.com/jonathandunn/common_crawl_corpus]
[Code, LID: https://github.com/jonathandunn/idNet]