The Automatic Identification of Lexical Variation between Language Varieties

Peirsman, Yves; Geeraerts, Dirk; Speelman, Dirk

doi:10.1017/S1351324910000161

The Automatic Identification of Lexical Variation between Language Varieties

Author:

Peirsman, Yves

Geeraerts, Dirk ; Speelman, Dirk

Keywords:

Science & Technology, Social Sciences, Technology, Computer Science, Artificial Intelligence, Linguistics, Language & Linguistics, Computer Science, LATENT SEMANTIC ANALYSIS, SPACE, CORPUS, 0801 Artificial Intelligence and Image Processing, 1702 Cognitive Sciences, 2004 Linguistics, Artificial Intelligence & Image Processing, 4602 Artificial intelligence, 4605 Data management and data science, 4704 Linguistics

Abstract:

Languages are not uniform. Speakers of different language varieties use certain words differently - more or less frequently, or with different meanings. We argue that distributional semantics is the ideal framework for the investigation of such lexical variation. We address two research questions and present our analysis of the lexical variation between Belgian Dutch and Netherlandic Dutch. The first question involves a classic application of distributional models: the automatic retrieval of synonyms. We use corpora of two different language varieties to identify the Netherlandic Dutch synonyms for a set of typically Belgian words. Second, we address the problem of automatically identifying words that are typical of a given lect, either because of their high frequency or because of their divergent meaning. Overall, we show that distributional models are able to identify more lectal markers than traditional keyword methods. Distributional models also have a bias towards a different type of variation. In summary, our results demonstrate how distributional semantics can help research in variational linguistics, with possible future applications in lexicography or terminology extraction. Copyright © 2010 Cambridge University Press.