Mapping Semantic Space in Comparable Corpora. Token-level semantic vector spaces as an analysis tool for lexical variation

Heylen, Kris; Wielfaert, Thomas; Speelman, Dirk

Mapping Semantic Space in Comparable Corpora. Token-level semantic vector spaces as an analysis tool for lexical variation

Author:

Heylen, Kris

Wielfaert, Thomas ; Speelman, Dirk

Keywords:

distributional semantics, lexical semantics, corpus linguistics, visualization, discourse analysis, Dutch

Abstract:

Conceptual space can be carved up linguistically in different ways. The mapping between a set of related concepts and a set of forms need not be one to one and can differ both between varieties of the same language and between different languages. Recently, a number of studies have combined quantitative corpus analysis with visualization techniques to study form-meaning mappings on the exemplar level, both cross-linguistically and within one language: Wälchli (2010) used distributional similarity in parallel corpora and Multi-Dimensional Scaling to visualize how the exemplars of local phrase markers divide up the semantic space between themselves in different languages. Levshina (2011) coded exemplars of Dutch causative constructions for many different features in comparable corpora of different varieties and then used MDS to visualize how they carve up the causativity space. In this study, we present such an exemplar-level analysis and visualization for referentially rich lexical categories, rather than the less referential, grammatical categories studied by Wälchli and Levshina. We argue that the rich semantics of full lexical categories can be captured in a bottom-up, automatic way by token-level Semantic Vector Spaces (Turney & Pantel 2010; Heylen, Speelman & Geeraerts 2012) and we visualize how the individual occurrences of a set of near-synonyms carve up their concept’s semantic space in a comparable corpus of different language varieties. As a case study, we look all the occurrences of lexemes used to refer to the concept IMMIGRANT in a 1.3 million word corpus of Dutch and Belgian newspapers from 1999 to 2005. A token-level Semantic Vector Space (Heylen, Speelman & Geeraerts 2012) is then used to structure these occurrences semantically based on the similarity of their contextual usage. Multi Dimensional Scaling allows us to represent these contextual similarities in a 2 dimensional semantic space. With an interactive visualization, we can analyze the different dimensions in the semantic space and their contextual realization, as well as the differences in form-meaning mapping between the Netherlands and Belgium and different newspapers. We also look at the change in the space and form-meaning mappings during the period 1999-2005.