Download PDF

Digital Humanities @ Arts.KULeuven, Date: 2012/09/19 - 2012/09/21

Publication date: 2012-09-20

Author:

Wielfaert, Thomas
Heylen, Kris ; Speelman, Dirk ; Geeraerts, Dirk

Keywords:

distributional semantics, visualisation, corpus linguistics, Dutch

Abstract:

Investigating the different uses of a word in texts and corpora is a research activity in several fields of the humanities. Within linguistics, lexicology is the subdiscipline that analyses semantic structure of words in terms of polysemy, vagueness and meaning relations like metaphor or metonymie. Historical linguistics study how these uses developed through time (see Geeraerts 2010 for an overview). Lexicographers record this semantic structure of words in dictionaries. Yet also in disciplines for which language is not research object per se, scholars analyse the different meanings and uses of words: In literary studies, researchers look at how writers develop themes throughout their works using specific words. Historians, legal scholars and theologians analyse how concepts have been construed by looking at specific word uses in a historic body of texts. Traditionally, such analyses have been done by sorting through concordances of words, i.e. corpus attestations of a word in context. Although many software packages are available to extract concordances and collocations and annotate them, the actual semantic analysis still has to be done manually by the researcher. He or she has to go through a concordance list and organize the attestations in terms of which uses are similar and constitute a separate meaning or typical usage. However, in Computational Linguistics, so-called Semantic Vector Spaces (SVS) have been developed that can find detect usage patterns and semantic structure automatically based on a quantitative, statistical analysis of large corpora. More specifically, SVSs model word meaning in terms of frequency distributions of words over co-occurring context words (Turney and Pantel, 2010 for an overview). Unfortunately, these models are largely black boxes that contain purely mathematical representations of meaning, and hence they are not easily accessible to humanity scholars. However, we have argued (Heylen et al., 2012) that by visualizing these Semantic Vector Spaces, we can attain a double goal: On the one hand, SVSs can become a supporting tool for lexicologists and other humanities scholars to investigate word meaning and usage on larger scale and in a more data driven fashion. On the other hand the SVS models themselves become amenable to evaluation by human specialists. In this study, we use a token-based SVS that models the semantic distances between individual occurrences of a word in terms of their contextual usage. To visualize the output of the SVS and make it accessible to human experts, we use statistical dimension reduction techniques to create two dimensional scatter plots. In these plots, so-called token clouds become visible and make it possible to distinguish a word's different meanings and usages. As as case study, we analyse the usage of a set of Dutch near-synonyms in a large corpus of Belgian and Netherlandic Dutch. The near-synonyms, i.e. beeldscherm, computer scherm, monitor, and display all refer to the same concept of COMPUTER SCREEN. In Belgian Dutch however, monitor can also be used to refer to a type of youth leader, for instance speelpleinmonitor (playground monitor). This specific usage in Belgian Dutch is clearly distinguishable in the Token Space. We have made an interactive implementation of the COMPUTER SCREEN scatter plot with both Google Visualization and R using the method developed by Heylen et al. (2012). The Google implementation is annotated with manual semantic disambiguations which helps to visually identify the clusters by using colour codes. In the R version on the other hand, the context words are annotated with their weights, which shows how much each context word contributed to the solution. The goal is to make a visualization in which both annotations and weights are combined. References: Geeraerts, D. (2010). Theories of Lexical Semantics. Harris, Z. (1954). Distributional structure. Word, 10(23): 146-162. Heylen, K., Speelman, D. and Geeraerts, D. (2012). Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch synsets. Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, 16-26. Ruette, T., Geeraerts, D., Peirsman, Y. and Speelman, D. (2012). Semantic weighting mechanisms in scalable lexical sociolectometry. Aggregating dialectology and typology: linguistic variation in text and speech, within and across languages. Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97-124. Turney, P.D. and Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37(1), 141-188.