Aggregating lexical variation: towards large scale lexical lectometry

Ruette, Tom

Author:

Ruette, Tom

Abstract:

The observation that instigated the current study is old: the choice for a word to express a certain concept may be related to the context. As an example, we can take the concept of Subterranean public transport. If this concept is expressed by the word subway, it is well possible that the concept was realized in an American context; if the concept is expressed by the word underground, it is more likely that the concept was realized in a British context. To describe this variability, we employ specific terminology. The variation at the linguistic side, i.e. having different linguistic forms for one underlying meaning, is called linguistic variation, or in the case of words, lexical variation. The variation at the extra-linguistic side, i.e. the fact that there are different contexts, is called lectal variation. Sometimes, a specific context comes with its own idiosyncratic set of linguistic choices, and we call this a lect. So, because speaking English in an American context may be linked to the use of specific words, a specific phonology, and perhaps also specific morphology and syntax, we can speak about the lect American English.Lects can be studied in two ways. On the one hand, one can be interested in investigating the lectal properties of a small set of words that express a single concept. As an example, we can investigate the set of words subway, underground and tube, expressing the concept Subterranean public transport, to find out in which contexts they typically appear. This is the standard perspective that is taken in variationist research, with an emphasis on the linguistic variable. On the other hand, it is also possible to put the emphasis on the lectal variation. In that case, the researcher is interested in describing the multifactorial lectal structure of linguistic variation. As an example, we can limit the linguistic variation to lexical variation only, and then ask whether our choice of words is primarily related to register, or topic, sociodemographic properties of the language user, or audience, etc. It is this perspective that we will take in the current study.Intuitively speaking, we could investigate the multifactorial lectal structure of linguistic variation by going one by one through a large number of linguistic variables. Every time a linguistic variable is sensitive to a certain lectal distinction, e.g. expressing Subterranean public transport is sensitive to the distinction between the United States and the United Kingdom, this lectal distinction gets a vote. In the end, we can easily say which lectal distinctions there are across our linguistic variables, and by just looking at the amount of votes for each lectal distinction, we can say which lectal distinction is the most important one for our set of linguistic variables. Although the actual technical and quantitative implementation is much more complex, the just sketched aggregation methodology is the basis for what we call lectometry.From the above intuitive explanation, it becomes clear that a lectometric study is based on a large set of linguistic variables. This set of variables needs to be representative of the linguistic variation that we want to investigate, which is in our case lexical variation. So, to investigate the multifactorial lectal structure of lexical variation, we need a large set of what is basically known as synonyms. However, we run into the longstanding philosophical problem of synonymy: do synonymous words actually exist? Now, for our research purposes, we can consider synonymy as a heuristic device, so we can ignore the fact that words are deemed to be non-synonymous due to a lectal difference, because it is exactly the lectal difference that we want to study. However, it might also be possible that the choice between words is not only influenced by a lectal difference, but also by an actual "meaning difference". Let us give an example to make things more clear. Take the words escort, prostitute and whore, which are three words to describe women that engage in promiscuous sexual intercourse for money. Although there is clearly a lectal difference between the three words, with whore the most derogatory term, one could argue that escort is also conceptually different. A prostitute would not mind being called an escort, but this would not work the other way around. To what extent is it then valid to investigate the lectal alternation between these words? And is it not almost always possible to point out some conceptual difference between words that are considered synonymous? We would like to argue that it is therefore only valid to investigate the lectal variability of a set of (near-)synonymous words, if we can be certain that the lectal variation is much "stronger" than the conceptual variation. In other words, we have to be certain that the conceptual differences between the variants are so small that they are neglectable in the light of a strong lectal pattern.Although the above reasoning is sound and firmely rooted in the contemporary theory of lexical semantics, one could argue from a technical point-of-view that a lectometric study can also do without the difficult semantic requirement of investigating lexical variables. As an alternative, it is possible to merely investigate individual words: indeed, what is the difference between on the one hand finding that subway is used more than underground in the United States, and on the other hand finding that subway is used frequently and that underground is used infrequently in the United States? In the case study of Chapter 3, we show that an approach that incorporates semantic knowledge is much more suited for our research goal than a model that ignores this semantic knowledge, because the lectal patterns become more outspoken than the conceptual patterns.So, in Chapter 3 we have learned the importance of a semantically informed approach. This now leaves us with the question of how to measure the semantic similarity between words? Or in other words, how can we gauge whether words are sufficiently conceptually alike so that it is safe to investigate their lectal distribution? Moreover, as our goal is to investigate many lexical variables, we should find a scalable, preferably automatic approach to the measurement of semantic similarity. We have opted to use Semantic Vector Space models. These models are able to produce quantitative output that has been shown to correlate with semantic similarity, and we have used this output to automatically find lexical alternations. The case study in Chapter 4 shows that the automatic modeling is somewhat successful, because it provides a wide basis of candidate lexical variables from which we can select the most appropriate lexical variables. However, it is also clear that a completely automatic approach does not work yet for a number of reasons. One of the main reasons is the lack of semasiological sensitivity of the Semantic Vector Space models as we have applied them. As an example, Semantic Vector Space models collapse the senses of a polysemous word into one point that resides somewhere between the different senses of the word. Therefore, we can use the Semantic Vector Space models as they are developed now only as a tool to explore the lexicon, so that the researcher can manually find lexical variables in a bottom-up way.Next to the semantic difficulties that we have dealt with, we are also confronted with a serious drawback of our aggregation methodology: after the aggregation has been performed, the results are completely opaque and do not allow for advanced interpretation of the lectal patterns. Although we are able to make the desired claims such as ``lectal dimensions one is stronger than lectal dimensions two'', we can not connect this observation with the behavior of the individual lexical variables. Therefore, we introduce Individual Differences Scaling as a way to perform transparent aggregation. This technique is thoroughly tested and applied in the case study of Chapter 5, where it is shown to be an easy to apply technique that offers unseen flexibility and interpretation possibilities, in comparison to other aggregation methodologies.All in all, we can conclude that a lectometric approach is revealing for the multifactorial structure of lectal variation in the lexicon. Confronted with complex semantic issues, we have found that the application of Semantic Vector Space models is useful as a generator of candidate lexical variables. Nonetheless, further development of these models is necessary to make them sensitive for the semasiological aspects of meaning. If Semantic Vector Space models reach that level of maturity, it might become possible to make the step towards a completely automatic lectometric methodology, without the need for a manual intervention of the researcher. Finally, we have shown that the application of Individual Differences Scaling has proven to be a great extension of aggregation methodology, because it allows for an in-depth interpretation of the aggregatedlectal patterns by means of backtracing the behavior of the individual variables.