Download PDF Download PDF

Computational Linguistics in The Netherlands (CLIN 2014), Date: 2014/01/17 - 2014/01/17, Location: Leiden

Publication date: 2014-01-01

Author:

Wielfaert, Thomas
Heylen, Kris ; Daems, Jocelyne ; Speelman, Dirk ; Geeraerts, Dirk

Abstract:

Distributional models of semantics have become the mainstay of large-scale modelling of word meaning statistical NLP (see Turney and Pantel 2010 for an overview). In a Word Sense Disambiguation task, identifying semantic structure is usually seen as a clustering problem where occurrences of a polysemous word have to be assigned to the ‘correct’ sense. As linguists however, we are not interested solely in performance evaluation against some gold standard; rather, we want to investigate the precise relation between a word's distributional behaviour and its meaning. Given that distributional models are extremely parameter-rich, we want to assess how well and in which way a specific model can capture a lexicological description of semantic structure. In this presentation, we discuss three tools we are developing for a lexicological assessment of distributional models. Firstly, we are creating our own lexicologically informed 'gold standard' of disambiguated noun occurrences, based on the ANW (Algemeen Nederlands Woordenboek) and a random sample from two large-scale Belgian (1.3G) and Netherlandic (500M) Dutch newspaper corpora. Secondly, we are developing a visualisation tool to analyse the impact of parameter settings on the semantic structure captured by a distributional model. Thirdly, we have adapted the a clustering quality measure (McClain & Rao 1975) to assess how well a manual disambiguation is captured by a distributional model independently from a specific clustering algorithm. Similar to Lapesa and Evert's (2013) parameter sweep for a type-level model on semantic priming data, we are striving towards a large-scale parameter evaluation for token-level models on sense-annotated occurrences.