Download PDF

Unsupervised Algorithms for Cross-Lingual Text Analysis, Translation Mining, and Information Retrieval (Algoritmen voor ongesuperviseerde cross-linguale tekstanalyse, het identificeren van vertalingen en informatieontsluiting)

Publication date: 2014-06-10

Author:

Vulic, Ivan

Abstract:

With the ongoing growth of the global network and information influx in today's increasingly connected world, more and more content becomes readily available in a plethora of different languages, dialects, unofficial and community languages. Considering the large amount of multilingual data which are typically unstructured but thematically aligned and comparable, there is a pressing need to build unsupervised algorithms which can deal with such multilingual data, and address the problems of meaning, translation and information retrieval in multilingual settings.The thesis has four major contributions in the research fields of data mining, natural language processing and information retrieval. First, we present and describe a full overview of the newly developed multilingual probabilistic topic modeling (MuPTM) framework for mining multilingual data. The framework is utilized to induce high-level language-independent representations of textual information (e.g., words, phrases and documents). Second, we propose a new statistical framework for inducing bilingual lexicons (i.e., addressing the problem of translation) from parallel data that is based on the novel paradigm of sub-corpora sampling. Third, we introduce a new statistical framework for modeling cross-lingual semantic similarity (i.e., addressing the problem of meaning) and inducing bilingual lexicons (i.e., the problem of translation) from comparable data. Here, we make a series of contributions to the field of (multilingual) natural language processing and its sub-field of distributional semantics by (i) proposing a series of MuPTM-based models of cross-lingual semantic similarity, (ii) designing an algorithm for detecting only highly reliable translation pairs from noisy multilingual environments, (iii) proposing a new language pair independent cross-lingual semantic space that relies on the concept of semantic word responding, (iv) presenting a new bootstrapping approach to cross-lingual semantic similarity and bilingual lexicon extraction, and (v) proposing a new context-sensitive framework for modeling semantic similarity. Fourth, we propose a new probabilistic framework for cross-lingual and monolingual information retrieval (i.e., tackling the problem of information retrieval) which relies on MuPTM-based text representations.All proposed models are unsupervised and language pair independent in their design. Consequently, that makes them potentially applicable to many language pairs. The proposed models have been evaluated with a variety of language pairs, and we show that they advance state-of-the-art in their respective fields. Due to their unsupervised and language pair independent nature, the presented models exhibit a solid potential for future research and other applications that deal with different official and unofficial languages, dialects and different idioms of the same language.