Representation Learning for Weakly-Supervised Natural Language Processing Tasks

Heyman, G; Moens, M-F; Vulic, I

Author:

Heyman, G

Moens, M-F ; Vulic, I

Abstract:

In recent years, representation learning has obtained impressive results across a wide range of machine learning tasks in domains such as natural language processing, computer vision and speech recognition. Rather than relying on hand-crafted data representations, representation learning aims to acquire representations automatically from data. Its successes have been achieved on problems with a large amount of annotated data -- datasets that comprise millions of training examples are no exception. For many important problems, there is no abundance of labeled data, however. This is particularly the case within the domain of natural language processing where, especially for languages other than English, such large-scale datasets are often lacking. In this thesis, we investigate and propose representation learning models in settings where the amount of annotated training data is limited. This thesis has four main contributions which each shed a different light on how representation learning can be used in weakly-supervised settings. First, we design a new cross-lingual probabilistic topic model that can infer cross-lingual representations for words and documents after being trained on a collection of document pairs that are similar, but not necessarily identical in content. With this, we provide a means to obtain cross-lingual representations for words and documents that are interpretable and valuable features for tasks such as cross-lingual document classification, and this without the need for parallel data. Second, we design methods to construct multilingual embedding spaces without using bilingual dictionaries, parallel corpora or any other type of multilingual supervision and study the effectiveness of these spaces on down-stream natural language processing tasks (i.e., bilingual lexicon induction, multilingual document classification, and multilingual dependency parsing). In contrast to previous research, our most effective method combines the following desirable properties: the method incorporates dependencies between all targeted languages; the method still works when targeting languages with very different characteristics (e.g., projecting English in the same vector space as Finnish and/or Hungarian); and empirical evidence indicates that the method is stable as it never produced degenerate solutions in our experiments. Third, we propose a deep learning model to tackle the correction of context-dependent dt-errors, one of the most prominent spelling errors in Dutch, without using labeled examples of dt-mistakes. The model is designed to predict the correct suffixes of verbs given their stem and the context in which they occur. Hence the data requirements are limited to high-quality Dutch text, which is available in abundance. In comparative tests with other systems including the spell checker that comes with Microsoft Word, the proposed model obtains the best results by a large margin. Fourth, we present a new approach for obtaining bilingual dictionaries that combines character-level and word-level information to extract translations from non-parallel texts. Different from the majority of prior work, we frame this task as a classification problem rather than as a retrieval problem. This enables combining unsupervised and weakly-supervised representation learning techniques to seamlessly integrate word-level and character-level information. In particular, from a set of seed translations, the model learns character-level representations rather than relying on hand-crafted feature extraction techniques and learns how to fuse it with word-level representations that encode corpus statistics. The major findings are a) that the incorporation of character-level information is particularly useful in the biomedical domain, where many terms have their origin in Greek and Latin or are acronyms or abbreviations, and b) that learning character-level representations is superior to the hand-crafted representations which were used in prior work. Although we evaluate primarily on biomedical terms, the method is domain-agnostic and holds promise to support translation mining in other domains. The main conclusion of this dissertation is that representation learning is very much applicable to weakly-supervised natural language processing problems both as a means to inject data-driven prior knowledge into tasks by inducing textual input representations from unlabeled texts and as a paradigm for obtaining abstractions from labeled text data that are not uncovered with classical feature engineering.