A data-driven machine translation approach using semantic tree alignment

Vanallemeersch, Thomas; Van Eynde, Frank

Author:

Vanallemeersch, Thomas

Van Eynde, Frank

Keywords:

machine translation, alignment, parsing, shallow semantics, semantic role labeling, syntax, linguistics, translation studies, parallel corpora, crosslingual projection

Abstract:

This dissertation deals with improving systems for machine translation (automated translation) using semantic information. Such information tends to remain constant during translation, while the syntactic structure of sentences often changes (as a result of linguistic necessities or translators' choices), as shown by the word order change in examples (1) and (2). These changes make it difficult to derive rules automatically from a database with translated sentences, i.e. to derive rules in a data-driven setting. (1) that their neighbour sold the car → dat hun buur de auto verkocht (2) they like that music → die muziek bevalt hen In our experiments, we focus on the extension of data-driven systems that align (link) words to their translation and which create a “phrase table” using aligned words. Example (3) illustrates word alignment (correspondence is indicated using numbers), while (4) shows a phrase table entry. We extend such systems with translation rules derived from aligned syntactic trees. For instance, the car and de auto can be aligned based on the fact that they are both noun phrases. As shown in (5), rules derived from aligned trees allow reordering of constituents: the second noun phrase is moved before the verb. (3) that_1 their_2 neighbour_3 sold_4 the_5 car_6 → dat_1 hun_2 buur_3 de_5 auto_6 verkocht_4 (4) their neighbour → hun buur (5) that [noun phrase 1] [verb] [noun phrase 2] → dat [noun phrase 1] [noun phrase 2] [verb] Differences between the syntactic structures of sentences can complicate the tree alignment process. Therefore, we investigate two research questions: [1] does the use of semantic predicates and roles facilitate tree alignment ? [2] does semantic tree alignment lead to more useful translation rules ? We experiment with a four-step approach in which we enrich the process of tree alignment and rule derivation with semantic information, with a minimum amount of manual intervention. The first step of our approach (described in Chapter 4) consists of enriching trees with semantic predicates and roles. Tools for automatically labelling the latter are only available for a limited number of languages. These tools make use of frameworks such as PropBank/NomBank, with labels like A0 (“agent”, the thing or person undertaking an action) and A1 (“patient”, undergoing an action or event). We design a method (implemented as a program called Sermano) which supports the creation of a new tool for a language on the basis of word alignment and mappings between structure and meaning. For instance, when running an existing tool on the English sentence in (6), the labels and word alignment help in creating a mapping for Dutch: the subject of the verb groeit has label A1. (6) they_1 observe_2 the growth_3 of [A1 the_4 market_5] → ze_1 zien_2 dat [A1 de_4 markt_5] groeit_3 The second step (described in Chapter 5 and implemented as a program called Serlino) consists of aligning trees via semantic labels. For instance, if both their neighbour and hun buur are labelled as A0, they can be aligned. The third step (described in Chapter 6 and implemented as a program called Linomat) consists of deriving translation rules based on semantic alignment. A rule containing semantic information is shown in (7): the verb sold indicates an action. (7) that [A0] sold [A1] → dat [A0] [A1] verkocht In the final step (equally described in Chapter 6 and implemented in Linomat), we extend a machine translation system containing a phrase table with semantic translation rules. Based on the evaluation of the above steps, we can answer the two research questions. The results of the second step indicate that it leads to more precise results than tree alignment without semantic predicates and roles. As for the third and fourth step, our tests indicate that integrating semantic translation rules with a phrase table helps in improving translation results. While we performed tests on the language pair English to Dutch, our methods and tools are sufficiently generic for tests on other language pairs and for contexts other than machine translation, such as translation studies. For instance, they can be applied in a tool like Poly-GrETEL, which allows for detecting specific syntactic structures in a database with aligned trees.