Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012) pages:449-459
Conference of the European chapter of the association for computational linguistics edition:13 location:Avignon, France date:23-27 April 2012
In this paper, we extend the work on using latent cross-language topic models for identifying word translations across comparable corpora. We present a novel precision oriented algorithm that relies on per-topic
word distributions obtained by the bilingual LDA (BiLDA) latent topic model.
The algorithm aims at harvesting only the most probable word translations across languages in a greedy fashion, without any prior knowledge about the language pair, relying on a symmetrization process and the one-to-one constraint. We report our results for Italian-English and Dutch-English
language pairs that outperform the current state-of-the-art results by a significant margin. In addition, we show how to use the algorithm
for the construction of high-quality initial seed lexicons of translations.