Lecture Notes in Computer Science vol:6634 pages:549-560
PAKDD2011: the 15th Pacific-Asia conference on knowledge discovery and data mining
This paper explores bridging the content of two different languages via latent topics. Specifically, we propose a uniﬁed probabilistic model to simultaneously model latent topics from bilingual corpora that discuss comparable content and use the topics as features in a cross-lingual,
dictionary-less text categorization task. Experimental
results on multilingual Wikipedia data show that the
proposed topic model effectively discovers the topic
information from the bilingual corpora, and the learned
topics successfully transfer classification knowledge to other languages, for which no labeled training data are available.