Gene prioritization and clustering by multi-view text mining

Yu, Shi; Tranchevent, Léon-Charles; De Moor, Bart; Moreau, Yves

doi:10.1186/1471-2105-11-28

Gene prioritization and clustering by multi-view text mining

Author:

Yu, Shi

Tranchevent, Léon-Charles ; De Moor, Bart ; Moreau, Yves

Keywords:

SISTA, Science & Technology, Life Sciences & Biomedicine, Biochemical Research Methods, Biotechnology & Applied Microbiology, Mathematical & Computational Biology, Biochemistry & Molecular Biology, CLASS DISCOVERY, DISEASE GENES, EXPRESSION, CONSENSUS, IDENTIFICATION, ANNOTATION, NETWORK, GENOME, CNTF, MEN, Cluster Analysis, Computational Biology, Data Mining, Databases, Factual, Disease, Genes, Information Storage and Retrieval, MEDLINE, United States, 01 Mathematical Sciences, 06 Biological Sciences, 08 Information and Computing Sciences, Bioinformatics, 31 Biological sciences, 46 Information and computing sciences, 49 Mathematical sciences

Abstract:

BACKGROUND: Text mining has become a useful tool for biologists trying to understand the genetics of diseases. In particular, it can help identify the most interesting candidate genes for a disease for further experimental analysis. Many text mining approaches have been introduced, but the effect of disease-gene identification varies in different text mining models. Thus, the idea of incorporating more text mining models may be beneficial to obtain more refined and accurate knowledge. However, how to effectively combine these models still remains a challenging question in machine learning. In particular, it is a non-trivial issue to guarantee that the integrated model performs better than the best individual model. RESULTS: We present a multi-view approach to retrieve biomedical knowledge using different controlled vocabularies. These controlled vocabularies are selected on the basis of nine well-known bio-ontologies and are applied to index the vast amounts of gene-based free-text information available in the MEDLINE repository. The text mining result specified by a vocabulary is considered as a view and the obtained multiple views are integrated by multi-source learning algorithms. We investigate the effect of integration in two fundamental computational disease gene identification tasks: gene prioritization and gene clustering. The performance of the proposed approach is systematically evaluated and compared on real benchmark data sets. In both tasks, the multi-view approach demonstrates significantly better performance than other comparing methods. CONCLUSIONS: In practical research, the relevance of specific vocabulary pertaining to the task is usually unknown. In such case, multi-view text mining is a superior and promising strategy for text-based disease gene identification.