Biomedical Text Mining and Genomic Data Fusion for Disease-Gene Discovery

ElShal, Sarah

Author:

ElShal, Sarah

Keywords:

Text mining, pattern recognition, bioinformatics, gene prioritization, SISTA

Abstract:

Our genome is an amazing sequence of three billion chemical letters (DNA nucleotides) that is present in almost every cell inside the human body. This sequence contains fragments called genes that encode proteins with a wide diversity of functions. Any mutation in the gene sequence might result in an alteration of these functions, which sometimes is undesirable and contributes to disease. Hence identifying which genes are associated with which disease is of great medical importance. It is a key step to diagnosing and curing diseases, and hence plays a key role in many critical applications such as personalized medicine and early prediction, and drug design and repurposing. However this task is not trivial, especially with the exponential growth of genomic data that makes it challenging for the geneticists to explore all possible hypotheses in a reasonable amount of time. In this thesis, we propose Beegle, an online search and discovery engine, which allows geneticists to explore possible hypotheses about links between genes and diseases in a fast and easy way. It starts from text mining to quickly present the user with an ordered list of genes that have been reported in the literature to be linked with the query in question. Then it integrates genomic data fusion techniques to learn a model and generate novel gene hypothesis. In this work, we analysed over 20 million biomedical abstracts to extract relevant links between genes and diseases. We tested different statistical measures to decide on the degree of relevance of such links, which ranged from co-occurrence to cosine similarities. We experimented with two biomedical text taggers, which are quite diverse in tagging the biomedical text with the different biomedical concepts. We also investigated the application of topic modelling, where we relied on a latent Dirichlet allocation model, to infer a latent set topics that better model our text data. Finally, we integrated state-of-the-art learning methodologies to analyse and fuse over 70 genomic data sources and compute gene similarity scores to eventually present the user with one final ranked hypothesis. We release Beegle at http://beegle.esat.kuleuven.be/, where we welcome our users to start their disease-gene discovery experience with an introductory video tutorial. We validated Beegle in multiple experimental setups, which we partly created in-house based on public genetic databases. We mainly designed the validation process such that it mimics real discovery, where we limited information in our data sets up to a certain date, then we used test sets of disease-gene links that were only reported after this date. Hence, our hypotheses were not contaminated with novel information. In one setup, our results show that Beegle recommends on average 41.2% true novel hypotheses in the top 5% ranking genes. In another setup, our results show that Beegle recommends at least one true novel hypothesis in the top 20 ranking genes. Our methodology increases the true positive rate of manual approaches by 44%, and reduces the error of automatic approaches by 50%. We believe Beegle is an interesting tool to quickly explore all the gene hypotheses related to any query of interest. These can further be assessed and filtered by the geneticist who can carry out the necessary validation experiments. This motivates us to extend Beegle such that it additionally explores similar drug hypotheses, which we believe is a potential future work given the availability of the relevant data sets.