Heterogeneous information sources for bioinformatics: integration methodology, search algorithms and case studies

Bonachela Capdevila, Francisco; De Causmaecker, Patrick; Deckmyn, Hans; Moreau, Yves

Author:

Bonachela Capdevila, Francisco

De Causmaecker, Patrick ; Deckmyn, Hans ; Moreau, Yves

Keywords:

itec, iMinds

Abstract:

Identifying the genetic basis associated with Mendelian disorders or complex phenotypes is essential in human genetics in order to design more effective and eventually to better understand the molecular mechanisms behind these genetic disorders.Usually, a list of candidates is obtained in a high-thoughput experiment, such as a genomewide association study. This set of genes (either a chromosomal region or a list of genes scattered in the genome) is usually not small enough to easily undertake a manually one-by-one validation and therefore a selection of the putative most interesting genes is needed. This problem has been named gene prioritization and in the last years, several computing based approaches have been proposed to cope with it. This thesis presents a work on gene prioritization.The first part of this text thoroughly reviews the web based gene prioritization tools that can be freely used by any user. We describe seventeen tools and we stress their similarities and differences with the aim to help the user to choose the most appropriate one for his type of data. We have also reviewed the bibliography associated with these tools in search of validations and tool performance comparisons and we have finally set up a website where this information and regular updates are stored. In the last two years, the number of tools described in the website has almost doubled.Furthermore, we have developed a performance review among gene prioritization tools, both using the whole genome as starting candidate set or a limited one. We have compared individual results with the combination of the tools and finally we have completed our review with the combination of the best performance gene prioritization tools in our benchmark in three real life experiments. All the expertise gathered in our complete review has been used to find new candidate genes involved in congenital heart disease, congenital diaphragmatic hernia and asthma.Finally, we propose the use of cluster analysis as a preprocessing step of gene prioritization approaches that use training genes to lead the prioritization. We claim that the automatic selection of a homogenous training set produces more accurate rankings than the expert selected ones. To this purpose, we have applied a transactional clustering algorithm, CLOPE, to two different gene prioritization tools: Endeavour and Genedistiller.