A framework for the creation and exploration of cross platform expression compendia

Fu, Qiang; Marchal, Kathleen; Engelen, Kristof

Author:

Fu, Qiang

Marchal, Kathleen ; Engelen, Kristof

Abstract:

With the unprecedented ability to systematically probe gene expression at the genome scale, microarrays have become an indispensable technology adopted by most of the laboratories across the world, generating a wealth of data for a variety of species. Although comprehensive in the gene dimension, any microarray based study alone provides a limited scope at the level of condition. However combining expression data from different labs provides the opportunity to investigate gene expression of a particular species at a more global level, and to view a specific study from the perspective of existing knowledge. The goal of this research is developing a novel methodology and system to explore this opportunity.We first developed a methodology to create an organism-specific cross-platform compendium based on publicly available gene expression data. Special attention has been paid to facilitate automated data retrieval by resolving heterogeneities in the data representation and to improve data consistency and compatibility through systematic renormalization of the data. Compared with existing single platform compendia, our methodology provides a broader range of data (cross platform).Using this novel methodology, we constructed three comprehensive expression compendia for the bacterial model organisms: Escherichia coli, Bacillus subtilis, and Salmonella enterica serovar Typhimurium. Moreover, efforts have been taken to create a web portal with intuitive functionalities for data analysis and visualization, providing public access to these three compendia.One of the most important applications of compendia is to study the response of an organism to environmental changes by identifying condition dependent functional modules and studying the underlying regulatory mechanisms responsible for the observed expression variations. Different methods exist for this purpose. Each makes distinct assumptions to handle the under-deterministic nature of this complex problem, and consequently generates complementary results. Here, we demonstrated such complementarity between two methods, DISTILLER and COLOMBOS, in a case study, in which co-expression modules containing gene sodA are extracted from the E. coli compendium using each method, and compared against each other. Through this example, we stress the importance of choosing the right method based on the research purpose.At last, we extended the methodology to handle the increased complexity of the monocot Zea mays, specifically addressing the following two issues: inconsistencies in platform-probe annotation and having a more precise biological sample annotation which can reflect the different genetic repositories of maize (breeding lines), the complexity of the plants life style (development stage) and its more complex tissue structure. We also upgraded the web access portal accordingly with new functions adapted to queries specific for a higher organism like Zea mays.