Computational Analyses of Complex Phenotypes

Arslan, Ahmed; van Noort, Vera

Author:

Arslan, Ahmed

van Noort, Vera

Abstract:

Computational methods used and developed in this thesis contribute to the analysis of complex biological data. The aim was to bridge the gap between genotype and phenotype present in different experimental outcomes. The impact of variations in individual proteins and pathways can change the phenotypes. Currently, the understanding of the mechanisms with precise details about compromised phenotypes is far from complete. In this regard, an experimental study to address a stress/disease condition can provide the insights of causal molecules and mechanisms. In fact, the complementarity of computational methods and experimental setups can further enhance the understanding. Over the years, the integration of experimental and computational setups has contributed in explaining the grey areas of phenotypes. To this end, with advancements in technology providing wealth to new data, it is almost impossible to make sense of data without in silico input. Based on this idea several computational methods have been developed to contribute towards “omics” studies. However, functional analyses of many phenotypes need more creative and effectives computational protocols. Therefore, in order to assess the impact of variations on phenotypes more robust tools are required. Motivated about this notion, we analyzed disparate data sets coming from different experiments. The aim of these experiments was to understand the molecular phenotypes by assessing the impact of variations on protein networks, pathways and conserved functional regions. For this, we tailored different computational pipelines. These frameworks specifically map the variations onto the functional regions like modified residues or protein domains. The resulting data with mutations mapped to the functional regions can provide a refined glimpse of pathways that are operating a phenotype. This method can facilitate the understanding of complex phenotypes. The projects conducted during this thesis work can be separated into two parts. In the first part, we performed three projects about yeast computational systems biology. In addition to that, in the second part, we performed a project about viral disease and therapeutic targets prediction. For the first project, we generated new experimental data from yeast under ethanol stress. At one instance, we analyzed the genomic variation data from whole genome sequencing. These variations were mapped onto the functional regions, for further analyses and interpretation. The resulting data suggested that the modified residues suffered the least mutational burden. Whereas the presence of multi-protein domains and pathways show these functional units can contribute to the stress-tolerating behavior of yeast. The second project of this thesis work was to develop a computational protocol. The aim of this project was to facilitate researchers who are working on big data from yeast experiments. Budding yeast as a model organism is very popular in the research community, to understand fundamental biological questions. Yet, researchers lacked a tool facilitating the analysis of mutational data from yeast. We thus created a python based tool, yMap, which can take big data containing the mutations at genetic/proteomic level. This tool maps the mutated residues to several evolutionary conserved and functionally important regions. In the end of a typical analysis, an output contains the information regarding mutated protein regions, mutation-types, pathway enrichment and network visualization. We believe that, in the areas of systems biology, this automated protocol can contribute to and facilitate yeast research. Our third project was based on the data integration of two different types of yeast data. The theme was to understand and explain the protein regulation in yeast under ethanol stress. For this reason we created a data integration computational framework based on the yeast protein-protein interactions. The strategy helped in the probing of causal regulatory proteins for their possible role in protein regulation of stress orientated yeast clones. We could also establish potential involvement of mutations in protein regulation along side with regulatory proteins. Altogether, this project contributed in understanding the mechanisms of protein regulation in ethanol stress via regulatory proteins and mutations. The second part of this thesis was dedicated to human disease. The aim was to create a new computational protocol to contribute to bringing further the present understanding of Ebola virus disease (EVD). Moreover, based on our strategy, we suggested possible therapeutic targets of EVD. In this regard, we mapped the most conserved regions of Ebola virus proteins, and predicted modified residues present on these conserved regions. The phosphorylation was the most abundant type of modification predicted in our analysis. To target Ebola virus proteins, we predicted that the phosphorylation contributing kinases from host genome. These kinases are potentially involved in the protein modifying events of Ebola virus. This project opens a new area of research to analyze the conserved modified residues in order to target possible modifying enzymes. To sum it up, we conducted four different projects to contribute to understanding the gap present between the genotype and phenotype. Each of the projects on yeast genome brought unique insights of the molecular mechanisms present underneath a phenotype. Moreover, the evolutionary insights of Ebola virus proteins can facilitate drug development based on our analysis. Additionally, the methodologies developed in each project can facilitate larger research community to perform their research. /* Font Definitions */@font-face {font-family:Arial; panose-1:2 116 4 2 2 2 2 2 4; mso-font-charset:0; mso-generic-font-family:auto; mso-font-pitch:variable; mso-font-signature:3 0 0 0 1 0;}@font-face {font-family:'ＭＳ明朝'; panose-1:0 0 0 0 0 0 0 0 0 0; mso-font-charset:128; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:fixed; mso-font-signature:1 134676480 16 0 131072 0;}@font-face {font-family:'ＭＳ明朝'; panose-1:0 0 0 0 0 0 0 0 0 0; mso-font-charset:128; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:fixed; mso-font-signature:1 134676480 16 0 131072 0;}@font-face {font-family:Cambria; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:auto; mso-font-pitch:variable; mso-font-signature:3 0 0 0 1 0;} /* Style Definitions */p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:''; margin:0cm; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:Cambria; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:'ＭＳ明朝'; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:'Times New Roman'; mso-bidi-theme-font:minor-bidi;}.MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; font-family:Cambria; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:'ＭＳ明朝'; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:'Times NewRoman'; mso-bidi-theme-font:minor-bidi;}@page WordSection1 {size:595.0pt 842.0pt; margin:72.0pt 90.0pt 72.0pt 90.0pt; mso-header-margin:35.4pt; mso-footer-margin:35.4pt; mso-paper-source:0;}div.WordSection1 {page:WordSection1;}--> Three months progress report Student: Ahmed Arslan Topic of PhD: Computational systems biology of complex phenotypes Supervisor: Vera van Noort Start date: 09 November 2013 Reporting period: 09 November 2013 to 09 February 2014 Background The phenotype ofan organism is the product of the environment and its genotype [1]. The environment includes everything that influences an individual from outside, for example, temperature, nutrient intake and water availability. These factors can also affect the genotype of an organism by selective pressure and result in phenotypic variations, accordingly. Under laboratory conditions, the genetic make-up can be manipulated through genetic engineering. In nature, on the other hand, the environment controls the genetic behavior of a species but in a much broader way hence affects the genome of its inhabitants in a number of ways. Introduction to the Project Our aim is to access the quantitative effect of the environment on the genome of model organism (yeast). And to create new yeast strains to meet with the challenges of the changing environment. To analyze the effect of different alcohol concentrations (environment) on the model genome, a project was carried out within the laboratory of our collaborator prof. Verstrepen. In six chemostats, populations of yeast strains were grown in in the growth media and let them to reproduce for a certain number ofgenerations, while the concentration of alcohol was increased over time. The experiment is aimed at letting the genome of yeast adapt to higher alcohol levels to create a ‘superior strain’ that could survive even under higher then usual alcohol concentration. To evaluate the genetic changes taken place in the course of the experiments, the DNA was extracted from the surviving cells and subjected to the whole genome sequencing to access the effects. So far, the sequencing DNA data has been analyzed for the presence of single nucleotide polymorphism and insertion/deletion (indel) mutations. The yeast is quite useful as a model organism to conduct this evaluation, because it is very important organism for the pharmaceutical, chemical and beer industry. Moreover, this unicellular organism and has an ideal (relatively small) genome that makes it possible to perform systems levelanalysis [2]. Progress The further downstream data analysis in this project involves the computational analysis of DNA sequencing data followed by statistical evaluation of the given data. These analyses include post-translational modifications, pathway level analyses for individual SNP and protein domain analysis. For this reason I have been learning in-silico data analysis and interpretation approaches, including a computer language (Bio)Python and statistical analysis software R. Python is a very famous programming language among Bioinformaticans all around the world and BioPython is a derived version of it to further accelerate biological data analysis. I have gained experience in this given period of python programming andam now able to perform initial analyses such as protein domain analysis. Using BioPython scripts, I have extracted surrounding protein sequences of SNPs and indels, which will be used in downstream analyses. To enable myself to learn quickly more about python I have registered myself in the coming python course in ICT department of KU Leuven.This training gives me further understanding of python and hopefully helps me to perform my work efficiently. To excel in using R software Ihave recently taken part in a course in Maastricht University, The Netherlands. This course was very helpful for me to understand the basic knowledge about the work environment of R as well as advanced level data handling with it. Future direction In the coming months, I am going to pursue with my training, both in scientific areas that are directly linked to the theme of my thesis and also in the non-scientific areas. For this reason I will take a few courses in this semester and also learn more about computer programing, online databases, biological analysis softwares and statistical analysis tools. More precisely, the next steps in data analysis include extracting Post Translational Modification sites and their locations from UniProt for yeast proteins that contain SNPs. Theprotein sequences and PTMs will be combined together to identify changes that could affect the modification status of yeast proteins thathave undergone adaptation to high alcohol concentrations. At the same time, phosphorylation motifs from databases like phospho.ELM will be extracted and the sequences surrounding SNPs and indels will be searched for these motifs. This will add more information on SNPs that could affect PTMs. References: 1 - Bradshaw, A.D. (1965) Evolutionary signiﬁcance of phenotypic plasticity in plants. Adv. Genet. 13, 115–155. 2 - (URL:http://www.ncbi.nlm.nih.gov/pubmed?term=Mustacchi%20R%5BAuthor%5D&cauthor=true&cauthor_uid=16498699)Mustacchi R, (URL:http://www.ncbi.nlm.nih.gov/pubmed?term=Hohmann%20S%5BAuthor%5D&cauthor=true&cauthor_uid=16498699)Hohmann S, (URL:http://www.ncbi.nlm.nih.gov/pubmed?term=Nielsen%20J%5BAuthor%5D&cauthor=true&cauthor_uid=16498699)Nielsen J. (2006) Yeast systems biology to unravel the network of life. (URL:http://www.ncbi.nlm.nih.gov/pubmed/16498699)Yeast. 23, 227-238.