Interpretation and Prioritization of Genomic Single-Nucleotide Variation

Sifrim, Alejandro; Moreau, Yves; Vermeesch, Joris; Aerts, Jan

Author:

Sifrim, Alejandro

Moreau, Yves ; Vermeesch, Joris ; Aerts, Jan

Keywords:

SISTA

Abstract:

The field of human genetics has evolved at a dramatically fast pace over the past few decades. Breakthroughs in sequencing technologies and large-scale global initiatives such as the Human Genome Project and the Thousand Genomes Project have advanced our understanding of the human genome. The field of clinical genomics tries to leverage such resources in order to determine how variation in the genome influences human health, ranging from common diseases with complex genetic architectures to rare disorders caused by a single severe allele. As more human genomes are sequenced and other large-scale omics data sources (i.e. transcriptomics, proteomics, interactomics, phenomics) become available, we will be able to study more complex aspects of human biology. This will however require novel data analytical techniques capable of processing and integrating these large quantities of information. This thesis describes the development of computational analysis tools in order to facilitate the interpretation of single-nucleotide variants discovered by whole- exome sequencing datasets in the context of rare genomic disorders. After a general introduction into human genetics and the methods used therein, it presents a case study of a successful application of whole-exome sequencing and a common data analysis workflow for the identification of the genetic cause of a rare genomic disorder (e.g. Nicolaides Baraitser Syndrome). Based on this workflow we describe a software tool called Annotate-it which allows clinical geneticists to easily perform similar analyses, aggregates useful information and handles the data management issues coupled with these type of studies. Although these approaches have been shown to be powerful in several exome sequencing studies, they are not always feasible due to their design requirements and rely heavily on the geneticist’s expert interpretation of the aggregated information. To remedy this we developed a novel variant prioritization method using a machine learning approach in order to automatically prioritize single- nucleotide variants by integrating information about the disease at hand with evolutionary and biochemical data for the variant. In an extensive benchmark we show that such an approach substantially outperforms classical phenotype-agnostic prediction methods. In cases where the disease is described by a series of individual phenotypes we elaborate on different aggregation techniques for variant prioritization. Finally we discuss the benefits and limitations of the presented approaches and suggest possible technical, methodological and conceptual improvements to the methods discussed in this thesis. In conclusion, this thesis presents computational methodology to assist the clinical geneticist in interpreting single-nucleotide variation found in the human genome in order to identify the underlying disease-causing variants.