A Novel Neural Network Framework for Genome Interpretation in Complex Diseases
Author:
Keywords:
STADIUS-25-105
Abstract:
Understanding how genetic code gives rise to the complex traits we observe, remains one of the key missions of modern genetics. Although many milestones in understanding the link between genotype and phenotype have been reached, the genome yet continues to conceal many of its secrets, particularly for complex diseases shaped by a myriad of genetic and environmental factors. The limited clinical utility of state-of-the-art predictive models, polygenic risk scores based on the results of genome-wide association studies, further highlights this challenge. These additive models provide an oversimplified representation of the intricate landscape of molecular biology underlying complex diseases. Today's vast amount of genetic data is shifting the bottleneck from data availability to data interpretation. This dissertation addresses this challenge by introducing a novel neural network-based framework for genome interpretation, capable of processing whole exome sequences end-to-end while allowing for nonlinear interactions between the inputs. Empirical evidence is provided that, for predicting genetic risk in inflammatory bowel disease, nonlinear deep learning models can outperform state-of-the-art additive models, if the amount of data is sufficiently large. We postulate that nonlinear models, able to capture complex interactions among the inputs, represent a more realistic approximation of real-life molecular biology, benefiting predictive performance as well as the extent to which insights into the true underlying molecular mechanisms can be derived. Key to successful machine learning modeling is finding the right position on the variance-bias spectrum to address the underdetermination in genetic datasets (number of parameters p >> number of samples n). This underdetermination is one of the major drivers for the apparent optimality of additive modeling in clinical genetics today. To tackle this issue while preserving the benefits of nonlinear modeling, we maximize n by leveraging large genetic datasets and at the same time minimize p by exploiting the sparsity of biological networks to constrain the complexity of the models. Further analysis demonstrates that the degree of sparsity is more decisive for predictive performance than the biological meaningfulness of the connections, although incorporating biological knowledge proved instrumental in the extraction of biological insights after the prediction phase. In this thesis, we benchmark one gene-level and two variant-level whole exome sequence encodings with biologically sparsified neural network architectures, ranging from feedforward fully connected networks to graph neural networks, transformers and convolutional networks. By investigating the decision process leading towards the model's predictions with Explainable AI methods, we identify pathways, genes and variants relevant to the disease. Additionally, we extend the framework to multiclass prediction in the context of inflammatory bowel disease subtype prediction, showcasing the framework's potential in Stratified Medicine. Possible future extensions, including the incorporation of other omics and environmental data, as well as accounting for population structure and sequencing batch effects, emphasize the value of the framework. In the last part, we assess the framework's generalizability by applying it to two additional complex diseases: Type 2 Diabetes Mellitus and Schizophrenia. The results validate the nonlinear advantage in a large schizophrenia dataset, confirming the critical role of sample size and the complexity of the encoding and model in the prediction, next to the genetic component and heterogeneity of underlying molecular disease mechanisms. Although further extensive external validation remains necessary, this framework contributes to the groundwork for future research aimed at refining disease classifications, improving patient stratification, and ultimately paving the way for more personalized and effective therapeutic strategies in the era of precision medicine.