Prediction of drug response from genetic sequence data using regression techniques
Regressie technieken op genotype-fenotype database om de werkzaamheid van geneesmiddelen te voorspellen
Van der Borght, Koen; S0111510
Regression techniques are increasingly important as automatic methods to study complex high-dimensional biological systems and to separate true signal from experimental noise.In this thesis, we developed novel methodologies to build linear regression models with low complexity that are at the same time accurate to predict drug response (phenotype) from HIV-1 genetic sequence mutations (genotype), where the choice of methodology depended on the size of the genotype-phenotype data sets.For large data sets we developed a novel cross-validated stepwise linear regression procedure to improve the selection of the model variables, i.e. mutations or interaction terms. The best results with our new methodology were obtained when building models for the non-nucleoside reverse transcriptase inhibitors (NNRTIs), leading to a reduced list of forty novel mutations putatively associated with NNRTI resistance. The effect on resistance of several of these mutations was confirmed experimentally by in vitro phenotyping site-directed mutants, such as for mutations at positions not previously linked with NNRTI resistance (e.g. 102 and 139).Applying our novel method for large data sets on small data sets would not provide an effective solution against overfitting. Therefore, for small data sets we developed a novel methodology where variable selection occurred by inference from multiple genetic algorithm (GA) derived linear regression models. Moreover, we could extend this GA methodology to account for clustering in the data, which led to a more interpretable linear regression model for the integrase inhibitor raltegravir on a clonal genotype-phenotype dataset containing multiple clones derived from the same clinical isolate.Finally, we developed a logistic regression method for the accurate detection of true minor single nucleotide mutations in the presence of experimental noise, in large amounts of clonal data obtained for an individual patient with the Illumina next generation sequencing technology.