FROM SEQUENCING READS TO MICROBIAL DIVERSITY: BIOINFORMATIC ALGORITHMS FOR PROCESSING AMPLICON SEQUENCING DATA
Ahmed, Mohamed Mysara; R0643555
The development of high-throughput sequencing technologies has revolutionized the field of microbial ecology by offering a cost-efficient method to assess microbial diversity at an unseen depth using 16S RNA amplicon sequencing approaches. Different preprocessing algorithms need to be performed to obtain a collection of highly reliable sequencing reads, ending with a clustering step to group them into Operational Taxonomic Units (OTUs) However, this approach is posing various challenges: the removal of PCR artefacts (called chimera), correction of sequencing errors resulting from the sequencing technologies and clustering those sequences into OTUs. In this work various bioinformatics tools were developed to tackle those challenges. First, an ensemble classifier for chimera detection was developed named CATCh, which obtained a higher performance on different types of sequencing data compared to existing tools. Secondly, two artificial intelligence-based algorithms, NoDe and IPED, able to treat sequencing errors within 454 pyrosequencing and Illumina MiSeq data respectively, were introduced. A benchmarking study was performed comparing NoDe and IPED, showing a more pronounced decrease of the error rate compared to other state-of-the art tools. Thirdly, a new method was developed introducing an adaptive cut-off score in the OTU clustering step, as such making the results of the OTU clustering less sensitive to variations in evolutionary rates between taxonomic lineages and to the region of the 16S rRNA gene targeted for amplification. Implementing such a dynamic cut-off value resulted in closer correspondence between the number of OTUs and the actual diversity of the samples. Finally, a benchmark analysis comparing existing pipelines for 16S rRNA metagenomics data processing was performed, showing that an integration of our in-house developed algorithms achieved the highest accuracy. Conclusively, the newly developed pipeline within this PhD translates amplicon sequencing data into high-quality OTUs tendering robust diversity estimates.