European Congress of Clinical Microbiology and Infectious Diseases (ECCMID), Date: 2015/04/25 - 2015/04/28, Location: Copenhagen, Denmark

Publication date: 2015-04-25

Author:

Libin, Pieter
Deforche, Koen ; Vandamme, Anne-Mieke ; Theys, Kristof

Abstract:

Objectives In the context of clinical follow-up and epidemiological surveillance, genetic data of viral pathogens is continuously generated and added to newly emerging or existing virus databases. The curation of virus sequence data is essential to ensure the quality of such virus databases, and subsequently to support good clinical decision-making. However, this process can be affected by lab contaminations, mislabeling of samples or virus sequences, intra-patient viral recombinations or superinfections. We propose an innovative approach to identify anomalies in viral sequence databases, solely based on the virus’ molecular information. Methods Mismatch distributions (i.e. the distribution of pairwise nucleotide differences between sequences) of intra-patient and inter-patient virus populations are approximated by parameterizing the appropriate probability distribution. To detect anomalies, our method uses the difference between such mismatch distributions. For a given query sequence, the intra-patient and inter-patient genetic distances are determined, and an anomaly factor for the query sequence is calculated based on the relationship between the quantified distances and the mismatch distribution approximations. A factor value that falls below a predefined threshold is indicative for the query sequence to be considered as a potential anomaly. Results We developed a command line tool that can determine whether a virus sequence is similar to other virus sequences from the same patient and whether a virus sequence is distant to virus sequences from other patients. The tool is also able to quantify the overall genetic distance between a patient's virus sequences and other virus sequences in a database. We developed and tested the tool in the context of HIV virus sequence databases, which predominantly consist of sequences that encode for the pol genomic region. The tool was successfully applied to detect a set of anomalies in a clinical HIV database; the reported anomalies were verified by means of manual phylogenetic analysis, and all were confirmed to have been sample or sequence labeling errors. Conclusion Inconsistencies in viral genetic information and their annotation can impair the management of viral infections. We developed a virus sequence anomaly detection tool to improve the curation of virus sequence databases, based on differences in genetic distance distributions. Although the tool was successfully tested in an HIV-1 clinical database, further validation of detection performance on a large HIV-1 dataset with artificially introduced anomalies is warranted. The tool and its source code will be made publicly available for other research and clinical institutions, offering opportunities for databases that contain genetic information of other viral pathogens.