Title: Messy Data in Life Sciences - A Discussion based on Case Studies
Other Titles: 'Messy' data in de biowetenschappen - Een discussie gebaseerd op casestudies
Authors: Winand, Raf
Issue Date: 23-Mar-2016
Abstract: In the last decades we have witnessed an enormous increase in the amount of data being generated in every imaginable field. Where the bottleneck used to be the creation of the raw data it has now moved to the analysis of the data. Indeed, producing the raw data does not always take too much effort anymore, while extracting the relevant information contained in the data and drawing relevant conclusions can take an entire team of specialists in their own field. In this thesis we propose a conceptual framework that can be used to contemplate possible challenges that may arise during the analysis of data and more specifically biomedical data. The proposed framework consists of two dimensions: the amount of data in relevant dimensions and the messiness of the data.
While in general it is true that more data yields better models we discuss what `more' data actually means and possible pitfalls of increasing the amount of data. For the second dimension in the framework, we consider data to be messy when it violates statistical assumptions or is influenced by stochastic processes, non-linear interactions and feedback loops, environmental effects, human behavior, missing data, and temporal effects and when these factors can not be easily modeled, abstracted away, or are unknown. Studies with a large amount of data available and low messiness are more likely to yield accurate and reliable results while studies with a low amount of data and high messiness can lead to unexpected results and/or wrong predictions.
To illustrate the framework we discuss different case studies and where they are situated in the framework. These case studies include an analysis of transmission of HIV-1 drug resistance, the use of whole genome sequencing in the context of embryo selection, and epigenetic modifications associated with neural tube defects. In addition we also discuss the challenges faced in the analysis of personal health record data, and the development of a digital coach to help achieve sustainable weight loss when the data from these projects becomes available.
Publication status: published
KU Leuven publication type: TH
Appears in Collections:ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics

Files in This Item:
File Status SizeFormat
RafWinand_PhD.pdf Published 8493KbAdobe PDFView/Open Request a copy

These files are only available to some KU Leuven Association staff members


All items in Lirias are protected by copyright, with all rights reserved.