Title: Machine Learning on Belgian Health Expenditure Data: Data-driven Screening for Type 2 Diabetes
Other Titles: Machine learning op Belgische ziekenfondsgegevens: data-gedreven screening voor type 2 diabetes
Authors: Claesen, Marc
Issue Date: 14-Dec-2015
Abstract: Diabetes mellitus is a metabolic disorder characterized by chronic hyperglycemia, which may cause serious harm to many of the body's systems. Diabetes is a deadly pandemic which presents a significant burden on healthcare systems worldwide, and will continue to do so as its global prevalence rises rapidly (particularly type 2 diabetes). In developed countries, the rising prevalence is primarily driven by population aging, lifestyle changes and greater longevity of diabetes patients. Diabetes can be managed effectively when detected early. Unfortunately, early detection proves difficult as the time between onset and clinical diagnosis may span several years. Furthermore, estimates indicate that over one third of diabetes patients in developed countries are undiagnosed.

We investigated the potential of Belgian health expenditure data as a basis to build a cost-effective population-wide screening approach for (type 2) diabetes mellitus, aspiring to improve secondary prevention by speeding up the diagnosis of patients in order to initiate treatment before the disease has caused irrevocable damage. We used health expenditure data collected by the National Alliance of Christian Mutualities - the largest social health insurer in Belgium. This data comprises basic biographic information and records of all refunded medical interventions and drug purchases, thus providing a long-term longitudinal overview of over 4 million individuals' medical expenditure histories.

Screening was formulated as a binary classification task, in which diabetes patients represent the positive class. Due to the nature of the problem and limitations of health expenditure data, we were unable to identify a set of known negatives (patients without diabetes). Hence, we had to learn classifiers from positive and unlabeled data. During this project we made two contributions to this subdomain of semi-supervised learning: (i) a novel learning method which is robust to false positives and (ii) an approach to evaluate classifiers using traditional metrics without known negatives in the test set. Additionally, we mapped the survival of patients starting various antidiabetic pharmacotherapies and developed two open-source machine learning packages: one for ensemble learning and another to automate hyperparameter search.

We built a screening method with competitive performance to existing state-of-the-art approaches. This exceeded our expectations, since health expenditure data omits most info about the typical risk factors used by other screening methods (BMI, lifestyle, genetic predisposition, ...). As such, the combination of health expenditure data and additional information about risk factors is a promising avenue for future research in screening for diabetes mellitus. Finally, our approach has a very low operational cost since we only used readily-available data, which effectively removes one of the key barriers of population-wide screening for diabetes.
Description: Work done in collaboration with Landsbond der Christelijke Mutualiteiten.
Table of Contents: Abstract v
Contents xiii
List of Figures xxi
List of Tables xxv

1 Introduction
1.1 Diabetes mellitus
1.2 Early detection and intervention in type 2 diabetes
1.2.1 Diagnosis of diabetes
1.2.2 Existing screening and prescreening approaches
1.2.3 Situation in Belgium
1.3 Belgian mutual health insurance
1.3.1 Data related to medical interventions
1.3.2 Data related to drug purchases
1.3.3 Quality of health expenditure data
1.4 Machine learning challenges and contributions
1.4.1 Learning from positive and unlabeled data
1.4.2 Automated hyperparameter optimization
1.4.3 Open-source software
1.5 Structure of the thesis
1.A Regulation of blood glucose levels
1.B Complications and comorbidities of diabetes
1.C Classification of diabetes mellitus
1.D Prevalence and burden of diabetes
1.E Treatment of diabetes mellitus

2 Mortality in individuals treated with glucose lowering agents: a large, controlled cohort study
2.1 Introduction
2.2 Research design and methods
2.2.1 Study cohort selection
2.2.2 Control cohort selection
2.2.3 Therapy changes within cohorts
2.2.4 Censoring
2.2.5 Statistical analysis
2.3 Results
2.3.1 Baseline cohort characteristics
2.3.2 Five-year survival in individuals on different glucose lowering agents
2.3.3 Age-dependent 5-year survival of individuals on different glucose lowering agents
2.3.4 Statins and survival in individuals on different glucose lowering therapy
2.4 Conclusions

3 EnsembleSVM: A Library for Ensemble Learning Using SVMs
3.1 Introduction
3.2 Software Description
3.2.1 Implementation
3.2.2 Tools
3.3 Benchmark Results
3.4 Conclusions

4 SVM Ensemble Learning from Positive and Unlabeled Data
4.1 Introduction
4.2 Related work
4.2.1 Class-weighted SVM
4.2.2 Bagging SVM
4.3 Robust Ensemble of SVMs
4.3.1 Bootstrap resampling contaminated sets
4.3.2 Bagging predictors
4.3.3 Justification of the RESVM algorithm
4.3.4 RESVM training
4.3.5 RESVM prediction
4.4 Experimental setup
4.4.1 Simulation setup
4.4.2 Data sets
4.5 Results and discussion
4.5.1 Results for supervised classification
4.5.2 Results for PU learning
4.5.3 Results of semi-supervised classification
4.5.4 A note on the number of repetitions per experiment
4.5.5 Trend across data sets
4.5.6 Effect of contamination
4.5.7 RESVM optimal parameters
4.6 Conclusion

5 Hyperparameter Search in Machine Learning
5.1 Introduction
5.1.1 Example: controlling model complexity
5.1.2 Formalizing hyperparameter search
5.2 Challenges in hyperparameter search
5.2.1 Costly objective function evaluations
5.2.2 Randomness
5.2.3 Complex search spaces
5.3 Current approaches
5.4 Conclusion

6 Easy Hyperparameter Search Using Optunity
6.1 Introduction
6.2 Optunity
6.2.1 Functional Overview
6.2.2 Available Solvers
6.2.3 Software Design and Implementation
6.2.4 Development and Documentation
6.3 Related Work
6.4 Solver Benchmark
6.A Survey of hyperparameter optimization in NIPS 2014
6.B Performance benchmark
6.B.1 Setup
6.B.2 Results & Discussion

7 Assessing Binary Classifiers Using Only Positive and Unlabeled Data
7.1 Introduction
7.2 Background and definitions
7.2.1 Rank distributions and contingency tables
7.2.2 ROC and PR curves
7.2.3 Evaluation with partially labeled data
7.3 Relationship between the rank CDF of positives and contingency tables
7.3.1 Rank distributions and contingency tables based on subsets of positives within a ranking
7.3.2 Contingency tables based on partially labeled data
7.4 Efficiently computing the bounds
7.4.1 Computing the contingency table with greatest-lower bound on FPR at given rank r
7.4.2 Bounds on the rank distribution of P U
7.5 Constructing ROC and PR curve estimates
7.6 Discussion and Recommendations
7.6.1 Determining betahat and its effect
7.6.2 Model selection
7.6.3 Empirical quality of the estimates
7.6.4 Relative importance of known negatives compared to known positives
7.7 Conclusion
7.A Effect of betahat on contingency table entries and common performance metrics
7.B The effect of the fraction of known positives, known negatives and betahat

8 Building Classifiers to Predict the Start of Glucose-Lowering Pharmacotherapy Using Belgian Health Expenditure Data
8.1 Introduction
8.2 Existing Type 2 Diabetes Risk Profiling Approaches
8.3 Health Expenditure Data
8.3.1 Records Related to Drug Purchases
8.3.2 Records Related to Medical Provisions
8.3.3 Advantages of Health Expenditure Data
8.3.4 Limitations of Health Expenditure Data Methods
8.4.1 Experimental Setup
8.4.2 Data Set Construction
8.4.3 Learning Methods
8.5 Results and Discussion
8.5.1 Benchmark of learning methods
8.5.2 Performance Curves
8.5.3 Feature Importance Analysis for the RESVM Model
8.6 Conclusion

9 Conclusion
9.1 Machine learning contributions
9.1.1 Future work
9.2 Screening for type 2 diabetes
9.2.1 Weaknesses and limitations of our approach
9.2.2 Future work
9.2.3 Health expenditure data
9.2.4 The elephant in the room

List of publications
Publication status: published
KU Leuven publication type: TH
Appears in Collections:ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics
Environment and Health - miscellaneous

Files in This Item:
File Status SizeFormat
thesis.pdf Published 4137KbAdobe PDFView/Open Request a copy

These files are only available to some KU Leuven Association staff members


All items in Lirias are protected by copyright, with all rights reserved.