Understanding Machine Learning Performance with Experiment Databases (Het verwerven van inzichten in leerperformantie met experiment databanken)

Publication date: 2010-05-17

Author:

Vanschoren, Joaquin

Keywords:

Machine Learning, Data Mining, Knowledge Discovery, Databases, e-Sciences, Meta-learning

Abstract:

Research in machine learning and data mining can be speeded up tremendously by moving empirical research results out of people’s heads and labs, onto the network and into tools that help us structure and filter the information. The massive streams of experiments that are being executed to benchmark new algorithms, test hypotheses or model new datasets have many more uses beyond their original intent, but are often discarded or their details are lost over time.In this thesis, we developed a framework to automatically export experiments to experiment databases, databases specifically designed to collect all the details on large numbers of past experiments, performed by many different researchers, and to compose queries about almost any aspect of the behavior of learning algorithms. They can be set up for personal use, to share results within a lab, or to build community-wide repositories.Following similar developments in several other sciences, we first define a formal domain model, an ontology, for experimentation in machine learning, after which we use this ontology to define an XML-based languageto exchange experiments, as well as a database model to organize all submitted results.Finally, we demonstrate how such databases can be queried to meta-learn: to gain new insight into learning algorithm behavior. Using often no more than a single database query, we obtained surprising new results. This includes detailed rankings of learning algorithms, insight into the behavior of ensemble methods, suggestions for improvement of certain algorithms, learning curve analyses and insight into the bias-variance behavior of algorithms. We also built meta-models for predicting and explaining the suitability of learning algorithms and parameter settings.This illustrates that much can be learned by collecting and reusing past machine learning experiments, and that building experiment databases to query for them provides an effective way of tapping into this information, often yielding surprising new insight or generating interesting newresearch questions.