Diverse subgroup set discovery

van Leeuwen, Matthijs; Knobbe, Arno

doi:10.1007/s10618-012-0273-y

Diverse subgroup set discovery

Author:

van Leeuwen, Matthijs

Knobbe, Arno

Keywords:

Subgroup set discovery, Pattern selection, Science & Technology, Technology, Computer Science, Artificial Intelligence, Computer Science, Information Systems, Computer Science, Exceptional model mining, Heuristic search, Diversity, PATTERN, ALGORITHM, 0801 Artificial Intelligence and Image Processing, 0804 Data Format, 0806 Information Systems, Artificial Intelligence & Image Processing, 46 Information and computing sciences

Abstract:

Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.