Download PDF

Intelligent Data Analysis, Date: 2018/10/24 - 2018/10/26, Location: ‘s-Hertogenbosch

Publication date: 2018-10-01
Volume: LNCS vol 11191 Pages: 353 - 366
ISSN: 9783030017675
Publisher: Springer

Proceedings of the 17th International Symposium on Intelligent Data Analysis

Author:

Van Craenendonck, Toon
Dumancic, Sebastijan ; Vanwolputte, Elia ; Blockeel, Hendrik

Keywords:

Science & Technology, Technology, Computer Science, Artificial Intelligence, Computer Science, Information Systems, Computer Science, Theory & Methods, Computer Science, Semi-supervised clustering, Pairwise constraints, Active clustering

Abstract:

© Springer Nature Switzerland AG 2018. Constraint-based clustering algorithms exploit background knowledge to construct clusterings that are aligned with the interests of a particular user. This background knowledge is often obtained by allowing the clustering system to pose pairwise queries to the user: should these two elements be in the same cluster or not? Answering yes results in a must-link constraint, no in a cannot-link. Ideally, the user should be able to answer a couple of these queries, inspect the resulting clustering, and repeat these two steps until a satisfactory result is obtained. Such an interactive clustering process requires the clustering system to satisfy three requirements: (1) it should be able to present a reasonable (intermediate) clustering to the user at any time, (2) it should produce good clusterings given few queries, i.e. it should be query-efficient, and (3) it should be time-efficient. We present COBRAS, an approach to clustering with pairwise constraints that satisfies these requirements. COBRAS constructs clusterings of super-instances, which are local regions in the data in which all instances are assumed to belong to the same cluster. By dynamically refining these super-instances during clustering, COBRAS is able to produce clusterings at increasingly fine-grained levels of granularity. It quickly produces good high-level clusterings, and is able to refine them to find more detailed structure as more queries are answered. In our experiments we demonstrate that COBRAS is the only method able to produce good solutions at all stages of the clustering process at fast runtimes, and hence the most suitable method for interactive clustering.