Download PDF

A Theory-based Study of Graph Mining

Publication date: 2016-02-08

Author:

Haghir Chehreghani, Mostafa

Abstract:

During the last decade or so, the amount of data that is generated and becomes publicly available is rapidly growing. This makes it impossible to extract useful information from this huge amount of data manually without using automatic tools and algorithms. Data mining has been defined as the process of automatic extraction of useful and previously unknown information from a large dataset using techniques from statistics, artificial intelligence and database management. Furthermore, in many applications, such as Bioinformatics, the world wide web, social and technological and communication networks, data are usually represented with graphs. This makes graph mining practically interesting, while also a challenging research area, due to high computational cost involved in processing graph data. In this dissertation, we investigate two key problems in graph mining: frequent pattern mining and betweenness centrality computation. Existing algorithms for finding frequent patterns from large single networks mainly use subgraph isomorphism. However, subgraph isomorphism is expensive to compute: deciding whether one graph is subgraph isomorphic to another graph is NP-complete in terms of the sizes of the graphs. Recently, a few algorithms have used subgraph homomorphism. However, they find very restricted classes of patterns such as trees. The main challenge with pattern mining under subgraph homomorphism is the pattern generation phase. In this work, we go beyond trees and propose an efficient algorithm for mining graph patterns from large networks under homomorphism. We introduce a new class of patterns, called rooted graphs, and present an algorithm for complete generation of rooted graphs. We also propose a new data structure for compact representation of all frequent patterns. By performing extensive experiments on several real-world and synthetic large networks, we show the empirical efficiency of our proposed algorithm, called HoPa. We then present an efficient algorithm for subtree homeomorphism with application to frequent pattern mining. We propose a compact data-structure, called occ, that can encode and represent several occurrences of a tree pattern and define efficient join operations on the occ data-structure, that help us to count occurrences of tree patterns according to occurrences of their proper subtrees. Based on the proposed subtree homeomorphism method, we develop an effective pattern mining algorithm, called TPMiner. We evaluate the efficiency of TPMiner on several real-world and synthetic datasets. Our extensive experiments confirm that TPMiner always outperforms well-known existing algorithms, and in several cases the improvement with respect to existing algorithms is significant. Finally, we propose a randomized algorithm for unbiased estimation of betweenness centrality. We discuss the conditions that a promising sampling technique should satisfy to minimize the approximation error. We then propose a sampling method that fits better with these conditions. By performing extensive experiments on synthetic and real-world networks, we compare our proposed method with existing algorithms and show that our method works with a better accuracy.