From Graph Patterns to Networked Statistics
Author:
Abstract:
Over the last decades, the amount of available data has grown rapidly. Data mining is the field studying how to discover interesting knowledge from data. When the data points are drawn identically and independently (i.i.d.), classical data mining techniques work well in practice. In particular, the employed statistical theories make sure that, if the sample size is large enough, then the mined knowledge is very likely to have a good generalization ability. However, the structure of available data becomes more and more complex, and the assumption that data points are drawn i.i.d. is violated. This usually makes the data less informative. We begin with defining support measures which gauge the frequency of a given pattern in a given dataset. For transactional datasets, one can simply count the occurrences of the pattern in the dataset. However, if the dataset is a large network, occurrences of a subgraph pattern may overlap, which makes the support measure definition less straightforward. For example, if we ignore overlaps and directly count the occurrences, we lose the important property of anti-monotonicity, which allows us to effectively prune in search spaces, An important class of anti-monotonic support measures relies on overlap graphs. Unfortunately, but all earlier overlap graph based support measures are expensive to compute. In this thesis, we introduce the concept of overlap hypergraphs, and propose an overlap hypergraph based support measure which is anti-monotonic and can be computed efficiently. In order to understand this support measure from a statistical point of view, we consider the problem of statistics on networked examples. We propose a model for representing networked examples as hyperedges in a hypergraph and reasonable assumptions to replace the classical i.i.d. assumption. Based on this model, we design an estimator on networked examples with the minimum variance and generalize Chernoff-Hoeffding inequalities to weighted networked examples. We can observe a close relationship between the effective sample size and the graph support measure we proposed. Our statistical concentration results can be used to adapt learning principles to learning from networked examples. We illustrate this by proving generalization error bounds for the empirical risk minimization principle.